Introduction To Graph Neural Networks - Zhiyuan Liu & Jie Zhou
Introduction To Graph Neural Networks - Zhiyuan Liu & Jie Zhou
Federated Learning
Qiang Yang, Yang Liu, Yong Cheng, Yan Kang, and Tianjian Chen
2019
Reasoning with Probabilistic and Deterministic Graphical Models: Exact Algorithms, Second Edition
Rina Dechter
2019
Strategic Voting
Reshef Meir
2018
Metric Learning
Aurélien Bellet, Amaury Habrard, and Marc Sebban
2015
Active Learning
Burr Settles
2012
Human Computation
Edith Law and Luis von Ahn
2011
Trading Agents
Michael P. Wellman
2011
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or
transmitted in any form or by any means—electronic, mechanical, photocopy, recording, or any other
except for brief quotations in printed reviews, without the prior permission of the publisher.
DOI 10.2200/S00980ED1V01Y202001AIM045
Lecture #45
Series Editors: Ronald Brachman, Jacobs Technion-Cornell Institute at Cornell Tech
Francesca Rossi, IBM Research AI
Peter Stone, University of Texas at Austin
Series ISSN
Synthesis Lectures on Artificial Intelligence and Machine Learning
Print 1939-4608 Electronic 1939-4616
Introduction to
Graph Neural Networks
KEYWORDS
deep graph learning, deep learning, graph neural network, graph analysis, graph convolutional network,
graph recurrent network, graph residual network
Contents
Preface
Acknowledgments
1 Introduction
1.1 Motivations
1.1.1 Convolutional Neural Networks
1.1.2 Network Embedding
1.2 Related Work
11 General Frameworks
11.1 Message Passing Neural Networks
11.2 Non-local Neural Networks
11.3 Graph Networks
15 Open Resources
15.1 Datasets
15.2 Implementations
16 Conclusion
Bibliography
Authors’ Biographies
Preface
Deep learning has achieved promising progress in many fields such as computer vision and natural
language processing. The data in these tasks are usually represented in the Euclidean domain. However,
many learning tasks require dealing with non-Euclidean graph data that contains rich relational
information between elements, such as modeling physical systems, learning molecular fingerprints,
predicting protein interface, etc. Graph neural networks (GNNs) are deep learning-based methods that
operate on graph domains. Due to its convincing performance and high interpretability, GNN has
recently been a widely applied graph analysis method.
The book provides a comprehensive introduction to the basic concepts, models, and applications of
graph neural networks. It starts with the basics of mathematics and neural networks. In the first
chapters, it gives an introduction to the basic concepts of GNNs, which aims to provide a general
overview for readers. Then it introduces different variants of GNNs: graph convolutional networks,
graph recurrent networks, graph attention networks, graph residual networks, and several general
frameworks. These variants tend to generalize different deep learning techniques into graphs, such as
convolutional neural network, recurrent neural network, attention mechanism, and skip connections.
Further, the book introduces different applications of GNNs in structural scenarios (physics, chemistry,
knowledge graph), non-structural scenarios (image, text) and other scenarios (generative models,
combinatorial optimization). Finally, the book lists relevant datasets, open source platforms, and
implementations of GNNs.
This book is organized as follows. After an overview in Chapter 1, we introduce some basic
knowledge of math and graph theory in Chapter 2. We show the basics of neural networks in Chapter 3
and then give a brief introduction to the vanilla GNN in Chapter 4. Four types of models are introduced
in Chapters 5, 6, 7, and 8, respectively. Other variants for different graph types and advanced training
methods are introduced in Chapters 9 and 10. Then we propose several general GNN frameworks in
Chapter 11. Applications of GNN in structural scenarios, nonstructural scenarios, and other scenarios
are presented in Chapters 12, 13, and 14. And finally, we provide some open resources in Chapter 15
and conclude the book in Chapter 16.
We would also thank those who provide feedback on the content of the book: Cheng Yang, Ruidong
Wu, Chang Shu, Yufeng Du, and Jiayou Zhang.
Finally, we would like to thank all the editors, reviewers, and staff who helped with the publication
of the book. Without you, this book would not have been possible.
Introduction
Graphs are a kind of data structure which models a set of objects (nodes) and
their relationships (edges). Recently, researches of analyzing graphs with
machine learning have received more and more attention because of the great
expressive power of graphs, i.e., graphs can be used as denotation of a large
number of systems across various areas including social science (social
networks) [Hamilton et al., 2017b, Kipf and Welling, 2017], natural science
(physical systems [Battaglia et al., 2016, Sanchez et al., 2018] and protein-
protein interaction networks [Fout et al., 2017]), knowledge graphs
[Hamaguchi et al., 2017] and many other research areas [Khalil et al., 2017].
As a unique non-Euclidean data structure for machine learning, graph draws
attention on analyses that focus on node classification, link prediction, and
clustering. Graph neural networks (GNNs) are deep learning-based methods
that operate on graph domain. Due to its convincing performance and high
interpretability, GNN has been a widely applied graph analysis method
recently. In the following paragraphs, we will illustrate the fundamental
motivations of GNNs.
1.1 MOTIVATIONS
1.1.1 CONVOLUTIONAL NEURAL NETWORKS
Firstly, GNNs are motivated by convolutional neural networks (CNNs) LeCun
et al. [1998]. CNNs is capable of extracting and composing multi-scale
localized spatial features for features of high representation power, which have
result in breakthroughs in almost all machine learning areas and the revolution
of deep learning. As we go deeper into CNNs and graphs, we find the keys of
CNNs: local connection, shared weights, and the use of multi-layer [LeCun et
al., 2015]. These are also of great importance in solving problems of graph
domain, because (1) graphs are the most typical locally connected structure,
(2) shared weights reduce the computational cost compared with traditional
spectral graph theory [Chung and Graham, 1997], (3) multi-layer structure is
the key to deal with hierarchical patterns, which captures the features of
various sizes. However, CNNs can only operate on regular Euclidean data like
images (2D grid) and text (1D sequence), which can also be regarded as
instances of graphs. Therefore, it is straightforward to think of finding the
generalization of CNNs to graphs. As shown in Figure 1.1, it is hard to define
localized convolutional filters and pooling operators, which hinders the
transformation of CNN from Euclidean to non-Euclidean domain.
Figure 1.1: Left: image in Euclidean space. Right: graph in non-Euclidean space.
With Lp norm, the distance of two vectors x1, x2 (where x1 and x2 are in
the same linear space) can be defined as
A set of vectors x1, x2, …, xm are linearly independent if and only if there
does not exist a set of scalars λ1, λ2 …, λm, which are not all 0, such that
• Matrix: A two-dimensional array, which can be expressed as the
following:
where A ∈ ℝm × n.
Given two matrices A ∈ ℝm × n and B ∈ ℝn × p, the matrix product of AB
can be denoted as C ∈ ℝm × p, where
2.1.2 EIGENDECOMPOSITION
Let A be a matrix in ℝn × n. A nonzero vector v ∈ ℂn is called an eigenvector
of A if there exists such scalar λ ∈ ℂ that
However, not all square matrices can be diagonalized in such form because a
matrix may not have as many as n linear independent eigenvectors.
Fortunately, it can be proved that every real symmetric matrix has an
eigendecomposition.
2.1.3 SINGULAR VALUE DECOMPOSITION
As eigendecomposition can only be applied to certain matrices, we introduce
the singular value decomposition, which is a generalization to all matrices.
First we need to introduce the concept of singular value. Let r denote the
rank of AT A, then there exist r positive scalars σ1 ≥ σ2 ≥ … σr > 0 such that
for 1 ≤ i ≤ r, vi is an eigenvector of AT A with corresponding eigenvalue
. Note that v1, v2, …, vr are linearly
independent. The r positive scalars σ1, σ2, …, σr are called singular values of
A. Then we have the singular value decomposition
where U ∈ ℝm × m and V (n × n) are orthogonal matrices and Σ is an m × n
matrix defined as follows:
In fact, the column vectors of U are eigenvectors of AAT, and the eigenvectors
of AT A are made up of the the column vectors of V.
which is the famous Bayes formula. Note that it also holds for more than two
variables:
3.1 NEURON
The basic units of neural networks are neurons, which can receive a series of
inputs and return the corresponding output. A classic neuron is as shown in
Figure 3.1. Where the neuron receives n inputs x1, x2, …, xn with
corresponding weights w1, w2, …, wn and an offset b. Then the weighted
summation passes through an activation function f
and the neuron returns the output z = f(y). Note that the output will be the
input of the next neuron. The activation function is a kind of function that
maps a real number to a number between 0 and 1 (with rare exceptions),
which represents the activation of the neuron, where 0 indicates deactivated
and 1 indicates fully activated. Several useful activation functions are shown
as follows.
• Sigmoid Function (Figure 3.2):
• Tanh Function (Figure 3.3):
In fact, there are many other activation functions and each has its
corresponding derivatives. But do remember that a good activation function is
always smooth (which means that it is a continuous differentiable function)
and easily calculated (in order to minimize the computational complexity of
the neural network). During the training of a neural network, the choice of
activation function is usually essential to the outcome.
By the chain rule, we can deduce the derivative of z with respect to wi and b:
With a learning rate of η, the update for each parameter will be:
Figure 3.5: Feedforward neural network.
In summary, the process of the back propagation consists of the following two
steps.
• Forward calculation: given a set of parameters and an input, the neural
network computes the values at each neuron in a forward order.
• Backward propagation: compute the error at each variable to be
optimized, and update the parameters with their corresponding partial
derivatives in a backward order.
The above two steps will go on repeatedly until the optimization target is
acquired.
4.1 INTRODUCTION
The concept of GNN was first proposed in Gori et al. [2005], Scarselli et al.
[2004, 2009]. For simplicity, we will talk about the model proposed in
Scarselli et al. [2009], which aims to extend existing neural networks for
processing graph-structured data.
A node is naturally defined by its features and related nodes in the graph.
The target of GNN is to learn a state embedding hv ∈ ℝs, which encodes the
information of the neighborhood, for each node. The state embedding hv is
used to produce an output ov, such as the distribution of the predicted node
label.
In Scarselli et al. [2009], a typical graph is illustrated in Figure 4.1. The
vanilla GNN model deals with the undirected homogeneous graph where each
node in the graph has its input features xv and each edge may also have its
features. The paper uses co[v] and ne[v] to denote the set of edges and
neighbors of node v. For processing other more complicated graphs such as
heterogeneous graphs, the corresponding variants of GNNs could be found in
later chapters.
4.2 MODEL
Given the input features of nodes and edges, next we will talk about how the
model obtains the node embedding hv and the output embedding ov.
In order to update the node state according to the input neighborhood,
there is a parametric function f, called local transition function, shared among
all nodes. In order to produce the output of the node, there is a parametric
function g, called local output function. Then, hv and ov are defined as follows:
where x denotes the input feature and h denotes the hidden state. co[v] is the
set of edges connected to node v and ne[v] is set of neighbors of node v. So
that xv, xco[v], hne[v], xne[v] are the features of v, the features of its edges, the
states and the features of the nodes in the neighborhood of v, respectively. In
the example of node l1 in Figure 4.1, xl1 is the input feature of l1. co[l1]
contains edges l(1,4), l(1,6), l(1,2), and l(3.1). ne[l1] contains nodes l2, l3, l4, and l6.
Figure 4.1: An example of the graph based on Scarselli et al. [2009].
where F, the global transition function, and G is the global output function.
They are stacked versions of the local transition function f and the local
output function g for all nodes in a graph, respectively. The value of H is the
fixed point of Eq. (4.3) and is uniquely defined with the assumption that F is a
contraction map.
With the suggestion of Banach’s fixed point theorem [Khamsi and Kirk,
2011], GNN uses the following classic iterative scheme to compute the state:
After running the algorithm, we can get a model trained for a specific
supervised/semi-supervised task as well as hidden states of nodes in the
graph. The vanilla GNN model provides an effective way to model graphic
data and it is the first step toward incorporating neural networks into graph
domain.
4.3 LIMITATIONS
Though experimental results showed that GNN is a powerful architecture for
modeling structural data, there are still several limitations of the vanilla GNN.
• First, it is computationally inefficient to update the hidden states of nodes
iteratively to get the fixed point. The model needs T steps of computation
to approximate the fixed point. If relaxing the assumption of the fixed
point, we can design a multi-layer GNN to get a stable representation of
the node and its neighborhood.
• Second, vanilla GNN uses the same parameters in the iteration while
most popular neural networks use different parameters in different layers,
which serves as a hierarchical feature extraction method. Moreover, the
update of node hidden states is a sequential process which can benefit
from the RNN kernels like GRU and LSTM.
• Third, there are also some informative features on the edges which
cannot be effectively modeled in the vanilla GNN. For example, the
edges in the knowledge graph have the type of relations and the message
propagation through different edges should be different according to their
types. Besides, how to learn the hidden states of edges is also an
important problem.
• Last, if T is pretty large, it is unsuitable to use the fixed points if we focus
on the representation of nodes instead of graphs because the distribution
of representation in the fixed point will be much more smooth in value
and less informative for distinguishing each node.
Beyond the vanilla GNN, several variants are proposed to release these
limitations. For example, Gated Graph Neural Network (GGNN) [Li et al.,
2016] is proposed to solve the first problem. Relational GCN (R-GCN)
[Schlichtkrull et al., 2018] is proposed to deal with directed graphs. More
details could be found in the following chapters.
CHAPTER 5
5.1.2 CHEBNET
Hammond et al. [2011] suggest that gθ(Λ) can be approximated by a
truncated expansion in terms of Chebyshev polynomials Tk(x) up to Kth order.
Thus, the operation is:
5.1.3 GCN
Kipf and Welling [2017] limit the layer-wise convolution operation to K = 1
to alleviate the problem of overfitting on local neighborhood structures for
graphs with very wide node degree distributions. It further approximates λmax
≈ 2 and the equation simplifies to:
where α is a parameter.
Lres is computed by a learned graph adjacency matrix
and is computed via a learned metric. The idea behind the adaptive metric
is that Euclidean distance is not suitable for graph structured data and the
metric should be adaptive to the task and input features. AGCN uses the
generalized Mahalanobis distance:
5.2.3 DCNN
Atwood and Towsley [2016] propose the diffusion-convolutional neural
networks (DCNNs). Transition matrices are used to define the neighborhood
for nodes in DCNN. For node classification, it has
5.2.4 DGCN
Zhuang and Ma [2018] propose the dual graph convolutional network
(DGCN) to jointly consider the local consistency and global consistency on
graphs. It uses two convolutional networks to capture the local/global
consistency and adopts an unsupervised loss to ensemble them. The first
convolutional network is the same as Eq. (5.5). And the second network
replaces the adjacency matrix with positive pointwise mutual information
(PPMI) matrix:
where XP is the PPMI matrix and DP is the diagonal degree matrix of XP, σ is
a nonlinear activation function.
The motivations of jointly using the two perspectives are: (1) Eq. (5.5)
models the local consistency, which indicates that nearby nodes may have
similar labels, and (2) Eq. (5.13) models the global consistency which
assumes that nodes with similar context may have similar labels. The local
consistency convolution and global consistency convolution are named as
ConvA and ConvP.
Zhuang and Ma [2018] further ensemble these two convolutions via the
final loss function. It can be written as:
λ(t) is the dynamic weight to balance the importance of these two loss
functions. L0(ConvA) is the supervised loss function with given node labels. If
we have c different labels to predict, ZA denotes the output matrix of ConvA
and denotes the output of ZA after a softmax operation, then the loss
L0(ConvA), which is the cross-entropy error, can be written as:
where yL is the set of training data indices and Y is the ground truth.
Figure 5.2: Architecture of the dual graph convolutional network (DGCN) model.
where denotes the output of ConvP after the softmax operation. Thus,
Lreg(ConvA, ConvP) is the unsupervised loss function to measure the
differences between and . As a result, the architecture of this model is
shown in Figure 5.2.
5.2.5 LGCN
Gao et al. [2018] propose the learnable graph convolutional networks
(LGCN). The network is based on the learnable graph convolutional layer
(LGCL) and the sub-graph training strategy. We will give the details of the
learnable graph convolutional layer in this section.
LGCL leverages CNNs as aggregators. It performs max pooling on nodes’
neighborhood matrices to get top-k feature elements and then applies 1-D
CNN to compute hidden representations. The propagation step of LGCL is
formulated as:
where A is the adjacency matrix, g(·) is the k-largest node selection operation,
and c(·) denotes the regular 1-D CNN.
The model uses the k-largest node selection operation to gather
information for each node. For a given node x, the features of its neighbors
are firstly gathered; suppose it has n neighbors and each node has c features,
then a matrix M ∈ ℝn × c is obtained. If n is less than k, then M is padded with
columns of zeros. Then the k-largest node selection is conducted that we rank
the values in each column and select the top-k values. After that, the
embedding of the node x is inserted into the first row of the matrix and finally
we get a matrix .
Figure 5.3: An example of the learnable graph convolutional layer (LGCL). Each node has three three
features and this layer selects k = 4 neighbors. The node has five neighbors and four of them are
selected. The k-largest node selection procedure is shown in the left part and four largest values are
selected in each column. Then a 1-D CNN is performed to get the final output.
After the matrix is obtained, then the model uses the regular 1-D CNN
to aggregate the features. The function c(·) should take a matrix of N × (k +
1) × C as input and output a matrix of dimension N × D or N × 1 × D. Figure
5.3 gives an example of the LGCL.
5.2.6 MONET
Monti et al. [2017] propose a spatial-domain model (MoNet) on non-
Euclidean domains which could generalize several previous techniques. The
Geodesic CNN (GCNN) [Masci et al., 2015] and Anisotropic CNN (ACNN)
[Boscaini et al., 2016] on manifolds or GCN [Kipf and Welling, 2017] and
DCNN [Atwood and Towsley, 2016] on graphs could be formulated as
particular instances of MoNet.
We use x to denote the node in the graph and y ∈ Nx to denote the
neighbor node of x. The MoNet model computes the pseudo-coordinates u(x,
y) between the node and its neighbor and uses a weighting function among
these coordinates:
Table 5.1: Different settings for different methods in the MoNet framework
where the parameters are wΘ(u) = (w1(u), …, wJ(u)) and J represents the size
of the extracted patch. Then a spatial generalization of the convolution on
non-Euclidean domains is defined as:
5.2.7 GRAPHSAGE
Hamilton et al. [2017b] propose the GraphSAGE, a general inductive
framework. The framework generates embeddings by sampling and
aggregating features from a node’s local neighborhood. The propagation step
of GraphSAGE is:
where Wt is the parameter at layer t.
However, Hamilton et al. [2017b] do not utilize the full set of neighbors in
Eq. (5.20) but a fixed-size set of neighbors by uniformly sampling. The
AGGREGATE function can have various forms. And Hamilton et al. [2017b]
suggest three aggregator functions.
• Mean aggregator. It could be viewed as an approximation of the
convolutional operation from the transductive GCN framework [Kipf and
Welling, 2017], so that the inductive version of the GCN variant could be
derived by
Note that any symmetric functions could be used in place of the max-
pooling operation here.
The node v first aggregates message from its neighbors, where Av is the
sub-matrix of the graph adjacency matrix A and denotes the connection of
node v with its neighbors. The GRU-like update functions use information
from each node’s neighbors and from the previous timestep to update node’s
hidden state. Vector a gathers the neighborhood information of node v, z, and
r are the update and reset gates, ⊙ is the Hardamard product operation.
The GGNN model is designed for problems defined on graphs which
require outputting sequences while existing models focus on producing single
outputs such as node-level or graph-level classifications.
Li et al. [2016] further propose Gated Graph Sequence Neural Networks
(GGS-NNs) which uses several GGNNs to produce an output sequence o(1) …
o(K). As shown in Figure 6.1, for the kth output step, the matrix of node
annotations is denoted as Xk). Two GGNNs are used in this architecture: (1)
for predicting o(k) from X(k) and (2) for predicting X(k+1) from X(k).
We use H(k,t) to denote the t-th propagation step of the k-th output step. The
value of H(k,1) at each step k is initialized by X(k). The value of H(t,1) at each
step t is initialized by and can be different models or share the
same parameters.
The model is used on the bAbI task as well as the program verification
task and has demonstrated its effectiveness.
where is the input vector at time t in the standard LSTM setting, ⊙ is the
Hardamard product operation.
If the number of children of each node in a tree is at most K and the
children can be ordered from 1 to K, then the N -array Tree-LSTM can be
applied. For node v, and denote the hidden state and memory cell of
its k-th child at time t, respectively. The transition equations are the following:
Compared to the Child-Sum Tree-LSTM, the N -ary Tree-LSTM
introduces separate parameter matrices for each child k, which allows the
model to learn more fine-grained representations for each node conditioned
on the it’s children.
Figure 6.2: The propagation step of the S-LSTM model. The dash lines connect the supernode g with its
neighbors from last layer. The solid lines connect the word node with its neighbors from last layer.
CHAPTER 7
7.1 GAT
Velickovic et al. [2018] propose a graph attention network (GAT) which
incorporates the attention mechanism into the propagation steps. It follows the
self-attention strategy and the hidden state of each node is computed by
attending over its neighbors.
Velickovic et al. [2018] define a single graph attentional layer and
constructs arbitrary graph attention networks by stacking this layer. The layer
computes the coefficients in the attention mechanism of the node pair (i, j) by:
where αij is the attention coefficient of node j to i, Ni represents the
neighborhoods of node i in the graph. The input node features are denoted as
h = {h1, h2, …, hN}, hi ∈ ℝF, where N is the number of nodes and F is the
dimension, the output node features (with cardinality F′) are denoted as
. W ∈ ℝF′ × F is the weight matrix of a
shared linear transformation which applied to every node, a ∈ ℝ2F′ is the
weight vector. It is normalized by a softmax function and the LeakyReLU
nonlinearity (with negative input slop α = 0:2) is applied.
Then the final output features of each node can be obtained by (after
applying a nonlinearity σ):
7.2 GAAN
Besides GAT, Gated Attention Network (GaAN) [Zhang et al., 2018b] also
uses the multi-head attention mechanism. The difference between the
attention aggregator in GaAN and the one in GAT is that GaAN uses the key-
value attention mechanism and the dot product attention while GAT uses a
fully connected layer to compute the attention coefficients.
Furthermore, GaAN assigns different weights for different heads by
computing an additional soft gate. This aggregator is called the gated attention
aggregator. In detail, GaAN uses a convolutional network that takes the
features of the center node and it neighbors to generate gate values. And as a
result, it could outperform GAT as well as other GNN models with different
aggregators in the inductive node classification problem.
CHAPTER 8
Xu et al. [2018] propose the Jump Knowledge Network which could learn
adaptive, structure-aware representations. The Jump Knowledge Network
selects from all of the intermediate representations (which “jump” to the last
layer) for each node at the last layer, which enables each node to select
effective neighborhood size as needed. Xu et al. [2018] used three approaches
of concatenation, max-pooling, and LSTM-attention in the experiments to
aggregate information. The illustration of JKN could be found in Figure 8.1.
The idea of Jump Knowledge Network is straightforward and it performs
well on the experiments in social, bioinformatics and citation networks. It
could also be combined with models like GCNs, GraphSAGE, and Graph
Attention Networks to improve their performance.
8.3 DEEPGCNS
Li et al. [2019] borrow ideas from CNNs to add skip connections into graph
neural networks. There are two major challenges to stack more layers of
GNNs: vanishing gradient and over smoothing. Li et al. [2019] use residual
connections and dense connections from ResNet [He et al., 2016b] and
DenseNet [Huang et al., 2017] to solve the vanishing gradient problem and
uses dilated convolutions [Yu and Koltun, 2015] to solve the over smoothing
problem.
Li et al. [2019] denote the vanilla GCN as PlainGCN and further propose
ResGCN and DenseGCN. In PlainGCN, the computation of hidden states is
where the matrix of hidden states Ht is directly added to the matrix after the
graph convolution.
For DenseGCN, the computation is
Figure 8.3: An example of the dilated convolution. The dilation rate is 1, 2, 3 for figures from left to
right.
CHAPTER 9
where denotes the part of the adjacency matrice that contains the k-hop
edges for the ancestor propagation phase and contains the k-hop edges for
the descendant propagation phase. and are the corresponding degree
matrices for and .
10.1 SAMPLING
The original graph neural network has several drawbacks in training and
optimization. For example, GCN requires the full-graph Laplacian, which is
computational-consuming for large graphs. Furthermore, GCN is trained
independently for a fixed graph, which lacks the ability for inductive learning.
GraphSAGE [Hamilton et al., 2017b] is a comprehensive improvement
of the original GCN. To solve the problems mentioned above, GraphSAGE
replaced full-graph Laplacian with learnable aggregation functions, which are
key to perform message passing and generalize to unseen nodes. As shown in
Eq. (5.20), they first aggregate neighborhood embeddings, concatenate with
target node’s embedding, then propagate to the next layer. With learned
aggregation and propagation functions, GraphSAGE could generate
embeddings for unseen nodes. Also, GraphSAGE uses a random neighbor
sampling method to alleviate receptive field expansion.
Compared to GCN [Kipf and Welling, 2017], GraphSAGE proposes a
way to train the model via batches of nodes instead of the full-graph
Laplacian. This enables the training of large graphs though it may be time-
consuming.
PinSage [Ying et al., 2018a] is an extension version of GraphSAGE on
large graphs. It uses the importance-based sampling method. Simple random
sampling is suboptimal because of the increase of variance. PinSage defines
importance-based neighborhoods of node u as the T nodes that exert the most
influence on node u. By simulating random walks starting from target nodes,
this approach calculate the L1-normalized visit count of nodes visited by the
random walk. Then the top T nodes with the highest normalized visit counts
with respect to u are selected to be the neighborhood of node u.
Figure 10.1: The illustration of sampled neighborhood on an example graph, K denotes the hop of
neighborhood.
Kipf and Welling [2016] also train the GAE model in a variational
manner and the model is named as the variational graph auto-encoder
(VGAE). Furthermore, Berg et al. use GAE in recommender systems and
have proposed the graph convolutional matrix completion model (GC-MC)
[van den Berg et al., 2017], which outperforms other baseline models on the
Movie-Lens dataset.
Adversarially Regularized Graph Auto-encoder (ARGA) [Pan et al.,
2018] employs generative adversarial networks (GANs) to regularize a GCN-
based graph auto-encoder to follow a prior distribution.
Deep Graph Infomax (DGI) [Veličković et al., 2019] aims to maximize
the local-global mutual information to learn representations. The local
information comes from each node’s hidden state after the graph convolution
function ℱ. The global information of a graph is computed by the readout
function ℛ. This function aggregates all node presentations and is set to an
average function in the paper. The paper uses node shuffling to get negative
examples (by changing node features from X to with a corruption function
𝒞). Then it use a discriminator 𝒟 to classify the positive samples and negative
samples. The architecture of DGI is shown in Figure 10.2.
There are also several graph auto-encoders such as NetRA [Yu et al.,
2018b], DNGR [Cao et al., 2016], SDNE [Wang et al., 2016], and DRNE
[Tu et al., 2018], however, they don’t use GNNs in their framework.
General Frameworks
Apart from different variants of graph neural networks, several general
frameworks are proposed aiming to integrate different models into one single
framework. Gilmer et al. [2017] propose the message passing neural network
(MPNN) and it is a unified framework to generalize several graph neural
network and graph convolutional network methods. Wang et al. [2018b]
propose the non-local neural network (NLNN) which is used to solve
computer vision tasks. It could generalize several “self-attention”-style
methods [Hoshen, 2017, Vaswani et al., 2017, Velickovic et al., 2018].
Battaglia et al. [2018] propose the graph network (GN) which unified the
MPNN and NLNN methods as well as many other variants like Interaction
Networks [Battaglia et al., 2016, Watters et al., 2017], Neural Physics Engine
[Chang et al., 2017], CommNet [Sukhbaatar et al., 2016], structure2vec [Dai
et al., 2016, Khalil et al., 2017], GGNN [Li et al., 2016], Relation Network
[Raposo et al., 2017, Santoro et al., 2017], Deep Sets [Zaheer et al., 2017],
and Point Net [Qi et al., 2017a].
where evw represents features of the edge from node v to w. The readout phase
uses a readout function R to compute a representation for the whole graph
Figure 11.1: A spacetime non-local operation in the network trained for video classification. The
response of xi is computed as the weighted sum of all positions xj where in this figure only the highest
weighted ones are shown.
where T denotes the total time steps. The message function Mt, vertex update
function Ut, and readout function R could have different settings. Hence, the
MPNN framework could generalize several different models via different
function settings. Here we give an example of generalizing GGNN, and other
models’ function settings could be found in Gilmer et al. [2017]. The function
settings for GGNNs are:
where Aevw is the adjacency matrix, one for each edge label e. GRU is the
Gated Recurrent Unit introduced in Cho et al. [2014]. i and j are neural
networks in function R.
where i is the target position and the selection of j should enumerate all
possible positions. f(hi, hj) is used to compute the “attention” between position
i and j. g(hj) denotes a transformation of the input hj and a factor is
utilized to normalize the results.
There are several instantiations with different f and g settings. For
simplicity, Wang et al. [2018b] use the linear transformation as the function g.
That means g(hj) = Wghj, where Wg is a learned weight matrix. Next, we list
the choices for function f in the following.
Gaussian. The Gaussian function is a natural choice according to the non-
local mean [Buades et al., 2005] and bilateral filters [Tomasi and Manduchi,
1998]. Thus:
where θ(hi) = Wθ hi, ϕ(hj) = Wϕhj and 𝒞(h) = Σ∀j f(hi, hj).
It could be found that the self-attention proposed in Vaswani et al. [2017]
is a special case of the Embedded Gaussian version. For a given i,
becomes the softmax computation along dimension j. So
that , which matches the form of self-attention
in Vaswani et al. [2017].
Dot product. The function f can also be implemented as dot-product
similarity:
where is given in Eq. (11.4) and “Chi” denotes the residual connection [He
et al., 2016a]. Hence, the non-local block could be insert into any pre-trained
model, which makes the block more applicable. Wang et al. [2018b] conduct
experiments on the tasks of video classification, object detection and
segmentation, and pose estimation. And on these tasks, the simple addition of
non-local blocks leads to a significant improvement over baselines.
Note here the order is not strictly enforced. For example, it is possible to
proceed from global, to per-node, to per-edge updates. And the ϕ and ρ
functions need not be neural networks though in this paper we only focus on
neural network implementations.
Design Principles. The design of GN based on three basic principles:
flexible representations, configurable within-block structure, and composable
multi-block architectures.
• Flexible representations. The GN framework supports flexible
representations of the attributes as well as different graph structures. The
global, node, and edge attributes can use different kinds of
representations and researchers usually use real-valued vectors and
tensors. One can simply tailor the output of a GN block according to
specific demands of tasks. For example, Battaglia et al. [2018] list several
edge-focused [Hamrick et al., 2018, Kipf et al., 2018], node-focused
[Battaglia et al., 2016, Chang et al., 2017, Sanchez et al., 2018, Wang et
al., 2018a], and graph-focused [Battaglia et al., 2016, Gilmer et al., 2017,
Santoro et al., 2017] GNs. In terms of graph structures, the framework
can be applied to both structural scenarios where the graph structure is
explicit and non-structural scenarios where the relational structure should
be inferred or assumed.
• Configurable within-block structure. The functions and their inputs
within a GN block can have different settings so that the GN framework
provides flexibility in within-block structure configuration. For example,
Hamrick et al. [2018] and Sanchez et al. [2018] use the full GN blocks.
Their ϕ implementations use neural networks and their ρ functions use
the elementwise summation. Based on different structure and functions
settings, a variety of models (such as MPNN, NLNN, and other variants)
could be expressed by the GN framework. Figure 11.2a gives an
illustration of a full GN block and other models can be regarded as
special variants of the GN block. For example, the MPNN uses the
features of nodes and edges as input and outputs graph-level and node-
level representations. The MPNN model does not use the graph-level
input features and omits the learning process of edge embeddings.
• Composable multi-block architectures. GN blocks could be composed
to construct complex architectures. Arbitrary numbers of GN blocks
could be composed in sequence with shared or unshared parameters.
Battaglia et al. [2018] utilize GN blocks to construct an encode-process-
decode architecture and a recurrent GN-based architecture. These
architectures are demonstrated in Figure 11.3. Other techniques for
building GN-based architectures could also be useful, such as skip
connections, LSTM-, or GRU-style gating schemes and so on.
Figure 11.3: Examples of architectures composed by GN blocks. (a) The sequential processing
architecture; (b) The encode-process-decode architecture; and (c) The recurrent architecture.
CHAPTER 12
12.1 PHYSICS
Modeling real-world physical systems is one of the most basic aspects of
understanding human intelligence. By representing objects as nodes and
relations as edges, we can perform GNN-based reasoning about objects,
relations, and physics in a simplified but effective way.
Battaglia et al. [2016] propose Interaction Networks to make predictions
and inferences about various physical systems. In current state, we input
objects and relations into GNN to model their interactions, then the physical
dynamics are adopted to predict future states. They separately model relation-
centric and object-centric models, making it easier to generalize across
different systems.
In CommNet [Sukhbaatar et al., 2016], interactions are not modeled
explicitly. Instead, an interaction vector is obtained by averaging all other
agents’ hidden vectors.
VAIN [Hoshen, 2017] further introduces attentional methods into agent
interaction process, which preserves both the complexity advantages and
computational efficiency as well.
Visual Interaction Networks [Watters et al., 2017] could make predictions
from pixels. It learns a state code from two consecutive input frames for each
object. Then, after adding their interaction effect by an Interaction Net block,
the state decoder converts state codes to next step’s state.
Sanchez et al. [2018] propose a GN-based model which could either
perform state prediction or inductive inference. The inference model takes
partially observed information as input and constructs a hidden graph for
implicit system classification. Kipf et al. [2018] also build graphs from object
trajectories, they adopt an encoder-decoder architecture for neural relational
inference process. In detail, the encoder returns a factorized distribution of
interaction graph through GNN while the decoder generates trajectory
predictions conditioned on both the latent code of the encoder and the
previous time step of the trajectory.
Figure 12.1: A physical system and its corresponding graph representation. Colored nodes denote
different objects and edges denote interaction between them.
Figure 12.2: A single CH3OH molecular and its graph representation. Nodes are elements and edges are
bonds.
where euv is the edge feature of edge (u, v). Then update node representation
by
where deg(v) is the degree of node v and is a learned matrix for each
time step t and node degree N.
Kearnes et al. [2016] further explicitly model atom and atom pairs
independently to emphasize atom interactions. It introduces edge
representation instead of aggregation function, i.e., .
The node update function is
Beyond atom molecular graphs, some works [Jin et al., 2018, 2019]
represent molecules as junction trees. A junction tree is generated by
contracting certain vertices in corresponding molecular graph into a single
node. The nodes in a junction tree are molecular substructures such as rings
and bonds. Jin et al. [2018] leverage variational auto-encoder to generate
molecular graphs. Their model follows a two-step process, first generating a
junction tree scaffold over chemical substructures, then combining them into a
molecule with a graph message passing network. Jin et al. [2019] focus on
molecular optimization. This task aims to map one molecule to another
molecular graph which preserves better properties. The proposed VJTNN
uses graph attention to decode the junction tree and incorporates GAN for
adversarial training to avoid valid graph translation.
To better explain the function of each substructure in a molecule, Lee et
al. [2019] propose a game-theoretic approach to exhibit the transparency in
structured data. The model is set up as a two-player co-operative game
between a predictor and a witness. The predictor is trained to minimize the
discrepancy while the goal of the witness is to test how well the predictor
conforms to the transparency.
12.2.2 CHEMICAL REACTION PREDICTION
Chemical reaction product prediction is a fundamental problem in organic
chemistry. Do et al. [2019] view chemical reaction as graph transformation
process and introduces GTPN model. GTPN uses GNN to learn
representation vectors of reactant and reagent molecules, then leverages
reinforcement learning to predict the optimal reaction path in the form of
bond change which transforms the reactants into products. Bradshaw et al.
[2019] give another view that chemical reactions can be described as the
stepwise redistribution of electrons in molecules. Their model tries to predict
the electron paths by learning path distribution over the electron movements.
They represent node and graph embeddings with a four-layer GGNN, and
then optimize the factorized path generation probability.
12.2.3 MEDICATION RECOMMENDATION
Using deep learning algorithms to help recommend medications has been
explored by researchers and doctors extensively. The traditional methods can
be categorized into instance-based and longitudinal electronic health records
(EHR)-based medication recommendation methods.
To fill the gap between them, Shang et al. [2019c] propose GAMENet
which takes both longitudinal patient EHR data and drug knowledge based on
drug-drug interactions (DDI) as inputs. GAMENet embeds both EHR graph
and DDI graph, then feed them into Memory Bank for final output.
To further exploit the hierarchical knowledge for meditation
recommendation, Shang et al. [2019b] combine the power of GNN and
BERT for medical code representation. The authors first encode the internal
hierarchical structure with GNN, and then feed the embeddings into the pre-
trained EHR encoder and the fine-tuned classifier for downstream predictive
tasks.
12.2.4 PROTEIN AND MOLECULAR INTERACTION
PREDICTION
Fout et al. [2017] focus on the task named protein interface prediction, which
is a challenging problem to predict the interaction between proteins and the
interfaces they occur. The proposed GCN-based method, respectively, learns
ligand and receptor protein residue representation and merges them for
pairwise classification. Xu et al. [2019b] introduce MR-GNN which utilizes a
multi-resolution model to capture multi-scale node features. The model also
utilizes two long short-term memory networks to capture the interaction
between two graphs step-by-step.
GNN can also be used in biomedical engineering. With Protein-Protein
Interaction Network, Rhee et al. [2018] leverage graph convolution and
relation network for breast cancer subtype classification. Zitnik et al. [2018]
also suggest a GCN-based model for polypharmacy side effects prediction.
Their work models the drug and protein interaction network and separately
deals with edges in different types.
Figure 12.3: Example of knowledge base fragment. The nodes are entities and the edges are relations.
The dashed line is missing edge information to be inferred.
Applications – Non-Structural
Scenarios
In this chapter we will talk about applications on non-structural scenarios such
as image, text, programming source code [Allamanis et al., 2018, Li et al.,
2016], and multi-agent systems [Hoshen, 2017, Kipf et al., 2018, Sukhbaatar
et al., 2016]. We will only give detailed introduction to the first two scenarios
due to the length limit. Roughly, there are two ways to apply the graph neural
networks on non-structural scenarios: (1) incorporate structural information
from other domains to improve the performance, for example using
information from knowledge graphs to alleviate the zero-shot problems in
image tasks; and (2) infer or assume the relational structure in the scenario
and then apply GNN model to solve the problems defined on graphs, such as
the method in Zhang et al. [2018c] which models text into graphs.
13.1 IMAGE
13.1.1 IMAGE CLASSIFICATION
Image classification is a very basic and important task in the field of computer
vision, which attracts much attention and has many famous datasets like
ImageNet [Russakovsky et al., 2015]. Recent progress in image classification
benefits from big data and the strong power of GPU computation, which
allows us to train a classifier without extracting structural information from
images. However, zero-shot and few-shot learning are becoming more and
more popular in the field of image classification, because most models can
achieve similar performance with enough data. There are several works
leveraging graph neural networks to incorporate structural information in
image classification.
First, knowledge graphs can be used as extra information to guide zero-
short recognition classification [Kampffmeyer et al., 2019, Wang et al.,
2018c]. Wang et al. [2018c] builds a knowledge graph where each node
corresponds to an object category and takes the word embeddings of nodes as
input for predicting the classifier of different categories. As over-smoothing
effect happens with the deep depth of convolution architecture, the six-layer
GCN used in Wang et al. [2018c] would wash out much useful information in
the representation. To solve the smoothing problem in the propagation of
GCN, Kampffmeyer et al. [2019] managed to use single layer GCN with a
larger neighborhood which includes both one-hop and multi-hops nodes in the
graph. And it is proved effective in building a zero-shot classifier with existing
ones. Figure 13.1 shows an example of the propagation step in Kampffmeyer
et al. [2019] and Wang et al. [2018c].
Figure 13.1: The black lines represent the propagation step from previous methods. The red and blues
lines represent the propagation step in Kampffmeyer et al. [2019], where the node could aggregate
information from ancestor and descendent nodes.
Besides the knowledge graph, the similarity between images in the dataset
is also helpful for the few-shot learning [Garcia and Bruna, 2018]. Garcia and
Bruna [2018] propose to build a weighted fully-connected image network
based on the similarity and do message passing in the graph for few-shot
recognition.
As most knowledge graphs are large for reasoning, Marino et al. [2017]
select some related entities to build a sub-graph based on the result of object
detection and apply GGNN to the extracted graph for prediction. Besides, Lee
et al. [2018a] propose to construct a new knowledge graph where the entities
are all the categories. And, they defined three types of label relations: super-
subordinate, positive correlation, and negative correlation and propagate the
confidence of labels in the graph directly.
Figure 13.2: The method in Teney et al. [2017] for visual question answering. The scene graph from the
picture and the syntactic graph from the question are first constructed and then combined for question
answering.
13.2 TEXT
The graph neural networks could be applied to several tasks based on text. It
could be applied to both sentence-level tasks (e.g., text classification) as well
as word-level tasks (e.g., sequence labeling). We will introduce several major
applications on text in the following.
13.2.1 TEXT CLASSIFICATION
Text classification is an important and classical problem in natural language
processing. The classical GCN models [Atwood and Towsley, 2016,
Defferrard et al., 2016, Hamilton et al., 2017b, Henaff et al., 2015, Kipf and
Welling, 2017, Monti et al., 2017] and GAT model [Velickovic et al., 2018]
are applied to solve the problem, but they only use the structural information
among documents and they don’t use much text information.
Peng et al. [2018] propose a graph-CNN-based deep learning model. It
first turns texts to graph-of-words, and then conducts the convolution
operations in [Niepert et al., 2016] on the word graph.
Zhang et al. [2018c] propose the S-LSTM to encode text. The whole
sentence is represented in a single state which contains an overall global state
and several sub-states for individual words. It uses the global sentence-level
representation for classification tasks.
These methods either view a document or a sentence as a graph of word
nodes or rely on the document citation relation to construct the graph. Yao et
al. [2019] regard the documents and words as nodes to construct the corpus
graph (hence heterogeneous graph) and use the Text GCN to learn
embeddings of words and documents.
Sentiment classification could also be regarded as a text classification
problem and a Tree-LSTM approach is propose by Tai et al. [2015].
Figure 14.1: A small example of traveling salesman problem (TSP). The nodes denote different cities
and edges denote paths between cities. The edge weights are path lengths. The red line shows the
shortest possible loop that connects every city.
Open Resources
15.1 DATASETS
Many tasks related to graphs are released to test the performance of various graph neural
networks. Such tasks are based on the following commonly used datasets.
A series of datasets based on citation networks are as follows:
• Pubmed [Yang et al., 2016]
• Cora [Yang et al., 2016]
• Citeseer [Yang et al., 2016]
• DBLP [Tang et al., 2008]
A series of datasets based on Biochemical graphs are as follows:
• MUTAG [Debnath et al., 1991]
• NCI-1 [Wale et al., 2008]
• PPI [Zitnik and Leskovec, 2017]
• D&D [Dobson and Doig, 2003]
• PROTEIN [Borgwardt et al., 2005]
• PTC [Toivonen et al., 2003]
A series of datasets based on Social Networks are as follows:
• Reddit [Hamilton et al., 2017c]
• BlogCatalog [Zafarani and Liu, 2009]
A series of datasets based on Knowledge Graphs are as follows:
• FB13 [Socher et al., 2013]
• FB15K [Bordes et al., 2013]
• FB15K237 [Toutanova et al., 2015]
• WN11 [Socher et al., 2013]
• WN18 [Bordes et al., 2013]
• WN18RR [Dettmers et al., 2018]
A broader range of opensource dataset repositories are as follows:
• Network Repository
A scientific network data repository with interactive visualization and mining tools.
http://networkrepository.com
• Graph Kernel Datasets
Benchmark datasets for graph kernels.
https://ls11-www.cs.tu-dortmund.de/staff/morris/graphkerneldatasets
• Relational Dataset Repository
To support the growth of relational machine learning.
https://relational.fit.cvut.cz
• Stanford Large Network Dataset Collection
The SNAP library is developed to study large social and information networks.
https://snap.stanford.edu/data/
• Open Graph Benchmark
Open Graph Benchmark (OGB) is a collection of benchmark datasets, data-loaders, and
evaluators for graph machine learning in PyTorch.
https://ogb.stanford.edu/
15.2 IMPLEMENTATIONS
We first list several platforms that provide codes for graph computing in Table 15.1.
Next, we list the hyperlinks of the current opensource implementations of some famous
GNN models in Table 15.2.
As the research filed grows rapidly, we recommend our readers the paper list published
by our team, GNNPapers (https://github.com/thunlp/gnnpapers), for recent studies.
Model Link
GGNN (2015) https://github.com/yujiali/ggnn
Neurals FPs https://github.com/HIPS/neural-fingerprint
(2015)
ChebNet https://github.com/mdeff/cnn_graph
(2016)
DNGR (2016) https://github.com/ShelsonCao/DNGR
Model Link
SDNE (2016) https://github.com/suanrong/SDNE
GAE (2016) https://github.com/limaosen0/Variational-Graph-Auto-Encoders
DRNE (2016) https://github.com/tadpole/DRNE
Structural https://github.com/asheshjain399/RNNexp
RNN (2016)
DCNN (2016) https://github.com/jcatw/dcnn
GCN (2017) https://github.com/tkipf/gcn
CayleyNet https://github.com/amoliu/CayleyNet
(2017)
GraphSage https://github.com/williamleif/GraphSAGE
(2017)
GAT (2017) https://github.com/PetarV-/GAT
CLN(2017) https://github.com/trangptm/Column_networks
ECC (2017) https://github.com/mys007/ecc
MPNNs https://github.com/brain-research/mpnn
(2017)
MoNet (2017) https://github.com/pierrebaque/GeometricConvolutionsBench
JK-Net (2018) https://github.com/ShinKyuY/Representation_Learning_on_Graphs_with_Jumping_Knowledge_Networks
SSE (2018) https://github.com/Hanjun-Dai/steady_state_embedding
LGCN (2018) https://github.com/divelab/lgcn/
FastGCN https://github.com/matenure/FastGCN
(2018)
DiffPool https://github.com/RexYing/diffpool
(2018)
GraphRNN https://github.com/snap-stanford/GraphRNN
(2018)
MolGAN https://github.com/nicola-decao/MolGAN
(2018)
NetGAN https://github.com/danielzuegner/netgan
(2018)
DCRNN https://github.com/liyaguang/DCRNN
(2018)
ST-GCN https://github.com/yysijie/st-gcn
(2018)
RGCN (2018) https://github.com/tkipf/relational-gcn
AS-GCN https://github.com/huangwb/AS-GCN
(2018)
DGCN (2018) https://github.com/ZhuangCY/DGCN
GaAN (2018) https://github.com/jennyzhang0215/GaAN
DGI (2019) https://github.com/PetarV-/DGI
Model Link
GraphWaveNet https://github.com/nnzhan/Graph-WaveNet
(2019)
HAN (2019) https://github.com/Jhy1993/HAN
CHAPTER 16
Conclusion
Although GNNs have achieved great success in different fields, it is
remarkable that GNN models are not good enough to offer satisfying
solutions for any graph in any condition. In this section, we will state some
open problems for further researches.
Shallow Structure. Traditional DNNs can stack hundreds of layers to get
better performance, because deeper structure has more parameters, which
improve the expressive power significantly. However, graph neural networks
are always shallow, most of which are no more than three layers. As
experiments in Li et al. [2018a] show, stacking multiple GCN layers will
result in over-smoothing, that is to say, all vertices will converge to the same
value. Although some researchers have managed to tackle this problem [Li et
al., 2018a, 2016], it remains to be the biggest limitation of GNN. Designing
real deep GNN is an exciting challenge for future research, and will be a
considerable contribution to the understanding of GNN.
Dynamic Graphs. Another challenging problem is how to deal with
graphs with dynamic structures. Static graphs are stable so they can be
modeled feasibly, while dynamic graphs introduce changing structures. When
edges and nodes appear or disappear, GNN cannot change adaptively.
Dynamic GNN is being actively researched on and we believe it to be a big
milestone about the stability and adaptability of general GNN.
Non-Structural Scenarios. Although we have discussed the applications
of GNN on non-structural scenarios, we found that there is no optimal
methods to generate graphs from raw data. In image domain, some work
utilizes CNN to obtain feature maps then upsamples them to form superpixels
as nodes [Liang et al., 2016], while other ones directly leverage some object
detection algorithms to get object nodes. In the text domain [Chen et al.,
2018c], some work employs syntactic trees as syntactic graphs while others
adopt fully connected graphs. Therefore, finding the best graph generation
approach will offer a wider range of fields where GNN could make a
contribution.
Scalability. How to apply embedding methods in web-scale conditions
like social networks or recommendation systems has been a fatal problem for
almost all graph-embedding algorithms, and GNN is not an exception. Scaling
up GNN is difficult because many of the core steps are computational
consuming in big data environment. There are several examples about this
phenomenon. First, graph data are not regular Euclidean, each node has its
own neighborhood structure so batches cannot be applied. Then, calculating
graph Laplacian is also unfeasible when there are millions of nodes and edges.
Moreover, we need to point out that scaling determines whether an algorithm
is able to be applied into practical use. Several works have proposed their
solutions to this problem [Ying et al., 2018a] and recent research is paying
more attention to this direction.
In conclusion, graph neural networks have become powerful and practical
tools for machine learning tasks in graph domain. This progress owes to
advances in expressive power, model flexibility, and training algorithms. In
this book, we give a detailed introduction to graph neural networks. For GNN
models, we introduce its variants categorized by graph convolutional
networks, graph recurrent networks, graph attention networks and graph
residual networks. Moreover, we also summarize several general frameworks
to uniformly represent different variants. In terms of application taxonomy, we
divide the GNN applications into structural scenarios, nonstructural scenarios,
and other scenarios, then give a detailed review for applications in each
scenario. Finally, we suggest four open problems indicating the major
challenges and future research directions of graph neural networks, including
model depth, scalability, the ability to deal with dynamic graphs, and non-
structural scenarios.
Bibliography
F. Alet, A. K. Jeewajee, M. Bauza, A. Rodriguez, T. Lozano-Perez, and L. P. Kaelbling. 2019. Graph
element networks: Adaptive, structured computation and memory. In Proc. of ICML. 68
M. Allamanis, M. Brockschmidt, and M. Khademi. 2018. Learning to represent programs with graphs.
In Proc. of ICLR. 75
G. Angeli and C. D. Manning. 2014. Naturalli: Natural logic inference for common sense reasoning. In
Proc. of EMNLP, pages 534–545. DOI: 10.3115/v1/d14-1059 81
J. Atwood and D. Towsley. 2016. Diffusion-convolutional neural networks. In Proc. of NIPS, pages
1993–2001. 2, 26, 30, 78
D. Bahdanau, K. Cho, and Y. Bengio. 2015. Neural machine translation by jointly learning to align and
translate. In Proc. of ICLR. 39
J. Bastings, I. Titov, W. Aziz, D. Marcheggiani, and K. Simaan. 2017. Graph convolutional encoders for
syntax-aware neural machine translation. In Proc. of EMNLP, pages 1957–1967. DOI:
10.18653/v1/d17-1209 79
P. Battaglia, R. Pascanu, M. Lai, D. J. Rezende, et al. 2016. Interaction networks for learning about
objects, relations and physics. In Proc. of NIPS, pages 4502–4510. 1, 59, 63, 67, 81
P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. Zambaldi, M. Malinowski, A.
Tacchetti, D. Raposo, A. Santoro, R. Faulkner, et al. 2018. Relational inductive biases, deep learning,
and graph networks. ArXiv Preprint ArXiv:1806.01261. 3, 59, 62, 63, 64
D. Beck, G. Haffari, and T. Cohn. 2018. Graph-to-sequence learning using gated graph neural networks.
In Proc. of ACL, pages 273–283. DOI: 10.18653/v1/p18-1026 49, 79, 81
I. Bello, H. Pham, Q. V. Le, M. Norouzi, and S. Bengio. 2017. Neural combinatorial optimization with
reinforcement learning. In Proc. of ICLR. 84
Y. Bengio, P. Simard, P. Frasconi, et al. 1994. Learning long-term dependencies with gradient descent is
difficult. IEEE TNN, 5(2):157–166. DOI: 10.1109/72.279181 17
M. Berlingerio, M. Coscia, and F. Giannotti. 2011. Finding redundant and complementary communities
in multidimensional networks. In Proc. of CIKM, pages 2181–2184. ACM. DOI:
10.1145/2063576.2063921 52
A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko. 2013. Translating embeddings
for modeling multi-relational data. In Proc. of NIPS, pages 2787–2795. 87, 88
K. M. Borgwardt, C. S. Ong, S. Schönauer, S. Vishwanathan, A. J. Smola, and H.-P. Kriegel. 2005.
Protein function prediction via graph kernels. Bioinformatics, 21(suppl_1):i47–i56. DOI:
10.1093/bioinformatics/bti1007 87
D. Boscaini, J. Masci, E. Rodolà, and M. Bronstein. Learning shape correspondence with anisotropic
convolutional neural networks. In Proc. of NIPS, pages 3189–3197. 2, 30
J. Bradshaw, M. J. Kusner, B. Paige, M. H. Segler, and J. M. Hernández-Lobato. 2019. A generative
model for electron paths. In Proc. of ICLR. 70
M. Brockschmidt, M. Allamanis, A. L. Gaunt, and O. Polozov. 2019. Generative code modeling with
graphs. In Proc. of ICLR. 84
M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst. 2017. Geometric deep learning:
going beyond euclidean data. IEEE SPM, 34(4):18–42. DOI: 10.1109/msp.2017.2693418 2
J. Bruna, W. Zaremba, A. Szlam, and Y. Lecun. 2014. Spectral networks and locally connected
networks on graphs. In Proc. of ICLR. 23, 59
A. Buades, B. Coll, and J.-M. Morel. 2005. A non-local algorithm for image denoising. In Proc. of
CVPR, 2:60–65. IEEE. DOI: 10.1109/cvpr.2005.38 60, 61
H. Cai, V. W. Zheng, and K. C.-C. Chang. 2018. A comprehensive survey of graph embedding:
Problems, techniques, and applications. IEEE TKDE, 30(9):1616–1637. DOI:
10.1109/tkde.2018.2807452 2
S. Cao, W. Lu, and Q. Xu. 2016. Deep neural networks for learning graph representations. In Proc. of
AAAI. 56
M. Chang, T. Ullman, A. Torralba, and J. B. Tenenbaum. 2017. A compositional object-based approach
to learning physical dynamics. In Proc. of ICLR. 59, 63
J. Chen, T. Ma, and C. Xiao. 2018a. FastGCN: Fast learning with graph convolutional networks via
importance sampling. In Proc. of ICLR. 54
J. Chen, J. Zhu, and L. Song. 2018b. Stochastic training of graph convolutional networks with variance
reduction. In Proc. of ICML, pages 941–949. 55
X. Chen, L.-J. Li, L. Fei-Fei, and A. Gupta. 2018c. Iterative visual reasoning beyond convolutions. In
Proc. of CVPR, pages 7239–7248. DOI: 10.1109/cvpr.2018.00756 77, 91
X. Chen, G. Yu, J. Wang, C. Domeniconi, Z. Li, and X. Zhang. 2019. Activehne: Active heterogeneous
network embedding. In Proc. of IJCAI. DOI: 10.24963/ijcai.2019/294 48
J. Cheng, L. Dong, and M. Lapata. 2016. Long short-term memory-networks for machine reading. In
Proc. of EMNLP, pages 551–561. DOI: 10.18653/v1/d16-1053 39
K. Cho, B. Van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio.
2014. Learning phrase representations using RNN encoder—decoder for statistical machine
translation. In Proc. of EMNLP, pages 1724–1734. DOI: 10.3115/v1/d14-1179 17, 33, 60
F. R. Chung and F. C. Graham. 1997. Spectral Graph Theory. American Mathematical Society. DOI:
10.1090/cbms/092 1
P. Cui, X. Wang, J. Pei, and W. Zhu. 2018. A survey on network embedding. IEEE TKDE. DOI:
10.1109/TKDE.2018.2849727 2
H. Dai, B. Dai, and L. Song. 2016. Discriminative embeddings of latent variable models for structured
data. In Proc. of ICML, pages 2702–2711. 59, 85
H. Dai, Z. Kozareva, B. Dai, A. Smola, and L. Song. 2018. Learning steady-states of iterative algorithms
over graphs. In Proc. of ICML, pages 1114–1122. 55
N. De Cao and T. Kipf. 2018. MolGAN: An implicit generative model for small molecular graphs.
ICML Workshop on Theoretical Foundations and Applications of Deep Generative Models. 83
A. K. Debnath, R. L. Lopez de Compadre, G. Debnath, A. J. Shusterman, and C. Hansch. 1991.
Structure-activity relationship of mutagenic aromatic and heteroaromatic nitro compounds.
Correlation with molecular orbital energies and hydrophobicity. Journal of Medicinal Chemistry,
34(2):786–797. DOI: 10.1021/jm00106a046 87
M. Defferrard, X. Bresson, and P. Vandergheynst. 2016. Convolutional neural networks on graphs with
fast localized spectral filtering. In Proc. of NIPS, pages 3844–3852. 24, 59, 78
T. Dettmers, P. Minervini, P. Stenetorp, and S. Riedel. 2018. Convolutional 2D knowledge graph
embeddings. In Proc. of AAAI. 71, 88
K. Do, T. Tran, and S. Venkatesh. 2019. Graph transformation policy network for chemical reaction
prediction. In Proc. of SIGKDD, pages 750–760. ACM. DOI: 10.1145/3292500.3330958 70
P. D. Dobson and A. J. Doig. 2003. Distinguishing enzyme structures from non-enzymes without
alignments. Journal of Molecular Biology, 330(4):771–783. DOI: 10.1016/s0022-2836(03)00628-4
87
D. K. Duvenaud, D. Maclaurin, J. Aguileraiparraguirre, R. Gomezbombarelli, T. D. Hirzel, A.
Aspuruguzik, and R. P. Adams. 2015. Convolutional networks on graphs for learning molecular
fingerprints. In Proc. of NIPS, pages 2224–2232. 25, 59, 68
W. Fan, Y. Ma, Q. Li, Y. He, E. Zhao, J. Tang, and D. Yin. 2019. Graph neural networks for social
recommendation. In Proc. of WWW, pages 417–426. ACM. DOI: 10.1145/3308558.3313488 74
M. Fey and J. E. Lenssen. 2019. Fast graph representation learning with PyTorch Geometric. In ICLR
Workshop on Representation Learning on Graphs and Manifolds. 89
A. Fout, J. Byrd, B. Shariat, and A. Ben-Hur. 2017. Protein interface prediction using graph
convolutional networks. In Proc. of NIPS, pages 6530–6539. 1, 70
H. Gao, Z. Wang, and S. Ji. 2018. Large-scale learnable graph convolutional networks. In Proc. of
SIGKDD, pages 1416–1424. ACM. DOI: 10.1145/3219819.3219947 29
V. Garcia and J. Bruna. 2018. Few-shot learning with graph neural networks. In Proc. of ICLR. 76
J. Gehring, M. Auli, D. Grangier, and Y. N. Dauphin. 2017. A convolutional encoder model for neural
machine translation. In Proc. of ACL, 1:123–135. DOI: 10.18653/v1/p17-1012 39
J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl. 2017. Neural message passing for
quantum chemistry. In Proc. of ICML, pages 1263–1272. 3, 59, 60, 62, 63, 64
M. Gori, G. Monfardini, and F. Scarselli. 2005. A new model for learning in graph domains. In Proc. of
IJCNN, pages 729–734. DOI: 10.1109/ijcnn.2005.1555942 19
P. Goyal and E. Ferrara. 2018. Graph embedding techniques, applications, and performance: A survey.
Knowledge-Based Systems, 151:78–94. DOI: 10.1016/j.knosys.2018.03.022 2
J. L. Gross and J. Yellen. 2004. Handbook of Graph Theory. CRC Press. DOI:
10.1201/9780203490204 49
A. Grover and J. Leskovec. 2016. node2vec: Scalable feature learning for networks. In Proc. of
SIGKDD, pages 855–864. ACM. DOI: 10.1145/2939672.2939754 2
A. Grover, A. Zweig, and S. Ermon. 2019. Graphite: Iterative generative modeling of graphs. In Proc. of
ICML. 84
J. Gu, H. Hu, L. Wang, Y. Wei, and J. Dai. 2018. Learning region features for object detection. In Proc.
of ECCV, pages 381–395. DOI: 10.1007/978-3-030-01258-8_24 77
T. Hamaguchi, H. Oiwa, M. Shimbo, and Y. Matsumoto. 2017. Knowledge transfer for out-of-
knowledge-base entities: A graph neural network approach. In Proc. of IJCAI, pages 1802–1808.
DOI: 10.24963/ijcai.2017/250 1, 72
W. L. Hamilton, R. Ying, and J. Leskovec. 2017a. Representation learning on graphs: Methods and
applications. IEEE Data(base) Engineering Bulletin, 40:52–74. 2
W. L. Hamilton, Z. Ying, and J. Leskovec. 2017b. Inductive representation learning on large graphs. In
Proc. of NIPS, pages 1024–1034. 1, 31, 32, 53, 67, 74, 78
W. L. Hamilton, J. Zhang, C. Danescu-Niculescu-Mizil, D. Jurafsky, and J. Leskovec. 2017c. Loyalty in
online communities. In Proc. of ICWSM. 87
D. K. Hammond, P. Vandergheynst, and R. Gribonval. 2011. Wavelets on graphs via spectral graph
theory. Applied and Computational Harmonic Analysis, 30(2):129–150. DOI:
10.1016/j.acha.2010.04.005 24
J. B. Hamrick, K. Allen, V. Bapst, T. Zhu, K. R. Mckee, J. B. Tenenbaum, and P. Battaglia. 2018.
Relational inductive bias for physical construction in humans and machines. Cognitive Science. 63
K. He, X. Zhang, S. Ren, and J. Sun. 2016a. Deep residual learning for image recognition. In Proc. of
CVPR, pages 770–778. DOI: 10.1109/cvpr.2016.90 43, 61
K. He, X. Zhang, S. Ren, and J. Sun. 2016b. Identity mappings in deep residual networks. In Proc. of
ECCV, pages 630–645. Springer. DOI: 10.1007/978-3-319-46493-0_38 31, 45
M. Henaff, J. Bruna, and Y. Lecun. 2015. Deep convolutional networks on graph-structured data. ArXiv:
Preprint, ArXiv:1506.05163. 23, 78
S. Hochreiter and J. Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8):1735–
1780. DOI: 10.1162/neco.1997.9.8.1735 17, 33
S. Hochreiter, Y. Bengio, P. Frasconi, J. Schmidhuber, et al., 2001. Gradient flow in recurrent nets: The
difficulty of learning long-term dependencies. A Field Guide to Dynamical Recurrent Neural Networks.
IEEE Press. 17
Y. Hoshen. 2017. Vain: Attentional multi-agent predictive modeling. In Proc. of NIPS, pages 2701–
2711. 59, 60, 67, 75
H. Hu, J. Gu, Z. Zhang, J. Dai, and Y. Wei. 2018. Relation networks for object detection. In Proc. of
CVPR, pages 3588–3597. DOI: 10.1109/cvpr.2018.00378 77
G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. 2017. Densely connected convolutional
networks. In Proc. of CVPR, pages 4700–4708. DOI: 10.1109/cvpr.2017.243 45
W. Huang, T. Zhang, Y. Rong, and J. Huang. 2018. Adaptive sampling towards fast graph representation
learning. In Proc. of NeurIPS, pages 4563–4572. 54
T. J. Hughes. 2012. The Finite Element Method: Linear Static and Dynamic Finite Element Analysis.
Courier Corporation. 68
A. Jain, A. R. Zamir, S. Savarese, and A. Saxena. 2016. Structural-RNN: Deep learning on spatio-
temporal graphs. In Proc. of CVPR, pages 5308–5317. DOI: 10.1109/cvpr.2016.573 51, 77
W. Jin, R. Barzilay, and T. Jaakkola. 2018. Junction tree variational autoencoder for molecular graph
generation. In Proc. of ICML. 69
W. Jin, K. Yang, R. Barzilay, and T. Jaakkola. 2019. Learning multimodal graph-to-graph translation for
molecular optimization. In Proc. of ICLR. 69
M. Kampffmeyer, Y. Chen, X. Liang, H. Wang, Y. Zhang, and E. P. Xing. 2019. Rethinking knowledge
graph propagation for zero-shot learning. In Proc. of CVPR. DOI: 10.1109/cvpr.2019.01175 47, 75,
76
S. Kearnes, K. McCloskey, M. Berndl, V. Pande, and P. Riley. 2016. Molecular graph convolutions:
Moving beyond fingerprints. Journal of Computer-Aided Molecular Design, 30(8):595–608. DOI:
10.1007/s10822-016-9938-8 59, 69
E. Khalil, H. Dai, Y. Zhang, B. Dilkina, and L. Song. 2017. Learning combinatorial optimization
algorithms over graphs. In Proc. of NIPS, pages 6348–6358. 1, 59, 85
M. A. Khamsi and W. A. Kirk. 2011. An Introduction to Metric Spaces and Fixed Point Theory, volume
53. John Wiley & Sons. DOI: 10.1002/9781118033074 20
M. R. Khan and J. E. Blumenstock. 2019. Multi-GCN: Graph convolutional networks for multi-view
networks, with applications to global poverty. ArXiv Preprint ArXiv:1901.11213. DOI:
10.1609/aaai.v33i01.3301606 52
T. Kipf, E. Fetaya, K. Wang, M. Welling, and R. S. Zemel. 2018. Neural relational inference for
interacting systems. In Proc. of ICML, pages 2688–2697. 63, 67, 75
T. N. Kipf and M. Welling. 2016. Variational graph auto-encoders. In Proc. of NIPS. 56
T. N. Kipf and M. Welling. 2017. Semi-supervised classification with graph convolutional networks. In
Proc. of ICLR. 1, 2, 24, 30, 31, 43, 48, 53, 59, 67, 78, 79
W. Kool, H. van Hoof, and M. Welling. 2019. Attention, learn to solve routing problems! In Proc. of
ICLR. https://openreview.net/forum?id=ByxBFsRqYm85
A. Krizhevsky, I. Sutskever, and G. E. Hinton. 2012. Imagenet classification with deep convolutional
neural networks. In Proc. of NIPS, pages 1097–1105. DOI: 10.1145/3065386 17
L. Landrieu and M. Simonovsky. 2018. Large-scale point cloud semantic segmentation with superpoint
graphs. In Proc. of CVPR, pages 4558–4567. DOI: 10.1109/cvpr.2018.00479 78
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. 1998. Gradient-based learning applied to document
recognition. Proc. of the IEEE, 86(11):2278–2324. DOI: 10.1109/5.726791 1, 17
Y. LeCun, Y. Bengio, and G. Hinton. 2015. Deep learning. Nature, 521(7553):436. DOI:
10.1038/nature14539 1
C. Lee, W. Fang, C. Yeh, and Y. F. Wang. 2018a. Multi-label zero-shot learning with structured
knowledge graphs. In Proc. of CVPR, pages 1576–1585. DOI: 10.1109/cvpr.2018.00170 76
G.-H. Lee, W. Jin, D. Alvarez-Melis, and T. S. Jaakkola. 2019. Functional transparency for structured
data: A game-theoretic approach. In Proc. of ICML. 69
J. B. Lee, R. A. Rossi, S. Kim, N. K. Ahmed, and E. Koh. 2018b. Attention models in graphs: A survey.
ArXiv Preprint ArXiv:1807.07984. DOI: 10.1145/3363574 3
F. W. Levi. 1942. Finite Geometrical Systems: Six Public Lectures Delivered in February, 1940, at the
University of Calcutta. The University of Calcutta. 49
G. Li, M. Muller, A. Thabet, and B. Ghanem. 2019. DeepGCNs: Can GCNs go as deep as CNNs? In
Proc. of ICCV. 45, 46
Q. Li, Z. Han, and X.-M. Wu. 2018a. Deeper insights into graph convolutional networks for semi-
supervised learning. In Proc. of AAAI. 55, 91
R. Li, S. Wang, F. Zhu, and J. Huang. 2018b. Adaptive graph convolutional neural networks. In Proc. of
AAAI. 25
Y. Li, D. Tarlow, M. Brockschmidt, and R. S. Zemel. 2016. Gated graph sequence neural networks. In
Proc. of ICLR. 22, 33, 59, 75, 91
Y. Li, O. Vinyals, C. Dyer, R. Pascanu, and P. Battaglia. 2018c. Learning deep generative models of
graphs. In Proc. of ICLR Workshop. 83
Y. Li, R. Yu, C. Shahabi, and Y. Liu. 2018d. Diffusion convolutional recurrent neural network: Data-
driven traffic forecasting. In Proc. of ICLR. DOI: 10.1109/trust-com/bigdatase.2019.00096 50
X. Liang, X. Shen, J. Feng, L. Lin, and S. Yan. 2016. Semantic object parsing with graph LSTM. In
Proc. of ECCV, pages 125–143. DOI: 10.1007/978-3-319-46448-0_8 36, 77, 91
X. Liang, L. Lin, X. Shen, J. Feng, S. Yan, and E. P. Xing. 2017. Interpretable structure-evolving LSTM.
In Proc. of CVPR, pages 2175–2184. DOI: 10.1109/cvpr.2017.234 78
X. Liu, Z. Luo, and H. Huang. 2018. Jointly multiple events extraction via attention-based graph
information aggregation. In Proc. of EMNLP. DOI: 10.18653/v1/d18-1156 81
T. Ma, J. Chen, and C. Xiao. 2018. Constrained generation of semantically valid graphs via regularizing
variational autoencoders. In Proc. of NeurIPS, pages 7113–7124. 83
Y. Ma, S. Wang, C. C. Aggarwal, D. Yin, and J. Tang. 2019. Multi-dimensional graph convolutional
networks. In Proc. of SDM, pages 657–665. DOI: 10.1137/1.9781611975673.74 52
D. Marcheggiani and I. Titov. 2017. Encoding sentences with graph convolutional networks for
semantic role labeling. In Proc. of EMNLP, pages 1506–1515. DOI: 10.18653/v1/d17-1159 79
D. Marcheggiani, J. Bastings, and I. Titov. 2018. Exploiting semantics in neural machine translation with
graph convolutional networks. In Proc. of NAACL. DOI: 10.18653/v1/n18-2078 79
K. Marino, R. Salakhutdinov, and A. Gupta. 2017. The more you know: Using knowledge graphs for
image classification. In Proc. of CVPR, pages 20–28. DOI: 10.1109/cvpr.2017.10 76
J. Masci, D. Boscaini, M. Bronstein, and P. Vandergheynst. 2015. Geodesic convolutional neural
networks on Riemannian manifolds. In Proc. of ICCV Workshops, pages 37–45. DOI:
10.1109/iccvw.2015.112 2, 30
T. Mikolov, K. Chen, G. Corrado, and J. Dean. 2013. Efficient estimation of word representations in
vector space. In Proc. of ICLR. 2
M. Miwa and M. Bansal. 2016. End-to-end relation extraction using LSTMs on sequences and tree
structures. In Proc. of ACL, pages 1105–1116. DOI: 10.18653/v1/p16-1105 79
F. Monti, D. Boscaini, J. Masci, E. Rodola, J. Svoboda, and M. M. Bronstein. 2017. Geometric deep
learning on graphs and manifolds using mixture model CNNs. In Proc. of CVPR, pages 5425–5434.
DOI: 10.1109/cvpr.2017.576 2, 30, 73, 78
M. Narasimhan, S. Lazebnik, and A. G. Schwing. 2018. Out of the box: Reasoning with graph
convolution nets for factual visual question answering. In Proc. of NeurIPS, pages 2654–2665. 77
D. Nathani, J. Chauhan, C. Sharma, and M. Kaul. 2019. Learning attention-based embeddings for
relation prediction in knowledge graphs. In Proc. of ACL. DOI: 10.18653/v1/p19-1466 72
T. H. Nguyen and R. Grishman. 2018. Graph convolutional networks with argument-aware pooling for
event detection. In Proc. of AAAI. 81
M. Niepert, M. Ahmed, and K. Kutzkov. 2016. Learning convolutional neural networks for graphs. In
Proc. of ICML, pages 2014–2023. 26, 78
W. Norcliffebrown, S. Vafeias, and S. Parisot. 2018. Learning conditioned graph structures for
interpretable visual question answering. In Proc. of NeurIPS, pages 8334–8343. 77
A. Nowak, S. Villar, A. S. Bandeira, and J. Bruna. 2018. Revised note on learning quadratic assignment
with graph neural networks. In Proc. of IEEE DSW, pages 1–5. IEEE. DOI:
10.1109/dsw.2018.8439919 85
R. Palm, U. Paquet, and O. Winther. 2018. Recurrent relational networks. In Proc. of NeurIPS, pages
3368–3378. 81
S. Pan, R. Hu, G. Long, J. Jiang, L. Yao, and C. Zhang. 2018. Adversarially regularized graph
autoencoder for graph embedding. In Proc. of IJCAI. DOI: 10.24963/ijcai.2018/362 56
E. E. Papalexakis, L. Akoglu, and D. Ience. 2013. Do more views of a graph help? Community
detection and clustering in multi-graphs. In Proc. of FUSION, pages 899–905. IEEE. 52
H. Peng, J. Li, Y. He, Y. Liu, M. Bao, L. Wang, Y. Song, and Q. Yang. 2018. Large-scale hierarchical
text classification with recursively regularized deep graph-CNN. In Proc. of WWW, pages 1063–1072.
DOI: 10.1145/3178876.3186005 78
H. Peng, J. Li, Q. Gong, Y. Song, Y. Ning, K. Lai, and P. S. Yu. 2019. Fine-grained event categorization
with heterogeneous graph convolutional networks. In Proc. of IJCAI. DOI: 10.24963/ijcai.2019/449
48
N. Peng, H. Poon, C. Quirk, K. Toutanova, and W.-t. Yih. 2017. Cross-sentence N-ary relation
extraction with graph LSTMs. TACL, 5:101–115. DOI: 10.1162/tacl_a_00049 35, 80
B. Perozzi, R. Al-Rfou, and S. Skiena. 2014. Deepwalk: Online learning of social representations. In
Proc. of SIGKDD, pages 701–710. ACM. DOI: 10.1145/2623330.2623732 2
T. Pham, T. Tran, D. Phung, and S. Venkatesh. 2017. Column networks for collective classification. In
Proc. of AAAI. 43
M. Prates, P. H. Avelar, H. Lemos, L. C. Lamb, and M. Y. Vardi. 2019. Learning to solve NP-complete
problems: A graph neural network for decision TSP. In Proc. of AAAI, 33:4731–4738. DOI:
10.1609/aaai.v33i01.33014731 85
C. R. Qi, H. Su, K. Mo, and L. J. Guibas. 2017a. PointNet: Deep learning on point sets for 3D
classification and segmentation. In Proc. of CVPR, 1(2):4. DOI: 10.1109/cvpr.2017.16 59
S. Qi, W. Wang, B. Jia, J. Shen, and S.-C. Zhu. 2018. Learning human-object interactions by graph
parsing neural networks. In Proc. of ECCV, pages 401–417. DOI: 10.1007/978-3-030-01240-3_25
77
X. Qi, R. Liao, J. Jia, S. Fidler, and R. Urtasun. 2017b. 3D graph neural networks for RGBD semantic
segmentation. In Proc. of CVPR, pages 5199–5208. DOI: 10.1109/iccv.2017.556 78
A. Rahimi, T. Cohn, and T. Baldwin. 2018. Semi-supervised user geolocation via graph convolutional
networks. In Proc. of ACL, 1:2009–2019. DOI: 10.18653/v1/p18-1187 43, 67
D. Raposo, A. Santoro, D. G. T. Barrett, R. Pascanu, T. P. Lillicrap, and P. Battaglia. 2017. Discovering
objects and their relations from entangled scene representations. In Proc. of ICLR. 59, 64
S. Rhee, S. Seo, and S. Kim. 2018. Hybrid approach of relation network and localized graph
convolutional filtering for breast cancer subtype classification. In Proc. of IJCAI. DOI:
10.24963/ijcai.2018/490 70
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M.
Bernstein, et al. 2015. ImageNet large scale visual recognition challenge. In Proc. of IJCV,
115(3):211–252. DOI: 10.1007/s11263-015-0816-y 75
A. Sanchez, N. Heess, J. T. Springenberg, J. Merel, R. Hadsell, M. A. Riedmiller, and P. Battaglia. 2018.
Graph networks as learnable physics engines for inference and control. In Proc. of ICLR, pages 4467–
4476. 1, 63, 64, 67
A. Santoro, D. Raposo, D. G. Barrett, M. Malinowski, R. Pascanu, P. Battaglia, and T. Lillicrap. 2017. A
simple neural network module for relational reasoning. In Proc. of NIPS, pages 4967–4976. 59, 63,
81
F. Scarselli, A. C. Tsoi, M. Gori, and M. Hagenbuchner. 2004. Graphical-based learning environments
for pattern recognition. In Proc. of Joint IAPR International Workshops on SPR and SSPR, pages 42–
56. DOI: 10.1007/978-3-540-27868-9_4 19
F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. 2009. The graph neural
network model. IEEE TNN, pages 61–80. DOI: 10.1109/tnn.2008.2005605 19, 20, 47, 62
M. Schlichtkrull, T. N. Kipf, P. Bloem, R. van den Berg, I. Titov, and M. Welling. 2018. Modeling
relational data with graph convolutional networks. In Proc. of ESWC, pages 593–607. Springer. DOI:
10.1007/978-3-319-93417-4_38 22, 50, 71
K. T. Schütt, F. Arbabzadah, S. Chmiela, K. R. Müller, and A. Tkatchenko. 2017. Quantum-chemical
insights from deep tensor neural networks. Nature Communications, 8:13890. DOI:
10.1038/ncomms13890 59
C. Shang, Y. Tang, J. Huang, J. Bi, X. He, and B. Zhou. 2019a. End-to-end structure-aware
convolutional networks for knowledge base completion. In Proc. of AAAI, 33:3060–3067. DOI:
10.1609/aaai.v33i01.33013060 71
J. Shang, T. Ma, C. Xiao, and J. Sun. 2019b. Pre-training of graph augmented transformers for
medication recommendation. In Proc. of IJCAI. DOI: 10.24963/ijcai.2019/825 70
J. Shang, C. Xiao, T. Ma, H. Li, and J. Sun. 2019c. GameNet: Graph augmented memory networks for
recommending medication combination. In Proc. of AAAI, 33:1126–1133. DOI:
10.1609/aaai.v33i01.33011126 70
O. Shchur, D. Zugner, A. Bojchevski, and S. Gunnemann. 2018. NetGAN: Generating graphs via
random walks. In Proc. of ICML, pages 609–618. 83
M. Simonovsky and N. Komodakis. 2017. Dynamic edge-conditioned filters in convolutional neural
networks on graphs. In Proc. CVPR, pages 3693–3702. DOI: 10.1109/cvpr.2017.11 55
K. Simonyan and A. Zisserman. 2014. Very deep convolutional networks for large-scale image
recognition. ArXiv Preprint ArXiv:1409.1556. 17
R. Socher, D. Chen, C. D. Manning, and A. Ng. 2013. Reasoning with neural tensor networks for
knowledge base completion. In Proc. of NIPS, pages 926–934. 87, 88
L. Song, Z. Wang, M. Yu, Y. Zhang, R. Florian, and D. Gildea. 2018a. Exploring graph-structured
passage representation for multi-hop reading comprehension with graph neural networks. ArXiv
Preprint ArXiv:1809.02040. 81
L. Song, Y. Zhang, Z. Wang, and D. Gildea. 2018b. A graph-to-sequence model for AMR-to-text
generation. In Proc. of ACL, pages 1616–1626. DOI: 10.18653/v1/p18-1150 81
L. Song, Y. Zhang, Z. Wang, and D. Gildea. 2018c. N-ary relation extraction using graph state LSTM.
In Proc. of EMNLP, pages 2226–2235. DOI: 10.18653/v1/d18-1246 80
S. Sukhbaatar, R. Fergus, et al. 2016. Learning multiagent communication with backpropagation. In
Proc. of NIPS, pages 2244–2252. 59, 67, 75
Y. Sun, N. Bui, T.-Y. Hsieh, and V. Honavar. 2018. Multi-view network embedding via graph
factorization clustering and co-regularized multi-view agreement. In IEEE ICDMW, pages 1006–
1013. DOI: 10.1109/icdmw.2018.00145 52
R. S. Sutton and A. G. Barto. 2018. Reinforcement Learning: An Introduction. MIT Press. DOI:
10.1109/tnn.1998.712192 85
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A.
Rabinovich. 2015. Going deeper with convolutions. In Proc. of CVPR, pages 1–9. DOI:
10.1109/cvpr.2015.7298594 17
K. S. Tai, R. Socher, and C. D. Manning. 2015. Improved semantic representations from tree-structured
long short-term memory networks. In Proc. of IJCNLP, pages 1556–1566. DOI: 10.3115/v1/p15-
1150 34, 78, 81
J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, and Z. Su. 2008. Arnetminer: Extraction and mining of
academic social networks. In Proc. of SIGKDD, pages 990–998. DOI: 10.1145/1401890.1402008 87
J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei. 2015. Line: Large-scale information network
embedding. In Proc. of WWW, pages 1067–1077. DOI: 10.1145/2736277.2741093 2
D. Teney, L. Liu, and A. V. Den Hengel. 2017. Graph-structured representations for visual question
answering. In Proc. of CVPR, pages 3233–3241. DOI: 10.1109/cvpr.2017.344 77
H. Toivonen, A. Srinivasan, R. D. King, S. Kramer, and C. Helma. 2003. Statistical evaluation of the
predictive toxicology challenge 2000–2001. Bioinformatics, 19(10):1183–1193. DOI:
10.1093/bioinformatics/btg130 87
C. Tomasi and R. Manduchi. 1998. Bilateral filtering for gray and color images. In Computer Vision,
pages 839–846. IEEE. DOI: 10.1109/iccv.1998.710815 61
K. Toutanova, D. Chen, P. Pantel, H. Poon, P. Choudhury, and M. Gamon. 2015. Representing text for
joint embedding of text and knowledge bases. In Proc. of EMNLP, pages 1499–1509. DOI:
10.18653/v1/d15-1174 87
K. Tu, P. Cui, X. Wang, P. S. Yu, and W. Zhu. 2018. Deep recursive network embedding with regular
equivalence. In Proc. of SIGKDD. DOI: 10.1145/3219819.3220068 56
R. van den Berg, T. N. Kipf, and M. Welling. 2017. Graph convolutional matrix completion. In Proc. of
SIGKDD. 56, 67, 74
A. Vaswani, N. Shazeer, N. Parmar, L. Jones, J. Uszkoreit, A. N. Gomez, and L. Kaiser. 2017. Attention
is all you need. In Proc. of NIPS, pages 5998–6008. 36, 39, 59, 60, 61, 79
P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio. 2018. Graph attention
networks. In Proc. of ICLR. 39, 40, 59, 60, 78
P. Veličković, W. Fedus, W. L. Hamilton, P. Liò, Y. Bengio, and R. D. Hjelm. 2019. Deep graph
infomax. In Proc. of ICLR. 56
O. Vinyals, M. Fortunato, and N. Jaitly. 2015. Pointer networks. In Proc. of NIPS, pages 2692–2700. 84
N. Wale, I. A. Watson, and G. Karypis. 2008. Comparison of descriptor spaces for chemical compound
retrieval and classification. Knowledge and Information Systems, 14(3):347–375. DOI:
10.1007/s10115-007-0103-5 87
D. Wang, P. Cui, and W. Zhu. 2016. Structural deep network embedding. In Proc. of SIGKDD. DOI:
10.1145/2939672.2939753 56
P. Wang, J. Han, C. Li, and R. Pan. 2019a. Logic attention based neighborhood aggregation for
inductive knowledge graph embedding. In Proc. of AAAI, 33:7152–7159. DOI:
10.1609/aaai.v33i01.33017152 72, 89
T. Wang, R. Liao, J. Ba, and S. Fidler. 2018a. NerveNet: Learning structured policy with graph neural
networks. In Proc. of ICLR. 63
X. Wang, R. Girshick, A. Gupta, and K. He. 2018b. Non-local neural networks. In Proc. of CVPR, pages
7794–7803. DOI: 10.1109/cvpr.2018.00813 3, 59, 60, 61, 62, 64
X. Wang, Y. Ye, and A. Gupta. 2018c. Zero-shot recognition via semantic embeddings and knowledge
graphs. In Proc. of CVPR, pages 6857–6866. DOI: 10.1109/cvpr.2018.00717 75, 76
X. Wang, H. Ji, C. Shi, B. Wang, Y. Ye, P. Cui, and P. S. Yu. 2019b. Heterogeneous graph attention
network. In Proc. of WWW. DOI: 10.1145/3308558.3313562 48
Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon. 2018d. Dynamic graph
CNN for learning on point clouds. ArXiv Preprint ArXiv:1801.07829. DOI: 10.1145/3326362 78
Z. Wang, T. Chen, J. S. J. Ren, W. Yu, H. Cheng, and L. Lin. 2018e. Deep reasoning with knowledge
graph for social relationship understanding. In Proc. of IJCAI, pages 1021–1028. DOI:
10.24963/ijcai.2018/142 77
Z. Wang, Q. Lv, X. Lan, and Y. Zhang. 2018f. Cross-lingual knowledge graph alignment via graph
convolutional networks. In Proc. of EMNLP, pages 349–357. DOI: 10.18653/v1/d18-1032 72
N. Watters, D. Zoran, T. Weber, P. Battaglia, R. Pascanu, and A. Tacchetti. 2017. Visual interaction
networks: Learning a physics simulator from video. In Proc. of NIPS, pages 4539–4547. 59, 67
L. Wu, P. Sun, Y. Fu, R. Hong, X. Wang, and M. Wang. 2019a. A neural influence diffusion model for
social recommendation. In Proc. of SIGIR. DOI: 10.1145/3331184.3331214 74
Q. Wu, H. Zhang, X. Gao, P. He, P. Weng, H. Gao, and G. Chen. 2019b. Dual graph attention networks
for deep latent representation of multifaceted social effects in recommender systems. In Proc. of
WWW, pages 2091–2102. ACM. DOI: 10.1145/3308558.3313442 74
Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu. 2019c. A comprehensive survey on graph
neural networks. ArXiv Preprint ArXiv:1901.00596. 3
Z. Wu, S. Pan, G. Long, J. Jiang, and C. Zhang. 2019d. Graph waveNet for deep spatial-temporal graph
modeling. ArXiv Preprint ArXiv:1906.00121. DOI: 10.24963/ijcai.2019/264 51
K. Xu, C. Li, Y. Tian, T. Sonobe, K. Kawarabayashi, and S. Jegelka. 2018. Representation learning on
graphs with jumping knowledge networks. In Proc. of ICML, pages 5449–5458. 43, 44
K. Xu, L. Wang, M. Yu, Y. Feng, Y. Song, Z. Wang, and D. Yu. 2019a. Cross-lingual knowledge graph
alignment via graph matching neural network. In Proc. of ACL. DOI: 10.18653/v1/p19-1304 72
N. Xu, P. Wang, L. Chen, J. Tao, and J. Zhao. 2019b. Mr-GNN: Multi-resolution and dual graph neural
network for predicting structured entity interactions. In Proc. of IJCAI. DOI: 10.24963/ijcai.2019/551
70
S. Yan, Y. Xiong, and D. Lin. 2018. Spatial temporal graph convolutional networks for skeleton-based
action recognition. In Proc. of AAAI. DOI: 10.1186/s13640-019-0476-x 51
B. Yang, W.-t. Yih, X. He, J. Gao, and L. Deng. 2015a. Embedding entities and relations for learning
and inference in knowledge bases. In Proc. of ICLR. 71
C. Yang, Z. Liu, D. Zhao, M. Sun, and E. Y. Chang. 2015b. Network representation learning with rich
text information. In Proc. of IJCAI, pages 2111–2117. 2
Z. Yang, W. W. Cohen, and R. Salakhutdinov. 2016. Revisiting semi-supervised learning with graph
embeddings. ArXiv Preprint ArXiv:1603.08861. 87
L. Yao, C. Mao, and Y. Luo. 2019. Graph convolutional networks for text classification. In Proc. of
AAAI, 33:7370–7377. DOI: 10.1609/aaai.v33i01.33017370 78
R. Ying, R. He, K. Chen, P. Eksombatchai, W. L. Hamilton, and J. Leskovec. 2018a. Graph
convolutional neural networks for web-scale recommender systems. In Proc. of SIGKDD. DOI:
10.1145/3219819.3219890 53, 67, 74, 92
Z. Ying, J. You, C. Morris, X. Ren, W. Hamilton, and J. Leskovec. 2018b. Hierarchical graph
representation learning with differentiable pooling. In Proc. of NeurIPS, pages 4805–4815. 55, 67
J. You, B. Liu, Z. Ying, V. Pande, and J. Leskovec. 2018a. Graph convolutional policy network for goal-
directed molecular graph generation. In Proc. of NeurIPS, pages 6410–6421. 83
J. You, R. Ying, X. Ren, W. Hamilton, and J. Leskovec. 2018b. GraphRNN: Generating realistic graphs
with deep auto-regressive models. In Proc. of ICML, pages 5694–5703. 83
B. Yu, H. Yin, and Z. Zhu. 2018a. Spatio-temporal graph convolutional networks: A deep learning
framework for traffic forecasting. In Proc. of IJCAI. DOI: 10.24963/ijcai.2018/505 50
F. Yu and V. Koltun. 2015. Multi-scale context aggregation by dilated convolutions. ArXiv Preprint
ArXiv:1511.07122. 45
W. Yu, C. Zheng, W. Cheng, C. C. Aggarwal, D. Song, B. Zong, H. Chen, and W. Wang. 2018b.
Learning deep network representations with adversarially regularized autoencoders. In Proc. of
SIGKDD. DOI: 10.1145/3219819.3220000 56
R. Zafarani and H. Liu, 2009. Social computing data repository at ASU. http://socialcomputing.asu.edu
87
M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhutdinov, and A. J. Smola. 2017. Deep
sets. In Proc. of NIPS, pages 3391–3401. 59, 64
V. Zayats and M. Ostendorf. 2018. Conversation modeling on reddit using a graph-structured LSTM.
TACL, 6:121–132. DOI: 10.1162/tacl_a_00009 35
D. Zhang, J. Yin, X. Zhu, and C. Zhang. 2018a. Network representation learning: A survey. IEEE
Transactions on Big Data. DOI: 10.1109/tbdata.2018.2850013 2
F. Zhang, X. Liu, J. Tang, Y. Dong, P. Yao, J. Zhang, X. Gu, Y. Wang, B. Shao, R. Li, et al. 2019. OAG:
Toward linking large-scale heterogeneous entity graphs. In Proc. of SIGKDD. DOI:
10.1145/3292500.3330785 72
J. Zhang, X. Shi, J. Xie, H. Ma, I. King, and D.-Y. Yeung. 2018b. GaAN: Gated attention networks for
learning on large and spatiotemporal graphs. In Proc. of UAI. 40
Y. Zhang, Q. Liu, and L. Song. 2018c. Sentence-state LSTM for text representation. In Proc. of ACL,
1:317–327. DOI: 10.18653/v1/p18-1030 36, 75, 78, 79
Y. Zhang, P. Qi, and C. D. Manning. 2018d. Graph convolution over pruned dependency trees improves
relation extraction. In Proc. of EMNLP, pages 2205–2215. DOI: 10.18653/v1/d18-1244 79
Y. Zhang, Y. Xiong, X. Kong, S. Li, J. Mi, and Y. Zhu. 2018e. Deep collective classifi-cation in
heterogeneous information networks. In Proc. of WWW, pages 399–408. DOI:
10.1145/3178876.3186106 48
Z. Zhang, P. Cui, and W. Zhu. 2018f. Deep learning on graphs: A survey. ArXiv Preprint
ArXiv:1812.04202. 3
J. Zhou, X. Han, C. Yang, Z. Liu, L. Wang, C. Li, and M. Sun. 2019. Gear: Graph-based evidence
aggregating and reasoning for fact verification. In Proc. of ACL. DOI: 10.18653/v1/p19-1085 81, 82
H. Zhu, Y. Lin, Z. Liu, J. Fu, T.-S. Chua, and M. Sun. 2019a. Graph neural networks with generated
parameters for relation extraction. In Proc. of ACL. DOI: 10.18653/v1/p19-1128 79
R. Zhu, K. Zhao, H. Yang, W. Lin, C. Zhou, B. Ai, Y. Li, and J. Zhou. 2019b. Aligraph: A
comprehensive graph neural network platform. arXiv preprint arXiv:1902.087.30. 89
Z. Zhu, S. Xu, M. Qu, and J. Tang. 2019c. Graphite: A high-performance cpu-gpu hybrid system for
node embedding. In The World Wide Web Conference, pages 2494–2504, ACM. 89
C. Zhuang and Q. Ma. 2018. Dual graph convolutional networks for graph-based semi-supervised
classification. In Proc. of WWW. DOI: 10.1145/3178876.3186116 28
J. G. Zilly, R. K. Srivastava, J. Koutnik, and J. Schmidhuber. 2016. Recurrent highway networks. In
Proc. of ICML, pages 4189–4198. 43
M. Zitnik and J. Leskovec. 2017. Predicting multicellular function through multi-layer tissue networks.
Bioinformatics, 33(14):i190–i198. DOI: 10.1093/bioinformatics/btx252 87
M. Zitnik, M. Agrawal, and J. Leskovec. 2018. Modeling polypharmacy side effects with graph
convolutional networks. Bioinformatics, 34(13):i457–i466. DOI: 10.1093/bioinformatics/bty294 70
Authors’ Biographies
ZHIYUAN LIU
Zhiyuan Liu is an associate professor in the Department of Computer Science and Technology,
Tsinghua University. He got his B.E. in 2006 and his Ph.D. in 2011 from the Department of Computer
Science and Technology, Tsinghua University. His research interests are natural language processing and
social computation. He has published over 60 papers in international journals and conferences, including
IJCAI, AAAI, ACL, and EMNLP.
JIE ZHOU
Jie Zhou is a second-year Master’s student of the Department of Computer Science and Technology,
Tsinghua University. He got his B.E. from Tsinghua University in 2016. His research interests include
graph neural networks and natural language processing.