Bacciu 2020
Bacciu 2020
Abstract
1. Introduction
2
as a purely theoretical model [41] and later more practically through efficient
approximated distributions [6].
The recursive models share the idea of a (neural) state transition system
that traverses the structure to compute its embedding. The main issue in ex-
tending such approaches to general graphs (cyclic/acyclic, directed/undirected)
was the processing of cycles. Indeed, the mutual dependencies between state
variables cannot be easily modeled by the recursive neural units. The earliest
models to tackle this problem have been the Graph Neural Network [104] and
the Neural Network for Graphs [88]. The former is based on a state transition
system similar to the recursive neural networks, but it allows cycles in the state
computation within a contractive setting of the dynamical system. The Neural
Network for Graphs, instead, exploits the idea that mutual dependencies can be
managed by leveraging the representations from previous layers in the architec-
ture. This way, the model breaks the recursive dependencies in the graph cycles
with a multi-layered architecture. Both models have pioneered the field by lay-
ing down the foundations of two of the main approaches for graph processing,
namely the recurrent [104] and the feedforward [88] ones. In particular, the
latter has now become the predominant approach, under the umbrella term of
graph convolutional (neural) networks (named after approaches [72, 54] which
reintroduced the above concepts around 2015).
This paper takes pace from this historical perspective to provide a gentle
introduction to the field of neural networks for graphs, also referred to as deep
learning for graphs in modern terminology. It is intended to be a paper of
tutorial nature, favoring a well-founded, consistent, and progressive opening
to the main concepts and building blocks to assemble deep architectures for
graphs. Therefore, it does not aim at being an exposition of the most recently
published works on the topic. The motivations for such a tutorial approach are
multifaceted. On the one hand, the surge of recent works on deep learning for
graphs has come at the price of a certain forgetfulness, if not lack of appropriate
referencing, of pioneering and consolidated works. As a consequence, there is
the risk of running through a wave of rediscovery of known results and models.
3
On the other hand, the community is starting to notice troubling trends in the
assessment of deep learning models for graphs [108, 34], which calls for a more
principled approach to the topic. Lastly, a certain number of survey papers have
started to appear in the literature [7, 19, 48, 55, 144, 130, 143], while a more
slowly-paced introduction to the methodology seems lacking.
This tutorial takes a top-down approach to the field while maintaining a clear
historical perspective on the introduction of the main concepts and ideas. To
this end, in Section 2, we first provide a generalized formulation of the problem
of representation learning in graphs, introducing and motivating the architec-
ture roadmap that we will be following throughout the rest of the paper. We will
focus, in particular, on methods that deal with local and iterative processing of
information as these are more consistent with the operational regime of neural
networks. In this respect, we will pay less attention to global approaches (i.e.,
assuming a single fixed adjacency matrix) based on spectral graph theory. We
will then proceed, in Section 3, to introduce the basic building blocks that can
be assembled and combined to create modern deep learning architectures for
graphs. In this context, we will introduce the concepts of graph convolutions as
local neighborhood aggregation functions, the use of attention, sampling, and
pooling operators defined over graphs, and we will conclude with a discussion
on aggregation functions that compute whole-structure embeddings. Our char-
acterization of the methodology continues, in Section 4, with a discussion of
the main learning tasks undertaken in graph representation learning, together
with the associated cost functions and a characterization of the related induc-
tive biases. The final part of the paper surveys other related approaches and
tasks (Section 5), and it discusses interesting research challenges (Section 6) and
applications (Section 7). We conclude the paper with some final considerations
and hints for future research directions.
2. High-level Overview
We begin with a high-level overview of deep learning for graphs. To this aim,
we first summarize the necessary mathematical notation. Secondly, we present
4
av3 v4 v4 v4 Nv1 v4
v3 v3 v3
v2 v1 v2 v1
v2 v1
(b)
(a) (c)
Figure 1: (a) A directed graph with oriented arcs is shown. (b) If the graph is undirected,
we can transform it into a directed one to obtain a viable input for graph learning methods.
In particular, each edge is replaced by two oriented and opposite arcs with identical edge
features. (c) We visually represent the (open) neighborhood of node v1 .
the main ideas the vast majority of works in the literature borrow from.
5
this is referred to as nodes (respectively edges) being “uniformly labelled”. In
′
the general case one can consider Xg ⊆ Rd , d ∈ N and Ag ⊆ Rd , d′ ∈ N. Here
the terms d and d′ denote the number of features associated with each node and
edge, respectively. Note that, despite having defined node and edge features on
the real set for the sake of generality, in many applications these take discrete
values. Moreover, from a practical perspective, we can think of a graph with
no node ( respectively edge) features as an equivalent graph in which all node
(edge) features are identical.
As far as undirected graphs are concerned, these are straightforwardly trans-
formed to their directed version. In particular, every edge {u, v} is replaced by
two distinct and oppositely oriented arcs (u, v) and (v, u), with identical edge
features as shown in Figure 1b .
A path is a sequence of edges that joins a sequence of nodes. Whenever there
exists a non-empty path from a node to itself with no other repeated nodes, we
say the graph has a cycle; when there are no cycles in the graph, the graph is
called acyclic.
A topological ordering of a directed graph g is a total sorting of its nodes
such that for every directed edge (u, v) from node u to node v, u comes before
v in the ordering. A topological ordering exists if and only if the directed graph
has no cycles, i.e., if it is a directed acyclic graph (DAG).
A graph is ordered if, for each node v, a total order on the edges incident
on v is defined and unordered otherwise. Moreover, a graph is positional if,
besides being ordered, a distinctive positive integer is associated with each edge
incident on a node v (allowing some positions to be absent) and non-positional
otherwise. To summarize, in the rest of the paper we will assume a general class
of directed/undirected, acyclic/cyclic and positional/non-positional graphs.
The neighborhood of a node v is defined as the set of nodes which are
connected to v with an oriented arc, i.e., Nv = {u ∈ Vg |(u, v) ∈ Eg }. Nv is
closed if it always includes u and open otherwise. If the domain of arc labels A
is discrete and finite, i.e., A = {c1 , . . . , cm }, we define the subset of neighbors
of v with arc label ck as Nvck = {u ∈ Nv | auv = ck }. Figure 1c provides a
6
DGN
Figure 2: The bigger picture that all graph learning methods share. A “Deep Graph Network”
takes an input graph and produces node representations hv ∀v ∈ Vg . Such representations
can be aggregated to form a single graph representation hg .
Regardless of the training objective one cares about, almost all deep learning
models working on graphs ultimately produce node representations, also called
states. The overall mechanism is sketched in Figure 2, where the input graph on
the left is mapped by a model into a graph of node states with the same topology.
In [41], this process is referred to as performing an isomorphic transduction of
the graph. This is extremely useful as it allows tackling nodes, edges, and graph-
related tasks. For instance, a graph representation can be easily computed by
aggregating together its nodes representations, as shown in the right-hand side
of Figure 2.
To be more precise, each node in the graph will be associated with a state
vector hv ∀v ∈ Vg . The models discussed in this work visit/traverse the input
graph to compute node states. Importantly, in our context of general graphs,
the result of this traversal does not depend on the visiting order and, in partic-
ular, no topological ordering among nodes is assumed. Being independent of a
topological ordering has repercussions on how deep learning models for graphs
deal with cycles (Section 2.3). Equivalently, we can say that the state vectors
7
can be computed by the model in parallel for each node of the input graph.
The work of researchers and practitioners therefore revolves around the def-
inition of deep learning models that automatically extract the relevant features
from a graph. In this tutorial, we refer to such models with the uniforming name
of “Deep Graph Networks” (DGNs). On the one hand, this general terminology
serves the purpose of disambiguating the terms “Graph Neural Network”, which
we use to refer to [104], and “Graph Convolutional Network”, which refers to,
e.g., [72]. These two terms have been often used across the literature to rep-
resent the whole class of neural networks operating on graph data, generating
ambiguities and confusion among practitioners. On the other hand, we also use
it as the base of an illustrative taxonomy (shown in Figure 3), which will serve
as a road-map of the discussion in this and the following sections.
Note that with the term “DGN” (and its taxonomy) we would like to focus solely
on the part of the deep learning model that learns to produce node representa-
tions. Therefore, the term does not encompass those parts of the architecture
that compute a prediction, e.g., the output layer. In doing so, we keep a mod-
ular view on the architecture, and we can combine a deep graph network with
any predictor that solves a specific task.
We divide deep graph networks into three broad categories. The first is called
Deep Neural Graph Networks (DNGNs), which includes models inspired by neu-
ral architectures. The second category is that of Deep Bayesian Graph Networks
(DBGNs), whose representatives are probabilistic models of graphs. Lastly, the
family of Deep Generative Graph Networks (DGGNs) leverages both neural and
probabilistic models to generate graphs. This taxonomy is by no means a strict
compartmentalization of methodologies; in fact, all the approaches we will focus
on in this tutorial are based on local relations and iterative processing to diffuse
information across the graph, regardless of their neural or probabilistic nature.
8
DGNs
(Deep Graph Networks)
Constructive
Graphs with variable topology. First of all, we need a way to seamlessly process
information of graphs that vary both in size and shape. In the literature, this
has been solved by building models that work locally at node level rather than
at graph level. In other words, the models process each node using information
coming from the neighborhood. This recalls the localized processing of images
in convolutional models [75], where the focus is on a single pixel and its set of
finite neighbors (however defined). Such stationarity assumption allows reduc-
ing significantly the number of parameters needed by the model, as they are
re-used across all nodes (similarly to how convolutional filters are shared across
pixels). Moreover, it effectively and efficiently combines the “experience” of all
nodes and graphs in the dataset to learn a single function. At the same time,
the stationarity assumption calls for the introduction of mechanisms that can
learn from the global structure of the graph as well, which we discuss in the
following section.
Notwithstanding these advantages, local processing alone does not solve the
9
problem of graphs of variable neighborhood shape. This issue arises in the case
of non-positional graphs, where there is no consistent way to order the nodes
of a neighborhood. In this case, one common solution is to use permutation
invariant functions acting on the neighborhood of each node. A permutation
invariant function is a function whose output does not change upon reordering
of the input elements. Thanks to this property, these functions are well suited
to handle an arbitrary number of input elements, which comes in handy when
working on unordered and non-positional graphs of variable topology. Common
examples of such functions are the sum, mean, and product of the input ele-
ments. Under some conditions, it is possible to approximate all permutation
invariant continuous functions by means of suitable transformations [140, 124].
More concretely, if the input elements belong to an uncountable space X , e.g.,
Rd , and they are in finite and fixed number M , then any permutation invariant
continuous function Ψ : X M → Y can be expressed as (Theorem 4.1 of [124])
X
Ψ(Z) = φ( ψ(z)), (1)
z∈Z
Graphs contain cycles. A graph cycle models the presence of mutual dependen-
cies/influences between the nodes. In addition, the local processing of graphs
implies that any intermediate node state is a function of the state of its neigh-
bors. Under the local processing assumption, a cyclic structural dependency
translates into mutual (causal) dependencies, i.e., a potentially infinite loop,
when computing the node states in parallel. The way to solve this is to assume
an iterative scheme, i.e., the state hℓ+1
v of node v at iteration ℓ + 1 is defined
using the neighbor states computed at the previous iteration ℓ. The iterative
scheme can be interpreted as a process that incrementally refines node repre-
sentation as ℓ increases. While this might seem reasonable, one may question
10
whether such an iterative process can converge, given the mutual dependencies
among node states. In practice, some approaches introduce constraints on the
nature of the iterative process that force it to be convergent. Instead, others
map each step of the iterative process to independent layers of a deep archi-
tecture. In other words, in the latter approach, the state hℓ+1
v is computed by
layer ℓ + 1 of the model based on the output of the previous layer ℓ.
For the above reasons, in the following sections, we will use the symbol ℓ
to refer, interchangeably, to an iteration step or layer by which nodes propa-
gate information across the graph. Furthermore, we will denote with hℓg the
representation of the entire graph g at layer ℓ.
Another aspect of the process we have just discussed is the spreading of local
information across the graph under the form of node states. This is arguably
the most important concept of local and iterative graph learning methods. At
a particular iteration ℓ, we (informally) define the context of a node state hℓv as
the set of node states that directly or indirectly contribute to determining hℓv ;
a formal characterization of context is given in [88] for the interested reader.
An often employed formalism to explain how information is actually diffused
across the graph is message passing [48]. Focusing on a single node, message
passing consists of two distinct operations:
• state update. The incoming node messages, and possibly its state, are
collected and used to update the node state.
11
Nv
Nu
v v v
u u u
Under the light of the different information diffusion mechanisms they em-
12
ploy, we can partition most deep graph learning models into recurrent, feedfor-
ward and constructive approaches. We now discuss how they work and what
their differences are.
13
the first proposal of a feedforward architecture for graphs.
Not surprisingly, there is a close similarity between this kind of context diffusion
and the local receptive field of convolutional networks, which increases as more
layers are added to the architecture. Despite that, the main difference is that
graphs have no fixed structure as neighborhoods can vary in size, and a node
ordering is rarely given. In particular, the local receptive field of convolutional
networks can be seen as the context of a node in graph data, whereas the con-
volution operator processing corresponds to the visit of the nodes in a graph
(even though the parametrization technique is different). These are the reasons
why the term graph convolutional layer is often used in literature.
The family of feedforward models is the most popular for its simplicity, effi-
ciency, and performance on many different tasks. However, deep networks for
graphs suffer from the same gradient-related problems as other deep neural net-
works, especially when associated with an “end-to-end” learning process running
through the whole architecture [61, 9, 77].
14
result to solve the global task progressively.
Among the constructive approaches, we mention the Neural Network for Graphs
[88] (which is also the very first proposed feedforward architecture for graphs)
and the Contextual Graph Markov Model [3], a more recent and probabilistic
variant.
3. Building Blocks
We now turn our attention to the main constituents of local graph learning
models. The architectural bias imposed by these building blocks determines the
kind of representations that a model can compute. We remark that the aim
of this Section is not to give the most comprehensive and general formulation
under which all models can be formalized. Rather, it is intended to show the
main “ingredients” that are common to many architectures and how these can
be combined to compose an effective learning model for graphs.
15
structural information.
It is important to realize that the above formulation includes both Neural and
Bayesian DGNs. As an example, a popular concrete instance of the neighbor-
hood aggregation scheme presented above is the Graph Convolutional Network
[72], a DNGN which performs aggregation as follows:
X
hℓ+1
v = σ(Wℓ+1 Luv hℓu ), (3)
u∈N (v)
16
where wck is a learnable scalar parameter that weighs the contribution of arcs
with label auv = ck , and ∗ multiplies every component of its first argument by
wck . This formulation presents an inner aggregation among neighbors sharing
the same arc label, plus an outer weighted sum over each possible arc label.
This way, the contribution of each arc label is learned separately. The Neural
Network for Graphs [88] and the Relational Graph Convolutional Network [105]
implement Eq. 7 explicitly, whereas the Contextual Graph Markov Model [3]
uses the switching-parent approximation [103] to achieve the same goal. A more
general solution, which works with continuous arc labels, is to reformulate Eq.
2 as
hℓ+1
v = φℓ+1 hℓv , Ψ({eℓ+1 (auv )T ψ ℓ+1 (hℓu ) | u ∈ Nv }) , (8)
where αℓ+1
uv ∈ R is the attention score associated with u ∈ Nv . In general,
this score is unrelated to the edge information, and as such edge processing
and attention are two quite distinct techniques. As a matter of fact, the Graph
Attention Network [120] applies attention to its neighbors but it does not take
into account edge information. To calculate the attention scores, the model
17
computes attention coefficients wuv as follows:
ℓ
wuv = a(Wℓ hℓu , Wℓ hℓv ), (10)
where a is a shared attention function and W are the layer weights. The at-
tention coefficients measure some form of similarity between the current node v
and each of its neighbors u. Moreover, the attention function a is implemented
as:
Sampling. When graphs are large and dense, it can be unfeasible to perform
aggregations over all neighbors for each node, as the number of edges becomes
quadratic in |Vg |. Therefore, alternative strategies are needed to reduce the
computational burden, and neighborhood sampling is one of them. In this
scenario, only a random subset of neighbors is used to compute hℓ+1
v . When
the subset size is fixed, we also get an upper bound on the aggregation cost
per graph. Figure 5 depicts how a generic sampling strategy acts at node level.
Among the models that sample neighbors we mention Fast Graph Convolutional
Network (FastGCN) [23] and Graph SAmple and aggreGatE (GraphSAGE) [54].
Specifically, FastGCN samples t nodes at each layer ℓ via importance sampling
so that the variance of the gradient estimator is reduced. Differently from
FastGCN, GraphSAGE considers a neighborhood function N : |Vg | → 2|Vg | that
18
Figure 5: The sampling technique affects the neighborhood aggregation procedure by selecting
either a subset of the neighbors [23] or a subset of the nodes in the graph [54] to compute
hℓ+1
v . Here, nodes in red have been randomly excluded from the neighborhood aggregation of
node v, and the context flows only through the wavy arrows.
associates each node with any (fixed) subset of the nodes in the given graph. In
practice, GraphSAGE can sample nodes at multiple distances and treat them
as direct neighbors of node v. Therefore, rather than learning locally, this
technique exploits a wider and heterogeneous neighborhood, trading a potential
improvement in performances for additional (but bounded) computational costs.
3.2. Pooling
19
Pooling layer
Figure 6: We show an example of the pooling technique. Each pooling layer coarsens the
graph by identifying and clustering nodes of the same community together, so that each
group becomes a node of the coarsened graph.
where Aℓ and Hℓ are the adjacency and encoding matrices of layer ℓ. The Sℓ+1
matrix is then used to recombine the current graph into (ideally) one of reduced
size:
T T
Hℓ+1 = Sℓ+1 Hℓ and Aℓ+1 = Sℓ+1 Aℓ Sℓ+1 . (14)
Hℓ pℓ+1
sℓ+1 = . (15)
kpℓ+1 k
Such scores are then used to select the indices of the top ranking nodes and to
slice the matrix of the original graph to retain only the entries corresponding to
top nodes. Node selection is made differentiable by means of a gating mechanism
built on the projection scores. Self-attention Graph Pooling [76] extends Top-
k Pooling by computing the score vector as an attention score with a Graph
Convolutional Network [72]
20
Edge Pooling [42] operates from a different perspective, by targeting edges in
place of nodes. Edges are ranked based on a parametric scoring function which
takes in input the concatenated embeddings of the incident nodes, that is
The highest ranking edge and its incident nodes are then contracted into a single
new node with appropriate connectivity, and the process is iterated.
Topological pooling, on the other hand, is non-adaptive and typically lever-
ages the structure of the graph itself as well as its communities. Note that,
them being non-adaptive, such mechanisms are not required to be differentiable,
and their results are not task-dependent. Hence, these methods are potentially
reusable in multi-task scenarios. The graph clustering software (GRACLUS)
[30] is a widely used graph partitioning algorithm that leverages an efficient
approach to spectral clustering. Interestingly, GRACLUS does not require an
eigendecomposition of the adjacency matrix. From a similar perspective, Non-
negative Matrix Factorization Pooling [2] provides a soft node clustering using
a non-negative factorization of the adjacency matrix.
Pooling methods can also be used to perform graph classification by itera-
tively shrinking the graph up to the point in which the graph contains a single
node. Generally speaking, however, pooling is interleaved with DGNs layers so
that context can be diffused before the graph is shrunk.
where a common setup is to take f as the identity function and choose Ψ among
element-wise mean, sum or max. Another, more sophisticated, aggregation
21
scheme draws from the work of [140], where a family of adaptive permutation-
invariant functions if defined. Specifically, it implements f as a neural network
applied to all the node representations in the graph, and Ψ is an element-wise
summation followed by a final non-linear transformation.
There are multiple ways to exploit graph embeddings at different layers for
the downstream tasks. A straightforward way is to use the graph embedding
of the last layer as a representative for the whole graph. More often, all the
intermediate embeddings are concatenated or given as input to permutation-
invariant aggregators. The work of [78] proposes a different strategy where
all the intermediate representations are viewed as a sequence, and the model
learns a final graph embedding as the output of a Long Short-Term Memory
[62] network on the sequence. Sort Pooling [142], on the other hand, uses the
concatenation of the node embeddings of all layers as the continuous equiva-
lent of node coloring algorithms. Then, such “colors” define a lexicographic
ordering of nodes across graphs. The top ordered nodes are then selected and
fed (as a sequence) to a one-dimensional convolutional layer that computes the
aggregated graph encoding.
To conclude, Table 1 provides a summary of neighborhood aggregation meth-
ods for some representative models. Figure 7 visually exemplifies how the dif-
ferent building blocks can be arranged and combined to construct a feedforward
or recurrent model that is end-to-end trainable.
4. Learning Criteria
After having introduced the main building blocks and most common tech-
niques to produce node and graph representations, we now discuss the different
learning criteria that can be used and combined to tackle different tasks. We will
focus on unsupervised, supervised, generative, and adversarial learning criteria
to give a comprehensive overview of the research in this field.
22
Classifier
yv
Aggregation
Classifier
Pooling
Conv.
Conv.
Conv.
yg
v hℓ+1
v
ℓ+1
αuv
u
Classifier
yv
Recurrent
Aggregation
Classifier
yg
Figure 7: Two possible architectures (feedforward and recurrent) for node and graph classifi-
cation. Inside each layer, one can apply the attention and sampling techniques described in
this Section. After pooling is applied, it is not possible to perform node classification anymore,
which is why a potential model for node classification can combine graph convolutional lay-
ers. A recurrent architecture (bottom) iteratively applies the same neighborhood aggregation,
possibly until a convergence criterion is met.
Link Prediction. The most common unsupervised criterion used by graph neu-
ral networks is the so-called link prediction or reconstruction loss. This learning
objective aims at building node representations that are similar if an arc con-
nects the associated nodes, and it is suitable for link prediction tasks. Formally,
the reconstruction loss can be defined [72] as
X
Lrec (g) = ||hv − hu ||2 . (19)
(u,v)
23
Model Neighborhood Aggregation hℓ+1
v
T
σ wℓ+1 xv + ℓi=0 ck ∈C u∈Nvck wci k ∗ hiu
P P P
NN4G [88]
ℓ+1
xu , xv , auv , hℓu
P
GNN [104] u∈Nv M LP
GraphESN [44] σ Wℓ+1 xu + Ŵℓ+1 [hℓu1 , . . . , hℓuNv ]
σ Wℓ+1 u∈N (v) Lvu hℓu
P
GCN [72]
P
ℓ+1 ℓ+1
GAT [120] σ u∈Nv αuv ∗ W hu
σ |N1v | u∈Nv M LP ℓ+1 (auv )T hℓu
P
ECC [111]
P
1 ℓ+1 ℓ ℓ+1 ℓ
P
R-GCN [105] σ ck ∈C u∈Nv k |Nvck | Wck hu + W
c hv
σ Wℓ+1 ( |N1v | [hℓv , u∈Nv hℓu ])
P
GraphSAGE [54]
Pℓ P
i i 1 i
P
CGMM [3] i=0 w ∗ ck ∈C w c k
∗ c
|Nv k | u∈Nv
ck h
u
ℓ+1 ℓ+1
ℓ P
GIN [131] M LP 1+ǫ hv + u∈Nv hℓu
Table 1: We report some of the neighborhood aggregations present in the literature, and we
provide a table in Appendix A to ease referencing and understanding of acronyms. Here,
square brackets denote concatenation, and W, w and ǫ are learnable parameters. Note that
GraphESN assumes a maximum size of the neighborhood. The attention mechanism of GAT
is implemented by a weight αuv that depends on the associated nodes. As for GraphSAGE,
we describe its “mean” variant, though others have been proposed by the authors. Finally,
recall that ℓ represents an iteration step in GNN rather than a layer.
There also exists a probabilistic formulation of this loss, which is used in varia-
tional auto-encoders for graphs [71] where the decoder only focuses on structural
reconstruction:
24
Maximum Likelihood. When the goal is to build unsupervised representations
that reflect the distribution of neighboring states, a different approach is needed.
In this scenario, probabilistic models can be of help. Indeed, one can compute
the likelihood that node u has a certain label xu conditioned on neighboring
information. Known unsupervised probabilistic learning approaches can then
maximize this likelihood. An example is the Contextual Graph Markov Model
[3], which constructs a deep network as a stack of simple Bayesian networks.
Each layer maximizes the following likelihood:
C
Y X
L(θ|g) = P (yu |Qu = i)P (Qu = i|qNu ), (21)
u∈Vg i=1
25
Mutual Information. An alternative approach to produce node representations
focuses on local mutual information maximization between pairs of graphs. In
particular, Deep Graph Infomax [121] uses a corruption function that generates
a distorted version of a graph g, called g̃. Then, a discriminator is trained to
distinguish the two graphs, using a bilinear score on node and graph representa-
tions. This unsupervised method requires a corruption function to be manually
defined each time, e.g., injecting random structural noise in the graph, and as
such it imposes a bias on the learning process.
where H is the entropy and Su is the row associated with node u clusters assign-
ment. Notice that, from a practical point of view, it is still challenging to devise
a differentiable pooling method that does not generate dense representations.
However, encouraging a one-hot community assignment of nodes can enhance
visual interpretation of the learned clusters, and it acts as a regularizer that
enforces well-separated communities.
26
Node Classification. As the term indicates, the goal of node classification is
to assign the correct target label to each node in the graph. There can be
two distinct settings: inductive node classification, which consists of classifying
nodes that belong to unseen graphs, and transductive node classification, in
which there is only one graph to learn from and only a fraction of the nodes
needs to be classified. It is important to remark that benchmark results for node
classification have been severely affected by delicate experimental settings; this
issue was later addressed [108] by re-evaluating state of the art architectures
under a rigorous setting. Assuming a multi-class node classification task with
C classes, the most common learning criterion is the cross-entropy:
eyt
LCE (y, t) = − log PC (23)
yj
j=1 e
where y ∈ RC and t ∈ {1, . . . , C} are the output vector and target class, respec-
tively. The loss is then summed or averaged over all nodes in the dataset.
27
4.3. Generative learning
∼ ∼
MLP
p(hg )
Ã
h̃g
∼ ∼
p(hv ) h̃v
Figure 8: A simplified schema of graph-level (top row) and node-level (bottom row) generative
decoders is shown. Tilde symbols on top of arrows indicate sampling. Dashed arrows indicate
that the corresponding sampling procedure is not differentiable in general. Darker shades of
blue indicate higher probabilities.
28
of observing an arc between node i and j. This corresponds to minimizing the
following log-likelihood:
Notice that the first two alternatives are not differentiable; in those cases, the
actual reconstruction loss cannot be back-propagated during training. Thus,
the reconstruction loss is computed on the probabilistic matrix instead of the
actual matrix [112]. Graph-level decoders are not permutation invariant (unless
approximate graph matching is used) because the ordering of the output matrix
is assumed fixed.
1 X X
Ldecoder(g) = − log P (ãuv | h̃v , h̃u ), (26)
|Vg |
v∈Vg u∈Vg
where P (ãuv | h̃v , h̃u ) = σ(h̃Tv h̃u ) as in Eq. 20 and similarly to [71, 51], and
h̃ are sampled node representations. As opposed to graph-level decoding, this
method is permutation invariant, even though it is generally more expensive to
29
calculate than one-shot adjacency matrix generation.
Generative Auto-Encoder for graphs. This method works by learning the prob-
ability distribution of node (or graph) representations in latent space. Samples
of this distribution are then given to the decoder to generate novel graphs. A
general formulation of the loss function for graphs is the following:
where Ldecoder is the reconstruction error of the decoder as mentioned above, and
Lencoder is a divergence measure that forces the distribution of points in latent
space to resemble a “tractable” prior (usually an isotropic Gaussian N (0, I)).
For example, models based on Variational AEs [70] use the following encoder
loss:
where DKL is the Kullback-Leibler divergence, and the two parameters of the
encoding distribution are computed as µ = DGNµ (A, X) and σ = DGNσ (A, X)
[112, 80, 101]. More recent approaches such as [18] propose to replace the
encoder error term in Equation 27 with a Wasserstein distance term [116].
30
and a discriminator D that is trained to recognize whether its input comes from
the generator or from the dataset. When dealing with graph-structured data,
both the generator and the discriminator are trained jointly to minimize the
following objective:
LGAN (g) = min max Eg∼Pdata (g) [log D(g)] + Ez∼P (z) [log(1 − D(G(z)))], (29)
G D
where Pdata is the true unknown probability distribution of the data, and P (z)
is the prior on the latent space (usually isotropic Gaussian or uniform). Note
that this procedure provides an implicit way to sample from the probability
distribution of interest without manipulating it directly. In the case of graph
generation, G can be a graph or node-level decoder that takes a random point
in latent space as input and generates a graph, while D takes a graph as input
and outputs the probability of being a “fake” graph produced by the generator.
As an example, [36] implements G as a graph-level decoder that outputs both
a probabilistic adjacency matrix à and a node label matrix L̃ as well. The
discriminator takes an adjacency matrix A and a node label matrix L as input,
applies a Jumping Knowledge Network [132] to it, and decides whether the graph
is sampled from the generator or the dataset with a multi-layer perceptron. In
contrast, [126] works at the node level. Specifically, G generates structure-aware
node representations (based on the connectivity of a breadth-first search tree of
a random graph sampled from the training set), while the discriminator takes as
input two node representations and decides whether they come from the training
set or the generator, optimizing an objective function similar to Eq. 26.
4.4. Summary
31
Model Context Embedding Layers Nature
GNN [104] Recurrent Supervised Single Neural
NN4G [88] Constructive Supervised Adaptive Neural
GraphESN [44] Recurrent Untrained Single Neural
GCN [72] Feedforward Supervised Fixed Neural
GG-NN [78] Recurrent Supervised Fixed Neural
ECC [111] Feedforward Supervised Fixed Neural
GraphSAGE [54] Feedforward Both Fixed Neural
CGMM [3] Constructive Unsupervised Fixed Probabilistic
DGCNN [142] Feedforward Supervised Fixed Neural
DiffPool [137] Feedforward Supervised Fixed Neural
GAT [120] Feedforward Supervised Fixed Neural
R-GCN [105] Feedforward Supervised Fixed Neural
DGI [121] Feedforward Unsupervised Fixed Neural
GMNN [97] Feedforward Both Fixed Hybrid
GIN [131] Feedforward Supervised Fixed Neural
NMFPool [2] Feedforward Supervised Fixed Neural
SAGPool [76] Feedforward Supervised Fixed Neural
Top-k Pool [46] Feedforward Supervised Fixed Neural
FDGNN [45] Recurrent Untrained Fixed Neural
Table 2: Here we recap the main properties of DGNs, according to what we have discussed so
far. Please refer to Appendix A for a description of all acronyms. For clarity, “-” means not
applicable, as the model is a framework that relies on any generic learning methodology. The
“Layers” column describes how many layers are used by an architecture, which can be just
one, a fixed number or adaptively determined by the learning process. On the other hand,
“Context” refers to the context diffusion method of a specific layer, which was discussed in
Section 2.4.
32
5. Summary of Other Approaches and Tasks
There are several approaches and topics that are not covered by the tax-
onomy discussed in earlier sections. In particular, we focused our attention on
deep learning methods for graphs, which are mostly based on local and iterative
processing. For completeness of exposition, we now briefly review some of the
topics that were kept out.
5.1. Kernels
Spectral graph theory studies the properties of a graph by means of the as-
sociated adjacency and Laplacian matrices. Many machine learning problems
can be tackled with these techniques, for example Laplacian smoothing [100],
graph semi-supervised learning [22, 21] and spectral clustering [123]. A graph
33
can also be analyzed with signal processing tools, such as the Graph Fourier
Transform [59] and related adaptive techniques [20]. Generally speaking, spec-
tral techniques are meant to work on graphs with the same shape and different
node labels, as they are based on the eigen-decomposition of adjacency and
Laplacian matrices. More in detail, the eigenvector matrix Q of the Laplacian
constitutes an orthonormal basis used to compute the Graph Fourier Transform
on the nodes signal f ∈ RVg . The transform is defined as F (f ) = QT f , and
its inverse is simply F −1 (QT f ) = QQT f thanks to orthogonality of Q. Then,
the graph convolution between a filter θ and the graph signal f resembles the
convolution of the standard Fourier analysis [12]:
F (f ⊗ θ) = QW QT f (30)
5.3. Random-walks
34
for a long time [81, 122, 99, 64]. A random walk is defined as a random path
that connects two nodes in the graphs. Depending on the reachable nodes, we
can devise different frameworks to learn a node representation: for example,
Node2Vec [50] maximizes the likelihood of a node given its surroundings by
exploring the graph using a random walk. Moreover, learnable parameters guide
the bias of the walk in the sense that a depth-first search can be preferred to a
breadth-first search and vice-versa. Similarly, DeepWalk [96] learns continuous
node representations by modeling random walks as sentences and maximizing a
likelihood objective. More recently, random walks have been used to generate
graphs as well [14], and a formal connection between the contextual information
diffusion of GCN and random walks has been explored [132].
Given the importance of real-world applications that use graph data struc-
tures, there has recently been an increasing interest in studying the robustness
of DGNs to malicious attacks. The term adversarial training is used in the
context of deep neural networks to identify a regularization strategy based on
feeding the model with perturbed input. The catch is to make the network
resilient to adversarial attacks [11]. Recently, neural DGNs have been shown
to be prone to adversarial attacks as well [148], while the use of adversarial
training for regularization is relatively new [37]. The adversarial training ob-
jective function is formulated as a min-max game where one tries to minimize
the harmful effect of an adversarial example. Briefly, the model is trained with
original graphs from the training set, as well as with adversarial graphs. Ex-
amples of perturbations to make a graph adversarial include arc insertion and
deletions [134] or the addition of adversarial noise to the node representations
[68]. The adversarial graphs are labeled according to their closest match in the
dataset. This way, the space of the loss function is smooth, and it preserves the
predictive power of the model even in the presence of perturbed graphs.
35
5.5. Sequential generative models of graphs
Despite the steady increase in the number of works on graph learning method-
ologies, there are some lines of research that have not been widely investigated
yet. Below, we mention some of them to give practitioners insights about po-
tential research avenues.
36
[141] proposals in the literature. However, the limiting factor for the develop-
ment of this research line seems, currently, the lack of large datasets, especially
of non-synthetic nature.
37
7. Applications
38
Among them, we mention NCI1 [125], PROTEINS, [17] D&D [31], MUTAG
[28], PTC [60] and ENZYMES [106].
7.4. Security
The field of static code analysis is a promising new application avenue for
graph learning methods. Practical applications include: i) determining if two
39
assembly programs, which stem from the same source code, have been compiled
by means of different optimization techniques; ii) prediction of specific types of
bugs by means of augmented Abstract Syntax Trees [63]; iii) predicting whether
a program is likely to be the obfuscated version of another one; iv) automatically
extracting features from Control Flow Graphs [87].
DGNs are also interesting to solve tasks where the structure of a graph
changes over time. In this context, one is interested not only in capturing
the structural dependencies between nodes but also in the evolution of these
dependencies on the temporal domain. Approaches to this problem usually
combine a DGN (to extract structural properties of the graph) and a Recurrent
Neural Network (to model the temporal dependencies). Example of applications
include the prediction of traffic in road networks [139], action recognition [128]
and supply chain [102] tasks.
8. Conclusions
After a pioneering phase in the early years of the millennia, the topic of
neural networks for graph processing is now a consolidated and vibrant research
40
area. In this expansive phase, research works at a fast pace producing a plethora
of models and variants thereof, with less focus on systematization and tracking
of early and recent literature. For the field to move further to a maturity phase,
we believe that certain aspects should be deepened and pursued with higher
priority. A first challenge, in this sense, pertains to a formalization of the differ-
ent adaptive graph processing models under a unified framework that highlights
their similarities, differences, and novelties. Such a framework should also allow
reasoning on theoretical and expressiveness properties [131] of the models at a
higher level. A notable attempt in this sense has been made by [48], but it does
not account for the most recent developments and the variety of mechanisms
being published (e.g., pooling operators and graph generation, to name a few).
An excellent reference, with respect to this goal, is the seminal work of [41],
which provided a general framework for tree-structured data processing. This
framework is expressive enough to generalize supervised learning to tree-to-tree
non-isomorph transductions, and it generated a followup of theoretical research
[58, 56] which consolidated the field of recursive neural networks. The second
challenge relates to the definition of a set of rich and robust benchmarks to test
and assess models in fair, consistent, and reproducible conditions. Some works
[34, 108] are already bringing to the attention of the community some troubling
trends and pitfalls as concerns datasets and methodologies used to assess DGNs
in the literature. We believe such criticisms should be positively embraced by
the community to pursue the growth of the field. Some attempts to provide a
set of standardized data and methods appear now under development2 . Also,
recent progress has been facilitated by the growth and wide adoption by the
community of new software packages for the adaptive processing of graphs. In
particular, the PyTorch Geometrics [39] and Deep Graph Library [127] packages
provide standardized interfaces to operate on graphs for ease of development.
Moreover, they allow training models using all the Deep Learning tricks of the
trade, such as GPU compatibility and graph mini-batching. The last challenge
41
relates to applications. We believe a methodology reaches its maturity when
it will show the transfer of research knowledge to an impactful innovation for
the society. Again, attempts in this sense are already underway, with good
candidates being in the fields of chemistry [18] and life-sciences [147].
Acknowledgements
This work has been partially supported by the Italian Ministry of Education,
University, and Research (MIUR) under project SIR 2014 LIST-IT (grant n.
RBSI14STDE).
42
Appendix A. Acronyms Table
Table A.3: Reference table with acronyms, their extended names, and associated references.
43
References
[1] Davide Bacciu and Antonio Bruno. Deep tree transductions - a short
survey. In Recent Advances in Big Data and Deep Learning, pages 236–
245. Springer, 2020.
[3] Davide Bacciu, Federico Errica, and Alessio Micheli. Contextual Graph
Markov Model: A deep and generative approach to graph processing. In
Proceedings of the 35th International Conference on Machine Learning
(ICML), volume 80, pages 294–303. PMLR, 2018.
[4] Davide Bacciu, Alessio Micheli, and Marco Podda. Edge-based sequen-
tial graph generation with recurrent neural networks. Neurocomputing.
Accepted, 2019.
[5] Davide Bacciu, Alessio Micheli, and Marco Podda. Graph generation by
sequential edge prediction. In Proceedings of the European Symposium
on Artificial Neural Networks, Computational Intelligence and Machine
Learning (ESANN), 2019.
44
[8] Daniel Beck, Gholamreza Haffari, and Trevor Cohn. Graph-to-sequence
Learning using Gated Graph Neural Networks. In Proceedings of the 56th
Annual Meeting of the Association for Computational Linguistics (ACL),
Volume 1 (Long Papers), pages 273–283, 2018.
[9] Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term
dependencies with gradient descent is difficult. IEEE Transactions on
Neural Networks, 5(2):157–166, 1994.
[10] Anna Maria Bianucci, Alessio Micheli, Alessandro Sperduti, and Anton-
ina Starita. Application of cascade correlation networks for structures
to chemistry. Applied Intelligence, 12(1-2):117–147, 2000. Publisher:
Springer.
[11] Battista Biggio and Fabio Roli. Wild patterns: Ten years after the rise
of adversarial machine learning. Pattern Recognition, 84:317–331, 2018.
Publisher: Elsevier.
[15] John Adrian Bondy, Uppaluri Siva Ramachandra Murty, et al. Graph
theory with applications, volume 290. Macmillan London, 1976.
[16] Marco Bongini, Leonardo Rigutini, and Edmondo Trentin. Recursive neu-
ral networks for density estimation over generalized random graphs. IEEE
45
Transactions on Neural Networks and Learning Systems, 29(11):5441–
5458, 2018. Publisher: IEEE.
[17] Karsten M Borgwardt, Cheng Soon Ong, Stefan Schönauer, SVN Vish-
wanathan, Alex J Smola, and Hans-Peter Kriegel. Protein function pre-
diction via graph kernels. Bioinformatics, 21(suppl 1):i47–i56, 2005. Pub-
lisher: Oxford University Press.
[18] John Bradshaw, Brooks Paige, Matt J Kusner, Marwin Segler, and
José Miguel Hernández-Lobato. A model to search for synthesizable
molecules. In Proceedings of the 33rd Conference on Neural Information
Processing Systems (NeurIPS), pages 7935–7947, 2019.
[19] Michael M. Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and
Pierre Vandergheynst. Geometric deep learning: going beyond Euclidean
data. IEEE Signal Processing Magazine, 34(4):25. 18–42, 2017.
[20] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral
networks and locally connected networks on graphs. Proceedings of the 2nd
International Conference on Learning Representations (ICLR), 2014.
[23] Jie Chen, Tengfei Ma, and Cao Xiao. FastGCN: Fast learning with graph
convolutional networks via importance sampling. In Proceedings of the
6th International Conference on Learning Representations (ICLR), 2018.
[24] Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Dzmitry Bah-
danau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning
46
phrase representations using rnn encoder-decoder for statistical machine
translation. In Proceedings of the 2014 Conference on Empirical Methods
in Natural Language Processing, (EMNLP), pages 1724–1734, 2014.
[27] Nicola De Cao and Thomas Kipf. MolGAN: An implicit generative model
for small molecular graphs. Workshop on Theoretical Foundations and
Applications of Deep Generative Models, International Conference on Ma-
chine Learning (ICML), 2018.
[30] Inderjit S Dhillon, Yuqiang Guan, and Brian Kulis. Weighted graph cuts
without eigenvectors a multilevel approach. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 29(11):1944–1957, 2007. Publisher:
IEEE.
47
[32] David K. Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bom-
barelli, Timothy Hirzel, Alan Aspuru-Guzik, and Ryan P. Adams. Con-
volutional networks on graphs for learning molecular fingerprints. In Pro-
ceedings of the 29th Conference on Neural Information Processing Systems
(NIPS), pages 2224–2232, 2015.
[33] Paul Erdős and Alfréd Rényi. On the evolution of random graphs. Publica-
tions of the Mathematical Institute of the Hungarian Academy of Science,
5(1):17–60, 1960.
[34] Federico Errica, Marco Podda, Davide Bacciu, and Alessio Micheli. A
fair comparison of graph neural networks for graph classification. In Pro-
ceedings of the 8th International Conference on Learning Representations
(ICLR), 2020.
[36] S. Fan and B. Huang. Conditional labeled graph generation with GANs.
In Workshop on Representation Learning on Graphs and Manifolds, In-
ternational Conference on Learning Representations (ICLR), 2019.
[37] Fuli Feng, Xiangnan He, Jie Tang, and Tat-Seng Chua. Graph adversarial
training: Dynamically regularizing based on graph structure. IEEE Trans-
actions on Knowledge and Data Engineering, 2019. Publisher: IEEE.
[38] Yifan Feng, Haoxuan You, Zizhao Zhang, Rongrong Ji, and Yue Gao.
Hypergraph neural networks. In Proceedings of the 33rd AAAI Conference
on Artificial Intelligence (AAAI), volume 33, pages 3558–3565, 2019.
[39] Matthias Fey and Jan Eric Lenssen. Fast graph representation learn-
ing with PyTorch Geometric. Workshop on Representation Learning on
Graphs and Manifolds, International Conference on Learning Represen-
tations (ICLR), 2019.
48
[40] Paolo Frasconi, Fabrizio Costa, Luc De Raedt, and Kurt De Grave. klog:
A language for logical and relational learning with kernels. Artificial In-
telligence, 217:117–143, 2014. Publisher: Elsevier.
[41] Paolo Frasconi, Marco Gori, and Alessandro Sperduti. A general frame-
work for adaptive processing of data structures. IEEE Transactions on
Neural Networks, 9(5):768–786, 1998. Publisher: IEEE.
[42] Michael Truong Le Frederik Diehl, Thomas Brunner and Alois Knoll. To-
wards graph pooling by edge contraction. In Workshop on learning and
reasoning with graph-structured data, International Conference on Ma-
chine Learning (ICML), 2019.
[43] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of
statistical learning, volume 1. Springer series in statistics New York, 2001.
[44] Claudio Gallicchio and Alessio Micheli. Graph echo state networks. In
Proceedings of the International Joint Conference on Neural Networks
(IJCNN), pages 1–8. IEEE, 2010.
[45] Claudio Gallicchio and Alessio Micheli. Fast and deep graph neural net-
works. In Proceedings of the 34th AAAI Conference on Artificial Intelli-
gence (AAAI), 2020.
[46] Hongyang Gao and Shuiwang Ji. Graph U-nets. In Proceedings of the 36th
International Conference on Machine Learning (ICML), pages 2083–2092,
2019.
[48] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and
George E Dahl. Neural message passing for quantum chemistry. In
Proceedings of the 34th International Conference on Machine Learning
(ICML), pages 1263–1272, 2017.
49
[49] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David
Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Gener-
ative Adversarial Nets. In Proceedings of the 28th Conference on Neural
Information Processing Systems (NIPS), pages 2672–2680, 2014.
[50] Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning
for networks. In Proceedings of the 22nd International Conference on
Knowledge Discovery and Data Mining (SIGKDD), pages 855–864. ACM,
2016.
[51] Aditya Grover, Aaron Zweig, and Stefano Ermon. Graphite: Iterative
generative modeling of graphs. In Proceedings of the 36th International
Conference on Machine Learning (ICML), pages 2434–2444, 2019.
[54] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation
learning on large graphs. In Proceedings of the 31st Conference on Neural
Information Processing Systems (NIPS), pages 1024–1034, 2017.
[55] William L. Hamilton, Rex Ying, and Jure Leskovec. Representation learn-
ing on graphs: Methods and applications. IEEE Data Engineering Bul-
letin, 40(3):52–74, 2017.
50
[57] Barbara Hammer, Alessio Micheli, Alessandro Sperduti, and Marc Strick-
ert. A general framework for unsupervised processing of structured data.
Neurocomputing, 57:3–35, 2004. Publisher: Elsevier.
[58] Barbara Hammer, Alessio Micheli, Alessandro Sperduti, and Marc Strick-
ert. Recursive self-organizing network models. Neural Networks, 17(8-
9):1061–1085, 2004. Publisher: Elsevier.
[60] Christoph Helma, Ross D. King, Stefan Kramer, and Ashwin Srinivasan.
The predictive toxicology challenge 2000–2001. Bioinformatics, 17(1):107–
108, 2001. Publisher: Oxford University Press.
[62] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neu-
ral computation, 9(8):1735–1780, 1997. Publisher: MIT Press.
[65] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparametrization
with gumbel-softmax. In Proceedings of the 5th International Conference
on Learning Representations (ICLR), 2017.
[66] Woosung Jeon and Dongsup Kim. FP2VEC: A new molecular featurizer
for learning molecular properties. Bioinformatics, 35(23):4979–4985, 2019.
51
[67] Jianwen Jiang, Yuxuan Wei, Yifan Feng, Jingxuan Cao, and Yue Gao.
Dynamic hypergraph neural networks. In Proceedings of the 28th Inter-
national Joint Conference on Artificial Intelligence (IJCAI), pages 2635–
2641, 2019.
[69] Wengong Jin, Regina Barzilay, and Tommi S. Jaakkola. Junction tree
variational autoencoder for molecular graph generation. In Proceedings of
the 35th International Conference on Machine Learning (ICML), pages
2328–2337, 2018.
[74] Youngchun Kwon, Jiho Yoo, Youn-Suk Choi, Won-Joon Son, Dongseon
Lee, and Seokho Kang. Efficient learning of non-autoregressive graph vari-
ational autoencoders for molecular graph generation. Journal of Chemin-
formatics, 11(1):70, 2019. Publisher: Springer.
52
[75] Yann LeCun, Yoshua Bengio, and others. Convolutional networks for
images, speech, and time series. The Handbook of Brain Theory and Neural
Networks, 3361(10):1995, 1995.
[76] Junhyun Lee, Inyeop Lee, and Jaewoo Kang. Self-attention graph pooling.
In Proceedings of the 36th International Conference on Machine Learning
(ICML), pages 3734–3743, 2019.
[77] Qimai Li, Zhichao Han, and Xiao-Ming Wu. Deeper insights into graph
convolutional networks for semi-supervised learning. In Proceedings of the
32nd AAAI Conference on Artificial Intelligence (AAAI), 2018.
[78] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard S. Zemel. Gated
Graph Sequence Neural Networks. In Proceedings of the 4th International
Conference on Learning Representations, (ICLR), 2016.
[79] Yujia Li, Oriol Vinyals, Chris Dyer, Razvan Pascanu, and Peter W.
Battaglia. Learning deep generative models of graphs. CoRR,
abs/1803.03324, 2018.
[81] László Lovász and others. Random walks on graphs: A survey. Combina-
torics, Paul Erdos is eighty, 2(1):1–46, 1993.
[82] Andrew L. Maas, Awni Y. Hannun, and Andrew Y. Ng. Rectifier nonlin-
earities improve neural network acoustic models. In Workshop on Deep
Learning for Audio, Speech and Language Processing, International Con-
ference on Machine Learning (ICML), 2013.
53
[84] Diego Marcheggiani, Joost Bastings, and Ivan Titov. Exploiting seman-
tics in neural machine translation with graph convolutional networks. In
Proceedings of the 2018 Conference of the North American Chapter of the
Association for Computational Linguistics: Human Language Technolo-
gies (NAACL-HLT), Volume 2 (Short Papers), pages 486–492, 2018.
[85] Diego Marcheggiani and Ivan Titov. Encoding sentences with graph con-
volutional networks for semantic role labeling. In Proceedings of the
2017 Conference on Empirical Methods in Natural Language Processing
(EMNLP), pages 1506–1515, 2017.
[86] Enrique S Marquez, Jonathon S Hare, and Mahesan Niranjan. Deep cas-
cade learning. IEEE Transactions on Neural Networks and Learning Sys-
tems, 29(11):5475–5485, 2018. Publisher: IEEE.
[87] Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Roberto Bal-
doni, and Leonardo Querzoni. Safe: Self-attentive function embeddings
for binary similarity. In Proceedings of the 16th International Conference
on Detection of Intrusions and Malware, and Vulnerability Assessment
(DIMVA), pages 309–329. Springer, 2019.
[89] Alessio Micheli, Diego Sona, and Alessandro Sperduti. Contextual pro-
cessing of structured data by recursive cascade correlation. IEEE Trans-
actions on Neural Networks, 15(6):1396–1410, 2004. Publisher: IEEE.
54
of the 2nd Workshop on Abusive Language Online (ALW2), pages 1–10,
2018.
[93] Galileo Mark Namata, Ben London, Lise Getoor, and Bert Huang. Query-
driven active surveying for collective classification. In Proceedings of the
Workshop on Mining and Learning with Graphs, 2012.
[95] Michel Neuhaus and Horst Bunke. Self-organizing maps for learning the
edit costs in graph matching. IEEE Transactions on Systems, Man, and
Cybernetics, Part B (Cybernetics), 35(3):503–514, 2005.
[96] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online
learning of social representations. In Proceedings of the 20th International
Conference on Knowledge Discovery and Data Mining (SIGKDD), pages
701–710. ACM, 2014.
[97] Meng Qu, Yoshua Bengio, and Jian Tang. GMNN: Graph Markov Neural
Networks. In Proceedings of the 36th International Conference on Machine
Learning (ICML), pages 5241–5250, 2019.
[98] Liva Ralaivola, Sanjay J Swamidass, Hiroto Saigo, and Pierre Baldi.
Graph kernels for chemical informatics. Neural Networks, 18(8):1093–
1110, 2005. Publisher: Elsevier.
55
ceedings of the 23rd International Conference on Knowledge Discovery and
Data Mining (SIGKDD), pages 385–394. ACM, 2017.
[100] Veeru Sadhanala, Yu-Xiang Wang, and Ryan Tibshirani. Graph sparsifi-
cation approaches for laplacian smoothing. In Artificial Intelligence and
Statistics, pages 1250–1259, 2016.
[101] Bidisha Samanta, Abir De, Gourhari Jana, Pratim Kumar Chattaraj,
Niloy Ganguly, and Manuel Gomez Rodriguez. NeVAE: A deep generative
model for molecular graphs. In Proceedings of the 33rd AAAI Conference
on Artificial Intelligence (AAAI), pages 1110–1117, 2019.
[102] Tae San Kim, Won Kyung Lee, and So Young Sohn. Graph convolu-
tional network approach applied to predict hourly bike-sharing demands
considering spatial, temporal, and global effects. PloS one, 14(9), 2019.
Publisher: Public Library of Science.
[103] Lawrence K Saul and Michael I Jordan. Mixed memory Markov models:
Decomposing complex stochastic processes as mixtures of simpler ones.
Machine Learning, 37(1):75–87, 1999. Publisher: Springer.
[104] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and
Gabriele Monfardini. The graph neural network model. IEEE Transac-
tions on Neural Networks, 20(1):61–80, 2009. Publisher: IEEE.
[105] Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne van den Berg,
Ivan Titov, and Max Welling. Modeling relational data with graph con-
volutional networks. In Proceedings of the 15th European Semantic Web
Conference (ESWC), pages 593–607. Springer, 2018.
[106] Ida Schomburg, Antje Chang, Christian Ebeling, Marion Gremse, Chris-
tian Heldt, Gregor Huhn, and Dietmar Schomburg. BRENDA, the enzyme
database: updates and major new developments. Nucleic Acids Research,
32(suppl 1), 2004.
56
[107] Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Gal-
ligher, and Tina Eliassi-Rad. Collective classification in network data. AI
magazine, 29(3):93–93, 2008.
[109] Nino Shervashidze, Pascal Schweitzer, Erik Jan van Leeuwen, Kurt
Mehlhorn, and Karsten M Borgwardt. Weisfeiler-lehman graph kernels.
Journal of Machine Learning Research, 12(Sep):2539–2561, 2011.
[113] Richard Socher, Cliff C Lin, Chris Manning, and Andrew Y Ng. Parsing
natural scenes and natural language with recursive neural networks. In
Proceedings of the 28th International Conference on Machine Learning
(ICML), pages 129–136, 2011.
[114] Alessandro Sperduti and Antonina Starita. Supervised neural networks for
57
the classification of structures. IEEE Transactions on Neural Networks,
8(3):714–735, 1997. Publisher: IEEE.
[115] Kai Sheng Tai, Richard Socher, and Christopher D. Manning. Improved
semantic representations from tree-structured Long Short-Term Memory
networks. In Proceedings of the 53rd Annual Meeting of the Association
for Computational Linguistics (ACL), pages 1556–1566, 2015.
[116] Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf.
Wasserstein auto-encoders. In Proceedings of the 6th International Con-
ference on Learning Representations (ICLR), 2018.
[117] Edmondo Trentin and Ernesto Di Iorio. Nonparametric small random net-
works for graph-structured pattern recognition. Neurocomputing, 313:14–
24, 2018.
[119] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion
Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is
all you need. In Proceedings of the 31st Conference on Neural Information
Processing Systems (NIPS), pages 5998–6008, 2017.
[121] Petar Velickovic, William Fedus, William L. Hamilton, Pietro Liò, Yoshua
Bengio, and R. Devon Hjelm. Deep Graph Infomax. In Proceedings of the
7th International Conference on Learning Representations (ICLR), New
Orleans, LA, USA, May 6-9, 2019, 2019.
58
[122] S Vichy N Vishwanathan, Nicol N Schraudolph, Risi Kondor, and
Karsten M Borgwardt. Graph kernels. Journal of Machine Learning Re-
search, 11(Apr):1201–1242, 2010.
[124] Edward Wagstaff, Fabian B Fuchs, Martin Engelcke, Ingmar Posner, and
Michael Osborne. On the limitations of representing functions on sets.
In Proceedings of the 36th International Conference on Machine Learning
(ICML), pages 6487–6494, 2019.
[125] Nikil Wale, Ian A Watson, and George Karypis. Comparison of descriptor
spaces for chemical compound retrieval and classification. Knowledge and
Information Systems, 14(3):347–375, 2008. Publisher: Springer.
[126] Hongwei Wang, Jia Wang, Jialin Wang, Miao Zhao, Weinan Zhang,
Fuzheng Zhang, Xing Xie, and Minyi Guo. GraphGAN: Graph repre-
sentation learning with generative adversarial nets. In Proceedings of the
32nd AAAI Conference on Artificial Intelligence (AAAI), pages 2508–
2515, 2018.
[127] Minjie Wang, Lingfan Yu, Da Zheng, Quan Gan, Yu Gai, Zihao Ye,
Mufei Li, Jinjing Zhou, Qi Huang, Chao Ma, Ziyue Huang, Qipeng Guo,
Hao Zhang, Haibin Lin, Junbo Zhao, Jinyang Li, Alexander J Smola,
and Zheng Zhang. Deep Graph Library: Towards efficient and scal-
able deep learning on graphs. Workshop on Representation Learning on
Graphs and Manifolds, International Conference on Learning Represen-
tations (ICLR), 2019.
[128] Xiaolong Wang and Abhinav Gupta. Videos as space-time region graphs.
In Proceedings of the 15th European conference on computer vision
(ECCV), pages 399–417, 2018.
59
[129] Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bron-
stein, and Justin M Solomon. Dynamic graph cnn for learning on point
clouds. ACM Transactions on Graphics (TOG), 38(5):146, 2019. Pub-
lisher: ACM.
[130] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang,
and Philip S. Yu. A comprehensive survey on graph neural networks.
CoRR, abs/1901.00596, 2019.
[131] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How pow-
erful are graph neural networks? In Proceedings of the 7th International
Conference on Learning Representations (ICLR), 2019.
[132] Keyulu Xu, Chengtao Li, Yonglong Tian, Tomohiro Sonobe, Ken-ichi
Kawarabayashi, and Stefanie Jegelka. Representation learning on graphs
with jumping knowledge networks. Proceedings of the 35th International
Conference on Machine Learning (ICML), pages 5453–5462, 2018.
[133] Pinar Yanardag and SVN Vishwanathan. Deep graph kernels. In Proceed-
ings of the 21th International Conference on Knowledge Discovery and
Data Mining (SIGKDD, pages 1365–1374. ACM, 2015.
[134] Liang Yang, Zesheng Kang, Xiaochun Cao, Di Jin, Bo Yang, and Yuan-
fang Guo. Topology optimization based graph convolutional network. In
Proceedings of the 28th International Joint Conference on Artificial Intel-
ligence (IJCAI), pages 4054–4061, 2019.
[135] Ruiping Yin, Kan Li, Guangquan Zhang, and Jie Lu. A deeper graph
neural network for recommender systems. Knowledge-Based Systems,
185:105020, 2019. Publisher: Elsevier.
[136] Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L.
Hamilton, and Jure Leskovec. Graph convolutional neural networks for
web-scale recommender systems. In Proceedings of the 24th International
60
Conference on Knowledge Discovery and Data Mining (SIGKDD), pages
974–983. ACM, 2018.
[137] Zhitao Ying, Jiaxuan You, Christopher Morris, Xiang Ren, Will Hamil-
ton, and Jure Leskovec. Hierarchical graph representation learning with
differentiable pooling. In Proceedings of the 32nd Conference on Neural
Information Processing Systems (NeurIPS), 2018.
[138] Jiaxuan You, Rex Ying, Xiang Ren, William L. Hamilton, and Jure
Leskovec. GraphRNN: Generating realistic graphs with deep auto-
regressive models. In Proceedings of the 35th International Conference
on Machine Learning (ICML), 2018.
[139] Bing Yu, Haoteng Yin, and Zhanxing Zhu. Spatio-temporal graph con-
volutional networks: A deep learning framework for traffic forecasting.
In Proceedings of the 27th International Joint Conference on Artificial
Intelligence (IJCAI), 2018.
[141] Daniele Zambon, Cesare Alippi, and Lorenzo Livi. Concept drift and
anomaly detection in graph streams. IEEE Transactions on Neural Net-
works and Learning Systems, 29(11):5592–5605, 2018. Publisher: IEEE.
[142] Muhan Zhang, Zhicheng Cui, Marion Neumann, and Yixin Chen. An end-
to-end deep learning architecture for graph classification. In Proceedings
of the 32nd AAAI Conference on Artificial Intelligence (AAAI), 2018.
[143] Si Zhang, Hanghang Tong, Jiejun Xu, and Ross Maciejewski. Graph
convolutional networks: a comprehensive review. Computational Social
Networks, 6(1):11, 2019. Publisher: Springer.
61
[144] Ziwei Zhang, Peng Cui, and Wenwu Zhu. Deep learning on graphs: A
survey. CoRR, abs/1812.04202, 2018.
[145] Zizhao Zhang, Haojie Lin, Yue Gao, and KLISS BNRist. Dynamic hyper-
graph structure learning. In Proceedings of the 27th International Joint
Conference on Artificial Intelligence (IJCAI), pages 3162–3169, 2018.
[146] Dengyong Zhou, Jiayuan Huang, and Bernhard Schölkopf. Learning with
hypergraphs: Clustering, classification, and embedding. In Proceedings
of the 21st Conferece on Neural Information Processing Systems (NIPS),
pages 1601–1608, 2007.
[147] Marinka Zitnik, Monica Agrawal, and Jure Leskovec. Modeling polyphar-
macy side effects with graph convolutional networks. Bioinformatics,
34(13):i457–i466, 2018.
62