A New Model For Learning in Graph Domains
A New Model For Learning in Graph Domains
net/publication/4202380
CITATIONS READS
71 2,929
3 authors:
Franco Scarselli
Università degli Studi di Siena
81 PUBLICATIONS 1,455 CITATIONS
SEE PROFILE
All content following this page was uploaded by Franco Scarselli on 30 May 2014.
In several machine learning applications the data of interest Recursive neural networks (RNNs) [3], [4] are a new neural
can be suitably represented in form of sequences, trees, and, model that tries to overcome this problem. In fact, RNNs can
more generally, directed or undirected graphs, f.i. in chemics directly process graphs. The main idea consists of encoding the
[1], software engineering, image processing [2]. In those graphical information into a set of states associated with the
applications, the goal consists of learning from examples a graph nodes. The states are dynamically updated following the
function τ that maps a graph G and one of its nodes n to a topological relationship among the nodes. Finally, an output is
I m.
vector of reals: τ (G, n) ∈ R computed using the encodings stored in the states. However,
More precisely, we can distinguish two classes of applica- the RNN model suffers from a number of limitations. In fact,
tions according whether τ (G, n) depends or not on the node RNNs can process only directed and acyclic graphs and can
n. Those applications will be called node focused and graph be used only on graph focused problems, i.e. τ (G, n) must be
focused, respectively. Object localization is an example of independent from n.
node focused applications. An image can be represented by In this paper, we present a new neural network model, called
a Region Adjacency Graph (RAG), where the nodes denote graph neural network (GNN), that extends recursive neural
the homogeneous regions of the image and the arcs represent networks. GNNs can process most of the practically useful
their adjacency relationship (Fig. 1). This problem can be graphs and can be applied both on graph and node focused
solved by a function τ which classifies the nodes of the RAG problems. A learning algorithm for GNNs is also described
according to whether the corresponding region belongs to the along with some experimental results that assess the properties
object or not. For example, the output of τ for Fig. 1 might of model. Finally, it is worth to mention that, under mild
be 1 for the black nodes, which correspond to the house, and conditions, any function τ on graphs can be approximated
−1 otherwise. On the other hand, image classification is an in probability by a GNN. Such a result, which, for reasons of
example of graph focused applications. For instance, τ (G) space, is not further discussed in this paper, is proved in [5].
may classify an image represented by G into different classes, The structure of the paper is as follows: Section II presents
e.g., houses, cars, people, and so on. the GNN model along with its main properties. Section III
Traditional applications usually cope with graphs by a contains some experimental results. Finally, in Section IV
preprocessing procedure that transforms the graphs to simpler conclusions are drawn.
representations (e.g. vectors or sequences of reals) which can
be successfully elaborated with common machine learning II. G RAPH NEURAL NETWORKS
techniques. However, valuable information may be lost during In the following, |·| represents the module or the cardinality
the preprocessing and, as a consequence, the application may operator according to whether it is applied on a real number
suffer from a poor performance and generalization. or a set, respectively. The norm one of vector v is denoted
P
by kvk1 , i.e. kvk1 = i |vi |. A graph G is a pair (N , E), where Fw and Gw are the composition of |N | instances of
where N is a set of nodes and E a set of edges. The nodes fw and gw , respectively.
connected to n by an arc are represented by ne[n]. Each node Notice that x is correctly defined only if the solution
may have a label that is denoted by ln ∈ R I q . Usually labels of system (3) is unique. The key choice adopted in the
include features of the object corresponding to the node. For proposed approach consists of designing fw such that Fw is
example, in the case of a RAG (Figure 1), node labels may a contraction mapping1 w.r.t. the state x. In fact, the Banach
represent properties of the regions, e.g., area, perimeter. fixed point theorem [6] guarantees that if Fw is a contraction
The considered graphs may be either positional or mapping, then Eq. (3) has a solution and the solution is unique.
non–positional. Non–positional graphs are those described so Thus, Eqs. (1) and (2) define a method to produce an
far. In positional graphs, an injective function µn : ne[n] → IN output on for each node, i.e. they realize a parametric function
is defined for each node n. Here, IN is the set of natural ϕw (G, n) = on which operates on graphs. The corresponding
numbers, and µn assigns to each neighbor u ∈ ne[n] a different machine learning problem consists of adapting the parameters
position. Actually, the position can be used to store useful w such that ϕw approximates the data in the learning set
information, e.g. a sorting of the neighbors according to their L = {(Gi , ni , ti )| 1 ≤ i ≤ p}, where each triple (Gi , ni , ti )
importance. denotes a graph Gi , one of its nodes ni and the desired output
The intuitive idea underlying GNNs is that nodes in a ti . In practice, the learning problem can be implemented by
graph represent objects or concepts and edges represent their the minimization of a quadratic error function
relationships. Thus, we can attach to each node n a vector
p
xn ∈ R I s , called state, which collects a representation of X
the object denoted by n. In order to define xn , we observe ew = (ti − ϕw (Gi , ni ))2 . (4)
that the related nodes are connected by edges. Thus, xn i=1
where ln , xne[n] , lne[n] are the label of n, and the states and The intuitive idea underlying Eq. (5) consists of computing
the labels of the nodes in the neighborhood of n, respectively. the state xn by the summing a set of “contributions”. Each
contribution is generated considering only one node in the
x4
x6 neighborhood of n. A similar approach was already used with
l4 x7
l6 l3 success in recursive neural networks [7], [8].
x3 l7
Moreover, it is worth to mention that GNNs can be applied
x2 also to directed graphs. For this purpose, the input of fw
x5
l2 x1 l5 (or hw ) must be extended with information about the edge
l1 directions, f.i. a flag du for each node u ∈ ne[n] such that
du = 1, if the edge (u, v) is directed toward v, and du =
x1 = fw (l1 , x2 , x3 , x5 , l2 , l3 , l5 ) 0, otherwise. Finally, in graph focused applications only one
output for each graph is produced. This can be achieved in
x ne[1] l ne[1] several ways. For example a special node s can be selected in
Fig. 2. State x1 depends on the neighborhood information.
each graph and the corresponding output os is returned. Such
an approach is also used in recursive neural networks.
For each node n, an output vector on ∈ IRm is also In order to implement the model formally defined by
defined which depends on the state xn and the label ln . The Equations (1) and (2), the following items must be provided:
dependence is described by a parametric output function gw 1 A method to solve Eq. (1);
2 A learning algorithm to adapt fw and gw by exam-
on = gw (xn , ln ), n ∈ N . (2) ples from the train set;
3 An implementation of fw and gw .
Let x and l be respectively the vectors constructed by stacking
all the states and all the labels. Then, Equations (1) and (2) These aspects will be considered in the following subsections.
can be written as:
1 A function l : R Ia →R I a is a contraction mapping w.r.t. a vector norm
x = Fw (x, l) k · k, if it exists a real µ, 0 ≤ µ < 1, such that for any y 1 ∈ RI a, y2 ∈ R
I a,
(3)
o = Gw (x, l) kl(y 1 ) − l(y 2 )k ≤ µky 1 − y 2 k.
A. Computing the states for training recursive neural networks [4], and the Almeida–
The Banach fixed point theorem suggests a simple algorithm Pineda algorithm [9], [10]. The latter is a particular version of
to compute the fixed point of Eq. (3). It states that if Fw is a the backpropagation through time algorithm which can be used
contraction mapping, then the following dynamical system to train recurrent networks. Our approach applies the Almeida–
Pineda algorithm to the encoding network, where all instances
x(t + 1) = Fw (x(t), l) , (6) of fw and gw are considered to be independent networks. It
produces a set of gradients, one for each instance of fw and
where x(t) denotes the t-th iterate of x, converges exponen- gw . Those gradients are accumulated to compute ∂ww .
∂e (T )
tially fast to the solution of Eq. (3) for any initial state x(0).
Thus, xn and on can be obtained by iterating: C. Implementing the transition and the output functions
xn (t + 1) = fw (ln , xne[n] (t), lne[n] ), In the following, two different GNN models, called linear
(7) GNN and neural GNN, are described. In both the cases, the
on (t + 1) = gw (xn (t + 1), ln ), n ∈ N.
output function gw is implemented by a multilayer feedfor-
Note that the computation described in Eq. (7) can be in- ward neural network and the transition function defined in (5)
terpreted as the representation of a neural network, called is used. On the other hand, linear and neural GNNs differ in
encoding network, that consists of units which compute fw the implementation of the function hw of Eq. (5) and in the
and gw (see Figure 3). In order to build the encoding network, strategy adopted to ensure that Fw is a contraction mapping.
each node of the graph can be replaced by a unit computing
1) Linear GNN: In this model, hw is
the function fw . Each unit stores the current state xn (t) of
the corresponding node n, and, when activated, it calculates hw (ln , xu , lu ) = An,u xu + bn .
the state xn (t + 1) using the labels and the states stored in
its neighborhood. The simultaneous and repeated activation of where the vector bn ∈ RI s and the matrix An,u ∈ R I s×s are
the units produces the behavior described by Eq. (7). In the defined by the output of two feedforward neural networks,
encoding network, the output of node n is produced by another whose parameters correspond to the parameters of the GNN.
2
unit which implements gw . More precisely, let φw : RI 2q → R I s and ρw : R Iq → R Is
be the functions implemented by two multilayer feedforward
o2 (t)
neural networks. Then,
l2 gw µ
An,u = · Resize (φw (ln , lu ))
l2
l1
x2 (t)
l3 s|ne[u]|
fw l 2 , l 1 ,..
o1 (t) gw
x1 (t)
gw o3 (t) bn = ρw (ln ) ,
l1 x2 (t) x3 (t)
l3
l 1 , l 4 ,.. fw x1 (t)
x3 (t)
fw l 3 ,l 4 where µ ∈ (0, 1) and Resize(·) denotes the operator that
l4
x4 (t) allocate the elements of s2 -dimensional vector into a s × s
x4 (t) matrix. Here, it is further assumed that kφw (ln , lu )k1 ≤
fw l 4 , l 3 ,..
s2 holds, which is straightforwardly verified if the output
gw l4
neurons of the network implementing φw use an appropriately
o4 (t) bounded activation function, f.i. a hyperbolic tangent.
Notice that, in this case, Fw is a contraction function for
Fig. 3. A graph and its corresponding encoding network.
any set of parameters w. In fact,
Fw (x, l) = Ax + b , (8)
B. A learning algorithm
where b is the vector constructed by stacking all the bn , and
The learning algorithm consists of two phases:
A is a block matrix {Ān,u }, with Ān,u = An,u if u is a
(a) the states xn (t) are iteratively updated, using Eq. (7) until neighbor of n and Ān,u = 0, otherwise. By simple algebra,
they reach a stable fixed point x(T ) = x at time T ; ∂Fw
it is easily proved that k ∂ x k1 = kAk1 ≤ µ which implies
∂e (T )
(b) the gradient ∂ww is computed and the weights w are that Fw is a contraction function.
updated according to a gradient descent strategy. 2) Neural GNN: In this model, hw is implemented by a
Thus, while phase (a) moves the system to a stable point, feedforward neural network. Since three layer neural networks
phase (b) adapts the weights to change the outputs towards are universal approximators, this method allows to implement
the desired target. These two phases are repeated until a given any function hw . However, not all the parameters w can be
stopping criterion is reached. It can be formally proved that used, because it must be ensured that the corresponding global
if Fw and Gw in Eq. (3) are differentiable w.r.t. w and transition function Fw is a contraction. In practice, this goal
x, then the above learning procedure implements a gradient can be achieved by adding a penalty term to the error function
descendent strategy on the error function ew [5]. p
In fact our algorithm is obtained by combining the back-
X ∂Fw
ew = (ti − ϕw (G, ni ))2 + βL ,
propagation through structure algorithm, which is adopted i=1
∂x 1
where L(y) = (y − µ)2 , if y > µ, and L(y) = 0 other- set. A clique of size 5 was forced into every graph of the
wise. Moreover, β is a predefined parameter balancing the dataset. Thus, each graph had at least one clique, but it might
importance of the penalty term and the error on patterns, and contain more cliques due to the random dataset construction.
the parameter µ ∈ (0, 1) defines a desired upper bound on The desired target tn = τ (G, n) of each node was generated
contraction constant of Fw . by a brute force algorithm that looked for cliques in the graphs.
Table I shows the accuracies4 achieved on this problem by a
III. E XPERIMENTAL RESULTS set of GNNs obtained varying the number of hidden neurons
The approach has been evaluated on a set of toy problems of the feedforward networks. For sake of simplicity, all the
derived from graph theory and applications of practical rele- feedforward networks involved in a GNN contained the same
vance in machine learning. Each problem belongs to one of number of hidden neurons.
the following categories: TABLE I
1) connection-based problems; R ESULTS ON THE C LIQUE PROBLEM
2) label-based problems;
3) general problems. Accuracy Time
Model Hidden
The first category contains problems where τ (G, n) depends Test Train Test Train
only on the graph connectivity and is independent from the 2 83.73% 83.45% 14.2s 36m 21s
5 86.95% 86.60% 20.0s 52m 18s
labels. On the other hand, in label-based problems τ (G, n) 10 90.74% 90.33% 31.3s 1h 15m 52s
can be computed using only the label ln of node n. Finally, neural
20 90.20% 89.72% 50.3s 1h 34m 03s
the last category collects examples in which GNNs must use 30 90.32% 89.82% 1m 10s 2h 115 3s
both topological and labeling information. 2 81.25% 86.90% 2.6s 53m 21s
5 79.87% 86.90% 2.8s 1h 01m 20s
Both the linear and the neural model were tested. Three
10 82.92% 87.54% 3.1s 48m 54s
layer (one hidden layer) feedforward networks with sigmoidal linear
20 80.90% 88.51% 3.6s 1h 00m 33s
activation functions were used to implement the functions 30 77.79% 84.94% 4.1s 1h 06m 21s
involved in the two models, i.e. gw , φw , and ρw in linear
GNNs, and gw , hw in neural GNNs (see Section II-C).
Unless otherwise stated, state dimension s was 2. The The clique problem is a difficult test for GNNs. In fact,
presented results were averaged on five different trials. In each GNNs are based on local computation framework, where the
trial, the dataset was a collection of random connected graphs computing activity is localized on the nodes of the graph (see
with a given density δ. The dataset construction procedure Eq. (1)). On the other hand, the detection of a clique requires
consisted of two steps: i) each pair of nodes is connected with the knowledge of properties of all the nodes involved in the
a probability δ; ii) the graph is checked to verify whether it is clique. Notwithstanding, the results of Tab. I confirmed that
connected or not and, if it is not, random edges are inserted GNNs can learn to solve this problem.
until the condition is satisfied. The dataset was splitted into a Notice that Tab. I compares the accuracy achieved on test
train, a validation and a test set. The validation set was used set with the accuracy of train set. The results, are very close,
to select the best GNN produced by the learning procedure. particularly for the neural model. This proves that the GNN
Actually, in every trial, the training procedure performed 5000 model does not suffer from generalization problems on this
epochs and every 20 epochs it evaluated the current GNN on experiment. It is also observed that, for the neural GNN, the
validation set. The best GNN was the one that achieved the number of hidden neurons has a clear influence on the results.
lowest error on validation set. In fact, a larger number of hiddens corresponds to a better
GNN software was implemented in Matlab R 7.0.12 . The averaged accuracy. On the other hand, a clear relationship
experiments were run on a Power Mac G5 with a 2 GHz between number of hidden neurons and accuracy is not evident
PowerPC processor and 2 GB RAM. For all the experiments for the linear model.
Finally, Tab. I displays the time spent by the training
memory requirements never grew beyond 200 MB.
and the testing procedures. It is worth to mention that the
A. Connection-based problems computational cost of each learning epoch may depend on
the particular train dataset. For example, the number of the
1) The Clique problem: A Clique of size k is a complete
iterations of system (7), needed to reach the fixed point,
subgraph with k nodes3 in a larger graph. The goal of this
depends on the initial state x(0) (see Section II-C). For this
experiment consisted of detecting all the cliques of size 5 in
reason, in some cases, even if the neural networks involved in
the input graphs. More precisely, the function τ that should
the learning procedure are larger, the computation time may be
be implemented by the GNN was τ (G, n) = 1 if n belongs to
smaller, e.g. the linear model with 5 and 10 hidden neurons.
a clique of size 5 and τ (G, n) = −1 otherwise. The dataset
In [11], it is stated that the performance of recursive neural
contained 1400 random graphs with 20 nodes: 200 graphs in
networks may be improved if the labels of the graphs are
the train set, 200 in the validation set, and 1000 in the test
4 Accuracy is defined as the ratio between the correct results and the total
2 Copyright c 1994-2004 by The MathWorks, Inc. number of patterns. A zero threshold was used to decide if the output of the
3 A graph is complete if there is an edge between each pair of nodes. GNN for a certain node is positive or negative.
TABLE III
extended with random vectors. Intuitively, the reason why
R ESULTS ON T HE N EIGHBORS PROBLEM
this approach works is that the random vectors are a sort
of identifiers that allow the RNNs to distinguish among the
Test
nodes. In practice, the method may or may not work, since Model Hidden
er < 0.05 er < 0.1
Training time
the random vectors also inject noise on the dataset, making the 2 73.64% 77.40% 47m 28s
learning more difficult. In order to investigate whether such a 5 89.56% 89.76% 1h 06m 20s
result holds also for GNNs, we added integer random labels neural 10 90.64% 91.44% 1h 21m 00s
between 0 and 8 to the graphs of the previous dataset and 20 99.04% 99.72% 2h 23m 27s
30 88.48% 89.48% 2h 33m 03s
we ran again the experiments. Table II seems to confirm that 2 72.48% 77.24% 58m 45s
the result holds also for GNNs, since in most cases GNNs on 5 89.60% 89.84% 46m 38s
graph with random labels outperform GNNs on graphs with 10 99.44% 99.72% 42m 57s
linear
no labels. 20 98.92% 99.68% 42m 53s
30 99.16% 99.68% 49m 58s
TABLE II
R ESULTS ON THE C LIQUE PROBLEM WITH RANDOM LABELS
C. General problems
matter of future research. From a theoretical point of view, it
1) The Subgraph Matching problem: The Subgraph Match-
is interesting to study also the case when the input graph is
ing problem consists of identifying the presence of a subgraph
not predefined but it changes during the learning procedure.
S in a larger graph G. Such a problem has a number of appli-
cations, including object localization and detection of active R EFERENCES
parts in chemical compounds. Machine learning techniques [1] T. Schmitt and C. Goller, “Relating chemical structure to activity: An
are useful for this problem when the subgraph is not known application of the neural folding architecture,” in Workshop on Fuzzy–
in advance and is available only from a set of examples or Neuro Systems ’98 and Conference on Egineering Applications of Neural
Networks, EANN ’98, 1998.
when the graphs are corrupted by noise. [2] E. Francesconi, P. Frasconi, M. Gori, S. Marinai, J. Sheng, G. Soda,
In our tests, we used 600 connected random graphs, equally and A. Sperduti, “Logo recognition by recursive neural networks,” in
divided into the train, the validation and the test set. A smaller Lecture Notes in Computer Science — Graphics Recognition, K. Tombre
and A. K. Chhabra, Eds. Springer, 1997, GREC’97 Proceedings.
subgraph S, which was randomly generated in each trial, was [3] P. Frasconi, M. Gori, and A. Sperduti, “A general framework for adaptive
inserted into every graph of the dataset. The nodes had integer processing of data structures,” IEEE Transactions on Neural Networks,
labels in the range [0, 10], and a small normal noise, with zero vol. 9, no. 5, pp. 768–786, September 1998.
[4] A. Sperduti and A. Starita, “Supervised neural networks for the
mean and a standard deviation of 0.25, was added to all the classification of structures,” IEEE Transactions on Neural Networks,
labels. The goal consisted of predicting whether n is a node vol. 8, pp. 429–459, 1997.
of a subgraph S, i.e. τ (G, n) = 1, if n belongs to S, and [5] F. Scarselli, A. C. Tsoi, M. Gori, and M. Hagenbuchner, “A new
neural network model for graph processing,” Department of Information
τ (G, n) = −1, otherwise. Engineering, University of Siena, Tech. Rep. DII 01/05, 2005.
In all the experiments, the state dimension was s = 5 and [6] M. A. Khamsi, An Introduction to Metric Spaces and Fixed Point Theory.
all the neural networks involved in the GNNs had 5 hidden John Wiley & Sons Inc, 2001.
[7] M. Gori, M. Maggini, and L. Sarti, “A recursive neural network
neurons. Table VI shows the results with several dimensions model for processing directed acyclic graphs with labeled edges,” in
for S and G. In order to evaluate the relative importance Proceedings of the International Joint Conference on Neural Networkss,
of the labels and the connectivity in the localization of the Portland (USA), July 2003, pp. 1351–1355.
[8] M. Bianchini, P. Mazzoni, L. Sarti, and F. Scarselli, “Face spotting in
subgraph, also a feedforward neural network (FNN) with 20 color images using recursive neural networks,” in Proceedings of the 1st
hidden neurons was exercised on this test. The FNN tries to ANNPR Workshop, Florence (Italy), Sept. 2003.
solve the problem using only the label ln of the node. Table VI [9] L. Almeida, “A learning rule for asynchronous perceptrons with feed-
back in a combinatorial environment,” in IEEE International Conference
shows that GNNs outperform the FNNs, confirming that the on Neural Networks, M. Caudill and C. Butler, Eds., vol. 2. San Diego,
GNNs used also the graph topology to find S. 1987: IEEE, New York, 1987, pp. 609–618.
[10] F. Pineda, “Generalization of back–propagation to recurrent neural
IV. C ONCLUSIONS networks,” Physical Review Letters, vol. 59, pp. 2229–2232, 1987.
[11] M. Bianchini, M. Gori, and F. Scarselli, “Recursive processing of cyclic
A new neural model, called graph neural network (GNN), graphs,” in Proceedings of IEEE International Conference on Neural
was presented. GNNs extend recursive neural networks, since Networks,, Washington, DC, USA, May 2002, pp. 154–159.
they can process a larger class of graphs and can be used
on node focused problems. Some preliminary experimental
results confirmed that the model is very promising. The
experimentation of the approach on larger applications is a