0% found this document useful (0 votes)
36 views

Hierarchical Graph Representation Learning With Differentiable Pooling

This document proposes DiffPool, a differentiable graph pooling module that generates hierarchical representations of graphs to improve graph classification using graph neural networks (GNNs). DiffPool learns a soft cluster assignment at each layer of a deep GNN to map nodes to clusters, which then form the coarsened input for the next layer. When combined with existing GNN methods, DiffPool provides an average 5-10% improvement in accuracy on graph classification benchmarks, achieving state-of-the-art results on four of five datasets.

Uploaded by

Rafael Lima
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views

Hierarchical Graph Representation Learning With Differentiable Pooling

This document proposes DiffPool, a differentiable graph pooling module that generates hierarchical representations of graphs to improve graph classification using graph neural networks (GNNs). DiffPool learns a soft cluster assignment at each layer of a deep GNN to map nodes to clusters, which then form the coarsened input for the next layer. When combined with existing GNN methods, DiffPool provides an average 5-10% improvement in accuracy on graph classification benchmarks, achieving state-of-the-art results on four of five datasets.

Uploaded by

Rafael Lima
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Hierarchical Graph Representation Learning with

Differentiable Pooling

Rex Ying Jiaxuan You


[email protected] [email protected]
Stanford University Stanford University

Christopher Morris Xiang Ren


[email protected] [email protected]
TU Dortmund University University of Southern California

William L. Hamilton Jure Leskovec


[email protected] [email protected]
Stanford University Stanford University

Abstract
Recently, graph neural networks (GNNs) have revolutionized the field of graph
representation learning through effectively learned node embeddings, and achieved
state-of-the-art results in tasks such as node classification and link prediction.
However, current GNN methods are inherently flat and do not learn hierarchical
representations of graphs—a limitation that is especially problematic for the task
of graph classification, where the goal is to predict the label associated with an
entire graph. Here we propose D IFF P OOL, a differentiable graph pooling module
that can generate hierarchical representations of graphs and can be combined with
various graph neural network architectures in an end-to-end fashion. D IFF P OOL
learns a differentiable soft cluster assignment for nodes at each layer of a deep
GNN, mapping nodes to a set of clusters, which then form the coarsened input
for the next GNN layer. Our experimental results show that combining existing
GNN methods with D IFF P OOL yields an average improvement of 5–10% accuracy
on graph classification benchmarks, compared to all existing pooling approaches,
achieving a new state-of-the-art on four out of five benchmark data sets.

1 Introduction
In recent years there has been a surge of interest in developing graph neural networks (GNNs)—
general deep learning architectures that can operate over graph structured data, such as social network
data ??? or graph-based representations of molecules ???. The general approach with GNNs is to
view the underlying graph as a computation graph and learn neural network primitives that generate
individual node embeddings by passing, transforming, and aggregating node feature information
across the graph ??. The generated node embeddings can then be used as input to any differentiable
prediction layer, e.g., for node classification ? or link prediction ?, and the whole model can be
trained in an end-to-end fashion.
However, a major limitation of current GNN architectures is that they are inherently flat as they
only propagate information across the edges of the graph and are unable to infer and aggregate the
information in a hierarchical way. For example, in order to successfully encode the graph structure
of organic molecules, one would ideally want to encode the local molecular structure (e.g., individual

32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, Canada.
Original Pooled network Pooled network Pooled network Graph
network at level 1 at level 2 at level 3 classification

Figure 1: High-level illustration of our proposed method D IFF P OOL. At each hierarchical layer, we
run a GNN model to obtain embeddings of nodes. We then use these learned embeddings to cluster
nodes together and run another GNN layer on this coarsened graph. This whole process is repeated
for L layers and we use the final output representation to classify the graph.

atoms and their direct bonds) as well as the coarse-grained structure of the molecular graph (e.g.,
groups of atoms and bonds representing functional units in a molecule). This lack of hierarchical
structure is especially problematic for the task of graph classification, where the goal is to predict
the label associated with an entire graph. When applying GNNs to graph classification, the standard
approach is to generate embeddings for all the nodes in the graph and then to globally pool all these
node embeddings together, e.g., using a simple summation or neural network that operates over sets
????. This global pooling approach ignores any hierarchical structure that might be present in the
graph, and it prevents researchers from building effective GNN models for predictive tasks over entire
graphs.
Here we propose D IFF P OOL, a differentiable graph pooling module that can be adapted to various
graph neural network architectures in an hierarchical and end-to-end fashion (Figure ??). D IFF P OOL
allows for developing deeper GNN models that can learn to operate on hierarchical representations of
a graph. We develop a graph analogue of the spatial pooling operation in CNNs ?, which allows for
deep CNN architectures to iteratively operate on coarser and coarser representations of an image. The
challenge in the GNN setting—compared to standard CNNs—is that graphs contain no natural notion
of spatial locality, i.e., one cannot simply pool together all nodes in a “m × m patch” on a graph,
because the complex topological structure of graphs precludes any straightforward, deterministic
definition of a “patch”. Moreover, unlike image data, graph data sets often contain graphs with
varying numbers of nodes and edges, which makes defining a general graph pooling operator even
more challenging.
In order to solve the above challenges, we require a model that learns how to cluster together nodes
to build a hierarchical multi-layer scaffold on top of the underlying graph. Our approach D IFF P OOL
learns a differentiable soft assignment at each layer of a deep GNN, mapping nodes to a set of clusters
based on their learned embeddings. In this framework, we generate deep GNNs by “stacking” GNN
layers in a hierarchical fashion (Figure ??): the input nodes at the layer l GNN module correspond
to the clusters learned at the layer l − 1 GNN module. Thus, each layer of D IFF P OOL coarsens
the input graph more and more, and D IFF P OOL is able to generate a hierarchical representation
of any input graph after training. We show that D IFF P OOL can be combined with various GNN
approaches, resulting in an average 7% gain in accuracy and a new state of the art on four out of
five benchmark graph classification tasks. Finally, we show that D IFF P OOL can learn interpretable
hierarchical clusters that correspond to well-defined communities in the input graphs.

2 Related Work

Our work builds upon a rich line of recent research on graph neural networks and graph classification.
General graph neural networks. A wide variety of graph neural network (GNN) models have been
proposed in recent years, including methods inspired by convolutional neural networks ????????,
recurrent neural networks ?, recursive neural networks ?? and loopy belief propagation ?. Most
of these approaches fit within the framework of “neural message passing” proposed by Gilmer
et al. ?. In the message passing framework, a GNN is viewed as a message passing algorithm

2
where node representations are iteratively computed from the features of their neighbor nodes using
a differentiable aggregation function. Hamilton et al. ? provides a conceptual review of recent
advancements in this area, and Bronstein et al. ? outlines connections to spectral graph convolutions.
Graph classification with graph neural networks. GNNs have been applied to a wide variety of
tasks, including node classification ??, link prediction ?, graph classification ???, and chemoinfor-
matics ?????. In the context of graph classification—the task that we study here—a major challenge
in applying GNNs is going from node embeddings, which are the output of GNNs, to a representation
of the entire graph. Common approaches to this problem include simply summing up or averaging all
the node embeddings in a final layer ?, introducing a “virtual node” that is connected to all the nodes
in the graph ?, or aggregating the node embeddings using a deep learning architecture that operates
over sets ?. However, all of these methods have the limitation that they do not learn hierarchical
representations (i.e., all the node embeddings are globally pooled together in a single layer), and
thus are unable to capture the natural structures of many real-world graphs. Some recent approaches
have also proposed applying CNN architectures to the concatenation of all the node embeddings ??,
but this requires a specifying (or learning) a canonical ordering over nodes, which is in general very
difficult and equivalent to solving graph isomorphism.
Lastly, there are some recent works that learn hierarchical graph representations by combining GNNs
with deterministic graph clustering algorithms ???, following a two-stage approach. However, unlike
these previous approaches, we seek to learn the hierarchical structure in an end-to-end fashion, rather
than relying on a deterministic graph clustering subroutine.

3 Proposed Method
The key idea of D IFF P OOL is that it enables the construction of deep, multi-layer GNN models by
providing a differentiable module to hierarchically pool graph nodes. In this section, we outline the
D IFF P OOL module and show how it is applied in a deep GNN architecture.

3.1 Preliminaries

We represent a graph G as (A, F ), where A ∈ {0, 1}n×n is the adjacency matrix, and F ∈ Rn×d
is the node feature matrix assuming each node has d features.1 Given a set of labeled graphs
D = {(G1 , y1 ), (G2 , y2 ), ...} where yi ∈ Y is the label corresponding to graph Gi ∈ G, the goal
of graph classification is to learn a mapping f : G → Y that maps graphs to the set of labels. The
challenge—compared to standard supervised machine learning setup—is that we need a way to
extract useful feature vectors from these input graphs. That is, in order to apply standard machine
learning methods for classification, e.g., neural networks, we need a procedure to convert each graph
to an finite dimensional vector in RD .
Graph neural networks. In this work, we build upon graph neural networks in order to learn useful
representations for graph classification in an end-to-end fashion. In particular, we consider GNNs
that employ the following general “message-passing” architecture:
H (k) = M (A, H (k−1) ; θ(k) ), (1)
(k) n×d
where H ∈ R are the node embeddings (i.e., “messages”) computed after k steps of the
GNN and M is the message propagation function, which depends on the adjacency matrix, trainable
parameters θ(k) , and the node embeddings H (k−1) generated from the previous message-passing
step.2 The input node embeddings H (0) at the initial message-passing iteration (k = 1), are initialized
using the node features on the graph, H (0) = F .
There are many possible implementations of the propagation function M ??. For example, one
popular variant of GNNs—Kipf’s et al. ? Graph Convolutional Networks (GCNs)—implements M
using a combination of linear transformations and ReLU non-linearities:
1 1
H (k) = M (A, H (k−1) ; W (k) ) = ReLU(D̃− 2 ÃD̃− 2 H (k−1) W (k−1) ), (2)
1
We do not consider edge features, although one can easily extend the algorithm to support edge features
using techniques introduced in ?.
2
For notational convenience, we assume that the embedding dimension is d for all H (k) ; however, in general
this restriction is not necessary.

3
where à = A + I, D̃ = j Ãij and W (k) ∈ Rd×d is a trainable weight matrix. The differentiable
P
pooling model we propose can be applied to any GNN model implementing Equation (??), and is
agnostic with regards to the specifics of how M is implemented.
A full GNN module will run K iterations of Equation (??) to generate the final output node embed-
dings Z = H (K) ∈ Rn×d , where K is usually in the range 2-6. For simplicity, in the following
sections we will abstract away the internal structure of the GNNs and use Z = GNN(A, X) to
denote an arbitrary GNN module implementing K iterations of message passing according to some
adjacency matrix A and initial input node features X.
Stacking GNNs and pooling layers. GNNs implementing Equation (??) are inherently flat, as they
only propagate information across edges of a graph. The goal of this work is to define a general,
end-to-end differentiable strategy that allows one to stack multiple GNN modules in a hierarchical
fashion. Formally, given Z = GNN(A, X), the output of a GNN module, and a graph adjacency
matrix A ∈ Rn×n , we seek to define a strategy to output a new coarsened graph containing m < n
0 0
nodes, with weighted adjacency matrix A ∈ Rm×m and node embeddings Z ∈ Rm×d . This new
coarsened graph can then be used as input to another GNN layer, and this whole process can be
repeated L times, generating a model with L GNN layers that operate on a series of coarser and
coarser versions of the input graph (Figure ??). Thus, our goal is to learn how to cluster or pool
together nodes using the output of a GNN, so that we can use this coarsened graph as input to another
GNN layer. What makes designing such a pooling layer for GNNs especially challenging—compared
to the usual graph coarsening task—is that our goal is not to simply cluster the nodes in one graph,
but to provide a general recipe to hierarchically pool nodes across a broad set of input graphs. That is,
we need our model to learn a pooling strategy that will generalize across graphs with different nodes,
edges, and that can adapt to the various graph structures during inference.

3.2 Differentiable Pooling via Learned Assignments

Our proposed approach, D IFF P OOL, addresses the above challenges by learning a cluster assignment
matrix over the nodes using the output of a GNN model. The key intuition is that we stack L GNN
modules and learn to assign nodes to clusters at layer l in an end-to-end fashion, using embeddings
generated from a GNN at layer l − 1. Thus, we are using GNNs to both extract node embeddings that
are useful for graph classification, as well to extract node embeddings that are useful for hierarchical
pooling. Using this construction, the GNNs in D IFF P OOL learn to encode a general pooling strategy
that is useful for a large set of training graphs. We first describe how the D IFF P OOL module pools
nodes at each layer given an assignment matrix; following this, we discuss how we generate the
assignment matrix using a GNN architecture.
Pooling with an assignment matrix. We denote the learned cluster assignment matrix at layer l as
S (l) ∈ Rnl ×nl+1 . Each row of S (l) corresponds to one of the nl nodes (or clusters) at layer l, and
each column of S (l) corresponds to one of the nl+1 clusters at the next layer l + 1. Intuitively, S (l)
provides a soft assignment of each node at layer l to a cluster in the next coarsened layer l + 1.
Suppose that S (l) has already been computed, i.e., that we have computed the assignment matrix at
the l-th layer of our model. We denote the input adjacency matrix at this layer as A(l) and denote
the input node embedding matrix at this layer as Z (l) . Given these inputs, the D IFF P OOL layer
(A(l+1) , X (l+1) ) = D IFF P OOL(A(l) , Z (l) ) coarsens the input graph, generating a new coarsened
adjacency matrix A(l+1) and a new matrix of embeddings X (l+1) for each of the nodes/clusters in
this coarsened graph. In particular, we apply the two following equations:
T
X (l+1) = S (l) Z (l) ∈ Rnl+1 ×d , (3)
(l) T
A(l+1) = S A(l) S (l) ∈ Rnl+1 ×nl+1 . (4)
Equation (??) takes the node embeddings Z (l) and aggregates these embeddings according to the
cluster assignments S (l) , generating embeddings for each of the nl+1 clusters. Similarly, Equation
(??) takes the adjacency matrix A(l) and generates a coarsened adjacency matrix denoting the
connectivity strength between each pair of clusters.
Through Equations (??) and (??), the D IFF P OOL layer coarsens the graph: the next layer adjacency
matrix A(l+1) represents a coarsened graph with nl+1 nodes or cluster nodes, where each individual

4
cluster node in the new coarsened graph corresponds to a cluster of nodes in the graph at layer l.
Note that A(l+1) is a real matrix and represents a fully connected edge-weighted graph; each entry
(l+1)
Aij can be viewed as the connectivity strength between cluster i and cluster j. Similarly, the i-th
row of X (l+1) corresponds to the embedding of cluster i. Together, the coarsened adjacency matrix
A(l+1) and cluster embeddings X (l+1) can be used as input to another GNN layer, a process which
we describe in detail below.
Learning the assignment matrix. In the following we describe the architecture of D IFF P OOL, i.e.,
how D IFF P OOL generates the assignment matrix S (l) and embedding matrices Z (l) that are used in
Equations (??) and (??). We generate these two matrices using two separate GNNs that are both
applied to the input cluster node features X (l) and coarsened adjacency matrix A(l) . The embedding
GNN at layer l is a standard GNN module applied to these inputs:

Z (l) = GNNl,embed (A(l) , X (l) ), (5)


i.e., we take the adjacency matrix between the cluster nodes at layer l (from Equation ??) and the
pooled features for the clusters (from Equation ??) and pass these matrices through a standard GNN
to get new embeddings Z (l) for the cluster nodes. In contrast, the pooling GNN at layer l, uses the
input cluster features X (l) and cluster adjacency matrix A(l) to generate an assignment matrix:
 
S (l) = softmax GNNl,pool (A(l) , X (l) ) , (6)

where the softmax function is applied in a row-wise fashion. The output dimension of GNNl,pool
corresponds to a pre-defined maximum number of clusters in layer l, and is a hyperparameter of the
model.
Note that these two GNNs consume the same input data but have distinct parameterizations and play
separate roles: The embedding GNN generates new embeddings for the input nodes at this layer,
while the pooling GNN generates a probabilistic assignment of the input nodes to nl+1 clusters.
In the base case, the inputs to Equations (??) and Equations (??) at layer l = 0 are simply the input
adjacency matrix A and the node features F for the original graph. At the penultimate layer L − 1 of
a deep GNN model using D IFF P OOL, we set the assignment matrix S (L−1) be a vector of 1’s, i.e.,
all nodes at the final layer L are assigned to a single cluster, generating a final embedding vector
corresponding to the entire graph. This final output embedding can then be used as feature input to a
differentiable classifier (e.g., a softmax layer), and the entire system can be trained end-to-end using
stochastic gradient descent.
Permutation invariance. Note that in order to be useful for graph classification, the pooling layer
should be invariant under node permutations. For D IFF P OOL we get the following positive result,
which shows that any deep GNN model based on D IFF P OOL is permutation invariant, as long as the
component GNNs are permutation invariant.
Proposition 1. Let P ∈ {0, 1}n×n be any permutation matrix, then D IFF P OOL(A, Z) =
D IFF P OOL(P AP T , P X) as long as GNN(A, X) = GNN(P AP T , X) (i.e., as long as the GNN
method used is permutation invariant).

Proof. Equations (??) and (??) are permutation invariant by the assumption that the GNN module
is permutation invariant. And since any permutation matrix is orthogonal, applying P T P = I to
Equation (??) and (??) finishes the proof.

3.3 Auxiliary Link Prediction Objective and Entropy Regularization

In practice, it can be difficult to train the pooling GNN (Equation ??) using only gradient signal from
the graph classification task. Intuitively, we have a non-convex optimization problem and it can be
difficult to push the pooling GNN away from spurious local minima early in training. To alleviate
this issue, we train the pooling GNN with an auxiliary link prediction objective, which encodes the
intuition that nearby nodes should be pooled together. In particular, at each layer l, we minimize
T
LLP = ||A(l) , S (l) S (l) ||F , where || · ||F denotes the Frobenius norm. Note that the adjacency matrix
A(l) at deeper layers is a function of lower level assignment matrices, and changes during training.

5
Another important characteristic of the pooling GNN (Equation ??) is that the output cluster assign-
ment for each node should generally be close to a one-hot vector, so that the membership for each
cluster or subgraph is clearly
Pn defined. We therefore regularize the entropy of the cluster assignment
by minimizing LE = n1 i=1 H(Si ), where H denotes the entropy function, and Si is the i-th row
of S.
During training, LLP and LE from each layer are added to the classification loss. In practice we
observe that training with the side objective takes longer to converge, but nevertheless achieves better
performance and more interpretable cluster assignments.

4 Experiments

We evaluate the benefits of D IFF P OOL against a number of state-of-the-art graph classification
approaches, with the goal of answering the following questions:
Q1 How does D IFF P OOL compare to other pooling methods proposed for GNNs (e.g., using sort
pooling ? or the S ET 2S ET method ?)?
Q2 How does D IFF P OOL combined with GNNs compare to the state-of-the-art for graph classifica-
tion task, including both GNNs and kernel-based methods?
Q3 Does D IFF P OOL compute meaningful and interpretable clusters on the input graphs?
Data sets. To probe the ability of D IFF P OOL to learn complex hierarchical structures from graphs in
different domains, we evaluate on a variety of relatively large graph data sets chosen from benchmarks
commonly used in graph classification ?. We use protein data sets including E NZYMES, P ROTEINS ??,
D&D ?, the social network data set R EDDIT-M ULTI -12 K ?, and the scientific collaboration data set
C OLLAB ?. See Appendix A for statistics and properties. For all these data sets, we perform 10-fold
cross-validation to evaluate model performance, and report the accuracy averaged over 10 folds.
Model configurations. In our experiments, the GNN model used for D IFF P OOL is built on top of
the G RAPH S AGE architecture, as we found this architecture to have superior performance compared
to the standard GCN approach as introduced in ?. We use the “mean” variant of G RAPH S AGE ?
and apply a D IFF P OOL layer after every two G RAPH S AGE layers in our architecture. A total of 2
D IFF P OOL layers are used for the datasets. For small datasets such as E NZYMES, P ROTEINS and
C OLLAB, 1 D IFF P OOL layer can achieve similar performance. After each D IFF P OOL layer, 3 layers
of graph convolutions are performed, before the next D IFF P OOL layer, or the readout layer. The
embedding matrix and the assignment matrix are computed by two separate G RAPH S AGE models
respectively. In the 2 D IFF P OOL layer architecture, the number of clusters is set as 25% of the number
of nodes before applying D IFF P OOL, while in the 1 D IFF P OOL layer architecture, the number of
clusters is set as 10%. Batch normalization ? is applied after every layer of G RAPH S AGE. We also
found that adding an `2 normalization to the node embeddings at each layer made the training more
stable. In Section ??, we also test an analogous variant of D IFF P OOL on the S TRUCTURE 2V EC ?
architecture, in order to demonstrate how D IFF P OOL can be applied on top of other GNN models.
All models are trained for 3 000 epochs with early stopping applied when the validation loss starts to
drop. We also evaluate two simplified versions of D IFF P OOL:
• D IFF P OOL -D ET, is a D IFF P OOL model where assignment matrices are generated using a deter-
ministic graph clustering algorithm ?.
• D IFF P OOL -N O LP is a variant of D IFF P OOL where the link prediction side objective is turned off.

4.1 Baseline Methods

In the performance comparison on graph classification, we consider baselines based upon GNNs
(combined with different pooling methods) as well as state-of-the-art kernel-based approaches.
GNN-based methods.
• G RAPH S AGE with global mean-pooling ?. Other GNN variants such as those proposed in ? are
omitted as empirically GraphSAGE obtained higher performance in the task.
• S TRUCTURE 2V EC (S2V) ? is a state-of-the-art graph representation learning algorithm that
combines a latent variable model with GNNs. It uses global mean pooling.
• Edge-conditioned filters in CNN for graphs (ECC) ? incorporates edge information into the GCN
model and performs pooling using a graph coarsening algorithm.

6
Table 1: Classification accuracies in percent. The far-right column gives the relative increase in
accuracy compared to the baseline G RAPH S AGE approach.

Data Set
Method
E NZYMES D&D R EDDIT-M ULTI -12 K C OLLAB P ROTEINS Gain
G RAPHLET 41.03 74.85 21.73 64.66 72.91
Kernel

S HORTEST- PATH 42.32 78.86 36.93 59.10 76.43


1-WL 53.43 74.02 39.03 78.61 73.76
WL-OA 60.13 79.04 44.38 80.74 75.26
PATCHY S AN – 76.27 41.32 72.60 75.00 4.17
G RAPH S AGE 54.25 75.42 42.24 68.25 70.48 –
ECC 53.50 74.10 41.73 67.79 72.65 0.11
S ET 2 SET 60.15 78.12 43.49 71.75 74.29 3.32
GNN

S ORT P OOL 57.12 79.37 41.82 73.76 75.54 3.39


D IFF P OOL -D ET 58.33 75.47 46.18 82.13 75.62 5.42
D IFF P OOL -N O LP 61.95 79.98 46.65 75.58 76.22 5.95
D IFF P OOL 62.53 80.64 47.08 75.48 76.25 6.27

• PATCHY S AN ? defines a receptive field (neighborhood) for each node, and using a canonical
node ordering, applies convolutions on linear sequences of node embeddings.
• S ET 2S ET replaces the global mean-pooling in the traditional GNN architectures by the aggre-
gation used in S ET 2S ET ?. Set2Set aggregation has been shown to perform better than mean
pooling in previous work ?. We use G RAPH S AGE as the base GNN model.
• S ORT P OOL ? applies a GNN architecture and then performs a single layer of soft pooling
followed by 1D convolution on sorted node embeddings.
For all the GNN baselines, we use 10-fold cross validation numbers reported by the original authors
when possible. For the G RAPH S AGE and S ET 2S ET baselines, we use the base implementation and
hyperparameter sweeps as in our D IFF P OOL approach. When baseline approaches did not have the
necessary published numbers, we contacted the original authors and used their code (if available) to
run the model, performing a hyperparameter search based on the original author’s guidelines.
Kernel-based algorithms. We use the G RAPHLET ?, the S HORTEST-PATH ?, W EISFEILER -
L EHMAN kernel (WL) ?, and W EISFEILER -L EHMAN O PTIMAL A SSIGNMENT KERNEL (WL-OA) ?
as kernel baselines. For each kernel, we computed the normalized gram matrix. We computed the
classification accuracies using the C-SVM implementation of L IB S VM ?, using 10-fold cross valida-
tion. The C parameter was selected from {10−3 , 10−2 , . . . , 102 , 103 } by 10-fold cross validation on
the training folds. Moreover, for WL and WL-OA we additionally selected the number of iteration
from {0, . . . , 5}.

4.2 Results for Graph Classification

Table ?? compares the performance of D IFF P OOL to these state-of-the-art graph classification
baselines. These results provide positive answers to our motivating questions Q1 and Q2: We
observe that our D IFF P OOL approach obtains the highest average performance among all pooling
approaches for GNNs, improves upon the base G RAPH S AGE architecture by an average of 6.27%,
and achieves state-of-the-art results on 4 out of 5 benchmarks. Interestingly, our simplified model
variant, D IFF P OOL -D ET, achieves state-of-the-art performance on the C OLLAB benchmark. This
is because many collaboration graphs in C OLLAB show only single-layer community structures,
which can be captured well with pre-computed graph clustering algorithm ?. One observation is that
despite significant performance improvement, D IFF P OOL could be unstable to train, and there is
significant variation in accuracy across different runs, even with the same hyperparameter setting. It
is observed that adding the link predictioin objective makes training more stable, and reduces the
standard deviation of accuracy across different runs.
Differentiable Pooling on S TRUCTURE 2V EC. D IFF P OOL can be applied to other GNN architec-
tures besides G RAPH S AGE to capture hierarchical structure in the graph data. To further support
answering Q1, we also applied D IFF P OOL on Structure2Vec (S2V). We ran experiments using S2V
with three layer architecture, as reported in ?. In the first variant, one D IFF P OOL layer is applied after
the first layer of S2V, and two more S2V layers are stacked on top of the output of D IFF P OOL. The

7
second variant applies one D IFF P OOL layer after the first and second layer of S2V respectively. In
both variants, S2V model is used to compute the embedding matrix, while G RAPH S AGE model is
used to compute the assignment matrix.

Table 2: Accuracy results of applying D IFF P OOL to S2V.


Method
Data Set
S2V S2V WITH 1 D IFF P OOL S2V WITH 2 D IFF P OOL
E NZYMES 61.10 62.86 63.33
D&D 78.92 80.75 82.07

The results in terms of classification accuracy are summarized in Table ??. We observe that D IFF P OOL
significantly improves the performance of S2V on both E NZYMES and D&D data sets. Similar
performance trends are also observed on other data sets. The results demonstrate that D IFF P OOL is a
general strategy to pool over hierarchical structure that can benefit different GNN architectures.
Running time. Although applying D IFF P OOL requires additional computation of an assignment
matrix, we observed that D IFF P OOL did not incur substantial additional running time in practice. This
is because each D IFF P OOL layer reduces the size of graphs by extracting a coarser representation of
the graph, which speeds up the graph convolution operation in the next layer. Concretely, we found
that G RAPH S AGE with D IFF P OOL was 12× faster than the G RAPH S AGE model with S ET 2S ET
pooling, while still achieving significantly higher accuracy on all benchmarks.

4.3 Analysis of Cluster Assignment in D IFF P OOL

Hierarchical cluster structure. To address Q3, we investigated the extent to which D IFF P OOL
learns meaningful node clusters by visualizing the cluster assignments in different layers. Figure
?? shows such a visualization of node assignments in the first and second layers on a graph from
C OLLAB data set, where node color indicates its cluster membership. Node cluster membership
is determined by taking the argmax of its cluster assignment probabilities. We observe that even
when learning cluster assignment based solely on the graph classification objective, D IFF P OOL
can still capture the hierarchical community structure. We also observe significant improvement in
membership assignment quality with link prediction auxiliary objectives.
Dense vs. sparse subgraph structure. In addition, we observe that D IFF P OOL learns to collapse
nodes into soft clusters in a non-uniform way, with a tendency to collapse densely-connected
subgraphs into clusters. Since GNNs can efficiently perform message-passing on dense, clique-like
subgraphs (due to their small diameters) ?, pooling together nodes in such a dense subgraph is not
likely to lead to any loss of structural information. This intuitively explains why collapsing dense
subgraphs is a useful pooling strategy for D IFF P OOL. In contrast, sparse subgraphs may contain many
interesting structures, including path-, cycle- and tree-like structures, and given the high-diameter
induced by sparsity, GNN message-passing may fail to capture these structures. Thus, by separately
pooling distinct parts of a sparse subgraph, D IFF P OOL can learn to capture the meaningful structures
present in sparse graph regions (e.g., as in Figure ??).
Assignment for nodes with similar representations. Since the assignment network computes the
soft cluster assignment based on features of input nodes and their neighbors, nodes with both similar
input features and neighborhood structure will have similar cluster assignment. In fact, one can
construct synthetic cases where 2 nodes, although far away, have exactly the same neighborhood
structure and features for self and all neighbors. In this case the pooling network is forced to assign
them into the same cluster, which is different from the concept of pooling in other architectures such
as image ConvNets. In some cases we do observe that disconnected nodes are pooled together.
In practice we rely on the identifiability assumption similar to Theorem 1 in GraphSAGE ?, where
nodes are identifiable via their features. This holds in many real datasets 3 . The auxiliary link
prediction objective is observed to also help discouraging nodes that are far away to be pooled
together. Furthermore, it is possible to use more sophisticated GNN aggregation function such as
3
However, some chemistry molecular graph datasets contain many nodes that are structurally similar, and
assignment network is observed to pool together nodes that are far away.

8
Pooling at Layer 1 Pooling at Layer 2

(a) (b) (c)


Figure 2: Visualization of hierarchical cluster assignment in D IFF P OOL, using example graphs from
C OLLAB. The left figure (a) shows hierarchical clustering over two layers, where nodes in the second
layer correspond to clusters in the first layer. (Colors are used to connect the nodes/clusters across the
layers, and dotted lines are used to indicate clusters.) The right two plots (b and c) show two more
examples first-layer clusters in different graphs. Note that although we globally set the number of
clusters to be 25% of the nodes, the assignment GNN automatically learns the appropriate number of
meaningful clusters to assign for these different graphs.

high-order moments ? to distinguish nodes that are similar in structure and feature space. The overall
framework remains unchanged.
Sensitivity of the Pre-defined Maximum Number of Clusters. We found that the assignment
varies according to the depth of the network and C, the maximum number of clusters. With larger C,
the pooling GNN can model more complex hierarchical structure. The trade-off is that very large
C results in more noise and less efficiency. Although the value of C is a pre-defined parameter, the
pooling net learns to use the appropriate number of clusters by end-to-end training. In particular, some
clusters might not be used by the assignment matrix. Column corresponding to unused cluster has
low values for all nodes. This is observed in Figure ??(c), where nodes are assigned predominantly
into 3 clusters.

5 Conclusion
We introduced a differentiable pooling method for GNNs that is able to extract the complex hierarchi-
cal structure of real-world graphs. By using the proposed pooling layer in conjunction with existing
GNN models, we achieved new state-of-the-art results on several graph classification benchmarks.
Interesting future directions include learning hard cluster assignments to further reduce computational
cost in higher layers while also ensuring differentiability, and applying the hierarchical pooling
method to other downstream tasks that require modeling of the entire graph structure.

Acknowledgement
This research has been supported in part by DARPA SIMPLEX, Stanford Data Science Initiative,
Huawei, JD and Chan Zuckerberg Biohub. Christopher Morris is funded by the German Science
Foundation (DFG) within the Collaborative Research Center SFB 876 “Providing Information by
Resource-Constrained Data Analysis”, project A6 “Resource-efficient Graph Mining”. The authors
also thank Marinka Zitnik for help in visualizing the high-level illustration of the proposed methods.

You might also like