Higher-Order Clustering and Pooling For Graph Neural Networks
Higher-Order Clustering and Pooling For Graph Neural Networks
426
CIKM ’22, October 17–21, 2022, Atlanta, GA, USA. Alexandre Duval & Fragkiskos Malliaros
1 1
unsupervised loss with any task-specific supervised loss function Laplacian matrix of G. Ã = D − 2 AD − 2 ∈ R𝑁 ×𝑁 is the symmetri-
to allow truly end-to-end graph classification. Finally, we evaluate cally normalised adjacency matrix with corresponding D̃, L̃.
the performance of HoscPool on a plethora of graph datasets, and
the reliability of its clustering algorithm on a variety of graphs 3.1 Graph Cut and Normalised Cut
endowed with ground-truth community structure. During this ex- Clustering involves partitioning the vertices of a graph into 𝐾
periment phase, we proceed to a deep analysis aimed to understand disjoint subsets with more intra-connections than inter-connections
why existing pooling methods fail to truly outperform random base- [43]. One of the most common and effective way to do it [35] is to
lines and attempt to provide explications. This is another important solve the Normalised Cut problem [38]:
contribution, which we hope will help future works.
cut(S𝑘 , S¯𝑘 )
𝐾
∑︁
min , (1)
2 RELATED WORK S1 ,...,S𝐾 vol(S𝑘 )
𝑘=1
Graph pooling. Leaving aside global pooling [2, 39, 47], we distin-
where S¯𝑘 = V\S𝑘 , cut(S𝑘 , S¯𝑘 ) = 𝑖 ∈ S𝐾 ,𝑗 ∈ S̄𝐾 𝐴𝑖 𝑗 , and vol(S𝑘 ) =
Í
guish between two main types of hierarchical approaches. Node Í
drop methods [3, 13, 19, 31, 48, 50, 52] use a learnable scoring func- 𝑖 ∈ S𝑘 ,𝑗 ∈ V 𝐴𝑖 𝑗 . Unlike the simple min-cut objective, (1) scales each
tion based on message passing representations to assess all nodes term by the cluster volume, thus enforcing clusters to be “reason-
and drop the ones with lowest score. The drawback is that we ably large” and avoiding degenerate solutions where most nodes
loose information during pooling by dropping completely certain are assigned to a single cluster. Although minimising (1) is NP-hard
nodes. On the other hand, clustering approaches cast the pooling [44], there are approximation algorithms with theoretical guaran-
problem as a clustering one [10, 23–25, 33, 46, 51]. For instance, tees [8] for finding clusters with small conductance, such as Spectral
StructPool [51] utilizes conditional random fields to learn the Clustering (SC), which proposes clusters determined based on the
cluster assignment matrix; HaarPool [46] uses the compressive eigen-decomposition of the Laplacian matrix. A refresher on SC is
Haar transform; EdgePool [10] gradually merges nodes by con- provided in [43].
tracting high-scoring edges. Of particular interest here are two very
popular end-to-end clustering methods, namely DiffPool [49] and 3.2 Motif conductance
MinCutPool [5], because of their original and efficient underlying While the Normalised Cut builds on first-order connectivity pat-
idea. While DiffPool utilises a link prediction objective along with terns (i.e. edges), [4, 42] propose to cluster a network based on
an entropy regularization to learn the cluster assignment matrix, specific higher-order substructures. Formally, for graph G, motif
MinCutPool leverages an min-cut score objective along with an 𝑀 made of |𝑀 | nodes, and M = {v ∈ V |𝑀 | |v = 𝑀 } the set of
orthogonality term. Although there are more pooling operators, all instances of 𝑀 in G, they propose to search for the partition
we wish to improve this line of method, that we think is promis- S1, . . . , S𝐾 minimising motif conductance:
ing and perfectible. In addition to solving existing limitations, we
want to introduce the notion of higher-order to pooling for graph 𝐾 cut ( G) (S , S¯ )
∑︁
𝑀 𝑘 𝑘
classification, which is unexplored yet. min , (2)
S1 ,...,S𝐾 ( G)
Higher-order connectivity patterns (i.e. motifs – small network 𝑘=1 vol𝑀 (S𝑘 )
subgraphs like triangles ).), are known to be the fundamental ( G)
where cut𝑀 (S𝑘 , S¯𝑘 ) = v∈ M 1(∃𝑖, 𝑗 ∈ v|𝑖 ∈ S𝑘 , 𝑗 ∈ S¯𝑘 ), i.e.
Í
building blocks of complex networks [6, 28]. They are essential for
the number of instances v of 𝑀 with at least one node in S𝑘 and
modelling and understanding the organization of various types of ( G)
at least one node in S¯𝑘 ; and vol𝑀 (S𝑘 ) = v∈ M 𝑖 ∈v 1(𝑖 ∈ S𝑘 ),
Í Í
networks. For instance, they play an essential role in the characteri-
sation of social, biological or molecules networks [30]. [11] showed i.e. the number of motif instance endpoints in S𝑘 .
that vertices participating in the same higher-order structure often
share the same label, spreading its adoption to node classification 4 PROPOSED METHOD
tasks [20, 22]. Going further, several recent research papers have The objective of this paper is to design a differentiable cluster as-
clearly demonstrated the benefits of leveraging higher-order struc- signment matrix S that learns to find relevant clusters based on
ture for link prediction [1, 37], explanation generation [32, 36], higher-order connectivity patterns, in an end-to-end manner within
ranking [34], clustering [15, 18]. Regarding the latter, [4, 42] argue any GNN architecture. To achieve this, we formulate a continuous
that domain-specific motifs are a better signature of the community relaxation of motif spectral clustering and embed the derived for-
structure than simple edges. Their intuition is that motifs allow us mulation into the model objective function to enforce its learning.
to focus on particular network substructures that are important
for networks of a given domain. As a result, they generalized the 4.1 Probabilistic motif spectral clustering
notion of conductance to triangle conductance (Section 3), which Before exploring how we can rewrite the motif conductance op-
was found highly beneficial by [6, 40]. timisation problem (2) in a solvable way, we introduce the motif
adjacency matrix A𝑀 , where each entry (𝐴𝑀 )𝑖 𝑗 represents the
3 PRELIMINARY KNOWLEDGE number of motifs in which both node 𝑖 and node 𝑗 participate. Its di-
Í
G = (V, E) is a graph with vertex set V and edge set E, charac- agonal has zero values. Formally, (𝐴𝑀 )𝑖 𝑗 = v∈ M 1(𝑖, 𝑗 ∈ v, 𝑖 ≠ 𝑗).
terised by its adjacency matrix A ∈ R𝑁 ×𝑁 and node feature matrix
Í
G𝑀 is the graph induced by A𝑀 . (𝐷 𝑀 )𝑖𝑖 = 𝑁 𝑗=1 (𝐴𝑀 )𝑖 𝑗 and L𝑀
X ∈ R𝑁 ×𝐹 . D = diag(A1𝑁 ) is the degree matrix and L = D − A the are the motif degree and motif Laplacian matrices.
427
Higher-order Clustering and Pooling for Graph Neural Networks CIKM ’22, October 17–21, 2022, Atlanta, GA, USA.
For now, we focus on triangle motifs (𝑀 = 𝐾3 ), and extend to We compute the soft cluster assignment matrix S using one (or
more complex motifs in Section 4.2. From [4], we have: more) fully connected layer(s), mapping each node’s representation
( G) 1 ∑︁ ∑︁ X𝑖∗ to its probabilistic cluster assignment vector S𝑖∗ . We apply a
cut𝑀 (S𝑘 , S¯𝑘 ) = (𝐴𝑀 )𝑖 𝑗 softmax activation function to enforce the constraint inherited from
2 ¯ 𝑖 ∈ S𝑘 𝑗 ∈𝑆𝑘 (4): 𝑆𝑖 𝑗 ∈ [0, 1] and S1𝐾 = 1𝑁 :
( G) 1 ∑︁ ∑︁
vol𝑀 (S𝑘 ) = (𝐴𝑀 )𝑖 𝑗 , S = FC(X; 𝚯). (5)
2
𝑖 ∈ S𝑘 𝑗 ∈ V 𝚯 are trainable parameters, optimised by minimising the unsuper-
which enables us to rewrite (2) as: vised loss function L𝑚𝑐 , which approximates the relaxed formula-
Í tion of the motif conductance problem (4):
𝑖 ∈ S ,𝑗 ∈ S¯𝑘 (𝐴𝑀 )𝑖 𝑗
𝐾
∑︁
min Í 𝑘 1
⊤
S A S
S1 ,...,S𝐾 (𝐴 )𝑖 𝑗 L𝑚𝑐 = − · Tr ⊤ 𝑀 . (6)
𝑘=1 𝑖 ∈ S𝑘 ,𝑗 ∈ V 𝑀 𝐾 S D𝑀 S
𝐾 Í
𝑖,𝑗 ∈ S𝑘 (𝐴𝑀 )𝑖 𝑗 Referring to the spectral clustering formulation1 , L𝑚𝑐 ∈ [−1, 0].
∑︁
≡ max Í , (3)
S1 ,...,S𝐾 𝑖 ∈ S𝑘 ,𝑗 ∈ V (𝐴𝑀 )𝑖 𝑗
𝑘=1 It reaches −1 when G𝑀 has ≥ 𝐾 connected components (no motif
where the last equivalence follows from endpoints are separated by clustering), and 0 when for each pair of
∑︁ ∑︁ ∑︁ nodes participating in the same motif (i.e. (𝐴𝑀 )𝑖 𝑗 > 0), the cluster
(𝐴𝑀 )𝑖 𝑗 + (𝐴𝑀 )𝑖 𝑗 = (𝐴𝑀 )𝑖 𝑗 . assignments are orthogonal: ⟨S𝑖∗, S 𝑗∗ ⟩ = 0. L𝑚𝑐 is a non-convex
𝑖,𝑗 ∈ S𝑘 𝑖 ∈ S𝑘 ,𝑗 ∈ S¯𝑘 𝑖 ∈ S𝑘 ,𝑗 ∈ V function and its minimisation can lead to local minima, although
Instead of using partition sets, we define a discrete cluster as- our probabilistic membership formulation makes it less likely to
signment matrix S ∈ {0, 1}𝑁 ×𝐾 where S𝑖 𝑗 = 1 if 𝑣𝑖 ∈ S 𝑗 and 0 happen w.r.t. hard membership [16].
In fact, we allow the combination of several motifs inside our ob-
otherwise. We denote by S 𝑗 = [𝑆 1𝑗 , . . . , 𝑆 𝑁 𝑗 ] ⊤ the 𝑗 𝑡ℎ column of S, Í
jective function (6) via L𝑚𝑐 = 𝑗 𝛼 𝑗 L𝑚𝑐 𝑗 where L𝑚𝑐 𝑗 denotes the
which indicates the nodes belonging to cluster S 𝑗 . Using this, we
objective function with respect to a particular motif (e.g., edge , tri-
transform (3) into:
angle , 4-nodes cycle ) and 𝛼 𝑗 is an importance factor. This also
𝐾 Í
𝑖,𝑗 ∈ V (𝐴𝑀 )𝑖 𝑗 𝑆𝑖𝑘 𝑆 𝑗𝑘 increases the power of our method, allowing us to find communities
∑︁
max Í of nodes w.r.t. a hierarchy of higher-order substructures. As a result,
S∈ {0,1} 𝑁 ×𝐾
𝑖,𝑗 ∈ V 𝑆𝑖𝑘 (𝐴𝑀 )𝑖 𝑗
𝑘=1
the graph coarsening step will pool together more relevant groups
𝐾 S⊤ A S
∑︁
𝑘 𝑀 𝑘 of nodes, potentially capturing more relevant patterns in subse-
≡ max
S∈ {0,1} 𝑁 ×𝐾 S𝑘⊤ D𝑀 S𝑘 quent layers, ultimately producing richer graph representation. We
𝑘=1
⊤ implement it for edge and triangle motifs:
S A S
−Tr ⊤ 𝑀 ,
⊤ ⊤
≡ min (4)
𝛼1 S AS 𝛼2 S A S
S∈ {0,1} 𝑁 ×𝐾 S D𝑀 S L𝑚𝑐 = − · Tr ⊤ − · Tr ⊤ 𝑀 . (7)
𝐾 S DS 𝐾 S D𝑀 S
where the division sign in the last line is an element-wise division
We let 𝛼 1 , 𝛼 2 , to be dynamic functions of the epoch, subject to
on the diagonal of both matrices. By definition, S is subject to the
𝛼 1 + 𝛼 2 = 1, allowing to first optimise higher-order motifs before
constraint S1𝐾 = 1𝑁 , i.e. each node belongs exactly to 1 cluster.
moving on to smaller ones. It helps refine the level of granular-
This optimisation problem is NP-hard since S take discrete val-
ity progressively and was found desirable empirically. This is the
ues. We thus relax it to a probabilistic framework, where S take
higher-order clustering formulation that we consider in the paper.
continuous values in the range [0, 1], representing cluster member-
In case we would like to enforce more rigorously the hard clus-
ship probabilities, i.e. each entry 𝑆𝑖𝑘 denotes the probability that
ter assignment, characteristic of the original motif conductance
node 𝑖 belongs to cluster 𝑘. Referring to [43] and [4], solving this
formulation, we design an auxiliary loss function:
continuous relaxation of motif spectral clustering approximates a
closed form solution with theoretical guarantees, provided by the √ 𝐾
1 1 ∑︁
Cheeger inequality [8]. Compared to the original hard assignment L𝑜 = √ 𝐾−√ ||𝑆 ∗𝑗 || 𝐹 , (8)
𝐾 −1 𝑁 𝑗=1
problem, this soft cluster assignment formulation is less likely to
be trapped in local minima [16]. It also allows to generalise easily where ∥ · ∥ 𝐹 indicates the Frobenius norm. This orthogonality loss
to multi-class assignment, expresses uncertainty in clustering, and encourages more balanced and discrete clusters (i.e. a node assigned
can be optimised within any GNN. to a cluster with high probability, while to other clusters with a
low one), discouraging further degenerate solutions. Although its
4.2 End-to-end clustering framework effect overlaps with L𝑚𝑐 , it often smoothes out the optimisation
In this section, we leverage this probabilistic approximation of motif process and even improves slightly performance in complex tasks
conductance to learn our cluster assignment matrix S in a trainable or networks, such as graph classification. In (8), we rescale L𝑜 to
manner. Our method addresses the limitations of (motif) spectral [0, 1], making it commensurable to L𝑚𝑐 . As a result, the two terms
clustering: we cluster nodes based both on graph topology and can be safely summed and optimised together when specified. A
node features; leverage higher-order connectivity patterns; avoid parameter 𝜇 controls the strength of this regularisation.
the expensive eigen-decomposition of the motif Laplacian; and 1 The largest eigenvalue A𝑀 S = 𝜆D𝑀 S is 1 and the smallest 0; we are summing only
allow to cluster out-of-sample graphs. the 𝑘 largest eigenvalues.
428
CIKM ’22, October 17–21, 2022, Atlanta, GA, USA. Alexandre Duval & Fragkiskos Malliaros
Loss value
computing cluster assignments. The latter is realistic due to the
homophily property of many real-world networks [26] as well as 0 objective
the smoothing effect of message passing layers [7], which render regularizer
connected nodes more similar.
We conclude this section with a note for future work. An in- −1
teresting research direction would be to extend this framework to 0 10 20 30 40 50 60
4-nodes motifs. Despite having managed to derive a theoretical MinCutPool
formulation for the 4-nodes motif conductance problem in Appen- 1
dix C, it becomes complex and would probably necessitate its own
Loss value
dedicated research, as it could be an promising extension.
0 objective
4.3 Higher-order graph coarsening regularizer
The methodology detailed in the previous sections is a general −1
clustering technique that can be used for any clustering tasks on
0 10 20 30 40 50 60
any graph dataset. In this paper, we utilise it to form a pooling
MinCutPool - degenerate solution
operator, called HoscPool, which exploits the cluster assignment
matrix S to generate a coarsened version of the graph (with fewer 1
nodes and edges) that preserve critical information and embeds
Loss value
higher-order connectivity patterns. More precisely, it coarsens the
0 objective
existing graph by creating super-nodes from the derived clusters,
with a new edge set and feature vector, depending on previous regularizer
nodes belonging to this cluster. Mathematically, −1
𝑝𝑜𝑜𝑙 𝑝𝑜𝑜𝑙 𝑝𝑜𝑜𝑙 0 10 20 30 40 50 60
HoscPool : G = (X, A) → G = (X ,A )
Epochs
A𝑝𝑜𝑜𝑙 = S⊤ AS and X𝑝𝑜𝑜𝑙 = S⊤ X.
Figure 1: Loss function value w.r.t. epochs. MinCutPool op-
𝑝𝑜𝑜𝑙
Each entry 𝑋𝑖,𝑗 denotes feature 𝑗’s value for cluster 𝑖, calculated timises the orthogonality loss, which decreases smoothly,
as a sum of feature 𝑗’s value for the nodes belonging to cluster 𝑖, while its min-cut objective remains constant (acting like a
weighted by the corresponding cluster assignment scores. A𝑝𝑜𝑜𝑙 ∈ regularizer); whereas HoscPool optimises the main objective
R𝐾 ×𝐾 is a symmetric matrix where 𝐴𝑖,𝑗 can be viewed as the
𝑝𝑜𝑜𝑙 directly. Sometimes, MinCutPool does not manage to opti-
connection strength between cluster 𝑖 and cluster 𝑗. Given our mise the regularizer loss, yielding a degenerate clustering.
optimisation function, it will be a diagonal-dominant matrix, which
will hamper the propagation across adjacent nodes. For this reason,
we remove self-loops. We also symmetrically normalise the new 4.4 Comparison with relevant baselines
adjacency matrix. Lastly, note that we use the original A and X Before moving to the experiments, we take a moment to emphasise
for this graph coarsening step; their motif counterparts A𝑀 and the key differences with respect to core end-to-end clustering-based
X𝑀 are simply leveraged to compute the loss function. Our work pooling baselines. We focus on MinCutPool in the following since
thus differ clearly from diffusion methods and traditional GNN it is our closest baseline. DiffPool and others differ more signifi-
leveraging higher-order. cantly, in addition to being less theoretically-grounded and efficient.
Because our GNN-based implementation of motif spectral clus- Firstly, MinCutPool focuses on first-order connectivity patterns,
tering is fully differentiable, we can stack several HoscPool layers, while we work on higher-order, which implies a more elaborated
intertwined with message passing layers, to hierarchically coarsen background theory with the construction and combination of sev-
the graph representation. In the end, a global pooling and some eral motif adjacency matrices (each specific to a particular motif).
dense layers produce a graph prediction. The parameters of each This shall lead to capturing more advanced types of communities,
HoscPool layer can be learned end-to-end by jointly optimizing: producing ultimately a better coarsening of the graph. Secondly,
we approximate a probabilistic version of the motif conductance
L = L𝑚𝑐 + 𝜇L𝑜 + L𝑠 , (9) problem (extension of the normalised min-cut to motifs) whereas
MinCutPool approximates the relaxed unormalised min-cut prob-
where L𝑠 denotes any supervised loss for a particular downstream lem. Despite claiming to formulate a relaxation of the normalised
task (here the cross entropy loss). This way, we should be able to min-cut (a trace ratio), it truly minimises a ratio of traces in the
hierarchically capture relevant graph higher-order structure while Tr(S⊤ ÃS)
. Since Tr(S⊤ D̃S) = 𝑖 ∈ V 𝐷˜ 𝑖𝑖 is a
Í
objective function: −
learning GNN parameters so as to ultimately better classify the Tr(S⊤ D̃S)
graphs within our dataset. constant, this yields the unormalised min-cut −Tr(S⊤ ÃS), which
429
Higher-order Clustering and Pooling for Graph Neural Networks CIKM ’22, October 17–21, 2022, Atlanta, GA, USA.
Table 1: (Right) NMI obtained by clustering the nodes of various networks over 10 different runs. Best results are in bold, second
best underlined. The number of clusters 𝐾 is equal to the number of node classes. (Left) Dataset properties.
Dataset Nodes Edges Feat. 𝐾 SC MSC DiffPool MinCutPool HP-1 HP-2 HoscPool
Cora 2,708 5,429 1,433 7 0.150 ± 0.002 0.056 ± 0.014 0.308 ± 0.023 0.391 ± 0.028 0.435 ± 0.032 0.464 ± 0.036 0.502 ± 0.029
PubMed 19,717 88,651 500 3 0.183 ± 0.002 0.002 ± 0.000 0.098 ± 0.006 0.214 ± 0.066 0.230 ± 0.071 0.215 ± 0.073 0.260 ± 0.054
Photo 7,650 287,326 745 8 0.592 ± 0.008 0.451 ± 0.011 0.171 ± 0.004 0.086 ± 0.014 0.495 ± 0.068 0.513 ± 0.083 0.598 ± 0.101
PC 13,752 245,861 767 10 0.464 ± 0.002 0.166 ± 0.009 0.043 ± 0.008 0.026 ± 0.006 0.497 ± 0.040 0.499 ± 0.036 0.528 ± 0.041
CS 18,333 81,894 6,805 15 0.273 ± 0.006 0.011 ± 0.009 0.383 ± 0.048 0.431 ± 0.060 0.479 ± 0.022 0.701 ± 0.029 0.731 ± 0.018
Karate 34 156 10 2 0.792 ± 0.035 0.870 ± 0.031 0.715 ± 0.018 0.751 ± 0.090 0.792 ± 0.038 0.862 ± 0.046 0.894 ± 0.039
DBLP 17,716 105,734 1,639 4 0.027 ± 0.003 0.005 ± 0.006 0.186 ± 0.014 0.334 ± 0.026 0.326 ± 0.027 0.284 ± 0.026 0.312 ± 0.027
Polblogs 1,491 33,433 10 2 0.017 ± 0.000 0.014 ± 0.001 0.317 ± 0.010 0.440 ± 0.390 0.992 ± 0.003 0.994 ± 0.001 0.994 ± 0.005
Email-eu 1,005 32,770 10 42 0.485 ± 0.030 0.382 ± 0.019 0.096 ± 0.034 0.253 ± 0.028 0.317 ± 0.026 0.488 ± 0.025 0.476 ± 0.021
Syn1 1,000 6,243 10 3 0.000 ± 0.000 1.000 ± 0.000 0.035 ± 0.000 0.043 ± 0.008 0.041 ± 0.006 1.000 ± 0.000 1.000 ± 0.000
Syn2 1,000 5,496 10 2 0.003 ± 0.000 0.050 ± 0.003 0.081 ± 0.008 0.902 ± 0.028 0.942 ± 0.028 1.000 ± 0.000 1.000 ± 0.000
Syn3 500 48,205 10 5 1.000 ± 0.000 1.000 ± 0.000 0.067 ± 0.001 0.052 ± 0.002 0.115 ± 0.006 0.826 ± 0.005 1.000 ± 0.000
often produces degenerate solutions. To cope with this limitation, entropy and node cluster membership is determined by the argmax
MinCutPool optimises in parallel a penalty term L𝑜 encouraging of its assignment probabilities. We also calculate completeness,
balanced and discrete clusters assignments. But despite this regu- modularity, normalised cut, and motif conductance (App. Table 7).
larizer, it often gets stuck in local minima [45] (see Fig. 1), as we Datasets. We use a collection of node classification datasets with
will see empirically in Section 5. We spot and correct this weakness ground truth community labels: citation networks Cora, PubMed;
in HoscPool. Thirdly, we introduced a new and more powerful collaboration networks DBLP, Coauthor CS; co-purchase networks
orthogonality term together with a regularization control parame- Amazon Photo, Amazon PC; the KarateClub community network;
ter. Unlike MinCutPool, it is unnecessary but often smoothes out and communication networks Polblogs and Eu-email. They are
training and improves performance. Lastly, we showcase a different all taken from Pytorch Geometric. We construct three synthetic
architecture involving a more general way of computing S. datasets: Syn1, Syn2, Syn3 (based on several random graphs) where
node labels are determined based on higher-order community struc-
5 EVALUATION ture and node features are simple graph statistics (Appendix A).
We now evaluate the benefits of the proposed method, with the They are designed to show the additional efficiency of HoscPool
goal of answering the following questions: when datasets have clear higher-order structure, which is not al-
(1) Does our differentiable higher-order clustering algorithm ways the case for the standard baseline datasets chosen.
compute meaningful clusters? Is considering higher-order Baselines. We compare HoscPool with the original spectral clus-
structures beneficial? tering (SC), motif spectral clustering (MSC)2 as well as key pooling
(2) How does HoscPool compare with state-of-the-art pooling baselines DiffPool and MinCutPool. We refer to all methods by
approaches for graph classication tasks? their pooling name for simplicity, although this experiment focuses
(3) Why do existing pooling operators fail to outperform signif- on the clustering part and does not involve the coarsening step. We
icantly random pooling? repeat all experiments 10 times and average results across runs. For
ablation study, let HP-1 and HP-2 denote HoscPool where L𝑚𝑐
5.1 Clustering in Eq. (7) has 𝛼 2 = 0 (first-order connectivity only) and 𝛼 1 = 0
(higher-order only), respectively.
Experimental setup. For this experiment, we first run a Message
Passing (MP) layer; in this case a GCN model with skip connection
Results are reported in Table 1. HoscPool achieves better per-
for initial features [30]: X̄ = ReLU(AXΘ1 + XΘ2 ), where 𝚯1 and
formance than all baselines across most datasets. This trend is
𝚯2 are trainable weight matrices. It has 32 hidden units and ReLU
emphasised on synthetic datasets, where we know higher-order
activation function. We then run a Multi-Layer Perceptron (MLP)
structure is critical, proving the benefits of our clustering method.
with 32 hidden units to produce the cluster assignment matrix of
DiffPool often fails to converge to a good solution. MinCutPool,
dimension 𝑛𝑢𝑚_𝑛𝑜𝑑𝑒𝑠 × 𝑛𝑢𝑚_𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑠, trained end-to-end by opti-
as evoked earlier and in [41], sometimes get stuck in degenerate
mising the unsupervised loss function L𝑚𝑐 +𝜇L𝑜 . This architecture
solutions (e.g., Amazon PC and Photo – all nodes are assigned to
is trained using a learning rate of 0.001 for an Adam optimizer, 500
less than 10% of clusters), failing completely to converge even when
epochs, a gradient clip of 2.0, 200 early stop patience, a learning
tuning model architecture and hyper-parameters (see Fig.1). HP-
decay patience of 25 and 𝜇 = {0, .1, 1}.
1 shows superior performance and alleviates this issue, meaning
Metrics. We evaluate the quality of S by comparing the distribution
of true node labels with the one of predicted labels, via Normalised
𝐻 ( ỹ) −𝐻 ( ỹ|y) 2 SCbased on motif conductance [4] instead of edge conductance; meaning SC applied
Mutual Information NMI( ỹ, y) = √ , where 𝐻 (·) is the on A𝑀 .
𝐻 ( ỹ) −𝐻 (y)
430
CIKM ’22, October 17–21, 2022, Atlanta, GA, USA. Alexandre Duval & Fragkiskos Malliaros
that it can be considered as an improved version of MinCutPool. of 2.0, 100 early stop patience, a learning decay patience of 50 and
Spectral Clustering (SC) performs really well on some datasets, a regularisation parameter 𝜇 = {0, 0.1}.
poorly on others. MSC often performs badly, revealing its excessive Baselines. We compare our method to representative state-of-
dependence to the presence of motifs. On the contrary, our results the-art graph classification baselines, involving pooling operators
highlight the robustness of HoscPool to the limited presence of mo- DiffPool [49], MinCutPool [5], EigPool [25], SAGPool [19],
tifs due to its consideration for node features. Besides, HoscPool’s ASAP [33], GMT [3]; by replacing the pooling layer in the above
consideration for finer granularity levels allows to group nodes pipeline. We implement a random pooling operator (Random) to
primarily based on motifs while still considering edges when nec- assess the benefits of pooling similar nodes together, and a model
essary, which may be the reason of its superior performance with with a single global pooling operator (NoPool) to assess how useful
respect to HP-2, itself more desirable than HP-1 (edge-only). This leveraging hierarchical information is.
ablation study proves the relevance of our underlying claims: incor- Datasets. We use several common benchmark datasets for GC,
porating higher-order information leads to better communities and taken from TUDataset [29], including three bioinformatics protein
combining several motifs further help. See Table 7 for more results. datasets Proteins, Enzymes, D&D; one mutagen Mutagenicity; one
Complexity. The main complexity of HoscPool lies in the deriva- anticancer activity dataset NCI1; two chemical compound dataset
tion of A𝑀 , which remains relatively fast for triangles: A𝑀 = A2 ⊙A. Cox-2-MD, ER-MD; one social network Reddit-Binary. Bench-hard
In Table 2, we remark that HoscPool (and HP-2) has a comparable is taken from source where X and A are completely uninformative
running time with respect to MinCutPool on small or average if considered alone. We split them into training set (80%), valida-
size datasets. It is slower to compute than MinCutPool on large tion set (10%), and test set (10%). We adopt the accuracy metric to
datasets, while staying relatively affordable. This extra time lies measure performance and average the results over 10 runs, each
with the computation and processing of the motif adjacency matrix with a different split. We select the best model using validation set
as well as the combination of several connectivity order; which accuracy, and report the corresponding test set accuracy. For fea-
grows bigger with the graph size. Note however that we could avoid tureless graphs, we use constant features. Model hyperparameters
the computation of the regularisation loss, which both MinCut- are tuned for each dataset, but are kept fixed across all baselines.
Pool and DiffPool cannot afford. HP-1 is not reported as it shares Lastly, despite being used by all baselines, note that these datasets
similar times as MinCutPool while reaching better performance. are known to be small and noisy, leading to large errors.
Table 2: Running time (s) of the entire clustering experiment. Results are reported in Table 3, from which we draw the fol-
lowing conclusions. Performing pooling proves useful (NoPool)
in most cases. HoscPool compares favourably on all datasets w.r.t.
Dataset DiffPool MinCutPool HP-2 HoscPool pooling baselines. Higher-order connectivity patterns are more
Cora 13 16 17 24 desirable than first-order ones, and combining both is even bet-
PubMed 80 95 264 501 ter. It confirms findings from Section 5.1 and shows that better
Photo 23 48 91 182 clustering (i.e. graph coarsening) is correlated with better classifi-
PC 89 101 304 510 cation performance. However, while the clustering performance of
CS 157 251 683 1406 HoscPool is significantly better than baselines, the performance
Karate 9 9 9 9 gap has slightly closed down on this task. Even more surprising,
DBLP 126 210 635 1330 the benefits of existing advanced node-grouping or node-dropping
Polblogs 8 9 10 10 methods are not considerable with respect to the Random pooling
Email-eu 9 9 10 12 baseline. Faithfully to what we announced in Section 1, we attempt
to provide explanations.
431
Higher-order Clustering and Pooling for Graph Neural Networks CIKM ’22, October 17–21, 2022, Atlanta, GA, USA.
Table 3: Graph classification accuracy. Top results are in bold, second best underlined.
Dataset NoPool Random GMT MinCutPool DiffPool EigPool SAGPool ASAP HP-1 HP-2 HoscPool
Proteins 71.6±4.1 75.7±3.2 75.0±4.2 75.9±2.4 73.8±3.7 74.2±3.1 70.6±3.5 74.4±2.6 76.7±2.5 77.0±3.1 77.5±2.3
NCI1 77.1±1.9 77.0±1.7 74.9±4.3 76.8±1.6 76.7±2.1 75.0±2.2 74.1±3.9 74.3±1.6 77.3±1.6 80.3±2.0 79.9±1.7
Mutagen. 78.1±1.3 79.2±1.3 79.4±2.2 78.6±1.8 77.9±2.3 75.2±2.7 74.4±2.7 76.8±2.4 79.8±1.6 81.7±2.1 82.3±1.3
DD 71.2±2.2 77.1±1.5 78.1±3.2 78.4±2.8 76.3±2.1 75.1±1.8 71.5±4.1 73.2±2.5 78.8±2.0 78.2±2.1 79.4±1.8
Reddit-B 80.1±2.6 89.3±2.6 86.7±2.6 89.0±1.4 87.3±2.4 82.8±2.1 74.7±4.5 84.1±1.1 91.2±1.0 92.8±1.5 93.6±0.9
Cox2-MD 58.7±3.2 62.9±3.6 58.9±3.6 58.9±5.1 57.1±4.8 59.8±3.4 56.9±9.7 60.5±5.5 61.6±3.5 66.4±4.6 64.6±3.9
ER-MD 72.2±2.9 73.0±4.5 74.3±4.5 75.5±4.0 76.8±4.8 73.1±3.8 71.7±8.2 74.5±5.9 76.2±4.2 77.9±4.3 78.2±3.8
b-hard 66.5±0.5 69.1±2.1 70.1±3.4 72.6±1.5 70.7±2.0 69.1±3.1 39.6±9.6 70.5±1.7 72.4±0.8 73.5±0.8 74.0±0.4
Table 4: (Left) Simple graph statistics. (Middle) The clustering coefficient (cc), proportion of triangles attached per node (triangle),
transivitiy (transi), homophily (homo) and proportion of node labels in a graph w.r.t. all graphs (diff labels) are computed on
each graph individually and averaged over the whole dataset. (Right) msc, sc and sc-mod denote motif conductance, normalised
cut, and modularity obtained by clustering each graph using traditional deterministic spectral clustering, where the number of
clusters is equal to the number of labels in a graph. The last column refers the NMI obtained through HoscPool clustering only.
All metrics provide information on graph community structure. Reddit-Binary has no node labels and is treated differently.
Datasets # graphs # edges av # nodes labels cc triangle transi homo diff-labels msc sc sc-mod NMI
Proteins 1,113 162,088 39 3 .575 1.03 .517 .476 .833 .034 .005 .460 .46
NCI1 4,110 132,753 29 37 .125 .125 .214 .667 .054 .111 0.0 .388 .71
DD 1,178 843,046 284 89 .496 2.0 .462 .058 .219 .021 .013 .402 .38
Mutagenicity 4,337 133,447 30 14 .002 .003 .002 .376 .244 .056 0.0 .378 .85
Reddit-Binary 2,000 995,508 429 no .051 .069 .009 - - .008 .011 .071 -
COX2-MD 303 203,084 26.2 7 1.00 103 1.00 .707 .482 .302 .333 .01 .45
ER-MD 446 209482 21.1 10 1.00 77.4 1.00 .701 .232 .331 .323 .01 .56
Table 5: Ablation study of HoscPool, denoted as Base. GIN, SAGE, GAT change the core GNN model; No-diag does not zero-
out the diagonal of S in the pooling step, 1-pooling uses an architecture with only one HoscPool block, skip-co adds a skip
connection between every GNN layer and the dense layer, c-ratio involves a higher clustering ratio and no-adapt refers to the
discussed dynamic adaptative loss. For dense-feat, we simply added some graph statistics to boost node identifiability.
study in Table 5). However, despite clear progress – we learn to graph classification model architecture and the nature of selected
decently optimise S, to assign nodes to more clusters and to better datasets.
balance the number of nodes per cluster – there still seems to be Concerning model architecture, we show in Appendix B that us-
room for improvement. We thus look for other potential causes ing more complex clustering frameworks (2-layer clustering: GNN
which could prevent a proper learning, especially targeting the – Pooling – GNN –Pooling) prevents totally the learning of mean-
ingful clusters for MinCutPool (and DiffPool), which illustrates
432
CIKM ’22, October 17–21, 2022, Atlanta, GA, USA. Alexandre Duval & Fragkiskos Malliaros
a feature oversmoothing issue. HoscPool, on the other hand, has and efficient pooling operators, ensuring significant improvement
fixed this issue and still manages to learn meaningful clusters. Nev- over the random baseline for graph classification tasks.
ertheless, the learning process becomes longer and more difficult, Acknowledgements. Supported in part by ANR (French National
leading to a drop in performance. In addition to showing the ro- Research Agency) under the JCJC project GraphIA (ANR-20-CE23-
bustness of HoscPool with respect to existing pooling baselines, 0009-01).
this experiment reveals that the clustering performed in graph clas-
sification tasks may not lead to meaningful clusters because of the A SYNTHETIC DATASETS
more complex framework. Although it is likely to contribute, it is (1) syn1 is made of 𝑘 communities, each densely intra-connected by
probably a factor among others, since simpler GC models like GNN triangles. We then widely link these communities without creating
– Pooling – GNN – Global Pooling – Dense (1-pooling in Table 5) new triangles through these new links. We create random Gaussian
do not improve things. features (included one correlated to node labels) since our method
We therefore also look for answers from a dataset perspective. is dependent on node features.
In Table 4, the computed graph properties and clustering results (2) syn2 is an Erdős–Rényi random graph with 1,000 nodes and
on individual graphs suggest that graphs are relatively small, with 𝑝 = 0.012. Each node receives label 0 if it does not belong to a
few node types co-existing in a same graph, weak homophily and triangle and label 1 otherwise. Node features include several graph
a relatively poor community structure which clustering algorithms statistics.
would like to exploit. Besides, because most datasets do not have (3) syn3 is designed using a Gaussian random partition graph with 𝑘
dense node features (only labels), the node identifiablity assumption partitions with size drawn from a normal distribution. Nodes within
is shaken and does not enable our MLP (5) to fully distinguish the same partition are connected with probability 𝑝 = 0.8, while
between same-label-nodes, thus making it impossible to place them nodes across partitions with probability 0.2. Here, only random
in distinct clusters. On top of that, we now need to learn a clustering features are used.
pattern that extends to all graphs, which is a much more complex
task (compared to 1 graph in Section 5.1). B 2-LAYER CLUSTERING: PRECISIONS
As a result, taking into consideration the multiple pooling layers,
In this experiment, we complexify the clustering framework (MP –
the joint optimisation with a supervised loss, the poor individ-
MLP), making it more similar to its use as a pooling operator inside
ual graph community structure, and the complexity of learning to
supervised graph classification tasks. More precisely, we follow an
cluster all graphs with few features, learning meaningful clusters
architecture: MP – Pooling – MP – Pooling. As before, the pooling
becomes extremely challenging. This would explain the optimi-
step regroups an MLP to compute the first cluster assignment matrix
sation difficulties encountered by existing pooling operators so
S1 , and a graph coarsening step. In the end, we provide a unique
far. Although HoscPool makes a step towards better pooling, we
cluster assignment matrix S of dimension 𝑁 × 𝐾, composed of the
advice future research to explore more appropriate datasets than
two matrix derived above (S1 and S2 ), such that the probability that
TUDataset [29] even though it is used by all pooling baselines Í
node 𝑖 belongs to cluster 𝑘 is written S𝑖𝑘 = 𝑗 S1𝑖 𝑗 S2 𝑗𝑘 .
as benchmark, such as Open Graph Benchmark datasets (OGB).
The results, given in Table 6, are obtained using 1, 000 epochs
We also recommend to design simpler node-grouping approach,
with 𝑒𝑎𝑟𝑙𝑦_𝑠𝑡𝑜𝑝_𝑝𝑎𝑡𝑖𝑒𝑛𝑐𝑒 = 500—meaning using many more epochs
to use higher-order information so as to capture more relevant
than for standard 1-layer clustering. This is because the conver-
communities even with complex model architectures, as well as
gence to a desirable solution is weaker. Furthermore, the obtained
to exploit more directly graph structure information (as targeted
solution is less desirable and yields to a less desirable clustering.
graphs do not have dense node features). Finally, the heterophilious
Overall, this argument is very important as it suggests that the
nature of these datasets (Table 4) come to question the true benefit
clustering obtained in supervised graph classification tasks might
of grouping together nodes with similar embeddings (homophily
not be as accurate as what our original evaluation on real-world
assumption) when coarsening the graph.
dataset with ground-truth community structure suggested.
433
Higher-order Clustering and Pooling for Graph Neural Networks CIKM ’22, October 17–21, 2022, Atlanta, GA, USA.
C EXTENSION TO 4-NODES MOTIFS In practice however, unlike triangle normalised cut, this expres-
Here, we consider motifs composed of 4 nodes (|M | = 4), such as the sion is not easy to compute. First of all, computing the related motif
4-cycle or 𝐾4 , written as v = {𝑙, 𝑞, 𝑟, 𝑘 }. In Section 4.1, we formulated adjacency matrix is difficult; it cannot be written a simple matrix
a relation between triangle normalised cut and graph-normalised dot product. Secondly, there is this term on the RHS to take into
cut, in order to compute triangle normalised cut easily. Here, we consideration. And although we might be able to compute both
do the same, but for 4-nodes-motif conductance. Again, we derive directly via a complex algorithm, it is not guaranteed that solving
this problem is quicker than the original optimisation problem (def.
this relation by looking at a single cluster ( S with corresponding (𝐺 ) (𝐺 )
1 if 𝑖 ∈ S of vol𝑀 and cut𝑀 ).
cluster assignment vector y, with 𝑦𝑖 = .
0 else
( G)
∑︁
3cut𝑀 (S, S̄) = 3 1{∃𝑖, 𝑗 ∈ v|𝑖 ∈ S, 𝑗 ∈ S̄}
v∈ M REFERENCES
∑︁
[1] Ghadeer AbuOda, Gianmarco De Francisci Morales, and Ashraf Aboulnaga. 2019.
= 3(𝑦𝑙 + 𝑦𝑞 + 𝑦𝑟 + 𝑦𝑘 ) Link prediction via higher-order motif features. arXiv preprint arXiv:1902.06679
v∈ M (2019).
− 2(𝑦𝑙 𝑦𝑞 + 𝑦𝑙 𝑦𝑟 + 𝑦𝑙 𝑦𝑘 + 𝑦𝑞𝑦𝑟 + 𝑦𝑞𝑦𝑘 + 𝑦𝑟 𝑦𝑘 ) [2] James Atwood and Don Towsley. 2016. Diffusion-convolutional neural networks.
In Advances in neural information processing systems. 1993–2001.
− 1{exactly 2 of 𝑙, 𝑞, 𝑟, 𝑘 are in S} . [3] Jinheon Baek, Minki Kang, and Sung Ju Hwang. 2021. Accurate Learning of Graph
Representations with Graph Multiset Pooling. arXiv preprint arXiv:2102.11533
(2021).
0 if all 𝑦𝑙 , 𝑦𝑞 , 𝑦𝑟 , 𝑦𝑘 are the same [4] Austin R Benson, David F Gleich, and Jure Leskovec. 2016. Higher-order organi-
Í
This expression equals v∈ M 3 if 3 are the same zation of complex networks. Science 353, 6295 (2016), 163–166.
[5] Filippo Maria Bianchi, Daniele Grattarola, and Cesare Alippi. 2020. Spectral clus-
4 if 2 are the same.
tering with graph neural networks for graph pooling. In International Conference
Thus, on Machine Learning. PMLR, 874–883.
[6] Aldo G Carranza, Ryan A Rossi, Anup Rao, and Eunyee Koh. 2020. Higher-order
( G) clustering in complex heterogeneous networks. In Proceedings of the 26th ACM
3cut𝑀 (S, S̄) + 1{exactly 2 of 𝑙, 𝑞, 𝑟, 𝑘 are in S}
∑︁ SIGKDD International Conference on Knowledge Discovery & Data Mining. 25–35.
= 3(𝑦𝑙 + 𝑦𝑞 + 𝑦𝑟 + 𝑦𝑘 ) [7] Deli Chen, Yankai Lin, Wei Li, Peng Li, Jie Zhou, and Xu Sun. 2020. Measuring
and relieving the over-smoothing problem for graph neural networks from the
v∈ M topological view. In Proceedings of the AAAI Conference on Artificial Intelligence,
Vol. 34. 3438–3445.
− 2(𝑦𝑙 𝑦𝑞 + 𝑦𝑙 𝑦𝑟 + 𝑦𝑙 𝑦𝑘 + 𝑦𝑞𝑦𝑟 + 𝑦𝑞𝑦𝑘 + 𝑦𝑟 𝑦𝑘 )
[8] Fan Chung. 2007. Four proofs for the Cheeger inequality and graph partition
= y⊤ D𝑀 y − y⊤ A𝑀 y algorithms. In Proceedings of ICCM, Vol. 2. Citeseer, 378.
[9] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. 2016. Convolu-
= y⊤ L𝑀 y tional neural networks on graphs with fast localized spectral filtering. Advances
in neural information processing systems 29 (2016), 3844–3852.
= cut ( G𝑀 ) (S, S̄), [10] Frederik Diehl. 2019. Edge contraction pooling for graph neural networks. arXiv
preprint arXiv:1905.10990 (2019).
where the second inequality holds because [11] Dhivya Eswaran, Srijan Kumar, and Christos Faloutsos. 2020. Higher-order label
homogeneity and spreading in graphs. In Proceedings of The Web Conference 2020.
( G)
∑︁ ∑︁
y⊤ D𝑀 y = (𝐴𝑀 )𝑖 𝑗 = vol ( G𝑀 ) (S) = |M |vol𝑀 (S) 2493–2499.
[12] Matthias Fey, Jan Eric Lenssen, Frank Weichert, and Heinrich Müller. 2018.
𝑖∈S 𝑗 ∈V Splinecnn: Fast geometric deep learning with continuous b-spline kernels. In
( G) Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
∑︁
= 3vol𝑀 (S) = 3 (𝑦𝑙 + 𝑦𝑞 + 𝑦𝑟 + 𝑦𝑘 ) 869–877.
v∈ M [13] Hongyang Gao and Shuiwang Ji. 2019. Graph u-nets. In international conference
∑︁ ∑︁ ∑︁ ∑︁ ∑︁ on machine learning. PMLR, 2083–2092.
[14] William L Hamilton, Rex Ying, and Jure Leskovec. 2017. Inductive representation
y⊤ A𝑀 y = 𝑦𝑖 𝑦 𝑗 (𝐴𝑀 )𝑖 𝑗 = (𝐴𝑀 )𝑖 𝑗 = 1{𝑖, 𝑗 ∈ v} learning on large graphs. In Proceedings of the 31st International Conference on
𝑖∈V 𝑗 ∈V 𝑖,𝑗 ∈ S 𝑖,𝑗 ∈ S v∈ M Neural Information Processing Systems. 1025–1035.
∑︁ [15] Lun Hu, Jun Zhang, Xiangyu Pan, Hong Yan, and Zhu-Hong You. 2021. HiSCF:
= 2(𝑦𝑙 𝑦𝑞 + 𝑦𝑙 𝑦𝑟 + 𝑦𝑙 𝑦𝑘 + 𝑦𝑞𝑦𝑟 + 𝑦𝑞𝑦𝑘 + 𝑦𝑟 𝑦𝑘 ). leveraging higher-order structures for clustering analysis in biological networks.
v∈ M Bioinformatics 37, 4 (2021), 542–550.
[16] Rong Jin, Feng Kang, and Chris Ding. 2005. A probabilistic approach for opti-
Overall, we obtain the following equality: mizing spectral clustering. Advances in neural information processing systems 18
1 1 ∑︁ (2005), 571–578.
( G)
cut𝑀 (S, S̄) = cut ( G𝑀 ) (S, S̄) − 1{exactly 2 of 𝑙, 𝑞, 𝑟, 𝑘 ∈ S} [17] Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph
3 3 convolutional networks. arXiv preprint arXiv:1609.02907 (2016).
v∈ M [18] Christine Klymko, David Gleich, and Tamara G Kolda. 2014. Using triangles to im-
The optimisation problem can be written as: prove community detection in directed networks. arXiv preprint arXiv:1404.5874
(2014).
∑︁ cut ( G) (S𝑘 , S¯𝑘 ) [19] Junhyun Lee, Inyeop Lee, and Jaewoo Kang. 2019. Self-attention graph pooling.
𝑀 In International Conference on Machine Learning. PMLR, 3734–3743.
min
S ( G) [20] John Boaz Lee, Ryan A Rossi, Xiangnan Kong, Sungchul Kim, Eunyee Koh, and
𝑘 vol𝑀 (S𝑘 )
Anup Rao. 2018. Higher-order graph convolutional networks. arXiv preprint
∑︁ 1 cut ( G𝑀 ) (S𝑘 , S¯𝑘 ) − 1 Ív∈ M 1{exactly 2 nodes in v ∈ S} arXiv:1809.07697 (2018).
3 3 [21] Jure Leskovec, Kevin J Lang, Anirban Dasgupta, and Michael W Mahoney. 2009.
≡ min 1 vol ( G𝑀 ) (S )
S Community structure in large networks: Natural cluster sizes and the absence of
𝑘 3 𝑘
large well-defined clusters. Internet Mathematics 6, 1 (2009), 29–123.
⊤ ∑︁ Í
S A S v∈ M 1{exactly 2 nodes in v ∈ S} [22] Jianxin Li, Hao Peng, Yuwei Cao, Yingtong Dou, Hekai Zhang, Philip Yu, and
≡ min −Tr ⊤ 𝑀 − . Lifang He. 2021. Higher-order attribute-enhancing heterogeneous graph neural
S∈ [0,1] 𝑁 ×𝐾 S D𝑀 S
𝑘 vol ( G𝑀 ) (S𝑘 ) networks. IEEE Transactions on Knowledge and Data Engineering (2021).
434
CIKM ’22, October 17–21, 2022, Atlanta, GA, USA. Alexandre Duval & Fragkiskos Malliaros
Table 7: Modularity (Mod), Conductance (Cond), Motif Conductance (M.Cond), Homogeneity (Homog) obtained by clustering
the nodes of various networks over 10 different runs. The number of clusters 𝐾 is equal to the number of node classes. HP-2
optimises better the motif conductance metric than MinCutPool. HoscPool achieves a similar motif conductance but a better
conductance than HP-2, which it also often outperforms in terms of modularity. Finally, MinCutPool does achieve degenerate
solutions for several datasets (e.g., PC, Photo, CS, Email-eu).
Dataset Mod Cond M.Cond Homog Mod Cond M.Cond Homog Mod Cond M.Cond Homog
Cora 0.700 0.156 0.094 0.464 0.621 0.125 0.025 0.338 0.654 0.091 0.026 0.314
PubMed 0.532 0.120 0.047 0.225 0.478 0.069 0.029 0.101 0.454 0.082 0.038 0.096
CS −0.005 0.001 0.000 0.000 0.684 0.141 0.087 0.637 0.695 0.131 0.084 0.638
Photo 0.000 0.008 0.002 0.002 0.566 0.084 0.033 0.470 0.684 0.093 0.043 0.580
PC −0.001 0.000 0.000 0.000 0.546 0.285 0.263 0.457 0.591 0.149 0.082 0.556
DBLP 0.533 0.182 0.157 0.363 0.588 0.131 0.065 0.277 0.608 0.114 0.066 0.318
Karate 0.370 0.269 0.281 0.543 0.389 0.192 0.088 0.715 0.417 0.217 0.133 0.861
Email-eu 0.002 0.011 0.003 0.025 0.189 0.455 0.382 0.166 0.185 0.488 0.396 0.208
Polblogs 0.409 0.090 0.048 0.991 0.409 0.087 0.035 0.993 0.429 0.073 0.035 0.991
[23] Ning Liu, Songlei Jian, Dongsheng Li, Yiming Zhang, Zhiquan Lai, and Hongzuo [40] Konstantinos Sotiropoulos and Charalampos E Tsourakakis. 2021. Triangle-aware
Xu. 2021. Hierarchical Adaptive Pooling by Capturing High-order Dependency Spectral Sparsifiers and Community Detection. In Proceedings of the 27th ACM
for Graph Representation Learning. IEEE Transactions on Knowledge and Data SIGKDD Conference on Knowledge Discovery & Data Mining. 1501–1509.
Engineering (2021). [41] Anton Tsitsulin, John Palowitch, Bryan Perozzi, and Emmanuel Müller. 2020.
[24] Enxhell Luzhnica, Ben Day, and Pietro Lio. 2019. Clique pooling for graph Graph clustering with graph neural networks. arXiv preprint arXiv:2006.16904
classification. arXiv preprint arXiv:1904.00374 (2019). (2020).
[25] Yao Ma, Suhang Wang, Charu C Aggarwal, and Jiliang Tang. 2019. Graph convo- [42] Charalampos E Tsourakakis, Jakub Pachocki, and Michael Mitzenmacher. 2017.
lutional networks with eigenpooling. In Proceedings of the 25th ACM SIGKDD Scalable motif-aware graph clustering. In Proceedings of the 26th International
International Conference on Knowledge Discovery & Data Mining. 723–731. Conference on World Wide Web. 1451–1460.
[26] Miller McPherson, Lynn Smith-Lovin, and James M Cook. 2001. Birds of a feather: [43] Ulrike Von Luxburg. 2007. A tutorial on spectral clustering. Statistics and
Homophily in social networks. Annual review of sociology 27, 1 (2001), 415–444. computing 17, 4 (2007), 395–416.
[27] Diego Mesquita, Amauri H Souza, and Samuel Kaski. 2020. Rethinking pooling [44] Dorothea Wagner and Frank Wagner. 1993. Between min cut and graph bisection.
in graph neural networks. arXiv preprint arXiv:2010.11418 (2020). In International Symposium on Mathematical Foundations of Computer Science.
[28] Ron Milo, Shai Shen-Orr, Shalev Itzkovitz, Nadav Kashtan, Dmitri Chklovskii, Springer, 744–750.
and Uri Alon. 2002. Network motifs: simple building blocks of complex networks. [45] Huan Wang, Shuicheng Yan, Dong Xu, Xiaoou Tang, and Thomas Huang. 2007.
Science 298, 5594 (2002), 824–827. Trace ratio vs. ratio trace for dimensionality reduction. In 2007 IEEE Conference
[29] Christopher Morris, Nils M Kriege, Franka Bause, Kristian Kersting, Petra Mutzel, on Computer Vision and Pattern Recognition. IEEE, 1–8.
and Marion Neumann. 2020. Tudataset: A collection of benchmark datasets for [46] Yu Guang Wang, Ming Li, Zheng Ma, Guido Montufar, Xiaosheng Zhuang, and
learning with graphs. arXiv preprint arXiv:2007.08663 (2020). Yanan Fan. 2020. Haar graph pooling. In International conference on machine
[30] Christopher Morris, Martin Ritzert, Matthias Fey, William L Hamilton, Jan Eric learning. PMLR, 9952–9962.
Lenssen, Gaurav Rattan, and Martin Grohe. 2019. Weisfeiler and leman go neural: [47] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. 2018. How powerful
Higher-order graph neural networks. In Proceedings of the AAAI Conference on are graph neural networks? arXiv preprint arXiv:1810.00826 (2018).
Artificial Intelligence, Vol. 33. 4602–4609. [48] Yuhua Xu, Junli Wang, Mingjian Guang, Chungang Yan, and Changjun Jiang.
[31] Yunsheng Pang, Yunxiang Zhao, and Dongsheng Li. 2021. Graph pooling via 2022. Multistructure Graph Classification Method With Attention-Based Pooling.
coarsened graph infomax. In Proceedings of the 44th International ACM SIGIR IEEE Transactions on Computational Social Systems (2022).
Conference on Research and Development in Information Retrieval. 2177–2181. [49] Rex Ying, Jiaxuan You, Christopher Morris, Xiang Ren, William L Hamilton, and
[32] Alan Perotti, Paolo Bajardi, Francesco Bonchi, and André Panisson. 2022. Jure Leskovec. 2018. Hierarchical graph representation learning with differen-
GRAPHSHAP: Motif-based Explanations for Black-box Graph Classifiers. arXiv tiable pooling. arXiv preprint arXiv:1806.08804 (2018).
preprint arXiv:2202.08815 (2022). [50] Hualei Yu, Jinliang Yuan, Hao Cheng, Meng Cao, and Chongjun Wang. 2021.
[33] Ekagra Ranjan, Soumya Sanyal, and Partha Talukdar. 2020. Asap: Adaptive struc- GSAPool: Gated Structure Aware Pooling for Graph Representation Learning.
ture aware pooling for learning hierarchical graph representations. In Proceedings In 2021 International Joint Conference on Neural Networks (IJCNN). 1–8. https:
of the AAAI Conference on Artificial Intelligence, Vol. 34. 5470–5477. //doi.org/10.1109/IJCNN52387.2021.9534320
[34] Ryan A Rossi, Anup Rao, Sungchul Kim, Eunyee Koh, Nesreen K Ahmed, and [51] Hao Yuan and Shuiwang Ji. 2020. Structpool: Structured graph pooling via
Gang Wu. 2019. Higher-order ranking and link prediction: From closing triangles conditional random fields. In Proceedings of the 8th International Conference on
to closing higher-order motifs. arXiv preprint arXiv:1906.05059 (2019). Learning Representations.
[35] Satu Elisa Schaeffer. 2007. Graph clustering. Computer science review 1, 1 (2007), [52] Muhan Zhang, Zhicheng Cui, Marion Neumann, and Yixin Chen. 2018. An end-
27–64. to-end deep learning architecture for graph classification. In Thirty-Second AAAI
[36] Thomas Schnake, Oliver Eberle, Jonas Lederer, Shinichi Nakajima, Kristof T Conference on Artificial Intelligence.
Schütt, Klaus-Robert Müller, and Grégoire Montavon. 2020. Higher-order explana- [53] Jie Zhou, Ganqu Cui, Shengding Hu, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu,
tions of graph neural networks via relevant walks. arXiv preprint arXiv:2006.03589 Lifeng Wang, Changcheng Li, and Maosong Sun. 2020. Graph neural networks:
(2020). A review of methods and applications. AI Open 1 (2020), 57–81.
[37] Govind Sharma, Aditya Challa, Paarth Gupta, and M Narasimha Murty. 2021.
Higher-Order Relations Skew Link Prediction in Graphs. arXiv preprint
arXiv:2111.00271 (2021).
[38] Jianbo Shi and Jitendra Malik. 2000. Normalized cuts and image segmentation.
IEEE Transactions on pattern analysis and machine intelligence 22, 8 (2000), 888–
905.
[39] Martin Simonovsky and Nikos Komodakis. 2017. Dynamic edge-conditioned
filters in convolutional neural networks on graphs. In Proceedings of the IEEE
conference on computer vision and pattern recognition. 3693–3702.
435