Community Detection With Graph Neural Networks
Community Detection With Graph Neural Networks
net/publication/317088067
CITATIONS READS
137 8,383
2 authors, including:
Joan Bruna
New York University
138 PUBLICATIONS 25,133 CITATIONS
SEE PROFILE
All content following this page was uploaded by Joan Bruna on 03 July 2017.
Abstract
We study data-driven methods for community detection in graphs. This estimation
problem is typically formulated in terms of the spectrum of certain operators, as well
as via posterior inference under certain probabilistic graphical models. Focusing on
random graph families such as the Stochastic Block Model, recent research has unified
these two approaches, and identified both statistical and computational signal-to-noise
detection thresholds.
We embed the resulting class of algorithms within a generic family of graph neu-
ral networks and show that they can reach those detection thresholds in a purely
data-driven manner, without access to the underlying generative models and with no
parameter assumptions. The resulting model is also tested on real datasets, requir-
ing less computational steps and performing significantly better than rigid parametric
models.
1 Introduction
Clustering and community detection is a fundamental unsupervised data analysis task.
Given some relational observations between a set of incoming datapoints, it consists in
inferring a community structure across the dataset that enables non-linear dimensionality
reduction and analysis. Efficient algorithms to perform clustering – such as k-means exist,
and are heavily used across diverse data science areas. However, it relies on having the
appropriate Euclidean embedding of the data.
By formulating this task as a graph partitioning problem, spectral clustering methods
obtain such embedding from the leading eigenvectors of appropriate operators defined on
1
the graph, such as the Normalized Graph Laplacian. This leads to efficient algorithms, yet
there is no general procedure to construct the “correct” graph operator from the data. An-
other formalism is based on Probabilistic Graphical Models. By postulating the community
structure as a latent, unobserved variable, authors have constructed latent generative mod-
els, using for instance pairwise Markov random fields, where inferring the community struc-
ture can be seen as a form of posterior inference over the graphical model. Whereas such
models offer flexibility that lacks in spectral clustering models, the Maximum-Likelihood
Estimation over such graphical models is in general intractable. However, when the under-
lying graph has a particular ‘tree-like’ structure, Belief Propagation (BP) provides a way
forward. In the specific instance of the Stochastic Block Model (SBM), a recent research
program ([1] and references therein) has bridged the gap between probabilistic and spectral
methods using tools from Statistical Physics, leading to a rich understanding of statistical
and computational estimation limits.
In this work, we study to what extent one can learn those algorithms by observing
labeled pairs of graphs and labels. In other words, we observe instances of graphs together
with their true community structure, and attempt to learn a mapping between graphs
and their predicted communities. We propose to do so by unrolling the previous generic
inference algorithms and by using backpropagation. Our motivation is both to obtain
computationally efficient estimation, and robustness against model misspecifications.
Spectral clustering algorithms consist in performing power iterations, which are also
alternating between localized linear operators in the graph, point-wise nonlinearities and
normalization. This algorithm can therefore be “unrolled” and recast as a neural network,
similarly as in [8]. In our scenario, the resulting neural network for the community detection
task is the graph neural network (GNN) [20, 3]. Inspired by the works that assimilate
efficient Belief Propagation algorithms with specific perturbations of the spectrum of the
Graph Laplacian, we propose key modifications to GNNs that provide us with a robust
model that can operate well under scenarios where standard spectral clustering methods
fail.
We first study our GNN model in the synthetic Stochastic Block Model, for which the
computational and information-theoretic detection regimes and corresponding algorithms
are well-known. We show that our network learns to reach those detection thresholds with
no explicit knowledge of the model, and with improved computational efficiency. Next,
we demonstrate the applicability of our framework on real-world community detection
problems, by showing that our data-driven model is able to outperform existing community
detection algorithms based on parametric generative families.
To summarize, our main contributions are:
• We propose to use graph neural networks to perform data-driven spectral analysis
by unrolling power iterations.
• We show that on the Stochastic Block-Model we reach detection thresholds in a
purely data-driven fashion.
2
• We show how our model can be applied to real-world datasets, leading to state-of-
the-art community detection results.
measures the cost associated with cutting the graph between communities P encoded by s
that we wish to minimize under appropriate constraints [17]. Note that i,j Ai,j = sT Ds,
with D = diag(A1) (called the degree matrix), so the cut cost can be expressed as a positive
semidefinite quadratic form
min sT (D − A)s = sT ∆s
s(i)=±1
that we wish to minimize. This shows a fundamental connection between the community
structure and the spectrum of certain linear operators of the graph, which provides a pow-
erful and stable relaxation of the discrete combinatorial optimization problem of estimating
the community labels for each node. In the case of the graph Laplacian ∆ = D − A, its
eigenvector associated with the smallest eigenvalue is trivial, but its Fiedler vector (the
eigenvector associated with the second smallest eigenvalue) reveals important community
information of the graph [17] under appropriate conditions, and is associated with the
graph conductance [21] under certain normalization schemes.
For a given linear operator L(A) extracted from the graph (that we assume symmetric),
we are thus interested in extracting eigenvectors at the edge of its spectrum. A particu-
larly simple algorithmic framework is given by the power iteration method. Indeed, the
3
Fiedler vector of L(A) can be obtained by first extracting the leading eigenvector v of
à = kL(A)kI − L(A), and then iterating
y (n) − hy (n) , viv
y (n) = Ãw(n−1) , w(n) = .
ky (n) − hy (n) , vivk
Unrolling power iterations and recasting the resulting model as a trainable neural network
is akin to the LISTA [8] sparse coding model, which unrolled iterative proximal splitting
algorithms.
Despite the appeal of graph Laplacian spectral approaches, it is well known [13] that
these methods fail in sparsely connected graphs. Indeed, in such scenarios, the eigenvec-
tors of graph Laplacians concentrate on nodes with dominant degree, losing their ability
to correlate with community structure. In order to overcome this important limitation,
authors have resorted to ideas inspired from statistical physics, as explained next.
The beliefs (bi→j (σi )) are interpreted as the marginal distributions of σi . Fixed points
of BP can be used to recover marginals of the MRF above. In the case of the tree,
the correspondence is exact: Pi (σi ) = bi (σi ). Some sparse graph, like the Stochastic
Blockmodel with constant degree[16] are locally similar to trees for such an approximation
to be successful. BP approximates the MLE solutions but convergence is not guaranteed in
graphs that are not trees. Furthermore, in order to apply BP, we need a generative model
and the correct parameters of the model. If unknown, the parameters can be derived using
expectation maximization, further adding complexity and instability to the method since
iterations may learn parameters for which BP does not converge.
4
2.3 Non-backtracking operator and Bethe Hessian
The BP equations have a trivial fixed-point where every node takes equal probability in each
group. Linearizing the BP equation around this point is equivalent to spectral clustering
using the non-backtracking matrix (NB), a matrix defined on the edges of the graph that
indicates whether two edges are adjacent and do not coincide. Spectral clustering using
NB gives significant improvements over spectral clustering with versions of the Laplacians
(L) and the adjacency matrix (A) [13]; High degree fluctuations drown out the signal of
the informative eigenvalues in the case of A and L, whereas NB’s eigenvalues are confined
to a disk in the complex plane, so its eigenvalues corresponding to community structure
lay outside the disk and are easily distinguishable.
NB matrices are still not optimal in that they are matrices on the edge set, and are
not symmetric (so cannot enjoy tools of numerical linear algebra for symmetric matrices).
Recently Saade et al.[19] showed that a spectral method can do as well as BP in this
regime, using the Bethe Hessian operator given by BH(r) := (r2 − 1)I − rA + D (where r
is a scalar value). This is due to a one-to-one correspondence between the fixed points of
BP and the stationary points of the Bethe free energy (corresponding Gibbs energy of the
Bethe approximation)[25]. The Bethe Hessian is a scaling of the Hessian of the Bethe free
energy at an extrema corresponding to the trivial fixed point of BP. Negative eigenvalues
of BH(r) correspond to phase transitions in the Ising model where new clusters become
identifiable. This all gives theoretical motivation for why [I, D, A] defined in Section 3
are a good family of generators to do spectral clustering on. In the case of the SBM,
they generate the Bethe Hessian which can achieve community detection down to the
information theoretic threshold. The GNN is capable of expressing spectral approximations
of complicated functions of [I, D, A], and performing nonlinear power method iterations in
order to infer global structure (for instance community structure). Furthermore, unlike
belief propagation, the method does not require a generative model, oftentimes requires a
lot statistical analysis to motivate and is exposed to model misspecifications when deployed
on real data. Instead, our framework finds structure in a data driven way, learning it from
the available training data.
5
J-th powers of A encode J-hop neighborhoods of each node, and allow us to combine
and aggregate local information at different scales. We also allow a channel to broadcast
information globally P giving GNN the ability to recover average degrees, moments of degrees
via (U (F ))i = |V1 | j Fj .
We consider a multiscale gnn layer that receives as input a signal x(k) ∈ RV ×dk and
produces x(k+1) ∈ RV ×dk+1 as
J−1
(k) (k) (k) (k) j
X
x(k+1) i,l = ρ θ1,l xi + θ2,l (Dx)i + θ3,l (U x)i + θ4+j,l (A2 x)i , l = 1, . . . dk+1 /2 ,
j=0
J−1
(k) (k) (k) (k) j
X
x(k+1) i,l = θe1,l xi + θe2,l (Dx)i + θe3,l (U x)i + θe4+j,l (A2 x)i , l = dk+1 /2 + 1, . . . dk+1
(3),
j=0
y (n+1) − v T vy (n+1)
y (n+1) = M̃ x(n) , x(n+1) = .
ky (n+1) − v T vy (n+1) k
6
Figure 1: We input an arbitrary signal (can be random or can be informative) and output
a classification of the nodes. The colour saturation representative of the magnitude of the
signal, whereas the colour difference encode different label classes (red versus blue in this
case).
If v is a constant vector, then the normalization above is precisely performed within the
Batch Normalization step.
We bootstrap the network by considering the input signal x(0) = deg. After performing
K steps of (3), we use the resulting node-level features to predict the community of each
node. Let C = {c1 , . . . , cC } denote the possible community labelings that each node can
take.
Consider first the case where communities do not overlap: C equals the number existing
communities. We define the network output at each node using standard softmax, comput-
(o) (K)
hθc ,x i
e i,·
ing the conditional probability that node i belongs to community c: oi,c = (o) (K) ,
P hθ 0 ,x i
e c i,·
c0
c = 1 . . . C. Let G = (V, E) be the input graph and let y ∈ C V be the ground truth
community structure. Since community belonging is defined up to global label changes in
7
communities, we define the loss associated with a given graph instance as
X
`(θ) = inf − log oi,σ(yi ) , (4)
σ∈SC
i∈V
4 Related Work
The GNN was first proposed in [20]. [4] generalized convnets to apply to graphs by using
the graph Laplacian’s eigenbasis. Signals on graphs projected onto this basis will have
coefficients that rapidly decay if the graph that signals live on is very local. [5] and [12] both
use symmetric Laplacian modules as effective embedding mechanisms for graph signals,
and the latter considered the related task of semi-supervised learning on graphs. However
this basis restricts the expressive power of its spectral approximations, a limitation that
precludes their application to very sparse graphs (for instance near information theoretic
threshold of the SBM). [7] interpreted the GNN architecture as learning a message passing
algorithm. As mentioned in our setup, that is a natural relaxation of the inference problem
on the MRF, but does not tell the whole story with respect to clustering and the need for
generators [I, D, A] in the GNN layers. The present work deals with graphs that are
far sparser then previous applications of neural networks on graphs. We are able to be
competitive with spectral methods specifically designed to work up to the information
theoretic regime that cannot be achieved with previous GNN implementations. See [3] for
a recent and more exhaustive survey on deep learning on graphs.
[26] works on data regularization for clustering and rank estimation and is also mo-
tivated by success of using Bethe-Hessian-like perturbations to improve spectral methods
on sparse networks. They find good perturbations via matrix perturbations, and also had
success on the Stochastic Block Model. Jure and his coauthors have done a lot of work
in community detection in curating benchmark datasets, quantifying the quality of these
datasets[24], and also in making new algorithms for community detection by fitting data to
newly designed generative models (models that exhibit similar statistical structure learned
from their analysis of the aforementioned datasets)[23].
8
5 SBM
We briefly review the main properties needed in our analysis, and refer the interested reader
to [1] for an excellent recent review. The Stochastic Blockmodel (SBM) is a random graph
model denoted by SBM (n, p, q, K). Implicitly there is an F : V → {1, ..., K} associated
with each SBM graph, which assigns community labels to each vertex. One obtains a graph
from this generative model by starting with n vertices and connecting any two vertices u, v
independently at random with probability p if F (v) = F (u), and with probability q if
F (v) 6= F (u). We say the SBM is balanced if the communities are the same size. Let
F¯n : V → {1, ..., K} be our predicted community labels for SBM (n, p, q.K), Fn ’s give
exact recovery on a sequence {SBM (n, p, q)}n if P(Fn = F¯n ) →n 1, and give detection
∃ > 0 : P(|Fn − F¯n | ≥ 1/k + ) →n 1 (i.e F¯n ’s do better than random guessing).
It is harder to tell communities apart if p is close to q (if p = q we just get an Erdős
Renyi random graph, which has no communities). In the two community case, It was
shown that exact recovery is possible on SBM (n, p = a log n
n ,q =
b log n
n ) if and only if
a+b
√
2 ≥ 1 + ab [15, 2]. For exact recovery to be possible, p, q must grow at least O(log n)
or else the sequence of graphs will not to be connected, and thus vertex labels will be
underdetermined. There is no information-computation gap in this regime, so there are
polynomial time algorithms when recovery is possible ([1][15]). In the much sparser regime
of constant degree SBM (n, p = na , q = nb ), detection is the best we hope for. The constant
degree regime is also of most interest to us for real world applications, as most large datasets
have bounded degree and are extremely sparse. It is also a very challenging regime; spectral
approaches using the Laplacian in its various (un)normalized forms and the adjacency
matrix, as well as SDP methods cannot detect communities in this regime[1] due to large
fluctuations in the degree distribution that prevent eigenvectors form concentrating on the
clusters.
In the constant degree regime with balanced k communities, the Kesten-Stigum thresh-
old is given by SN R := (a − b)2 /(k(a + (k + 1)b)) [1]. It has been shown for k = 2 that
SN R = 1 is both the information theoretic and efficient computational threshold where
belief propagation (BP) via a polynomial time algorithms. For k ≥ 4 a gap emerges be-
tween the information theoretic threshold and computational one. It’s conjectured that no
polynomial time algorithm exist for SN R < 1, while a BP algorithm works for SN R > 1[1].
The existence of the gap was shown by [1] by proving a non-polynomial algorithm can do
detection for some SN R < 1.
6 Experiments
6.1 GNN Performance Near Information Theoretic Threshold of SBM
Our performance measure is the overlap between predicted (F̄ ) and true labels (F ), which
quantifies how much better than random guessing a predicted labelling is. The overlap is
9
given by n1 u δF (u),F̄ (u) − k1 /(1− k1 ) where δ is the Kronecker delta function, and the labels
P
are defined up to global permutation. The GNNs were all trained with 30 layers, 10 feature
maps and J = 3 in the middle layers and n = 1000. We used Adamax [11] with learning
rate 0.001 1 We consider two learning scenarios. In the first scenario, we train parameters
θ conditional on a and b, by producing 6000 samples G ∼ SBM (n = 1000, ai , bi , k = 2)
for different pairs (ai , bi ) and estimating the resulting θ(ai , bi ). In the second scenario, we
train a single set of parameters θ from a sample of 6000 samples
p
G ∼ SBM (n = 1000, a = k d¯ − b, b ∼ Unif(0, d¯ − d), ¯ k = 2) ,
where the average degree d¯ is either fixed constant or also randomized with d¯ ∼ Unif(1, t).
This training set is very important as it shows our GNN is not just approximating the BH
spectral method since the optimal r is not constant in this dataset. Instead, the model’s
competitive performance in this regime shows that the GNN is able to learn a higher
dimensional representation of the optimal r as a function of the data.
Our GNN model is either competitive with BH or beats BH, which achieves the state
of the art along with BP, despite not having any access to the underlying generative
model (especially in cases where GNN was trained on a mixture of SBM and thus must
be able to generalize the r parameter in BH). They all beat by a wide margin spectral
clustering methods using the symmetric Laplacian (Ls) and power method (pm) applied
to kBHkI − BH using the same number of layers as our model. Thus GNN’s ability to
predict labels goes beyond approximating spectral decomposition via learning the optimal
r for BH(r). The model architecture allows it to learn a higher dimensional function
of the optimal perturbation of the multiscale adjacency basis, as well as nonlinear power
iterations, that amplify the informative signals in the spectrum; In a data driven way it
can generalize the problem without needing to study a generative model.
10
Figure 2: SBM detection. left: k = 2 associative, right: k = 2 disasocciative, X-axis
corresponds to SNR, Y-axis to overlap; see text.
Figure 3: GNN mixture (Graph Neural Network trained on a mixture of SBM with av-
erage degree 3), √
GNN full mixture
√ (GNN trained over different SNR regimes some below
¯
threshold), BH( d) and BH(− d). ¯ left: k = 2, right: k = 4. We verify that BH(r)
models cannot perform detection at both ends of the spectrum simultaneously.
11
computational limitations, we restrict our attention to the three smallest graphs in the
SNAP collection (Youtube, DBLP and Amazon), and we restrict the largest community
size to 800 nodes, which is a conservative bound, since the average community size on these
graphs is below 30.
We compare GNN’s performance with the Community-Affiliation Graph Model (AGM).
The AGM is a generative model defined in [23] that allows for overlapping communities
where overlapping area have higher density. This was a statistical property observed in
many real datasets with ground truth communities, but not present in generative models
before AGM and was shown to outperform algorithms before that. AGM fits the data to
the model parameters in order give community predictions, and we use the recommended
default parameters. Table 1 compares the performance, measured with a 3-class {1, 2, 1 +
2} classification accuracy up to global permutation 1 ↔ 2. We stress however that the
experimental setup is different from the one in [23], which may impact the performance
of AGM. Nonetheless, this experiment illustrates the benefits of data-driven models that
strike the right balance between expressive power to adapt to model mis-specifications and
structural assumptions of the task at hand.
7 Conclusion
In this work we have studied data-driven approaches to clustering with graph neural net-
works. Our results confirm that, even when the signal-to-noise ratio is at the lowest de-
tectable regime, it is possible to backpropagate detection errors through a graph neural
network that can ‘learn’ to extract the spectrum of an appropriate operator. This is made
possible by considering generators that span the appropriate family of graph operators that
can operate in sparsely connected graphs.
One word of caution is that obviously our results are inherently non-asymptotic, and
further work is needed in order to confirm that learning is still possible as |V | grows.
Nevertheless, our results open up interesting questions, namely understanding the energy
landscape that our model traverses as a function of signal-to-noise ratio; or whether the
network parameters can be interpreted mathematically. This could be useful in the study
of computational-to-statistical gaps, where our model could be used to inquire about the
form of computationally tractable approximations.
12
Besides such theoretical considerations, we are also interested in pursuing further appli-
cations with our model, owing to its good computational complexity |V | log(|V |). So far it
presumes the number of communities to be estimated, so we will explore generalisations to
also estimate it to large-scale graphs. Also, our model can readily be applied to data-driven
ranking tasks.
References
[1] Emmanuel Abbe. Community detection and stochastic block models: recent develop-
ments. arXiv preprint arXiv:1703.10146, 2017.
[2] Emmanuel Abbe, Afonso S. Bandeira, and Georgina Hall. Exact recovery in the
stochastic block model. arXiv:1405.3267v4, 2014.
[3] Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Van-
dergheynst. Geometric deep learning: going beyond euclidean data. IEEE Signal
Processing Magazine, 2017.
[4] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral networks
and locally connected networks on graphs. arXiv:1312.6203., 2013.
[6] David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy
Hirzel, Alán Aspuru-Guzik, and Ryan P Adams. Convolutional networks on graphs
for learning molecular fingerprints. In Proc. NIPS, 2015.
[7] Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and George E.
Dahl. Neural message passing for quantum chemistry. arXiv:1704.01212v1, 2017.
[8] Karol Gregor and Yann LeCun. Learning fast approximations of sparse coding. ICML,
2010.
[9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for
image recognition. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pages 770–778, 2016.
[10] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network
training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
[11] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv
preprint arXiv:1412.6980, 2014.
13
[12] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolu-
tional networks. arXiv preprint arXiv:1609.02907, 2016.
[13] Florent Krzakala, Cristopher Moore, Elchanan Mossel, Joe Neeman, Allan Sly, Lenka
Zdeborová, and Pan Zhang. Spectral redemption in clustering sparse networks. Pro-
ceedings of the National Academy of Sciences, 110(52):20935–20940, 2013.
[14] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph se-
quence neural networks. arXiv:1511.05493, 2015.
[15] Elchanan Mossel, Joe Neeman, and Allan Sly. A proof of the block model threshold
conjecture. arXiv:1311.4115, 2014.
[16] Elchanan Mossel, Joe Neeman, and Allan Sly. A proof of the block model threshold
conjecture. arXiv:1311.4115, 2016.
[18] Oliver Riordan and Nicholas Wormald. The diameter of sparse random graphs. Com-
binatorics, Probability and Computing, 19(5-6):835–926, 2010.
[19] Alaa Saade, Florent Krzakala, and Lenka Zdeborová. Spectral clustering of graphs
with the bethe hessian. In Advances in Neural Information Processing Systems, pages
406–414, 2014.
[20] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele
Monfardini. The graph neural network model. IEEE Trans. Neural Networks,
20(1):61–80, 2009.
[22] Sainbayar Sukhbaatar, Arthur Szlam, and Rob Fergus. Learning multiagent commu-
nication with backpropagation. NIPS, 2016.
[23] Jaewon Yang and Jure Leskovec. Community-affiliation graph model for overlapping
network community detection. Proceeding ICDM ’12 Proceedings of the 2012 IEEE
12th International Conference on Data Mining, 390(.):1170–1175, 2012.
[24] Jaewon Yang and Jure Leskovec. Defining and evaluating network communities based
on ground-truth. ICDM., 7(2):43–55, dfd.
[25] Jonathan S Yedidia, William T Freeman, and Yair Weiss. Understanding belief propa-
gation and its generalizations. Exploring artificial intelligence in the new millennium,
8, 2003.
14
[26] Pan Zhang. Robust spectral detection of global structures in the data by learning a
regularization. arXiv preprint 1609.02906v1, 2016.
15