Parallel Modularity Clustering
Parallel Modularity Clustering
com
ScienceDirect
This space is reserved for the Procedia header, do not use it
This Procedia
space isComputer
reserved for 108C
Science the Procedia header, do not use it
(2017) 1793–1802
This space is reserved for the Procedia header, do not use it
features. We note that in our approach the number of fixed clusters p is arbitrary. It does not
need to be a power of two, i.e. p = 2k , as when repeated bisection is used k times.
Also, we analyse the effects of using single and double precision to solve the problem. The
latter some times being faster, but requiring more iterations to convergence. Moreover, we
outline how the proposed approach could be modified to find an adaptive number of clusters.
In particular, we show that the clustering information could be derived from the smaller, same
or larger number of eigenvectors, with the former case often exchanging lower quality for higher
performance.
Finally, in our experiments we compare the clustering obtained by the modularity approach
developed in this paper to previous work. We comment on the quality and performance tradeoffs
when they are applied to large social network graphs that often have power law-like distribution
of edges per node. Also, we highlight the performance obtained by our novel parallel approach
on the GPU. For example, it can find 7 clusters with a modularity score over 0.5 in about 0.8
seconds for hollywood-2009 network with over a hundred million undirected edges.
2 Graph Clustering
Let a graph G = (V, E) be defined by its vertex V and edge E sets. The vertex set V = {1, ..., n}
represents n nodes in a graph, with each node identified by a unique integer number i ∈ V . The
edge set E = {wi1 ,j1 , ..., wim ,jm } represents m weighted edges in a graph, with each positive
wi,j ≥ 0 undirected edge identified by wi,j ∈ E.
Let the weighted adjacency matrix A = [ai,j ] of a graph G = (V, E) be defined through its
elements ai,j = wi,j if there is an edge connecting i to j, and 0 otherwise. Notice that matrix
A is symmetric because graph is assumed to be undirected and therefore wi,j ≡ wj,i . Also,
assume that we do not include self-edges, diagonal elements, in the definition of the weighted
adjacency matrix A.
In graph clustering we
p are often interested in finding a partitioning of vertices V into disjoint
sets Sk ⊆ V such that k=1 Sk = V . Notice that we can equivalently express this partitioning
as a function c(i) = k specifying the assignment of nodes i ∈ V into clusters k = 1, ..., p.
In the following discussion, let |.| denote cardinality (number of elements) of a set and di
denote the degree (number of edges) of the vertex i ∈ V . Also, let us define the volume of a
n
node vi = j=1 ai,j and volume of a set of vertices
n
n
n
vol(V ) = vi = ai,j = 2ω (1)
i=1 i=1 j=1
Notice that for unweighted graphs ai,j = 1 and therefore vi = di and 2ω = 2m.
3 Modularity
An intuitive way to identify structure in a graph is to assume that similar vertices are connected
by more edges in the current graph than if they were randomly connected. The modularity
measures the difference between how well vertices are assigned into clusters for the current
graph G = (V, E), when compared to a random graph R = (V, F ) [16, 17].
The reference random graph R = (V, F ) is constructed with the same set of vertices, but
a different set of edges as the current graph. The set of edges F of the random graph is
2
Alexandre Fender et al. / Procedia Computer
Parallel Modularity Clustering ScienceN.108C
A. Fender, (2017)
Emad, S. 1793–1802
Petiton and M. Naumov 1795
constructed such that the number of edges |F | = |E| = m and degree di of each vertex is the
same, but the edges themselves are rewired randomly between vertices in V .
Notice that every broken edge, generates two edge ends that are available for rewiring. Then,
the weighted probability of a particular edge end to be connected with some edge end at node
i is vi /2ω. Therefore, the probability of node i and j to be connected during the rewiring is
(vi vj )/2ω.
The modularity is the difference between existing edges and the probabilities of edges in
random graph across all nodes that belong to a given set of clusters.
Definition 1. Let graph G = (V, E) and c(i) be an assignment of nodes into clusters. Then,
modularity Q can be expressed as
1 vi vj
n n
1 if c(i) = c(j)
Q= ai,j − δc(i),c(j) where δc(i),c(j) = (2)
2ω i=1 j=1 2ω 0 otherwise
The above definition can be reduced to the special case in [16, 17] if we choose to ignore the
vi vj di dj
edge weights during rewiring or work with unweighted graphs, in which case 2ω = 2m .
1
− ≤Q≤1 (3)
2
Let us now define the modularity matrix, state its properties and show its relationship to
modularity metric.
Definition 2. Let volume vector vT = [v1 , ..., vn ], then the modularity matrix can be written as
1
B =A− vvT (4)
2ω
Lemma 2. The modularity and adjacency matrices have the following properties
Be = 0, Ae = v, vT e = eT Ae = 2ω (5)
Notice that the modularity matrix B is symmetric indefinite. Also, using Lemma 2 we may
conclude that it is singular, with an eigenvalue 0 and corresponding eigenvector e = [1, ..., 1]T .
Let us now define a tall matrix U = [ui,k ], that can be interpreted as a set of vectors
U = [u1 , ..., up ] where each vector uk corresponds to a cluster Sk for k = 1, ..., p, with elements
ui,k = 1 if c(i) = k and 0 otherwise.
1
Q= Tr(U T BU ) (6)
2ω
3
1796 Alexandre Fender et al. / Procedia Computer
Parallel Modularity Clustering ScienceN.108C
A. Fender, (2017)
Emad, S. 1793–1802
Petiton and M. Naumov
Notice that ultimately we are interested in finding the cluster assignment c that achieves
the maximum modularity
1
max Q = max Tr(U T BU ) (8)
c 2ω U ∈C
The exact solution to the modularity maximization problem stated in (8) is NP-complete
[5]. However, we can find an approximation by relaxing the requirement that elements of matrix
U take discrete values [12, 13].
Notice that U T U = D, where D = [dk,k ] is a p×p diagonal matrix with elements dk,k = |Sk |.
Then, introducing auxiliary matrix U = U D−1/2 ∈ Rn×p , we can start by looking for
T BU
max Tr(U ) (9)
T U
U =I
Notice that by the Courant-Fischer theorem [9] this maximum is achieved by the largest
eigenpairs of the modularity matrix. Now, we still need to convert the real values obtained in
(9) back into the discrete assignment into clusters.
Since we are working in multiple dimensions, it is natural to use the distance between points
as a metric of how to group them. In this case, if we interpret each row of the matrix U as a
point in a p-dimensional space then it becomes natural to use a clustering algorithm, such as
k-means [1, 11] to identify the p distinct partitions. We are not aware of a theoretical result
guaranteeing that the obtained approximate solution will closely match the optimal discrete
solution, but in practice we often do obtain a good approximation.
4 Algorithm
The outline of the modularity clustering technique is described in Alg. 1.
Notice that the general outline of the modularity clustering closely resembles the spectral
clustering in [13]. The main difference is that in the former case we use the modularity matrix
B and find its largest eigenpairs, while in the latter case we use the Laplacian matrix L and find
its smallest eigenpairs. The properties of modularity and Laplacian matrices are also different,
requiring a different choice of eigenvalue problem solvers.
4
Alexandre Fender et al. / Procedia Computer
Parallel Modularity Clustering ScienceN.108C
A. Fender, (2017)
Emad, S. 1793–1802
Petiton and M. Naumov 1797
(a) Profiling of the modularity algorithm (b) Profiling of the Lanczos eigensolver
Figure 1: Profiles
5
1798 Alexandre Fender et al. / Procedia Computer
Parallel Modularity Clustering ScienceN.108C
A. Fender, (2017)
Emad, S. 1793–1802
Petiton and M. Naumov
terms we can relate it to the difference in TDP1 and time consumed by the algorithm on
different hardware platforms. For instance, in the next section we will perform experiments on
Nvidia Titan X (Pascal) GPU and Intel Core i7-3930K CPU with 250 and 130 Watts TDP,
respectively. Also, we will show that our algorithm on the GPU outperforms the state-of-the-art
implementation on the CPU by ∼ 3× on average. Since the ratio between the speedup and
ratio of TDP (250/130 ∼ 2×) on these platforms is 3/2 > 1, we can in general expect to achieve
a better power efficiency on the GPU.
5 Numerical Experiments
Let us now study the performance and quality of the clustering obtained by the proposed
modularity algorithm Alg. 1 on a relevant sample of graphs from the DIMACS10, LAW and SNAP
graph collections [24], shown in Tab. 1.
In the modularity algorithm, we let the stopping criteria for the Lanczos solver be based on
the norm of the residual of the largest eigenpair ||r1 ||2 = ||Bu1 − λ1 u1 ||2 ≤ 10−3 and maximum
# of iterations 800 (with restart at every 20 iterations), while for the k-means we let it be based
on the scaled error difference |l − l−1 |/n < 10−2 and maximum # of iterations 20.
Also, all numerical experiments are performed on a workstation with Ubuntu 14.04 operating
system, gcc 4.8.4 compiler, CUDA Toolkit 8.0 software and Intel Core i7-3930K CPU 3.2 GHz
and Nvidia Titan X (Pascal) GPU hardware. The performance of the algorithms was always
measured across multiple runs to ensure consistency.
Table 1: The modularity (Mod), time (T) in milliseconds and # of iterations (It) achieved for
64 and 32 bit precision, when splitting the graph into 7 clusters. The column (Rand) contains
the modularity score resulting of random cluster assignments
First, notice that the modularity algorithm is robust and converged to the solution on all
networks of interest. Also, notice that the computed modularity score remained in the interval
[−0.5, 1] as predicted by the theory for all the problems. Moreover, the modularity score
1 Thermal Design Power (TDP) measures the average power a processor dissipates when operating with all
cores active. The real energy usage may be different and may change depending on the hardware generation.
6
Alexandre Fender et al. / Procedia Computer
Parallel Modularity Clustering ScienceN.108C
A. Fender, (2017)
Emad, S. 1793–1802
Petiton and M. Naumov 1799
computed by random assignment of nodes into clusters was 0 for all the networks as expected.
It is an important baseline for comparing attained modularity scores.
Second, notice the difference in behaviour of the algorithm when the computation is per-
formed using single (32 bit) and double (64 bit) floating point arithmetic. In particular, notice
that the total time to the solution can be significantly better in 64 bit than in 32 bit precision as
shown in the time column of Tab. 1. Indeed, single precision can result in unwanted perturba-
tions during the computation of the Krylov subspace by the Lanczos eigenvalue solver. Those
perturbations can impact the number of iterations and the overall quality of the approximation.
Therefore, we have found that using 64 bit precision is a safer option.
(a) Comparing the impact of varying the # of (b) The modularity achieved when changing the
clusters used for assignment for different # of # of clusters for citationCiteseer network in 64
computed eigenvectors bit precision
7
1800 Alexandre Fender et al. / Procedia Computer
Parallel Modularity Clustering ScienceN.108C
A. Fender, (2017)
Emad, S. 1793–1802
Petiton and M. Naumov
The above experiments lead us to propose the following method for discovering an approx-
imation to the natural number of clusters, which has also been proposed for small networks in
[22]. We propose computing as many eigenpairs as clusters up to a fixed point, such as 7, and
afterwards continuing to increase the number of k-means clusters only, while keeping track of
modularity score as shown on Fig. 2b. Since the plotted modularity score curve has a Gaussian
shape it is straight forward to detect that its maximum is at 17 clusters on x-axis. A similar
trend can be seen for several other networks in our experiments. Moreover, we also found that
it is better to over than under estimate the number of clusters.
Also, notice on Fig. 2b that when we increase the number of clusters by 10× from 2 to 20
the time to compute them only increases by about 20% from 95 ms to 120 ms. The plotted time
line has a very low slope with respect to x-axis, because the number of computed eigenpairs
does not increase past 7 in this experiment. Hence, the time growth shown on the figure only
reflects additional time spent in the k-means step.
Using this technique we were able to detect the best clustering for all of the networks in
Tab. 1. The resulting number of clusters and the modularity score found by our method are
shown in Tab. 2.
In general, our algorithm can very quickly compute modularity for many clusters with only
limited memory requirements. For example, we computed 53 clusters in half a second for
coPapersCiteseer network with 16 million edges. Also, it takes only 0.8 seconds to find a
clustering with a modularity score over 0.5 for hollywood-2009 which has 1, 139, 905 vertices
and 113, 891, 327 edges.
Table 2: Modularity (Mod) for a given # of clusters (Clu) vs. reference results (Rmod) in [2]
On the other hand, for the small cases the modularity algorithm we propose often attains a
better modularity score than the reference results in [2]. In fact its score is better in 5 out of 6
2 We do not have access to the corresponding code and are forced to make the comparisons with the results
obtained on Tesla C2075 GPU in [2]. Since in both algorithms the execution time is limited by memory
bandwidth, we estimate a factor of ∼ 3× as baseline performance difference between the Tesla C2075 with
144GB/s and TitanX with 337GB/s bandwidth.
8
Alexandre Fender et al. / Procedia Computer
Parallel Modularity Clustering ScienceN.108C
A. Fender, (2017)
Emad, S. 1793–1802
Petiton and M. Naumov 1801
(a) Large data sets on GPU in [2] (b) Large data sets on CPU in [10]
Figure 3: The speedup and relative quality when compared to the reference results
cases considered in the study, as shown in Tab. 2b. Since we implement different algorithms for
computing modularity, it is not completely surprising that their behavior varies on different data
sets. Unfortunately, we could not identify any particular trends that would tell us when one
algorithm would be better than the other in terms of quality. However, we always outperform
the reference approach on large cases.
Another more recent work on modularity developed a hierarchical algorithm for computing
it on the CPU [10]. We have experimented with this algorithm by computing 7 clusters in 64 bit
precision and using all CPU cores available on the machine. The performance of our approach
versus these results is plotted in Fig. 3b. Notice that on average our algorithm outperforms
the hierarchical approach by about ∼ 3×, but it has the same quality tradeoffs.
9
1802 Alexandre Fender et al. / Procedia Computer
Parallel Modularity Clustering Science
A. Fender, N.108C (2017)
Emad, 1793–1802
S. Petiton and M. Naumov
7 Acknowledgements
The authors would like to acknowledge Steven Dalton, Joe Eaton, Alex Fit-Florea and Michael
Garland for their useful comments and suggestions.
References
[1] D. Arthur and S. Vassilvitskii, K-means++: The Advantages of Careful Seeding, Proc. 18th Annual
ACM-SIAM Symposium on Discrete algorithms, pp. 1027-1035, 2007.
[2] B. O. F. Auer, GPU Acceleration of Graph Matching, Clustering and Partitioning, Ph.D. Thesis,
Utrecht University, 2013.
[3] Z. Bai, J. Demmel, J. Dongarra, A. Ruhe and H. van der Vorst, Templates for the solution of
Algebraic Eigenvalue Problems: A Practical Guide, SIAM, Philadelphia, PA, 2000.
[4] N. Bell and M. Garland, Implementing Sparse Matrix-Vector Multiplication on Throughput-
Oriented Processors, Proc. SC09, 2009.
[5] U. Brandes, D. Delling, M. Gaertler, R. Gorke, M. Hoefer, Z. Nikoloski and D. Wagner, On
Modularity Clustering, IEEE Trans. Knowledge and Data Engineering, Vol. 20, pp. 172-188, 2008.
[6] W. M. Campbell, C. K. Dagli, and C. J. Weinstein, Social Network Analysis with Content and
Graphs, Lincoln Lab. Journal, Vol. 20, 2013.
[7] W. E. Donath and A. J. Hoffman, Lower Bounds for the Partitioning of Graphs, IBM Journal of
Research and Development, Vol. 17, pp. 420-425, 1973.
[8] M. Fiedler, Algebraic Connectivity of Graphs, Czechoslovak Mathematical Journal, Vol. 23, pp.
298-305, 1973.
[9] R. A. Horn and C. R. Johnson, Matrix Analysis, Cambridge University Press, NY, 1999.
[10] D. LaSalle and G. Karypis Multi-threaded Modularity Based Graph Clustering Using the Multilevel
Paradigm, Parallel Distrib. Comput., Vol. 76, pp. 66-80, 2015.
[11] S. P. Lloyd, Least Square Quantization in PCM, IEEE Trans. Information Theory, Vol. 28, pp.
129-137, 1982.
[12] U. von Luxburg, A Tutorial on Spectral Clustering, Technical Report No. TR-149, Max Planck
Institute, 2007.
[13] M. Naumov and T. Moon, Parallel Spectral Graph Partitioning, NVIDIA Technical Report, NVR-
2016-001, 2016.
[14] M. E. J. Newman, Assortative Mixing in Networks, Phys. Rev. Lett., Vol. 89, pp. 208701, 2002.
[15] M. E. J. Newman, The Structure and Function of Complex Networks, SIAM Review, Vol. 45, pp.
167-256, 2003.
[16] M. E. J. Newman and M. Girvan, Finding and Evaluating Community Structure in Networks,
Phys. Rev. E, Vol. 69, pp. 026113, 2004.
[17] M. E. J. Newman, Networks: An Introduction, Oxford University Press, New York, NY, 2010.
[18] D. Pelleg and A. Moore, X-means: Extending K-means with Efficient Estimation of the Number
of Clusters, Proc. 17th Int. Conf. on Machine Learning, pp. 727-734, 2000.
[19] M. Rosvall and C. T. Bergstrom, Maps of Random Walks on Complex Networks Reveal Community
Structure, Proc. Natl. Acad. Sci. USA, Vol. 105, pp. 1118-1123, 2008.
[20] Y. Saad, Iterative Methods for Sparse Linear Systems, SIAM, Philadelphia, PA, 2nd Ed., 2003.
[21] M. Chen, K. Kuzmin and B. K. Szymanski, Community Detection via Maximization of Modularity
and Its Variants, IEEE Trans. Comput. Social System, Vol. 1, pp. 46-65, 2014.
[22] S. White and P. Smyth, A Spectral Approach to Finding Communities in Graphs, SIAM Conf.
Data Mining, 2005.
[23] Nvidia, CUDA Toolkit, http://developer.nvidia.com/cuda-downloads
[24] The University of Florida Sparse Matrix Collection, http://www.cise.ufl.edu/research/sparse/matrices
10