High Quality, Scalable and Parallel Community Detection For Large Real Graphs
High Quality, Scalable and Parallel Community Detection For Large Real Graphs
ABSTRACT Keywords
Community detection has arisen as one of the most relevant Graph Algorithms; Community Detection; Clustering; Par-
topics in the field of graph mining, principally for its applica- allel; Social Networks; Graph Partition; Modularity; WCC
tions in domains such as social or biological networks anal-
ysis. Different community detection algorithms have been 1. INTRODUCTION
proposed during the last decade, approaching the problem
During the last years, the analysis of complex networks
from different perspectives. However, existing algorithms
has become a hot research topic in the field of data mining.
are, in general, based on complex and expensive computa-
Social, biological, information and collaboration networks
tions, making them unsuitable for large graphs with millions
are typical targets for such analysis, just to cite a few of
of vertices and edges such as those usually found in the real
them. Among all the tools used to analyze these networks,
world.
community detection is one of the most relevant [7, 22].
In this paper, we propose a novel disjoint community
Communities, also known as clusters, are often referred to as
detection algorithm called Scalable Community Detection
vertices with a high density of connections among them and
(SCD). By combining different strategies, SCD partitions
seldom connected with the rest of the graph [9]. Community
the graph by maximizing the Weighted Community Clus-
detection provides valuable information about the structural
tering (W CC), a recently proposed community detection
properties of the network [5, 9], the interactions among the
metric based on triangle analysis. Using real graphs with
agents of a network [3] or the role the agents develop inside
ground truth overlapped communities, we show that SCD
the network [21].
outperforms the current state of the art proposals (even
Community detection algorithms are often computation-
those aimed at finding overlapping communities) in terms
ally expensive and are not scalable to large graphs with
of quality and performance. SCD provides the speed of
billions of edges. Recently, Yang and Leskovec provided a
the fastest algorithms and the quality in terms of NMI and
benchmark with real datasets and its corresponding ground
F1Score of the most accurate state of the art proposals. We
truth communities [24]. In such work, they measure the
show that SCD is able to run up to two orders of magni-
time spent by several state of the art algorithms, such as
tude faster than practical existing solutions by exploiting
clique percolation [16] or link clustering [1], and found that
the parallelism of current multi-core processors, enabling us
they did not scale to networks with more than hundreds
to process graphs of unprecedented size in short execution
of thousands of edges. Even their new proposal aimed at
times.
large networks, BigClam [24], was not able to process the
largest graph in the benchmark, the Friendster graph, with
roughly 2 billion edges. On the other hand, algorithms such
as Louvain [4], which can locate communities in graphs with
a scale similar to that of Friendster graph, does not scale in
Categories and Subject Descriptors quality [2, 8, 10, 22].
H.3.3 [Information Storage and Retrieval]: Informa- In this paper, we present SCD, which is a new community
tion Search and Retrieval; H.3.4 [Information Storage detection algorithm that is much faster than the most ac-
and Retrieval]: Systems and Software; G.2.2 [Discrete curate state of the art solutions, while maintaining or even
Mathematics]: Graph Theory improving their quality. SCD is able to compute the com-
munities of the Friendster graph, yet using a modest 32GB
RAM computer. Figure 1 illustrates schematically the two
most important dimensions to evaluate community detec-
tion algorithms: quality and scalability. We observe that no
algorithm in the state of the art excels both in scalability
and quality.
Copyright is held by the International World Wide Web Conference Com- SCD detects disjoint communities in undirected and un-
mittee (IW3C2). IW3C2 reserves the right to provide a hyperlink to the weighted networks by maximizing W CC, a recently pro-
author’s site if the Material is used in electronic media. posed community metric [18]. W CC is a metric based on
WWW’14, April 7–11, 2014, Seoul, Korea.
ACM 978-1-4503-2744-2/14/04. triangle structures in a community. In contrast to modular-
http://dx.doi.org/10.1145/2566486.2568010. ity, which is the most widely used metric and has resolution
vides better quality communities. In particular W CC,
which is based on triangle counting, is very effective lo-
scd
cating meaningful communities.
4. According to our results, we observe that overlapping
oslom community detection metrics are still far from obtain-
Quality
Regarding the second phase, let α be the number of iter- DBLP: This graph represents a network of coauthorships,
ations required to find the best partition P’, which in our where each vertex is an author and two authors are con-
experiments is between 3 and 7. In each iteration, for each nected if they have written a paper together. Each journal
vertex v of the graph, we compute, in the worst case, d + 1 or conference defines a ground truth community formed by
movements of type W CC ′ (I) that have a cost O(1). Then, those authors which published in that journal or conference.
the computation of the best movement for all vertices in the
graph in an iteration is O(n · (d + 1)) = O(m). The ap- Youtube: This graph represents the Youtube social net-
plication of the all the movements is linear with respect to work, where each vertex is a user and two users are linked
the number of vertices O(n). We also need to update, for if they have established a friendship relation. Communities
each iteration of the second phase, the statistics δ, cout , din are defined by the groups created by the users, which are
and dout for each vertex and community, which has a cost of formed by those users that joined that group.
O(m). Finally, the computation of the WCC for the current
partition is performed by computing for each edge the trian-
LiveJournal: This graph represents the social network around
gles, which is O(m · log n) as already stated. Hence, the cost
LiveJournal. Similar to the Youtube network, the vertices
of the refinement phase becomes O(α·(m+n+m+m·log n)),
are the users, which establish friendship relationships with
which after simplification, becomes O(m · log n) assuming α
other users. Users can create groups, which define the ground
as constant.
truth communities and which are formed by those users that
The final cost of the algorithm is the sum of the two
joined that group.
phases: O(m · log n + m · log n) = O(m · log n)
2·a·b 1000
H(a, b) =
a+b 100
F1 (A, B) = H(precision(A, B), recall(A, B))
10
Then, the average F1Score of two sets of communities C1 1
and C2 is given by: Amz Dblp You Live Orkut Friend
F1Score
0.4
The computer used in the experiments has: a CPU Intel
0.3
Xeon E5530 at 2.4 GHz, 32 GB of RAM, 1TB of disk space
and Debian Linux with kernel 2.6.32-5-amd64. 0.2
0.1
5. EXPERIMENTAL RESULTS 0
Amz Dblp You Live Orkut Friend
Live
0.4
100
0.2 You
10
Amz Dblp
0
Amz Dblp You Live Orkut Friend
1
0.1 1 10 100 1000 10000
1 2 4
Million Edges
Figure 7: SCD normalized execution time for differ- Figure 8: SCD execution time with four threads vs
ent number of threads. number of edges.
Harry Potter and the Sorcerer’s Stone (Book 1); Harry Potter and the Order of the Phoenix (Book 5, Deluxe
Harry Potter and the Chamber of Secrets (Book 2); Edition);
Harry Potter and the Prisoner of Azkaban (Book 3); Harry Potter and the Sorcerer’s Stone (Book 1, Large Print);
Harry Harry Potter and the Goblet of Fire (Book 4); Harry Potter and the Sorcerer’s Stone (Book 1, Audio CD);
Potter Harry Potter and the Prisoner of Azkaban (Book 3, Audio CD);
Harry Potter and the Order of the Phoenix (Book 5);
Harry Potter and the Sorcerer’s Stone (Hardcover); Harry Potter and the Goblet of Fire (Book 4, Audio CD);
Harry Potter and the Prisoner of Azkaban (Hardcover); Harry Potter and the Order of the Phoenix (Book 5, Audio);
Eyewitness Top 10 Travel Guides: Barcelona; The National Geographic Traveler: Barcelona;
Lonely Barcelona and Catalonia (Eyewitness Travel Guides); Lonely Planet Barcelona City Map;
Planet Eyewitness Travel Guide to Barcelona and Catalonia; Streetwise Barcelona;
Barcelona
Lonely Planet Barcelona;
metadata includes information such as the title of the prod- over, the results show that SCD is able to run faster than
uct, the type and the categories that it belongs to, and was these highest quality existing solutions, matching the speed
collected in 2006. As stated above, a ground truth com- of those algorithms aimed at large scale graphs. This trans-
munity in the Amazon graph is formed by those products lates into SCD being able to process graphs of an unprece-
belonging to the same groups and form a connected com- dented size, such the Friendster which has roughly 2 billion
ponent. Therefore, the communities in the Amazon graph edges in just 4.3 hours on off the shelf computer hardware.
should contain similar products than have been usually co- The design of SCD also allows a remarkable scalability that
purchased. Hence, once a new buyer purchases a product, it is close to four fold improvements in a four core processor.
could be recommended with those products of the same com- Also, we showed that SCD is able to deliver meaningful com-
munities of that he bought. We select three profiles of well munities, by means of a case study consisting of a product
known products: the technical computer science book titled recommendation application. Finally, we can conclude that
“Modern Information Retrieval”, by Ricardo Baeza; the first going beyond edge counting, i.e. focusing on richer struc-
book of a popular novel series “Harry Potter and the Sor- tures such as triangles for community detection, provides
cerer’s Stone”; and a travel book “Lonely Planet Barcelona”. better results.
We run the SCD algorithm on the whole network and we The fact that SCD, being a disjoint community detection
report the communities where these books were assigned in algorithm, performs better than pure overlapping commu-
Table 3. In the case of “Modern Information Retrieval”, we nity detection algorithms, gives us the hint that overlapping
see that the community is formed by relevant books in the community detection is a problem still far from being solved.
field of information retrieval and text analysis. In the case of Hence, one of the future research lines is to extend the ideas
Harry Potter, we see that the community contains different behind the topological analysis of the graph performed by
books of the Harry Potter series (for the curious reader, SCD to overlapped communities. On the other hand, we
the two last books of the series were published after the have seen that SCD is able to scale on current multi-core
crawl and are not present in the dataset), as well as Harry architectures. Another interesting research line to explore
Potter audio books and some special editions. Finally, for in the future is how to adapt SCD to a vertex centric large
the “Lonely Travel Barcelona” guide, the community found graph processing model such as GraphLab or Pregel.
Barcelona travel guides from other publishers. We observe
that SCD is able to perform a good selection of the relations
in the graphs in order to give meaningful communities.