0% found this document useful (0 votes)
22 views11 pages

High Quality, Scalable and Parallel Community Detection For Large Real Graphs

Uploaded by

clowntaketwo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views11 pages

High Quality, Scalable and Parallel Community Detection For Large Real Graphs

Uploaded by

clowntaketwo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

High Quality, Scalable and Parallel Community Detection

for Large Real Graphs

Arnau Prat-Pérez David Dominguez-Sal Josep-LLuis Larriba-Pey


DAMA-UPC Sparsity Technologies DAMA-UPC
Universitat Politècnica de david@sparsity- Universitat Politècnica de
Catalunya technologies.com Catalunya
[email protected] [email protected]

ABSTRACT Keywords
Community detection has arisen as one of the most relevant Graph Algorithms; Community Detection; Clustering; Par-
topics in the field of graph mining, principally for its applica- allel; Social Networks; Graph Partition; Modularity; WCC
tions in domains such as social or biological networks anal-
ysis. Different community detection algorithms have been 1. INTRODUCTION
proposed during the last decade, approaching the problem
During the last years, the analysis of complex networks
from different perspectives. However, existing algorithms
has become a hot research topic in the field of data mining.
are, in general, based on complex and expensive computa-
Social, biological, information and collaboration networks
tions, making them unsuitable for large graphs with millions
are typical targets for such analysis, just to cite a few of
of vertices and edges such as those usually found in the real
them. Among all the tools used to analyze these networks,
world.
community detection is one of the most relevant [7, 22].
In this paper, we propose a novel disjoint community
Communities, also known as clusters, are often referred to as
detection algorithm called Scalable Community Detection
vertices with a high density of connections among them and
(SCD). By combining different strategies, SCD partitions
seldom connected with the rest of the graph [9]. Community
the graph by maximizing the Weighted Community Clus-
detection provides valuable information about the structural
tering (W CC), a recently proposed community detection
properties of the network [5, 9], the interactions among the
metric based on triangle analysis. Using real graphs with
agents of a network [3] or the role the agents develop inside
ground truth overlapped communities, we show that SCD
the network [21].
outperforms the current state of the art proposals (even
Community detection algorithms are often computation-
those aimed at finding overlapping communities) in terms
ally expensive and are not scalable to large graphs with
of quality and performance. SCD provides the speed of
billions of edges. Recently, Yang and Leskovec provided a
the fastest algorithms and the quality in terms of NMI and
benchmark with real datasets and its corresponding ground
F1Score of the most accurate state of the art proposals. We
truth communities [24]. In such work, they measure the
show that SCD is able to run up to two orders of magni-
time spent by several state of the art algorithms, such as
tude faster than practical existing solutions by exploiting
clique percolation [16] or link clustering [1], and found that
the parallelism of current multi-core processors, enabling us
they did not scale to networks with more than hundreds
to process graphs of unprecedented size in short execution
of thousands of edges. Even their new proposal aimed at
times.
large networks, BigClam [24], was not able to process the
largest graph in the benchmark, the Friendster graph, with
roughly 2 billion edges. On the other hand, algorithms such
as Louvain [4], which can locate communities in graphs with
a scale similar to that of Friendster graph, does not scale in
Categories and Subject Descriptors quality [2, 8, 10, 22].
H.3.3 [Information Storage and Retrieval]: Informa- In this paper, we present SCD, which is a new community
tion Search and Retrieval; H.3.4 [Information Storage detection algorithm that is much faster than the most ac-
and Retrieval]: Systems and Software; G.2.2 [Discrete curate state of the art solutions, while maintaining or even
Mathematics]: Graph Theory improving their quality. SCD is able to compute the com-
munities of the Friendster graph, yet using a modest 32GB
RAM computer. Figure 1 illustrates schematically the two
most important dimensions to evaluate community detec-
tion algorithms: quality and scalability. We observe that no
algorithm in the state of the art excels both in scalability
and quality.
Copyright is held by the International World Wide Web Conference Com- SCD detects disjoint communities in undirected and un-
mittee (IW3C2). IW3C2 reserves the right to provide a hyperlink to the weighted networks by maximizing W CC, a recently pro-
author’s site if the Material is used in electronic media. posed community metric [18]. W CC is a metric based on
WWW’14, April 7–11, 2014, Seoul, Korea.
ACM 978-1-4503-2744-2/14/04. triangle structures in a community. In contrast to modular-
http://dx.doi.org/10.1145/2566486.2568010. ity, which is the most widely used metric and has resolution
vides better quality communities. In particular W CC,
which is based on triangle counting, is very effective lo-
scd
cating meaningful communities.
4. According to our results, we observe that overlapping
oslom community detection metrics are still far from obtain-
Quality

ing high quality results, since non-overlapping methods


bigclam louvain are able to obtain better communities using overlap-
ping benchmarks.
infomap The rest of the paper is structured as follows. In Section 2,
walktrap
we review the related work. In Section 3, we describe SCD,
including a brief introduction to W CC, the proposal of the
estimators and an analysis of the complexity of the algo-
rithm. In Section 4, we describe the experimental setup. In
Scalability
Section 5, we show the results and discuss them. In Sec-
tion 6 we introduce a case study where the use of SCD is
Figure 1: Scale vs quality of different community proven successful and in Section 7, we conclude the paper
detection algorithms. and give guidelines for the future work.

problems when optimized [2,8], W CC mathematically guar- 2. RELATED WORK


antees that the communities emerged from its optimization In the literature, we find a wide range of different commu-
are cohesive and structured [18]. Furthermore, we show in nity detection algorithms which follow different strategies.
this paper that the computation of W CC can be efficiently The biggest family of community detection algorithms is
parallelized, allowing the design of algorithms that take ad- formed by those based on maximizing modularity [14]. Mod-
vantage of current multi-core processors. ularity scores high those partitions containing communities
SCD implements a two-phase procedure that combines with an internal edge density larger than that expected in a
different strategies. In the first phase, SCD uses the cluster- given graph model, which is almost always an Erdös-Rényi
ing coefficient as an heuristic to obtain a preliminary parti- model. Several strategies have been proposed for its opti-
tion of the graph. In the second phase, SCD refines the ini- mization, such as agglomerative greedy [6] or simulated an-
tial partition by moving vertices between communities while nealing [13]. A multilevel approach has been proposed which
the W CC of the communities increases. In order to speed scales to graphs with hundreds of millions of objects [4], but
up this second phase, we propose a W CC estimator that ap- the quality of its results decreases considerably as long as
proximates the original metric and it is faster to compute. the size of the graph increases [10]. Moreover, it has been
The evaluation of SCD indicates that the quality of the reported that modularity has resolution limits [2, 8]. Modu-
communities found by SCD is at least as good as the best larity is unable to detect small and well defined communities
state of the art algorithms though orders of magnitude faster. when the graph is large, and its maximization delivers sets
Although the communities found by SCD are disjoint, we with a tree-like structure, which cannot be considered com-
evaluate our algorithm with overlapping community bench- munities. SCD does not suffer from these problems because
marks. We note that this quality comparison is biased against it is based on W CC, whose maximization is proven to de-
our algorithm, because SCD is not able to detect overlaps, liver cohesive and structured communities [18].
which are present in the ground truth. The reason of our Random walks is a tool on which several community de-
good quality score despite this handicap is that our algo- tection algorithms rely. The intuition behind this is that in
rithm goes beyond edge counting metrics and accounts for a random walk, the probability of remaining inside of a com-
triangle structures using W CC. The communities obtained munity is higher than going outside due to the higher density
using W CC are meaningful and precise, and thus have a very of internal edges. This strategy is the main idea exploited in
good match to real communities. It is beyond the scope of Walktrap [17]. Another algorithm based on random walks
this paper to extend WCC and SCD to handle overlapping is Infomap [20]. In this case, a codification for describing
partitions, which are also known as covers. random walks based on communities is searched. The codi-
We summarize the main contributions as follows: fication that requires less memory space (attains the highest
compression rates) is selected. According to the comparison
1. We propose a very scalable parallel community detec- performed by Lancichinetti et al. [10], Infomap stands as one
tion algorithm that is able to handle graphs with bil- of the best community detection algorithms.
lions of edges. Our algorithm is in the same order of Another category of algorithms is that formed by those
magnitude as the fastest community detection tech- capable of finding overlapping communities. An example of
niques [4], which have, in general, a lower quality than such an algorithm is Oslom, which uses the significance as
the best state of the art algorithms, and thus far below a fitness measure in order to assess the quality of a clus-
SCD. ter. Similar to modularity, the significance is defined as the
2. By using ground truth real world benchmarks, we show probability of finding a given cluster in a random null model.
that the quality of the communities of SCD is as good Another algorithm that falls into this category is Link Clus-
as the best state of the art algorithms [12, 24], which tering Algorithm (LCA) [1]. This algorithm is based on the
are at least two orders of magnitude slower than SCD. idea of taking edges instead of vertices to form a commu-
3. SCD shows that the topological and structural analy- nity. By means of an iterative process, the similarity of
sis of communities beyond edge counting metrics pro- adjacent edges (i.e. those edges that share a vertex, forming
an opened triad) is assessed by using the Jaccard coefficient
of the adjacency lists of the two other vertices of the edges.
In this case, thanks to taking the edges instead of the ver-
tices, the overlapped communities emerge naturally. Label G(V,E) P P'
Initial Partition
propagation is another family of iterative techniques that Partition refinement
initially sets labels to nodes. Then, it defines rules that sim-
ulate the spread of these labels in the network similarly to
infections [19, 23]. Finally, a recently proposed algorithm
is BigClam by Yang et al. [24]. This algorithm is based
on computing an affiliation of vertices to communities that Figure 2: The algorithm’s general schema.
maximizes an objective function using non negative matrix
factorization. The objective function is based on the intu-
words, it is the average W CC of every vertex x in the graph
ition that the probability of existing an edge between two
with respect to the community C in P where x belongs.
vertices increases with the number of communities the ver-
tices share (i.e. the number of communities in which the 3.2 Algorithm description
vertices overlap).
SCD takes a graph G as input, and generates a partition
Finally, the exploitation of parallelism has been an evasive
of G resulting from a W CC optimization process. Before
topic in the field of community detection with few remark-
starting the execution of the algorithm, we perform a pre-
able exceptions [24]. SCD differs from existing solutions by
processing step where unnecessary edges of the graph are
being designed with parallelism in mind, and hence can take
removed. Figure 2 shows the general structure of SCD. The
advantage of current multi-core architectures.
algorithm is divided into two phases. The first phase con-
sists in finding an initial partition. The second phase is fed
3. SCALABLE COMMUNITY DETECTION with this initial partition and refines it.
Scalable Community Detection (SCD) finds disjoint com-
munities in an undirected and unweighted graph by maxi-
mizing W CC. First, we briefly introduce W CC, and then 3.2.1 Graph loading and preprocessing
we describe the proposed algorithm. For a more detailed During the loading of the graph, we perform the following
description of W CC, please refer to [18]. cleanup of the graph. We compute the number of triangles
where each edge in the graph belongs to. Then, we remove
3.1 WCC from the graph those edges that do not close any triangle.
W CC is a community metric based on the fact that real Those edges that do not close any triangle, are irrelevant
networks contain a large number of triangles due to their from the point of view of W CC. Hence, by removing these
community structure. Since communities are groups of highly edges we reduce the memory consumption and improve the
connected vertices, the probability of these vertices to close performance of SCD. Furthermore, removing these edges
triangles among them is larger than the expected between simplifies the estimator proposed in Section 3.3, since we
vertices of different communities. W CC uses this feature to can assume that each edge closes at least one triangle.
quantify the quality of a partition of vertices.
Given a graph G(V, E), a vertex x and a community C, 3.2.2 Initial partition
t(x, C) denotes the number of triangles that vertex x closes The initial partition is computed by a fast heuristic pro-
with the vertices in C and vt(x, C) denotes the number of cess, described in Algorithm 1. First, we compute the clus-
vertices in C that close at least one triangle with x and tering coefficient of each vertex (Line 2). This information
another vertex in G. Then, the level of cohesion of a vertex is needed by this step as well as the refinement step. Then,
x with respect to community C is denoted by W CC(x, C), we sort the vertices of the graph by their clustering coeffi-
which is defined as follows: cient decreasingly. For those vertices with equal clustering
coefficient, we use the degree as a second sorting criterion
 t(x,C) vt(x,V ) (Line 3). Then, the vertices are iterated and, for each vertex
t(x,V )
· if t(x, V ) 6= 0;
|C\{x}|+vt(x,V \C) v not previously visited, we create a new community C that
W CC(x, C) =
0 if t(x, V ) = 0. contains v and all the neighbors of v that were not visited
(1) previously (Line 7 to 13). Finally, community C is added
Using Equation 1, the W CC of a community C is com- to partition P (Line 14) and all the vertices of the commu-
puted as follows: nity are marked as visited. The process finishes when all the
1 X vertices in the graph have been visited.
W CC(S) = W CC(x, C). (2) The intuition behind this heuristic is the following: the
|C| x∈S
larger the clustering coefficient of a vertex, the larger the
Given a set C, W CC(C) is the average W CC of all the number of triangles the vertex closes with its neighbors, and
vertices in C. Finally, the W CC of a graph partition P = the larger the probability that its neighbors form triangles
{C1 , ..., Cn } such that (C1 ∩ ... ∩ Ck ) = ∅ is: among them. Hence, considering Equation 1, the larger the
clustering coefficient of a vertex, the larger is the probability
n
1 X that the W CC of its neighbors is large if we include them
W CC(P) = (|Ci | · W CC(Ci )) . (3)
|V | i=1 in the same community.

The W CC of a partition is the weighted average of the


W CC of all the communities in the partition. In other
Data: Given a graph G(V,E) Data: Given a graph G(V,E) and a partition P
Result: Computes a partition of G Result: A refined partition P’
1 Let P be a set of sets of vertices; 1 newP ← P;
2 ComputeCC(G); 2 newWCC ← computeWCC(P);
3 S ← sortByCC( V ); 3 repeat
4 foreach v in S do 4 WCC’ ← newWCC;
5 if not visited(v) then 5 P’ ← newP;
6 markAsVisited(v); 6 M ← ∅;
7 C ← v; 7 foreach v in V do
8 foreach u in neighbors(v) do 8 M.add( bestMovement(v,P’) );
9 if not visited(u) then 9 end
10 markAsVisited(u); 10 newP ← applyMovements(M,P’);
11 C.add(u); 11 newWCC ← computeWCC(newP);
12 end 12 until (newW CC − W CC ′ )/W CC ′ ≥ t;
13 end 13 return P’;
14 S.add(C); Algorithm 2: Phase 2, refinement.
15 end
16 end
17 return P; Theorem 1. Let P = {C1 , . . . , Ck , {v}} and P ′ =
Algorithm 1: Phase 1, initial partition. {C1′ , . . . , Ck } be partitions of a graph G = (V, E) where
C1′ = C1 ∪ {v}. Then,
W CC(P ′ ) − W CC(P ) = W CCI (v, C1 )
3.2.3 Partition refinement X
W CC(x, C1′ ) − W CC(x, C1 ) +
 
The partition refinement phase is described in Algorithm 2. =1/|V | ·
x∈C1
The goal of this phase is to refine the partition received from 1/|V | · W CC(v, C1′ ).
the previous phase. This phase follows a hill climbing strat-
egy, that is, in each iteration, a new partition is computed
from the previous one by performing a set of modifications
(movements of vertices between communities). The algo- • W CCR (v, C) computes the improvement of the W CC
rithm repeats the process until the W CC of the new par- of a partition when vertex v is removed from commu-
tition does not percentually improve over the previous one nity C and placed as a singleton community.
more than a given threshold. In our tests, this threshold is Theorem 2. Let partitions P = {C1 , . . . , Ck } and
set at 1%, which provided a good tradeoff between perfor- P ′ = {C1′ , . . . , Ck , {v}} of a graph G = (V, E) where
mance and quality. C1 = C1′ ∪ {v}. Then,
In each iteration, for each vertex v of the graph, we use the
bestMovement function to compute the movement of v that W CC(P ′ ) − W CC(P ) = W CCR (v, C1 ) = −W CCI (v, C1′ ).
improves the W CC of the partition most (Line 8). There
are three types of possible movements:
• No Action: leave the vertex in the community where • W CCT (v, C1 , C2 ) computes the improvement of the
it currently is. W CC of a partition when vertex v is transfered from
community C1 and to C2 .
• Remove: remove the vertex from its current commu-
nity and place it as a singleton (a community formed Theorem 3. Let P = {C1 , C2 , . . . , Ck−1 , Ck } and P ′ =
by a single vertex). {C1′ , C2 , . . . , Ck−1 , Ck′ } be partitions of a graph G =
(V, E) where C1 = C1′ ∪ {v} and Ck′ = Ck ∪ {v}. Then,
• Transfer: remove the vertex from its current commu-
nity and insert it into another one. W CC(P ′ ) − W CC(P ) = W CCT (v, C1 , Ck )
Note that bestMovement does not modify the current par- = − W CCI (v, C1′ ) + W CCI (v, Ck ).
tition, and that the best movement of each vertex is com-
puted independently from the others. This feature allows
the computation of the best movements for all the vertices
From Theorem 1, we conclude that in order to compute
in the graph in parallel. Once each vertex of the graph has a
the improvement of W CC derived from inserting a vertex v
best movement, we apply it simultaneously to all the vertices
(i.e. a singleton community) into a community C, only the
(applyMovements Line 10). Finally, we update the W CC of
W CC of vertex v and those vertices in C has to be com-
the new partition (Line 11) and check whether it improved
puted. This limits the number of computations to perform
since the last iteration.
a movement, because only a very local portion of the graph
Before describing function bestMovement, we introduce
needs to be accessed for each vertex. Theorems 2 and 3
some auxiliary functions that are used in its computation.
show that we can express any of the movements needed by
The proofs of the theorems introduced are Appendix A.
the algorithm, in terms of function W CCI (), simplifying the
• W CCI (v, C) computes the improvement of the W CC implementation of the algorithm.
of a partition when vertex v (which belongs to a sin- Algorithm 3 describes the bestMovement function. First,
gleton community) is inserted into community C. we compute the improvement of removing vertex v from its
Data: Given a graph G(V,E) a partition P G, ω
and a vertex v C
Result: Computes the best movement of v. dout
1 m ← [NO ACTION]; din
2 sourceC ← GetCommunity(v,P); v
3 wcc r ← W CCR (v,sourceC); b
4 wcc t ← 0.0;
5 bestC ← ∅;
6 Candidates ← candidateCommunities(v,P); r, δ
7 for c in Candidates do
8 aux ← W CCT (v,sourceC,c) ;
9 if aux > wcc t then Figure 3: Model for estimating the W CC improve-
10 wcc t ← aux; ment.
11 bestC ← c;
12 end
13 end nize the community members and allow the computation of
14 if wcc r > wcc t and wcc r > 0.0 then W CCI′ () as follows:
15 m ← [REMOVE]; Theorem 4. Consider the situation depicted in Figure 3,
16 else if wcc t > 0.0 then with the following assumptions:
17 m ← [TRANSFER , bestC];
18 end • Every edge in the graph closes at least one triangle.
19 return m; • The edge density inside community C is homogeneous
Algorithm 3: bestMovement. and equal to δ .
• The clustering coefficient of the graph is homogeneous
for all nodes outside C and equals to ω.
current community (Line 3). Then, we obtain the set of
Then,
candidate communities, formed by those communities con-
taining the neighbors of v (Line 6). After that, we calculate W CC(P ′ ) − W CC(P ) = W CCI′ (v, C)
which is the candidate community where transferring vertex 1
v improves the W CC most (Lines 7 to 13). Finally, we select = · (din · Θ1 + (r − din ) · Θ2 + Θ3 ), (4)
V
whether the best improvement is obtained from removing
the vertex from its current community (REMOVE) or trans- where,
ferring it into a new community (TRANSFER) (Lines 14 Θ1 = (r−1)δ+1+q
·
(r+q)·((r−1)(r−2)δ 3 +(din −1)δ+q(q−1)δω+q(q−1)ω+dout ω)
to 18). If neither of the two movements improves the W CC
(din −1)δ;
of the partition, we keep the vertex in the current commu-
3
(r−1)(r−2)δ (r−1)δ+q
nity (NO ACTION) (Line 1). Θ2 = − (r−1)(r−2)δ 3 +q(q−1)ω+q(r−1)δω · (r+q)(r−1+q)
;
din (din −1)δ din +dout
3.3 WCCI Estimation Θ3 = din (din −1)δ+dout (dout −1)ω+dout din ω
· r+dcout

The exact computation of W CC needs the computation and q = (b − din )/r.


of the triangles of the target vertex with the rest of its neigh-
bors. For a vertex of degree d, the complexity of this oper- Conceptually, Θ1 , Θ2 and Θ3 are the W CC improvements
ation is O(d2 ). And, since real graphs typically have power of those vertices in C connected to v, those vertices in C not
law distributions, this cost is large for the highest degree connected to v, and vertex v respectively, when v is added to
vertices in the graph. Furthermore, considering that the community C. The evaluation of Equation 4 is O(1) given
functions that compute the W CC improvements are called all the statistics. And, the update of all statistics is only
several times per vertex and iteration, they become the most performed when all communities are updated, with a cost
time consuming part of the algorithm. As shown in the pre- O(m) for the whole graph. Note that we use aggregated
vious section, all auxiliary functions can be computed with statistics to estimate the number of triangles, and thus we
only the W CCI () function. In this section, we propose a are not computing the triangles when we compute W CCI′ ().
model to estimate W CCI () with a constant time complex-
ity function (given some easy to compute statistics) that we 3.4 Complexity of the Algorithm
call W CCI′ (). Let n be the number of vertices and m the number of edges
W CCI′ () stands as the approximated increment of W CC in the graph. We assume that the average degree of the
when vertex v is inserted into a community C. We depict graph is d = m/n and that real graphs have a quasi-linear
the simplified schema in Figure 3. For the considered vertex relation between vertices and edges O(m) = O(n · log n).
v, we only record the number of edges that connect it to In the initial partition phase, for each edge in the graph,
community C. For each community C, we keep the follow- we compute the triangles that each edge participates in. The
ing statistics: the size of the community r; the edge density triangles are found by intersecting the adjacency lists of the
of the community δ; and the number of edges b that are in two connected vertices. Since we assume sorted adjacency
the boundary of the community. We also record the clus- lists, the complexity of computing the intersection is O(d).
tering coefficient of the graph ω, which is constant along all Since the average degree is m/n, we have that the cost of
the community detection process. These statistics homoge- the first phase is O(m · d) = O(m · log n).
Vertices Edges Communities % non-overlap % two-overlap % three-plus-overlap
Amazon 334,863 925,872 151,037 3.9 3.6 92.4
Dblp 317,080 1,049,866 13,477 57.5 16.6 25.9
Youtube 1,134,890 2,987,624 8,385 62.4 16.1 21.5
LiveJournal 3,997,962 34,681,189 287,512 35.8 17.2 47.0
Orkut 3,072,441 117,185,083 6,288,363 6.2 5.7 88.1
Friendster 65,608,366 1,806,067,135 957,154 45.5 20.8 33.6

Table 1: Characteristics of the test graphs.

Regarding the second phase, let α be the number of iter- DBLP: This graph represents a network of coauthorships,
ations required to find the best partition P’, which in our where each vertex is an author and two authors are con-
experiments is between 3 and 7. In each iteration, for each nected if they have written a paper together. Each journal
vertex v of the graph, we compute, in the worst case, d + 1 or conference defines a ground truth community formed by
movements of type W CC ′ (I) that have a cost O(1). Then, those authors which published in that journal or conference.
the computation of the best movement for all vertices in the
graph in an iteration is O(n · (d + 1)) = O(m). The ap- Youtube: This graph represents the Youtube social net-
plication of the all the movements is linear with respect to work, where each vertex is a user and two users are linked
the number of vertices O(n). We also need to update, for if they have established a friendship relation. Communities
each iteration of the second phase, the statistics δ, cout , din are defined by the groups created by the users, which are
and dout for each vertex and community, which has a cost of formed by those users that joined that group.
O(m). Finally, the computation of the WCC for the current
partition is performed by computing for each edge the trian-
LiveJournal: This graph represents the social network around
gles, which is O(m · log n) as already stated. Hence, the cost
LiveJournal. Similar to the Youtube network, the vertices
of the refinement phase becomes O(α·(m+n+m+m·log n)),
are the users, which establish friendship relationships with
which after simplification, becomes O(m · log n) assuming α
other users. Users can create groups, which define the ground
as constant.
truth communities and which are formed by those users that
The final cost of the algorithm is the sum of the two
joined that group.
phases: O(m · log n + m · log n) = O(m · log n)

Orkut: This graph represents the Orkut social network.


4. EXPERIMENTAL SETUP Similar to the Youtube and LiveJournal networks, the ver-
For the experimental evaluation of SCD, we used all the tices are the users of the social network and the edges rep-
community benchmark datasets provided by SNAP1 that resent the friendship relationships between the users. The
include a community ground truth, which we use to verify groups created by the users define the ground truth com-
the quality of the algorithms. To our knowledge, these are munities, which are formed by those users that joined the
the largest datasets that include such a gold standard ver- group.
ification, which range from a million edges to almost two
billion edges, as reported in Table 1. The communities field
Friendster: This graph represents the Friendster gaming
indicates the number of ground truth communities. Note
social network, where each vertex is a user of the network
that, for all the graphs, a vertex can belong to more than
and two users are connected if they established a friendship
one community. In Table 1, we show the percentage of ver-
relation. In this network, users create groups, which define
tices that belong to one, two, or three or more ground truth
the ground truth communities. Then, each ground truth
communities. We see that the number of overlaps is not ho-
community is formed by those vertices in that group.
mogeneous and in some graphs a vertex tends to participate
in more communities than others.
We select the most relevant community detection algo-
In this paper, we do not make any special distinction be-
rithms in the state of the art: Infomap [20], Louvain [4],
tween disjoint and overlapping community detection algo-
Walktrap [17], BigClam [24] and Oslom [12]. Infomap, Lou-
rithms. We are interested in determining how meaningful
vain and Walktrap are considered the best algorithms for
are the output communities. Therefore, we simply analyze
disjoint community detection, according to [10, 15]. On the
the matching between the output of the algorithms and the
other hand, we take BigClam and Oslom as the state of the
real communities specified in the ground truth. We consider
art of overlapping community detection. We use the imple-
better, in terms of quality, those algorithms that have better
mentations provided in the website of the authors and are
matching with the gold standard, independently of the type
all implemented in C++. SCD was implemented in C++
of algorithm under consideration.
and can be found in the website of the authors2 .
We use two metrics to evaluate the quality of the different
Amazon: This graph represents a network of products,
algorithms. The first metric is the Average F1Score, F¯1 ,
where each vertex is a product and an edge exists between
following the approach taken by the authors of the commu-
two products if they have been co-purchased frequently. The
nity benchmark [24]. The F1Score of a set A with respect to
product categories provided by Amazon define the ground
a set B is defined as the harmonic mean (H) of the precision
truth communities.
1 2
http://snap.stanford.edu http://www.dama.upc.edu
and the recall of A with respect to B: 1e+06

Execution Time (s)


|A ∩ B| |A ∩ B| 100000
precision(A, B) = , recall(A, B) = .
|A| |B| 10000

2·a·b 1000
H(a, b) =
a+b 100
F1 (A, B) = H(precision(A, B), recall(A, B))
10
Then, the average F1Score of two sets of communities C1 1
and C2 is given by: Amz Dblp You Live Orkut Friend

F1 (A, C) = arg max F1 (A, Ci ), ci ∈ C = {C1 , · · · , Cn } walktrap louvain oslom


i infomap bigclam scd
1 X 1
F¯1 (C1 , C2 ) =
X
F1 (ci , C ′ ) + ′
F1 (ci , C)
2|C| c ∈C 2|C | ′
i ci ∈C
Figure 4: Execution times in seconds.
The second metric is the Normalized Mutual Infor-
mation (NMI) [11], which is based on information theory
concepts. The NMI provides a real number between zero
0.6
and one that gives the similarity between two sets of sets of
objects. An exact match between the two inputs obtains a 0.5
value of one.

F1Score
0.4
The computer used in the experiments has: a CPU Intel
0.3
Xeon E5530 at 2.4 GHz, 32 GB of RAM, 1TB of disk space
and Debian Linux with kernel 2.6.32-5-amd64. 0.2
0.1
5. EXPERIMENTAL RESULTS 0
Amz Dblp You Live Orkut Friend

5.1 Quality and Performance walktrap louvain oslom


Figure 4 shows the execution time of the different tested infomap bigclam scd
algorithms. Missing columns indicate that the algorithm
was not able to process that graph due to memory consump-
tion, or because it took more than a week of computing. All Figure 5: F¯1 score using ground truth.
the state of the art algorithms are executed using their de-
fault parameters. In BigClam, parameter k, which indicates
the number of communities, is set to the actual number of
ground truth communities. In this experiment, all the algo- 0.35
rithms were set to run as single threaded, including ours, be- 0.3
cause most implementations provided were single threaded. 0.25
Note that since the differences between the fast and the slow 0.2
NMI

algorithms are in the orders of magnitude, multi-threading 0.15


does not alter the conclusions. 0.1
When we look at the execution times, we see that there
0.05
are mainly two groups of algorithms in terms of scalability:
SCD and Louvain on one side, and the rest of algorithms 0
Amz Dblp You Live Orkut Friend
on the other. In general, the group formed by SCD and
walktrap louvain oslom
Louvain, runs about two orders of magnitude faster than the
rest. For practical purposes, this computational complexity infomap bigclam scd

creates a barrier to analyze large networks by the group of


slow algorithms. Oslom takes several days to analyze the
Orkut graph whereas SCD finds the communities in a few Figure 6: NMI using ground truth.
minutes. Even assuming that these slow algorithms scale
linearly with the problem size, which is not true for most of
them, the analysis of large graphs may require unaffordable
times. We see that SCD’s performance is better than for except for Amazon and Livejournal. Note that the F¯1 of
Louvain, the fastest algorithm in the state of the art. Louvain for the Friendster graph cannot be appreciated in
Figures 5 and 6 show the F¯1 and N M I quality scores of the figure due to its small value. When we turn into N M I,
the tested algorithms, respectively. We observe that both we see that SCD obtains the best quality in four out of six of
metrics are correlated, though some small deviations exist the tested graphs. For the rest of the graphs, SCD obtains
among them. From these results, we conclude that SCD a score close to the best one. In terms of N M I, Louvain
obtains the best quality, followed by Oslom and Louvain. and Oslom are the closest ones to SCD on average, but with
In the case of F¯1 , SCD obtains the best quality in all the less quality in most of the graphs. Broadly speaking, SCD
test graphs except Livejournal, where it is close to BigClam. improves the quality of the best community detection algo-
On the other hand, Oslom stands as the second best option rithms but running much faster.
Execution Time (s) 1 100000

Execution Time (s)


Friend
0.8 10000

0.6 1000 Orkut

Live
0.4
100
0.2 You
10
Amz Dblp
0
Amz Dblp You Live Orkut Friend
1
0.1 1 10 100 1000 10000
1 2 4
Million Edges

Figure 7: SCD normalized execution time for differ- Figure 8: SCD execution time with four threads vs
ent number of threads. number of edges.

Graph Triangles Partitions Total


5.2 Scalability Amazon 11.4 1.3 16.0 28.7
We parallelized the two most time consuming parts of the Dblp 12.2 1.3 14.9 28.4
algorithm: the computation of the clustering coefficient of Youtube 37.5 4.5 68.9 110.9
the vertices and the refinement part. In the former, we par- Livejournal 325.4 16.0 197.7 539.1
allelized the loop that computes, for each edge, the number Orkut 974.3 12.3 124.4 1111.0
of triangles that the edge closes. In the later, we parallelized Friendster 15235.8 262.4 3317.6 18815.8
the loop in Line 7 of Algorithm 2, which calls the function
bestMovement for each of the vertices in the graph, as well Table 2: SCD Memory consumption in MB.
as the computation of W CC for the partition at the end of
the iteration. Since all the parallel code is in the form of
loops, we used OpenMP with dynamic scheduling, using a • Graph: the size of the data structure that stores the
chunk size of 32. Figure 7 shows the normalized execution graph as a list of adjacencies. We relabel the indices
times of SCD with different number of threads. In this ex- of the vertices to the range from 1 to n, and hence, we
periment, we have excluded the time spent in I/O, which also account an array containing a mapping between
includes reading the graph file and printing the results. our vertex identifier and the original identifier.
Broadly speaking, we see that SCD is able to achieve very • Triangles: an array with size equals to the number of
good scalability, specially for the larger graphs which are vertices, which contains the number of triangles each
also the graphs with the largest degree. The larger the de- vertex belongs to.
gree of the graph, the larger the cost of those parts that have
been parallelized, which become dominant over the sequen- • Partitions: accounts the statistics for the partitions.
tial ones. This translates into a better scalability due to a We report the iteration with the largest memory con-
direct application of Ahmdal’s Law. sumption.
For large graphs, the implementation of SCD is able to
The auxiliary data structures (triangles and partitions)
exploit all the processor’s resources available. The config-
built by SCD scale linearly with the number of vertices of
uration with four threads of SCD keeps the four cores of
the graph, and not with the number of edges. Since the
the processor active, and hence obtains about a four fold
number of statistics stored per node is small, the amount of
improvement over the single threaded version. These re-
memory consumed by SCD is often dominated by the graph
sults show that SCD is an algorithm capable of exploiting
representation itself. Among the tested benchmarks, the
multi-core architectures efficiently, especially on large graphs
only exception is Youtube, because of its very small average
where this feature is more appreciated. More specifically,
degree of 2.6. For the largest graphs, SCD allocates only
SCD processes the Friendster graph using four threads in
data structures for an additional 23% (Friendster) and 14%
just 4.3 hours.
(Orkut) of the original graph. The amount of memory con-
Figure 8 shows the execution time of SCD with respect
sumed for the Friendster graph is roughly 18GB, indicating
to the number of edges of the graph. Each point represents
that even larger graphs could be processed with the 32 GB
the time spent by the four thread version of SCD in the six
of memory of the test machine.
datasets. We observe that SCD scales approximately as a
line with respect to the number of edges of the graph.
6. CASE STUDY
In this section, we present results showing the behavior
5.3 Memory Consumption of SCD in a real world application, such as product pur-
In Table 2, we show the memory consumption of SCD for chasing recommendation. We have downloaded, from the
each of the graphs. We split the memory consumption into SNAP website, the metadata associated with the products
three different concepts: of the co-purchasing Amazon graph described above. This
Finding Out About: A Cognitive Perspective on Search Engine Information Retrieval: Data Structures and Algorithms
Technology and the WWW (Richard K. Belew); (William B. Frakes and Ricardo Baeza-Yates);
Understanding Search Engines: Mathematical Modeling and Modern Information Retrieval (Ricardo Baeza-Yates);
Modern Text Retrieval (Michael W. Berry); Natural Language Processing for Online Applications: Text
Information Managing Gigabytes: Compressing and Indexing Documents Retrieval, Extraction, and Categorization (P. Jackson and I.
Retrieval and Images (Ian H. Witten); Moulinier);
Foundations of Statistical Natural Language Processing Readings in Information Retrieval (Karen Spark Jones and Pe-
(Christopher D. Manning); ter Willett);

Harry Potter and the Sorcerer’s Stone (Book 1); Harry Potter and the Order of the Phoenix (Book 5, Deluxe
Harry Potter and the Chamber of Secrets (Book 2); Edition);
Harry Potter and the Prisoner of Azkaban (Book 3); Harry Potter and the Sorcerer’s Stone (Book 1, Large Print);
Harry Harry Potter and the Goblet of Fire (Book 4); Harry Potter and the Sorcerer’s Stone (Book 1, Audio CD);
Potter Harry Potter and the Prisoner of Azkaban (Book 3, Audio CD);
Harry Potter and the Order of the Phoenix (Book 5);
Harry Potter and the Sorcerer’s Stone (Hardcover); Harry Potter and the Goblet of Fire (Book 4, Audio CD);
Harry Potter and the Prisoner of Azkaban (Hardcover); Harry Potter and the Order of the Phoenix (Book 5, Audio);

Eyewitness Top 10 Travel Guides: Barcelona; The National Geographic Traveler: Barcelona;
Lonely Barcelona and Catalonia (Eyewitness Travel Guides); Lonely Planet Barcelona City Map;
Planet Eyewitness Travel Guide to Barcelona and Catalonia; Streetwise Barcelona;
Barcelona
Lonely Planet Barcelona;

Table 3: Examples of communities of co-purchased products of WCC.

metadata includes information such as the title of the prod- over, the results show that SCD is able to run faster than
uct, the type and the categories that it belongs to, and was these highest quality existing solutions, matching the speed
collected in 2006. As stated above, a ground truth com- of those algorithms aimed at large scale graphs. This trans-
munity in the Amazon graph is formed by those products lates into SCD being able to process graphs of an unprece-
belonging to the same groups and form a connected com- dented size, such the Friendster which has roughly 2 billion
ponent. Therefore, the communities in the Amazon graph edges in just 4.3 hours on off the shelf computer hardware.
should contain similar products than have been usually co- The design of SCD also allows a remarkable scalability that
purchased. Hence, once a new buyer purchases a product, it is close to four fold improvements in a four core processor.
could be recommended with those products of the same com- Also, we showed that SCD is able to deliver meaningful com-
munities of that he bought. We select three profiles of well munities, by means of a case study consisting of a product
known products: the technical computer science book titled recommendation application. Finally, we can conclude that
“Modern Information Retrieval”, by Ricardo Baeza; the first going beyond edge counting, i.e. focusing on richer struc-
book of a popular novel series “Harry Potter and the Sor- tures such as triangles for community detection, provides
cerer’s Stone”; and a travel book “Lonely Planet Barcelona”. better results.
We run the SCD algorithm on the whole network and we The fact that SCD, being a disjoint community detection
report the communities where these books were assigned in algorithm, performs better than pure overlapping commu-
Table 3. In the case of “Modern Information Retrieval”, we nity detection algorithms, gives us the hint that overlapping
see that the community is formed by relevant books in the community detection is a problem still far from being solved.
field of information retrieval and text analysis. In the case of Hence, one of the future research lines is to extend the ideas
Harry Potter, we see that the community contains different behind the topological analysis of the graph performed by
books of the Harry Potter series (for the curious reader, SCD to overlapped communities. On the other hand, we
the two last books of the series were published after the have seen that SCD is able to scale on current multi-core
crawl and are not present in the dataset), as well as Harry architectures. Another interesting research line to explore
Potter audio books and some special editions. Finally, for in the future is how to adapt SCD to a vertex centric large
the “Lonely Travel Barcelona” guide, the community found graph processing model such as GraphLab or Pregel.
Barcelona travel guides from other publishers. We observe
that SCD is able to perform a good selection of the relations
in the graphs in order to give meaningful communities.

7. CONCLUSIONS AND FUTURE WORK 8. ACKNOWLEDGEMENTS


In this paper, we proposed SCD, a novel algorithm for dis- The members of DAMA-UPC thank the Ministry of Sci-
joint community detection based on optimizing W CC. We ence and Innovation of Spain and Generalitat de Catalunya,
proposed a mechanism to estimate W CC, allowing the com- for grant numbers TIN2009-14560-C03-03 and GRC-1187
putation of W CC faster. We compared SCD with the best respectively. The members of DAMA-UPC thank the EU
community detection algorithms in the state of the art, by FP7 project LDBC (FP7-ICT2011-8-317548) for funding the
means of a methodology for overlapping community detec- LDBC project. David Dominguez-Sal thanks the Ministry
tion using ground truth data. SCD is able to detect com- of Science and Innovation of Spain for the grant Torres
munities as meaningful (and in the most of the cases better) Quevedo PTQ-11-04970. The authors would like to thank
as the most high-profile algorithms in the literature. More- the reviewers for their useful comments.
APPENDIX and Θ3 , which represent the W CC improvement of a vertex
of F , a vertex of G and v respectively. Then,
A. PROOF OF THEOREM 1
Proof. W CCI′ (v, C) = 1/|V |(|F | · Θ1 + |G| · Θ2 + Θ3 ).
W CC(P ′ ) − W CC(P ) = We define q = (b − din )/r as the number of edges connecting
each vertex in C with the rest of the graph excluding v.
 Xk 
=1/|V | |C1 ∪ {v}| · W CC(C1 ∪ {v}) + |Ci | · W CC(Ci ) −
i=2 Then,
 Xk  (i) If x ∈ F , we have
1/|V | |C1 | · W CC(C1 ) + |Ci | · W CC(Ci ) + W CC({v})
i=2
t(x, C) =(r − 1)(r − 2)δ 3 ;
=1/|V |(|C1′ | · W CC(C1′ )) − 1/|V |(|C1 | · W CC(C1 ) + 0)
t(x, C ∪ {v}) =(r − 1)(r − 2)δ 3 + (din − 1)δ;
X X  t(x, V ) =(r − 1)(r − 2)δ 3 + (din − 1)δ + q(r − 1)δω+
=1/|V | W CC(x, C1′ ) + W CC(x, C1 )
x∈C1′ x∈C1 q(q − 1)ω + dout ω;
vt(x, V ) =(r − 1)δ + 1 + q;
X
=1/|V | W CC(x, C1′ ) + W CC(v, C1′ )−
x∈C1

X
W CC(x, C1 )
|C ∪ {v} \ {x}| + vt(x, V \ {C ∪ {v}}) = r + q;
x∈C1
|C \ {x}| + vt(x, V \ C) = r − 1 + q + 1 = r + q;
B. PROOF OF THEOREM 2 In t(x, C), we account for those triangles that x closes
with two other vertices in C. Similarly, in t(x, C ∪ {v})
Proof. As stated in the theorem assumptions, the partition we account for those triangles that x closes with two other
P ′ is build by removing v from C1 . Alternatively, the parti- vertices in C, and those triangles that x closes with v and
tion P can be build by removing vertex v to C1′ in P ′ . Then, another vertex in C. t(x, V ) accounts for all triangles that
the two following equalities hold: vertex x closes with the graph, which are: t(x, C ∪ {v}) plus
those triangles that vertex x closes with another vertex of
W CC(P ) + W CCR (v, C1 ) = W CC(P ′ ), C and a vertex of V \ C, plus those triangles that vertex x
W CC(P ) = W CC(P ′ ) + W CCI (v, C1′ ) closes with two other vertices in V \ C, plus those triangles
vertex x closes with v and another vertex of V \ C. Since
and thus: W CCR (v, C1 ) = −W CCI (v, C1′ )
we assume that every edge in the graph closes at least one
triangle, vt(x, V ) accounts for the number of vertices in C
C. PROOF OF THEOREM 3 that are actual neighbors of x plus 1 (for vertex v) and q
vertices that are connected to x. Finally, we have that the
Proof. Since W CC is a state function, all paths from P to union of vertices in C and those vertices in V with whom x
P ′ have the same differential. Then, we express the transfer
operation as a combination of remove and insert: closes at least one triangle is r + q. Therefore,

W CC(P ) + W CCT (v, C1 , Ck ) = W CC(P ′ )


Θ1 = W CC(x, C ∪ {v}) − W CC(x, C)
W CC(P ) + W CCR (v, C1 ) + W CCI (v, Ck ) = W CC(P ′ )
t(x,C∪{v}) vt(x,V )
W CC(P ′ ) − W CC(P ) = −W CCI (v, C1′ ) + W CCI (v, Ck )
= t(x,V )
· |C∪{v}\{x}|+vt(x,V \{C∪{v}})

t(x,C) vt(x,V )
t(x,V )
· |C\{x}|+vt(x,V \C)
D. PROOF OF THEOREM 4 vt(x,V )
= (r+q)·t(x,V )
· (t(x, C ∪ {v}) − t(x, C))
Proof. Consider the situation depicted in Figure 3. Let N (x) = (r−1)δ+1+q
·
(r+q)·((r−1)(r−2)δ 3 +(din −1)δ+q(r−1)δω+q(q−1)ω+dout ω)
be the set of neighbors of x. Given that, we define sets
(din −1)δ.
F = N (v) ∩ C which contains those vertices in C that are
actual neighbors of v, and G = (C \ N (x)), which contains (ii) If x ∈ B, we have
those vertices in C that are not neighbors of v. Therefore,
from Theorem 1 we have: t(x, C) =(r − 1)(r − 2)δ 3 ;
t(x, C ∪ {v}) =(r − 1)(r − 2)δ 3 ;
W CCI (v, C) = t(x, V ) =(r − 1)(r − 2)δ 3 + q(q − 1)ω + q(r − 1)δω;
X
=1/|V | (W CC(x, C ∪ {v}) − W CC(x, C))+ vt(x, V ) =(r − 1)δ + q;
x∈C
1/|V |W CC(v, C ∪ {v})
X |C ∪ {v} \ {x}| + vt(x, V \ {C ∪ {v}}) = r + q;
=1/|V | (W CC(x, C ∪ {v}) − W CC(x, C))+ |C \ {x}| + vt(x, V \ C) = r − 1 + q;
x∈F
X
1/|V | (W CC(x, C ∪ {v}) − W CC(x, C))+ t(x, C) accounts for those triangles that x closes with two
x∈G
1/|V |W CC(v, C ∪ {v}) other vertices in C. Since, x is not connected to v, we have
that t(x, C) = t(x, C ∪{v}). t(x, V ) accounts for the number
We know that |F | = din and |G| = r − din , then we can of triangles that x closes with the rest of vertices in V . These
define W CCI′ (v, C) with respect to three variables Θ1 , Θ2 are t(x, C) plus those triangles that vertex x closes with
another vertex of C and a vertex of V \C, plus those triangles [7] S. Fortunato. Community detection in graphs. Physics
that vertex x closes with two other vertices in V \C. vt(x, V ) Reports, 486(3-5):75–174, 2010.
accounts for the number of vertices in V with whom x closes [8] S. Fortunato and M. Barthélemy. Resolution limit in
at least one triangle, which are the neighbors of x in C and community detection. PNAS, 104(1):36, 2007.
those t vertices with whom x is connected. Finally, we have [9] M. Girvan and M. Newman. Community structure in
that the union of vertices in C ∪ {v} and vertices in V with social and biological networks. PNAS, 99(12):7821,
whom x closes at least one triangle is r + q, and the union of 2002.
vertices in C and vertices in V with whom x closes at least [10] A. Lancichinetti. Community detection algorithms: a
one triangle is r + q − 1. Therefore, comparative analysis. Phy. Rev. E, 80(5):056117, 2009.
[11] A. Lancichinetti, S. Fortunato, and J. Kertész.
Θ2 = W CC(x, C ∪ {v}) − W CC(x, C) Detecting the overlapping and hierarchical community
t(x,C∪{v}) vt(x,V )
structure in complex networks. New Journal of
= t(x,V )
· |C∪{v}\{x}|+vt(x,V \{C∪{v}})
− Physics, 11(3):033015, 2009.
t(x,C)
· vt(x,V )
= [12] A. Lancichinetti, F. Radicchi, J. Ramasco, and
t(x,V ) |C\{x}|+vt(x,V \C)
S. Fortunato. Finding statistically significant
3
(r−1)(r−2)δ (r−1)δ+q
= − (r−1)(r−2)δ 3 +q(q−1)ω+q(r−1)δω · (r+q)(r−1+q)
. communities in networks. PloS one, 6(4):e18961, 2011.
[13] A. Medus et al. Detection of community structures in
(iii) If x = v we have networks via global optimization. Physica A,
t(x, C ∪ {v}) = din (din − 1)δ; 358(2-4):593–604, 2005.
t(x, V ) = din (din − 1)δ + dout (dout − 1)ω + dout din ω; [14] M. Newman and M. Girvan. Finding and evaluating
community structure in networks. Phy. Rev. E,
vt(x, V ) = din + dout ; 69(2):026113, 2004.
|C| + vt(x, V \ C) = r + dout ; [15] G. K. Orman, V. Labatut, and H. Cherifi. Qualitative
comparison of community detection algorithms. In
Digital Information and Communication Technology
In this case, t(x, C ∪ {v}) accounts for those triangles that and Its Applications, pages 265–279. Springer, 2011.
x closes with C, with whom it is connected to din vertices.
[16] G. Palla, I. Derényi, I. Farkas, and T. Vicsek.
t(x, V ) are those vertices vertex x closes with V , which are
Uncovering the overlapping community structure of
those x closes with C plus those x closes with other two
complex networks in nature and society. Nature,
vertices in V \C. vt(x, V ) accounts for the number of vertices
435(7043):814–818, 2005.
in V with whom x closes at least one triangle, which are din
plus dout since we assume that every edge closes at least one [17] P. Pons and M. Latapy. Computing communities in
triangle. Finally, the union between the vertices in C and large networks using random walks. J. Graph
those vertices in V with whom x closes at least one triangle Algorithms Appl., 10(2):191–218, 2006.
is r + dout . Therefore, [18] A. Prat-Pérez, D. Dominguez-Sal, J. M. Brunat, and
J.-L. Larriba-Pey. Shaping communities out of
triangles. In CIKM, pages 1677–1681, 2012.
Θ3 = W CC(v, C ∪ {v}) [19] U. N. Raghavan, R. Albert, and S. Kumara. Near
t(x,C∪{v}) vt(x,V ) linear time algorithm to detect community structures
= t(x,V )
· |C|+vt(x,V \C)
=
din (din −1)δ din +dout
in large-scale networks. Phys. Rev. E, 76:036106, Sep
= din (din −1)δ+dout (dout −1)ω+dout din ω
· r+dcout
. 2007.
[20] M. Rosvall and C. Bergstrom. Maps of random walks
E. REFERENCES on complex networks reveal community structure.
[1] Y. Ahn, J. Bagrow, and S. Lehmann. Link PNAS, 105(4):1118, 2008.
communities reveal multiscale complexity in networks. [21] Y. Wang, G. Cong, G. Song, and K. Xie.
Nature, 466(7307):761–764, 2010. Community-based greedy algorithm for mining top-k
[2] J. P. Bagrow. Are communities just bottlenecks? trees influential nodes in mobile social networks. In
and treelike networks have high modularity. CoRR, SIGKDD, pages 1039–1048. ACM, 2010.
abs/1201.0745, 2012. [22] J. Xie, S. Kelley, and B. K. Szymanski. Overlapping
[3] A.-L. Barabási and Z. N. Oltvai. Network biology: community detection in networks: the state of the art
understanding the cell’s functional organization. and comparative study. ACM Computing Surveys,
Nature Reviews Genetics, 5(2):101–113, 2004. 45(4), 2013.
[4] V. Blondel et al. Fast unfolding of communities in [23] J. Xie and B. Szymanski. Towards linear time
large networks. JSTAT, 2008:P10008, 2008. overlapping community detection in social networks.
[5] S. Boccaletti, V. Latora, Y. Moreno, M. Chavez, and In PAKDD, pages 25–36, 2012.
D.-U. Hwang. Complex networks: Structure and [24] J. Yang and J. Leskovec. Overlapping community
dynamics. Physics reports, 424(4):175–308, 2006. detection at scale: a nonnegative matrix factorization
[6] A. Clauset et al. Finding community structure in very approach. In WSDM, pages 587–596, 2013.
large networks. Phy. Rev. E, 70(6):066111, 2004.

You might also like