0% found this document useful (0 votes)
72 views12 pages

FOCS Fast Overlapped Community Search

This document discusses an algorithm called Fast Overlapped Community Search (FOCS) for detecting overlapping communities in large networks. It begins with background on community detection and its applications. It notes that most existing algorithms falsely identify overlaps as communities or do not scale well for large networks. The document then introduces FOCS, which accounts for local connectivity to identify overlapped communities efficiently in linear time. FOCS finds multiple near-best communities simultaneously rather than just the best one. It is shown to outperform other algorithms in computational time without compromising on quality.

Uploaded by

Tapan Chowdhury
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views12 pages

FOCS Fast Overlapped Community Search

This document discusses an algorithm called Fast Overlapped Community Search (FOCS) for detecting overlapping communities in large networks. It begins with background on community detection and its applications. It notes that most existing algorithms falsely identify overlaps as communities or do not scale well for large networks. The document then introduces FOCS, which accounts for local connectivity to identify overlapped communities efficiently in linear time. FOCS finds multiple near-best communities simultaneously rather than just the best one. It is shown to outperform other algorithms in computational time without compromising on quality.

Uploaded by

Tapan Chowdhury
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

2974 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO.

11, NOVEMBER 2015

FOCS: Fast Overlapped Community Search


Sanghamitra Bandyopadhyay, Senior Member, IEEE, Garisha Chowdhary, and Debarka Sengupta

Abstract—Discovery of natural groups of similarly functioning individuals is a key task in analysis of real world networks. Also, overlap
between community pairs is commonplace in large social and biological graphs, in particular. In fact, overlaps between communities
are known to be denser than the non-overlapped regions of the communities. However, most of the existing algorithms that detect
overlapping communities assume that the communities are denser than their surrounding regions, and falsely identify overlaps as
communities. Further, many of these algorithms are computationally demanding and thus, do not scale reasonably with varying
network sizes. In this article, we propose Fast Overlapped Community Search (FOCS), an algorithm that accounts for local
connectedness in order to identify overlapped communities. FOCS is shown to be linear in number of edges and nodes. It
additionally gains in speed via simultaneous selection of multiple near-best communities rather than merely the best, at each iteration.
FOCS outperforms some popular overlapped community finding algorithms in terms of computational time while not compromising
with quality.

Index Terms—Overlapping community search, social network, local heuristic, complex network

1 INTRODUCTION

A social network comprises a finite number of individuals


and connections among them. A connection or tie usu-
ally links a pair of individuals based on their common inter-
and lead to an undesired shrinkage in the respective commu-
nity [1]. Commercial web sites may market offers by eying
increased sales within certain communities of individuals.
est, relationship through work, family, romance, friendship, Sometimes, a piece of information can be diffused easily into
partnership in crime etc. The complexity involved in the a community by informing a handful of its influential indi-
appearance and disappearance of such connections leads to viduals. This, in fact, is quite evident today, in the way a
the formation of some non-trivial topological structures. locally published information is spread across a huge mass
Moreover, these networks are often huge in size, which pre- of socially connected people. Information about political
vent the application of most of the traditional graph theoretic views, social irregularities, natural calamities, important con-
algorithms that do not scale well. Social network analysis ferences, newly created media etc. diffuse quickly through
intends to generate useful insights into such large, complex social networks. There has been instances of identifying
networks with the help of a range of novel and scalable dubious communities involved in organized crime [2]. To
computational methods. summarize, community detection has diverse applications
In a social system, individuals tend to group with others including the prediction of forthcoming events, activities or
who are like-minded or with whom they interact more regu- developments, business intelligence, campaign manage-
larly and intensely than others. This process leads to the for- ment, infrastructure management, churn prediction, etc.
mation of communities. In a community the participant Networks today, typically consist of nodes in millions
actors are densely connected to each other, whereas nodes and edges in billions. Mining useful information from such
that belong to different communities do not interact much. large-scale networks demands methods, which are fast, effi-
Furthermore, actors with interests and purposes in different cient and requiring information that are local to the nodes
fields result in overlapped communities. Such overlapping in consideration. Such methods, apart from being fast, are
communities are frequent in social graphs. able to overcome the memory constraints. In this paper, we
Identification of communities has many real life applica- propose Fast Overlapped Community Search (FOCS) algo-
tions. For example, a telecom service provider might take rithm that searches for overlapped communities in large
additional measures to retain a consumer who has significant networks based on locally computed scores. The method
connectivity within a specific community. This is important has been applied to several large social and biological net-
because exit of such an important consumer might go viral works. The detected communities have been compared
with respective ground-truth communities for the networks.
FOCS has performed well in terms of both time and effi-
 S. Bandyopadhyay and G. Chowdhary are with the Machine Intelligence ciency when compared with some popular overlapped com-
Unit, Indian Statistical Institute, Kolkata 700108, India. munity detection algorithms.
E-mail: [email protected], [email protected].
 D. Sengupta is with the Computational and Systems Biology Group,
Genome Institute of Singapore, 60 Biopolis St., Singapore 138672.
E-mail: [email protected]. 2 RELATED WORK
Manuscript received 28 Dec. 2013; revised 1 June 2015; accepted 7 June 2015. The problem of community detection is to identify naturally
Date of publication 15 June 2015; date of current version 2 Oct. 2015. existing groups of actors such that nodes within a group are
Recommended for acceptance by A. Gionis. densely connected with each other while being sparsely con-
For information on obtaining reprints of this article, please send e-mail to:
[email protected], and reference the Digital Object Identifier below. nected to the nodes that belong to different groups. Social
Digital Object Identifier no. 10.1109/TKDE.2015.2445775 communities, in topological terms, are nothing but graph
1041-4347 ß 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
BANDYOPADHYAY ET AL.: FOCS: FAST OVERLAPPED COMMUNITY SEARCH 2975

clusters. However, graph-clustering methods, in general, are


only applicable on networks which are way smaller than
today’s social networks. Some excellent reviews of existing
community detection methods can be found in [3], [4]
and [5].
Graph clustering is an optimization problem, which is
computationally intractable. The advent of social networks
and their impact on day to day life have rejuvenated this
area of research with the demand of faster algorithms that
scale well for large-sized graphs. A number of graph clus-
tering approaches exist in literature [6]. Hierarchical meth-
ods, for example, usually optimize a global score such as
conductance, spectral distance, modularity etc., and obtain
disjoint clusters. In [7] Rosvall and Bergstrom created a par-
tition of network into clusters based on compression of
information on random walk taken on the network. Blondel
et al. in [8] found hierarchical disjoint communities in mas- Fig. 1. Snapshot of a section of DBLP network.
sive networks based on modularity optimization.
As already mentioned, partitioning a graph into disjoint by a certain objective function. BigClam in particular takes
clusters is inapplicable in the context of social networks, care of the statistics as mentioned earlier in this paragraph.
since they are generally organized into overlapped commu- Additionally, BigClam has smaller time complexity as com-
nities. Secondly, communities in social graphs exist at both pared to other existing non-negative matrix factorization
the vertical and horizontal tiers. For example, in corporate methods, because of improvement in objective function from
scenarios, communities naturally exist within a horizontal l2 norm to log-likelihood. As stated in [19], BigClam ”achieves
tier, a subunit assigned to a task. Concurrently, a committee near linear running time”. However, as evident from the
formed to activate tasks up and down the vertical differenti- results in Table 4, it is still unable to produce results within a
ation of organization exemplifies a community existing reasonable time for very large networks. In addition, BigClam
across horizontal tiers. Therefore, methods that find com- also requires the number of communities to be given as input.
munities through hierarchical clustering of nodes [9], fail to Local optimization methods, on the other hand, achieve
capture such communities. This enforces the need of clus- near optimum solution by optimizing a fitness function
tering methods that allow a node to be present in communi- defined on parameters that describe local topological con-
ties detected at all hierarchical levels. figurations. Reducing the community detection problem in
Further, there are approaches that identify certain prede- social networks into a local optimization problem is more
termined structures in the network such as: cliques [10], convincing. This is intuitive as the process of community
k-cores [11], and, n-cliques [12], as communities. These meth- formation is initiated by a participating individual, in a
ods generally perform well but are computationally demand- manner more local than global. Success of such local optimi-
ing, and restrictive at times. Other methods that start from a zation based approaches depends heavily on the considera-
seed (node [9] or clique [10]) and expand until a certain score tions made while constructing the fitness functions. A
such as cut ratio, or conductance is decreased, fail to identify recent method, for example, defines the local fitness score to
all existing communities in social network. This is because be the fraction of neighbors of a node that are within its
the number of outgoing edges from a community is many a community [20]. Several such fitness functions have already
times greater than the number of edges within the commu- been defined in the literature [9], [21], [22], [23].
nity, as can be seen around community B encircled with a Label propagation algorithms (LPAs) in particular start
dashed line in Fig. 1. Most density based methods [13], are by assigning each node a unique label and then propagate
inapplicable to community detection in social network labels ensuring that a node receives one that maximum of
because of the same reason. Communities in social networks its neighbors share [21], [24], [25], [26]. COPRA [21] modi-
have rather denser overlapped regions when compared to fied the classical LPA [27] such that each node can retain
non-overlapped portions of communities [14], [15]. multiple labels in order to find overlapped community
In [16], [17] the idea of partitioning edges, instead of nodes, structure. In SLPA [24] the occurrence frequency of labels
into communities has been explored. It allows a node with received over consecutive iterations for each node is main-
multiple edges to be assigned to multiple communities. These tained while a sender (neighboring) node sends the most
methods assume that the links are homogeneous, i.e., two probable label. Such methodology also helps a node to
individuals are connected via a single functionality or inter- decide upon its membership strength in each of its commu-
est. This assumption violates the observed statistics that the nities based on the probability information of received
likelihood of an edge between a pair of nodes increases with labels. While maintaining that each node must only hold
the number of communities they share [14]. There also exist labels that majority of its neighbors share, the LPAs empha-
several model based approaches to community detection size on largest possible communities for each node, ignoring
including the block stochastic model [18] and another based the small well connected communities among a minority of
on non-negative matrix factorization (BigClam) [19]. In these its neighbors. In real life scenario, however, one usually
methods a given graph is considered to be a realization of the forms a small community with ones family members
proposed statistical model. These methods are usually driven and close relatives, while many more larger ones with
2976 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 11, NOVEMBER 2015

school/ college class mates. In Fig. 1, for example, for the The problem of community detection is to find family of
node marked a in the overlapped region of the two commu- subgraphs S ¼ fSi jSi  V g such that for any node vj in a
nities A and B, only a very small fraction of its neighbors, subgraph Si , it is more connected in the subgraph Si than in
i.e., only five of 16, are in community B. LPAs, in this case, another subgraph Sj0 . Here, Sj0 ¼ ðSk jvj 2 = Sk ^ Sk 2 SÞ is any
will assign only label A to node a resulting in two disjoint subgraph in family S not containing node vj . Each subgraph
communities. Algorithm DEMON [28] extracts local net- Si 2 S is a community.
work for each node, applies label propagation algorithm to For each node vj , 8j 2 f1; 2;::; jV jg, let Sðvj Þ ¼ fSi jvj 2 Si
each of them, and finally finds union of obtained communi- ^Si 2 Sg be the collection of communities containing node
ties to get overlapped community structure. The algorithm vj . Further, let S 0 ðvj Þ ¼ S  Sðvj Þ be the collection of com-
however suffers with the same limitation as an LPA. munities not containing node vj . If each node vj belongs to
Local spectral clustering based methods have also found exactly 1, or no community at all, i.e., jSðvj Þj  1, then it is
application in overlapped community detection [29], [30]. called disjoint clustering, overlapped clustering otherwise.
These methods usually require an upper bound on the num- FOCS algorithm, proposed in this paper, explores over-
ber of communities as input. They usually first approximately lapped clusters in a given graph.
embed the graph in d  n dimensions (where n is the number
of nodes) using spectral clustering. Following this, the points
3.2 Connectedness
in low dimensional space are clustered using simpler existing
As has already been mentioned in Section 3.1, for each node
clustering methods. However, computation of eigenvalues/
eigenvectors for spectral clustering are computationally vj , 8j 2 f1; 2;::; jV jg, vj is more connected to any community
expensive. Efforts to parallelize computation in MapReduce in Sðvj Þ than any of the communities in S 0 ðvj Þ. Conse-
in [30] still show limited application in terms of scalability. quently, we say, vj is equally well connected to all the com-
In [22], [31], the problem of overlapped community munities in Sðvj Þ. This derives the working principle for
detection in social networks has been addressed using a FOCS.
game theoretic framework, where the dynamics of commu- Let Nðvj Þ be the set of neighbors of a node vj 2 V . Or,
nity formation have been captured as a strategic game.
Here, each node, a selfish agent in disguise, selects the com- Nðvj Þ ¼ fvk jðvj ; vk Þ 2 Eg: (1)
munities to join or leave, based on its definition of utility.
Utility is usually a combination of gain and loss functions. Now, let Ni ðvj Þ be the within community neighborhood
In [22], for example, increase in modularity has been for- of node vj defined for community Si 2 Sðvj Þ as follows:
mulated as the gain function, whereas the number of com-
Ni ðvj Þ ¼ fvk jðvj ; vk Þ 2 E ^ vk 2 Si g: (2)
munities a node joins is the input parameter to the loss
function. There are other methods that solve community
FOCS defines connectedness of a node with respect to its
detection problem for social networks based on cost-benefit
community as the ratio of the size of its within community
trade-off [23]. They mostly add or remove nodes iteratively
neighborhood to the size of the community minus 1. An
from a community, or merge communities, in order to
individual, thus, is considered to be well connected within
improve the benefits, and reduce the costs incurred to a
its community if it has connections to most of the nodes in
node. Many approaches among these impose the number of
the community (apart from itself). The community connected-
communities a node participates in as a restriction [18], [21],
ness score ~zij , thus, assigned to each node vj in each commu-
[22], [20], [32], which is not the case in real networks [14].
Although the aforementioned methods are simple and nity Si 2 S is,
fast, they mostly find disjoint clusters. The ones that find over-
jNi ðvj Þj
lapped clusters are mostly computationally demanding, and ~zij ¼ : (3)
still restrictive. This makes them inapplicable to large scale jSi j  1
real networks. FOCS, on the other hand is a fast algorithm
that evolves on the basis of some locally computed scores to Further, to ensure that a node in any community has at
discover overlapped communities. It scales well over large least K neighbors within the community [33], Equation (3)
sized social networks. It additionally gains in speed via simul- has been modified to define community connectedness score
taneous selection of multiple near-best communities rather zij as follows:
than merely the best. This helps to save a number of iterations.
Moreover, the communities detected by the method are not jNi ðvj Þj  K þ 1
zij ¼ ; if jNi ðvj Þj > K; and, 0; otherwise:
limited to a particular hierarchical level, rather are inclusive jSi j  K
of all meaningful communities in the given network. Further- (4)
more, the method is deterministic i.e., the results are not
dependent on the sequence in which the nodes are consid- Reasonably, if K is assigned a very large value, small but
ered. This is a problem in [9], [21], [22], [23], [25], [31]. dense communities will be missed out. On the other hand, a
very small value for K allows discovery of sparser large
communities and insignificant small communities. It is
3 METHOD found that the algorithm is not sensitive to low values of K
3.1 Problem Definition and performs consistently well over networks of varying
We are given an undirected, unweighted graph GðV; EÞ. The sizes with K ¼ 2. Fig. 2 can be referred for variation in sta-
graph is assumed to be simple (without self loop or parallel tistics of detected communities when FOCS is applied on
edges). Amazon network, with increasing values of K, and OVL
BANDYOPADHYAY ET AL.: FOCS: FAST OVERLAPPED COMMUNITY SEARCH 2977

nodes with degree greater than K. In this way each node


becomes a part of the communities initiated by itself and by
its neighbors as well, allowing overlap between the commu-
nities at the initiation. This approach further helps a node
participating in multiple communities to selectively stay in
more than one community based on high connectedness
scores (and leave the rest), simultaneously.
Let the initial community structure be denoted as S 0 .
Further, let Addedi ¼ fvk jvk 2 Nðvi Þ ^ vk 2 Si g, 8Si 2 S 0 be
defined and referred to as the set of peripheral nodes of Si ,
initially. The algorithm henceforth iterates over two phases:
leave phase and expand phase. Let each iteration comprising
these two phases be referred to as a stage. Also, let the com-
munity structure obtained after a certain stage l be denoted
Fig. 2. Change in community statistics when input parameter to FOCS,
K is varied (with OVL set to 0:6), simulated on Amazon network [34].
as S l . It is important to note that it is always the peripheral
nodes for any community that either leave or expand in the
stages following.
(discussed later in Section 3.3.4) set to 0:6. It can be observed
that a community structure ceases to exist in extremely 3.3.2 Leave Phase
sparse graph resulting from setting K > 5. In general
In this phase a node leaves some of its communities when it
K ¼ 2 is a reasonable choice for community detection
finds itself not sufficiently connected in those. Every node vj is
unless networks being analyzed are highly dense, in which
assigned two scores zij and ji as defined in Equations (4)
case K can be set larger.
The algorithm also defines neighborhood connectedness and (5) respectively. As a result, we obtain a list of commu-
score ji for a node vj with respect to its community Si as the nity connectedness scores < zi > j ¼ fzij jvj 2 Si ^ Si 2 S l g
ratio of the size of its within community neighborhood to and neighborhood connectedness scores < i > j ¼ fji jvj 2
the size of its (overall) neighborhood Si ^ Si 2 S l g for each node vj participating in any commu-
ji ¼ jNi ðvj Þj=jNðvj Þj: (5) nity in the community structure S l . A node is considered to
be sufficiently connected in its community if it has a commu-
This score emphasizes on the fraction of neighborhood of nity connectedness score greater than a certain cut-off score.
node vj that is present within the community Si . It must be This cut-off here is referred to as stay cut-off, and is com-
noted that community connectedness score decides the belong- puted from the list of its community connectedness scores. In
ingness of a node to its community, whereas the neighbor- this paper, a method similar to the first step in bucket sort
hood connectedness score only defines the interest of a node has been used for fast determination of stay cut-off.
in joining a new community. Table 1 describes the status of For each node vj 2 V , the entire range of scores, which lie
a node in its community on the basis of both the scores. between 0 and 1 by definition, is divided into maxð20;
Nðvj ÞÞ number of buckets of equal sizes. The initial count of
3.3 The Algorithm number of scores that fall in each bucket is set to 0. The
The driving principle for FOCS is that communities are initi- count for a bucket is incremented when a score in the list
ated by individuals, and influenced by their neighbors and < zi > j falls within its range. Once done, the rightmost
neighboring communities. A node attracts its neighboring bucket having count greater than 0 is marked. From there,
individuals to be a part of its community. Those that find the bucket list is scanned towards left until, either we have
enough connectivity may choose to stay. The communities found a bucket that has a count lesser than or equal to that
then expand further as the process is iterated by the newly of marked bucket and the count of the bucket to its left is
added members. greater than or equal to that of the current one, or we have
reached the leftmost bucket. Fig. 3 illustrates with an exam-
3.3.1 Initial Communities ple the marked and the chosen bucket. The lower bound of
Initially every node vi , 8i 2 f1; 2; . . . ; jV jg, that has at least K this bucket is chosen as the stay cut-off zcutoff
j for vj .
neighbors, builds a community Si with its neighbors. The The proposed cut-off selection method has been chosen
number of communities thus is equal to the number of after observing the score distributions. It helps in selecting

TABLE 1
Status of a Node in a Community on Basis of Neighborhood Connectedness and Community Connectedness Scores

community connectedness score


low high
neighborhood less interest and low belongingness less interest but high belongingness
connectedness low (not this community) (strong community with few neighbors)
score high interest but low belongingness high interest and high belongingness
high (not accepted by community) (node is central to the community)
2978 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 11, NOVEMBER 2015

28: Eliminate near-duplicate community Si if 9uj 2 Si ,


j 6¼ i such that cðSi ; Sj Þ > OVL (see Equation (6)), 8Si 2 S
29: Compute community connectedness scores < zi > j and
Fig. 3. An illustrative example showing the selection of bucket for given neighborhood connectedness scores < i > j (see Equations (4)
distribution of counts of scores. The rightmost bucket with count greater and (5) respectively), 8vj 2 Si ; 8Si 2 S
than 0 is bucket 14 marked with an arrow. Next, in the scan towards left, 30: Compute stay cut-off zcutoff
i (refer text 3.3.2), 8vi 2 V
bucket 12 has count as low as that of marked bucket but the bucket to its
31: for each community Si 2 S do
left has a lower count. So, moving to the next, bucket 11 has count lower
than that of 14 and the bucket to its left, i.e., bucket 10 has count greater 32: Si ¼ Si  fvk g if zik < zcutoff
k , 8vk 2 Addedi
than this bucket. So bucket 11 is the one chosen. 33: if Si updated in previous step then
34: if jSi j  K then
communities with near-best connectivities, unlike those 35: S ¼ S  fSi g /* Community Si is deleted */
resulting from other simpler alternatives such as mean, 36: else
median or percentage threshold as cut-off. 37: leave 1
38: end if
Now, for all communities Si 2 S l , a peripheral node
39: end if
vk 2 Addedi leaves Si if its community connectedness score zik
40: end for
is lower than its stay cut-off zcutoff
k . Removal of only 41: end function
peripheral nodes ensures that nodes that form the core of a 42: function EXPANDCOMMUNITIES(S, expand) /* For each com-
community are never eliminated. However, any community munity Si , each adjacent u 2 NðvÞ of each node v 2 Addedi
with less than k nodes remaining is considered insignificant is included in Si if u is not in Si and it builds up a neighbor-
and is eliminated. Computation of scores and removal of hood connectedness score greater than its join cut-off */
peripheral nodes is performed recursively for the entire com- 43: Compute join cut-off jcutoff (refer text 3.3.3), 8vj 2 V
munity structure till no node leaves any community. 44: for each community Si 2 S do
45: Nowaddedi ¼ Ø
Algorithm 1. Fast Overlapped Community Search 46: for each uk 2 Nðvj Þ, 8vj 2 Addedi do /*
For each node added to Si in the last round */
Input: G ¼ ðV; EÞ: input graph, K: minimum connections for a
47: if ki > kcutoff
S and uk 2
= Si then
node within a community, OVL: maximum allowed
48: Si ¼ Si fuk g S
overlap between communities
49: Nowaddedi ¼ Nowaddedi fuk g
Output: S ¼ fSi jSi  V and Si is a community}
50: end if
Auxiliary Variables: n ¼ jV j, NðvÞ = neighbors of node v,
51: end for
Addedi ¼ Nodes added to community Si in last round
52: Addedi ¼ Nowaddedi
1: procedure PREFERREDCOMMUNITIES(G, K, OVL)
53: if jAddedi j  1 then
2: S ¼Ø
54: expand 1
3: InitializeCommunities(G, K, S)
55: end if
4: expand 1
56: end for
5: while expand do
57: end function
6: leave 1
7: while leave do
8: leave 0
3.3.3 Expand Phase
9: LeaveCommunities(S, K, OVL, leave)
10: end while After leave phase, the idea of extending a community to its
11: expand 0 neighboring nodes is pursued. So, in each community Si
12: ExpandCommunities(S, expand) peripheral nodes Addedi include each neighboring node vj , if
13: end while the following conditions hold:
14: returnS
15: end procedure
 the node is not already included,
16: function INITIALIZECOMMUNITIES(G, K, S) /*  the node has high interest in joining this community.
Community of each node in initialized by the node v 2 V High interest in joining a new community is depicted via
and its neighbors NðvÞ if jNðvÞj  K*/ a high neighborhood connectedness score, ji . It is ensured that
17: for each i 2 f1; 2; ::; ng do a node has high neighborhood connectedness score when the
18: if jNðvi Þj  SK then score is greater than its join cut-off, jcutoff . join cut-off
19: Si ¼ fvi g Nðvi Þ jcutoff is computed from the list of neighborhood connected-
20: AddediS Nðvi Þ
21: S ¼ S Si ness scores < i > j in a way similar to stay cut-off.
22: else When a community expands, most of its nodes become
23: Si ¼ NULL, Addedi ¼ NULL less connected. This is because an existing node is able to con-
24: end if nect to very few of the the newly included nodes. Conse-
25: end for quently, the maximum of the community connectedness
26: end function scores decreases for all nodes. On the other hand, the num-
27: function LEAVECOMMUNITIES(S, K, OVL, leave) /* In each ber of edges (friends for each node) in that community
community Si node v 2 Si leaves Si if its community con- increases. It is comprehended that a node has interest in
nectedness score is less than stay cut-off. Updated communi- joining a new community if it has at least as many connec-
ties of size less than K are deleted */ tions (friends) in the one concerned as it had in the
BANDYOPADHYAY ET AL.: FOCS: FAST OVERLAPPED COMMUNITY SEARCH 2979

Fig. 5. Change in community statistics when input parameter to FOCS,


OVL is varied, simulated on Amazon network [34].

grow to become near-to-duplicate communities. Such near-to-


duplicate community pairs ðC; C 0 Þ are identified via the simi-
larity measure defined as follows [36]:
T
0 jC C 0 j
cðC; C Þ ¼ : (6)
minðjCj; jC 0 jÞ

Duplication removal is performed during each stage,


before passing communities to leave phase and after every
iteration within it. Duplication removal is essential from
Fig. 4. Fast Overlapped Community Search applied to the Dolphin net- two viewpoints: (i) this prevents the score distribution from
work. Circles and lines represent nodes and edges of the network being undesirably skewed, (ii) with a number of near-to-
respectively. Each translucently shaded and bordered region enclosing duplicate communities removed, the computation time is
nodes represents a community. Nodes that are not color-filled do not
belong to any community. These communities evolved with OVL set to also reduced.
0:5 and K ¼ 2. Duplication removal, in FOCS, takes a parameter OVL as
input. OVL sets a threshold for the maximum overlap
communities in previous stage. At this point, neighborhood allowed between two communities, before they can be iden-
connectedness score helps in the decision of a node to partici- tified as near-duplicates. The smaller of the two communities
pate in a new community and prevents expansion of the Si and Sj is deleted when similarity measure cðSi ; Sj Þ
already discovered communities into sparser subgraphs. crosses this threshold. An OVL ¼ 1 implies elimination of a
After this, Addedi is set to contain the newly added nodes duplicate community when it is exactly identical to another.
of Si (or f if no new node added), for either removal or Reasonably, OVL must be set to  0:5 and less than 1 for
expansion in the next stage. Expansion of only peripheral early identification of duplicates in community structure.
nodes, on one hand, allows for re-inclusion of removed We have taken OVL ¼ 0:6 in our work. However, we have
nodes, and on the other hand takes care that nodes which do experimented with different values of OVL and observed
not fit in the community are not repetitively added and later stability in output in qualitative terms when set in the range
removed during leave phase. It also helps FOCS converge. 0.5 to 0.7. Fig. 5 can be referred for change in community
After expand phase, the community structure so obtained statistics with variation in OVL, when FOCS is simulated on
at this stage is passed as input to the leave phase of the next Amazon network [34].
stage. At a certain stage, all the peripheral nodes of some par- Given an underlying static graph G ¼ ðV; EÞ, the pro-
ticular communities are removed during the leave phase. posed algorithm, Fast Overlapped Community Search
These communities can not expand further in later stages. (FOCS) is stated in Algorithm 1. Fig. 4 shows how the
However, such a community still contributes to the list of detected communities evolve over stages with the execution
connectedness scores maintained for its nodes. FOCS stops of FOCS on the dolphin network [37].
when, in a stage, there are no peripheral nodes remaining in
all existing communities. 3.4 Complexity Analysis
For a given undirected, unweighted graph GðV; EÞ, let
n ¼ jV j be the number of nodes and let m ¼ jEj be the num-
3.3.4 Duplication Removal ber of edges. During the initialization of communities, the
Overlapped community detection algorithms allow almost entire adjacency list is scanned once so that a node forms a
all nodes in a network to be a part of multiple communities. community if it has more than K neighbors. A scan of the
This is because each initial community is allowed to expand adjacency list requires time Oðn þ mÞ. However, in most
to include nodes irrespective of their existence in other com- cases the network is connected and the required time is in
munities. Thus, at each phase certain communities may OðmÞ. Henceforth, the initial communities consecutively
2980 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 11, NOVEMBER 2015

shrink and expand through removal or expansion of periph- TABLE 2


eral nodes in the community, respectively. It must be noted Comparison of Time Complexity for Various
that, nodes that are peripheral in current stage will not be so Existing Methods with FOCS
in the next stage. Algorithm Time Complexity
Let l be the average number of communities per node
(refer to Theorem A.1 for derivation of upper bound of l). CFinder [11] OðexpðnÞÞ
Game [22] Oðm2 Þ
Then, a total of nl leaves or expansions will take place in the
MOSES [18] Oðen2 Þ
algorithm. The leave phase requires computation of scores LFM [9] Oðn2 Þ
zij and ji for node vj in community Si 2 S, where S is the OSLOM [38] Oðn2 Þ
community structure. Each such computation involves com- COPRA [21] Oðvmlogðvm=nÞÞ
parison of the community members of Si against the adja- LinkComm [16] 2
OðnKmax Þ
cency list corresponding to vj to get the total number of DEMON [28] 3a
OðnKmax Þ
adjacents of vj within the community. Considering the aver- BigClam [19] Oðcn þ mÞ
age community length to be l with n number of communi- GCE [10] OðmhÞ
SLPA [25] OðmÞ
ties, such computation takes OðlÞ time. So nl number of FOCS Oðn þ mÞ ’ OðmÞ { m > n
computations take Oðnl2 Þ time. Computation of the cut-off
scores zcutoff
j and jcutoff for each node vj takes time n c m = No. of edges in the network
n = No. of nodes in the network
units, where c is a constant indicating the number of buck- Kmax = Maximum degree for any node in the network
ets. So, this takes OðnÞ time. The elimination of nodes after a : Network has power law degree distribution with
the computation of scores in the leave phase requires a com- pk ¼ ka
c = Sublinear term in k, the number of communities (exact
plete scan of the community structure which is achieved in relation is not clearly stated in [19])
OðnlÞ time. Thus, total time taken over all leave phases is h = No. of cliques
in Oðnl2 þ n þ nlÞ ¼ Oðnl2 Þ. e = No. of edges to be expanded
v = Maximum number of communities a node can participate in
In the expansion phase, for each peripheral node in com-
munity Si , its adjacency list is scanned against Si to ensure
that an adjacent node to be included does not already exist simulated graphs which allow communities to overlap.
in Si . This takes Oðnl2 Þ time. Further, for each such adjacent These graphs demonstrate some important features of real
node, computation of ji requires comparison of adjacent list networks such as the scale-free property and the power law
of this adjacent node against Si , summing up to Oðnl3 Þ time distribution of community sizes. However, LFR assigns for
for all expansion phases combined. Now, duplication each node, an equal number of its neighbors to different
removal involves comparison of each community Si against clusters such that sums of the number of neighbors in differ-
other communities of the peripheral nodes. So, for each of ent communities for the nodes closely follow the degree dis-
the n communities, there are l peripheral nodes (over all tribution. This is not the case in real networks (as one can
stages), each having l  1 other communities to be com- see in Fig. 6, the sums are higher than degrees in several
pared with, and comparison takes OðlÞ time, making it to cases). Moreover, the standard deviation of the number of
Oðnl3 Þ time for duplication removal. neighbors in different communities for a particular node
Thus, FOCS takes Oðm þ nl2 þ nl3 þ nl3 Þ ¼ Oðm þ nl3 Þ roughly increases with its degree and correspondingly
time across initialization, leave phase, expansion phase and increasing number of communities it is participating in (see
duplication removal. From Theorem A.1, Appendix A it is Fig. 7). Thus, the fraction of neighbors participating in dif-
clear that l takes a constant value irrespective of the size of ferent communities of a node will be far from equal, unlike
the network. Thereby, time complexity for FOCS is in the case of LFR graphs where they are close to equal. Fur-
Oðn þ mÞ. Apart from being linearly scalable like some ther, in case of LFR graphs a node is assigned to either a sin-
other overlapped community detection algorithms, the pro- gle community or an equal number of multiple
posed algorithm has faster implementation owing to its sim- communities, which makes them all the more unrealistic
plicity and the flags maintained that keep from recomputing because the distribution of number of communities per
scores for nodes and/or communities that do not undergo node in real networks follow a heavy tailed power law dis-
any change across phases and stages. tribution [14]. For all these reasons we evaluate the perfor-
Table 2 shows a comparison of the time complexity for mance of FOCS over the real networks only.
various existing methods that has been used for comparison It is argued that modularity tends to produce larger com-
with FOCS. munities, and imposes a limit to resolution [39]. Again, it
has been shown that modularity follows the same pattern
over different classes of networks [40], thus unable to follow
4 EMPIRICAL RESULTS the divergent community structures in different real net-
The performance of a community detection algorithm can works. Therefore, the normalized mutual information
be evaluated both on real and simulated networks. How- (NMI) between the detected and the ground-truth commu-
ever, the simulated networks do not capture some of the nities is used for the purpose of performance evaluation [9].
important characteristics associated to community structure To test the performance of FOCS on large-scale real net-
in real networks. works, FOCS algorithm is evaluated on 7 real networks of
We may discuss about the LFR benchmark graphs in this large size. All the networks are undirected and unweighted.
regard [9]. LFR benchmark graphs are a family of artificially Find below short descriptions for all considered networks.
BANDYOPADHYAY ET AL.: FOCS: FAST OVERLAPPED COMMUNITY SEARCH 2981

Fig. 6. Distribution of sum of number of neighbors in multiple communities for nodes of corresponding networks.

Amazon. It is the undirected network of Amazon prod- Orkut. It is a free on-line social network where users
uct co-purchasing. Here, the product categories are hierar- form friendship with each other. Ground-truth communi-
chically nested and thus the corresponding network ties are defined on a basis similar to that of LiveJournal [34].
inherently organizes into overlapping community structure. Yeast PPIN. The yeast interaction network is collected
The products in the same ground-truth community share a and combined from three different sources –Y2H-Union con-
common function [34]. taining 2,930 interactions [41], 2,770 interactions from [42],
DBLP. It is the scientific collaboration network of DBLP and only the positive examples i.e., top 58 interactions from
computer science, where two authors are connected if they [43]. Redundancies and self-loops are removed, resulting in
have published at least one paper together. Here, publica- a network of 2,705 interactions among 1,966 proteins. The set
tion venues i.e., journals and conferences is used as proxies of protein complexes considered as true community set is
for ground-truth research communities. In such network CYC2008 collected from [44]. From the complexes in
members are related to each other pertaining to areas of CYC2008 the proteins that are not in the interaction dataset
research, and thus highly overlapping community structure are removed, followed by elimination of complexes contain-
is natural to be observed [34]. ing two or less protein subunits. Following the filtration pro-
YouTube. It is a video-sharing web site. Users can form cess 137 out of 408 original complexes remain.
friendship with each other and thus YouTube also depicts a Human PPIN. The human protein interactome is the
social network. Ground-truth communities considered are PCDq dataset collected from Results of computational
groups explicitly formed by users [34]. analysis section [45]. It provides for both the interaction
LiveJournal. It is a free on-line blogging community network and the complexes (both experimentally verified
where users declare friendship with each other. Ground- and computationally predicted using DBClus). The com-
truth communities are groups explicitly created by users plete interaction network with 32,198 interactions among
based on common interest topics, affiliations, and geo- 9,268 proteins is used. The human protein complex data-
graphical regions. Other users in the network then join set contains 1,078 complexes constituting of 3,759 pro-
some of these communities. Communities belong to one of teins. Among these, only complexes that belong to either
the categories: culture, entertainment, expression, fandom, category I or category II are considered. These are the
life/style, life/support, gaming, sports, student life and complexes with high number of proteins experimentally
technology [34]. verified (for details check [45]). Further complexes of

Fig. 7. Distribution of standard deviation of number of neighbors in multiple communities for nodes of corresponding networks.
2982 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 11, NOVEMBER 2015

TABLE 3
Dataset Statistics

Networks #Nodes #Edges D Dmax C S Smax M Mmax F F2þ


Amazon 0.3M 0.9M 5.53 549 151K 19.38 53.5K 8.74 170 0.94 0.91
DBLP 0.3M 1M 6.62 343 13.5K 53.41 7.6K 2.27 124 0.82 0.35
YouTube 1.1M 3M 5.27 28.8K 8.4K 13.5 3K 0.1 173 0.04 0.02
LiveJournal 4M 34.7M 17.35 14.8K 0.3M 22.31 0.18M 1.6 579 0.27 0.18
Orkut 3M 117.2M 76.3 33.3K 6.3M 14.16 9.1K 29 2.6K 0.75 0.70
Yeast PPIN 1.9K 2.7K 3.95 90 137 5.38 21 0.28 5 0.23 0.03
Human PPIN 9.3K 32.2K 6.95 342 1078 4.5 32 0.52 21 0.41 0.06

D: average degree, Dmax : maximum degree, C: number of communities, S: average community size, Smax : maximum community size, M: number of communi-
ties per node on an average, Mmax : maximum number of communities any node partcipated in, F : fraction of nodes that participated in atleast one community,
F2þ : fraction of nodes that participated in more than one community. K denotes a thousand and m denotes a million.

size 2 or less are filtered out, resulting in a total of 1,221 updates its belonging coefficient and decides on its set of
complexes formed of 4,325 proteins. community labels by averaging that of its neighbors in syn-
The size of the networks ranges from hundreds of thou- chronous fashion. SLPA propagates community labels
sands to millions of nodes and a hundred of millions of edges. between nodes such that a listener node receives and saves
The number of ground-truth communities, community sizes the most probable label among those sent by its neighbors
and average node membership for the communities too range where each neighboring node sends a label with probability
over a large scale. Table 3 provides the specifics. proportional to its occurrence frequency in memory over
Protein complexes are coherent group of proteins that multiple iterations. Link Communities, on the other hand,
bind at same time and place, to perform a particular func- performs agglomerative hierarchical clustering where simi-
tion. A single protein is known to often bind with a multiple larity between nodes is a function of the commonalities
set of proteins at different time and location for different in their respective neighborhoods. BigClam employs non-
functions, thereby resulting in overlapped complexes in the negative matrix factorization method along with block sto-
protein interactome. The protein interactomes available till chastic gradient descent to optimize the model likelihood of
date though incomplete are expected to closely follow the explaining the links in network based on communities the
complete interactome structure. The interactomes collected nodes participate.
are the most complete available. The original implementations have been used for each of
Table 4 reports the execution time taken by the various the listed algorithms. Further, they have been executed hav-
algorithms on the considered networks. Table 5 summaries ing their parameters set to the default values, except for
the results with each cell representing the NMI between the GCE, where minimum cluster size is changed to 3 instead
detected and the ground-truth communities. of 4. Additionally for LinkComm, SLPA, and COPRA, the
FOCS is compared with seven widely used overlapped communities of size 2 or less are filtered out. COPRA also
community detetction algorithms namely Greedy clique requires one to set the maximum number of communities a
expansion (GCE) [10], MOSES [18], OSLOM [38], COPRA node can participate as an input parameter, v which was
[21], SLPA [25], Link Communities (LinkComm) [16] and tested for values starting from 2, increasing by 1 each time
BigClam [19]. Greedy clique expansion expands cliques until the results became worse. The results reported are
greedily to include edges such that within community edge those with v set to values that yielded community structure
density is improved. MOSES employs stochastic block with close match to the number of ground-truth communi-
model based community detection. OSLOM finds commu- ties for different networks. Similarly, results from SLPA
nities based on the difference between modularity of a can- depends heavily on the probability threshold parameter, r
didate community and that of the same set of nodes in a which was tested for r 2 ½0:01; 0:5
and chosen such that the
randomly generated network. In COPRA each node number of output communities was close to that reported

TABLE 4
Comparison of Time Taken in Detection of Communities by FOCS and by Seven of the Existing Algorithms

#Communities/Time Taken
Networks MOSES GCE OSLOM COPRA SLPA LinkComm BigClam FOCS
Amazon 30.2K/160 s 25.9K/10 s 18.7K/711 s 8.4K/1183 s 30.5K/456 s 61.5K/14 s 151K/1.18 h 20.9K/2 s
DBLP 46.4K/273 s 22.6K/16 s 22.2K/21 m 14.9K/180 s 22.2K/578 s 78.4K/34 s 39.6K/33 m 24.2K/2 s
YouTube 8K/1.9 h - - 12K/238 s 39.9K/104 m 5.1K/1.5 h 8K/1.4 h 7K/52 s
LiveJournal - - - - - - - 0.2M/312 s
Orkut - - - - - - - 0.2M/48 m
Yeast PPIN 76/18 s 92/0 s 74/3 s 86/0 s 243/1 s 159/0 s 137/1 s 32/0 s
Human PPIN 106/30 s 284/4 s 206/60 s 26/1 s 337/3 s 436/1 s 1,078/57 s 114/0 s

The blanks in the table denote that the method was allowed to run for 4 hours before any result was generated, after which it was terminated. h, m, and s denote
hour(s), minute(s) and second(s) respectively. K denotes a thousand and M denotes a million.
BANDYOPADHYAY ET AL.: FOCS: FAST OVERLAPPED COMMUNITY SEARCH 2983

TABLE 5
Comparison of NMI between Ground-Truth Communities and Communities Detected by FOCS and by Seven of the
Existing Algorithms

NMI
Networks MOSES GCE OSLOM COPRA SLPA LinkComm BigClam FOCS
Amazon 0.2239 0.2164 0.1851 0.2076 0.1208 0.2558 0.2421 0.2075
DBLP 0.153 0.1374 0.1276 0.1484 0.1191 0.2112 0.1448 0.2135
YouTube 0.0127 - - 0.0150 0.0025 0.0161 0.0008 0.0225
LiveJournal - - - - - - - 0.0307
Orkut - - - - - - - 0.0611
Yeast PPIN 0.1064 0.1322 0.0481 0.1236 0.0502 0.1148 0.004 0.1284
Human PPIN 0.0793 0.0481 0.0744 0.2510 0.1305 0.1106 0.0328 0.2471

The blanks in the table denote that the method was allowed to run for 4 hours before any result was generated, after which it was terminated.

for the corresponding ground-truth communities. Tables 4 inline figure depicting runtime for Amazon, DBLP and You-
and 5 report results for COPRA with v set to 9, 4, 3, 2, and 2 Tube only. None of the algorithms except FOCS scales well
for Amazon, DBLP, YouTube, Yeast PPIN, and Human PPIN to the two largest social network datasets considered,
respectively. Results reported for SLPA are those with r set within the given time and memory constraints.
to 0.01, 0.05, 0.01, 0.5, and 0.05 for these datasets respectively.
For BigClam, either number of communities, or a range of 5 CONCLUSION
number of communities to be tested is required as input. We Social networks are complex and large. Fast Overlapped
tested with number of communities equal to that appearing Community Search (FOCS) explores communities rapidly
in ground truth communities for the networks, as well as for by selecting only those where all nodes are locally well con-
a range encompassing outputs from other algorithms. The nected. The community connectedness and neighborhood con-
number of communities which yielded the best result was nectedness scores, which are computed for each node
noted, and the time shown in Table 4 is for simulation with throughout the algorithm reflect real world community
these exact number of communities as input parameter. properties. These make the algorithm applicable to real net-
Blanks in Tables 4 and 5 show that GCE and OSLOM could works of varying sizes. Users are free to set the maximum
not produce results for dataset YouTube within four hours allowed overlap between any two communities, and the
time, while none of the methods except FOCS scaled to data- minimum number of neighbors that a node should have, to
sets LiveJournal and Orkut in the same time. COPRA and determine its membership in any community.
SLPA, however, faced memory limitations much earlier for One of the limitations of FOCS is that the maximum num-
datasets LiveJournal and Orkut. The performance of the ber of communities that can be detected by this method is
other algorithms including LFM [9], DEMON [28], and equal to the number of nodes in a network. Whereas, in social
game-theoretic [22] are eliminated from comparison as they networks, as can be seen in Orkut [14], the number of com-
could not produce results even after four hours of execution munities can in fact be double the number of nodes. This
for any of the social network datasets. The game-theoretic happens when a node is allowed to create multiple commu-
algorithm in contradiction to its claim does not converge. nities. We try to address this issue in our future work. Fur-
The results depict significant gain in terms of execution ther, we would like to extend the method to work with
time as compared to the other algorithms. Interestingly, it weighted and/or directed networks, dynamic networks, etc.
does not come at the cost of performance. For all networks
except Amazon and PPIN networks, FOCS outperforms the
other methods. Communities serving as ground-truth for
Amazon have very high overlap (about 91 percent nodes
participate in two or more communities as can be seen in
Table 3). Thus, NMI values for Amazon mostly conform
with methods that yield very high number of overlapping
communities. LinkComm, though efficient in detecting
most of the communities correctly does not scale well with
increasing network sizes. BigClam performs well with input
for number of communities set equal to that in ground-truth
communities except in the case of DBLP network. It per-
forms competitively only for the case of Amazon dataset,
but requires large amount of time. COPRA produces strong
results for the PPIN networks which mostly have disjoint
communities, with very few nodes participating in overlaps
between communities. Fig. 8 shows the runtime of FOCS
versus the number of edges, m, as compared to all the seven
Fig. 8. Runtime of FOCS compared to seven of the existing algorithms
existing algorithms considered. The figure depicts run time for five of the social network datasets [34] with increasing number of
for five of the social network datasets considered, with edges, m.
2984 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 11, NOVEMBER 2015

APPENDIX A region while the other did not. Thus, number of nodes in
each community becomes n  k  1, as a node reduces,
Theorem A.1. Given OVL as the maximum allowed overlap
and number of nodes in the overlapped region between
between any two communities for a network, the average num-
1 the pair is reduced by exactly one. This results in an over-
ber of communities per node l is maximally bounded by 1OVL.
lap of nk11nk1 ¼ nk1.
nk2
On the other hand,
Proof. We are given an undirected, unweighted network Omaxmin ðk þ 1Þ ¼ nðkþ1Þ1
nðkþ1Þ ¼ nk1, thereby showing that
nk2

represented by graph GðV; EÞ with V is the set of vertices P ðk þ 1Þ holds.


and E is the set of edges. Let n ¼ jV j. In order to find the Thus, P ðsÞ holds for all natural number s. Now, given
upper bound for l, one needs to assign maximum number maximum allowed overlap OVL for a network, we want
of communities to each of the nodes in the network. As to find out the maximum value for average number of
per the proposed algorithm FOCS, a maximum of n com- communities per node, lðsÞ, which is when
munities can be formed. Consequently, each node v 2 V OVL ¼ Omaxmin ðsÞ. This ensures that the community
may belong to all n communities, resulting in 100 percent structure as a whole exists with no community pair hav-
overlap (or, similarity as defined in Equation (6)) between ing an overlap greater than Omaxmin ðsÞ. So, we have
any two communities. We need to assign maximum
number of communities per node such that the overlap OVL ¼ Omaxmin ðsÞ
between any pair of communities is constrained by the
ns1
given overlap threshold OVL. First, we prove by induc- ¼
ns
tion that if each node belongs to l ¼ n  s communities (7)
1
on average, where s 2 N; set of natural numbers, the min- ¼1
ns
imum achievable maximum of overlap between all pairs
1
of communities is, say Omaxmin ¼ ns1 ns . So, the induction
or; s¼n :
1  OVL
statement is
P ðsÞ : lðsÞ ¼ n  s ) Omaxmin ðsÞ ¼ ns1
ns .
Now, the average number of communities per node,
Basis: P ð1Þ holds lðsÞ in this case is given by n  s. So, we have
When s ¼ 1, lð1Þ ¼ n  1, i.e., on an average each node
belongs to n  1 communities. So, we need to remove in lðsÞ ¼ n  s
total n nodes combined from n communities to achieve 1
¼ n  ðn  Þ (8)
lð1Þ (from the scenario where each node belonged to all n 1  OVL
communities). In order for minimum achievable maxi- 1
¼ :
mum of overlap between all pairs of communities, a 1  OVL
unique node is removed from each of the n communities. Hence, the theorem follows. u
t
Thus, each community now has n  1 nodes and there
are exactly n  2 nodes in the overlapped region between
any pair of communities (since for each of the pair, both REFERENCES
the nodes removed belonged to overlapped region). This [1] K. Dasgupta, R. Singh, B. Viswanathan, D. Chakraborty, S.
results in an overlap of n2 n1. On the other hand, Mukherjea, A. A. Nanavati, and A. Joshi, “Social ties and their rel-
Omaxmin ð1Þ ¼ n11 ¼ n2
. It must be noted that in the cur- evance to churn in mobile telecom networks,” in Proc. 11th Int.
n1 n1 Conf. Extending Database Technol.: Adv. Database Technol., 2008,
rent situation, for a community pair, if the same node pp. 668–677.
belonging to the overlapped region is removed from [2] J. Xu and H. Chen, “Criminal network analysis and visualization,”
both communities, the overlap for this pair remains Commun. ACM, vol. 48, no. 6, pp. 100–107, 2005.
[3] S. Fortunato, “Community detection in graphs,” Phys. Rep.,
100 percent which is higher than Omaxmin ð1Þ. And, if vol. 486, no. 3, pp. 75–174, 2010.
more than one nodes are removed from the same com- [4] M. Plantie and M. Crampes, “Survey on social community
munity, it again results in a 100 percent overlap between detection,” in Social Media Retrieval. New York, NY, USA:
Springer, 2013, pp. 65–85.
this community and the community with no node [5] J. Xie, S. Kelley, and B. K. Szymanski, “Overlapping community
removed. Hence, P ð1Þ holds. detection in networks: The State-of-the-art and comparative
Inductive Step: P ðkÞ holds ) P ðk þ 1Þ holds study,” ACM Comput. Surveys, vol. 45, no. 4, pp. 43, 2013.
We assume that P ðkÞ holds meaning that when aver- [6] S. E. Schaeffer, “Graph clustering,” Comput. Sci. Rev., vol. 1, no. 1,
pp. 27–64, 2007.
age number of communities per node, lðkÞ ¼ n  k, the [7] M. Rosvall and C. T. Bergstrom, “Maps of random walks on com-
minimum achievable maximum of overlap between all plex networks reveal community structure,” Proc. Nat. Acad. Sci.,
pairs of communities is Omaxmin ðkÞ ¼ nk1 nk with each
vol. 105, no. 4, pp. 1118–1123, 2008.
[8] V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefebvre,
community containing n  k nodes and n  k  1 nodes “Fast unfolding of communities in large networks,” J. Statistical
belonging to overlapped region between every pair. Mech.: Theory Experiment, vol. 2008, no. 10, p. P10008, 2008.
For s ¼ k þ 1, lðk þ 1Þ ¼ n  ðk þ 1Þ ¼ n  k  1, i.e., [9] A. Lancichinetti, S. Fortunato, and J. Kertesz, “Detecting the over-
lapping and hierarchical community structure in complex
each node belongs to n  k  1 communities on average. networks,” New J. Phys., vol. 11, no. 3, p. 033015, 2009.
Now, for condition of maximum achievable minimum [10] C. Lee, F. Reid, A. McDaid, and N. Hurley, “Detecting highly
overlap between all community pairs, a unique node is overlapping community structure by greedy clique expansion,”
removed from each of the n communities (from the sce- ArXiv e-prints, Feb. 2010.
[11] G. Palla, I. Derenyi, I. Farkas, and T. Vicsek, “Uncovering the
nario where P ðkÞ is true) such that for any community overlapping community structure of complex networks in nature
pair, one of the nodes removed belonged to overlapped and society,” Nature, vol. 435, no. 7043, pp. 814–818, 2005.
BANDYOPADHYAY ET AL.: FOCS: FAST OVERLAPPED COMMUNITY SEARCH 2985

[12] T. Evans, “Clique graphs and overlapping communities,” J. Statis- [38] A. Lancichinetti, F. Radicchi, J. J. Ramasco, and S. Fortunato,
tical Mech.: Theory Experiment, vol. 2010, no. 12, p. P12037, 2010. “Finding statistically significant communities in networks,” PloS
[13] N. Mishra, R. Schreiber, I. Stanton, and R. E. Tarjan, “Finding One, vol. 6, no. 4, p. e18961, 2011.
strongly knit clusters in social networks,” Internet Math., vol. 5, [39] S. Fortunato and M. Barthelemy, “Resolution limit in community
no. 1/2, pp. 155–174, 2008. detection,” Proc. Nat. Acad. Sci., vol. 104, no. 1, pp. 36–41, 2007.
[14] J. Yang and J. Leskovec, “Structure and overlaps of communities [40] J. Leskovec, K. J. Lang, and M. Mahoney, “Empirical comparison
in networks,” CoRR, vol. abs/1205.6228, 2012. of algorithms for network community detection,” in Proc. 19th Int.
[15] S. L. Feld, “The focused organization of social ties,” Am. J. Sociol., Conf. World Wide Web, 2010, pp. 631–640.
vol. 86, pp. 1015–1035, 1981. [41] H. Yu, P. Braun, M. A. Yıldırım, I. Lemmens, K. Venkatesan,
[16] Y.-Y. Ahn, J. P. Bagrow, and S. Lehmann, “Link communities J. Sahalie, T. Hirozane-Kishikawa, F. Gebreab, N. Li, N. Simonis,
reveal multiscale complexity in networks,” Nature, vol. 466, T. Hao, J. F. Rual, A. Dricot, A. Vazquez, R. R. Murray, C. Simon,
no. 7307, pp. 761–764, 2010. L. Tardivo, S. Tam, N. Svrzikapa, C. Fan, A. S. de Smet, A. Motyl,
[17] T. Evans and R. Lambiotte, “Line graphs, link partitions, and M. E. Hudson, J. Park, X. Xin, M. E. Cusick, T. Moore, C. Boone,
overlapping communities,” Phys. Rev. E, vol. 80, no. 1, pp. 016105, M. Snyder, F. P. Roth, A. L. Barabasi, J. Tavernier, D E. Hill, and
2009. M. Vidal, “High-quality binary protein interaction map of the yeast
[18] A. McDaid and N. Hurley, “Detecting highly overlapping com- interactome network,” Science, vol. 322, no. 5898, pp. 104–110, 2008.
munities with model-based overlapping seed expansion,” in Proc. [42] K. Tarassov, V. Messier, C. R. Landry, S. Radinovic, M. M. S.
Int. Conf. Adv. Social Netw. Anal. Mining, 2010, pp. 112–119. Molina, I. Shames, Y. Malitskaya, J. Vogel, H. Bussey, and S. W.
[19] J. Yang and J. Leskovec, “Overlapping community detection at Michnick, “An in vivo map of the yeast protein interactome,” Sci-
scale: A nonnegative matrix factorization approach,” in Proc. 6th ence, vol. 320, no. 5882, pp. 1465–1470, 2008.
ACM Int. Conf. Web Search Data Mining, 2013, pp. 587–596. [43] J. P. Miller, R. S. Lo, A. Ben-Hur, C. Desmarais, I. Stagljar, W. S.
[20] R. Narayanam and Y. Narahari, “A game theory inspired, decen- Noble, and S. Fields, “Large-scale identification of yeast integral
tralized, local information based algorithm for community detec- membrane protein interactions,” Proc. Nat. Acad. Sci. United States
tion in social graphs,” in Proc. 21st Int. Conf. Pattern Recognition, of America, vol. 102, no. 34, pp. 12 123–12 128, 2005.
2012, pp. 1072–1075. [44] S. Pu, J. Wong, B. Turner, E. Cho, and S. J. Wodak, “Up-to-date
[21] S. Gregory, “Finding overlapping communities in networks by catalogues of yeast protein complexes,” Nucleic Acids Res., vol. 37,
label propagation,” New J. Phys., vol. 12, no. 10, p. 103018, 2010. no. 3, pp. 825–831, 2009.
[22] W. Chen, Z. Liu, X. Sun, and Y. Wang, “A Game-theoretic frame- [45] S. Kikugawa, K. Nishikata, K. Murakami, Y. Sato, M. Suzuki,
work to identify overlapping communities in social networks,” M. Altaf-Ul-Amin, S. Kanaya, and T. Imanishi, “PCDq: Human
Data Mining Knowl. Discovery, vol. 21, no. 2, pp. 224–240, 2010. protein complex database with quality index which summarizes
[23] J. Baumes, M. Goldberg, and M. Magdon-Ismail, “Efficient identi- different levels of evidences of protein complexes predicted from
fication of overlapping communities,” in Proc. IEEE Int. Conf. H-invitational Protein-protein interactions integrative dataset,”
Intell. Security Informat., 2005, pp. 27–36. BMC Syst. Biol., vol. 6, no. Suppl 2, p. S7, 2012.
[24] J. Xie and B. K. Szymanski, “Community detection using a neigh-
borhood strength driven label propagation algorithm,” in Proc. Sanghamitra Bandyopadhyay received the
IEEE Network Sci. Workshop, 2011, pp. 188–195. PhD degree in computer science from ISI. She is
[25] J. Xie and B. K. Szymanski, “Towards linear time overlapping currently a professor at the Indian Statistical
community detection in social networks,” in Proc. 16th Pacific-Asia Institute, Kolkata, India. She has authored/co-
Conf. Adv. Knowl. Discovery Data Mining, 2012, pp. 25–36. authored more than 250 technical articles and
[26] J. Xie and B. K. Szymanski, “LabelRank: A stabilized label propa- published five authored and edited books. Her
gation algorithm for community detection in networks,” in Proc. research interests include computational biology
IEEE 2nd Netw. Sci. Workshop, 2013, pp. 138–143. and bioinformatics, soft and evolutionary compu-
[27] U. N. Raghavan, R. Albert, and S. Kumara, “Near linear time algo- tation, pattern recognition, and data mining. She
rithm to detect community structures in large-scale networks,” is a fellow of NASI and INAE, India, and received
Phys. Rev. E, vol. 76, no. 3, p. 036106, 2007. several prestigious awards including the Hum-
[28] M. Coscia, G. Rossetti, F. Giannotti, and D. Pedreschi, “Demon: A boldt Fellowship from Germany, ICTP Senior Associate, Trieste, Italy,
local-first discovery method for overlapping communities,” in and the Shanti Swarup Bhatnagar Prize in Engineering Science. She is
Proc. 18th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, a senior member of the IEEE.
2012, pp. 615–623.
[29] M. Magdon-Ismail and J. Purnell, “SSDE-cluster: Fast overlapping
clustering of networks using sampled spectral distance embed- Garisha Chowdhary received the BTech and
ding and gmms,” in Proc. IEEE 3rd Int. Conf. Privacy, Security, Risk MTech degrees in computer science from Biju
Trust IEEE 3rd Int. Conf. Social Comput., 2011, pp. 756–759. Pattnaik Univeristy of Technology and Jadavpur
[30] S. Tsironis, M. Sozio, M. Vazirgiannis, and L.-E. Poltechnique, University, respectively. Currently, she is a senior
“Accurate spectral clustering for community detection in research fellow in the Machine Intelligence Unit
Mapreduce,’’ Frontiers of Network Analysis: Methods, Models, and of the Indian Statistical Institute, Kolkata, India.
Applications. Lake Tahoe, NIPS Workshop, 2013. Her research interest includes machine learning
[31] H. Alvari, S. Hashemi, and A. Hamzeh, “Discovering overlapping and complex network analysis.
communities in social networks: A novel Game-theoretic
approach,” AI Commun., vol. 26, no. 2, pp. 161–177, 2013.
[32] F. Bonchi, A. Gionis, and A. Ukkonen, “Overlapping correlation
clustering,” in Proc. IEEE 11th Int. Conf. Data Mining, 2011, pp. 51–60.
[33] S. B. Seidman, “Network structure and minimum degree,” Social Debarka Sengupta received the BTech and PhD
Netw., vol. 5, no. 3, pp. 269–287, 1983. degrees in computer science and engineering
[34] J. Yang and J. Leskovec, “Defining and evaluating network com- from West Bengal University of Technology and
munities based on ground-truth,” in Proc. ACM SIGKDD Work- Jadavpur University, respectively. He was in the
shop Mining Data Semantics, 2012, pp. 3. Machine Intelligence Unit of the Indian Statistical
[35] M. Bastian, S. Heymann, and M. Jacomy, “Gephi: An open source Institute as a research fellow during March, 2009-
software for exploring and manipulating networks.” Proc. Int. March, 2013. Currently, he is a postdoctoral
AAAI Conf. Weblogs Social Media, 2009, vol. 8, pp. 361–362. fellow in Computational and Systems Biology
[36] J. Baumes, M. K. Goldberg, M. S. Krishnamoorthy, M. Group, Genome Institute of Singapore. His
Magdon-Ismail, and N. Preston, “Finding communities by research interest includes computational biology,
clustering a graph into overlapping subgraphs,” in Proc. IADIS functional genomics, and machine learning.
Int. Conf. Appl. Comput., 2005, vol. 5, pp. 97–104.
[37] D. Lusseau, K. Schneider, O. J. Boisseau, P. Haase, E. Slooten, and
S. M. Dawson, “The bottlenose dolphin community of doubtful " For more information on this or any other computing topic,
sound features a large proportion of Long-lasting associations,” please visit our Digital Library at www.computer.org/publications/dlib.
Behavioral Ecol. Sociobiol., vol. 54, no. 4, pp. 396–405, 2003.

You might also like