0% found this document useful (0 votes)
14 views

Incremental Community Detection in Distributed Dynamic Graph

Uploaded by

clowntaketwo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Incremental Community Detection in Distributed Dynamic Graph

Uploaded by

clowntaketwo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Incremental Community Detection in Distributed

Dynamic Graph
Tariq Abughofa Ahmed A.Harby Haruna Isah Farhana Zulkernine
School of Computing School of Computing School of Computing School of Computing
Queen’s University Queen’s University Kingston, Queen’s University Queen’s University
Kingston, ON, Canada ON, Canada Kingston, ON, Canada Kingston, ON, Canada
abughofa@queensu.ca Ahmed.harby@queensu.ca h.isah@unb.ca Farhana.zulkernine@queensu.ca

Abstract— Community detection is an important research after changes happen to the data and the updated analytics
topic in graph analytics that has a wide range of applications. A results are returned with very short delays.
variety of static community detection algorithms and quality Several graph processing frameworks utilize static
metrics were developed in the past few years. However, most
real-world graphs are not static and often change over time. In partitioning, which means that they consider the graph and the
the case of streaming data, communities in the associated graph processing environment to remain unchanged [1][2][5].
need to be updated either continuously or whenever new data However, most real-world graphs are dynamic as they change
streams are added to the graph, which poses a much greater over time with new data producing new vertices and edges that
challenge in devising good community detection algorithms for need to be merged into existing graphs. The changes in
maintaining dynamic graphs over streaming data. In this paper, dynamic graphs are further complicated by the need for real-
we propose an incremental community detection algorithm for
maintaining a dynamic graph over streaming data. The
time guarantees for applications such as real-time disease
contributions of this study include (a) the implementation of a spreading and anomaly detection. Traditional static graph
Distributed Weighted Community Clustering (DWCC) analytics approaches face a major limitation in meeting this
algorithm, (b) the design and implementation of a novel demand [6]. Dynamic graph scenarios require novel online or
Incremental Distributed Weighted Community Clustering real-time graph update and analytics algorithms since the
(IDWCC) algorithm, and (c) an experimental study to compare traditional offline graph analytics approaches require first the
the performance of our IDWCC algorithm with the DWCC
whole graph to be updated with the new data, and then
algorithm. We validate the functionality and efficiency of our
framework in processing streaming data and performing large analytics algorithms to be applied to the whole graph, which
in-memory distributed dynamic graph analytics. The results is extremely computation-intensive and hence impractical.
demonstrate that our IDWCC algorithm performs up to three Dynamic graph [7] updates can be node-grained or edge-
times faster than the DWCC algorithm for a similar accuracy. grained. In node-grained dynamic graphs, new nodes or
vertices are simultaneously added to the graph with all their
Keywords—Distributed graph processing, dynamic graphs,
incident edges. An example of such graphs is a network of
streaming data, weighted community clustering
scientific papers and their references. Once a paper is
published, all the papers that it references are known as well
I. INTRODUCTION and no new references (connections) are added later. In edge-
grained dynamic graphs, new edges are added or removed for
Distributed processing of large-scale graphs has gained already existing vertices. Social networks are a good example
considerable attention in the last decade [1]. This is mainly of these graphs, people add new friends and "un-friend" old
due to the (i) unprecedented increase in the size of graph data ones all the time. Thus, the assumption of knowing all the
such as the Web based social media networks, (ii) evolution connections of a person when we add them to the graph is not
of systems for processing massive graph data such as Pregel viable. In these networks, the sequence of adding new edges
[2] and GraphX [3], and (iii) huge increase in the number of is important and influences the evolution of the graph
applications that utilize graph data such as traffic and social structure.
network analysis [4]. According to Heidari et al. [5], a typical Recently, the problem of distributed processing of large
graph processing system executes graph algorithms such as dynamic graphs has gained considerable attention [7]. Several
graph traversal over a graph dataset across five different traditional graph operations such as the k-core decomposition
logical phases, which include reading graph data, pre- [8-11], partitioning [12], and maximal clique computation
processing, partitioning, computation, and error handling. [13] have been extended to support dynamic graphs. However,
Regardless of the size and type of framework or algorithms many graph processing frameworks do not support several
used, Heidari et al. reported [5] that large-scale graph data can graph operations in the context of dynamic graphs [4]. One
be processed in offline, online, or real-time mode. Offline such operation is community detection in graphs, which is the
processing is the popular mode and is achieved by loading the process of identifying groups of nodes that are highly
graph dataset in memory from disk storage and processing it. connected among themselves and sparsely connected to the
Online processing allows users to update, maintain, and re- rest of the graph [14]. Such groups are referred to in the
process the graph data automatically with new values either literature as “communities" and occur in various types of
periodically or based on user-defined events. Real-time graphs. Several research studies on networks modeling real-
processing is similar to online processing except it also world phenomena have shown that the networks are organized
enables instant incremental updates to be made to the graph according to community structure and their structures evolve
data. Thus, it requires the computation to be done immediately with time [15]. Therefore, community detection within large-

XXX-X-XXXX-XXXX-X/XX/$XX.00 ©2021 IEEE


scale graphs has become an important research problem world datasets with ground-truth communities. The
[7][16][17]. It helps to discover new structural properties evaluation addresses the performance, quality, and
about the graph that cannot be found otherwise such as applicability aspects.
identification of the highly influential nodes known as The remainder of this paper is organized as follows. We
community centroids [18]. It is also used for targeted outline the existing solutions for the community detection
marketing [14], distributed graph management [9][10], problem and explain the WCC metric in Section II. In Section
uncovering tightly connected entities in a graph [7], and III we describe our implementation of the DWCC algorithm
finding major sub-graphs indicating special relationships that using Spark and GraphX and the propose the IDWCC
are generally obscured by the complex structure of the original algorithm. Next, we present a complexity analysis and
graph [19]. experimental evaluation of the two algorithms in Section IV.
Metrics for shaping communities often follow two Section V presents a case study of WCC optimization in
approaches, either by maximizing the internal density of the dynamic graphs for product recommendations. Finally, we
communities by including heavily connected nodes into the conclude this study and outline further improvements in
community, or by reducing intra-community connectivity by Section VI.
removing weak connections among different communities
[20]. Most of the existing community detection algorithms II. BACKGROUND & TERMINOLOGY
involve heavy computation and hence are time-consuming Community detection is a widely studied problem [24]. It
[21]. As the graphs being operated on become larger, the is one of the most relevant topics in the field of graph data
ability to process them in memory on a single machine processing due to its importance in many fields such as
becomes infeasible due to both time and memory constraints biology, social networks, or network traffic analysis [20]. In
[14][17]. In dynamic graphs, the problem becomes more this section, we present a brief literature review of some of the
complex because the data keeps changing and the work in this area and explain the key concepts behind the
communities need to be adjusted by reapplying the solution to WCC optimization technique.
the whole graph every time the data changes [15]. With
streaming data, communities need to be updated continuously A. Literature Review
or whenever a new micro-batch (too large of a batch size will A variety of community detection algorithms have been
lead to poor generalization, so micro-batches are needed to developed based on different graph update strategies during
provide some basic intuition) of streaming data gets added to the past few years. Label Propagation [25][26] is one of the
the graph. This poses a much greater challenge in devising a most popular community detection methods, which is
good community detection algorithm for dynamic graphs over implemented in GraphX [3]. This algorithm chooses the
streaming data. community of the current node using the labels of its
In this paper, we propose an incremental community neighboring nodes. Initially, each node is initialized with a
detection algorithm as a solution to the community detection unique label and at every iteration of the algorithm, each node
problem for large dynamic graphs over streaming data. It adopts the label that most of its neighbors have. As the labels
gradually propagates new incoming data in the graph and propagate through the network, densely connected groups of
adjusts the existing communities. The contributions of this nodes form a consensus on their labels. At the end of the
study are as follows. algorithm, nodes having the same labels are grouped as
• We implemented the Distributed Weighted Community communities. Another popular community detection method
Clustering (DWCC) algorithm using Apache Spark [22] based on random walks is Infomap [27]. Finding community
and GraphX [3][23] in Scala on a multi-cluster structure in networks using Infomap is equivalent to solving
environment. The DWCC was proposed by Saltz et al. an information flow problem. Rosvall and Bergstrom [27],
[14] which was implemented on the Pregel platform for exemplified this by making a map of science, based on how
static data. information flows among scientific journals through citations.
• We conducted an extensive performance study of the A detailed survey and guided tour through the main aspects of
DWCC algorithm to identify the costly operations to community detection methods and their applications have
optimize the processing time and memory consumption. been outlined by Harenberg et al. [28] and Fortunato et al. [29]
• Based on the results of the above study, we developed a respectively.
novel Incremental Distributed WCC (IDWCC) algorithm Many centralized community-detection methods have been
for undirected and unweighted node-grained dynamic proposed in the literature, however, recent dramatic growth in
distributed graphs. IDWCC applies the Weighted real-world network size requires community detection to be
Community Clustering (WCC) optimization technique to performed in a distributed environment [30]. Apart from the
add new vertices from the streaming data to the most huge sizes, modern networks are characterized by high
suitable communities in an existing distributed graph. We dynamics, which challenges the efficiency of community
implemented the algorithm in Scala using GraphX to detection algorithms [31]. These challenges have led to
work with Spark Streaming. To the best of our several research solutions on distributed community detection
knowledge, this is the first node-grained incremental in both static and dynamic graphs. Hung et al. [32] modeled
distributed community detection algorithm. community detection on edge-labeled graphs as a tensor
• We experimentally validated both DWCC and IDWCC decomposition problem and proposed a fast, accurate, and
algorithms and compared their performances using real- scalable distributed system for community detection in large
static graphs based on the Spark framework. Clementi et al. represented as streaming networks where data need to be
[30] introduced a dynamic community detection framework added to a network incrementally in real-time while updating
that relies on the Label Propagation algorithm [26][27]. the graph community structure [7]. Therefore, a solution is
However, the framework was evaluated using randomly needed to add new data and update communities in distributed
generated networks rather than real-world graphs. dynamic graphs in a multi-cluster environment for streaming
Recently, Jian et al. [31] designed an algorithm based on data.
the Label Propagation method [26][27] that can incrementally For incremental community detection, many modularity-
detect communities over distributed and dynamic graphs. based solutions have been proposed but very few solutions
According to Jian et al., besides detecting high-quality exist for node-grained graphs. Shang et al. [36] introduced an
communities, the algorithm can incrementally update the algorithm that depends on the Louvain algorithm for detecting
detected communities after a batch of edge insertion and an initial community structure as well as the communities for
deletion operations. The algorithm was implemented by using new vertices. Pan et al. [37] developed a method for edge-
the MapReduce model. The evaluation results on real-world grained graphs. The problem with this method is that it
datasets show that the algorithm can detect communities assumes the edges are added in a certain order. As a result, it
incrementally with a running time that is sublinear to the cannot handle node-grained graphs properly where the edges
changed edge number. What is not clear, however, in the are added simultaneously, and gives poor performance [7]. A
evaluation is the measure of the indicator of the quality of the recent method called the Node-Grained Incremental (NGI)
communities for a real-world dataset. community detection based on modularity optimization was
Several metrics such as modularity and conductance have proposed by Yin et al. [7] for node-grained dynamic graphs.
been proposed as indicators of the quality of a community in However, it was only implemented for centralized but not
a graph [19]. Modularity is considered the most prominent distributed processing.
quality measure for community detection [24][33]. It In this paper, we propose an incremental community
prioritizes communities based on their internal edge density. detection algorithm for large distributed dynamic graphs on a
One of the most popular algorithms based on modularity multi-cluster environment based on the WCC optimization
optimization is the Louvain algorithm, which is presented in technique. The WCC optimization algorithm is explained in
detail by Blondel et al. [33]. This algorithm is a greedy detail by Prat-Perez et al. [14][17]. In this section, we
optimization that can be used for weighted graphs. The summarize the fundamental concepts of the WCC metric, its
algorithm starts with each vertex as its own community. Then applications in community detection in large graphs, and the
it progresses in an iterative manner where each iteration processing steps namely pre-processing and partitioning.
consists of two phases. The first phase calculates the gain in B. WCC
modularity (see Eq. 1) by adding each vertex to a neighboring
Prat-Pérez et al. [20] [19] first introduced the metric called
community and to a community that produces the highest
Weighted Community Clustering (WCC) to evaluate the
gain.
quality of community partitioning based on the distribution of
triangles in the graph. The WCC optimization approach
constructs triangles of vertices in the graph to measure the
(1)
density of vertices. WCC optimization has gained a lot of
attention due to less computational complexity as it does not
This gain in modularity ΔQ when a node is moved into a consider edge weights in the computations and demonstrates
community C is calculated using Eq. 1. Σin is the sum of the superior results over other commonly used metrics like
weights of the links inside C, Σtot is the sum of the weights of modularity [17].
the links incident to nodes in C, ki is the sum of the weights of Given a graph G(V, E) composed of a set of vertices V and
the links incident to node i, ki,in is the sum of the weights of a set of edges E, t(x, V) denotes the number of triangles that
the links from i to nodes in C, and m is the sum of the weights pass through the vertex x and links it to neighbouring vertices
of all the links in the network. in a set of V vertices. (triangle count for x), and vt(x, V) denotes
More recently another metric called the Weighted the number of neighboring vertices that close at least one
Community Clustering (WCC) was introduced by Prat-Pérez triangle with x for each vertex in the graph. Given a
et al. [20] to evaluate the quality of communities based on their community C in graph G, t(x, C) and vt(x, V) are the same as
density in terms of triangles. Unlike Louvain, WCC the previous measurements considering the vertices inside C
optimization does not consider edge weights in the only. Based on these four measurements, the WCC value for
computations. The WCC metric ensures that communities are a vertex x in a community C can be calculated using Eq. 2 as
cohesive, structured, and well defined. It is used in the explained by Prat-Pérez et al [17].
Scalable Community Detection (SCD) algorithm [17] for
detecting communities in undirected unweighted graphs of
unprecedented size in a short execution time. A distributed (2)
version of the algorithm based on the vertex-centric paradigm
was developed later by Saltz et al. [14] on the Pregel platform
[2]. This approach performs well on static graphs of over one The WCC value for the whole graph is calculated from the
billion edges. However, most real-world graphs are not static average of the WCC of all the vertices in all the communities
but often change over time. The changes are usually in the graph as described in Eq. 3.
(3) tools did not exist, we implemented one using Scala, GraphX,
and GraphFrames for distributed processing on Spark.
Prat-Pérez et al. [20] introduced a set of basic properties We describe the three basic steps of the WCC optimization
that any community cohesion metric for social networks algorithm. Then we illustrate the Spark implementation of the
should fulfill. These properties include (i) clustering Distributed WCC (DWCC) algorithm for a static distributed
coefficient, defined as the probability that two neighbors of a graph. Finally, we explain and demonstrate our IDWCC for
given individual are also neighbors themselves [24], (ii) the detecting communities incrementally in dynamic distributed
dynamics of community formation, (iii) presence of a bridge, graphs.
an edge which if removed from the graph, creates two separate
A. Partitioning
connected components, (iv) presence of a cut vertex, a node
whose removal splits the graph into two or more connected In this step, we compute an initial partition of the graph.
components, and (v) presence of clique, a vertex connected to First, the vertices are sorted by their clustering coefficients in
another vertex with an edge which forms a maximal clique. descending order. Then the vertices are iterated on and for
The authors further proved that WCC is a good candidate to each vertex x not previously visited, we create a new
distinguish communities in social networks. In terms of the community C that contains x and all its neighbors that were
clustering coefficient, they discovered that WCC reacts to the not visited before. The algorithm requires the following
internal structure of the communities, and in particular, to the conditions to be met in an initial partition.
presence of triangles. Regarding the appearance of a new node • Every community should contain a single-center vertex
in a community, WCC was found to have a better value for a and a set of border vertices connected to the center
node with fewer connections if the node was included in the vertex.
community. It has, however, a better value for a node with • The center vertex should be the vertex with the highest
many connections if the node was kept outside the clustering coefficient in the community.
community. They also discovered that WCC was resistant to • Given a center vertex x and a border vertex y in a
bridges, and an optimal community in social networks can not community, the clustering coefficient of x must be higher
contain a bridge. Finally, WCC was found to be able to than the clustering coefficient of any neighbor z of y that
separate communities into two cliques. is the center of its own community.
As stated before, several metrics such as modularity and In the final step, the initial partition is improved iteratively
conductance have been proposed as indicators of the quality using a hill-climbing method. The execution stops when no
of a community in a graph. However, we chose the WCC further improvements to the global WCC can be achieved, or
metric and its optimization method to be the basis of our when a predefined number of iterations do not provide any
distributed dynamic graph community detection algorithm significant improvement as specified by a threshold. Next, we
because of its performance, increasing popularity in the graph will discuss our distributed implementation of WCC
processing community, and potential in ensuring that optimization for GraphX. The proposed IDWCC algorithm is
communities are cohesive, structured, and well-defined [20]. explained after that.
WCC provides a good trade-off between performance and The Pregel API in GraphX helps in executing the
quality [14][16][17]. In addition, the optimization process of partitioning of a distributed graph while respecting all the
WCC can be distributed easily; the calculations of the best initial partitioning conditions. It performs an iterative
movement and the WCC value for each vertex can be done execution process in which dynamic vertices keep
locally, and thus the computations can be executed in parallel. broadcasting changes in their communities to their neighbors
To the best of our knowledge, it is the most efficient solution while receivers update their communities depending on the
for community detection in large-scale graphs. change notifications, they receive from the neighbors until no
further adjustments are needed.
III. SYSTEM DESIGN
Computing the improvement of the global WCC using Eq.
WCC is used in the SCD algorithm [17] for community 3 requires the computation of the internal triangles of each
detection in centralized graphs. A distributed version of the community of the graph, which makes it inefficient to
algorithm exists for static distributed graphs, which was compute all possible movements of each vertex. Prat-Perez et
implemented by Saltz et al. [14] in Java for the Graph al. [17] present a heuristic for calculating WCC improvement
processing engine. In this paper, we propose an Incremental caused by moving a single vertex to a new community using
Distributed Weighted Community Clustering (IDWCC) the statistics about the vertex and its neighboring
algorithm for detecting communities incrementally in a communities. The heuristic as presented in Eq. 4, gives an
distributed dynamic graph that is continuously updated from approximated value and does not require the computation of
streaming data. Communities help in clustering very large the internal triangles of each community. Instead, it depends
graph data on a distributed infrastructure for better on calculating the following statistics: din: the number of edges
management and fast processing of analytical queries. that connect the vertex v to the vertices inside the community
We validate the algorithm using our existing multi-level C where it is moving, dout : the number of edges that connect v
streaming data processing framework. The framework uses to the vertices outside C, b: the number of edges that are in the
Spark, GraphX, and GraphFrames to create and maintain a boundary of C, 𝛿 : the edge density of C, r: the number of
dynamic distributed graph. Since the implementation of the vertices in C, and w: the clustering coefficient of the graph.
community detection algorithm based on WCC using these We use the same heuristic due to its efficiency. Since this
computation occurs independently within each vertex, all previous micro-batch t. (b) The inner vertices need to calculate
vertices may perform their movements simultaneously, the statistics. (c) The border vertices have new connections
meaning that this part of the algorithm can be distributed and thus might belong to new triangles and need to update
effectively on multiple compute nodes to be executed in their statistics. The definition of the stream batches, which is
parallel to improve the performance of the algorithm. presented in Eq. 7, is important for updating the statistics of
the border vertices as it assures that the graph holds the
(4) following conditions. Let's denote the set of triangles that pass
through a vertex x in graph G as Tx,G and the set of vertices that
Prat-Pérez et al [17] described Θ1, Θ2, and Θ3 as the WCC form at least one triangle with x as VTx,G. Then the following
improvements of the vertices in C that are connected to x, the holds true.
vertices in C that are not connected to x, and the vertices v
respectively, where v represents the set of vertices to be added
to community C.
B. Optimization where A is the set of vertices that are neighbors of x and
We implement DWCC optimization for Apache Spark form triangles with it in both Gt and G*. Based on these
using its distributed in-memory graph structure, GraphX. The statements, the statistics for the border vertices are calculated
implementation is somewhat influenced by the existing graph as follows.
processing libraries in Spark and the properties of the GraphX
structure. We calculated the execution time for each small step (5)
of DWCC as shown in Figure 2. Based on these calculations,
we developed an algorithm that works in three phases. First, it (6)
merges the batch with the maintained evolving graph, updates
the vertex statistics, and optimizes the graph. Second, it Using these two measurements we can compute the local
assigns the new vertices to initial communities. Finally, it clustering coefficient for each vertex and the global clustering
optimizes the WCC metric to generate better communities. coefficient w which is needed to calculate WCC’I. At the end
As a first step, a new graph G* = (V*, E*) is generated from of this phase, we optimize the graph in the same way as it is
the newly arrived batch 𝛿*. The produced graph is then done for DWCC to reduce the memory consumption and the
merged with the full graph to produce Gt+1 = (Vt ∪ V*, Et ∪ processing required in the succeeding phases, which is a
E*) as demonstrated in Fig. 1. relatively cheap operation (see Fig. 2).

Algorithm 1: Partitioning
1: Let P be a set of communities generated at the last micro-batch;
2: S ← sort ByClusteringCoefficients(Vt+1);
3: for all v in S do
4: if notVisited(v) then
5: markAsVisited(v);
6: if v ∈ V * then
7: C ← {v};
8: else
9: C ← P.getCommunity(v);
10: for all u in neighbors(v) do
11: if notVisited(u) then
12: markAsVisited(u);
13: if u ∈ V * then
Figure 1. Merging G* with Gt. 14: C.add(u);
15: P.add(C)
We identify a set of vertices which we call the border
vertices. These vertices exist in both Gt and G, are a part of the
edges that connect the newly arriving batch with the old graph.
We choose communities for the vertices that appear in the
Let us denote this set as Vb = Vt ∩ V*. We refer to the rest of
new batch. These vertices include the inner vertices Vn which
the vertices in the new graph which are not part of the border
have no communities assigned to them yet, and the border
vertices, as the inner vertices Vn = V*\Vb. The problem with
vertices Vb which were removed from their communities
the border vertices is that they have already been assigned to
during the previous phase. We use the same algorithm as used
communities in Gt. But since they have new connections, they
in DWCC (see Algorithm 1), but we limit it to the above-
are likely to belong to different communities. We isolate each
mentioned sets of vertices only. Hence, every vertex in the
of these vertices in its own community in the full graph
new batch chooses the vertex with the highest clustering
Gt+1.The merge phase also calculates t(x, Vt+1) and vt(x, Vt+1)
coefficient that does not belong to a community of another
for each vertex x in Gt+1. To perform the calculations
vertex as its community center.
efficiently, we recognize three possible situations. (a)
Statistics of the old vertices stay the same as they were for the Algorithm 2 follows the same steps as its counterpart
Algorithm 1, the DWCC algorithm. However, it includes two
optimizations since it is the most expensive processing phase • Step 2: Graph Restructuring (preprocessing): Move
in terms of computations: vertices and remove redundant edges as a part of graph
• Calculation of the community movements is still done on restructuring, optimizing, and cleaning.
all the vertices, but we drop calculating the value of • Step 3: Iterative Partitioning (initial partitioning): Use
WCC in each iteration. broadcasting protocol for dynamic vertices to
• We use a fixed number of iterations rather than using add/modify communities on distributed graphs.
more iterations when good WCC improvement appears. • Step 4: WCC Optimization (partition refinement):
Compute the WCC metric or improvement for the graph.
This might result in missing community movements that
We investigate and calculate the execution time of each of
can have a good impact on WCC. However, as we process
the small steps of DWCC and propose changes to the DWCC
subsequent micro-batches, all the vertices start changing their
algorithm to define the IDWCC for node-grained dynamic
communities again and any previous changes that were missed
distributed graphs. The IDWCC algorithm has many similar
are subsequently recovered. This way, the degradation of
steps as the DWCC. However, it has optimizations that avoid
WCC over time is avoided.
repeated calculations and consequently reduce the memory,
data movement, and computational costs without sacrificing
Algorithm 2: Partitioning optimization the quality of the result.
1: Let P be the initial partition; In this paper, we limit our scope of interest to dynamic
2: iteration ← 1;
graphs that satisfy two properties. First, the graph progresses
3: Repeat
4: M←∅
over a window of time in which a small batch of vertices and
5: For all v in V do their edges are added. These edges connect the new vertices
6: M.add(bestMovement(v , P)) to each other and to the full graph generated from the last
7: P ← applyMovements(M , P); micro-batch. Second, the edges are equal in value i.e., the
8: Iteration = iteration+1; edges are not weighted or directed. We denote the graph from
9: until iteration > maxIterations; the previous iteration as Gt = (Vt, Et) where Vt and Et are its
sets of vertices and edges at time t respectively. Let us refer to
C. Preprocessing the vertices in the newly arriving batch as V* and the edges as
This phase aims to calculate the t(x, V) and vt(x, V) values E*. We define a micro-batch from the stream of a node-
for each vertex of the graph. After these measurements are grained dynamic graph d as follows:
calculated, a graph optimization which is stated in the
optimization section, is performed by removing edges that do (7)
not close any triangles.
The Triangle Count algorithm in GraphX 1 requires the Based on the cost of each step and the above-mentioned
graph to be canonical which means that the graph should graph properties, we developed the IDWCC algorithm that
ensure the following: works in three phases. First, it merges a micro-batch of the
• Free from self-edges (edges with the same vertex as a streaming data with the maintained evolving graph, updates
source and a destination). the vertex statistics, and optimizes the graph. Second, it
assigns the new vertices to the initial communities. Finally, it
• All its edges are oriented (the source vertex has a greater
optimizes the WCC metric to generate better communities.
number of directly connected triangles than the
destination vertex based on a pre-defined comparison Table 1. Properties of the test graphs
method).
• Has no duplicate edges. Data Sources Vertices Edges
The cleaning is done using the subgraph API provided by Amazon 334,863 925,872
GraphX. We keep the calculated statistics namely, the triangle DBLP 317,080 1,049,866
count and the degree of vertex for later use. We took YouTube 1,134,890 2,987,624
advantage of the fact that GraphX supports property graphs
and hence we can save these statistics as properties of the
graph vertices. IV. VALIDATION AND RESULTS
D. Implementation
A. Experimental Setup
We modify certain steps of the DWCC algorithm which A distributed multi-cluster environment was used for our
incur high computational cost to make the algorithm more experiments applying Spark, Spark Streaming, and GraphX to
scalable so that we can apply it to distributed dynamic graphs. implement the dynamic distributed graph. Eight identical
The key steps of the DWCC algorithm are as follows. machines were used, each having 8 cores 2.10 GHz Intel Xeon
• Step 1: Vertex Statistics (preprocessing): Count the 64-bit CPU, 30 GB of RAM, and 300 GB of disk space to host
triangles of vertices to identify communities and keep the and perform computations on the distributed graph data
statistics of triangle count and the degrees of vertices. structure. We installed Apache Spark v2.2.0 on all the
1
https://spark.apache.org/docs/latest/graphx-programming-
guide.html#triangle-counting
machines. Both DWCC and IDWCC algorithms were nt+1.lognt+1 < mt+1. The cost can be simplified to become
implemented using Scala 2.112. O(mt+1). This cost is much smaller than O(mt+1.lognt+1) the
cost of applying the static algorithm on the whole graph during
B. Data Source
the t th iteration.
We used a set of different real-life undirected graphs that
have ground-truth communities. We took these graphs from D. Experimental Details and Results
the SNAP data repository 3 . The selected graphs and some We conducted the following three sets of experiments.
statistics about them are presented in Table 1. The use of
multiple graphs for the experiments allowed us to compare the • Experiment a: We computed the efficiency of the
optimizations used in IDWCC by comparing the
results for different graph sizes. In addition, it gave us the
execution time of each step of IDWCC with its
ability to experiment with different sizes of micro-batches
counterpart in DWCC.
easily. After loading a graph from the experimental sets, we
ensured that it is clean and free from duplicates and self-edges. • Experiment b: We compared the quality of the results of
We also sorted the edges by the source vertices which made DWCC, IDWCC, and SCD.
the graph canonical from the start and ready for processing. • Experiment c: Finally, we compared the quality of the
results and the execution time of the DWCC and IDWCC
C. Complexity Analysis algorithms for a dynamic distributed graph by executing
We compare the complexity of sequential implementation the algorithms while updating the graph in real-time from
of our incremental algorithm to its static counterpart when a data stream. We executed 5 iterations of the DWCC and
both are applied to detect communities in a dynamic graph. IDWCC algorithms based on the recommendations of
Let n be the number of vertices and m the number of edges in Prat-Pérez et al. [19], the authors of the original WCC
the graph. We assume that the average degree of the graph d optimization algorithm.
at any point of time t is d = m/n and that real graphs (graphs We experimented with different graph sizes to compare the
are meant to be found in real-world with millions of vertices performances of the algorithms for different sizes of graphs
and edges) have a quasi-linear relation between vertices and and micro-batches of the streaming data. Initially, we
edges O(m) = O(n.logn). Under these assumptions, the constructed the graph from a bulk of static data already
complexity of the static WCC optimization algorithm, as downloaded from the selected data sources. For the bulk data,
calculated in Prat-Perez et al. [17] is O(m.logn). the first two columns in Table 2 show the number of vertices
Now we calculate the complexity of a centralized version and edges added to the graph respectively. Next, we appended
of the sequential incremental WCC optimization algorithm. new data to the existing distributed graph from the streaming
For the first phase, we do not consider the complexity of data sources using micro-batches of data at a time. The
merging the graphs since the operation is necessary for any number of edges added from the data streams and the
dynamic graph. That leaves the cost of computing the triangles corresponding numbers of micro-batches of data used for the
and degrees for the vertices in the new batch as follows: updates are shown in columns 3 and 4 in Table 2 respectively.
It can be noticed that we chose to use fewer micro-batches
O(m*.d + m*) = O(m*.log nt+1)
with the smaller size bulk graphs, and greater micro-batches
The second phase requires sorting the vertices based on the for the larger size bulk graphs to limit the use of the resources
local clustering coefficient. However, the vertices are already on each iteration.
sorted from processing a previous micro-batch. Hence, the
cost is only for organizing the new vertices in the right order Table 2. Initial sizes and updated sizes of the test graphs
which requires sorting the new vertices and then executing a
full scan of the vertices in the worst-case O(n*+nt) = O(nt). Bulk Bulk Stream # of Micro-
For the third phase, let α be the number of iterations required Vertices Edges Edges batches
Amazon 258,464 576,718 349,154 10
to find the best possible communities, which is a constant. In
DBLP 253,119 852,754 197,112 10
each iteration, we compute in the worst-case d+1 movements YouTube 903,959 2666,836 320788 10
for each vertex of type WCC’I which has a cost of O(1). That LiveJournal 768,792 13,997,342 20,683,847 30
makes the total cost as follows:
O(n.(d+1)) = O(m) Experiment a shows the benefits of the optimization
techniques we applied to our IDWCC as shown in the
Next, we apply all the movements which are equal to the algorithm listing. For both DWCC and IDWCC, we calculated
number of vertices, so it costs O(nt+1). We also need to update, the time that each step took to be executed on the full Amazon
for each iteration in the second phase, the statistics w, cout, din, graph and aggregated them to compute the total execution
and dout for each vertex and community, which has a cost of time as shown in Fig. 2. The results show that the IDWCC has
O(mt+1). We sum all the costs to get the full cost of this phase a shorter execution time compared to the DWCC algorithm.
O(α.(mt+1 + nt+1 +mt+1)) = O(mt+1). The full cost of the The vertex statistics step as defined in the implementation part
algorithm is the sum of the cost of the three phases: O(m*log in Section III is the one responsible for calculating each vertex
nt+1 + nt + mt+1). Since m* << nt+1, then m*.lognt+1 << triangle count and degree.
2
https://github.com/TariqAbughofa/incremental_distributed_wcc
3
http://snap.stanford.edu/
results than the DWCC algorithm. We do not show the results
40 37.48 for the LiveJournal graph as both DWCC and IDWCC failed
Execution Time(Seconds)

to process the whole stream. In both cases, the computational


30 21.88 needs exceeded the available memory resources.
20 15.84
8.28 8.3
10
0
Vertex Graph Initial Apply Calculate
Statistics Optimization Partition Movements WCC

DWCC IDWCC

Figure 2. Execution time for each step in DWCC vs IDWCC

In Fig.2, we notice that the execution time was reduced in


(a) (b)
IDWCC by almost three times compared to DWCC as we
counted the triangles and degrees only for the new vertices and
the border vertices (vertices statistics Fig.2). The graph
restructuring step had no change in execution time as no
adjustments were made to the vertices. The iterative
partitioning step had a small decrease in execution time caused
by the way we altered this stage. The biggest gain was
expected in memory consumption as proven later. Finally, the
WCC optimization step, which calculates the WCC metric, an
expensive operation, was eliminated completely (no column
for IDWCC) from the IDWCC algorithm resulting in a great (c) (d)
reduction in the computational cost. Figure 4. Comparing WCC values (DWCC vs IDWCC for (a) Amazon; (b)
DBLP; (c) YouTube; (d) LiveJournal

0.4
Global WCC

0.3

0.2

0.1

0
Amazon DBLP YouTube

SCD DWCC IDWCC


(a) (b)
Figure 3. Global WCC: SCD vs DWCC vs IDWCC

Experiment b aimed at comparing the quality of results of


the distributed graph community detection algorithms,
DWCC and IDWCC. The existing implementation of the SDC
algorithm already proved that it had better quality than many
other centralized community detection algorithms while
having faster execution [17]. These studies compared the
quality of the results of the SDC to that of Louvain [33] and
Infomap [27]. Therefore, instead of repeating similar
experiments and comparing the effectiveness of WCC (c) (d)
optimization to that of the other approaches such as Louvain Figure 5. Comparing WCC execution time (DWCC vs IDWCC for (a)
and Infomap, we compared the quality of the communities Amazon; (b) DBLP; (c) YouTube; (d) LiveJournal
produced by the SCD, DWCC, and IDWCC. We measure the
quality of the communities by calculating the global WCC on Experiment c was designed to prove the quality of the
the full test graph after appending data from the last micro- results and the efficiency of the IDWCC algorithm compared
batch. The results are displayed in Fig. 3 which shows that to the DWCC algorithm. For each graph and each micro-batch
both DWCC and IDWCC produced good WCC values. These in the stream as described in Table 2, we merged the micro-
WCC values show only up to 5% decrease from their SCD batch with the full graph. Then we applied both DWCC and
counterparts. On top of that, the IDWCC gives slightly better IDWCC on the modified graph. In the streaming context,
DWCC finds the communities for the whole graph again, the communities in the graph constructed using the Amazon
whereas our IDWCC finds communities only for the new data would represent similar products that are frequently
vertices and reflects these changes on the old vertices. purchased together and can be used to provide
For each micro-batch, we compared the global WCC recommendations about products to the customer. We ran the
values of the resulting graph generated by IDWCC and IDWCC algorithm on the graph and chose three communities
DWCC in addition to the execution times. Fig. 4 and 5 show which are reported in Table 3. Table 3 has 10 randomly
the results of the experiments with streaming data. Missing selected vertices from each community. In the case of the first
data points indicate that the algorithm was not able to continue community, we see that it is formed of classic novels. The
processing the graph due to insufficient memory problems second community consists of Shakespearean literature.
(lack of resources). Fig. 4 clearly demonstrates that IDWCC Finally, the third one is mostly political and allegorical novels.
produces communities with global WCC values that are very We observe that the algorithm can perform a good selection
close to the ones produced by DWCC. We can even see that of the relations in the graphs to give meaningful communities.
the results start to be better than DWCC in later iterations.
Regarding the execution time, IDWCC performed two to three VI. CONCLUSION
times better than DWCC. For the large LiveJournal graph, we Detecting communities in very large graphs offers
see that both algorithms failed to continue with the available computational challenges and existing algorithms do not offer
computational resources. However, IDWCC continued for 7 a feasible solution for large dynamic graphs [7]. In this paper,
micro-batches before it crashed, while DWCC could only we propose the IDWCC algorithm as a solution to the
process up to 3 micro-batches. This shows that IDWCC has community detection problem for large dynamic distributed
significantly less memory consumption than DWCC. The graphs that need to be updated continuously from streaming
experiments show that our dynamic graph processing data. We demonstrate the efficacy of our solution by
framework with IDWCC is capable of maintaining graphs up implementing a prototype using cutting edge Spark, Spark
to 100 million edges while updating them with streaming data Streaming and GraphX to create and maintain a large in-
in under 50 seconds. memory distributed graph over a multi-cluster distributed
V. CASE STUDY infrastructure. We begin by conducting a study on the use of
the Weighted Community Clustering (WCC) metric to detect
We further examined the communities produced by IDWCC communities with Apache Spark. Next, we present the
in a real-world application of dynamic graphs. As a case study, implementation of a Distributed WCC (DWCC) optimization
we chose the scenario of generating recommendations for using Apache Spark and GraphX to detect communities in
product purchases using the Amazon streaming dataset [39]. static graphs. Finally, we propose and demonstrate a novel
The metadata for this graph is available on the SNAP website Incremental Distributed WCC (IDWCC) algorithm for
[38] and contains the titles of the products. detecting communities in large node-grained dynamic
distributed graphs. IDWCC improves the DWCC
Table 3. Examples of communities produced by IDWCC on Amazon
products data.
optimization by assigning the newly added vertices to the most
suitable communities in a distributed graph and by optimizing
Community # Example 1 Example 2
some of the computational steps. The algorithm is
implemented in Scala using the in-memory GraphX structure
community #1 Gulliver's Travels Robinson Crusoe: His Life and Strange and is executed on a distributed multi-cluster environment
Science Fiction Classics Surprising Adventures using Apache Spark. The experiments showed that IDWCC
of H.G. Wells Treasure Island outperforms DWCC for large dynamic graphs. IDWCC
Swiss Family Robinson Gulliver's Travels
produced the same or better WCC values compared to DWCC.
The War of the Worlds The Swiss Family Robinson
Anne of Avonlea Robinson Crusoe: Life and Strange It was also two to three times faster than DWCC. The memory
Surprising Adventures consumption was more optimized in IDWCC as well. To the
community #2 Merchant of Venice Hamlet: The New Variorum Edition best of our knowledge, IDWCC is the best performing
The Merchant of Venice Hamlet incremental community detection algorithm for node-grained
Macbeth The Merchant of Venice
dynamic distributed graphs. We also demonstrated and
Othello: The Applause A Midsummer Night's Dream
Shakespeare Library Othello validated the usability of dynamic community detection using
Much Ado About IDWCC for a real-life e-commerce use case scenario of
Nothing product recommendations using Amazon product data.
community #3 1984 To Kill a Mockingbird
A Separate Peace John Knowles's a Separate Peace
As future work, we like to study the stability of the quality
Lord of the Flies Joseph Heller's Catch-22 of IDWCC over a long period of time with a goal to assess the
Romeo and Juliet The Grapes of Wrath need of applying the DWCC optimization periodically on the
1984 1984 full graph to maintain high accuracy of the results in the case
of result degradation. We also aim to address the memory
The product recommendation problem aims to find consumption problem of the IDWCC algorithm which causes
products that are usually bought together to suggest them to a bottleneck when computing new communities for each
the users. This graph represents a network of products. Each vertex. We plan to further optimize the iteration phase by
vertex represents a product, and each edge connects two limiting the number of vertices for which we update the
products if they are frequently purchased together. Therefore, communities in each iteration. This may be done by using
statistics calculated from the previous iterations. We only [17] Prat-Pérez, A., Dominguez-Sal, D., & Larriba-Pey, J. L., "High quality,
scalable and parallel community detection for large real graphs," in
addressed undirected unweighted node-grained dynamic Proceedings of the 23rd international conference on World wide web,
graphs in this research. We would like to extend our 2014: ACM, pp. 225-236.
framework to work with edge-grained dynamic graphs and [18] Boccaletti, S., Latora, V., Moreno, Y., Chavez, M., & Hwang, D. U.,
"Complex networks: Structure and dynamics," Physics reports, vol.
explore the effect of the edge weights in the community
424, no. 4-5, pp. 175-308, 2006.
detection process. [19] Wang, T., Yang, B., Gao, J., Yang, D., Tang, S., Wu, H., ... & Pei, J.,
"Mobileminer: a real world case study of data mining in mobile
ACKNOWLEDGMENT communication ". In Proceedings of the 2009 ACM SIGMOD
International Conference on Management of data (pp. 1083-1086).
The authors wish to thank the Queen’s University Centre [20] Prat-Pérez, A., Dominguez-Sal, D., Brunat, J. M., & Larriba-Pey, J. L.,
for Advanced Computing (CAC) for providing access to "Shaping communities out of triangles," in Proceedings of the 21st
computing resources to run our experiments. ACM international conference on Information and knowledge
management, 2012: ACM, pp. 1677-1681.
[21] Basuchowdhuri, P., Sikdar, S., Nagarajan, V., Mishra, K., Gupta, S., &
Majumder, S., "Fast detection of community structures using graph
REFERENCES traversal in social networks," Knowledge and Information Systems, pp.
[1] Aridhi, S. and Nguifo, E. M., "Big graph mining: Frameworks and 1-31, 2017.
techniques," Big Data Research, vol. 6, pp. 1-10, 2016. [22] Zaharia, M., Xin, R. S., Wendell, P., Das, T., Armbrust, M., Dave, A., ...
[2] Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, I., Leiser, & Stoica, I. (2016). Apache spark: a unified engine for big data
N., & Czajkowski, G., "Pregel: a system for large-scale graph processing. Communications of the ACM, 59(11), 56-65.
processing," in Proceedings of the 2010 ACM SIGMOD International [23] Xin, R. S., Gonzalez, J. E., Franklin, M. J., & Stoica, I., "Graphx: A
Conference on Management of data, 2010: ACM, pp. 135-146. resilient distributed graph system on spark," in First International
[3] Gonzalez, J. E., Xin, R. S., Dave, A., Crankshaw, D., Franklin, M. J., Workshop on Graph Data Management Experiences and Systems,
& Stoica, I., "GraphX: Graph Processing in a Distributed Dataflow 2013: ACM, p. 2.
Framework," in OSDI, 2014, vol. 14, pp. 599-613. [24] Fortunato, S., "Community detection in graphs," Physics Reports, vol.
[4] Aridhi, S., Montresor, A., & Velegrakis, Y., "BLADYG: A graph 486, no. 3–5, pp. 75-174, 2// 2010.
processing framework for large dynamic graphs," Big Data Research, [25] Raghavan, U. N., Albert, R., & Kumara, S., "Near linear time algorithm
vol. 9, pp. 9-17, 2017. to detect community structures in large-scale networks," Physical
[5] Heidari, S., Simmhan, Y., Calheiros, R. N., and Buyya, R., "Scalable review E, vol. 76, no. 3, p. 036106, 2007.
Graph Processing Frameworks: A Taxonomy and Open Challenges," [26] Zhu, X., & Ghahramani, Z., "Learning from labeled and unlabeled data
ACM Computing Surveys (CSUR), vol. 51, no. 3, p. 60, 2018. with label propagation," 2002.
[6] Sengupta, D., Sundaram, N., Zhu, X., Willke, T. L., Young, J., Wolf, [27] Rosvall, M., & Bergstrom, C. T., "Maps of random walks on complex
M., & Schwan, K. (2016, August). Graphin: An online high networks reveal community structure," Proceedings of the National
performance incremental graph processing framework. In European Academy of Sciences, vol. 105, no. 4, pp. 1118-1123, 2008.
Conference on Parallel Processing (pp. 319-333). Springer, Cham. [28] Harenberg, S., Bello, G., Gjeltema, L., Ranshous, S., Harlalka, J., Seay,
[7] Yin, S., Chen, S., Feng, Z., Huang, K., He, D., Zhao, P., & Yang, M. R., & Samatova, N. (2014). "Community detection in large‐scale
Y. (2016, November). Node-grained incremental community detection networks: a survey and empirical evaluation," Wiley Interdisciplinary
for streaming networks. In 2016 IEEE 28th International Conference Reviews: Computational Statistics, vol. 6, no. 6, pp. 426-439, 2014.
on Tools with Artificial Intelligence (ICTAI) (pp. 585-592). IEEE. [29] Fortunato, S., & Hric, D., "Community detection in networks: A user
[8] Aksu, H., Canim, M., Chang, Y. C., Korpeoglu, I., & Ulusoy, Ö. guide," Physics Reports, vol. 659, pp. 1-44, 2016.
(2014)."Distributed k-core view materialization and maintenance for [30] Clementi, A., Di Ianni, M., Gambosi, G., Natale, E., & Silvestri, R.,
large dynamic graphs," IEEE Transactions on Knowledge and Data "Distributed community detection in dynamic graphs," Theoretical
Engineering, vol. 26, no. 10, pp. 439-452, 2014. Computer Science, vol. 584, pp. 19-41, 2015.
[9] Aridhi, S., Brugnara, M., Montresor, A., & Velegrakis, Y., "Distributed [31] Jian, X., Lian, X., & Chen, L., "On Efficiently Detecting Overlapping
k-core decomposition and maintenance in large dynamic graphs," in Communities over Distributed Dynamic Graphs," in 2018 IEEE 34th
Proceedings of the 10th ACM International Conference on Distributed International Conference on Data Engineering (ICDE), 2018: IEEE, pp.
and Event-based Systems, 2016: ACM, pp. 161-168. 1328-1331.
[10] Li, R. H., Yu, J. X., & Mao, R., "Efficient core maintenance in large [32] Hung, S. C., Araujo, M., & Faloutsos, C., "Distributed community
dynamic graphs," IEEE Transactions on Knowledge and Data detection on edge-labeled graphs using spark," in 12th International
Engineering, vol. 26, no. 10, pp. 2453-2465, 2014. Workshop on Mining and Learning with Graphs (MLG), 2016, vol. 113.
[11] Sariyüce, A. E., Gedik, B., Jacques-Silva, G., Wu, K. L., & Çatalyürek, [33] Blondel, V. D., Guillaume, J. L., Lambiotte, R., & Lefebvre, E., "Fast
Ü. V., "Streaming algorithms for k-core decomposition," Proceedings unfolding of communities in large networks," Journal of statistical
of the VLDB Endowment, vol. 6, no. 6, pp. 433-444, 2013. mechanics: theory and experiment, vol. 2008, no. 10, p. P10008, 2008.
[12] Sakouhi, C., Aridhi, S., Guerrieri, A., Sassi, S., & Montresor, A., [34] Bagrow, J. P., "Communities and bottlenecks: Trees and treelike
"DynamicDFEP: a distributed edge partitioning approach for large networks have high modularity," Physical Review E, vol. 85, no. 6, p.
dynamic graphs," in Proceedings of the 20th International Database 066118, 2012.
Engineering & Applications Symposium, 2016: ACM, pp. 142-147. [35] Fortunato, S., & Barthelemy, M., "Resolution limit in community
[13] Xu, Y., Cheng, J., Fu, A. W. C., & Bu, Y., "Distributed maximal clique detection," Proceedings of the National Academy of Sciences, vol. 104,
computation," in Big Data (BigData Congress), 2014 IEEE no. 1, pp. 36-41, 2007.
International Congress on, 2014: IEEE, pp. 160-167. [36] Shang, J., Liu, L., Xie, F., Chen, Z., Miao, J., Fang, X., & Wu, C.
[14] Saltz, M., Prat-Pérez, A., & Dominguez-Sal, D. (2015, (2014). A real-time detecting algorithm for tracking community
May)."Distributed community detection with the wcc metric," in structure of dynamic networks. arXiv preprint arXiv:1407.2683.
Proceedings of the 24th International Conference on World Wide Web, [37] Pan, G., Zhang, W., Wu, Z., & Li, S., "Online community detection for
2015: ACM, pp. 1095-1100. large complex networks," PloS one, vol. 9, no. 7, p. e102799, 2014.
[15] Rossetti, G., & Cazabet, R., "Community discovery in dynamic [38] "Supplemental Nutrition Assistance Program (SNAP) official website"
networks: a survey," ACM Computing Surveys (CSUR), vol. 51, no. 2, https://www.fns.usda.gov/snap/supplemental-nutrition-assistance-
p. 35, 2018. program
[16] Prat-Pérez, A., Dominguez-Sal, D., Brunat, J. M., & Larriba-Pey, J. L., [39] "Amazon streaming dataset" http://s3.amazonaws.com/aws-
"Put three and three together: Triangle-driven community detection," publicdatasets/
ACM Transactions on Knowledge Discovery from Data (TKDD), vol.
10, no. 3, p. 22, 2016.

You might also like