Incremental Community Detection in Distributed Dynamic Graph
Incremental Community Detection in Distributed Dynamic Graph
Dynamic Graph
Tariq Abughofa Ahmed A.Harby Haruna Isah Farhana Zulkernine
School of Computing School of Computing School of Computing School of Computing
Queen’s University Queen’s University Kingston, Queen’s University Queen’s University
Kingston, ON, Canada ON, Canada Kingston, ON, Canada Kingston, ON, Canada
abughofa@queensu.ca Ahmed.harby@queensu.ca h.isah@unb.ca Farhana.zulkernine@queensu.ca
Abstract— Community detection is an important research after changes happen to the data and the updated analytics
topic in graph analytics that has a wide range of applications. A results are returned with very short delays.
variety of static community detection algorithms and quality Several graph processing frameworks utilize static
metrics were developed in the past few years. However, most
real-world graphs are not static and often change over time. In partitioning, which means that they consider the graph and the
the case of streaming data, communities in the associated graph processing environment to remain unchanged [1][2][5].
need to be updated either continuously or whenever new data However, most real-world graphs are dynamic as they change
streams are added to the graph, which poses a much greater over time with new data producing new vertices and edges that
challenge in devising good community detection algorithms for need to be merged into existing graphs. The changes in
maintaining dynamic graphs over streaming data. In this paper, dynamic graphs are further complicated by the need for real-
we propose an incremental community detection algorithm for
maintaining a dynamic graph over streaming data. The
time guarantees for applications such as real-time disease
contributions of this study include (a) the implementation of a spreading and anomaly detection. Traditional static graph
Distributed Weighted Community Clustering (DWCC) analytics approaches face a major limitation in meeting this
algorithm, (b) the design and implementation of a novel demand [6]. Dynamic graph scenarios require novel online or
Incremental Distributed Weighted Community Clustering real-time graph update and analytics algorithms since the
(IDWCC) algorithm, and (c) an experimental study to compare traditional offline graph analytics approaches require first the
the performance of our IDWCC algorithm with the DWCC
whole graph to be updated with the new data, and then
algorithm. We validate the functionality and efficiency of our
framework in processing streaming data and performing large analytics algorithms to be applied to the whole graph, which
in-memory distributed dynamic graph analytics. The results is extremely computation-intensive and hence impractical.
demonstrate that our IDWCC algorithm performs up to three Dynamic graph [7] updates can be node-grained or edge-
times faster than the DWCC algorithm for a similar accuracy. grained. In node-grained dynamic graphs, new nodes or
vertices are simultaneously added to the graph with all their
Keywords—Distributed graph processing, dynamic graphs,
incident edges. An example of such graphs is a network of
streaming data, weighted community clustering
scientific papers and their references. Once a paper is
published, all the papers that it references are known as well
I. INTRODUCTION and no new references (connections) are added later. In edge-
grained dynamic graphs, new edges are added or removed for
Distributed processing of large-scale graphs has gained already existing vertices. Social networks are a good example
considerable attention in the last decade [1]. This is mainly of these graphs, people add new friends and "un-friend" old
due to the (i) unprecedented increase in the size of graph data ones all the time. Thus, the assumption of knowing all the
such as the Web based social media networks, (ii) evolution connections of a person when we add them to the graph is not
of systems for processing massive graph data such as Pregel viable. In these networks, the sequence of adding new edges
[2] and GraphX [3], and (iii) huge increase in the number of is important and influences the evolution of the graph
applications that utilize graph data such as traffic and social structure.
network analysis [4]. According to Heidari et al. [5], a typical Recently, the problem of distributed processing of large
graph processing system executes graph algorithms such as dynamic graphs has gained considerable attention [7]. Several
graph traversal over a graph dataset across five different traditional graph operations such as the k-core decomposition
logical phases, which include reading graph data, pre- [8-11], partitioning [12], and maximal clique computation
processing, partitioning, computation, and error handling. [13] have been extended to support dynamic graphs. However,
Regardless of the size and type of framework or algorithms many graph processing frameworks do not support several
used, Heidari et al. reported [5] that large-scale graph data can graph operations in the context of dynamic graphs [4]. One
be processed in offline, online, or real-time mode. Offline such operation is community detection in graphs, which is the
processing is the popular mode and is achieved by loading the process of identifying groups of nodes that are highly
graph dataset in memory from disk storage and processing it. connected among themselves and sparsely connected to the
Online processing allows users to update, maintain, and re- rest of the graph [14]. Such groups are referred to in the
process the graph data automatically with new values either literature as “communities" and occur in various types of
periodically or based on user-defined events. Real-time graphs. Several research studies on networks modeling real-
processing is similar to online processing except it also world phenomena have shown that the networks are organized
enables instant incremental updates to be made to the graph according to community structure and their structures evolve
data. Thus, it requires the computation to be done immediately with time [15]. Therefore, community detection within large-
Algorithm 1: Partitioning
1: Let P be a set of communities generated at the last micro-batch;
2: S ← sort ByClusteringCoefficients(Vt+1);
3: for all v in S do
4: if notVisited(v) then
5: markAsVisited(v);
6: if v ∈ V * then
7: C ← {v};
8: else
9: C ← P.getCommunity(v);
10: for all u in neighbors(v) do
11: if notVisited(u) then
12: markAsVisited(u);
13: if u ∈ V * then
Figure 1. Merging G* with Gt. 14: C.add(u);
15: P.add(C)
We identify a set of vertices which we call the border
vertices. These vertices exist in both Gt and G, are a part of the
edges that connect the newly arriving batch with the old graph.
We choose communities for the vertices that appear in the
Let us denote this set as Vb = Vt ∩ V*. We refer to the rest of
new batch. These vertices include the inner vertices Vn which
the vertices in the new graph which are not part of the border
have no communities assigned to them yet, and the border
vertices, as the inner vertices Vn = V*\Vb. The problem with
vertices Vb which were removed from their communities
the border vertices is that they have already been assigned to
during the previous phase. We use the same algorithm as used
communities in Gt. But since they have new connections, they
in DWCC (see Algorithm 1), but we limit it to the above-
are likely to belong to different communities. We isolate each
mentioned sets of vertices only. Hence, every vertex in the
of these vertices in its own community in the full graph
new batch chooses the vertex with the highest clustering
Gt+1.The merge phase also calculates t(x, Vt+1) and vt(x, Vt+1)
coefficient that does not belong to a community of another
for each vertex x in Gt+1. To perform the calculations
vertex as its community center.
efficiently, we recognize three possible situations. (a)
Statistics of the old vertices stay the same as they were for the Algorithm 2 follows the same steps as its counterpart
Algorithm 1, the DWCC algorithm. However, it includes two
optimizations since it is the most expensive processing phase • Step 2: Graph Restructuring (preprocessing): Move
in terms of computations: vertices and remove redundant edges as a part of graph
• Calculation of the community movements is still done on restructuring, optimizing, and cleaning.
all the vertices, but we drop calculating the value of • Step 3: Iterative Partitioning (initial partitioning): Use
WCC in each iteration. broadcasting protocol for dynamic vertices to
• We use a fixed number of iterations rather than using add/modify communities on distributed graphs.
more iterations when good WCC improvement appears. • Step 4: WCC Optimization (partition refinement):
Compute the WCC metric or improvement for the graph.
This might result in missing community movements that
We investigate and calculate the execution time of each of
can have a good impact on WCC. However, as we process
the small steps of DWCC and propose changes to the DWCC
subsequent micro-batches, all the vertices start changing their
algorithm to define the IDWCC for node-grained dynamic
communities again and any previous changes that were missed
distributed graphs. The IDWCC algorithm has many similar
are subsequently recovered. This way, the degradation of
steps as the DWCC. However, it has optimizations that avoid
WCC over time is avoided.
repeated calculations and consequently reduce the memory,
data movement, and computational costs without sacrificing
Algorithm 2: Partitioning optimization the quality of the result.
1: Let P be the initial partition; In this paper, we limit our scope of interest to dynamic
2: iteration ← 1;
graphs that satisfy two properties. First, the graph progresses
3: Repeat
4: M←∅
over a window of time in which a small batch of vertices and
5: For all v in V do their edges are added. These edges connect the new vertices
6: M.add(bestMovement(v , P)) to each other and to the full graph generated from the last
7: P ← applyMovements(M , P); micro-batch. Second, the edges are equal in value i.e., the
8: Iteration = iteration+1; edges are not weighted or directed. We denote the graph from
9: until iteration > maxIterations; the previous iteration as Gt = (Vt, Et) where Vt and Et are its
sets of vertices and edges at time t respectively. Let us refer to
C. Preprocessing the vertices in the newly arriving batch as V* and the edges as
This phase aims to calculate the t(x, V) and vt(x, V) values E*. We define a micro-batch from the stream of a node-
for each vertex of the graph. After these measurements are grained dynamic graph d as follows:
calculated, a graph optimization which is stated in the
optimization section, is performed by removing edges that do (7)
not close any triangles.
The Triangle Count algorithm in GraphX 1 requires the Based on the cost of each step and the above-mentioned
graph to be canonical which means that the graph should graph properties, we developed the IDWCC algorithm that
ensure the following: works in three phases. First, it merges a micro-batch of the
• Free from self-edges (edges with the same vertex as a streaming data with the maintained evolving graph, updates
source and a destination). the vertex statistics, and optimizes the graph. Second, it
assigns the new vertices to the initial communities. Finally, it
• All its edges are oriented (the source vertex has a greater
optimizes the WCC metric to generate better communities.
number of directly connected triangles than the
destination vertex based on a pre-defined comparison Table 1. Properties of the test graphs
method).
• Has no duplicate edges. Data Sources Vertices Edges
The cleaning is done using the subgraph API provided by Amazon 334,863 925,872
GraphX. We keep the calculated statistics namely, the triangle DBLP 317,080 1,049,866
count and the degree of vertex for later use. We took YouTube 1,134,890 2,987,624
advantage of the fact that GraphX supports property graphs
and hence we can save these statistics as properties of the
graph vertices. IV. VALIDATION AND RESULTS
D. Implementation
A. Experimental Setup
We modify certain steps of the DWCC algorithm which A distributed multi-cluster environment was used for our
incur high computational cost to make the algorithm more experiments applying Spark, Spark Streaming, and GraphX to
scalable so that we can apply it to distributed dynamic graphs. implement the dynamic distributed graph. Eight identical
The key steps of the DWCC algorithm are as follows. machines were used, each having 8 cores 2.10 GHz Intel Xeon
• Step 1: Vertex Statistics (preprocessing): Count the 64-bit CPU, 30 GB of RAM, and 300 GB of disk space to host
triangles of vertices to identify communities and keep the and perform computations on the distributed graph data
statistics of triangle count and the degrees of vertices. structure. We installed Apache Spark v2.2.0 on all the
1
https://spark.apache.org/docs/latest/graphx-programming-
guide.html#triangle-counting
machines. Both DWCC and IDWCC algorithms were nt+1.lognt+1 < mt+1. The cost can be simplified to become
implemented using Scala 2.112. O(mt+1). This cost is much smaller than O(mt+1.lognt+1) the
cost of applying the static algorithm on the whole graph during
B. Data Source
the t th iteration.
We used a set of different real-life undirected graphs that
have ground-truth communities. We took these graphs from D. Experimental Details and Results
the SNAP data repository 3 . The selected graphs and some We conducted the following three sets of experiments.
statistics about them are presented in Table 1. The use of
multiple graphs for the experiments allowed us to compare the • Experiment a: We computed the efficiency of the
optimizations used in IDWCC by comparing the
results for different graph sizes. In addition, it gave us the
execution time of each step of IDWCC with its
ability to experiment with different sizes of micro-batches
counterpart in DWCC.
easily. After loading a graph from the experimental sets, we
ensured that it is clean and free from duplicates and self-edges. • Experiment b: We compared the quality of the results of
We also sorted the edges by the source vertices which made DWCC, IDWCC, and SCD.
the graph canonical from the start and ready for processing. • Experiment c: Finally, we compared the quality of the
results and the execution time of the DWCC and IDWCC
C. Complexity Analysis algorithms for a dynamic distributed graph by executing
We compare the complexity of sequential implementation the algorithms while updating the graph in real-time from
of our incremental algorithm to its static counterpart when a data stream. We executed 5 iterations of the DWCC and
both are applied to detect communities in a dynamic graph. IDWCC algorithms based on the recommendations of
Let n be the number of vertices and m the number of edges in Prat-Pérez et al. [19], the authors of the original WCC
the graph. We assume that the average degree of the graph d optimization algorithm.
at any point of time t is d = m/n and that real graphs (graphs We experimented with different graph sizes to compare the
are meant to be found in real-world with millions of vertices performances of the algorithms for different sizes of graphs
and edges) have a quasi-linear relation between vertices and and micro-batches of the streaming data. Initially, we
edges O(m) = O(n.logn). Under these assumptions, the constructed the graph from a bulk of static data already
complexity of the static WCC optimization algorithm, as downloaded from the selected data sources. For the bulk data,
calculated in Prat-Perez et al. [17] is O(m.logn). the first two columns in Table 2 show the number of vertices
Now we calculate the complexity of a centralized version and edges added to the graph respectively. Next, we appended
of the sequential incremental WCC optimization algorithm. new data to the existing distributed graph from the streaming
For the first phase, we do not consider the complexity of data sources using micro-batches of data at a time. The
merging the graphs since the operation is necessary for any number of edges added from the data streams and the
dynamic graph. That leaves the cost of computing the triangles corresponding numbers of micro-batches of data used for the
and degrees for the vertices in the new batch as follows: updates are shown in columns 3 and 4 in Table 2 respectively.
It can be noticed that we chose to use fewer micro-batches
O(m*.d + m*) = O(m*.log nt+1)
with the smaller size bulk graphs, and greater micro-batches
The second phase requires sorting the vertices based on the for the larger size bulk graphs to limit the use of the resources
local clustering coefficient. However, the vertices are already on each iteration.
sorted from processing a previous micro-batch. Hence, the
cost is only for organizing the new vertices in the right order Table 2. Initial sizes and updated sizes of the test graphs
which requires sorting the new vertices and then executing a
full scan of the vertices in the worst-case O(n*+nt) = O(nt). Bulk Bulk Stream # of Micro-
For the third phase, let α be the number of iterations required Vertices Edges Edges batches
Amazon 258,464 576,718 349,154 10
to find the best possible communities, which is a constant. In
DBLP 253,119 852,754 197,112 10
each iteration, we compute in the worst-case d+1 movements YouTube 903,959 2666,836 320788 10
for each vertex of type WCC’I which has a cost of O(1). That LiveJournal 768,792 13,997,342 20,683,847 30
makes the total cost as follows:
O(n.(d+1)) = O(m) Experiment a shows the benefits of the optimization
techniques we applied to our IDWCC as shown in the
Next, we apply all the movements which are equal to the algorithm listing. For both DWCC and IDWCC, we calculated
number of vertices, so it costs O(nt+1). We also need to update, the time that each step took to be executed on the full Amazon
for each iteration in the second phase, the statistics w, cout, din, graph and aggregated them to compute the total execution
and dout for each vertex and community, which has a cost of time as shown in Fig. 2. The results show that the IDWCC has
O(mt+1). We sum all the costs to get the full cost of this phase a shorter execution time compared to the DWCC algorithm.
O(α.(mt+1 + nt+1 +mt+1)) = O(mt+1). The full cost of the The vertex statistics step as defined in the implementation part
algorithm is the sum of the cost of the three phases: O(m*log in Section III is the one responsible for calculating each vertex
nt+1 + nt + mt+1). Since m* << nt+1, then m*.lognt+1 << triangle count and degree.
2
https://github.com/TariqAbughofa/incremental_distributed_wcc
3
http://snap.stanford.edu/
results than the DWCC algorithm. We do not show the results
40 37.48 for the LiveJournal graph as both DWCC and IDWCC failed
Execution Time(Seconds)
DWCC IDWCC
0.4
Global WCC
0.3
0.2
0.1
0
Amazon DBLP YouTube