Incremental Algorithms For Closeness Centrality
Incremental Algorithms For Closeness Centrality
Depts. 1 Biomedical Informatics, 2 Computer Science and Engineering, 3 Electrical and Computer Engineering
The Ohio State University
Email:[email protected], [email protected], [email protected], [email protected]
Abstract—Centrality metrics have shown to be highly cor- quickly evaluate the effects of topology modifications on
related with the importance and loads of the nodes within the centrality values.
network traffic. In this work, we provide fast incremental al-
gorithms for closeness centrality computation. Our algorithms
efficiently compute the closeness centrality values upon changes
in network topology, i.e., edge insertions and deletions. We
show that the proposed techniques are efficient on many real-
life networks, especially on small-world networks, which have a
small diameter and spike-shaped shortest distance distribution. 0.1"
a" b"
Closeness'centrality'
We experimentally validate the efficiency of our algorithms 0.08"
0.06" c" d"
on large-scale networks and show that they can update the
closeness centrality values of 1.2 million authors in the temporal 0.04"
e" f"
DBLP-coauthorship network 460 times faster than it would 0.02"
0" g" h"
take to recompute them from scratch. 1" 2" 3" 4"
Keywords-closeness centrality; dynamic networks; small- Figure 1. A toy network with eight nodes, three consecutive
world networks edge (ah, f h, and ab, respectively) insertions/deletions, and values
of closeness centrality.
I. I NTRODUCTION Our main contributions are incremental algorithms which
Centrality metrics, such as closeness or betweenness, efficiently update the closeness centralities upon edge inser-
quantify how central a node is in a network. They have tions and deletions. Compared with the existing algorithms,
been successfully used to carry analysis for various purposes our algorithms have a low-memory footprint which makes
such as structural analysis of knowledge networks [14, 18], them practical and applicable to very large graphs. For ran-
power grid contingency analysis [7], quantifying importance dom edge insertions/deletions to the Wikipedia users’ com-
in social networks [12], analysis of covert networks [9], and munication graph, we reduced the centrality (re)computation
even for finding the best store locations in cities [15]. Several time from 2 days to 16 minutes. And for the real-life
works on rapid computation of these metrics exist in the lit- temporal DBLP coauthorship network, we reduced the time
erature. The algorithm with the best time complexity to com- from 1.3 days to 4.2 minutes.
pute centrality metrics [2] is believed to be asymptotically The rest of the paper is organized as follows: Section II
optimal [8]. Research have focused on either approximation introduces the notation and the closeness centrality metric.
algorithms for computing centrality metrics [3, 4, 13] or Our algorithms are explained in detail in Section III. Related
on high performance computing techniques [11, 19]. Today, works are given in Section IV. An experimental analysis is
the networks one needs to analyze can be quite large and given in Section V. Section VI concludes the paper.
dynamic, and better analysis techniques are always required.
II. BACKGROUND
In a dynamic and streaming network, ensuring the correct-
ness of the centralities is a challenging task [5, 10]. Further- Let G = (V, E) be a network modeled as a simple graph
more, for some applications involving a static network such with n = |V | vertices and m = |E| edges where each node
as the contingency analysis of power grids and robustness is represented by a vertex in V , and a node-node interaction
evaluation of networks, to be prepared and take proactive is represented by an edge in E. Let ΓG (v) be the set of
measures, we need to know how the centrality values change vertices which are connected to v.
when the network topology is modified by an adversary or A graph G0 = (V 0 , E 0 ) is a subgraph of G if V 0 ⊆ V
outer effects such as natural disasters. As Figure 1 shows, and E 0 ⊆ E. A path is a sequence of vertices such that
the effect of a local topology modification is usually global. there exists an edge between each consecutive vertex
To quantify these effects and find exact centrality scores, pair. A path between two vertices s and t is denoted by
P
existing algorithms are not efficient enough to be used s t (or s t if a specific path P with endpoints s and
in practice. Novel, incremental algorithms are essential to t is mentioned). Two vertices u, v ∈ V are connected if
there is a path between u and v. If all vertex pairs in G III. M AINTAINING C ENTRALITY
are connected we say that G is connected. Otherwise, it is
Many real-life networks are scale free. The diameters of
disconnected and each maximal connected subgraph of G
these networks grow proportional to the logarithm of the
is a connected component, or a component, of G. We use
number of nodes. That is, even with hundreds of millions
dG (u, v) to denote the length of the shortest path between
of vertices, the diameter is small, and when the graph
two vertices u, v in a graph G. If u = v then dG (u, v) = 0.
is modified with minor updates, it tends to stay small.
If u and v are disconnected, then dG (u, v) = ∞.
Combining this with the power-law degree distribution of
Given a graph G = (V, E), a vertex v ∈ V is called an scale-free networks, we obtain the spike-shaped shortest-
articulation vertex if the graph G−v (obtained by removing distance distribution as shown in Figure 2. We use work
v) has more connected components than G. Similarly, an filtering with level differences and utilization of special
edge e ∈ E is called a bridge if G−e (obtained by removing vertices to exploit these observations and reduce the
e from E) has more connected components than G. G is centrality computation time. In addition, we apply SSSP
biconnected if it is connected and it does not contain an hybridization to speedup each SSSP computation.
articulation vertex. A maximal biconnected subgraph of G
is a biconnected component. 0.50
amazon0601
Pr(d(u,v)
=
x)
0.40
soc-‐sign-‐epinions
0.30
A. Closeness Centrality web-‐Google
0.20
web-‐NotreDame
Given a graph G, the farness of a vertex u is defined as 0.10
X 0.00
far[u] = dG (u, v). 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Shortest
path
distance
v∈V
dG (u,v)6=∞ Figure 2. The probability of the distance between two (connected)
vertices is equal to x for four social and web networks.
And the closeness centrality of u is defined as
1 A. Work Filtering with Level Differences
cc[u] = . (1)
far[u] For efficient maintenance of the closeness centrality val-
ues in case of an edge insertion/deletion, we propose a work
If u cannot reach any vertex in the graph cc[u] = 0. filter which reduces the number of SSSPs in Algorithm 1 and
For a sparse unweighted graph G = (V, E) the the cost of each SSSP by utilizing the level differences.
complexity of cc computation is O(n(m + n)) [2]. For Level-based filtering detects the unnecessary updates and
each vertex s ∈ V , Algorithm 1 executes a Single-Source filter them out. Let G = (V, E) be the current graph and uv
Shortest Paths (SSSP), i.e., it initiates a breadth-first be an edge to be inserted to G. Let G0 = (V, E ∪ uv) be
search (BFS) from s, computes the distances to the other the updated graph. The centrality definition in (1) implies
vertices and far[s], the sum of the distances which are that for a vertex s ∈ V , if dG (s, t) = dG0 (s, t) for all t ∈ V
different than ∞. As the last step, it computes cc[s]. Since then cc[s] = cc0 [s]. The following theorem is used to detect
a BFS takes O(m + n) time, and n SSSPs are required in such vertices and filter their SSSPs.
total, the complexity follows. Theorem 1: Let G = (V, E) be a graph and u and v be
two vertices in V s.t. uv ∈ / E. Let G0 = (V, E ∪ uv). Then
Algorithm 1: CC: Basic centrality computation cc[s] = cc0 [s] if and only if |dG (s, u) − dG (s, v)| ≤ 1.
Data: G = (V, E) Proof: If s is disconnected from u and v, uv’s insertion
Output: cc[.]
1 for each s ∈ V do will not change cc[s]. Hence, cc[s] = cc0 [s]. If s is only
.SSSP(G, s) with centrality computation connected to one of u and v in G the difference |dG (s, u) −
Q ← empty queue dG (s, v)| is ∞, and cc[s] needs to be updated by using the
d[v] ← ∞, ∀v ∈ V \ {s}
Q.push(s), d[s] ← 0
new, larger connected component containing s. When s is
far[s] ← 0 connected to both u and v in G, we investigate the edge
while Q is not empty do insertion in three cases as shown in Figure 3:
v ← Q.pop() P
for all w ∈ ΓG (v) do Case 1: dG (s, u) = dG (s, v): Assume that the path s
P0
if d[w] = ∞ then u–v t is a shortest s t path in G0 containing uv. Since
Q.push(w) P 00 P0
d[w] ← d[v] + 1 dG (s, u) = dG (s, v), there exists a shorter path s v t
far[s] ← far[s] + d[w] with one less edge. Hence, ∀t ∈ V , dG (s, t) = dG0 (s, t).
1
cc[s] = far[s] Case 2: |dG (s, u) − dG (s, v)| = 1: Let
return cc[.] P P0
dG (s, u) < dG (s, v). Assume that s u–v t is a shortest
path in G0 containing uv. Since dG (s, v) = dG (s, u) + 1,
P 00 P0
there exists another path s v t with the same length. 1) Filtering with biconnected components: Our filter can
Hence, ∀t ∈ V , dG (s, t) = dG0 (s, t). be assisted by maintaining a biconnected component decom-
Case 3: |dG (s, u) − dG (s, v)| > 1: Let dG (s, u) < position (BCD) of G = (V, E). A BCD is a partitioning Π
dG (s, v). The path s u–v in G0 is shorter than the shortest of E where Π(e) is the component of each edge e ∈ E.
s v path in G since dG (s, v) > dG (s, u) + 1. Hence, When uv is inserted to G and G0 = (V, E 0 = E ∪ {uv}) is
∀t ∈ V \ {v}, dG0 (s, t) ≤ dG (s, t) and dG0 (s, v) < dG (s, v), obtained, we check if
i.e., an update on cc[s] is necessary.
{Π(uw) : w ∈ ΓG (u)} ∩ {Π(vw) : w ∈ ΓG (v)}
is empty or not: if the intersection is not empty, there will be
only one element in it, cid, which is the id of the biconnected
component of G0 containing uv (otherwise Π is not a valid
BCD). In this case, Π0 (e) is set to Π(e) for all e ∈ E and
Π0 (uv) is set to cid. If there is no biconnected component
containing both u and v , i.e., if the intersection above is
empty, we construct Π0 from scratch and set cid = Π0 (uv).
Figure 3. Three cases of edge insertion: when an edge uv is Π can be computed in linear, O(m+n) time [6]. Hence, the
inserted to the graph G, for each vertex s, one of them is true:
(1) dG (s, u) = dG (s, v), (2) |dG (s, u) − dG (s, v)| = 1, and (3) cost of BCD maintenance is negligible compared to the cost
|dG (s, u) − dG (s, v)| > 1. of updating closeness centrality. Details can be found in [16].
2) Filtering with identical vertices: Our preliminary
Although Theorem 1 yields to a filter only in case of analyses show that real-life networks can contain a
edge insertions, the following corollary which is used for significant amount of identical vertices with the same/a
edge deletion easily follows. similar neighborhood structure. We investigate two types of
Corollary 2: Let G = (V, E) be a graph and u and v be identical vertices.
two vertices in V s.t. uv ∈ E. Let G0 = (V, E \{uv}). Then Definition 3: In a graph G, two vertices u and v are type-
cc[s] = cc0 [s] if and only if |dG0 (s, u) − dG0 (s, v)| ≤ 1. I-identical if and only if ΓG (u) = ΓG (v).
With this corollary, the work filter can be implemented Definition 4: In a graph G, two vertices u and v are type-
for both edge insertions and deletions. The pseudocode of II-identical if and only if {u} ∪ ΓG (u) = {v} ∪ ΓG (v).
the update algorithm in case of an edge insertion is given Both types form an equivalance class relation since they
in Algorithm 2. When an edge uv is inserted/deleted, to are reflexive, symmetric, and transitive. Hence, all the
employ the filter, we first compute the distances from u and classes they form are disjoint.
v to all other vertices. And, we filter the vertices satisfying Let u, v ∈ V be two identical vertices. One can see that
the statement of Theorem 1. for any vertex w ∈ V \ {u, v}, dG (u, w) = dG (v, w). Then
Algorithm 2: Simple work filtering the following is true.
Corollary 5: Let I ⊆ V be a vertex-class containing
Data: G = (V, E), cc[.], uv
Output: cc0 [.] type-I or type-II identical vertices. Then the closeness cen-
G0 ← (V, E ∪ {uv}) trality values of all the vertices in I are equal.
du[.] ← SSSP(G, u) . distances from u in G
dv[.] ← SSSP(G, v) . distances from v in G C. SSSP Hybridization
for each s ∈ V do
if |du[s] − dv[s]| ≤ 1 then
The spike-shaped distribution given in Figure 2 can also
cc0 [s] = cc[s] be exploited for SSSP hybridization. Consider the execution
else of Algorithm 1: while executing an SSSP with source s, for
. use the computation in Algorithm 1 each vertex pair {u, v}, u is processed before v if and only
with G0
return cc0 [.]
if dG (s, u) < dG (s, v). That is, Algorithm 1 consecutively
uses the vertices with distance k to find the vertices with
distance k + 1. Hence, it visits the vertices in a top-down
manner. SSSP can also be performed in a a bottom-up
B. Utilization of Special Vertices
manner. That is to say, after all distance (level) k vertices
We exploit some special vertices to speedup the incre- are found, the vertices whose levels are unknown can be
mental closeness centrality computation further. We leverage processed to see if they have a neighbor at level k. The top-
the articulation vertices and identical vertices in networks. down variant is expected to be much cheaper for small k val-
Although it has been previously shown that articulation ues. However, it can be more expensive for the upper levels
vertices in real social networks are limited and yield an where there are much less unprocessed vertices remaining.
unbalanced shattering [17], we present the related techniques Following the idea of Beamer et al. [1], we hybridize the
here to give a complete view. SSSPs. While processing the nodes at an SSSP level, we
Graph Time (in sec.)
simply compare the number of edges need to be processed name |V | |E| Org. Best Speedup
for each variant and choose the cheaper one. hep-th 8.3K 15.7K 1.41 0.05 29.4
PGPgiantcompo 10.6K 24.3K 4.96 0.04 111.2
astro-ph 16.7K 121.2K 14.56 0.36 40.5
IV. R ELATED W ORK cond-mat-2005 40.4K 175.6K 77.90 2.87 27.2
geometric mean 43.5
To the best of our knowledge, there are only two works soc-sign-epinions 131K 711K 778 6.25 124.5
loc-gowalla 196K 950K 2,267 53.18 42.6
on maintaining centrality in dynamic networks. Yet, both web-NotreDame 325K 1,090K 2,845 53.06 53.6
are interested in betweenness centrality. Lee et al. proposed amazon0601 403K 2,443K 14,903 298 50.0
web-Google 875K 4,322K 65,306 824 79.2
the QUBE framework which uses a BCD and updates the wiki-Talk 2,394K 4,659K 175,450 922 190.1
betweenness centrality values in case of edge insertions and DBLP-coauthor 1,236K 9,081K 115,919 251 460.8
geometric mean 99.8
deletions in the network [10]. Unfortunately, the perfor-
Table I
mance of QUBE is only reported on small graphs (less than T HE GRAPHS USED IN THE EXPERIMENTS . C OLUMN Org.
100K edges) with very low edge density. In other words, it SHOWS THE INITIAL CLOSENESS COMPUTATION TIME OF CC
only performs significantly well on small graphs with a tree- AND Best IS THE BEST UPDATE TIME WE OBTAIN IN CASE OF
STREAMING DATA .
like structure having many small biconnected components.
Green et al. proposed a technique to update the be-
tweenness centrality scores rather than recomputing them structure. Besides, that specific structure is also important
from scratch upon edge insertions (can be extended to edge for the SSSP hybridization.
deletions) [5]. The idea is to store the whole data structure
used by the previous computation. However, as the authors A. Handling topology modifications
stated, it takes O(n2 + nm) space to store all the required To assess the effectiveness of our algorithms, we need
values. Compared to their work, our algorithms are much to know when each edge is inserted to/deleted from the
more practical since the memory footprint of linear. graph. Our datasets from the UFL collection do not have this
information. To conduct our experiments on these datasets,
V. E XPERIMENTAL R ESULTS we delete 1,000 edges from a graph chosen randomly in
We implemented the algorithms in C and compiled the following way: A vertex u ∈ V is selected ran-
with gcc v4.6.2 with the optimization flags -O2 domly (uniformly), and a vertex v ∈ ΓG (u) is selected
-DNDEBUG. The graphs are kept in the compressed row randomly (uniformly). Since we do not want to change the
storage (CRS) format. The experiments are run in sequential connectivity in the graph (having disconnected components
on a computer with two Intel Xeon E5520 CPU clocked at can make our algorithms much faster and it will not be fair to
2.27GHz and equipped with 48GB of main memory. CC), we discard uv if it is a bridge. If this is not the case we
For the experiments, we used 10 networks from the UFL delete it from G and continue. We construct the initial graph
Sparse Matrix Collection1 and also extracted the coauthor by deleting these 1,000 edges. Each edge is then re-inserted
network from the current set of DBLP papers. Properties one by one, and our algorithms are used to recompute the
of the graphs are summarized in Table I. They are from closeness centrality scores after each insertion.
different application areas, such as social (hep-th, PGPgiant- In addition to the random insertion experiments, we also
compo, astro-ph, cond-mat-2005, soc-sign-epinions, loc- evaluated our algorithms on a real temporal dataset of the
gowalla, amazon0601, wiki-Talk, DBLP-coauthor), and web DBLP coauthor graph2 . In this graph, there is an edge
networks (web-NotreDame, web-Google). The graphs are between two authors if they published a paper together. We
listed by increasing number of edges and a distinction is used the publication dates as timestamps and constructed
made between small graphs (with less than 500K edges) the initial graph with the papers published before January 1,
and the large graphs (with more than 500K) edges. 2013. We used the coauthorship edges of the later papers
Although the filtering techniques can reduce the update for edge insertions. Although we used insertions in our
cost significantly in theory, their practical effectiveness de- experiments, a deletion is a very similar process which
pends on the underlying structure of G. Since the diameter should give comparable results.
of the social networks are small, the range of the shortest In addition to CC, we configure our algorithms in
distances is small. Furthermore, the distribution of these dis- four different ways: CC-B only uses BCD, CC-BL uses
tances is unimodal. When the distance with the peak (mode) BCD and filtering with levels, CC-BLI uses all three
is combined with the ones on its right and left, they cover work filtering techniques including identical vertices. And
a significant amount of the pairs (56% for web-NotreDame, CC-BLIH uses all the techniques described in this paper
65% for web-Google, 79% for amazon0601, and 91% for including the SSSP hybridization.
soc-sign-epinions). We expect the filtering procedure to have Table II presents the results of the experiments. The
a significant impact on social networks because of their second column, CC, shows the time to run the full base
1 http://www.cise.ufl.edu/research/sparse/matrices/ 2 http://www.informatik.uni-trier.de/∼ley/db/
algorithm for computing the closeness centrality values on by filtering using level differences. Therefore, level filtering
the original version of the graph. Columns 3–6 of the is more useful for the graphs having characteristics similar
table present absolute runtimes (in seconds) of the centrality to small-world networks.
computation algorithms. The next four columns, 7–10, give
0.6
the speedups achieved by each configuration. For instance,
0.4
Pr(X
=
0)
on the average, updating the closeness values by using CC- Pr(X
=
1)
B on PGPgiantcompo is 11.5 times faster than running CC. 0.2
Pr(X
>
1)
equal to |dG (u, w)−dG (v, w)|. By using 1,000 uv edges, we 100