0% found this document useful (0 votes)
7 views12 pages

Blogel VLDB

The document presents Blogel, a block-centric framework designed to improve distributed computation on real-world graphs by addressing performance bottlenecks caused by skewed degree distribution, large diameter, and high density. Blogel allows programmers to think in terms of blocks, enabling efficient partitioning and reducing communication overhead, leading to significant performance improvements over existing vertex-centric systems. Experiments demonstrate that Blogel is orders of magnitude faster than state-of-the-art distributed graph computing systems.

Uploaded by

ylz20014
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views12 pages

Blogel VLDB

The document presents Blogel, a block-centric framework designed to improve distributed computation on real-world graphs by addressing performance bottlenecks caused by skewed degree distribution, large diameter, and high density. Blogel allows programmers to think in terms of blocks, enabling efficient partitioning and reducing communication overhead, leading to significant performance improvements over existing vertex-centric systems. Experiments demonstrate that Blogel is orders of magnitude faster than state-of-the-art distributed graph computing systems.

Uploaded by

ylz20014
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Blogel: A Block-Centric Framework for Distributed

Computation on Real-World Graphs

Da Yan∗1 , James Cheng∗2 , Yi Lu∗3 , Wilfred Ng#4



Department of Computer Science and Engineering, The Chinese University of Hong Kong
{1 yanda, 2
jcheng, 3
ylu}@cse.cuhk.edu.hk
# Department of Computer Science and Engineering, The Hong Kong University of Science and Technology
4
[email protected]

ABSTRACT SMS networks, some web graphs, and the cores of most large graphs),
The rapid growth in the volume of many real-world graphs (e.g., and (3)large diameter (common for road networks, terrain graphs,
social networks, web graphs, and spatial networks) has led to the and some large web graphs). These three characteristics are partic-
development of various vertex-centric distributed graph computing ularly adverse to vertex-centric parallelization as they are often the
systems in recent years. However, real-world graphs from differ- major cause(s) to one or more of the following three performance
ent domains have very different characteristics, which often create bottlenecks: skewed workload distribution, heavy message passing,
bottlenecks in vertex-centric parallel graph computation. We iden- and impractically many rounds of computation.
tify three such important characteristics from a wide spectrum of Let us first examine the performance bottleneck created by skewed
real-world graphs, namely (1)skewed degree distribution, (2)large degree distribution. The vertex-centric model assigns each vertex
diameter, and (3)(relatively) high density. Among them, only (1) together with its adjacency list to a machine, but neglects the dif-
has been studied by existing systems, but many real-world power- ference in the number of neighbors among different vertices. As
law graphs also exhibit the characteristics of (2) and (3). In this a result, for graphs with skewed degree distribution, it creates un-
paper, we propose a block-centric framework, called Blogel, which balanced workload distribution that leads to a long elapsed running
naturally handles all the three adverse graph characteristics. Blogel time due to waiting for the last worker to complete its job. For ex-
programmers may think like a block and develop efficient algo- ample, the maximum degree of the BTC RDF graph used in our
rithms for various graph problems. We propose parallel algorithms experiments is 1,637,619, and thus a machine holding such a high-
to partition an arbitrary graph into blocks efficiently, and block- degree vertex needs to process many incoming messages and send
centric programs are then run over these blocks. Our experiments out many outgoing messages to its neighbors, causing imbalanced
on large real-world graphs verified that Blogel is able to achieve workload among different machines.
orders of magnitude performance improvements over the state-of- Some existing systems proposed techniques for better load bal-
the-art distributed graph computing systems. ancing [4, 15], but they do not reduce the overall workload. How-
ever, for many real-world graphs including power-law graphs such
as social networks and mobile phone networks, the average vertex
1. INTRODUCTION degree is large. Also, most large real-world graphs have a high-
Due to the growing need to deal with massive graphs in vari- density core (e.g., the k-core [16] and k-truss [19] of these graphs).
ous graph analytic and graph mining applications, many distributed Higher density implies heavier message passing for vertex-centric
graph computing systems have emerged in recent years, including systems. We show that heavy communication workload due to high
Pregel [11], GraphLab [10], PowerGraph [4], Giraph [1], GPS [15], density can also be eliminated by our new computing model.
and Mizan [8]. Most of these systems adopt the vertex-centric For processing graphs with a large diameter δ, the message (or
model proposed in [11], which promotes the philosophy of “think- neighbor) propagation paradigm of the vertex-centric model often
ing like a vertex” that makes the design of distributed graph algo- leads to algorithms that require O(δ) rounds (also called super-
rithms more natural and easier. However, the vertex-centric model steps) of computation. For example, a single-source shortest path
has largely ignored the characteristics of real-world graphs in its algorithm in [11] takes 10,789 supersteps on a USA road network.
design and can hence suffer from severe performance problems. Apart from spatial networks, some large web graphs also have large
We investigated a broad spectrum of real-world graphs and diameters (from a few hundred to thousands). For example, the
identified three characteristics of large real-world graphs, namely vertex-centric system in [14] takes 2,450 rounds for computing
(1)skewed degree distribution (common for power-law and scale strongly connected components on a web graph.
free graphs such as social networks and web graphs), (2)(relatively) To address the performance bottlenecks created by real-world
high density (common for social networks, mobile phone networks, graphs in vertex-centric systems, we propose a block-centric graph-
parallel abstraction, called Blogel. Blogel is conceptually as sim-
ple as Pregel but works in coarser-grained graph units called blocks.
This work is licensed under the Creative Commons Attribution- Here, a block refers to a connected subgraph of the graph, and mes-
NonCommercial-NoDerivs 3.0 Unported License. To view a copy of this li- sage exchanges occur among blocks.
cense, visit http://creativecommons.org/licenses/by-nc-nd/3.0/. Obtain per- Blogel naturally addresses the problem of skewed degree distri-
mission prior to any use beyond those covered by the license. Contact
copyright holder by emailing [email protected]. Articles from this volume
bution since most (or all) of the neighbors of a high-degree ver-
were invited to present their results at the 40th International Conference on tex v are inside v’s block, and they are processed by sequential
Very Large Data Bases, September 1st - 5th 2014, Hangzhou, China. in-memory algorithms without any message passing. Blogel also
Proceedings of the VLDB Endowment, Vol. 7, No. 14 solves the heavy communication problem caused by high density,
Copyright 2014 VLDB Endowment 2150-8097/14/10.

1981
since the neighbors of many vertices are now within the same block, in iterations (called supersteps). In each superstep, the program
and hence they do not need to send/receive messages to/from each calls compute() for each active vertex. The compute() function per-
other. Finally, Blogel effectively handles large-diameter graphs, forms the user-specified task for a vertex v, such as processing v’s
since messages now propagate in the much larger unit of blocks incoming messages (sent in the previous superstep), sending mes-
instead of single vertices, and thus the number of rounds is signifi- sages to other vertices (for the next superstep), and making v vote
cantly reduced. Also, since the number of blocks is usually orders to halt. A halted vertex is reactivated if it receives a message in
of magnitude smaller than the number of vertices, the workload of a subsequent superstep. The program terminates when all vertices
a worker is significantly less than that of a vertex-centric algorithm. vote to halt and there is no pending message for the next superstep.
A central issue to Blogel is whether we can partition an arbitrary Pregel also supports message combiner. For example, if there
input graph into blocks efficiently. We propose a graph Voronoi are k numerical messages in worker wi to be sent to a vertex v in
diagram based partitioner which is a fully distributed algorithm, worker wj and suppose that only the sum of the message values
while we also develop more effective partitioners for graphs with matters, then user can implement a combine() function to sum up
additional information available. Our experiments show that our the message values and deliver only the sum to v in wj , thus reduc-
partitioning algorithms are efficient and effective. ing the number of messages to be buffered and transmitted. Pregel
We present a user-friendly and flexible programming interface also supports aggregator, which is useful for global communica-
for Blogel, and illustrate that programming in Blogel is easy by tion. Each vertex can provide a value to an aggregator in compute()
designing algorithms for a number of classic graph problems in in a superstep. The system aggregates those values and makes the
the Blogel framework. Our experiments on large real-world graphs aggregated result available to all vertices in the next superstep.
with up to hundreds of millions of vertices and billions of edges, Since Google’s Pregel is proprietary, many open-source systems
and with different characteristics, verify that our block-centric sys- have been developed based on Pregel’s computing model, such as
tem is orders of magnitude faster than the state-of-the-art vertex- Giraph [1] and GPS [15]. In particular, GPS uses a technique called
centric systems [1, 10, 4]. We also demonstrate that Blogel can ef- large adjacency list partitioning to handle high-degree vertices.
fectively address the performance bottlenecks caused by the three Vertex placement. Vertex placement rules that are more sophisti-
adverse characteristics of real-world graphs. cated than hashing were studied in [17], which aims at minimizing
The rest of the paper is organized as follows. Section 2 reviews the number of cross-worker edges while ensuring that workers hold
related systems. Section 3 illustrates the merits of block-centric approximately the same number of vertices. However, their method
computing. Section 4 gives an overview of the Blogel framework, requires expensive preprocessing but the performance gain is lim-
Section 5 presents Blogel’s programming interface, and Section 6 ited (e.g., only a speed up of 18%–39% for running PageRank).
discusses algorithm design in Blogel. We describe our partitioners Giraph++. A recent system, Giraph++ [18], proposed a graph-
in Section 7 and present performance results in Section 8. Finally, centric programming paradigm that opens up the block structure to
we conclude our paper in Section 9. users. However, Giraph++’s programming paradigm is still vertex-
centric since it does not allow a block to have its own states like
2. BACKGROUND AND RELATED WORK a vertex. Instead, each block is treated as two sets of vertices,
We first define some basic graph notations and discuss the stor- internal ones and boundary ones. As such, Giraph++ does not
age of a graph in distributed systems. support block-level communication, i.e., message passing is still
from vertices to vertices, rather than from blocks to blocks which
Notations. Given an undirect graph G = (V, E), we denote the is more efficient for solving some graph problems. For exam-
neighbors of a vertex v ∈ V by Γ(v); if G is directed, we denote ple, an algorithm for computing connected components is used to
the in-neighbors (out-neighbors) of v by Γin (v) (Γout (v)). Each demonstrate Giraph++ programming in [18], in which vertex IDs
v∈V has a unique integer ID, denoted by id(v). The diameter of G are passed among internal vertices and boundary vertices. We im-
is denoted by δ(G), or simply δ when G is clear from the context. plement an algorithm for the same problem in our block-centric
Graph storage. We consider a shared-nothing architecture where framework in Section 3, and show that it is much simpler and more
data is stored in a distributed file system (DFS), such as Hadoop’s efficient to simply exchange block IDs between blocks directly.
DFS (HDFS). We assume that a graph is stored as a distributed file Moreover, since Giraph++ is Java-based, the intra-block compu-
in HDFS, where each line records a vertex and its adjacency list. A tation inevitably incurs (de)serialization cost for each vertex ac-
distributed graph computing system involves a cluster of k work- cessed; in contrast, the intra-block computation in Blogel is simply
ers, where each worker wi holds and processes a batch of vertices a main-memory algorithm without any additional cost. Finally, Gi-
in its main memory. Here, “worker” is a general term for a comput- raph++ extends METIS [7] for graph partitioning, which is an ex-
ing unit, and a machine can have multiple workers in the form of pensive method. In Section 8, we will show that graph partitioning
threads/processes. A job is processed by a graph computing system and computing is much more efficient in Blogel than in Giraph++.
in three phases: (1)loading: each worker loads a portion of vertices GRACE. Another recent system, GRACE [20], which works in a
from HDFS into main-memory; the workers then exchange ver- single-machine environment, also applies graph partitioning to im-
tices through the network (by hashing over vertex ID) so that each prove performance. GRACE enhances vertex-centric computing
worker wi finally holds all and only those vertices assigned to wi ; with a scheduler that controls the order of vertex computation in
(2)iterative computing: in each iteration, each worker processes a block. This can be regarded as a special case of block-centric
its own portion of vertices sequentially, while different workers computing, where the intra-block computing logic is totally de-
run in parallel and exchange messages. (3)dumping: each worker fined by the scheduler. Although this model relieves users of the
writes the output of all its processed vertices to HDFS. Most exist- burden of implementing the intra-block computing logic, it is not
ing graph-parallel systems follow this procedure. as expressive as our block-centric paradigm. For example, GRACE
Pregel’s computing model. Pregel [11] is designed based on the does not allow a block to have its own states and does not support
bulk synchronous parallel (BSP) model. It distributes vertices to block-level communication. We remark that Blogel and GRACE
workers, where each vertex is associated with its adjacency list. A have different focuses: GRACE attempts to improve main memory
program in Pregel implements the compute() function and proceeds bandwidth utilization in a single-machine environment, while Blo-

1982
gel aims to reduce the computation and communication workload Computing Time Total Msg # Superstep #
Vertex-Centric 28.48 s 1,188,832,712 30
in a distributed environment. BTC
Block-Centric 0.94 s 1,747,653 6
GraphLab. Unlike Pregel’s synchronous model and message pass- Friendster
Vertex-Centric 120.24 s 7,226,963,186 22
ing paradigm, GraphLab [10] adopts a shared memory abstraction Block-Centric 2.52 s 19,410,865 5
Vertex-Centric 510.98s 8,353,044,435 6,262
and supports asynchronous execution. It hides the communication USA Road
Block-Centric 1.94 s 270,257 26
details from programmers: when processing a vertex, users can di-
rectly access its own field as well as the fields of its adjacent edges Figure 1: Overall Performance of Hash-Min
and vertices. GraphLab maintains a global scheduler, and workers
fetch vertices from the scheduler for processing, possibly adding G. The vertex-centric model works on G and hence a high-degree
the neighbors of these vertices into the scheduler. Asynchronous vertex may send/receive O(dmax ) messages each round, causing
computing may decrease the workload for some algorithms but in- skewed workload among the workers. On the contrary, the block-
curs extra cost due to data locking/unlocking. A recent version of centric model works on G and a high-degree vertex involves at most
GraphLab, called PowerGraph [4], partitions the graph by edges O(n/b) messages each round, where n is the number of vertices in
instead of by vertices, so that the edges of a high-degree vertex the giant CC. For power-law graphs, dmax can approach n and n/b
are handled by multiple workers. Accordingly, an edge-centric can be a few orders of magnitude smaller than dmax for a reason-
Gather-Apply-Scatter (GAS) computing model is used. However, able setting of b (e.g., b = 1000).
GraphLab is less expressive than Pregel, since a vertex cannot com- In addition, as long as the number of blocks is sufficiently larger
municate with a non-neighbor, and graph mutation is not supported. than the number of workers, the block-centric model can achieve a
rather balanced workload with a simple greedy algorithm for block-
3. MERITS OF BLOCK-CENTRIC COMPUT- to-worker assignment described in Section 4.
Figure 1 reports the overall performance (elapsed computing time,
ING: A FIRST EXAMPLE total number of messages sent, and total number of supersteps) of
We first use an example to illustrate the main differences in the the vertex-centric and block-centric Hash-Min algorithms on three
programming and performance between the block-centric model graphs: an RDF graph BTC, a social network Friendster, and a USA
and vertex-centric model, by considering the Hash-Min algorithm road network. The details of these graphs can be found in Section 8.
of [13] for finding connected components (CCs). We will show the Figure 2 reports the performance (running time and number of mes-
merits of the block-centric model for processing large real-world sages) of Hash-Min for each superstep on BTC and Friendster.
graphs with the three characteristics discussed in Section 1. BTC has skewed degree distribution, as a few vertices have de-
Given a CC C, we denote the set of vertices of C by V (C), and gree over a million but the average degree is only 4.69. The vertex-
define the ID of a CC C to be cc(v) = min{id(u) : u ∈ V (C)}. centric program is severely affected by the skewed workload due to
Hash-Min computes cc(v) for each vertex v ∈ V . The idea is high-degree vertices in the first few supersteps, and a large number
to broadcast the smallest vertex ID seen so far by each vertex v, of messages are passed as shown in Figure 2(a). The number of
denoted by min(v). messages starts to drop significantly only in superstep 4 when the
For the vertex-centric model, in superstep 1, each vertex v sets few extremely high-degree vertices become inactive. The subse-
min(v) to be the smallest ID among id(v) and id(u) of all u ∈ quent supersteps involve less and less messages since the smallest
Γ(v), broadcasts min(v) to all its neighbors, and votes to halt. vertex IDs have been seen by most vertices, and hence only a small
In each later superstep, each vertex v receives messages from its fraction of the vertices remain active. However, it still takes 30
neighbors; let min∗ be the smallest ID received, if min∗ < min(v), supersteps to complete the job. On the contrary, our block-centric
v sets min(v) = min∗ and broadcasts min∗ to its neighbors. All program has balanced workload from the beginning and uses only
vertices vote to halt at the end of a superstep. When the process 6 supersteps. Overall, as Figure 1 shows, the block-centric program
converges, min(v) = cc(v) for all v. uses 680 times less messages and 30 times less computing time.
Next we discuss the implementation of Hash-Min in the block- Next, we discuss the processing of graphs with high density.
centric framework. Let us assume that vertices are already grouped Social networks and mobile phone networks often have relatively
into blocks by our partitioners (to be discussed in Section 7), which higher average degree than other real-world graphs. For example,
guarantee that vertices in a block are connected. Each vertex be- Friendster has average degree of 55.06 while the USA road network
longs to a unique block, and let block(v) be the ID of the block has average degree of only 2.44 (see Figure 7), which implies that
that v belongs to. Let B be the set of blocks, and for each block a vertex of Friendster can send/receive 22.57 times more messages
B ∈ B, let V (B) be the set of vertices
S of B, and id(B) be the in- than that of the USA road network. The total number of messages
teger ID of B. We define Γ(B)= v∈V (B) {block(u) : u∈Γ(v)}. in each superstep of the vertex-centric Hash-Min is bounded by
Thus, we obtain a block-level graph, G = (B, E), where E = O(|E|), while that of the block-centric Hash-Min is bounded by
{(Bi , Bj ) : Bi ∈ B, Bj ∈ Γ(Bi )}. We then simply run Hash-Min O(|E|). Since |E| is generally significantly smaller than |E|, the
on G where blocks in B broadcast the smallest block ID that they block-centric model is much cheaper.
have seen. Similar to the vertex-centric algorithm, each block B As Figure 1 shows, the vertex-centric Hash-Min uses 372 times
maintains a field min(B), and when the algorithm converges, all more messages than the block-centric Hash-Min on Friendster, re-
vertices v with the same min(block(v)) belong to one CC. sulting in 48 times longer computing time. Figure 2(b) further
We now analyze why the block-centric model is more appealing shows that the number of messages and the elapsed computing time
than the vertex-centric model. First, we consider the processing of per superstep of the vertex-centric program is significantly larger
graphs with skewed degree distribution, where we compare G with than that of the block-centric program. Note that the more message
G. Most real-world graphs with skewed degree distribution con- passing and longer computing time are not primarily due to skewed
sists of a giant CC and many small CCs. In this case, our partitioner workload because the maximum degree in Friendster is only 5214.
computes a set of roughly even-sized blocks from the giant CC, Let δ(G) and δ(G) be the diameter of G and G. The vertex-
while each small CC forms a block. Let b be the average number centric Hash-Min takes O(δ(G)) supersteps, while the block-centric
of vertices in a block and dmax be the maximum vertex degree in Hash-Min takes only O(δ(G)) supersteps. For real-world graphs

1983
Superstep # 1 2 3 4 5 6 7 8 9 29 30
Vertex-Centric
Time 7.24 s 6.82 s 6.62 s 2.86 s 2.34 s 0.17 s 0.13 s 0.10 s 0.09s … 0.08 s 0.10 s
Msg # 393,175,048 349,359,723 320,249,398 78,466,694 44,961,718 1,884,460 530,278 128,602 61,727 4 0
Time 0.29 s 0.12 s 0.10 s 0.15 s 0.12 s 0.15 s
Block-Centric
Msg # 1,394,408 294,582 55,775 2,848 40 0
(a) Per-Superstep Performance over BTC
Superstep # 1 2 3 4 5 6 7 8 21 22
Time 26.86 s 27.64 s 27.86 s 26.97 s 8.96 s 0.43 s 0.15 s 0.11 s … 0.18 s 0.15 s
Vertex-Centric
Msg # 1,725,523,081 1,719,438,065 1,717,496,808 1,636,980,454 416,289,356 8,780,258 1,484,531 587,275 1 0
Time 0.53 s 1.53 s 0.25 s 0.10 s 0.06 s
Block-Centric
Msg # 6,893,957 6,892,723 5,620,051 4,134 0
(b) Per-Superstep Performance over Friendster

Figure 2: Performance of Hash-Min Per Superstep

1 1 1 1 1
with a large diameter, such as road networks, condensing a large 5
2 2 2 5
region (i.e., a block) of G into a single vertex in G allows distant V-worker1 Partitioner1 B-worker1
3 3 3 2 2
points to be reached within a much smaller number of hops. For ex- 4 4 4 6 6
ample, in the USA road network, there may be thousands of hops 5 5 5 3 3
6 6 6 11 11
between Washington and Florida; but if we condense each state into 7
V-worker2
7 7
Partitioner2 B-worker2
4 4
a block, then the two states are just a few hops apart. 8 8 8 12 12
The adverse effect of large diameter on the vertex-centric model 9 9 9 7 7
can be clearly demonstrated by the USA road network. As shown 10 10 10 8 8
V-worker3 Partitioner3 B-worker3
11 11 11 9 9
in Figure 1, the vertex-centric Hash-Min takes in 6,262 supersteps, 12 12 12 10 10
while the block-centric Hash-Min runs on a block-level graph with (a) V-workers (b) Partitioners and B-workers
a much small diameter and uses only 22 supersteps. The huge num-
ber of supersteps of the vertex-centric Hash-Min also results in 263 Figure 3: Operating Logic of Different Types of Workers
times longer computing time and 30,908 times more messages be-
ing passed than the block-centric algorithm. Blogel computing modes. Blogel operates in one of the following
three computing modes, depending on the application:
4. SYSTEM OVERVIEW • B-mode. In this mode, Blogel only calls B-compute() for
We first give an overview of the Blogel framework. Blogel sup- all its blocks, and only block-level message exchanges are
ports three types of jobs: (1)vertex-centric graph computing, where allowed. A job terminates when all blocks voted to halt and
a worker is called a V-worker; (2)graph partitioning which groups there is no pending message for the next superstep.
vertices into blocks, where a worker is called a partitioner; (3)block-
• V-mode. In this mode, Blogel only calls V-compute() for
centric graph computing, where a worker is called a B-worker.
all its vertices, and only vertex-level message exchanges are
Figure 3 illustrates how the three types of workers operate. Fig-
allowed. A job terminates when all vertices voted to halt and
ure 3(a) shows that each V-worker loads its own portion of ver-
there is no pending message for the next superstep.
tices, performs vertex-centric graph computation, and then dumps
the output of the processed vertices (marked as grey) to HDFS. Fig- • VB-mode. In this mode, in each superstep, a B-worker
ure 3(b) shows block-centric graph computing, where we assume first calls V-compute() for all its vertices, and then calls B-
that every block contains two vertices. Specifically, the vertices compute() for all its blocks. If a vertex v receives a message
are first grouped into blocks by the partitioners, which dump the at the beginning of a superstep, v is activated along with its
constructed blocks to HDFS. These blocks are then loaded by the block B = block(v), and B will call its B-compute() func-
B-workers for block-centric computation. tion. A job terminates only if all vertices and blocks voted to
Blocks and B-workers. In both vertex-centric graph computing halt and there is no pending message for the next superstep.
and graph partitioning, when a vertex v sends a message to a neigh- Partitioners. Blogel supports several types of pre-defined parti-
bor u ∈ Γ(v), the worker to which the message is to be sent is iden- tioners. Users may also implement their own partitioners in Blo-
tified by hashing id(u). We now consider block-centric graph com- gel. A partitioner loads each vertex v together with Γ(v). If the
puting. Suppose that the vertex-to-block assignment and block-to- partitioning algorithm supports only undirected graphs but the in-
worker assignment are already given, we define the block of a ver- put graph G is directed, partitioners first transform G into an undi-
tex v by block(v) and the worker of a block B by worker(B). rected graph, by making each edge bi-directed. The partitioners
We also define worker(v) = worker(block(v)). Then, the ID then compute the vertex-to-block assignment (details in Section 7).
of a vertex v in a B-worker is now given by a triplet trip(v) =
hid(v), block(v), worker(v)i. Thus, a B-worker can obtain the Block assignment. After graph partitioning, block(v) is computed
worker and block of a vertex v by checking its ID trip(v). for each vertex v. The partitioners then compute the block-to-
Similar to a vertex in Pregel, a block in Blogel also has a com- worker assignment. This is actually a load balancing problem:
pute() function. We use B-compute() and V-compute() to denote the D EFINITION 1 (L OAD BALANCING P ROBLEM [9]). Given k
compute() function of a block and a vertex, respectively. A block workers w1 , . . ., wk , and n jobs, let J(i) be the set of jobs as-
has access to all its vertices, and can send messages to any block signed to worker wi and cj beP the cost of each job j. The load of
B or vertex v as long as worker(B) or worker(v) is available. worker wi is defined as Li = j∈J(i) cj . The goal is to minimize
Each B-worker maintains two message buffers, one for exchang- L = maxi Li .
ing vertex-level messages and the other for exchanging block-level
messages. A block also has a state indicating whether it is active, In our setting, each job corresponds to a block B, whose cost is
and may vote to halt. given by the number of vertices in B. We use the 4/3-approximation

1984
Vertex <IDType, ValueType, MsgType> in the Vertex class, we specify IDType as integer, ValueType as a
Block <VertexType, BValueType, BMsgType> user-defined type for holding both cc(v) and Γ(v), and MsgType as
Combiner <MsgType> integer since Hash-Min sends messages in the form of vertex ID.
Aggregator <AValueType, PartialType, FinalType>
We then define a class CCVertex to inherit this class, and implement
VWorker <VertexType> the compute() function using the Hash-Min algorithm in Section 3.
BWorker <BlockType>
Predefined partitioner classes The Vertex class also has another template argument (not shown in
(a) Base Classes Figure 4) for specifying the vertex-to-worker assignment function
BWorker::block_set over IDType, for which a hash function is specified by default.
To run the Hash-Min job, we inherit the VWorker<CCVertex>
size = 3 size = 4 size = 2 … class, and implement two functions: (1)load vertex(line), which
vertex = 0 vertex = 3 vertex = 7
specifies how to parse a line from the input file into a vertex ob-
… ject; (2)dump vertex(vertex, HDFSWriter), which specifies how to
0 1 2 3 4 5 6 7 8
dump a vertex object to the output file. VWorker’s run() function is
BWorker::vertex_set then called to start the job. BWorker and the partitioner classes also
(b) Vertex Sets and Block Sets
have these three functions, though the logic of run() is different.
Figure 4: Programming Interface of Blogel
5.2 Block-Centric Interface
algorithm given in [5], which first sorts the jobs by cost in non- For the block-centric model, in the Vertex class, we specify ID-
increasing order, and then scans through the sorted jobs and assigns Type as a triplet for holding trip(v), ValueType as a list for holding
each job to the worker with the smallest load.
Γ(v),
b while MsgType can be any type since the algorithm works in
The block-to-worker assignment is computed as follows. (1)To
B-mode and there is no vertex-level message exchange. We then
get the block sizes, each partitioner groups its vertices into blocks
define a class CCVertex that inherits this class with an empty V-
and sends the number of vertices in each block to the master. The
compute() function. In the Block class, we specify VertexType as
master then aggregates the numbers sent by the partitioners to ob-
CCVertex, BValueType as a block-level adjacency list (i.e., Γ(B)),
tain the global number of vertices for each block. (2)The master
and BMsgType as integer since the algorithm sends messages in
then computes the block-to-worker assignment using the greedy
the form of block ID. We then define a class CCBlock that inher-
algorithm described earlier, and broadcasts the assignment to each
its this Block class, and implement the B-compute() function us-
partitioner. (3)Each partitioner sets worker(v) for each of its ver-
ing the logic of block-centric Hash-Min. Finally, we inherits the
tex v according to the received block-to-worker assignment (note
BWorker<CCBlock> class, implement the vertex loading/dumping
that block(v) is already computed).
functions, and call run() to start the job.
Triplet ID of neighbors. So far we only compute trip(v) for each A Block object also has the following fields: (1)an integer block
vertex v. However, in block-centric computing, if a vertex v needs ID, (2)an array of the block’s vertices, denoted by vertex, (3)and
to send a message to a neighbor u ∈ Γ(v), it needs to first obtain the number of vertices in the block, denoted by size.
the worker holding u from trip(u). Thus, we also compute Γ(v) b = A BWorker object also contains an array of blocks, denoted by
{trip(u) : u ∈ Γ(v)} for each vertex v as follows. Each worker wi block set, and an array of vertices, denoted by vertex set. As
constructs a look-up table LTi→j locally for every worker wj in the shown in Figure 4(b), the vertices in a B-worker’s vertex set are
cluster: for each vertex v in wi , and for each neighbor u ∈ Γ(v), grouped by blocks; and for each block B in the B-worker’s block set,
trip(v) is added to LTi→j , where j = hash(id(u)), i.e., u is on B’s vertex field is actually a pointer to the first vertex of B’s group
worker wj . Then, wi sends LTi→j to each wj , and each worker in the B-worker’s vertex set. Thus, a block can access its vertices
merges the received look-up tables into one look-up table. Now, a in B-compute() as vertex[0], vertex[1], . . ., vertex[size − 1].
vertex u on worker wj can find trip(v) for each neighbor v ∈ Γ(u) A subclass of BWorker also needs to implement an additional
from the look-up table of wj to construct Γ(u).
b function block init(), which specifies how to set the user-specified
Till now, each partitioner wi still holds only those vertices v field BValueType for each block. After a B-worker loads all its
with hash(id(v)) = i. After Γ(v) b is constructed for each ver- vertices into vertex set, it scans vertex set to construct its block set,
tex v, the partitioners exchange vertices according to the block-to- where the fields vertex and size of each block are automatically
worker assignment. Each partitioner wi then dumps its vertices to set. Then, before the block-centric computing begins, block init()
an HDFS file, which is later loaded by the corresponding B-worker is called to set the user-specified block field. For our block-centric
wi during block-centric computing. Each vertex v has an extra Hash-Min algorithm, in block init(), each block B constructs Γ(B)
field, content(v), that keeps additional information such as edge from Γ(v) of all v ∈ V (B).
weight and edge direction during data loading. It is used along with
trip(v) and Γ(v),
b to format v’s output line during data dumping. 5.3 Global Interface
In Blogel, each worker is a computer process. We now look at
5. PROGRAMMING INTERFACE a number of fields and functions that are global to each process.
Similar to Pregel, writing a Blogel program involves subclassing They can be accessed in both V-compute() and B-compute().
a group of predefined classes, with the template arguments of each Message buffers and combiners. Each worker maintains two
base class properly specified. Figure 4(a) shows the base classes of message buffers, one for exchanging vertex-level messages and the
Blogel. We illustrate their usage by showing how to implement the other for exchanging block-level messages. A combiner is associ-
Hash-Min algorithm described in Section 3. ated with each message buffer, and it is not defined by default. To
use a combiner, we inherit the Combiner class and implement the
5.1 Vertex-Centric Interface combine() function, which specifies the combiner logic. When a
In the vertex-centric model, each vertex v maintains an integer vertex or a block calls the send msg(target, msg) function of the
ID id(v), an integer field cc(v) and a list of neighbors Γ(v). Thus, message buffer, combine() is called if the combiner is defined.

1985
Aggregator. An aggregator works as follows. In each superstep, a B collects all its active vertices v into a priority queue Q (with
vertex/block’s V-compute()/B-compute() function may call aggre- dist(v) as the key), and makes these vertices vote to halt. B-
gate(value), where value is of type AValueType. After a worker compute() then runs Dijkstra’s algorithm on B using Q, which re-
calls V-compute()/B-compute() for all vertices/blocks, the aggre- moves the vertex v ∈ Q with the smallest value of dist(v) from Q
gated object maintained by its local aggregator (of type Partial- for processing each time. The out-neighbors u ∈ Γ(v) are updated
Type) is sent to the master. When the master obtains the locally as follows. For each u ∈ V (B), if dist(v)+`(v, u) < dist(u), we
aggregated objects from all workers, it calls master compute() to update hprev(u), dist(u)i to be hv, dist(v)+`(v, u)i, and insert u
aggregate them to a global object of type FinalType. This global into Q with key dist(u) if u ∈ / Q, or update dist(u) if u is already
object is then broadcast to all the workers so that it is available to in Q. For each u 6∈ V (B), a message hv, dist(v) + `(v, u)i is sent
every vertex/block in the next superstep. to u. B votes to halt when Q becomes empty. In the next superstep,
To define an aggregator, we subclass the Aggregator class to in- if a vertex u receives a message, u is activated along with its block,
clude a field recording the aggregated values, denoted by agg, and and the block-centric computation repeats.
implement aggregate(value) and master compute(). An object of Compared with the vertex-centric algorithm, this algorithm saves
this class is then assigned to the worker. a significant amount of communication cost since there is no mes-
Blogel also supports other useful global functions such as graph sage passing among vertices within each block. In addition, mes-
mutation (which are user-defined functions to add/delete vertices/edges), sages propagate from s in the unit of blocks, and thus, the algorithm
and the terminate() function which can be called to terminate the requires much less than supersteps than O(L).
job immediately. In addition, Blogel maintains the following global For both the vertex-centric and block-centric algorithms, we ap-
fields which are useful for implementing the computing logic: (1)the ply a combiner as follows. Given a set of messages from a worker,
ID of the current superstep, which also indicates the number of su- {hw1 , d(w1 )i, hw2 , d(w2 )i, . . ., hwk , d(wk )i}, to be sent to a ver-
persteps executed so far; (2)the total number of vertices among all tex u, the combiner combines them into a single message hw∗ , d(w∗ )i
workers at the beginning of the current superstep, whose value may such that d(w∗ ) is the smallest among all d(wi ) for 1 ≤ i ≤ k.
change due to graph mutation; (3)the total number of active vertices 6.2 Reachability
among all workers at the beginning of the current superstep.
Given a directed graph G = (V, E), a source vertex s and a
destination vertex t, the problem is to decide whether there is a
6. APPLICATIONS directed path from s to t in G. We can perform a bidirectional
We apply Blogel to solve four classic graph problems: Con- breadth-first search (BFS) from s and t, and check whether the two
nected Components (CCs), Single-Source Shortest Path (SSSP), Reach- BFSs meet at some vertex. We assign each vertex v a 2-bit field
ability, and PageRank. In Sections 3 and 5, we have discussed how tag(v), where the first bit indicates whether s can reach v and the
Blogel computes CCs with the Hash-Min logic in both the vertex- second bit indicates whether v can reach t.
centric model, and B-mode of the block-centric model. We now Vertex-centric algorithm. We first set tag(s) = 10, tag(t) =
discuss the solutions to the other three problems in Blogel. 01; and for any other vertex v, tag(v) = 00. Only s and t are
active initially. In superstep 1, s sends its tag 10 to all v ∈ Γout (s),
6.1 Single-Source Shortest Path and t sends its tag 01 to all v ∈ Γin (t). They then vote to halt.
Given a graph G = (V, E), where each edge (u, v) ∈ E has In superstep i (i > 1), a vertex v computes the bitwise-OR of all
length `(u, v), and a source s ∈ V , SSSP computes a shortest path messages it receives, which results in tag ∗ . If tag ∗ = 11, or if
from s to every other vertex v ∈ V , denoted by SP (s, v). the bitwise-OR of tag ∗ and tag(v) is 11, v sets tag(v) = 11 and
Vertex-centric algorithm. We first discuss the vertex-centric al- calls terminate() to end the program since the two BFSs now meet
gorithm, which is similar to Pregel’s SSSP algorithm [11]. Each at v; otherwise, if tag(v) = 00, v sets tag(v) = tag ∗ and, either
vertex v has two fields: hprev(v), dist(v)i and Γout (v), where sends tag ∗ to all u ∈ Γout (v) if tag ∗ = 10, or sends tag ∗ to all
prev(v) is the vertex preceding v on SP (s, v) and dist(v) is the u ∈ Γin (v) if tag ∗ = 01. Finally, v votes to halt.
length of SP (s, v). Each u ∈ Γout (v) is associated with `(v, u). Note that if we set t to be a non-existent vertex ID (e.g., −1), the
Initially, only s is active with dist(s)=0, while dist(v)=∞ for algorithm becomes BFS from s. We now analyze the cost of doing
any other vertex v. In superstep 1, s sends a message hs, dist(s) + BFS. Let Vh be the set of vertices that are h hops away from s.
`(s, u)i to each u∈Γout (s), and votes to halt. In superstep i (i>1), We can prove by induction that, in superstep i, only those vertices
if a vertex v receives messages hw, d(w)i from any of v’s in-neighbor in Vi both receive messages (from the vertices in Vi−1 ) and send
w, then v finds the in-neighbor w∗ such that d(w∗ ) is the small- messages to their out-neighbors. If a vertex in Vj (j < i) receives
est among all d(w) received. If d(w∗ ) < dist(v), v updates a message, it simply votes to halt without sending messages; while
hprev(v), dist(v)i=hw∗ , d(w∗ )i, and sends a message hv, dist(v)+ all vertices in Vj (j > i) remain inactive as they are not reached
`(v, u)i to each out-neighbor u ∈ Γout (v). Finally, v votes to halt. yet. Thus, the total workload is O(|E| + |V |).
Let hop(s, v) be the number of hops of SP (s, v), and L = Block-centric algorithm. This algorithm operates in VB-mode.
maxv∈V hop(s, v). The vertex-centric algorithm runs for O(L) In each superstep, V-compute() is first called where each vertex v
supersteps and in each superstep, at most one message is sent along receives messages and updates tag(v) as in the vertex-centric algo-
each edge. Thus, the total workload is O(L(|V | + |E|)). rithm. If tag(v) is updated, v remains active; otherwise, v votes to
Block-centric algorithm. Our block-centric solution operates in halt. Then, B-compute() is called, where each block B collects all
VB-mode. Each vertex maintains the same fields as in the vertex- its active vertices with tag 10 (01) into a queue Qs (Qt ). If an active
centric algorithm, and blocks do not maintain any information. In vertex with tag 11 is found, B calls terminate(). Otherwise, B per-
each superstep, V-compute() is first executed for all vertices, where forms a forward BFS using Qs as follows. A vertex v is dequeued
a vertex v finds w∗ from the incoming messages as in the vertex- from Qs each time, and the out-neighbors u ∈ Γout (v) are updated
centric algorithm. However, now v votes to halt only if d(w∗ ) ≥ as follows. For each out-neighbor u ∈ V (B), if tag(u) = 00,
dist(v). Otherwise, v updates hprev(v), dist(v)i = hw∗ , d(w∗ )i tag(u) is set to 10 and u is enqueued; if tag(u) = 01 or 11,
but stays active. Then, B-compute() is executed, where each block tag(u) is set to 11 and terminate() is called. For each out-neighbor

1986
|V| |E| Kendall Tau Distance
BerkStan 685,230 7,600,595 834,804,094
Google 875,713 5,105,039 2,185,214,827
NotreDame 325,729 1,497,134 1,008,151,095
Stanford 281,903 2,312,497 486,171,631

Figure 5: Impact of PageRank Loss


v
u 6∈ V (B), a message 10 is sent to u. Then a backward BFS us-
ing Qt is performed in a similar way. Finally, B votes to halt. In (a) Graph Voronoi Diagram (b) 2D Partitioner
the next superstep, if a vertex u receives a message, u is activated
along with its block, and the block-centric computation repeats. Figure 6: Partitioners (Best Viewed in Colors)

6.3 PageRank The PageRank of v is now updated by pr(v) = 0.15/|V | + 0.85 ×


(sum + agg/|V |), where sum is compensated with agg/|V | =
Given a directed graph G = (V, E), the problem is to compute P pr(v)
the PageRank of each vertex v ∈ V . Let pr(v) be the PageRank v∈V0 |V | .
of v. Pregel’s PageRank algorithm [11] works as follows. In su- Let pri (v) be the PageRank of v in superstep i. Then, PageRank
perstep 1, each vertex v initializes pr(v) = 1/|V | and distributes computation stops if |pri (v) − pri−1 (v)| < /|V | for all v ∈ V
pr(v) to the out-neighbors by sending each one pr(v)/|Γout (v)|. (note that the average PageRank is 1/|V |). We set  to be 0.01
In superstep i (i > 1), each vertex v sums up the received PageR- throughout this paper.
ank values, denoted by sum, and computes pr(v) = 0.15/|V | + We also implement this stop condition using an aggregator that
0.85×sum. It then distributes pr(v)/|Γout (v)| to each out-neighbor. performs logic AND. In compute(), each vertex v provides ‘true’ to
A combiner is used, which aggregates the messages to be sent to the the aggregator if |pri (v) − pri−1 (v)| < /|V |, and ‘false’ other-
same destination vertex into a single message that equals their sum. wise. Moreover, if v finds that the aggregated value of the previous
PageRank loss. Conceptually, the total amount of PageRank val- superstep is ‘true’, it votes to halt directly without doing PageRank
ues remain to be 1, with 15% held evenly by the vertices, and 85% computation.
redistributed among the vertices by propagating along the edges. Block-centric algorithm. In a web graph, each vertex is a web
However, if there exists a sink page v (i.e., Γout (v) = ∅), pr(v) page with a URL (e.g., cs.stanford.edu/admissions), and
is not distributed to any other vertex in the next superstep and the we can naturally group all vertices with the same host name (e.g.,
value simply gets lost. Therefore, in [11]’s PageRank algorithm, cs.stanford.edu) into a block. Kamvar et al. [6] proposed to
the total amount of PageRank values decreases as the number of initialize the PageRank values by exploiting this block structure, so
supersteps increases. that PageRank computation can converge faster. Note that though
Let V0 = {v ∈ V : Γout (v) = ∅} be the set of sink vertices, a different initialization is used, the PageRank values still converge
i.e., vertices with out-degree 0. We ran [11]’s PageRank algorithm to a unique stable state according to the Perron-Frobenius theorem.
on two web graphs (listed in Figure 7) and computed the PageRank The implementation of the algorithm of [6] in the Blogel frame-
loss: (1)WebUK which has 9.08% of the vertices being sink ver- work consists of two jobs. The first job operates in B-mode. Before
tices, and (2)WebBase which has 23.41% of the vertices being sink computation, in block init(), each block B first computes the local
vertices. We found that the algorithm converges in 88 supersteps PageRank of each v ∈ V (B), denoted by lpr(v), by a single-
with 17% PageRank loss on WebUK, and in 79 supersteps with machine PageRank algorithm with B as input. Block B then con-
34% PageRank loss on WebBase. As we shall see shortly, such a structs Γ(B) from Γ(v) of all v ∈ V (B) using [6]’s approach,
large PageRank loss reveals a problem that must be addressed. which assigns a weight to each out-edge. Finally, B-compute()
Vertex-centric algorithm. A common solution to the PageRank computes the BlockRank of each block B ∈ B, denoted by br(B),
loss problem is to make each sink vertex v ∈ V0 link to all the on G = (B, E) using a logic similar to the vertex-centric PageR-
vertices in the graph1 , i.e. to distribute pr(v)/|V | to each vertex. ank, except that BlockRank is distributed to out-neighbors propor-
Intuitively, this models the behavior of a random surfer: if the surfer tionally to the edge weights. The second job operates in V-mode,
arrives at a sink page, it picks another URL at random and continues which initializes pr(v) = lpr(v) · br(block(v)) [6], and performs
surfing again. Since |V | is usually large, pr(v)/|V | is small and the the standard PageRank computation on G.
impact of v to the PageRank of other vertices is negligible.
Compared with the standard PageRank definition above, Pregel’s 7. PARTITIONERS
PageRank algorithm [11] changes the relative importance order of Efficient computation of blocks that give balanced workload is
the vertices. To illustrate, we compute PageRank on the four web crucial to the performance of Blogel. We have discussed the logic
graphs from the SNAP database2 using the algorithm in [11] and of partitioners such as the computation of the block-to-worker as-
the standard PageRank definition, and compare the ordered vertex signment in Section 4. We have also seen a URL-based partitioner
lists obtained. The results in Figure 5 show that the two vertex lists for web graphs in Section 6.3, where the vertex-to-block assign-
have a large Kendall tau distance. For example, the graph Google ment is determined by the host names extracted from URLs. In
has over 2 million vertex pairs whose order of PageRank magnitude this section, we introduce two other types of partitioners, with an
is reversed from the standard definition. emphasis on computing the vertex-to-block assignment.
Obviously, materializing Γout (v) = V for each v ∈ V0 is un-
acceptable both in space and in communication cost. We propose 7.1 Graph Voronoi Diagram Partitioner
an aggregator-based solution. In compute(), if v’s out-degree
P is 0, v We first review the Graph Voronoi Diagram (GVD) [3] of an
provides pr(v) to an aggregator that computes agg = v∈V0 pr(v). undirected unweighted graph G = (V, E). Given a set of source
vertices s1 , s2 , . . . , sk ∈ V , we define a partition of V : {V C(s1 ),
1
http://www.google.com/patents/US20080075014 V C(s2 ), . . ., V C(sk )}, where a vertex v is in V C(si ) only if si is
2
http://snap.stanford.edu/data/ closer to v (in terms of the number of hops) than any other source.

1987
Ties are broken arbitrarily. The set V C(si ) is called the Voronoi vertices. We call this step as subgraph Hash-Min.
cell of si , and the Voronoi cells of all sources form the GVD of G. There are six parameters, (psamp , δmax , bmax , f , γ, pmax ),
Figure 6(a) illustrates the concept of GVD, where source vertices that need to be specified for the partitioner. We show that these
are marked with solid circles. Consider vertex v in Figure 6(a), it is parameters are intuitively easy to set. In particular, we found that
at least 2, 3 and 5 hops from the red, green and blue sources. Since the following setting of the parameters work well for most large
the red source is closer to v, v is assigned to the Voronoi cell of the real-world graphs. The sampling rate psamp decides the number of
red source. All the vertices in Figure 6(a) are partitioned into three blocks, and usually a small value as 0.1% is a good choice. Note
Voronoi cells, where the vertices with the same color belong to the that psamp cannot be too small in order not to create very large
same Voronoi cell. blocks. As for the stopping parameters, γ is usually set as 90%,
The GVD computation can be easily implemented in the vertex- and pmax as 10% with f = 2, so that there will not be too many
centric computing model, by performing multi-source BFS. Specif- rounds of multi-source BFS. The bound on the number of super-
ically, in superstep 1, each source s sets block(s) = s and broad- steps, δmax , is set to be a tolerable number such as 50, but for
casts it to the neighbors; for each non-source vertex v, block(v) is small-diameter graphs (e.g., social networks) we can set a smaller
unassigned. Finally, the vertex votes to halt. In superstep i (i > 1), δmax such as 10 since the number of supersteps needed for such
if block(v) is unassigned, v sets block(v) to an arbitrary source graphs is small. We set bmax to be 100 times that of the expected
received, and broadcasts block(v) to its neighbors before voting to block size (e.g., bmax = 100, 000 when psamp = 0.1%) for most
halt. Otherwise, v votes to halt directly. When the process con- graphs, except that for spatial networks the block size is limited by
verges, we have block(v) = si for each v ∈ V C(si ). δmax already and hence we just set it to ∞. We will further demon-
The multi-source BFS has linear workload since each vertex only strate in our experiments that the above settings work effectively for
broadcasts a message to its neighbors when block(v) is assigned, different types of real-world graphs.
and thus the total messages exchanged by all vertices is bounded
by O(|E|). However, we may obtain some huge Voronoi cells (or 7.2 2D Partitioner
blocks), which are undesirable for load balancing. We remark that In many spatial networks, vertices are associated with (x, y)-
block sizes can be aggregated at the master in a similar manner as coordinates. Blogel provides a 2D partitioner to partition such
when we compute the block-to-worker assignment in Section 4. graphs. A 2D partitioner associates each vertex v with an addi-
Our GVD partitioner works as follows, where we use a param- tional field (x, y), and it consists of two jobs.
eter bmax to limit the maximum block size. Initially, each vertex The first job is vertex-centric and works as follows: (1)each
v samples itself as a source with probability psamp . Then, multi- worker samples a subset of its vertices with probability psamp and
source BFS is performed to partition the vertices into blocks. If sends the sample to the master; (2)the master first partitions the
the size of a block is larger than bmax , we set block(v) unassigned sampled vertices into nx slots by the x-coordinates, and then each
for any vertex v in the block (and reactivate v). We then perform slot is further partitioned into ny slots by the y-coordinates. Each
another round of source sampling and multi-source BFS on those resulting slot is a super-block. Figure 6(b) shows a 2D partitioning
vertices v with block(v) unassigned, using a higher sampling prob- with nx =ny =3. This partitioning is broadcast to the workers, and
ability. Here, we increase psamp by a factor of f (f ≥ 1) after each each worker assigns each of its vertices to a super-block according
round in order to decrease the chance of obtaining an over-sized to the vertex coordinates. Finally, vertices are exchanged according
block. This process is repeated until the stop condition is met. to the superblock-to-worker assignment computed by master, with
We check two stop conditions, and the process described above
Γ(v)
b constructed for each vertex v as described in Section 4.
stops as long as one condition is met: (1)let A(i) be the number
Since a super-block may not be connected, we perform a second
of active vertices at the beginning of the i-th round of multi-source
block-centric job, where workers run BFS over their super-blocks
BFS, then the process stops if A(i)/A(i − 1) > γ. Here, γ ≤ 1
to break them into connected blocks. Each worker marks the blocks
is a parameter determining when multi-source BFS is no longer
it obtains with IDs 0, 1, · · · . To get the global block ID, each
effective in assigning blocks; or (2)the process stops if psamp >
worker wi sends the number of blocks it has, denoted by |Bi |, to the
pmax , where pmax is the maximum allowed sampling rate (recall
master which computes for each worker wj a prefix sum sumj =
that psamp = f · psamp after each round). P
Moreover, to prevent multi-source BFS from running too many i<j |Bi |, and sends sumj to wj . Each worker wj then adds
sumj to the block IDs of its blocks, and hence each block obtains
supersteps, which may happen if graph diameter δ is large and the
sampling rate psamp is small, we include a user-specified param- a unique block ID. Finally, Γ(v)
b are updated for each vertex v.
eter δmax to bound the maximum number of supersteps, i.e., each The 2D partitioner has parameters (psamp , nx , ny ). The setting
vertex votes to halt in superstep δmax during multi-source BFS. of psamp is similar
p to the GVD partitioner. To set nx and ny , we
When the above process terminates, there may still be some ver- use δ(G) ≈ O( max{nx , ny }) as a guideline, since the diameter
tices not assigned to any block. This could happen since multi- of G is critical to the performance of block-centric algorithms.
source BFS works well on large CCs, as a larger CC tends to con-
tain more sampled sources, but the sampling is ineffective for han-
dling small CCs. For example, consider the extreme case where the 8. EXPERIMENTS
graph is composed of isolated vertices. Since each round of multi- We compare the performance of Blogel and with Giraph 1.0.0,
source BFS assigns block ID to only around psamp |V | vertices, it Giraph++3 and GraphLab 2.2 (which includes the features of Pow-
takes around 1/psamp (which is 1000 if psamp = 0.1%) rounds to erGraph [4]). We ran our experiments on a cluster of 16 machines,
assign block ID to all vertices, which is inefficient. each with 24 processors (two Intel Xeon E5-2620 CPU) and 48GB
Our solution to assigning blocks for small CCs is by the Hash- RAM. One machine is used as the master that runs only one worker,
Min algorithm, which marks each small CC as a block using only a while the other 15 machines act as slaves running multiple workers.
small number of supersteps. Specifically, after the rounds of multi- The connectivity between any pair of nodes in the cluster is 1Gbps.
source BFS terminate, if there still exists any unassigned vertex,
we run Hash-Min on the subgraph induced by these unassigned 3
https://issues.apache.org/jira/browse/GIRAPH-818

1988
Data Type |V| |E| AVG Deg Max Deg
Web WebUK directed 133,633,040 5,507,679,822 41.21 22,429
Performance of GVD partitioners. Figure 8 shows the perfor-
Graphs WebBase directed 118,142,155 1,019,903,190 8.63 3,841 mance of the GVD partitioners, along with the parameters we used.
Social Friendster undirected 65,608,366 3,612,134,270 55.06 5,214 The web graphs can be partitioned simply based on URLs, but we
Networks LiveJournal directed 10,690,276 224,614,770 21.01 1,053,676 also apply GVD partitioners to them for the purpose of comparison.
RDF BTC undirected 164,732,473 772,822,094 4.69 1,637,619
Spatial USA Road undirected 23,947,347 58,333,344 2.44 9
The partitioning parameters are set according to the heuristics
Networks Euro Road undirected 18,029,721 44,826,904 2.49 12 given in Section 7, but if there exists a giant block in the result
which is usually obtained in the phase when Hash-Min is run, we
Figure 7: Datasets check why the multi-source BFS phase terminates. If it terminates
because the sampling rate increases beyond pmax , we decrease f
For the experiments of Giraph, we use the multi-threading fea- and increase pmax and γ to allow more rounds of multi-source BFS
ture added by Facebook, and thus each worker refers to a comput- to be run in order to break the giant block into smaller ones. We
ing thread. However, Giraph++ is built on top of an earlier Giraph also increase bmax to relax the maximum block size constraint dur-
version by Yahoo!, which does not support multi-threading. We ing multi-source BFS, which may generate some relatively larger
therefore run multiple workers (mapper tasks) per machine. We blocks (still bounded by bmax ) but reduce the chance of producing
make all Giraph++ codes used in our experiments public4 . a giant block (not bounded by bmax ). We slightly adjusted the pa-
We used seven large real-world datasets, which are from four rameters for WebUK, WebBase and BTC in this way, while all other
different domains as shown in Figure 7: (1)web graphs: WebUK5 datasets work well with the default setting described in Section 7.
and WebBase6 ; (2)social networks: Friendster7 and LiveJournal8 ; Recall from Section 4 that besides block-to-worker assignment,
(3)RDF graph: BTC9 (a graph converted from the Billion Triple partitioners also compute vertex-to-block assignment (denoted by
Challenge 2009 RDF dataset [2]); (4)road networks: USA and Euro10 .
“Block Assign.” in Figure 8), construct Γ(v)
b (denoted by “Triplet
Among them, WebUK, LiveJournal and BTC have skewed degree
ID Neighbors”), and exchange vertices according to the block-to-
distribution; WebUK, Friendster and LiveJournal have average de-
worker assignment (denoted by “Vertex Exchange”). We report
gree relatively higher than other large real-world graphs; USA and
the computing time for “Block Assign.”, “Triplet ID Neighbors”,
Euro, as well as WebUK, have a large diameter.
and “Vertex Exchange” in Figure 8. We also report the data load-
For a graph of certain size, we need a certain amount of comput-
ing/dumping time and the total computation time of the GVD par-
ing resources (i.e., workers) to achieve good performance. How-
titioners in the gray columns. As for vertex-to-block assignment,
ever, the performance does not further improve if we increase the
(1)for multi-source BFS, we report the number of rounds, the to-
number of workers per slave machine beyond that amount, since the
tal number of supersteps taken by all these rounds, and the average
increased overhead of inter-machine communication outweighs the
time of a superstep; (2)for running Hash-Min on the subgraphs, we
increased computing power. In the experiments, we run 10 workers
report the number of supersteps and the average time of a superstep.
for WebUK and WebBase, 8 for Friendster, 2 for LiveJournal, 4 for
As Figure 8 shows, the partitioning is very efficient as the com-
BTC, USA and Euro, which exhibit good performance.
puting time is comparable to graph loading/dumping time. In fact,
8.1 Blogel Implementation within the overall computing time, a significant amount of time is
spent on constructing Γ(v)
b and vertex exchange, and the vertex-to-
We make Blogel open-source. All the system source codes, as
block assignment computation is highly efficient.
well as the source codes of the applications discussed in this paper,
Figure 9 shows the number of blocks and vertices assigned to
can be found in http://www.cse.cuhk.edu.hk/blogel.
each worker, for example, WebUK is partitioned over workers 0
Blogel is implemented in C++ as a group of header files, and
to 150 and each worker contains x blocks and y vertices, where
users only need to include the necessary base classes and imple-
16, 781 ≤ x ≤ 16, 791 and 884, 987 ≤ y ≤ 884, 988. Thus, we
ment the application logic in their subclasses. Blogel communi-
can see that the GVD partitioners achieves very balanced block-
cates with HDFS through libhdfs, a JNI based C API for HDFS.
to-worker assignment, and it also shows that the greedy algorithm
Each worker is simply an MPI process and communications are im-
described in Section 4 is effective. The workload is relatively less
plemented using MPI communication primitives. While one may
balanced only for BTC, where we have 13 larger blocks distributed
deploy Blogel with any Hadoop and MPI version, we use Hadoop
over workers 0 to 12. This is mainly caused by the few vertices in
1.2.1 and MPICH 3.0.4 in our experiments. All programs are com-
BTC with very high degree (the max degree is 1.6M). We remark
piled using GCC 4.4.7 with -O2 option enabled.
that probably no general-purpose partitioning algorithm is effective
Blogel makes the master a worker, and fault recovery can be im-
on BTC, if there exists a balanced partitioning for BTC at all.
plemented by a script as follows. The script loops a Blogel job,
which runs for at most ∆ supersteps before dumping the interme- Performance of 2D partitioners. Since the vertices in USA and
diate results to HDFS. Meanwhile, the script also monitors the clus- Euro have 2D coordinates, we also run 2D partitioners on them,
ter condition. If a machine is down, the script kills the current job with (psamp , nx , ny ) = (1%, 20, 20). Figure 10(a) shows the
and restarts another job loading the latest intermediate results from quality of 2D partitioning, which reports the number of super-blocks,
HDFS. Fault tolerance is achieved by the data replication in HDFS. blocks and vertices in each worker. The number of blocks/vertices
per worker obtained by 2D partitioning is not as even as that by
8.2 Performance of Partitioners GVD partitioning. However, the shape of the super-blocks are very
regular, resulting in fewer number of edges crossing super-blocks
We first report the performance of Blogel’s partitioners.
and a smaller diameter for the block-level graph, and thus block-
4
https://issues.apache.org/jira/browse/GIRAPH-902 centric algorithms can run faster.
5
http://law.di.unimi.it/webdata/uk-union-2006-06-2007-05 Figures 10(b) and 10(c) further show that 2D partitioning is also
6
http://law.di.unimi.it/webdata/webbase-2001 more efficient than GVD partitioning. For example, for USA, job 1
7 of 2D partitioning takes 26.92 seconds (58% spent on loading and
http://snap.stanford.edu/data/com-Friendster.html
8 dumping) and job 2 takes 13.29 seconds (93.5% spent on loading
http://konect.uni-koblenz.de/networks/livejournal-groupmemberships
9 and dumping), which is still much shorter than the time taken by
http://km.aifb.kit.edu/projects/btc-2009/
10 the GVD partitioner (9.06 + 38.56 + 7.77 = 55.39 seconds).
http://www.dis.uniroma1.it/challenge9/download.shtml

1989
Performance
Computation
Parameters
Data Subgraph
Load Multi-Source BFS Block Triplet ID Vertex Dump
Hash-Min Total
Assign. Neighbors Exchange
psamp max bmax f pmax Round # Step # AVG Step # AVG
WebUK 0.1% 30 500,000 1.6 100% 20% 135.28 s 12 291 0.71 s 8 0.22 s 8.31 s 111.00 s 200.62 s 714.68 s 403.63 s
WebBase 0.1% 30 500,000 2 100% 10% 36.20 s 6 180 0.50 s 70 0.16 s 9.24 s 40.49 s 43.76 s 248.31 s 50.43 s
Friendster 0.1% 10 100,000 2 90% 10% 71.89 s 2 13 4.14 s 12 0.14 s 2.38 s 55.09 s 70.45 s 204.36 s 223.93 s
LiveJournal 0.1% 10 100,000 2 90% 10% 35.08 s 7 64 0.23 s 9 0.20 s 0.87 s 6.14 s 8.56 s 36.19 s 13.71 s
BTC 0.1% 20 500,000 2 95% 10% 29.08 s 3 60 0.60 s 31 0.80 s 2.47 s 12.52 s 25.92 s 112.53 s 39.54 s
USA Road 0.1% 50 2 90% 10% 9.06 s 4 166 0.07 s 19 0.04 s 3.08 s 3.35 s 6.23 s 38.56 s 7.77 s
Euro Road 0.1% 50 2 90% 10% 7.17 s 4 171 0.07 s 17 0.04 s 1.90 s 2.78 s 4.88 s 30.96 s 5.84 s

Figure 8: Performance of Graph Voronoi Diagram Partitioners

Super- Partitioning Partitioning


Block # Vertex #
Block # Load Compute Block Triplet ID Vertex Dump Load Compute Triplet ID Dump
374,440 – Total Total
USA 6–7 247 – 606 Slots Assign. Neighbors Exchange Blocks Neighbors
409,255
USA 7.49 s 1.75 s 0.03 s 3.70 s 5.82 s 11.31 s 8.12 s USA 1.37 s 0.50 s 0.37 s 0.87 s 11.05 s
283,695 –
Euro 6–7 466 – 872 Euro 6.02 s 1.43 s 0.03 s 3.21 s 4.73 s 9.40 s 5.69 s Euro 1.08 s 0.30 s 0.38 s 0.69 s 5.98 s
304,775
(a) Per-Worker Statistics (b) Job 1 Performance (c) Job 2 Performance
Figure 10: Performance of 2D Partitioners

Worker Block # Vertex # WebBase LiveJournal BTC USA Euro


WebUK 0 – 150 16,781 – 16,791 884,987 – 884,988 Coarsen 3547 s 1234 s 4463 s 228 s 184 s
WebBase 0 – 150 19,329 – 19,341 782,398 – 782,399 Partition 395 s 836 s 2360 s 8 s 5s
Friendster 0 – 120 614 – 618 542,217 – 542,218 Uncoarsen 1414 s 205 s 1064 s 171 s 156 s
0 5,066 344,848 IdRecode 94 s 24 s 107 s 25 s 24 s
LiveJournal
1 – 30 5,181 – 5,182 344,847 – 344,848 Total 5450 s 2299 s 7994 s 432 s 369 s
0 1 2,673,201
BTC 1 – 12 1 1,337,773 – 1,588,121 Figure 11: Partitioning Performance of Giraph++
13 – 120 5,902 – 5,928 1,337,679 – 1,337,680
USA Road 0 – 60 4,762 – 4,763 392,579 – 392,580 WebUK WebBase Friendster LiveJournal BTC USA Euro
Eagle Peak 0 – 60 3,564 – 3,569 295,569 – 295,570 Runtime 4863 s 1373 s 3547 s 205 s 1394 s 127 s 92s

Figure 9: # of Blocks/Vertices Per Worker (GVD Partitioner) Figure 12: Partitioning Performance of LDG

Comparison with existing partitioning methods. One of the widely rithm called Linear (Weighted) Deterministic Greedy (LDG). We
used graph partitioning algorithms is METIS [7] (e.g., GRACE [20] also ran LDG and the results are presented in Figure 12. We can
uses METIS to partition the input graph). However, METIS ran out see that LDG is many times slower than our GVD partitioner (re-
of memory on those large graphs in Figure 7 on our platform. To ported in Figure 8), though LDG is much faster than Giraph++’s
solve the scalability problem of METIS, Giraph++ [18] proposed new METIS partitioning algorithm.
a graph coarsening method to reduce the size of the input graph
so that METIS can run on the smaller graph. Here, a vertex in 8.3 Partitioner Scalability
the coarsened graph corresponds to a set of connected vertices in We now study the scalability of our GVD partitioner. We first
the original graph. The partitioning algorithm of Giraph++ con- test the partitioning scalability using two real graphs: BTC with
sists of 4 phases: (1)graph coarsening, (2)graph partitioning (using skewed degree distribution, and USA with a large graph diameter.
METIS), (3)graph uncoarsening, which projects the block informa- We partition both graphs using varying number of slave machines
tion back to the original graph, and (4)ID recoding, which relabels (i.e., 6, 9, 12 and 15), and study how the partitioning performance
the vertex IDs so that the worker of a vertex v can be obtained by scales with the amount of computing resources. We report the re-
hashing v’s new ID. Note that ID recoding is not required in Blo- sults in Figure 13. For the larger graph BTC, we need a certain
gel since the worker ID of each vertex v is stored in its triplet ID amount of computing resources to achieve good performance. For
trip(v). This approach retains the original vertex IDs so that Blo- example, when the number of slave machines increases from 6 to
gel’s graph computing results require no ID re-projection. Blogel’s 9, the partitioning time of BTC improves by 25.7%, from 189.2
graph partitioning is also more user-friendly, since it requires only seconds to 140.52 seconds. However, the performance does not
one partitioner job, while Giraph++’s graph partitioning consists of further improve if we increase the number of machines beyond 12.
a sequence of over 10 Giraph/MapReduce/METIS jobs. This is because the increased overhead of inter-machine communi-
Figure 11 shows the partitioning performance of Giraph++, where cation outweighs the increased computing power. For the relatively
we ran as many workers per machine as possible (without running smaller graph USA, the performance does not change much with
out of memory). We did not obtain result for WebUK and Friend- varying number of slave machines, since the computing power is
ster, since graph coarsening ran out of memory even when each sufficient even with only 6 slaves.
slave machine runs only one worker. Figure 11 shows that the parti- To test the scalability of our GVD partitioner as the graph size in-
tioning time of Giraph++ is much longer than that of our GVD par- creases, we generate random graphs using PreZER algorithm [12].
titioner. For example, while our GVD partitioner partitions Web- We set the average degree as 20 and vary |V | to be 25M, 50M,
Base in 334.94 seconds (see the breakdown time in Figure 8), Gi- 75M and 100M. Figure 14 shows the scalability results, where all
raph++ uses 5450 seconds. In general, our GVD partitioner is tens the 16 machines in our cluster are used. The partitioning time in-
of times faster than Giraph++’s METIS partitioning algorithm. creases almost linearly with |V |, which verifies that our GVD par-
Recently, Stanton and Kliot [17] proposed a group of algorithms titioner scales well with graph size. Moreover, even for a graph
to partition large graphs, and the best one is a semi-streaming algo- with |V | = 100M (i.e., |E| ≈ 2B), the partitioning is done in

1990
Machine # our block-centric system, B-GVD or B-2D, is one to two orders of
6 9 12 15
Load 60.35 s 38.50 s 32.00 s 28.71 s magnitude faster than the vertex-centric systems, V-Centric, Giraph
BTC
Compute 189.20 s 140.52 s 119.53 s 115.52 s and GraphLab, for processing the large-diameter graphs. But the
Dump 59.13 s 52.96 s 49.65 s 53.66 s improvement is limited for the small-diameter graph, since Live-
Total 308.69 s 231.98 s 201.18 s 197.90 s
Load 15.40 s 13.01 s 10.37 s 6.57 s
Joural is not very large and the vertex-centric systems are already
Compute 40.05 s 41.32 s 40.40 s 44.74 s very fast on a small-diameter graph of medium size. Compared
USA
Dump 6.46 s 4.66 s 4.46 s 3.95 s with Giraph++, B-GVD is significantly more efficient for process-
Total 61.91 s 58.99 s 55.23 s 55.26 s ing LiveJoural and B-2D is much faster for processing USA, while
Figure 13: GVD Partitioner Scalability on Real Graphs Giraph++ failed to run on WebUK.
For PageRank, we use the two web graphs WebUK and Web-
Base. We use both the URL-based partitioner and the GVD par-
|V| (avg deg = 20)
25M 50M 75M 100M titioner for graph partitioning. Figure 16(d) shows the number of
Load 11.44 s 29.05 s 42.10 s 74.31 s blocks/vertices per worker using URL partitioning, where we see
Compute 45.32 s 88.76 s 127.32 s 164.13 s that URL partitioning achieves more balanced workload and less
Dump 22.26 s 41.64 s 58.86 s 76.84 s
Total 79.02 s 159.45 s 228.29 s 315.28 s number of blocks than GVD partitioning (cf. Figure 9). This shows
that background knowledge about the graph data can usually offer
Figure 14: GVD Partitioner Scalability on Random Graphs a higher quality partitioning than a general method.
We report the time of computing local PageRank and BlockRank
only 164.13 seconds. In contrast, even for the smallest graph with by Blogel’s B-mode in Figure 16(a). We present the average time of
|V | = 25M , Giraph++’s new METIS partitioning algorithm can- a superstep and the total number of supersteps of computing PageR-
not finish in 24 hours. ank using different systems in Figure 16(b). We also run Blogel’s
block-centric V-mode with the input graph partitioned by LDG par-
8.4 Performance of Graph Computing titioning [17], denoted by B-LDG in Figure 16(b). We remark that
LDG cannot be used in Blogel’s VB-mode and B-mode, since a
We now report the performance of various graph computing sys-
partition obtained by LDG is not guaranteed to be connected.
tems for computing CC, SSSP, reachability, and PageRank. We run
The results show that though running V-mode, block-centric com-
the vertex-centric algorithms of Blogel (denoted by V-centric), as
puting is still significantly faster than vertex-centric computing (i.e.,
well as the block-centric algorithm of Blogel (denoted by B-GVD,
V-Centric, Giraph and GraphLab). Note that B-GVD and B-URL
B-2D or B-URL depending on which partitioner is used). Note that
are also running block-centric V-mode in Figure 16(b). Thus, the
B-2D applies to road networks only, B-URL applies to web graphs
result also reveals that our GVD partitioner leads to more efficient
only, while B-GVD applies to general graphs. We compare with
distributed computing than LDG. B-URL is comparable with B-
Giraph, GraphLab, and the graph-centric system Giraph++. For
LDG on WebUK, but is significantly faster than B-LDG on Web-
GraphLab, we use its synchronous mode since this paper focuses
Base. The superior performance of B-GVD and B-URL is mainly
on synchronous computing model. For Giraph++, we do not re-
because the GVD and URL partitioners achieve greater reduction
port the results for WebUK and Friendster since Giraph++ failed to
in the number of cross-worker edges than LDG, which results in
partition these large graphs.
less number of messages exchanged through the network.
Figure 15(a) shows the results of CC computation on three rep-
We also notice that the number of supersteps of the block-centric
resentative graphs: BTC (skewed degree distribution), Friendster
algorithm is more than that of the vertex-centric algorithm, which
(relatively high average degree), and USA (large diameter). We ob-
is mainly due to the fact that the PageRank initialization formula
tain the following observations. First, V-centric is generally faster
of [6], i.e., pr(v) = lpr(v) · br(block(v)), is not effective. We may
than Giraph and GraphLab, which shows that Blogel is more ef-
improve the algorithm by specifying another initialization formula,
ficient than existing systems even for vertex-centric computing.
but this is not the focus of this paper.
Second, B-GVD (or B-2D) is tens of times faster than V-centric,
Another kind of PageRank algorithm is adopted in Giraph++’s
which shows the superiority of our block-centric computing. Fi-
paper [18], which is based on the accumulative iterative update ap-
nally, B-GVD (or B-2D) is 1–2 orders of magnitude faster than
proach of [22]. To make a fair comparison with Giraph++, we
Giraph++; this is because Blogel’s block-centric algorithm works
also developed a Blogel vertex-centric counterpart, and a block-
in B-mode where blocks communicates with each other directly,
centric counterpart that runs in VB-mode. As Figure 16(c) shows,
while Giraph++’s graph-centric paradigm does not support B-mode
Blogel’s block-centric computing (i.e., B-GVD or B-URL) is sig-
and communication still occurs between vertices.
nificantly faster than its vertex-centric counterpart (i.e., V-Centric)
Figure 15(b) shows the results of SSSP computation on two weighted
when accumulative iterative update is applied. We can only com-
road network graphs. We see that both B-GVD and B-2D are orders
pare with Giraph++ on WebBase since Giraph++ failed to partition
of magnitude faster than V-Centric, which can be explained by the
WebUK. Figure 16(c) shows that while Giraph++ is twice faster
huge difference in the number of supersteps taken by the differ-
than V-Centric, it is still much slower than B-GVD and B-URL.
ent models. Giraph++ is also significantly faster than V-Centric,
but it is still much slower than our B-2D. This result verifies that
our block-centric model can effectively deal with graphs with large
diameter. The result also shows that 2D partitioner allows more 9. CONCLUSIONS
efficient block-centric parallel computing than GVD partitioner for We presented a block-centric framework, called Blogel, and showed
spatial networks. that Blogel is significantly faster than existing distributed graph
Figure 15(c) shows the results of reachability computation on the computing systems [1, 10, 4, 18], for processing large graphs with
small-diameter graph, LiveJoural, and the large-diameter graphs, adverse graph characteristics such as skewed degree distribution,
WebUK and USA. We set the source s to be a vertex that can reach high average degree, and large diameter. We also showed that Blo-
most of the vertices in the input graph and set t = −1, which means gel’s partitioners generate high-quality blocks and are much faster
that the actual computation is BFS from s. As Figure 15(c) shows, than the state-of-the-art graph partitioning methods [17, 18].

1991
Load Compute Step # Dump Load Compute Step # Dump Load Compute Step # Dump
V-Centric 24.22 s 28.48 s 30 8.36 s V-Centric 7.21 s 2832.26 s 10789 1.81 s V-Centric 71.89 s 144.29 s 664 0.89 s
B-GVD 7.58 s 0.94 s 6 6.16 s B-GVD 100.72 s 41.58 s 71 2.64 s
B-GVD 1.87 s 118.75 s 751 2.46 s WebUK
BTC Giraph 70.26 s 94.54 s 30 17.14 s Giraph 530.78 s 1078.06 s 664 13.75 s
Giraph++ 102.01 s 101.29 s 5 25.64 s USA B-2D 1.65 s 11.29 s 59 2.58 s GraphLab 435.95 s 424.2 s 664 15.12 s
GraphLab 105.48 s 83.1 s 30 19.03 s Road Giraph 16.68 s 11116.90 s 10789 4.54 s V-Centric 8.93 s 6.87 s 19 0.37 s
V-Centric 82.68 s 120.24 s 22 1.55 s Giraph++ 18.29 s 80.53 s 39 2.88 s B-GVD 5.54 s 4.84 s 6 0.65 s
Live-
B-GVD 16.08 s 2.52 s 6 2.31 s Giraph 37.77 s 17.89 s 19 1.6 s
Friendster GraphLab 19.38 s 9293 s 10789 4.02 s Journal
Giraph 95.88 s 248.29 s 22 6.37 s Giraph++ 17.50 s 29.12 s 4 16.13 s
GraphLab 188.59 s 77.0 s 22 7.57 s V-Centric 5.47 s 708.56 s 6210 1.01 s GraphLab 16.46 s 8.9 s 19 2.15 s
V-Centric 5.98 s 510.98 s 6262 0.57 s B-GVD 1.61 s 68.56 s 440 1.74 s V-Centric 5.18 s 389.37 s 6263 0.45 s
B-GVD 1.47 s 13.95 s 164 1.00 s Euro B-2D 1.26 s 8.86 s 55 1.85 s B-GVD 2.31 s 33.31 s 246 0.73 s
B-2D 1.41 s 1.94 s 26 1.07 s Road Giraph 12.51 s 12712.06 s 6210 0.05 s USA B-2D 1.51 s 4.02 s 26 1.40 s
USA Road
Giraph 14.07 s 9518.99 s 6262 2.14 s Road Giraph 16.27 s 5866.19 s 6263 2.64 s
Giraph++ 18.57 s 87.73 s 30 3.59 s
Giraph++ 16.81 s 24.00 s 12 2.54 s Giraph++ 18.43 s 34.43 s 18 2.51 s
GraphLab 18.27 s 2982.3 s 6262 3.41 s GraphLab 16.48 s 3231.3 s 6210 3.17 s GraphLab 18.86 s 1558.6 s 6263 3.56 s
(a) Performance of Hash-Min (b) Performance of SSSP (c) Performance of Reachability

Figure 15: Performance of CC, SSSP and Reachability Computation

Compute Load Step # Per-Step Time Dump


Load Dump
lpr(v) & br(b) V-Centric 71.37 s 89 29.99 s 4.16 s
B-GVD 105.10 s 44.98 s 385.45 s B-LDG 104.59 s 89 24.65 s 2.04 s
WebUK B-GVD 47.21 s 95 18.63 s 6.59 s
B-URL 124.52 s 17.39 s 395.93 s WebUK
B-GVD 23.93 s 16.68 s 53.79 s B-URL 64.62 s 93 25.96 s 7.08 s
WebBase Giraph 163.99 s 89 53.74 s 22.35 s
B-URL 20.27 s 40.32 s 45.36 s
GraphLab 245.62 s 89 48.43 s 16.34 s
(a) Performance of BlockRank Computation V-Centric 20.81 s 80 16.23 s 2.77 s
B-LDG 28.30 s 80 9.64 s 3.62 s
Per-step
Load Step # Dump B-GVD 11.51 s 90 4.99 s 6.92 s
Time WebBase
V-Centric 111.45 s 92 31.16 s 1.99 s B-URL 6.39 s 84 2.86 s 5.15 s
WebUK B-GVD 113.63 s 92 16.89 s 4.94 s Giraph 61.41 s 80 12.67 s 16.30 s
B-URL 120.74 s 92 9.30 s 4.87 s GraphLab 79.91 s 80 20.04 s 14.92 s
(b) Performance of PageRank Computation
V-Centric 30.74 s 92 16.53 s 2.54 s
B-GVD 25.38 s 92 4.00 s 4.61 s Worker Block # Vertex #
WebBase WebUK 0 – 150 1,693 – 1697 884,987 – 884,988
B-URL 25.83 s 92 1.20 s 4.93 s
Giraph++ 149.49 s 92 7.02 s 26.47 s WebBase 0 – 150 4,930 – 4,931 782,398 – 782,399
(c) Performance of Giraph++’s Version (d) # of Blocks/Vertices Per Worker (URL Partitioning)

Figure 16: Performance of PageRank Computation

For future work, we plan to define a class of algorithms similar [10] Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M.
to PPA [21] for the block-centric computing model. Hellerstein. Distributed graphlab: A framework for machine learning
in the cloud. PVLDB, 5(8):716–727, 2012.
Acknowledgments. We thank the reviewers for giving us many [11] G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert, I. Horn,
constructive comments, with which we have significantly improved N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph
our paper. This work was partially done when the first author was processing. In SIGMOD Conference, pages 135–146, 2010.
at HKUST. This research is supported in part by SHIAE Grant [12] S. Nobari, X. Lu, P. Karras, and S. Bressan. Fast random graph
No. 8115048 and HKUST Grant No. FSGRF14EG31. generation. In EDBT, pages 331–342, 2011.
[13] V. Rastogi, A. Machanavajjhala, L. Chitnis, and A. D. Sarma.
Finding connected components in map-reduce in logarithmic rounds.
10. REFERENCES
[1] C. Avery. Giraph: Large-scale graph processing infrastructure on In ICDE, pages 50–61, 2013.
hadoop. Proceedings of the Hadoop Summit. Santa Clara, 2011. [14] S. Salihoglu and J. Widom. Computing strongly connected
[2] J. Cheng, Y. Ke, S. Chu, and C. Cheng. Efficient processing of components in pregel-like systems. Stanford University Tech. Report.
distance queries in large graphs: a vertex cover approach. In [15] S. Salihoglu and J. Widom. Gps: a graph processing system. In
SIGMOD Conference, pages 457–468, 2012. SSDBM, page 22, 2013.
[3] M. Erwig and F. Hagen. The graph voronoi diagram with [16] S. B. Seidman. Network structure and minimum degree. Social
applications. Networks, 36(3):156–163, 2000. Networks, 5:269–287, 1983.
[4] J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. [17] I. Stanton and G. Kliot. Streaming graph partitioning for large
Powergraph: Distributed graph-parallel computation on natural distributed graphs. In KDD, pages 1222–1230, 2012.
graphs. In OSDI, pages 17–30, 2012. [18] Y. Tian, A. Balmin, S. A. Corsten, S. Tatikonda, and J. McPherson.
[5] R. L. Graham. Bounds on multiprocessing timing anomalies. SIAM From ”think like a vertex” to ”think like a graph”. PVLDB,
Journal of Applied Mathematics, 17(2):416–429, 1969. 7(3):193–204, 2013.
[6] S. Kamvar, T. Haveliwala, C. Manning, and G. Golub. Exploiting the [19] J. Wang and J. Cheng. Truss decomposition in massive networks.
block structure of the web for computing pagerank. Stanford PVLDB, 5(9):812–823, 2012.
University Technical Report, 2003. [20] W. Xie, G. Wang, D. Bindel, A. J. Demers, and J. Gehrke. Fast
[7] G. Karypis and V. Kumar. A fast and high quality multilevel scheme iterative graph computation with block updates. PVLDB,
for partitioning irregular graphs. SIAM Journal on scientific 6(14):2014–2025, 2013.
Computing, 20(1):359–392, 1998. [21] D. Yan, J. Cheng, K. Xing, Y. Lu, W. Ng, and Y. Bu. Pregel
[8] Z. Khayyat, K. Awara, A. Alonazi, H. Jamjoom, D. Williams, and algorithms for graph connectivity problems with performance
P. Kalnis. Mizan: a system for dynamic load balancing in large-scale guarantees. PVLDB, 7(14), 2014.
graph processing. In EuroSys, pages 169–182, 2013. [22] Y. Zhang, Q. Gao, L. Gao, and C. Wang. Accelerate large-scale
[9] J. Kleinberg and E. Tardos. Algorithm Design. Addison-Wesley iterative computation through asynchronous accumulative updates. In
Longman Publishing Co., Inc., Boston, MA, USA, 2005. Workshop on Scientific Cloud Computing, pages 13–22. ACM, 2012.

1992

You might also like