0% found this document useful (0 votes)
8 views

PowerGraph Distributed Graph-Parallel Computation On Natural Graphs

The document discusses challenges of computing on power-law graphs and limitations of existing graph-parallel systems like Pregel and GraphLab. It introduces the PowerGraph abstraction which exploits the structure of vertex programs and factors computation over edges instead of vertices. PowerGraph addresses challenges of power-law graphs through new approaches to distributed graph representation and placement.

Uploaded by

javad.hsadeghi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

PowerGraph Distributed Graph-Parallel Computation On Natural Graphs

The document discusses challenges of computing on power-law graphs and limitations of existing graph-parallel systems like Pregel and GraphLab. It introduces the PowerGraph abstraction which exploits the structure of vertex programs and factors computation over edges instead of vertices. PowerGraph addresses challenges of power-law graphs through new approaches to distributed graph representation and placement.

Uploaded by

javad.hsadeghi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs

Joseph E. Gonzalez Yucheng Low Haijie Gu


Carnegie Mellon University Carnegie Mellon University Carnegie Mellon University
[email protected] [email protected] [email protected]
Danny Bickson Carlos Guestrin
Carnegie Mellon University University of Washington
[email protected] [email protected]

Abstract in the graph. Graph-parallel abstractions rely on each ver-


tex having a small neighborhood to maximize parallelism
Large-scale graph-structured computation is central to and effective partitioning to minimize communication.
tasks ranging from targeted advertising to natural lan- However, graphs derived from real-world phenomena,
guage processing and has led to the development of like social networks and the web, typically have power-
several graph-parallel abstractions including Pregel and law degree distributions, which implies that a small subset
GraphLab. However, the natural graphs commonly found of the vertices connects to a large fraction of the graph.
in the real-world have highly skewed power-law degree Furthermore, power-law graphs are difficult to partition
distributions, which challenge the assumptions made by [1, 28] and represent in a distributed environment.
these abstractions, limiting performance and scalability. To address the challenges of power-law graph compu-
In this paper, we characterize the challenges of compu- tation, we introduce the PowerGraph abstraction which
tation on natural graphs in the context of existing graph- exploits the structure of vertex-programs and explicitly
parallel abstractions. We then introduce the PowerGraph factors computation over edges instead of vertices. As a
abstraction which exploits the internal structure of graph consequence, PowerGraph exposes substantially greater
programs to address these challenges. Leveraging the parallelism, reduces network communication and storage
PowerGraph abstraction we introduce a new approach costs, and provides a new highly effective approach to dis-
to distributed graph placement and representation that tributed graph placement. We describe the design of our
exploits the structure of power-law graphs. We provide distributed implementation of PowerGraph and evaluate it
a detailed analysis and experimental evaluation compar- on a large EC2 deployment using real-world applications.
ing PowerGraph to two popular graph-parallel systems. In particular our key contributions are:
Finally, we describe three different implementation strate-
gies for PowerGraph and discuss their relative merits with 1. An analysis of the challenges of power-law graphs
empirical evaluations on large-scale real-world problems in distributed graph computation and the limitations
demonstrating order of magnitude gains. of existing graph parallel abstractions (Sec. 2 and 3).
2. The PowerGraph abstraction (Sec. 4) which factors
1 Introduction individual vertex-programs.
The increasing need to reason about large-scale graph- 3. A delta caching procedure which allows computation
structured data in machine learning and data mining state to be dynamically maintained (Sec. 4.2).
(MLDM) presents a critical challenge. As the sizes of
datasets grow, statistical theory suggests that we should 4. A new fast approach to data layout for power-law
apply richer models to eliminate the unwanted bias of graphs in distributed environments (Sec. 5).
simpler models, and extract stronger signals from data. At 5. An theoretical characterization of network and stor-
the same time, the computational and storage complexity age (Theorem 5.2, Theorem 5.3).
of richer models coupled with rapidly growing datasets
6. A high-performance open-source implementation of
have exhausted the limits of single machine computation.
the PowerGraph abstraction (Sec. 7).
The resulting demand has driven the development
of new graph-parallel abstractions such as Pregel [30] 7. A comprehensive evaluation of three implementa-
and GraphLab [29] that encode computation as vertex- tions of PowerGraph on a large EC2 deployment
programs which run in parallel and interact along edges using real-world MLDM applications (Sec. 6 and 7).

1
USENIX Association 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’12) 17
2 Graph-Parallel Abstractions
void GraphLabPageRank(Scope scope) :
float accum = 0;
A graph-parallel abstraction consists of a sparse graph foreach (nbr in scope.in_nbrs) :
G = {V, E} and a vertex-program Q which is executed accum += nbr.val / nbr.nout_nbrs();
in parallel on each vertex v ∈ V and can interact (e.g., vertex.val = 0.15 + 0.85 * accum;
through shared-state in GraphLab, or messages in Pregel)
By eliminating messages, GraphLab isolates the user
with neighboring instances Q(u) where (u, v) ∈ E. In con-
defined algorithm from the movement of data, allowing
trast to more general message passing models, graph-
the system to choose when and how to move program
parallel abstractions constrain the interaction of vertex-
state. By allowing mutable data to be associated with both
program to a graph structure enabling the optimization
vertices and edges GraphLab allows the algorithm de-
of data-layout and communication. We focus our discus-
signer to more precisely distinguish between data shared
sion on Pregel and GraphLab as they are representative
with all neighbors (vertex data) and data shared with a
of existing graph-parallel abstractions.
particular neighbor (edge data).
2.1 Pregel
Pregel [30] is a bulk synchronous message passing ab- 2.3 Characterization
straction in which all vertex-programs run simultaneously
While the implementation of MLDM vertex-programs
in a sequence of super-steps. Within a super-step each
in GraphLab and Pregel differ in how they collect and
program instance Q(v) receives all messages from the pre-
disseminate information, they share a common overall
vious super-step and sends messages to its neighbors in
structure. To characterize this common structure and dif-
the next super-step. A barrier is imposed between super-
ferentiate between vertex and edge specific computation
steps to ensure that all program instances finish processing
we introduce the GAS model of graph computation.
messages from the previous super-step before proceed-
The GAS model represents three conceptual phases
ing to the next. The program terminates when there are
of a vertex-program: Gather, Apply, and Scatter. In the
no messages remaining and every program has voted to
gather phase, information about adjacent vertices and
halt. Pregel introduces commutative associative message
edges is collected through a generalized sum over the
combiners which are user defined functions that merge
neighborhood of the vertex u on which Q(u) is run:
messages destined to the same vertex. The following is an  
Σ←

example of the PageRank vertex-program implemented in g Du , D(u,v) , Dv . (2.1)
Pregel. The vertex-program receives the single incoming v∈Nbr[u]
message (after the combiner) which contains the sum of where Du , Dv , and D(u,v) are the values (program state
the PageRanks of all in-neighbors. The new PageRank is and meta-data) for vertices u and v and edge (u, v). The
then computed and sent to its out-neighbors. user defined sum ⊕ operation must be commutative and
Message combiner(Message m1, Message m2) : associative and can range from a numerical sum to the
return Message(m1.value() + m2.value()); union of the data on all neighboring vertices and edges.
void PregelPageRank(Message msg) :
float total = msg.value(); The resulting value Σ is used in the apply phase to
vertex.val = 0.15 + 0.85*total; update the value of the central vertex:
foreach(nbr in out_neighbors) :
SendMsg(nbr, vertex.val/num_out_nbrs); Dnew
u ← a (Du , Σ) . (2.2)
Finally the scatter phase uses the new value of the central
2.2 GraphLab vertex to update the data on adjacent edges:
GraphLab [29] is an asynchronous distributed shared-
   
∀v ∈ Nbr[u] : D(u,v) ← s Dnew
u , D(u,v) , Dv . (2.3)
memory abstraction in which vertex-programs have shared
access to a distributed graph with data stored on every ver- The fan-in and fan-out of a vertex-program is determined
tex and edge. Each vertex-program may directly access by the corresponding gather and scatter phases. For in-
information on the current vertex, adjacent edges, and stance, in PageRank, the gather phase only operates on
adjacent vertices irrespective of edge direction. Vertex- in-edges and the scatter phase only operates on out-edges.
programs can schedule neighboring vertex-programs to However, for many MLDM algorithms the graph edges
be executed in the future. GraphLab ensures serializabil- encode ostensibly symmetric relationships, like friend-
ity by preventing neighboring program instances from ship, in which both the gather and scatter phases touch all
running simultaneously. The following is an example of edges. In this case the fan-in and fan-out are equal. As we
the PageRank vertex-program implemented in GraphLab. will show in Sec. 3, the ability for graph parallel abstrac-
The GraphLab vertex-program directly reads neighboring tions to support both high fan-in and fan-out computation
vertex values to compute the sum. is critical for efficient computation on natural graphs.

2
18 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’12) USENIX Association
[27]) in more sophisticated natural graph models, the tech-
niques in this paper focus only on the degree distribution
and do not require any other modeling assumptions.
The skewed degree distribution implies that a small
fraction of the vertices are adjacent to a large fraction
of the edges. For example, one percent of the vertices
in the Twitter web-graph are adjacent to nearly half of
the edges. This concentration of edges results in a star-
(a) Twitter In-Degree (b) Twitter Out-Degree like motif which presents challenges for existing graph-
Figure 1: The in and out degree distributions of the Twitter parallel abstractions:
follower network plotted in log-log scale. Work Balance: The power-law degree distribution can
lead to substantial work imbalance in graph parallel ab-
GraphLab and Pregel express GAS programs in very stractions that treat vertices symmetrically. Since the stor-
different ways. In the Pregel abstraction the gather phase age, communication, and computation complexity of the
is implemented using message combiners and the apply Gather and Scatter phases is linear in the degree, the run-
and scatter phases are expressed in the vertex program. ning time of vertex-programs can vary widely [36].
Conversely, GraphLab exposes the entire neighborhood to Partitioning: Natural graphs are difficult to
the vertex-program and allows the user to define the gather partition[26, 28]. Both GraphLab and Pregel de-
and apply phases within their program. The GraphLab ab- pend on graph partitioning to minimize communication
straction implicitly defines the communication aspects of and ensure work balance. However, in the case of natural
the gather/scatter phases by ensuring that changes made graphs both are forced to resort to hash-based (random)
to the vertex or edge data are automatically visible to ad- partitioning which has extremely poor locality (Sec. 5).
jacent vertices. It is also important to note that GraphLab Communication: The skewed degree distribution of
does not differentiate between edge directions. natural-graphs leads to communication asymmetry and
consequently bottlenecks. In addition, high-degree ver-
3 Challenges of Natural Graphs tices can force messaging abstractions, such as Pregel, to
generate and send many identical messages.
The sparsity structure of natural graphs presents a unique Storage: Since graph parallel abstractions must locally
challenge to efficient distributed graph-parallel compu- store the adjacency information for each vertex, each
tation. One of the hallmark properties of natural graphs vertex requires memory linear in its degree. Consequently,
is their skewed power-law degree distribution[16]: most high-degree vertices can exceed the memory capacity of
vertices have relatively few neighbors while a few have a single machine.
many neighbors (e.g., celebrities in a social network). Un- Computation: While multiple vertex-programs may
der a power-law degree distribution the probability that a execute in parallel, existing graph-parallel abstractions do
vertex has degree d is given by: not parallelize within individual vertex-programs, limiting
P (d) ∝ d −α , (3.1)
their scalability on high-degree vertices.

where the exponent α is a positive constant that controls 4 PowerGraph Abstraction


the “skewness” of the degree distribution. Higher α im-
plies that the graph has lower density (ratio of edges to To address the challenges of computation on power-law
vertices), and that the vast majority of vertices are low graphs, we introduce PowerGraph, a new graph-parallel
degree. As α decreases, the graph density and number abstraction that eliminates the degree dependence of the
of high degree vertices increases. Most natural graphs vertex-program by directly exploiting the GAS decom-
typically have a power-law constant around α ≈ 2. For position to factor vertex-programs over edges. By lifting
example, Faloutsos et al. [16] estimated that the inter- the Gather and Scatter phases into the abstraction, Pow-
domain graph of the Internet has a power-law constant erGraph is able to retain the natural “think-like-a-vertex”
α ≈ 2.2. One can visualize the skewed power-law de- philosophy [30] while distributing the computation of a
gree distribution by plotting the number of vertices with a single vertex-program over the entire cluster.
given degree in log-log scale. In Fig. 1, we plot the in and PowerGraph combines the best features from both
out degree distributions of the Twitter follower network Pregel and GraphLab. From GraphLab, PowerGraph bor-
demonstrating the characteristic linear power-law form. rows the data-graph and shared-memory view of compu-
While power-law degree distributions are empirically tation eliminating the need for users to architect the move-
observable, they do not fully characterize the properties of ment of information. From Pregel, PowerGraph borrows
natural graphs. While there has been substantial work (see the commutative, associative gather concept. PowerGraph

3
USENIX Association 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’12) 19
interface GASVertexProgram(u) { about the neighborhood of the vertex. The gather func-
// Run on gather_nbrs(u) tion is invoked in parallel on the edges adjacent to u. The
gather(Du , D(u,v) , Dv ) → Accum particular set of edges is determined by gather nbrs
sum(Accum left, Accum right) → Accum which can be none, in, out, or all. The gather func-
apply(Du ,Accum) → Dnew u tion is passed the data on the adjacent vertex and edge
// Run on scatter_nbrs(u)
and returns a temporary accumulator (a user defined type).
scatter(Dnewu ,D(u,v) ,Dv ) → (Dnew , Accum)
(u,v) The result is combined using the commutative and associa-
} tive sum operation. The final result au of the gather phase
Figure 2: All PowerGraph programs must implement the state- is passed to the apply phase and cached by PowerGraph.
less gather, sum, apply, and scatter functions. After the gather phase has completed, the apply func-
tion takes the final accumulator and computes a new ver-
Algorithm 1: Vertex-Program Execution Semantics tex value Du which is atomically written back to the graph.
Input: Center vertex u The size of the accumulator au and complexity of the ap-
if cached accumulator au is empty then ply function play a central role in determining the network
foreach neighbor v in gather nbrs(u) do and storage efficiency of the PowerGraph abstraction and
au ← sum(au , gather(Du , D(u,v) , Dv )) should be sub-linear and ideally constant in the degree.
end During the scatter phase, the scatter function is in-
end voked in parallel on the edges adjacent to u producing
Du ← apply(Du , au ) new edge values D(u,v) which are written back to the data-
foreach neighbor v scatter nbrs(u) do
graph. As with the gather phase, the scatter nbrs
(D(u,v) , ∆a) ← scatter(Du , D(u,v) , Dv )
determines the particular set of edges on which scatter is
if av and ∆a are not Empty then av ← sum(av , ∆a)
invoked. The scatter function returns an optional value ∆a
else av ← Empty
which is used to dynamically update the cached accumu-
end
lator av for the adjacent vertex (see Sec. 4.2).
In Fig. 3 we implement the PageRank, greedy graph
coloring, and single source shortest path algorithms us-
supports both the highly-parallel bulk-synchronous Pregel
ing the PowerGraph abstraction. In PageRank the gather
model of computation as well as the computationally effi-
and sum functions collect the total value of the adjacent
cient asynchronous GraphLab model of computation.
vertices, the apply function computes the new PageRank,
Like GraphLab, the state of a PowerGraph program
and the scatter function is used to activate adjacent vertex-
factors according to a data-graph with user defined ver-
programs if necessary. In graph coloring the gather and
tex data Dv and edge data D(u,v) . The data stored in the
sum functions collect the set of colors on adjacent ver-
data-graph includes both meta-data (e.g., urls and edge
tices, the apply function computes a new color, and the
weights) as well as computation state (e.g., the PageRank
scatter function activates adjacent vertices if they violate
of vertices). In Sec. 5 we introduce vertex-cuts which al-
the coloring constraint. Finally in single source shortest
low PowerGraph to efficiently represent and store power-
path (SSSP), the gather and sum functions compute the
law graphs in a distributed environment. We now describe
shortest path through each of the neighbors, the apply
the PowerGraph abstraction and how it can be used to
function returns the new distance, and the scatter function
naturally decompose vertex-programs. Then in Sec. 5
activates affected neighbors.
through Sec. 7 we discuss how to implement the Power-
Graph abstraction in a distributed environment. 4.2 Delta Caching
In many cases a vertex-program will be triggered in re-
4.1 GAS Vertex-Programs
sponse to a change in a few of its neighbors. The gather
Computation in the PowerGraph abstraction is encoded operation is then repeatedly invoked on all neighbors,
as a state-less vertex-program which implements the many of which remain unchanged, thereby wasting com-
GASVertexProgram interface (Fig. 2) and therefore putation cycles. For many algorithms [2] it is possible to
explicitly factors into the gather, sum, apply, and scat- dynamically maintain the result of the gather phase au
ter functions. Each function is invoked in stages by the and skip the gather on subsequent iterations.
PowerGraph engine following the semantics in Alg. 1. The PowerGraph engine maintains a cache of the accu-
By factoring the vertex-program, the PowerGraph execu- mulator au from the previous gather phase for each vertex.
tion engine can distribute a single vertex-program over The scatter function can optionally return an additional ∆a
multiple machines and move computation to the data. which is atomically added to the cached accumulator av
During the gather phase the gather and sum func- of the neighboring vertex v using the sum function. If ∆a
tions are used as a map and reduce to collect information is not returned, then the neighbor’s cached av is cleared,

4
20 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’12) USENIX Association
PageRank Greedy Graph Coloring Single Source Shortest Path (SSSP)
// gather_nbrs: IN_NBRS // gather_nbrs: ALL_NBRS // gather_nbrs: ALL_NBRS
gather(Du , D(u,v) , Dv ): gather(Du , D(u,v) , Dv ): gather(Du , D(u,v) , Dv ):
return Dv .rank / #outNbrs(v) return set(Dv ) return Dv + D(v,u)
sum(a, b): return a + b sum(a, b): return union(a, b) sum(a, b): return min(a, b)
apply(Du , acc): apply(Du , S): apply(Du , new_dist):
rnew = 0.15 + 0.85 * acc Du = min c where c ∈ /S Du = new_dist
Du .delta = (rnew - Du .rank)/ // scatter_nbrs: ALL_NBRS // scatter_nbrs: ALL_NBRS
#outNbrs(u) scatter(Du ,D(u,v) ,Dv ): scatter(Du ,D(u,v) ,Dv ):
Du .rank = rnew // Nbr changed since gather // If changed activate neighbor
// scatter_nbrs: OUT_NBRS if(Du == Dv ) if(changed(Du )) Activate(v)
scatter(Du ,D(u,v) ,Dv ): Activate(v) if(increased(Du ))
if(|Du .delta|>ε) Activate(v) // Invalidate cached accum return NULL
return delta return NULL else return Du + D(u,v)

Figure 3: The PageRank, graph-coloring, and single source shortest path algorithms implemented in the PowerGraph abstraction.
Both the PageRank and single source shortest path algorithms support delta caching in the gather phase.

forcing a complete gather on the subsequent execution 4.3.1 Bulk Synchronous Execution
of the vertex-program on the vertex v. When executing
the vertex-program on v the PowerGraph engine uses the When run synchronously, the PowerGraph engine exe-
cached av if available, bypassing the gather phase. cutes the gather, apply, and scatter phases in order. Each
Intuitively, ∆a acts as an additive correction on-top of phase, called a minor-step, is run synchronously on all
the previous gather for that edge. More formally, if the active vertices with a barrier at the end. We define a super-
accumulator type forms an abelian group: has a com- step as a complete series of GAS minor-steps. Changes
mutative and associative sum (+) and an inverse (−) made to the vertex data and edge data are committed at
operation, then we can define (shortening gather to g): the end of each minor-step and are visible in the subse-
quent minor-step. Vertices activated in each super-step
∆a = g(Du , Dnew new
(u,v) , Dv ) − g(Du , D(u,v) , Dv ). (4.1) are executed in the subsequent super-step.
The synchronous execution model ensures a determin-
In the PageRank example (Fig. 3) we take advantage of istic execution regardless of the number of machines and
the abelian nature of the PageRank sum operation. For closely resembles Pregel. However, the frequent barriers
graph coloring the set union operation is not abelian and and inability to operate on the most recent data can lead
so we invalidate the accumulator. to an inefficient distributed execution and slow algorithm
convergence. To address these limitations PowerGraph
4.3 Initiating Future Computation also supports asynchronous execution.
The PowerGraph engine maintains a set of active vertices
on which to eventually execute the vertex-program. The 4.3.2 Asynchronous Execution
user initiates computation by calling Activate(v) or
Activate all(). The PowerGraph engine then pro- When run asynchronously, the PowerGraph engine exe-
ceeds to execute the vertex-program on the active vertices cutes active vertices as processor and network resources
until none remain. Once a vertex-program completes the become available. Changes made to the vertex and edge
scatter phase it becomes inactive until it is reactivated. data during the apply and scatter functions are immedi-
Vertices can activate themselves and neighboring ver- ately committed to the graph and visible to subsequent
tices. Each function in a vertex-program can only activate computation on neighboring vertices.
vertices visible in the arguments to that function. For ex- By using processor and network resources as they
ample the scatter function invoked on the edge (u, v) can become available and making any changes to the data-
only activate the vertices u and v. This restriction is es- graph immediately visible to future computation, an asyn-
sential to ensure that activation events are generated on chronous execution can more effectively utilize resources
machines on which they can be efficiently processed. and accelerate the convergence of the underlying algo-
The order in which activated vertices are executed is up rithm. For example, the greedy graph-coloring algorithm
to the PowerGraph execution engine. The only guarantee in Fig. 3 will not converge when executed synchronously
is that all activated vertices are eventually executed. This but converges quickly when executed asynchronously.
flexibility in scheduling enables PowerGraph programs The merits of asynchronous computation have been stud-
to be executed both synchronously and asynchronously, ied extensively in the context of numerical algorithms
leading to different tradeoffs in algorithm performance, [4]. In [18, 19, 29] we demonstrated that asynchronous
system performance, and determinism. computation can lead to both theoretical and empirical

5
USENIX Association 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’12) 21
gains in algorithm and system performance for a range of
A B A B A 1 1
A B A B

2
important MLDM applications. B

2
C C 2 B

1
Unfortunately, the behavior of the asynchronous exe- 3
C D C D C D C D C D

3
cution depends on the number machines and availability 3

of network resources leading to non-determinism that can (a) Edge-Cut (b) Vertex-Cut
complicate algorithm design and debugging. Furthermore, Figure 4: (a) An edge-cut and (b) vertex-cut of a graph into
for some algorithms, like statistical simulation, the result- three parts. Shaded vertices are ghosts and mirrors respectively.
ing non-determinism, if not carefully controlled, can lead
to instability or even divergence [17]. 5 Distributed Graph Placement
To address these challenges, GraphLab automati-
cally enforces serializability: every parallel execution The PowerGraph abstraction relies on the distributed data-
of vertex-programs has a corresponding sequential execu- graph to store the computation state and encode the in-
tion. In [29] it was shown that serializability is sufficient to teraction between vertex-programs. The placement of the
support a wide range of MLDM algorithms. To achieve se- data-graph structure and data plays a central role in mini-
rializability, GraphLab prevents adjacent vertex-programs mizing communication and ensuring work balance.
from running concurrently using a fine-grained locking A common approach to placing a graph on a cluster
protocol which requires sequentially grabbing locks on of p machines is to construct a balanced p-way edge-cut
all neighboring vertices. Furthermore, the locking scheme (e.g., Fig. 4a) in which vertices are evenly assigned to
used by GraphLab is unfair to high degree vertices. machines and the number of edges spanning machines
PowerGraph retains the strong serializability guaran- is minimized. Unfortunately, the tools [23, 31] for con-
tees of GraphLab while addressing its limitations. We structing balanced edge-cuts perform poorly [1, 28, 26] on
address the problem of sequential locking by introducing power-law graphs. When the graph is difficult to partition,
a new parallel locking protocol (described in Sec. 7.4) both GraphLab and Pregel resort to hashed (random) ver-
which is fair to high degree vertices. In addition, the tex placement. While fast and easy to implement, hashed
PowerGraph abstraction exposes substantially more fined vertex placement cuts most of the edges:
grained (edge-level) parallelism allowing the entire cluster Theorem 5.1. If vertices are randomly assigned to p
to support the execution of individual vertex programs. machines then the expected fraction of edges cut is:
 
|Edges Cut| 1
E = 1− . (5.1)
|E| p
4.4 Comparison with GraphLab / Pregel
For a power-law graph with exponent α, the expected
Surprisingly, despite the strong constraints imposed by number of edges cut per-vertex is:
the PowerGraph abstraction, it is possible to emulate both
1 h|V | (α − 1)
     
GraphLab and Pregel vertex-programs in PowerGraph. To |Edges Cut| 1
E = 1− E [D[v]] = 1 − ,
emulate a GraphLab vertex-program, we use the gather |V | p p h|V | (α)
(5.2)
and sum functions to concatenate all the data on adjacent |V |−1
vertices and edges and then run the GraphLab program where the h|V | (α) = ∑d=1 d −α is the normalizing con-
within the apply function. Similarly, to express a Pregel stant of the power-law Zipf distribution.
vertex-program, we use the gather and sum functions to
Proof. An edge is cut if both vertices are randomly as-
combine the inbound messages (stored as edge data) and
signed to different machines. The probability that both
concatenate the list of neighbors needed to compute the
vertices are assigned to different machines is 1 − 1/p.
outbound messages. The Pregel vertex-program then runs
within the apply function generating the set of messages Every cut edge contributes to storage and network over-
which are passed as vertex data to the scatter function head since both machines maintain a copy of the adja-
where they are written back to the edges. cency information and in some cases [20], a ghost (local
In order to address the challenges of natural graphs, copy) of the vertex and edge data. For example in Fig. 4a
the PowerGraph abstraction requires the size of the ac- we construct a three-way edge-cut of a four vertex graph
cumulator and the complexity of the apply function to resulting in five ghost vertices and all edge data being
be sub-linear in the degree. However, directly executing replicated. Any changes to vertex and edge data associ-
GraphLab and Pregel vertex-programs within the apply ated with a cut edge must be synchronized across the
function leads the size of the accumulator and the com- network. For example, using just two machines, a ran-
plexity of the apply function to be linear in the degree dom cut will cut roughly half the edges, requiring |E| /2
eliminating many of the benefits on natural graphs. communication.

6
22 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’12) USENIX Association
(1)(Gather( of the PowerGraph engine.
Accumulator( For each vertex v with multiple replicas, one of the
(2)(
Gather( (Par4al(Sum)( Gather(

Apply(
Mirror(
replicas is randomly nominated as the master which main-
(3)(Apply(
Sca>er( Sca>er( tains the master version of the vertex data. All remaining
Updated((
(4)( replicas of v are then mirrors and maintain a local cached
Vertex(Data(
Machine(1( (5)(Sca>er( Machine(2( read only copy of the vertex data. (e.g., Fig. 4b). For in-
stance, in Fig. 4b we construct a three-way vertex-cut of a
Figure 5: The communication pattern of the PowerGraph ab-
straction when using a vertex-cut. Gather function runs locally
graph yielding only 2 mirrors. Any changes to the vertex
on each machine and then one accumulators is sent from each data (e.g., the Apply function) must be made to the master
mirror to the master. The master runs the apply function and which is then immediately replicated to all mirrors.
then sends the updated vertex data to all mirrors. Finally the Vertex-cuts address the major issues associated with
scatter phase is run in parallel on mirrors. edge-cuts in power-law graphs. Percolation theory [3]
suggests that power-law graphs have good vertex-cuts.
5.1 Balanced p-way Vertex-Cut Intuitively, by cutting a small fraction of the very high
degree vertices we can quickly shatter a graph. Further-
By factoring the vertex program along the edges in the
more, because the balance constraint (Eq. 5.4) ensures
graph, The PowerGraph abstraction allows a single vertex-
that edges are uniformly distributed over machines, we
program to span multiple machines. In Fig. 5 a single high
naturally achieve improved work balance even in the pres-
degree vertex program has been split across two machines
ence of very high-degree vertices.
with the gather and scatter functions running in parallel
The simplest method to construct a vertex cut is to
on each machine and accumulator and vertex data being
randomly assign edges to machines. Random (hashed)
exchanged across the network.
edge placement is fully data-parallel, achieves nearly per-
Because the PowerGraph abstraction allows a single
fect balance on large graphs, and can be applied in the
vertex-program to span multiple machines, we can im-
streaming setting. In the following theorem, we relate the
prove work balance and reduce communication and stor-
expected normalized replication factor (Eq. 5.3) to the
age overhead by evenly assigning edges to machines and
number of machines and the power-law constant α.
allowing vertices to span machines. Each machine only
stores the edge information for the edges assigned to that Theorem 5.2 (Randomized Vertex Cuts). A random
machine, evenly distributing the massive amounts of edge vertex-cut on p machines has an expected replication:
data. Since each edge is stored exactly once, changes to     
1 D[v]

1 p
edge data do not need to be communicated. However, E ∑ |A(v)| = |V | ∑ 1 − 1 − p
|V | v∈V
. (5.5)
changes to vertex must be copied to all the machines it v∈V

spans, thus the storage and network overhead depend on where D[v] denotes the degree of vertex v. For a power-
the number of machines spanned by each vertex. law graph the expected replication (Fig. 6a) is determined
We minimize storage and network overhead by lim- entirely by the power-law constant α:
iting the number of machines spanned by each vertex.  
|V |−1 
p − 1 d −α

1 p
A balanced p-way vertex-cut formalizes this objective E ∑
|V | v∈V
|A(v)| = p − ∑
h|V | (α) d=1 p
d ,
by assigning each edge e ∈ E to a machine A(e) ∈
{1, . . . , p}. Each vertex then spans the set of machines (5.6)
|V |−1
A(v) ⊆ {1, . . . , p} that contain its adjacent edges. We de- where h|V | (α) = ∑d=1 d −α is the normalizing constant
fine the balanced vertex-cut objective: of the power-law Zipf distribution.
Proof. By linearity of expectation:
1
min ∑ |A(v)|
|V | v∈V
(5.3) 
1

1
∑ ∑ E [|A(v)|] ,
A
E |A(v)| = (5.7)
|E| |V | v∈V |V | v∈V
s.t. max |{e ∈ E | A(e) = m}| , < λ (5.4)
m p The expected replication E [|A(v)|] of a single vertex v
where the imbalance factor λ ≥ 1 is a small constant. We can be computed by considering the process of randomly
use the term replicas of a vertex v to denote the |A(v)| assigning the D[v] edges adjacent to v. Let the indicator
copies of the vertex v: each machine in A(v) has a replica Xi denote the event that vertex v has at least one of its
of v. Because changes to vertex data are communicated edges on machine i. The expectation E [Xi ] is then:
to all replicas, the communication overhead is also given E [Xi ] = 1 − P (v has no edges on machine i) (5.8)
by |A(v)|. The objective (Eq. 5.3) therefore minimizes the
1 D[v]
 
average number of replicas in the graph and as a conse- = 1− 1− , (5.9)
quence the total storage and communication requirements p

7
USENIX Association 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’12) 23
10 500 Graph |V | |E| α # Edges

Factor Improvement
Replication Factor
α = 1.65 200 Twitter [24] 41M 1.4B 1.8 641,383,778
8 α = 1.7
100 UK [7] 132.8M 5.5B 1.9 245,040,680
6 α = 1.8 50 α = 1.65 Amazon [6, 5] 0.7M 5.2M 2.0 102,838,432
α = 1.7 LiveJournal [12] 5.4M 79M 2.1 57,134,471
4 α=2 20 Hollywood [6, 5] 2.2M 229M 2.2 35,001,696
α = 1.8
10
2 (a) Real world graphs (b) Synthetic Graphs
5 α=2
0 50 100 150 0 50 100 150
Number of Machines Number of Machines Table 1: (a) A collection of Real world graphs. (b) Randomly
(a) V-Sep. Bound (b) V-Sep. Improvement constructed ten-million vertex power-law graphs with varying
α. Smaller α produces denser graphs.
Figure 6: (a) Expected replication factor for different power-
law constants. (b) The ratio of the expected communication and
storage cost of random edge cuts to random vertex cuts as a
5.2 Greedy Vertex-Cuts
function of the number machines. This graph assumes that edge We can improve upon the randomly constructed vertex-
data and vertex data are the same size. cut by de-randomizing the edge-placement process. The
The expected replication factor for vertex v is then: resulting algorithm is a sequential greedy heuristic which
   places the next edge on the machine that minimizes the
p
1 D[v]

E [|A(v)|] = ∑ E [Xi ] = p 1 − 1 − . (5.10) conditional expected replication factor. To construct the
i=1 p de-randomization we consider the task of placing the i + 1
edge after having placed the previous i edges. Using the
Treating D[v] as a Zipf random variable:
    conditional expectation we define the objective:
 
1 p p − 1 D[v]
∑ |A(v)| = |V | ∑ 1 − E
  
E ,
arg min E ∑ |A(v)|  Ai , A(ei+1 ) = k ,

|V | v∈V v∈V p 
(5.13)
k v∈V

(5.11)
and taking the expectation under P (d) = d −α / h|V | (α): where Ai is the assignment for the previous i edges. Using
 Theorem 5.2 to evaluate Eq. 5.13 we obtain the following
  |V |−1 
1 D[v] 1 d −α edge placement rules for the edge (u, v):

1
E 1−
p
= ∑
h|V | (α) d=1
1 −
p
d .
Case 1: If A(u) and A(v) intersect, then the edge should be
(5.12) assigned to a machine in the intersection.
Case 2: If A(u) and A(v) are not empty and do not intersect,
While lower α values (more high-degree vertices) im- then the edge should be assigned to one of the machines
ply a higher replication factor (Fig. 6a) the effective gains from the vertex with the most unassigned edges.
of vertex-cuts relative to edge cuts (Fig. 6b) actually in- Case 3: If only one of the two vertices has been assigned, then
crease with lower α. In Fig. 6b we plot the ratio of the choose a machine from the assigned vertex.
expected costs (comm. and storage) of random edge-cuts Case 4: If neither vertex has been assigned, then assign the
(Eq. 5.2) to the expected costs of random vertex-cuts edge to the least loaded machine.
(Eq. 5.6) demonstrating order of magnitude gains.
Finally, the vertex cut model is also highly effective for Because the greedy-heuristic is a de-randomization it is
regular graphs since in the event that a good edge-cut can guaranteed to obtain an expected replication factor that is
be found it can be converted to a better vertex cut: no worse than random placement and in practice can be
Theorem 5.3. For a given an edge-cut with g ghosts, any much better. Unlike the randomized algorithm, which is
vertex cut along the same partition boundary has strictly embarrassingly parallel and easily distributed, the greedy
fewer than g mirrors. algorithm requires coordination between machines. We
consider two distributed implementations:
Proof of Theorem 5.3. Consider the two-way edge cut
which cuts the set of edges E ′ ∈ E and let V ′ be the set Coordinated: maintains the values of Ai (v) in a dis-
of vertices in E ′ . The total number of ghosts induced by tributed table. Then each machine runs the greedy
this edge partition is therefore |V ′ |. If we then select and heuristic and periodically updates the distributed ta-
delete arbitrary vertices from V ′ along with their adja- ble. Local caching is used to reduce communication
cent edges until no edges remain, then the set of deleted at the expense of accuracy in the estimate of Ai (v).
vertices corresponds to a vertex-cut in the original graph. Oblivious: runs the greedy heuristic independently on
Since at most |V ′ | − 1 vertices may be deleted, there can each machine. Each machine maintains its own esti-
be at most |V ′ | − 1 mirrors. mate of Ai with no additional communication.

24 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’12) USENIX Association
1
graph has a different number of edges (see Tab. 1b).
Random Rand.
We used the GraphLab v1 C++ implementation from

Runtime Relative to Random


20 Oblivious Obliv.
Coordinated 0.8 Coord.
[29] and added instrumentation to track network usage.
Replication Factor

15
0.6
As of the writing of this paper, public implementations
10 0.4 of Pregel (e.g., Giraph) were unable to handle even our
5 0.2
smaller synthetic problems due to memory limitations.
Consequently, we used Piccolo [32] as a proxy imple-
0 0
Twitter HWood UK LJournal Amazon Coloring SSSP ALS PageRank
mentation of Pregel since Piccolo naturally expresses the
(a) Actual Replication (b) Effect of Partitioning Pregel abstraction and provides an efficient C++ imple-
Figure 7: (a) The actual replication factor on 32 machines. (b) mentation with dynamic load-balancing. Finally, we used
The effect of partitioning on runtime. our implementation of PowerGraph described in Sec. 7.
18 1000
All experiments in this section are evaluated on an eight
Predicted
node Linux cluster. Each node consists of two quad-core
14 800
Replication Factor

Intel Xeon E5620 processors with 32 GB of RAM and is


Runtime (secs)

Random Coordinated
10 Oblivious
600
Oblivious
connected via 1-GigE Ethernet. All systems were com-
400 Random piled with GCC 4.4. GraphLab and Piccolo used random
6 Coordinated
200
edge-cuts while PowerGraph used random vertex-cuts..
2 Results are averaged over 20 iterations.
1 0
8 16 32 48 64 8 16 32 48 64
#Machines #Machines

(a) Replication Factor (Twitter) (b) Ingress time (Twitter) 6.1 Computation Imbalance
Figure 8: (a,b) Replication factor and runtime of graph ingress The sequential component of the PageRank vertex-
for the Twitter follower network as a function of the number of program is proportional to out-degree in the Pregel ab-
machines for random, oblivious, and coordinated vertex-cuts. straction and in-degree in the GraphLab abstraction. Al-
ternatively, PowerGraph eliminates this sequential de-
In Fig. 8a, we compare the replication factor of both pendence by distributing the computation of individual
heuristics against random vertex cuts on the Twitter fol- vertex-programs over multiple machines. Therefore we
lower network. We plot the replication factor as a function expect, highly-skewed (low α) power-law graphs to in-
of the number of machines (EC2 instances described in crease work imbalance under the Pregel (fan-in) and
Sec. 7) and find that random vertex cuts match the pre- GraphLab (fan-out) abstractions but not under the Power-
dicted replication given in Theorem 5.2. Furthermore, Graph abstraction, which evenly distributed high-degree
the greedy heuristics substantially improve upon random vertex-programs. To evaluate this hypothesis we ran eight
placement with an order of magnitude reduction in the “workers” per system (64 total workers) and recorded the
replication factor, and therefore communication and stor- vertex-program time on each worker.
age costs. For a fixed number of machines (p = 32), we In Fig. 9a and Fig. 9b we plot the standard devia-
evaluated (Fig. 7a) the replication factor of the two heuris- tion of worker per-iteration runtimes, a measure of work
tics on five real-world graphs (Tab. 1a). In all cases the imbalance, for power-law fan-in and fan-out graphs re-
greedy heuristics out-perform random placement, while spectively. Higher standard deviation implies greater im-
doubling the load time (Fig. 8b). The Oblivious heuris- balance. While lower α increases work imbalance for
tic achieves compromise by obtaining a relatively low GraphLab (on fan-in) and Pregel (on fan-out), the Power-
replication factor while only slightly increasing runtime. Graph abstraction is unaffected in either edge direction.

6 Abstraction Comparison 6.2 Communication Imbalance


In this section, we experimentally characterize the de- Because GraphLab and Pregel use edge-cuts, their com-
pendence on α and the relationship between fan-in and munication volume is proportional to the number of
fan-out by using the Pregel, GraphLab, and PowerGraph ghosts: the replicated vertex and edge data along the par-
abstractions to run PageRank on five synthetically con- tition boundary. If one message is sent per edge, Pregel’s
structed power-law graphs. Each graph has ten-million combiners ensure that exactly one network message is
vertices and an α ranging from 1.8 to 2.2. The graphs transmitted for each ghost. Similarly, at the end of each
were constructed by randomly sampling the out-degree iteration GraphLab synchronizes each ghost and thus the
of each vertex from a Zipf distribution and then adding communication volume is also proportional to the number
out-edges such that the in-degree of each vertex is nearly of ghosts. PowerGraph on the other hand uses vertex-cuts
identical. We then inverted each graph to obtain the cor- and only synchronizes mirrors after each iteration. The
responding power-law fan-in graph. The density of each communication volume of a complete iteration is there-
power-law graph is determined by α and therefore each fore proportional to the number of mirrors induced by the

9
USENIX Association 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’12) 25
50 15 15
30
GraphLab Fan−in
Work Imbalance (stdev)

Work Imbalance (stdev)

One iter Comms(GB)

One iter Comms(GB)


40 Pregel(Piccolo) Fan−out Pregel (Piccolo)
25 Graphlab
Pregel(Piccolo) Fan−in 10 10 Graphlab
20 30
GraphLab Fan−out Pregel (Piccolo)
15 PowerGraph Fan−in PowerGraph
PowerGraph
20
PowerGraph Fan−out 5 5
10
10
5
0 0
1.8 1.9 2 2.1 2.2 1.8 1.9 2 2.1 2.2 1.8 1.9 2 2.1 2.2 1.8 1.9 2 2.1 2.2
α α α α
(a) Power-law Fan-In Balance (b) Power-law Fan-Out Balance (c) Power-law Fan-In Comm. (d) Power-law Fan-Out Comm.
Figure 9: Synthetic Experiments: Work Imbalance and Communication. (a, b) Standard deviation of worker computation time
across 8 distributed workers for each abstraction on power-law fan-in and fan-out graphs. (b, c) Bytes communicated per iteration for
each abstraction on power-law fan-in and fan-out graphs.
30 30 7 Implementation and Evaluation
One iter runtime(seconds)

One iter runtime(seconds)

Pregel (Piccolo)
25 Graphlab 25
Pregel (Piccolo)
Graphlab In this section, we describe and evaluate our imple-
20 20
PowerGraph (Random)
PowerGraph (Random)
PowerGraph (Coord.)
mentation of the PowerGraph system. All experiments
15 PowerGraph (Coord.) 15
are performed on a 64 node cluster of Amazon EC2
10 10
cc1.4xlarge Linux instances. Each instance has two
5 5
quad core Intel Xeon X5570 processors with 23GB of
0 0
1.8 1.9 2 2.1 2.2 1.8 1.9 2 2.1 2.2
α α RAM, and is connected via 10 GigE Ethernet. Power-
(a) Power-law Fan-In Runtime (b) Power-law Fan-Out Runtime Graph was written in C++ and compiled with GCC 4.5.
Figure 10: Synthetic Experiments Runtime. (a, b) Per itera- We implemented three variations of the PowerGraph
tion runtime of each abstraction on synthetic power-law graphs. abstraction. To demonstrate their relative implementation
complexity, we provide the line counts, excluding com-
vertex-cut. As a consequence we expect that PowerGraph mon support code:
will reduce communication volume. Bulk Synchronous (Sync): A fully synchronous implementa-
In Fig. 9c and Fig. 9d we plot the bytes communicated tion of PowerGraph as described in Sec. 4.3.1. [600 lines]
per iteration for all three systems under power-law fan-in Asynchronous (Async): An asynchronous implementation of
and fan-out graphs. Because Pregel only sends messages PowerGraph which allows arbitrary interleaving of vertex-
along out-edges, Pregel communicates more on power- programs Sec. 4.3.2. [900 lines]
law fan-out graphs than on power-law fan-in graphs. Asynchronous Serializable (Async+S): An asynchronous im-
On the other hand, GraphLab and PowerGraph’s com- plementation of PowerGraph which guarantees serializabil-
munication volume is invariant to power-law fan-in and ity of all vertex-programs (equivalent to “edge consistency”
fan-out since neither considers edge direction during data- in GraphLab). [1600 lines]
synchronization. However, PowerGraph communicates In all cases the system is entirely symmetric with no
significantly less than GraphLab which is a direct result single coordinating instance or scheduler. Each instances
of the efficacy of vertex cuts. Finally, PowerGraph’s total is given the list of other machines and start by reading a
communication increases only marginally on the denser unique subset of the graph data files from HDFS. TCP
graphs and is the lowest overall. connections are opened with other machines as needed to
build the distributed graph and run the engine.
6.3 Runtime Comparison
7.1 Graph Loading and Placement
PowerGraph significantly out-performs GraphLab and
Pregel on low α graphs. In Fig. 10a and Fig. 10b we The graph structure and data are loaded from a collection
plot the per iteration runtime for each abstraction. In both of text files stored in a distributed file-system (HDFS) by
cases the overall runtime performance closely matches all instances in parallel. Each machine loads a separate
the communication overhead ( Fig. 9c and Fig. 9d) while subset of files (determined by hashing) and applies one of
the computation imbalance (Fig. 9a and Fig. 9b) appears the three distributed graph partitioning algorithms to place
to have little effect. The limited effect of imbalance is due the data as it is loaded. As a consequence partitioning is
to the relatively lightweight nature of the PageRank com- accomplished in parallel and data is immediately placed
putation and we expect more complex algorithms (e.g., in its final location. Unless specified, all experiments were
statistical inference) to be more susceptible to imbalance. performed using the oblivious algorithm. Once computa-
However, when greedy (coordinated) partitioning is used tion is complete, the final vertex and edge data are saved
we see an additional 25% to 50% improvement in runtime. back to the distributed file-system in parallel.

10

26 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’12) USENIX Association
In Fig. 7b, we evaluate the performance of a collec- programs is permitted, we avoid data races by ensuring
tion of algorithms varying the partitioning procedure. Our that individual gather, apply, and scatter calls have exclu-
simple partitioning heuristics are able to improve per- sive access to their arguments.
formance significantly across all algorithms, decreasing We evaluate the performance of the Async engine by
runtime and memory utilization. Furthermore, the run- running PageRank on the Twitter follower network. In
time scales linearly with the replication factor: halving Fig. 12a, we plot throughput (number of vertex-program
the replication factor approximately halves runtime. operations per second) against the number of machines.
Throughput increases moderately with both the number
7.2 Synchronous Engine (Sync) of machines as well as improved partitioning. We eval-
uate the gains associated with delta caching (Sec. 4.2)
Our synchronous implementation closely follows the de-
by measuring throughput as a function of time (Fig. 12b)
scription in Sec. 4.3.1. Each machine runs a single multi-
with caching enabled and with caching disabled. Caching
threaded instance to maximally utilize the multi-core ar-
allows the algorithm to converge faster with fewer opera-
chitecture. We rely on background communication to
tions. Surprisingly, when caching is disabled, the through-
achieve computation/communication interleaving. The
put increases over time. Further analysis reveals that the
synchronous engine’s fully deterministic execution makes
computation gradually focuses on high-degree vertices,
it easy to reason about programmatically and minimizes
increasing the computation/communication ratio.
effort needed for tuning and performance optimizations.
We evaluate the graph coloring vertex-program (Fig. 3)
In Fig. 11a and Fig. 11b we plot the runtime and to-
which cannot be run synchronously since all vertices
tal communication of one iteration of PageRank on the
would change to the same color on every iteration. Graph
Twitter follower network for each partitioning method. To
coloring is a proxy for many MLDM algorithms [17]. In
provide a point of comparison (Tab. 2), the Spark [37]
Fig. 12c we evaluate weak-scaling on synthetic power-law
framework computes one iteration of PageRank on the
graphs (α = 2) with five-million vertices per machine and
same graph in 97.4s on a 50 node-100 core cluster [35].
find that the Async engine performs nearly optimally. The
PowerGraph is therefore between 3-8x faster than Spark
slight increase in runtime may be attributed to an increase
on a comparable number of cores. On the full cluster of
in the number of colors due to increasing graph size.
512 cores, we can compute one iteration in 3.6s.
The greedy partitioning heuristics improves both per-
formance and scalability of the engine at the cost of in- 7.4 Async. Serializable Engine (Async+S)
creased load-time. The load time for random, oblivious,
The Async engine is useful for a broad range of tasks,
and coordinated placement were 59, 105, and 239 sec-
providing high throughput and performance. However,
onds respectively. While greedy partitioning heuristics
unlike the synchronous engine, the asynchronous engine
increased load-time by up to a factor of four, they still im-
is difficult to reason about programmatically. We therefore
prove overall runtime if more than 20 iterations of PageR-
extended the Async engine to enforce serializability.
ank are performed. In Fig. 11c we plot the runtime of each
The Async+S engine ensures serializability by prevent-
iteration of PageRank on the Twitter follower network.
ing adjacent vertex-programs from running simultane-
Delta caching improves performance by avoiding unnec-
ously. Ensuring serializability for graph-parallel compu-
essary gather computation, decreasing total runtime by
tation is equivalent to solving the dining philosophers
45%. Finally, in Fig. 11d we evaluate weak-scaling: abil-
problem where each vertex is a philosopher, and each
ity to scale while keeping the problem size per processor
edge is a fork. GraphLab [29] implements Dijkstra’s solu-
constant. We run SSSP (Fig. 3) on synthetic power-law
tion [14] where forks are acquired sequentially according
graphs (α = 2), with ten-million vertices per machine.
to a total ordering. Instead, we implement the Chandy-
Our implementation demonstrates nearly optimal weak-
Misra solution [10] which acquires all forks simultane-
scaling and requires only 65s to solve a 6.4B edge graph.
ously, permitting a high degree of parallelism. We extend
the Chandy-Misra solution to the vertex-cut setting by en-
7.3 Asynchronous Engine (Async) abling each vertex replica to request only forks for local
We implemented the asynchronous PowerGraph execu- edges and using a simple consensus protocol to establish
tion model (Sec. 4.3.2) using a simple state machine for when all replicas have succeeded.
each vertex which can be either: INACTIVE, GATHER, We evaluate the scalability and computational effi-
APPLY or SCATTER. Once activated, a vertex enters the ciency of the Async+S engine on the graph coloring task.
gathering state and is placed in a local scheduler which We observe in Fig. 12c that the amount of achieved par-
assigns cores to active vertices allowing many vertex- allelism does not increase linearly with the number of
programs to run simultaneously thereby hiding commu- vertices. Because the density (i.e., contention) of power-
nication latency. While arbitrary interleaving of vertex law graphs increases super-linearly with the number of

11
USENIX Association 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’12) 27
35 35 60

One Iteration Comms (GB)


One Iteration Runtime (s) Synchronous(Random) Synchronous(Random)
30 Synchronous(Oblivious) 30 No Caching 1
50

Relative Runtime
25 Synchronous(Coord.) 25

Runtime(s)
Delta Caching 0.8 SSSP
20 20 40 Optimal
Synchronous(Oblivious) 0.6
15 15 30
Synchronous(Coord.) 0.4
10 10
20 0.2
5 5
0 0 10 0
8 16 32 48 64 8 16 32 48 64 0 5 10 15 20 8 16 32 48 64
Number of Machines Number of Machines Iteration Number of Machines
(a) Twitter PageRank Runtime (b) Twitter PageRank Comms (c) Twitter PageRank Delta Cache (d) SSSP Weak Scaling
Figure 11: Synchronous Experiments (a,b) Synchronous PageRank Scaling on Twitter graph. (c) The PageRank per iteration
runtime on the Twitter graph with and without delta caching. (d) Weak scaling of SSSP on synthetic graphs.
300 4
Million User Ops Per Second

150 1

Non−conflicting Edge Proportion


Async (Coord.) No Caching
250

Relative Runtime
Async (Oblivious) 0.8
Million User Ops/s

3 Async+S
200 Async (Random) Delta Caching
100 Async Async
0.6
Async+S
150 2
0.4
100 50 Race Induced
1 Conflicts
50 0.2
Optimal
0 0 0 0
20 40 60 0 500 1000 1500 8 16 32 48 64 0 50 100 150 200
Number of Machines Time (s) Number of Machines Time (s)

(a) Twitter PageRank Throughput (b) Twitter PageRank Delta Cache (c) Coloring Weak Scaling (d) Coloring Conflict Rate
Figure 12: Asynchronous Experiments (a) Number of user operations (gather/apply/scatter) issued per second by Dynamic
PageRank as # machines is increased. (b) Total number of user ops with and without caching plotted against time. (c) Weak scaling
of the graph coloring task using the Async engine and the Async+S engine (d) Proportion of non-conflicting edges across time on a 8
machine, 40M vertex instance of the problem. The green line is the rate of conflicting edges introduced by the lack of consistency
(peak 236K edges per second) in the Async engine. When the Async+S engine is used no conflicting edges are ever introduced.
7
x 10
vertices, we do not expect the amount of serializable par- 14 Async (d=5)
0
10
Async+S (d=5) Async (d=5)
Operations per Second

allelism to increase linearly. 12 Async (d=20) Async+S (d=5)


Async+S (d=20)

Error (RMSE)
10
In Fig. 12d, we plot the proportion of edges that satisfy Async (d=20)
8 −1 Async+S (d=20)
10
the coloring condition (both vertices have different colors)
6
for both the Async and the Async+S engines. While the 4
Async engine quickly satisfies the coloring condition for 2 −2
10
most edges, the remaining 1% take 34% of the runtime. 4 8 16 32 48
Number of Machines
64 0 500 1000
Runtime (seconds)
1500

We attribute this behavior to frequent races on tightly (a) ALS Throughput (b) ALS Convergence
connected vertices. Alternatively, the Async+S engine
Figure 13: (a) The throughput of ALS measured in millions of
performs more uniformly. If we examine the total number
User Operations per second. (b) Training error (lower is better)
of user operations we find that the Async engine does as a function of running time for ALS application.
more than twice the work of the Async+S engine.
Finally, we evaluate the Async and the Async+S en- ity, we plot in Fig. 13b the training error, a measure of
gines on a popular machine learning algorithm: Alter- solution quality, for both engines. We observe that while
nating Least Squares (ALS). The ALS algorithm has a the Async engine has greater throughput, the Async+S
number of variations which allow it to be used in a wide engine converges faster.
range of applications including user personalization [38] The complexity of the Async+S engine is justified by
and document semantic analysis [21]. We apply ALS to the necessity for serializability in many applications (e.g.,
the Wikipedia term-document graph consisting of 11M ALS). Furthermore, serializability adds predictability to
vertices and 315M edges to extract a mixture of topics the nondeterministic asynchronous execution. For exam-
representation for each document and term. The number ple, even graph coloring may not terminate on dense
of topics d is a free parameter graphs unless serializability is ensured.
  that determines the com-
putational complexity O d 3 of each vertex-program. In
Fig. 13a, we plot the ALS throughput on the Async en- 7.5 Fault Tolerance
gine and the Async+S engine. While the throughput of the Like GraphLab and Pregel, PowerGraph achieves fault-
Async engine is greater, the gap between engines shrinks tolerance by saving a snapshot of the data-graph. The syn-
as d increases and computation dominates the consistency chronous PowerGraph engine constructs the snapshot be-
overhead. To demonstrate the importance of serializabil- tween super-steps and the asynchronous engine suspends

12

28 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’12) USENIX Association
PageRank Runtime |V | |E| System the more general apply and scatter operations, as well as
Hadoop [22] 198s – 1.1B 50x8 mutable edge-data and are based on a strictly synchronous
Spark [37] 97.4s 40M 1.5B 50x2 model in which all computation is run in every iteration.
Twister [15] 36s 50M 1.4B 64x4 While we discuss Pregel and GraphLab in detail, there
PowerGraph (Sync) 3.6s 40M 1.5B 64x8 are other similar graph-parallel abstractions. Closely re-
Triangle Count Runtime |V | |E| System lated to Pregel is BPGL [20] which implements a syn-
Hadoop [36] 423m 40M 1.4B 1636x? chronous traveler model. Alternatively, Kineograph [11]
PowerGraph (Sync) 1.5m 40M 1.4B 64x16 presents a graph-parallel framework for time-evolving
graphs which mixes features from both GraphLab and Pic-
LDA Tok/sec Topics System
colo. Pujol et al. [33] present a distributed graph database
Smola et al. [34] 150M 1000 100x8
but do not explicitly consider the power-law structure. Fi-
PowerGraph (Async) 110M 1000 64x16
nally, [25] presents GraphChi: an efficient single-machine
Table 2: Relative performance of PageRank, triangle counting, disk-based implementation of the GraphLab abstraction.
and LDA on similar graphs. PageRank runtime is measured per Impressively, it is able to significantly out-perform large
iteration. Both PageRank and triangle counting were run on the Hadoop deployments on many graph problems while us-
Twitter follower network and LDA was run on Wikipedia. The ing only a single machine: performing one iteration of
systems are reported as number of nodes by number of cores. PageRank on the Twitter Graph in only 158s (Power-
Graph: 3.6s). The techniques described in GraphChi can
execution to construct the snapshot. An asynchronous be used to add out-of-core storage to PowerGraph.
snapshot using GraphLab’s snapshot algorithm [29] can
also be implemented. The checkpoint overhead, typically
a few seconds for the largest graphs we considered, is
9 Conclusions and Future Work
small relative to the running time of each application. The need to reason about large-scale graph-structured data
has driven the development of new graph-parallel abstrac-
7.6 MLDM Applications tions such as GraphLab and Pregel. However graphs de-
In Tab. 2 we provide comparisons of the PowerGraph sys- rived from real-world phenomena often exhibit power-law
tem with published results on similar data for PageRank, degree distributions, which are difficult to partition and
Triangle Counting [36], and collapsed Gibbs sampling can lead to work imbalance and substantially increased
for the LDA model [34]. The PowerGraph implementa- communication and storage.
tions of PageRank and Triangle counting are one to two To address these challenges, we introduced the Pow-
orders of magnitude faster than published results. For erGraph abstraction which exploits the Gather-Apply-
LDA, the state-of-the-art solution is a heavily optimized Scatter model of computation to factor vertex-programs
system designed for this specific task by Smola et al. [34]. over edges, splitting high-degree vertices and exposing
In contrast, PowerGraph is able to achieve comparable greater parallelism in natural graphs. We then introduced
performance using only 200 lines of user code. vertex-cuts and a collection of fast greedy heuristics to
substantially reduce the storage and communication costs
8 Related Work of large distributed power-law graphs. We theoretically
related the power-law constant to the communication
The vertex-cut approach to distributed graph placement is and storage requirements of the PowerGraph system and
related to work [9, 13] in hypergraph partitioning. In par- empirically evaluate our analysis by comparing against
ticular, a vertex-cut problem can be cast as a hypergraph- GraphLab and Pregel. Finally, we evaluate the Power-
cut problem by converting each edge to a vertex, and Graph system on several large-scale problems using a 64
each vertex to a hyper-edge. However, existing hyper- node EC2 cluster and demonstrating the scalability and
graph partitioning can be very time intensive. While our efficiency and in many cases order of magnitude gains
cut objective is similar to the “communication volume” over published results.
objective, the streaming vertex cut setting described in We are actively using PowerGraph to explore new large-
this paper is novel. Stanton et al, in [35] developed several scale machine learning algorithms. We are beginning
heuristics for the streaming edge-cuts but do not consider to study how vertex replication and data-dependencies
the vertex-cut problem. can be used to support fault-tolerance without check-
Several [8, 22] have proposed generalized sparse ma- pointing. In addition, we are exploring ways to support
trix vector multiplication as a basis for graph-parallel time-evolving graph structures. Finally, we believe that
computation. These abstractions operate on commutative many of the core ideas in the PowerGraph abstraction can
associative semi-rings and therefore also have generalized have a significant impact in the design and implementa-
gather and sum operations. However, they do not support tion of graph-parallel systems beyond PowerGraph.

13
USENIX Association 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’12) 29
Acknowledgments [18] G ONZALEZ , J., L OW, Y., AND G UESTRIN , C. Residual splash
for optimally parallelizing belief propagation. In AISTATS (2009),
This work is supported by the ONR Young Investiga- vol. 5, pp. 177–184.
tor Program grant N00014-08-1-0752, the ARO under [19] G ONZALEZ , J., L OW, Y., G UESTRIN , C., AND O’H ALLARON ,
MURI W911NF0810242, the ONR PECASE-N00014- D. Distributed parallel inference on large factor graphs. In UAI
10-1-0672, the National Science Foundation grant IIS- (2009).
0803333 as well as the Intel Science and Technology Cen- [20] G REGOR , D., AND L UMSDAINE , A. The parallel BGL: A generic
ter for Cloud Computing. Joseph Gonzalez is supported library for distributed graph computations. POOSC (2005).
by the Graduate Research Fellowship from the NSF. We [21] H OFMANN , T. Probabilistic latent semantic indexing. In SIGIR
would like to thank Alex Smola, Aapo Kyrola, Lidong (1999), pp. 50–57.
Zhou, and the reviewers for their insightful guidance. [22] K ANG , U., T SOURAKAKIS , C. E., AND FALOUTSOS , C. Pe-
gasus: A peta-scale graph mining system implementation and
References observations. In ICDM (2009), pp. 229 –238.
[23] K ARYPIS , G., AND K UMAR , V. Multilevel k-way partitioning
[1] A BOU -R JEILI , A., AND K ARYPIS , G. Multilevel algorithms for scheme for irregular graphs. J. Parallel Distrib. Comput. 48, 1
partitioning power-law graphs. In IPDPS (2006). (1998), 96–129.
[2] A HMED , A., A LY, M., G ONZALEZ , J., NARAYANAMURTHY, S., [24] K WAK , H., L EE , C., PARK , H., AND M OON , S. What is twitter,
AND S MOLA , A. J. Scalable inference in latent variable models.
a social network or a news media? In WWW (2010), pp. 591–600.
In WSDM (2012), pp. 123–132.
[25] K YROLA , A., B LELLOCH , G., AND G UESTRIN , C. GraphChi:
[3] A LBERT, R., J EONG , H., AND BARAB ÁSI , A. L. Error and
Large-scale graph computation on just a PC. In OSDI (2012).
attack tolerance of complex networks. In Nature (2000), vol. 406,
pp. 378—482. [26] L ANG , K. Finding good nearly balanced cuts in power law graphs.
Tech. Rep. YRL-2004-036, Yahoo! Research Labs, Nov. 2004.
[4] B ERTSEKAS , D. P., AND T SITSIKLIS , J. N. Parallel and dis-
tributed computation: numerical methods. Prentice-Hall, 1989. [27] L ESKOVEC , J., K LEINBERG , J., AND FALOUTSOS , C. Graph
[5] B OLDI , P., ROSA , M., S ANTINI , M., AND V IGNA , S. Layered evolution: Densification and shrinking diameters. ACM Trans.
label propagation: A multiresolution coordinate-free ordering for Knowl. Discov. Data 1, 1 (mar 2007).
compressing social networks. In WWW (2011), pp. 587–596. [28] L ESKOVEC , J., L ANG , K. J., DASGUPTA , A., , AND M AHONEY,
[6] B OLDI , P., AND V IGNA , S. The WebGraph framework I: Com- M. W. Community structure in large networks: Natural cluster
pression techniques. In WWW (2004), pp. 595–601. sizes and the absence of large well-defined clusters. Internet
Mathematics 6, 1 (2008), 29–123.
[7] B ORDINO , I., B OLDI , P., D ONATO , D., S ANTINI , M., AND
V IGNA , S. Temporal evolution of the uk web. In ICDM Workshops [29] L OW, Y., G ONZALEZ , J., K YROLA , A., B ICKSON , D.,
(2008), pp. 909–918. G UESTRIN , C., AND H ELLERSTEIN , J. M. Distributed GraphLab:
A Framework for Machine Learning and Data Mining in the Cloud.
[8] B ULUÇ , A., AND G ILBERT, J. R. The combinatorial blas: design, PVLDB (2012).
implementation, and applications. IJHPCA 25, 4 (2011), 496–509.
[30] M ALEWICZ , G., AUSTERN , M. H., B IK , A. J., D EHNERT, J.,
[9] C ATALYUREK , U., AND AYKANAT, C. Decomposing irregu-
H ORN , I., L EISER , N., AND C ZAJKOWSKI , G. Pregel: a system
larly sparse matrices for parallel matrix-vector multiplication. In
for large-scale graph processing. In SIGMOD (2010).
IRREGULAR (1996), pp. 75–86.
[31] P ELLEGRINI , F., AND ROMAN , J. Scotch: A software package
[10] C HANDY, K. M., AND M ISRA , J. The drinking philosophers
for static mapping by dual recursive bipartitioning of process and
problem. ACM Trans. Program. Lang. Syst. 6, 4 (Oct. 1984),
architecture graphs. In HPCN Europe (1996), pp. 493–498.
632–646.
[11] C HENG , R., H ONG , J., K YROLA , A., M IAO , Y., W ENG , X., [32] P OWER , R., AND L I , J. Piccolo: building fast, distributed pro-
W U , M., YANG , F., Z HOU , L., Z HAO , F., AND C HEN , E. Kineo- grams with partitioned tables. In OSDI (2010).
graph: taking the pulse of a fast-changing and connected world. [33] P UJOL , J. M., E RRAMILLI , V., S IGANOS , G., YANG , X.,
In EuroSys (2012), pp. 85–98. L AOUTARIS , N., C HHABRA , P., AND RODRIGUEZ , P. The little
[12] C HIERICHETTI , F., K UMAR , R., L ATTANZI , S., M ITZEN - engine(s) that could: scaling online social networks. In SIGCOMM
MACHER , M., PANCONESI , A., AND R AGHAVAN , P. On com- (2010), pp. 375–386.
pressing social networks. In KDD (2009), pp. 219–228. [34] S MOLA , A. J., AND NARAYANAMURTHY, S. An Architecture
[13] D EVINE , K. D., B OMAN , E. G., H EAPHY, R. T., B ISSELING , for Parallel Topic Models. PVLDB 3, 1 (2010), 703–710.
R. H., AND C ATALYUREK , U. V. Parallel hypergraph partitioning [35] S TANTON , I., AND K LIOT, G. Streaming graph partitioning
for scientific computing. In IPDPS (2006). for large distributed graphs. Tech. Rep. MSR-TR-2011-121, Mi-
[14] D IJKSTRA , E. W. Hierarchical ordering of sequential processes. crosoft Research, November 2011.
Acta Informatica 1 (1971), 115–138. [36] S URI , S., AND VASSILVITSKII , S. Counting triangles and the
[15] E KANAYAKE , J., L I , H., Z HANG , B., G UNARATHNE , T., BAE , curse of the last reducer. In WWW (2011), pp. 607–614.
S., Q IU , J., AND F OX , G. Twister: A runtime for iterative MapRe- [37] Z AHARIA , M., C HOWDHURY, M., F RANKLIN , M. J., S HENKER ,
duce. In HPDC (2010), ACM. S., AND S TOICA , I. Spark: Cluster computing with working sets.
[16] FALOUTSOS , M., FALOUTSOS , P., AND FALOUTSOS , C. On In HotCloud (2010).
power-law relationships of the internet topology. ACM SIGCOMM [38] Z HOU , Y., W ILKINSON , D., S CHREIBER , R., AND PAN , R.
Computer Communication Review 29, 4 (1999), 251–262. Large-scale parallel collaborative filtering for the netflix prize.
[17] G ONZALEZ , J., L OW, Y., G RETTON , A., AND G UESTRIN , C. In AAIM (2008), pp. 337–348.
Parallel gibbs sampling: From colored fields to thin junction trees.
In AISTATS (2011), vol. 15, pp. 324–332.

14
30 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’12) USENIX Association

You might also like