Distributed Graph Analytics Programming, Languages, And Their Compilation
Distributed Graph Analytics Programming, Languages, And Their Compilation
Rupesh Nasre
Y. N. Srikant
Distributed
Graph Analytics
Programming, Languages,
and Their Compilation
Distributed Graph Analytics
Unnikrishnan Cheramangalath • Rupesh Nasre •
Y. N. Srikant
Y. N. Srikant
Department of Computer Science
and Automation
Indian Institute of Science
Bangalore, India
This Springer imprint is published by the registered company Springer Nature Switzerland AG.
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
v
vi Preface
We divide this book in three parts: programming, languages, and their com-
pilation. The first part deals with manual parallelization of graph algorithms. It
uncovers various parallelization patterns we encounter while dealing especially
with graphs. The second part exploits these patterns to provide language constructs
using which a graph algorithm can be specified. A programmer can work only
with those language constructs, without worrying about their implementation. The
implementation, which is the heart of the third part, is handled by a compiler, which
can specialize code generation for a backend device. We present a few suggestive
results on different platforms, which justify the theory and practice covered in the
book. Together, the three parts provide the essential ingredients in creating a high-
performing graph application.
While this book is pitched at a graduate or advanced undergraduate level as a
specialized elective in universities, to make the matter accessible, we have also
included a brief background on elementary graph algorithms, parallel computing,
and GPUs. It is possible to read most of the chapters independently, if the readers are
familiar with the basics of parallel programming and graph algorithms. To highlight
recent advances, we also discuss dynamic graph algorithms, which pose new
challenges at the algorithmic and language levels, as well as for code generation.
To make the discussion more concrete, we use a case study using Falcon, a domain-
specific language for graph algorithms. The book ends with a section on future
directions which contains several pointers to topics that seem promising for future
research.
We believe this book provides you wings to explore the exciting area of
distributed graph analytics with zeal, and that it would help practitioners scale their
graph algorithms to newer heights.
We would like to especially thank Ralf Gerstner, Executive Editor of Computer
Science at Springer-Verlag for his encouragement and patient advice.
vii
viii Contents
References .. .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 195
Index . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 205
Chapter 1
Introduction to Graph Analytics
Examples of different kinds of graph analysis that are used in various applications
are path analysis, connectivity analysis, clustering, etc. Path analysis is used to find
the shortest distance between pairs of nodes in a graph and is useful in optimizing
transportation networks supply and distribution chains, etc. Connectivity analysis
helps in determining whether parts of networks are disconnected or connected by
too few links, and is useful in power and water grid analysis, telecommunication
network analysis, etc. Clustering helps in identifying groups of interacting people
in a social network. Other types of analyses that use graph analytics are community
analysis and centrality analysis.
While graphs in traditional applications are small or can fit in the memory
of a single desktop of server box, social applications of graph analytics usually
involve massive graphs that cannot fit into the memory of a single computer. Such
graphs can be handled using a distributed computation. The graphs are partitioned
and distributed across computing nodes of a distributed computing system, and
computation happens in parallel on all the nodes. Efficient implementation of
graph analytics on such platforms requires a deep knowledge of the hardware,
threads, processes, inter-process communication, and memory management. Tuning
a given graph algorithm for such platforms is therefore quite cumbersome, since
the programmer needs to worry not only about what the algorithmic processing is,
but also about how the processing would be implemented. Programming on these
platforms using traditional languages along with message passing interface (MPI)
and tools is therefore challenging and error-prone. There is little debugging support
either.
To exploit the computational power of a single machine, graph analytic
algorithms are parallelized using multiple threads. Such a multicore processing
approach can improve the efficiency of the underlying computation. Unlike in
a distributed system, all the threads in a multicore system share a common
memory. A multicore system is programmed using OpenMP and libraries such as
pthreads.
Apart from distributed systems and multicore processors, Graphical Processing
Units (GPUs) are also widely used for general purpose high performance com-
puting. GPUs have their own cores and memory, and are connected to a CPU
via a PCI-express bus. Due to massive parallelism available on GPUs, threads are
organized into a hierarchy of warps and thread blocks. All the threads within a
warp execute in single-instruction multiple data (SIMD) fashion. A group of warps
constitutes a thread block, which is assigned to a multi-processor for execution.
GPUs are programmed using programming languages such as CUDA and OpenCL.
Graph algorithms contain enough parallelism to keep thousands of GPU cores
busy.
As noted above, one needs to use different programming paradigms for paral-
lelizing a given application on different kinds of hardware. As data sizes grow, it is
imperative to combine the benefits of these hardware types to achieve best results.
Thus, both the graph data as well as the associated computation would be split across
multi-core CPUs and GPUs, operating in a distributed manner. It is an enormous
effort for the programmer to optimize the computation for three different platforms
1.2 Graph Preliminaries 3
in multiple languages. Debugging such a parallel code at large scale is also often
practically infeasible.
This problem can be addressed by graph analytic frameworks and domain
specific languages (DSL) for graph analytics. A DSL hides the hardware details
from a programmer and allows her/him to concentrate only on the algorithmic
logic. A DSL compiler should generate efficient code from the input DSL program.
This book takes a deep look at various programming frameworks, domain specific
languages, and their compilers for graph analytics for distributed systems with CPUs
and GPUs.
A similar definition holds for undirected graphs as well, but obviously the edges are
not directed. A graph is simple if it does not have a loop and there are no multiple
edges between two vertices. A graph is complete if every pair of distinct vertices is
connected by a unique edge.
4 1 Introduction to Graph Analytics
For undirected graphs, the degree of a vertex is the number of edges incident on
the vertex. For a directed graph, the indegree of a vertex v is the number of edges
having vertex v as their heads. Similarly, the outdegree of a vertex v is the number
of edges having vertex v as their tail.
A graph Gs (Vs , Es ) is a subgraph of the graph G (V, E) if Vs ⊆ V and Es ⊆ E.
Gs (Vs , Es ) is a proper subgraph of the graph G (V,E) if Vs ⊂ V or Es ⊂ E. A
directed graph is strongly connected if there is at least one directed path between
any pair of distinct vertices. A strongly connected component (SCC) of a directed
graph G is a maximal strongly connected subgraph of G. All SCCs are disjoint
and none of the SCCs can be extended with more vertices and edges from G, still
retaining strong connectivity. An undirected graph is connected if there is at least
one path between any pair of distinct vertices. A connected component (CC) of an
undirected graph G is a maximal connected subgraph of G. All CCs are disjoint,
and none of the CCs can be extended with more vertices and edges from G, still
retaining connectivity.
Distance between two vertices is the length of the shortest path between them.
The diameter of a graph is defined as the largest distance between any pair of
vertices. A cycle of a graph G(V,E) is a set E0 ⊆ E that forms a path such that
the first and the last vertices of the path are the same. A graph is cyclic if it has at
least one cycle; otherwise it is acyclic. A tree is an undirected, connected and acyclic
graph. For a tree, |V| = |E| + 1. A rooted tree is a tree which has a distinguished
vertex called the root, and all edges are oriented away from the root. Usually, a
rooted tree is referred to as a tree and the root is identified, with the directions of the
edges not being explicitely mentioned. A spanning tree for an undirected connected
graph G is a tree that connects all the vertices of G. If G is disconnected, then we talk
about a spanning forest for G, with one spanning tree for each connected component
of G.
An undirected graph is bipartite if the vertices in the graph can be divided
into two unique and independent subsets such that for all the edges the two
endpoints do not belong to the same subset. That is, a graph G(V, E) is bipartite
if
∃P , Q ⊂ V ∧ P ∪ Q = V ∧ P ∩ Q = φ ∧ e : (u, v) ∈ E ⇒
u ∈ P, v ∈ Q u ∈ Q, v ∈ P (1.2)
Graphs are stored in memory in various formats. We describe these formats below.
The adjacency matrix Arr of a graph G(V, E) is a square matrix of size |V|×|V| such
that:
Arr[i][j] = weight(e(i,j)), if e(i, j ) ∈ E
Arr[i][j] = ∞ if i = j ∧ e(i, j ) ∈ E.
Arr[i][j] = 0 if i = j ∧ e(i, j ) ∈ E
Table 1.1 shows the adjacency matrix representation for the graph in Fig. 1.2b.
For undirected graphs, the corresponding adjacency matrix is symmetric, that is,
Arr[i][j] = Arr[j][i].
The Compressed Sparse Row (CSR) format represents an unweighted graph using
a table of two rows and a weighted graph with a table of three rows. Table 1.2
shows the CSR representation for the directed graph in Fig. 1.2b. The third row
named weight stores the edge weight for each edge, and has as many elements as
the number of edges in the graph. The first row named index stores pointers into the
row weight and has |V | + 1 number of entries. The second row stores the destination
vertex (head) of the edges in sorted order. Elements in the row weight, between the
offsets index[i] and index[i + 1] store weights of all the edges with vi as the tail
vertex. For example, in Table 1.2, edge weights for edges having vertex v1 as tail
start at index 3 in row weight, and there are two such edges (index[v2 ] −index[v1 ] =
5 − 3 = 2). It is to be noted that if a vertex vi is isolated (no edges connected to
it), then index[vi ] = index[vi+1 ]. The last entry in the row index is always equal
to |E| and is useful in computing the number of edges emanating from the vertex
index[vlast ]. Unweighted graphs do not need the weight row. Undirected graphs can
be stored in CSR format by storing each edge in both the directions. The space
complexity of the CSR format is O(V + E). This storage representation is very
efficient for sparse graphs, i.e., |E| << |V|2 .
of the graph storage formats is given in Table 1.4, where Access Time is the time
required to check the presence or absence of an edge.
When a graph is too large to fit in the memory of a single computer, distributed
computation is required. The graph is partitioned into subgraphs and these are
distributed across the computing nodes of a cluster (distributed computing system).
More formally, a graph G(V, E) is partitioned into k subgraphs G1 , G2 , . . . , Gk ,
where k ≥ 2. The vertices and edges of the subgraphs Gi can be chosen in several
ways. The best partition must have almost equal number of vertices in each subgraph
and the number of inter-partition edges must be a minimum. Finding such a partition
is an NP-complete problem. Each subgraph is processed in parallel on a different
node of the computing cluster. Communication between subgraphs in the nodes
happens through message passing.
Every vertex belongs to a master subgraph which stores the updated properties
of that vertex. At the boundary of the subgraphs, edges span subgraphs. The edge
e : u → v of a subgraph Gi is called a remote edge if it connects to a different
subgraph Gj with the master subgraphs of the vertices u and v being Gi and Gj
respectively. Information propagation via a remote edge results in communication
between the nodes storing Gi and Gj . Latency of such a communication is typically
much higher than that in in-memory processing on a CPU or GPU. There is good
work balance among computing nodes when each subgraph requires similar amount
of computation. Graph partitioning methods should ensure work balance and also
minimize the amount of communication between subgraphs, by minimizing the
number of remote edges. This is a hard problem and all graph partitioning strategies
rely on heuristics in trying to achieve this goal.
Many frameworks have adopted random partitioning. Here, each vertex of the
graph is considered in turn, and is randomly assigned to a node of the computing
cluster. This method achieves good balance in the number of vertices per partition,
but there is no control on either the number of remote edges or the number of edges
in each partition. The two methods of graph partitioning that provide more control
on these issues are vertex-cut and edge-cut methods. In vertex-cut partitioning,
1.4 Graph Partitioning Strategies 9
Fig. 1.3 Graph partitioning: vertex-cut (b) and edge-cut (c) for graph in (a) for k == 3
10 1 Introduction to Graph Analytics
Note that vertex 3 is shared among three partitions and whenever information at
vertex 3 is updated in any one of the three partitions, it must be communicated to
vertex 3 in the other two partitions.
Shared Memory
SM5 SM6 SM7 SM8 SM9
192 Streaming Processors
warp of a GPU kernel follow different execution paths based on runtime values of
conditions in the program, it will result in performance degradation. This is called
as warp divergence. A program written for a GPU should have very few divergent
conditional blocks to reduce warp divergence.
Due to the differences in the hardware architecture and programming aspects of
CPU and GPU, separate programs are required to be written for the same algorithm.
A programmer should know the details of the hardware in order to obtain the best
performance. This is a challenging task for a naïve programmer with expertise in
algorithm design but not with tuning a program for a specific hardware.
Parallel and distributed systems can be programmed using high level programming
libraries. MPI is a defacto standard for distributed processing, and OpenMP is
popular for multi-core processing. CUDA and OpenCL are widely used frameworks
for GPU programming.
1.6.1 OpenMP
Program 1.1 shows parallel matrix addition using OpenMP. The pragma in
Line 21 creates a chunk of 3 independently running threads. The for loop
(Lines 22–25) computes the sum of the array elements arr1[i][j] and
arr2[i][j], and stores it in arr3[i][j]. Since the number of threads (3)
is not a multiple of the number of iterations of the outer loop (8), the assignment
of number of iterations to threads is 3, 3, 2. That means, thread 0 and 1 execute
1.6 Programming Libraries and APIs 13
3 iterations each, and thread 2 executes 2 iterations. The three threads end up
computing the sums in 3, 3, and 2 rows of the variable arr3, respectively.
1.6.2 CUDA
GPUs have separate memories from the CPU they are associated with. CUDA
provides library functions for allocating memory on GPU, copying data to and
from the GPU, making kernel calls, etc. It extends C++ language with additional
keywords. The CPU is called as the host and GPU is called as the device. The
keyword __global__ precedes the definition of functions which are executed in
parallel on the GPU (device), and called from the CPU ( host). Such a function
is called a CUDA kernel. A CUDA kernel is launched with X number of thread
blocks and N number of threads per block, where maximum values of X and N
are dependent on the specific GPU device. CUDA kernel calls are asynchronous
and the host continues execution after a kernel call. A barrier synchronization
function cudaDeviceSynchronize() forces the host to wait for the device to complete
its operation.
Program 1.2 performs matrix addition on GPU using the CUDA kernel Matrix-
Add (Lines 3–6). In the main program, first the arrays are allocated on CPU and
GPU using malloc() and cudaMalloc() functions respectively. This is followed by
reading of the two input arrays. Even though the matrices are two dimensional, they
are being stored into one dimensional arrays in row-major order. Single dimensional
arrays are easier to copy from the host to the device and vice-versa. The array
elements are then copied from CPU (host) to GPU (device) using the cudaMemcpy()
function. The last argument in the cudaMemcpy() function specifies the direction of
data transfer as host to device (see Lines 26, 27).
Then the CUDA kernel MatrixAdd is called, which performs the operation
d_arr3[i][j ] = d_arr1[i][j ] + d_arr2[i][j ], suchthat0 ≤ i, j < 1024, but
on corresponding one dimensional arrays. Since a separate thread is created for
computing each element of d_arr3, each thread must know which element of
d_arr3 it must compute and which elements of d_arr1 and d_arr2 must be used
to compute it. This index is computed in the variable i (Line 4) using the variables
blockIdx.x and blockDim.x which store the block number and the number of blocks
in the CUDA kernel respectively. The variable threadIdx.x stores thread-id within
the block. In this example, computation of i is simple because a separate thread is
created for computing each element of d_arr3. Finally, the result stored in the array
d_arr3 is copied from device to host (see Line 33).
1.6.3 OpenCL
1.6.4 Thrust
Thrust [4] is a CUDA library similar to C++ Standard Template Library, and it
eases programming parallel applications. Thrust library has data parallel primitives
such as sort, reduce, scan, etc. These primitives can be used to implement complex
parallel applications. The thrust library avoids using cudaMemcpy() and cudaMal-
loc() for explicit memory transfer operation and memory allocation respectively.
This makes programming GPUs simpler. The program shown in Algorithm 1.4 is for
matrix addition using the thrust library. The program uses the plus transformation
for each element in the arrays h_arr1 and h_arr2. The result is stored in the vector
d_arr3.
1.6.5 MPI
The Message Passing Interface (MPI) [5] is a message passing library standard
put forth by the MPI Forum. There are several implementations of MPI libraries
available in public which have been built according to the specifications of MPI.
The MPI library is used for communication between the nodes of a distributed
system with each computing node having private memory. It primarily focuses
16 1 Introduction to Graph Analytics
MPI also supports collective operations, such as Barrier, Broadcast, Scatter, Gather,
and reduction (add, multiply, min, max, etc.) on a group of processes:
1. The function MPI_Barrier(comm) blocks the caller until all the processes
have reached this routine. Then, they are all free to proceed with their own
executions. The parameter comm is the group of processes to be synchronized.
This routine is used to synchronize processes to ensure that all of them reach a
predetermined point in computation before proceeding further.
2. MPI_Bcast(&buffer, count, datatype, root, comm) is the
broadcast routine that broadcasts count number of data of type datatype
from the variable buffer in the process with rank as root, to all the processes
of the group comm.
3. The routine MPI_Scatter(&sendbuf, sendcnt, sendtype,
&recvbuf, recvcnt, recvtype, root, comm)
is meant to distribute distinct messages (sendcnt in number), from the buffer
sendbuf, and from a single source process root, to each process in the
group comm. The messages will be received in recvbuf of each process.
MPI_Gather(...) is the reverse operation of Scatter. It collects distinct
messages from each process in the group in a single destination process.
4. MPI_Reduce(&sendbuf, &recvbuf, count, datatype, op,
root, comm)
applies a reduction operation (op) on the data in all the processes in the group
comm, and places the result in one process root. op can be one of the many
predefined operators mentioned above or can be provided by the programmer.
Figures 1.5 and 1.6 show the operation of these four routines.
1.6 Programming Libraries and APIs 19
1 2 3 4 1
3
1
4
2
4 1 2 3 4
MPI library is quite vast and [5] contains a good description of the routines
provided in it.
20 1 Introduction to Graph Analytics
Graphs can model several real-world phenomenon. For instance, we can view
various webpages as nodes and hyperlinks as directed edges to represent the world
wide web as a graph. Such a modeling can help perform various graph computations
on the web. For instance, the pagerank algorithm is a popular graph algorithm,
which is used to rank the webpages. Alternatively, the web-graph can be used to
find clusters of webpages which link one another. This can help in categorizing the
webpages into various topics.
Graphs can also naturally model road networks. The Single Source Shortest Path
(SSSP) computation is often used to find the shortest distance between two end-
points in a map. The Breadth First Search (BFS) algorithm is used in social network
analysis to find the distance (hop) between two persons. The triangle counting
algorithm is used in community detection. The connected component algorithm is
used to cluster a graph into different subgraphs.
Graphs pose challenges for parallelization due to their inherent input-centric nature.
We discuss those challenges below.
Graphs and their properties have been discussed in Sect. 1.2. Some properties such
as diameter and variance in degree are very important to graph algorithms and their
execution on different types of hardware. The difference in these properties make
a program perform poorly or very well for input graphs of different types (road,
random, R-MAT). Table 1.5 shows a comparison of different graph types based on
the properties.
In a road network graph, a vertex represents a junction of two or more roads
and an edge represents a road connecting two junctions. Road network graphs have
a very high diameter and a low variance in degree. A random graph is created
from a set of vertices by randomly adding edges between the vertices. Erdös and
Rényi [6] model assigns equal probability to all graphs with V vertices and E
edges.
Real world graphs of social networks such as Twitter can have multiple edges
between the same pair of vertices. Such graphs could be unipartite like people in
a closed community group, bipartite like in a movie-actor or author-publication
database and possibly multipartite. Such graphs follow the power-law degree
distribution with very few vertices having very high degree (indegree or outde-
gree) [7]. As an example, in a Twitter network, a node corresponding to a celebrity
who has a large following will have a very high degree. Social network graphs
have a very low diameter unlike a road network. This is called as the small-
world phenomenon [8]. R-MAT [9] graphs follow the power-law distribution and
can be created manually. The R-MAT graph generator recursively subdivides the
adjacency matrix of a graph into four equal-sized partitions, and distributes edges
to these partitions with unequal probabilities a, b, c, d such that (a + b + c +
d) = 1.
A hypergraph G(V, E) is a graph where V is the set of vertices, and E is a set of
non-empty subsets of V called hyper-edges or edges. A k-uniform hypergraph will
have all edges e in E having v vertices. A two uniform hypergraph is an ordinary
graph. Hypergraphs have applications in game theory, data mining, etc.
Algorithm 1.6 shows the pseudo-code for computing the pagerank of each vertex
in a graph.3 After reading the graph the pagerank of each vertex is initialized to
1 ÷ |V | (Line 12) in a parallel loop. Then, in the while loop the pagerank value
(pr) of each vertex is updated using a call to the function pagerank (Lines 1–9). The
foreach..in parallel statements in Lines 17, 12, and 3 are parallel loop
statements, in which each iteration of the loop is executed in parallel.
This algorithm updates the pagerank value of a vertex p by pulling values of the
vertices t ∈ V such that there is an edge t → p ∈ E of the graph (Line 3). In such
an implementation, the variable val can be updated without an atomic operation. But
if the pagerank value is computed by a push model, where for all edges p → t the
pagerank value of the vertex p is added to the pagerank value of vertex t, then the
forall loop should add val using an atomic operation. This is due to the irregular
3 See Sect. 2.3.1 of Chap. 2 for a detailed explanation of the pagerank algorithm.
24 1 Introduction to Graph Analytics
nature of the graph algorithm. Push and Pull versions of parallel pagerank algorithm
are discussed in Sect. 3.2.5.
Synchronous execution involves a barrier after the forall call to the function
pagerank() (Line 17). This ensures that the program waits for the parallel call to
the function pagerank() to finish. But if the algorithm is executed asynchronously,
there will be no barrier. Such a feature is useful when the computation happens using
CPU and GPU. The host (CPU) calls the device (GPU) function pagerank, and the
host continues execution without waiting for the device to finish the execution of
the pagerank() function.
Programming graph analytics targeting a distributed system with CPUs and GPUs
is very challenging. The programmer must handle graph partitioning, parallel
computation, dynamic memory management, and communication between nodes.
Multi-core CPUs and GPUs follow MIMD and SIMT architectures respectively.
The graph storage format should maximize cache locality and coalesced access
to obtain high throughput. Thread management for CPUs and GPUs is very
challenging. Programs written in native languages such as C or CUDA with libraries
such as OpenMP and MPI will be very large. Such programs will be difficult
to understand and modify, and are error-prone. The complexity of programming
graph analytics for distributed systems can be reduced by high level programming
abstractions.
Programming abstractions are used in various domains. Well known examples are
VHDL for hardware design, HTML for web programming, and MATLAB for
scientific computations. Such domain-specific frameworks or languages make the
job of coding and debugging much easier than in the usual programming languages.
More specifically, abstractions for heterogeneous hardware, graphs and distributed
computation will help in programming graph analytics applications. This increases
productivity and reliability.
1.9.1 Frameworks
In the BSP model of programming [13], input data is partitioned on the multiple
compute nodes. Each compute node computes the algorithm locally, and subse-
quently, communicates with other nodes at the end of computation. Programs in the
BSP model are written as a sequence of iterations called supersteps. Each superstep
consists of the following three phases:
1. Computation: Each compute node performs local computation independently on
its partitioned data and is unaware of execution on other nodes. Nodes may
produce data that needs to be sent to other nodes in the communication phase.
2. Communication: In the communication phase, data is exchanged as requested in
the previous computation phase.
3. Synchronization: There is an implicit barrier at the end of communication phase.
Each node waits for data which was transferred in communication phase to be
available.
The BSP model is used to simplify the programming in a distributed environ-
ment. In the context of graph processing, each vertex behaves as a computing node,
and vertices communicate among each other through edges. Pregel is an example of
such a framework [14] based on the BSP model. The MapReduce() framework on
the Hadoop distributed file system (HDFS) for storing data is adaptable for graph
analytics. Giraph [15] is an example of such a framework that combines the benefits
of Pregel and HDFS.
26 1 Introduction to Graph Analytics
Gather-Apply-Scatter model supports both the BSP model and the asynchronous
model of execution.
Algorithm 1.8 shows the pseudo code of the execution model. In the gather phase
an active vertex invokes the algorithm-specific gather function on adjacent vertices
of vertex u. The result is computed using the sum function. The sum function of
an algorithm should be commutative and associative. The computed result is then
scattered using the edges connected to the vertex u. Gonzalez et al. [11] implements
such an abstraction.
Graph analytics is efficient on GPUs. Implementations of graph algorithms
written in languages such as C++/CUDA/OpenCL are efficient, but involve many
details of hardware, thread management, and communication. Many frameworks
have been developed over the last decade to support graph analytics on GPUs and
make programming easier. Such frameworks usually support multiple-GPUs but
only on a single machine. If the GPUs are connected with a peer-access capability,
the communication overhead between GPUs on a multi-GPU machine is very low.
Otherwise, it is very high. The communication cost between GPUs on different
nodes of a cluster is very high compared to that betweeen CPUs. Therefore, proper
graph partitioning assumes a great significance on such systems. LonestarGPU [18]
and IrGL [19] supports cautious morph graph algorithms (a subset of dynamic graph
algorithms) on GPUs. Totem [20] and Gluon [21] are graph analytics frameworks
for multi-GPU machines and GPU cluster respectively.
Graph analytics frameworks lack support for semantic checks of programs. They
also lack higher level of abstractions as compared to a domain specific language
(DSL). It is relatively more difficult to program algorithms using frameworks than
using graph DSLs. In a graph DSL, elementary data items such as vertex, edge and
graph are provided as data types with semantics for operations on each data type.
Further, DSLs comes with parallel and synchronization constructs, and also data
types such as Collection and Set which are necessary to implement even elementary
28 1 Introduction to Graph Analytics
algorithms. The syntax and semantic violations are caught by the DSL compiler.
All this makes programming graph analytics easier and less error prone, thereby
increasing productivity.
Several DSLs for graph analytics on multi-core CPUs have been proposed in the
past. Some of the Graph DSLs such as Green-Marl [22] and Elixir [23] support
only static graphs4 with mutable edge and vertex properties. Green-Marl has been
extended with support for CPU clusters [24]. Lighthouse [25] extended Green-Marl
for Nvidia-GPUs. Graph DSLs targeting single machines with a multicore-CPU,
multi-GPUs, and distributed systems with multi-core and multi-GPU configurations
have been reported in Gluon [21], Falcon [26], and DH-Falcon [27].
In order to demonstrate the benefits of DSLs, Algorithm 1.9 shows the code for
the single source shortest path algorithm (Bellman–Ford SSSP algorithm) written in
the Falcon DSL. While this code is quite compact, the same algorithm implemented
in a framework will be lengthier, more complex, and harder to debug. The important
function is SSSP in which hgraph is declared as a Graph type variable, a property
dist is added to its vertices (Line 9), the input graph is read (Line 10), and dist
of all vertices except the source is initialized to a very large value (source gets
zero value). The variable changed keeps track of any changes made to the shortest
distance property, dist. The while loop exits when no changes in dist are occur for
any vertex (Line 16). Within the while loop, shortest distance is updated in parallel
for each vertex by calling the function relaxgraph. This function updates the shortest
distance property dist of a vertex using an atomic function MIN. It also updates the
variable it changed. More details of the SSSP algorithm are provided in the next two
chapters.
Chapter 2
Graph Algorithms and Applications
This chapter provides a discussion of various sequential graph algorithms and issues
in their implementations. After describing fundamental algorithms like traversals,
shortest paths, etc., more specialized algorithms such as betweenness centrality,
page rank, etc. follow. The chapter ends with a focused discussion of applications
of graph analytics in different domains such as graph mining and graph databases.
2.1 Introduction
1 Ideally, this should be O(|V | + |E|), but we use this shorter notation when the context is clear.
In this section we discuss elementary graph algorithms which have many appli-
cations. For simplicity, explanation of an algorithm considers only the sequential
execution. However, parallelization and distributed executions of these algorithms
are very important considering the huge graphs on which they operate. They will
34 2 Graph Algorithms and Applications
BFS is one of the simplest graph traversal techniques [28]. The algorithm takes
as input a directed or undirected graph G(V, E) and a source vertex src (see
Algorithm 2.1). The traversal explores the edges of G to find all the vertices that
are reachable from the source vertex src in a level-order fashion. It computes a path
with the smallest number of edges to all the vertices reachable from src. A few
applications of BFS include:
• Web crawlers to build the index
• Social networks to find members with a distance k from a person
• GPS to find nearby locations.
The algorithm initializes the distance (dist) of all the vertices to infinity and
predecessor (pred) in the path to NULL (Lines 2–5). However, the dist value of
the source vertex src is made zero. While dist is useful in finding the shortest
distance (in terms of number of edges from the src vertex), pred is used to find
the actual shortest path. The variable changed is made False at the beginning of
the while() loop. The outermost forall loop iterates over all the vertices in the
2.2 Fundamental Graph Algorithms 35
V0 V1 V2 V0 V1 V2 V0 V1 V2
0 0 0 2 V3
1 V0 1 V0 2 V3
V3 V4 V5 V3 V4 V5 V3 V4 V5
V0 V1 V2
V0 V1 V2
0 2 V3 4 V5
0 2 V3
1 V0 2 V3 3 V4 1 V0 2 V3 3 V4
V3 V4 V5 V3 V4 V5
graph, while the innermost forall loop iterates over the neighbours of each vertex
(outnbrs) of the graph. The BFS distance dist of a vertex t is updated to p.dist +
1 and its predecessor t.pred is made p, if the following two conditions are satisfied
(see Lines 11–15):
1. There is an edge p → t
2. t.dist >p.dist + 1
Whenever such an update happens, the variable changed takes the value True. If
no update happens in a particular iteration of the while() loop, the exit condition
is satisfied (Line 18) and the algorithm terminates.
Figure 2.1 shows the working of the BFS traversal on an undirected graph with
six vertices.2 Each iteration of the loop is shown as a subfigure. The algorithm halts
after five iterations, because there is no update in the fifth iteration. For a graph
with n vertices, BFS requires O(n2 ) time with an adjacency matrix representation.
It may be noted that the BFS algorithm presented above has little resemblance to the
BFS algorithms presented in text books, which invariably use a queue of vertices.
However, it must be pointed out that Algorithm 2.1 discovers all paths of length k
from the vertex src, before discovering paths of length k + 1, and hence is breadth-
first in nature. It is more suitable for parallel implementations than the one that uses
queues.
The DFS algorithm (Algorithm 2.2) visits each vertex of a graph exactly once
starting from a source vertex [29]. The algorithm upon visiting a vertex, selects one
of its neighbours and visits that vertex deeper. It also timestamps the vertices, start
indicating the time at which the vertex was entered (visited) for the first time, and
etime indicating the time at which the vertex was exited. Colors are used to indicate
the state of a vertex: white indicating that the vertex has not yet been visited, gray
during the time (etime − start), and black at etime and later. The timestamps are
useful to prove certain properties of DFS.
It initializes each vertex with properties color=white, start and etime to −1,
and pred=−1 (see Lines 3–6). Then each unexplored vertex (i.e color=white) is
selected and DFS traversal is initiated (see Lines 7–11). In the traversal function
Traverse() (Lines 13–24), for the (argument) vertex p, its start time is noted and
color is made gray. This denotes vertex p is now visited. Then all the unexplored
neighbours (outnbrs) of the vertex p are visited in a depth-first manner recursively
(Lines 16–21) by calling the function Traverse() (Line 19). When the recursive call
of a vertex returns, the etime of the vertex is marked, and the color of the vertex is
made black. This denotes there is no more vertices to visit using the vertex p.
2.2 Fundamental Graph Algorithms 37
Figure 2.2 shows the DFS traversal on a directed graph.3 The start/etime of the
vertices are updated, and the color is updated to gray and then to black. The edge
explored in each step is shown as a dashed line and the iterations are shown in
Fig. 2.2a–h. DFS induces a spanning forest on a graph. The edges after the DFS
traversal of a directed graph can be classified as tree, forward, back or cross edges:
• An edge p → t is a tree edge if vertex t was unexplored when it was visited by
traversing p → t. All tree edges belong to the spanning forest.
• An edge t → p is a back edge if it connects vertex t to an ancestor p in the
spanning forest.4
• An edge p → t is a forward edge if it is not a tree edge and connects vertex p to
a proper descendant t in the spanning forest.
• All other edges are classified as cross edges.
The edge v0 → v3 is a tree edge, edge v1 → v3 is a back edge and v2 → v4 is
a cross edge in Fig. 2.2h. DFS on an undirected graph results in only tree and back
edges.
The DFS algorithm is used in finding connected and strongly connected com-
ponents, solving puzzles, finding biconnectivity in graphs, and solving a host of
other problems related to graphs. As an example, Algorithm 2.3 shows how DFS
can be used to find connected components of an undirected graph. Figure 2.3
shows connected components of a disconnected graph. assuming that DFS on
the disconnected graph in Fig. 2.3 begins at vertex 1, the order of visiting the
vertices with DFS would be 1, 7, 8, 2 (forming component 0), 3, 9, 4, 10 (forming
component 1) and 5, 6, 11, 12 (forming component 2). The connected components
found are always independent of the vertex from which DFS is started.
DFS is often used in gaming simulations where each choice or action leads
to another, yielding a choice tree. It traverses the choice tree until it discovers
an optimal solution path (e.g., win). The choice tree may not be built explicitely.
For a graph with n vertices, DFS requires O(n2 ) time with an adjacency matrix
representation.
V0 V1 V2 V0 V1 V2 V0 V1 V2
1 1 1
2 2 3
V3 V4 V5 V3 V4 v5 V3 V4 V5
1 4 18 45 18 45 9
2 3 27 36 27 36
V3 V3 V4 V5 V3 V4 V5
18 45 9 18 45 9 12
27 36 10 27 36 10 11
V3 V4 V5 V3 V4 V5
(g) (h)
1 2 3 4 5 6
7 8 9 10 11 12
Fig. 2.3 Example of connected components. Each dashed circle contains one connected compo-
nent. The given graph is the union of the three components
2.2 Fundamental Graph Algorithms 39
path and predecessor vertex in the shortest path respectively for each vertex. dist and
pred are initialized to ∞ and NULL respectively for all the vertices (see Lines 2–5).
However, dist value of the source vertex src is initialized to zero.
The dist value of vertices is reduced by traversing all the edges in the graph
multiple times (see Lines 7–20) until a fixpoint is reached. At the beginning of each
iteration, the variable changed is set to False. The variable changed is set to True
whenever dist of any vertex is reduced during an iteration of the while loop (see
Line 14). The shortest distance value of a vertex t is reduced when it is cheaper to go
to vertex p, and then traverse the edge p → t (if there is one), than using the already
computed path to t. That is, t.dist > (p.dist +weightedge(p,t )) (Lines 11–15). This
triggers one more iteration of the while loop. The exit condition is satisfied when
there is no modification to dist value of any vertex in an iteration (see Line 19). This
marks the end of the algorithm. The shortest path to a vertex p can be computed
using pred, by simply traversing backwards from any vertex p.
There are many algorithms for finding the shortest paths. The popular ones are
the Bellman–Ford algorithm, Dijkstra’s algorithm and the -stepping algorithm.
The Bellman–Ford algorithm can detect negative cycles in a graph and it has
a complexity of O(m.n), where m and n are the number of edges and vertices
40 2 Graph Algorithms and Applications
respectively. Dijkstra’s algorithm works for graphs with non-negative edge weights
only, and has a complexity of O(m log(n)) (with a binary heap) where m is
the number of edges. The -stepping algorithm is very efficient for graphs with
large diameters. Algorithm 2.4 is a variant of the Bellman–Ford algorithm. SSSP
computation is often applied to automatically obtain driving directions between
locations in Maps. It is used extensively as a part of many other algorithms, such as
Betweenness Centrality computation.
Figure 2.4 shows an example of SSSP computation with Algorithm 2.4 on a
graph with source vertex S. The dist and pred values are shown inside the circles
corresponding to vertices as (dist/pred) in the figure.
Apart from SSSP, it may be necessary to compute shortest paths between every
pair of distinct vertices in a directed graph. Algorithm 2.5 achieves this and this
algorithm is also called Floyd’s Algorithm or Floyd-Warshall Algorithm [31] for
computing all pairs shortest paths. The principle of this algorithm can be enunciated
using the following equations and Fig. 2.5:
T T
5 S
15 15
80 S 5 80
S 5
0 0 10
10 18 18
V W V W
100 100 40
40
100 S
U U
T T
5 S 5 S
15 15
S 5 80 S 5 80
0 10 85 T 20 T 0 10 55 U 20 T
18 18
V W V W
100 40 100 40
15 T 15 T
U U
Fig. 2.4 SSSP Computation illustration on an example graph (Text inside a vertex indicates
current distance and the current predecessor)
Where, Ak [i, j ] is the cost of the shortest path from vertex i to vertex j without
going through any vertex numbered higher than k. In going from vertex i to vertex
k and then from vertex k to vertex j , we do not go through vertex k or any vertex
numbered higher than k.5 The cost here is as in SSSP computation (sum of edge
costs). Computation of matrix A can be performed as shown in Algorithm 2.5, where
each row of A is attached to the corresponding vertex of G as a property. Floyd’s
algorithm has a time complexity of O(n3 ) for a graph with n nodes. Figures 2.6
and 2.7 show an input graph and the distance matrices. Examples of computing the
entries in the matrices are provided below.
5 Reaching vertex k once and leaving vertex k once is obviously permitted here.
2.2 Fundamental Graph Algorithms 43
v2
8
2 3
v1 v3
5
A0 v1 v2 v3 A1 v1 v2 v3 A2 v1 v2 v3 A3 v1 v2 v3
v1 0 2 ∞ v1 0 2 ∞ v1 0 2 5 v1 0 2 5
v2 ∞ 0 3 v2 ∞ 0 3 v2 ∞ 0 3 v2 8 0 3
v3 5 8 0 v3 5 7 0 v3 5 7 0 v3 5 7 0
6 This algorithm exhibits more parallelism than that exhibited by Tarjan’s algorithm.
44 2 Graph Algorithms and Applications
V0 V1 V2 V0 V1 V2
8 11 27 36 8 11 23 67
9 10 1 12 45 9 10 1 12 45
V3 V4 V5 V3 V4 V5
(a) DFS from vertex V4 in G (b) Transpose G T of the Graph
V0 V1 V2 V0 V1 V2
34 7 12 9 10
25 16 8 11
V3 V4 V5 V3 V4 V5
(c) Spanning Tree of the DFS on G T (d) Strongly Connected Components
of the Original Graph G
BFS traversals on the graph G and its transpose, GT . The algorithm is based on the
following lemma from [34]:
Let G = (V, E) be a directed graph and let i ∈ V be a vertex in G. Then Vf wd (i)∩
Vbwd (i) is a unique SCC in G. Moreover, for every other SCC s in G, exactly one
of the following holds:
2.2 Fundamental Graph Algorithms 45
Algorithm 2.7: Divide and Conquer Algorithm for Finding Strongly Con-
nected Components
1 SCC(Graph G, Vertexset P, Vertexsetcollection StrongCompSet) {
2 // P is the vertex set on which forward and backward searches are performed.
3 // StrongCompSet is the collection of strongly connected components,
4 // represented by their respective vertex sets.
5 if ( P is empty ){
6 return;
7 //End of recursion.
8 }
9 Select v uniformly at random from P ;
10 // v is the pivot vertex from which forward and backward searches are made.
11 Vf wd = Fwd-Reachable(G, P , v); // This is a BFS on G, starting from v.
12 Vbwd = Fwd-Reachable(GT , P , v); // This is a BFS on GT , starting from v.
13 scc = Vf wd ∩ Vbwd ;
14 StrongCompSet = StrongCompSet ∪ scc;
15 SCC(G, Vf wd \ scc, StrongCompSet); // Recursive call.
16 SCC(G, Vbwd \ scc, StrongCompSet); // Recursive call.
17 SCC(G, VG \ (Vf wd ∪ Vbwd ), StrongCompSet); // Recursive call. VG is the vertex set
of G.
18 }
46 2 Graph Algorithms and Applications
which is the subgraph with vertices {V1 , V2 , V5 } and edges between these vertices.
The two SCCs are shown in Fig. 2.8d.
The SCC algorithm is used as a preprocessing step in many algorithms to form
clusters of vertices. It has applications in model checking, vector code generation in
compilers, and analysis of transportation networks.
A directed graph G is weakly connected if the undirected graph G’, obtained after
discarding the edge directions in G, is connected. WCC of G are the connected
components of G’ and can be found by running the DFS (Algorithm 2.2) on G’. It
may be noted that while WCC of the graph in Fig. 2.8 is the original graph without
the directions of the edges, the SCC of the same graph are different. This algorithm
is used to find disconnected clusters in graphs, usually in a preprocessing step of
other algorithms.
In this section, we deal with only connected undirected graphs. A graph can have
more than one spanning tree. The cost of a spanning tree is the sum of the weights
of all the edges in the spanning tree. A spanning tree which has the minimum cost is
called a Minimum Spanning Tree (MST). A weighted graph can have more than one
MST with the same weight but with different sets of edges. MSTs are not necessarily
rooted trees, i.e., they may not have a vertex identified as a root.
There are several algorithms to compute MST, such as Prim’s [35], Kruskal’s [36]
and Boruvka’s [37]. Prim’s algorithm follows a greedy strategy. It starts from
an arbitrary vertex, considered as a component, and adds the minimum weight
edge incident on the component from the non-component vertices. The component
thus grows, and this step of choosing the minimum weight edge from the current
component is repeated. When the component contains all the vertices, the set of
edges considered during the processing forms the MST.
Kruskal’s algorithm to find an MST of a graph G is as follows: Sort the edges
of G by their edge weights. Add the edges to the MST7 (initially empty) in the
increasing order of edge weights, such that adding an edge does not lead to a cycle.
Stop once |V | − 1 edges are added to the tree or all the edges are considered.
Kruskal’s algorithm uses a Union-Find data structure for storing the intermediate
forest.
7 In intermediate stages of the algorithm, we may have a forest and not a tree. A tree is obtained
Algorithm 2.8 shows Boruvka’s MST algorithm [37, 38] which also uses a
Union-Find data structure for storing disjoint sets. Each set is identified with an
identifier, called parent (say). The operations on the data structure are union(A,
B) and find(x). The union(A,B) operation makes a destructive8 union of sets
A and B into a single set, and sets the parent of both A and B to the same value. It
is important to note that the parent of a set is a representative of the set and is not
necessarily the parent of all the vertices of the tree which the set represents. The
find(x) operation returns the parent of the set to which the element x belongs.
The set variable is initialized as a Union-Find data structure where each point is a
separate set (see Line 4). The set of edges in the MST is initialized to the null set
(see Line 6) and mst_cost is initialized to zero (see Line 7).
The while loop (Lines 8–16) computes the MST. The edge e : p ↔ t is added
to the MST if e connects two disconnected subtrees (p.parent = t.parent) and the
weight of e is the minimum of the edges that connect the two subtrees (Line 10).
This edge is added to the MST (Line 13) and the two subsets are unified to form a
single subset (Line 14). The while loop terminates when set contains exactly one
component (assuming that the input graph is connected).
Boruvka’s algorithm has a time complexity of O(m log n), where n and m are
the number of vertices and number of edges (respectively) in the graph. Boruvka’s
algorithm is easier to parallelize than either Prim’s or Kruskal’s algorithms. MST
computation is an important and basic problem with diverse applications which
v1 v3 v1 v3
v6 v6
3 3
27
5 14 12 22 22
v5 v5
32 v8 v8
1 24 1
6 4 18 4 18
v2 7 v4 21 v7 v2 v4 v7
(a) Input Graph (b) Iteration 1
v1 v3
v6
3
5 22
v5
1 v8
4 18
v2 v4 21 v7
(c) Iteration 2 - MST Computed
sink vertices of edges with v as the source vertex are removed from Vm . MIS is
one of the important problems in the study of graph algorithms, because several
other fundamental graph problems can be reduced to MIS with a constant round
complexity overhead. Maximal matching, vertex and edge coloring, vertex cover,
and approximation of maximum matching are some of them.
Finding a maximum independent set (MXMIS) for a graph is a computationally
hard problem. Good approximation algorithms exist for finding MXMIS in planar
graphs. The MXMIS problem is closely related to cliques and vertex cover in
graphs. MXMIS computation is useful in genetic algorithms, neural networks, and
location theory.
This section describes some of the important problems that use graphs and graph
related algorithms. The list is only a sampler and is not exhaustive. The description
is intended to provide only an introduction to the problems.
The Pagerank algorithm [39, 40] is used to index webpages, and decide the quality
of a webpage. Webpages are given values based on the number of incoming links to
a page and weight of each linking page (i.e., source of the link). It is also used for
ranking text for entity relevance in natural language processing.
The page rank of a page or website A is calculated by the formula given below:
T1 to Tn correspond to the websites accessed by the user. PR(A) is the page rank
of A. PR(T1 ) to PR(Tn ) are the page ranks of the pages which have urls linking to
A. C(T1 ) to C(Tn ) are the number of links on the sites T1 to Tn respectively. The
parameter d in the computation is called the damping factor and is typically set to
0.85. The damping factor is the probability at each page, that a “random surfer” will
get bored and request another random page.
Algorithm 2.10 shows the pseudo-code to compute the page ranks of webpages.
The vertices of the graph G in the pagerank algorithm are the websites accessed by
the user. The incoming urls from a webpage A to webpage B is represented as the
edge A → B. Similarly, the edge B → A represents the outgoing urls to a website
A from B. The page rank of a vertex is high when it has more incoming edges. The
function pagerank() (Lines 2–10) updates the page rank of a vertex using the number
of incoming edges to it and the outdegree of the source vertex of the incoming edge.
The function computePR() (Lines 11–20) calls the function pagerank() (Line 16
MAX_ITR times. This heuristic indexes webpages efficiently.
Graph coloring [41] is the assignment of colors to the vertices or edges of a graph. A
coloring of a graph such that no two adjacent vertices share the same color is called
a vertex coloring of the graph. Similarly, an edge coloring assigns a color to each
edge so that no two adjacent edges share the same color. A coloring using at most k
colors is called k-coloring. Graph coloring has applications in process scheduling,
register allocation phase of a compiler, and also in pattern matching. Theoretically,
it is an NP-Complete problem for general graphs, and many heuristics have been
proposed in literature for this problem. Greedy vertex coloring is one such heuristic
and it is described in Algorithm 2.12.
The greedy algorithm [42] considers the vertices of the graph in a specific order
that is already chosen, say, v0 , v1 , . . . , vn−1 . Colors are numbered as 1, 2, . . . .
Vertex vi is assigned the lowest numbered available color that is not used by its
predecessors among v0 , v1 , . . . , vi−1 , which are also neighbors of vi . The quality
of the computed coloring depends on the chosen ordering of vertices. While there
exists an ordering of vertices which when used with the just described greedy
coloring leads to an optimal number of colors, the number of colors can be much
larger than the optimal for arbitrary orderings. One popular ordering of vertices is
the ordering by their degree: largest degree first. Another is to choose a vertex v
of minimum degree, find the ordering for the subgraph with v removed recursively,
and then place v last in the ordering. Degeneracy of the graph is the largest degree
d encountered during the execution of this algorithm as the degree of a removed
vertex. This order is called smallest degree last and coloring with this ordering uses
at most d + 1 colors. Figure 2.10 shows an example of coloring using the greedy
strategy with different orders.
52 2 Graph Algorithms and Applications
C1 v0 v4 C2 C4 v0 v4 C4
C1 v1 v5 C2 C3 v1 v5 C3
C1 v2 v6 C2 C2 v2 v6 C2
C1 v3 v7 C2 C1 v3 v7 C1
(a) Crown graph with coloring (b) Crown graph with coloring gener-
generated by the vertex order: ated by the largest degree first vertex or-
Two col- der: v 3 v 7 v 2 v 6 v 1 v 5 v 0 v 4 Four col-
ors C 1 and C 2 are required (shown in ors are required (shown
brackets beside vertices). in brackets beside vertices).
C1 v0 v4 C2
C1 v1 v5 C2
C1 v2 v6 C2
C1 v3 v7 C2
2.3.4 K-Core
graph is either K-core or empty. The algorithm has applications in areas such as
study of the clustering structure of social network graphs, network analysis, and
computational biology.
For every pair of vertices (u, v) of an undirected weighted graph, there could be
many shortest paths between u and v. For each vertex x, the Vertex Betweenness
Centrality (BC) [44] is the number of shortest paths that pass through x.9 With this
definition, it is unbounded. To make it bounded, it is normalized as:
SPst (v)
BC(v) = (2.1)
SPst
s=v=t
1 6 [4] [4]
2 7 [0] [0]
(a) Given undirected graph (1). (b) Graph (1) with vertex BC values in brackets.
1 [1]
2 3 [4] [4]
4 [20]
5 6 [4] [4]
7 [1]
(c) Given undirected graph (d) Graph (2) with vertex BC
(2). values in brackets.
[10] [10]
1 6 1 6
2 7 2 7
(e) Graph (3) with edge BC values on edges. (f) Graph (3) with clusters based on BC range 12-
11.
1 6 1 6
2 7 2 7
(g) Graph (4) with edge BC values on edges. (h) Graph (4) with clusters based on BC range 24-
24.
11 Sometimes, BC(e) is divided by two because the algorithms traverse each shortest path twice,
Graph mining [46] finds application in areas such as Bioinformatics, social network
analysis, and simulation studies. Some of the problems of interest are the detection
of abnormal subgraphs, edges, or vertices in a graph object. Extraction of commu-
nities from graphs, and detecting patterns in a graph such as power-law distribution
subgraphs and small-diameter subgraphs, are problems of interest in data mining.
The following section describes the application of community detection in bio-
informatics.
Graphs are often used in bioinformatics [47] for describing the processes in a cell.
In such graphs, vertices are genes or proteins, and edges may describe protein-
protein interaction, or gene regulation. The idea is to find sets of genes that have
a biological meaning. One possibility is to compute graph-theoretically relevant
sets of vertices and then determine if they are also biologically meaningful. While
connected components provide a simple means of achieving this, they have been
found to be not so effective. A more advanced idea is to perform graph clustering:
find subgraphs that have a high edge density. Figure 2.12 shows an example of graph
clustering.
An edge between different clusters could be on several shortest paths from one
cluster to another, compared to an edge inside a cluster, because there are more
alternative paths inside a cluster. Betweenness centrality of an edge is a good
measure of this feature. Removing edges with the highest betweenness centrality
from the graph yields good clusters.
NoSQL databases may incorporate several types of data, including graphs. Graph
databases, which are a part of NoSQL databases, use graph structures with enhanced
with properties to represent and store data [48]. Answering queries on a graph
database requires execution of several graph algorithms on the graph data in the
database. Some of the well known graph algorithms, such as SSSP, BFS, DFS,
SCC, Betweenness Centrality, Page rank, etc., may be already provided in the
query language. An example of a graph database is Neo4j with the query language
Cypher [49].
Text graphs are used in Natural Language processing and text mining. In text graphs,
the textual information is represented using graphs. The vertices may represent
a concept in the text, such as paragraphs, sentences, phrases, words, or syllables.
An edge between two vertices (concepts) may represent co-occurrence of the two
concepts in a window over the text, syntactic relationship, or semantic relationship.
60 2 Graph Algorithms and Applications
This chapter provides some insights into the issues in programming parallel algo-
rithms. Parallelism, atomicity, push and pull types of computation, algorithms driven
by topology or data, vertex based, edge based and worklist based computation, are
some of the important considerations in designing parallel implementations. Parallel
algorithms for different types of problems including graph search, connected
components, union-find algorithms, and betweenness centrality computation are
considered in detail. Finally, a description of the functioning of graph analytics on
distributed systems is provided.
3.1 Introduction
Important graph algorithms and their sequential execution were discussed in the
previous chapter. This chapter discusses the challenges involved in implementing
graph algorithms on heterogeneous distributed systems. Graph processing can be
carried out on a multi-core CPU, or on accelerators (e.g., GPUs) connected to a
host CPU, or on a distributed system with CPUs and GPUs. Accelerators have a
memory that is typically separate from the host CPU memory physically and both
have their own address spaces. Data needs to be transferred from the host CPU
to the accelerator and vice-versa. The CPU and the GPU follow different execution
models with the CPU following a multiple instruction multiple data (MIMD) model,
whereas GPUs follow a single instruction multiple thread (SIMT) execution model.
A CPU has fewer computing cores and large volatile memory in comparison with
a GPU. Table 3.1 shows the major differences between CPU and GPU devices, for
a typical configuration. As seen in the table, a CPU has a few tens of powerful
(high-frequency) cores. In contrast, a GPU has a few thousand low-frequency cores.
Volatile memory of a CPU can be a few hundred GB or even a few TB. In contrast,
current GPUs have a limited memory of a few tens of GB. Massive graph analytics
requires distributed computer systems (computer clusters) as the graph is too big
to fit in a single device. Each node of such a cluster may have a multi-core CPU
and several GPUs. Most clusters are located in a single large box and its nodes are
connected with a very high speed interconnection network.
3.2.1 Parallelism
The parallel graph algorithms presented in this chapter parallelize the outermost
loop (these do not exploit nested parallelism). That is, the outermost foreach
statement is run in parallel mode, and any nested foreach statements are
run in sequential mode. This is quite common in both auto-parallelizing compilers
as well.Exploiting nested parallelism requires efficient scheduling of iterations that
balance work among threads. Parallelization of regular nested loops is based on
thread and block based mapping. In this case, the number of iterations of an inner
loop does not vary across the iterations of the outer loop. However, irregular nested
loops do not benefit from this simple strategy. In the case of irregular nested loops,
the use of thread-based mapping on the outer-loop may cause warp divergence (i.e.,
different threads are assigned different amounts of work), while the use of block-
based mapping will lead to uneven block utilization, which in turn may cause GPU
under-utilization.
3.2.2 Atomicity
The execution pattern of a graph algorithm may vary for graphs of even the same
size (|V | and |E|). Prediction of the runtime control flow and execution time from
the size of the graph is impossible. In a parallel implementation of a graph algorithm,
3.2 Issues in Programming Parallel Graph Algorithms 63
two or more threads may try to update the same memory location at the same time.
Such a race condition is handled by atomic operations provided by the hardware,
which create critical sections that are executed by only one thread at a time. A
few variants of the atomic operations relevant to graph algorithms are shown in
Table 3.2. The atomic language construct used to create such critical sections1
are implemented using atomic operations provided by the hardware. The atomic
operations listed in the table are typically supported by both modern CPU and GPU.
All the atomic operations return the value stored in the memory location before the
operation was initiated.
3.2.3 An Example
Figure 3.1 shows a sample directed graph with four vertices and five edges.
Figure 3.2 shows one of the possible parallel execution patterns of the SSSP
computation in which, in each iteration, all the vertices are processed in parallel
on a multi-core processor. A parallel version of the SSSP algorithm is discussed in
detail in a later section in Algorithm 3.5. Assume that the outermost forall loop
(Line 10 in Algorithm 3.5) is run in parallel. The vertex V0 is considered as the
source vertex. The execution sequence depends on the way threads are scheduled
on the cores in addition to the graph topology. The distance of the source vertex
is initialized to zero. The value is set to infinity (∞) for all the other vertices. The
distance is reduced in each iteration until a fixpoint is reached (see Fig. 3.2). The
two edges V1 → V2 and V3 → V2 have the same sink vertex V2 and different
source vertices (V1 and V3 ). When the two source vertices try to reduce the distance
of the vertex V2 at the same time using two different threads, there is a possible race
condition. Assuming that both the vertices have passed the condition for update,
V3 may write 175 into V2 followed by V1 again “updating” V2 to 300 (wrong
distance)! A parallel implementation should use an atomic operation provided by
the programming language to avoid such race conditions. The computation may be
incorrect if atomic operations are not used in parallel implementations.2
400
0
20
100
100
75
0 Initialization
Graphs in the real world vary in different aspects. Parallel implementations of graph
algorithms do not always perform very well for all types of graphs and on all
platforms. For example, road network graphs have large diameters, and the amount
of parallelism available in them is less for topology-driven algorithms (e.g., BFS,
see Algorithm 3.3). For road networks, such algorithms may perform very poorly
on GPUs, but may perform reasonably on multi-core CPUs. Social network graphs
follow the power-law degree distribution, and the difference in the maximum and
the minimum degrees of the graph is very high. Data-driven algorithms such as
minimum spanning tree computation perform well on GPUs and CPUs for many
types of graphs, but not for social network graphs.
Random graphs follow uniform degree distribution and have a low diameter, and
many parallel algorithms of both topology-driven and data-driven varieties perform
well for these graphs on both GPUs and multi-core CPUs. The type of algorithm
(topology-driven or data-driven) and the type of graph must be borne in mind while
implementing parallel algorithms.
3.2 Issues in Programming Parallel Graph Algorithms 65
3 Damping factor in Algorithms 3.1 and 3.2 is the probability at each page, that a “random surfer”
will get bored and request another random page.
66 3 Efficient Parallel Implementation of Graph Algorithms
Algorithm 3.3 shows the pseudo code for topology-driven parallel breadth-first
search on a (directed or undirected) graph. Breadth-first traversal repeatedly visits
all the nodes (in parallel, one thread per node) in an increasing order of distance
(i.e., number of edges from the source vertex). The if program block (Lines 11–
15, Algorithm 3.3) updates the distance of the vertex t for an edge p → t ∈ E.
The if block needs to be atomic to avoid race conditions. For example, consider
the graph in Fig. 3.1. Ignore the weights on the edges. Let the source node be V0 ,
and assume that V0 first updates V1 and then V0 and V1 race to update V2 . Without
an atomic block, it is possible that both V0 and V1 pass the test in the if condition,
V0 first updates V2 to 1 and then V1 updates V2 to 2! This is clearly wrong. With an
atomic block, if V0 passes the test first, it also gets to update V2 first and then V1
fails the test. However, if V2 passes the test first and updates V2 to 2, V0 will still
update V2 to 1 when it gets its chance.
In the worst case, the number of iterations of the algorithm equals the maximum
BFS distance value. In the initial iteration, the BFS distance of the neighbours of the
source vertex is reduced from ∞ to one. The running time of Algorithm 3.3 on road
networks is higher than that on social networks or random graphs of the same size,
because road network graphs have a high diameter. A lock-free (no atomic block)
version of BFS is presented in Algorithm 5.5 in Sect. 5.4.1.
3.3 Different Ways of Solving a Problem 69
Algorithm 3.4 presents a data-driven variant of BFS, which has better work-
efficiency compared to its topology-driven counterpart. It works with a variable
number of threads, T , and therefore scales better than the topology-driven algorithm
as the number of vertices increases. The number of threads ( num_threads) is fixed
at the beginning of the execution and does not change throughout the execution of
the algorithm. For a graph G = (V , E), V is partitioned into disjoint subsets such
that
num_t
hreads−1
Vi = V
i=0
BFS levels of vertices are recorded in the array bfs_level. Each thread numbered
i exclusively owns partition Vi , and handles the queue Qi . The queues Qi (i =
0, . . . , num_threads-1) store the frontier of vertices as BFS progresses. All these
queues are initialized to φ, except the one corresponding to k, where Vk contains the
node src, the source node form which BFS begins (see loop at Line 6).
The loop at Line 17 advances the frontier of BFS by collecting and distributing
the nodes adjacent to each node removed from the various queues owned by the
threads. Among the collected nodes by thread i, nodes present in the partition Vj
are pushed into the queue Qi,j (see the loops at Lines 22 and 25). The loop at
Line 34 forms the new queue for each thread Qi , and assigns the BFS level to each
node inserted into the queue Qi (see Line 39). The enclosing while loop at Line 16
exits only when all the queues Qi are empty simultaneously.
It is easy to see that this BFS algorithm is data-driven; there are active queues
(Qi ) and threads operate on these queues. By using an appropriate number of
threads, the number of threads that are inactive can be reduced to acceptable levels.
An improved version of this algorithm that is adaptive in the number of threads is
presented in [55].
115
5 80
s 10 t w
18
100
40
Algorithm 3.6 shows the pseudo code for a worklist based SSSP computation. This
is a data-driven implementation. The variables, distance and predecessor of all the
vertices are initialized as in Algorithm 3.5 and distance of the source vertex s is
then reset to zero. Computation proceeds using two worklists, current and next,
which store a subset of the vertices of the graph. The source vertex s is added to the
worklist current (Line 8). The computation takes place in the while loop (Line 9–
21), which exits when the shortest distances to all the reachable vertices have been
computed. In the while loop, each vertex u in the worklist current is considered
(Line 10) and its outgoing edges are processed in the inner loop (Lines 11–17). For
each edge u → v ∈ E considered for processing, the distance reduction is done as
in Algorithm 3.5. If the distance of the vertex v is reduced, it is added to the worklist
next (Line 15), which contains the vertices to be processed in the next iteration of
the while loop. Once all the elements in current are processed, worklists current and
next are swapped (Line 19), making the size of next as zero (empty worklist). After
the swap of next and current worklists, size of current will be number of elements
added to next during the last execution of the loop in Lines 10–18. If no elements are
added to next, size of current after swap operation will be zero, this is the fixpoint for
computation and the algorithm terminates. The above algorithm has a computational
complexity of O(V + E). Table 3.4 shows the iterations of Algorithm 3.6 for the
input graph in Fig. 3.3.
The SSSP algorithm works in the following way. Initially for each vertex v in the
graph, two sets heavy and light are computed, where:
heavy(v) = ( ∀ (v, w) ∈ E)
(weight(v, w) > )
light(v) = (∀ (v,w) ∈ E) (weight(v, w) ≤ )
Then the distance of all the vertices is made ∞ (Lines 9–13). The core of the
algorithm starts by relaxing the distance value of source vertex s in Line 14 with
a distance value of zero. This adds the source vertex to bucket zero (Line 21).
Then the algorithm enters the while loop in Lines 17–29, processing buckets in an
increasing order of index value i, starting from zero.
An important feature of the algorithm is that, once the processing of bucket B[i]
is over, no more elements will be added to the bucket B[i], when the buckets are
processed with increasing values of index i. A bucket B[i] is processed in the while
loop (Lines 19–24). At first, for all vertex v in bucket B[i], all edges v → w
∈ light(v) are considered and the pair (w, distance(v) + weight(v, w)) is added to
the Set Req. This is followed by the Set S added with all the elements in B[i]
(Line 21) and B[i] made empty (Line 22). Then Relax() function is called for all
elements in Set Req. This adds new elements to multiple buckets. It can add a
vertex w to a bucker B[k] where k ≥ i. The bucket to which the vertex w is added is
(dist[v]+weight(v,w))÷.
The vertex w is added to the bucket B[i] if distance[w] ≥ i × and there can be
element (w, x) ∈ Req where x < i × . Here x = distance(v) + weight (v, w)
for an edge v → w. Once the bucket B[i] becomes empty after a few iterations,
the program exits the while loop (Lines 19–24). Now all edges v → w ∈ heavy(v)
are considered and the pair (w, distance(v) + weight(v, w)) is added to the Set
Req(Line 26). The edges in Set heavy have weight > and this makes ∀(v, x) ∈
Req x > i × . So new elements will be added to bucket B[j ] when the Set Req
is relaxed where j > i (Line 27). Now value of i is incremented by one (Line 28)
and algorithm starts processing bucket B[i + 1]. Algorithm terminates when all
the buckets B[i], i≥0 are empty. The performance of the algorithm depends on
the input graph and the value of the parameter , which is a positive value. For
a Graph G(V , E) with random edge weights, maximum node degree d (0<d≤1),
the sequential -stepping algorithm has a time complexity of O(|V | + |E| + d × P),
where P is the maximum SSSP distance of the graph. So, this algorithm has running
time which is linear in |V | and |E|.
In the last few sections, several parallel algorithms were presented. The following
sections present a few more examples of important parallel graph algorithms.
performs linear work and runs in O(log 3 n) time with high probability, and is shown
in Algorithms 3.11 and 3.12.
The algorithm partitions a graph into clusters, with each cluster being a tentative
connected component (see Line 8, Algorithm 3.11), contracts the clusters into single
nodes and creates a smaller graph after contraction (see Line 10, Algorithm 3.12),
and repeats these steps until all the components are finalized. Function Parallel_CC
(Line 39, Algorithm 3.12) recursively calls itself to perform these steps. The factor
β which is a parameter to function Decompose is a parameter of an exponential
probability distribution ((1/β) is the mean) and controls the diameter of clusters
found by Decompose, and also inter-cluster edges. Larger values of β generate
clusters of smaller diameter and hence will have more number of inter-cluster edges.
Function Decompose is the heart of the algorithm.
In the foreach statement in Line 10 of function Decompose, a random number
δ is generated for each vertex in parallel, based on an exponential distribution with
a mean value of (1/β). The foreach statement in Line 16 of function Decompose
computes in parallel a start value for each vertex based on the value of δmax
computed in Line 13 of function Decompose. It is to be noted that start values may
not be integers and there may be several vertices with start values in a given range,
say x and x + 1, with x being an integer.
Line 19 of function Decompose begins the main loop of the function which
iterates until all the vertices of the graph are visited (O(log n) iterations will be
performed). Each iteration begins by collecting all the vertices which are not yet
visited (v.compnum == ∞) and whose start values are less than round+1 (round is
initialized to zero), into a set called Frontier (see Line 20, Algorithm 3.11). The
vertices in Frontier begin growing either the old cluster (to which they belong) or a
new cluster (if they have just got into Frontier), in the loop at Line 27 of function
Decompose, by taking one step, in which each vertex v looks at its neighbours
in parallel. If the neighbour is still unvisited (v.compnum == ∞), and some other
vertex has not annexed it already,5 vertex v adds this neighbour to its own cluster
by numbering the neighbour with its own component number. These new recruits
are added to NewFrontier which becomes the Frontier for the next iteration with
value round+1. After the clustering operation, Decompose returns a map of vertex
labels and their component numbers, created by function CreateCCmap(see Line 1,
Algorithm 3.11).
After calling Decompose once from function Parallel_CC, the clusters are
contracted into single vertices in function Contract (see Line 10, Algorithm 3.12).
New vertices and edges are created in the new graph and then the data structure
for the new graph is created in Line 366 of Algorithm 3.12. If the new graph is
5 This can happen due to concurrent processing of neighbours by vertices and hence requires a CAS
graph. This is necessary to provide features to traverse neighbouring vertices and edges. Its details
are not shown in the book.
3.5 Some Important Parallel Graph Algorithms 81
not empty (see Line 42, Algorithm 3.12), function Parallel_CC is called again on
the reduced new graph, and a new component map is collected. The old map and
the new map are combined to yield a new map in the function Relabel (see Line 1,
Algorithm 3.12), which is called towards the end of Parallel_CC. The combined
new map is returned by Parallel_CC.
Figures 3.4a–d and 3.5a–i show how an undirected graph is processed to compute
connected components using algorithmParallel_CC. In Round = 0, vertices v1 and
v4 become eligible (according to their start values which are not shown) and are
placed in Frontier. They start inspecting their neighbours in parallel. v1 grows the
ball around it by including vertex v2 into its cluster and both these vertices are given
the same component number (compnum) of 1. Similarly, vertices v3 , v4 and v5 are
given the same component number of 4. Edge (v2 , v3 ) is now an inter-component
edge. In Round = 1, vertices v2 , v3 , v5 , v7 and v9 become eligible based on their
start values and start parallel inspection of their respective neighbours. v2 and v3
have no unvisited neighbours (see Fig. 3.4b). Vertices v5 and v7 have a common
unvisited neighbour, v6 . Only one of them can include v6 into their component and
this choice being arbitrary, v5 wins. Similarly, vertices v7 and v9 have a common
unvisited vertex v8 , and in this case v7 wins in the arbitrary choice (decided by the
atomic CAS operation), and includes v8 in its cluster. The clusters formed at the end
of Round 1 are shown in Fig. 3.4c. The map of vertex labels and their component
numbers (still tentative) are shown in Fig. 3.4d. No more rounds are possible and
now the graph with four clusters is contracted.
The contracted graph is shown in Fig. 3.5a. It has four nodes corresponding to
four clusters found so far. The process explained above is applied to this contracted
graph and finally one component is found as shown in Fig. 3.5g, h. All the vertices
of the original graph form a single component and the component number is 7 (it
could have been any other label as well).
σst (v)
δst (v) = (3.2)
σst
While findings shortest paths in unweighted undirected paths may be carried out
by a cheaper BFS strategy, computing pairwise dependencies for each vertex is still
84 3 Efficient Parallel Implementation of Graph Algorithms
v1 1
v1 1
v2 compnum = 1
2
v2 2
v3 3
v3 3
v4 4
v4 4 compnum = 4
v5 5 Input Graph
v5 5
v6 6
v6 6
v7 7
v7 7
v8 8
v8 8
v9 9
(a) Input Graph with vertex la- v9 9
bels shown on the right side of (b) Round = 0. Black circles indicate the Frontier.
the vertex
v1 1
v 1 Label 1 Compnum 1
compnum = 1
v2 2
v 2 Label 2 Compnum 1
v3 3
v 3 Label 3 Compnum 4
v4 4
v 4 Label 4 Compnum 4
compnum = 4
v5 5
v 5 Label 5 Compnum 4
v6 6
v 6 Label 6 Compnum 4
v7 7
v 7 Label 7 Compnum 7
compnum = 7
v8 8
v 8 Label 8 Compnum 7
v9 9 compnum = 9
v 9 Label 9 Compnum 9
(c) Round = 1. Black circles indicate the Frontier. After (d) CCmap after Round = 1.
this round, Decompose returns because there are no more
unvisited vertices.
v 1 1
v 1 1
compnum = 4
v 2 4
v 2 4
Reduced Graph
v 3 7
v 3 7
compnum = 7
v 4 9
v 4 9
(b) Round=0. Black circles indicate
(a) Reduced Graph after contrac- the Frontier. After this round, Decom-
tion pose returns because there are no more
unvisited vertices.
v 1 Label 9 Compnum 4
v 1" 4
compnum = 7 v 1 " Label 4 Compnum 7
v 2" 7
v 2 " Label 7 Compnum 7
(e) Round=0. Black circle indicates
the Frontier. After this round, Decom- (f) CCmap after Round 0.
pose returns because there are no more
unvisited vertices.
v 1 7 Reduced Graph v 1 Label 7 Compnum 7
(g) Reduced Graph after contrac- (h) CCmap after no more steps
tion. No more steps are possible. are possible.
There is only one connected component. All the vertex labels map to component number 7.
(i) Final CCmap.
The key to fast computation is the recurrence (3.5) that enables independent
computation of fractional pairwise dependency contribution by each vertex after
shortest path computation. The parallel algorithm now runs as follows. A BFS run
concurrently from each vertex v of the graph computes the distance, prdecessors,
and number of shortest paths (the arrays v.distance[], v.predecessor[] and v.σ [],
respectively) from v. Then the fractional contributions of each vertex to BC(v)
are summed up to produce BC(v). The details are shown in Algorithm 3.13. This
algorithm is based on [63]. Using fine grained locks and lock-free data structures as
discussed in [63] yields even better performance.
Algorithm 3.13 processes each vertex of the graph sequentially, but the BFS and
other operations in each iteration are parallel algorithms. It uses an array of stacks
indexed by the BFS level to store the vertices visited at each level. This hierarchy
is used to compute the partial pairwise dependencies in a bottom-up manner. After
initializations (Lines 3–19), parallel BFS begins. Newly visited vertices are pushed
onto the stack at the appropriate level (Line 24), and distance, predecessor and σ
(number of shortest paths) values are updated (Line 31). The atomic sections are
necessary since global variables level, Stack_Array, and σ are being manipulated by
several concurrent operations. Operations beginning at Line 41 show the update of
δ and BC values in a bottom-up manner using Stack_Array and level.
Figure 2.11a, b are reproduced in Fig. 3.6 for convenience. The trees produced
by BFS initiated at each vertex along with partial δ values are shown in Fig. 3.7. Of
course, tree structure is provided for better understanding and is not needed for the
computation—level information is sufficient. Consider BC computation for vertex 1
in Figs. 3.6a and 3.7a. To begin with, vertex 1 is pushed onto Stack_Array[0]. Then,
neighbours of vertex 1, vertices 6 and 3 are pushed onto Stack_Array[1], indicating
that they would be processed after advance of BFS by one level. Vertex 1 is put on
predecessor lists of both vertices 6 and 3. Distance and σ values of both vertices
6 and 3 are updated. This ends BFS from vertex 1 for level 0. Vertices 3 and 6
are processed concurrently at level 1 (available from Stack_Array[1]) in a similar
3.5 Some Important Parallel Graph Algorithms 87
1 6 [4] [4]
2 7 [0] [0]
(a) Given undirected graph (1). (b) Graph (1) with vertex BC values in brackets.
Sequential algorithms for disjoint set union and find operations have been well
known for several years [64, 65]. They are used routinely in applications such
as Kruskal’s algorithm for computing minimum spanning trees. However, parallel
versions of union and find operations are not as efficient as their sequential ones.
Simple approaches to making these sequential algorithms into parallel ones by using
locks on all update operations limits performance. Frameworks such as Galois [66]
provide efficient implementations of concurrent Union-Find algorithms which use
fine grained locks. Concurrent wait-free versions of these operations are reported
in [67] and more recently in [68]. These are randomized concurrent algorithms
and their performance is estimated to be very good with certain assumptions on
randomization. Even though their performance may be reasonable in practice,
sufficient experimentation with these algorithms has still not been carried out to
verify the claims of theory. The following sections describe versions of wait-free
Union-Find algorithms as described in [68]. They may be replaced by available
implementations such as the ones in Galois [66] with almost no change in the
structure of applications using them.
The collection of sets that are dealt with in these algorithms are assumed to be
disjoint, for example, collection of disjoint sets of vertices used in Kruskal’s and
Boruvka’s MST algorithms. Each set in the collection is maintained as a tree with
the root being the representative of the set (“ID” of the set). Parent pointers are
used to implement the tree, with parent of the root pointing to itself. The MakeSet
operation creates a collection of singleton sets. A find operation on a leaf of a tree
3.5 Some Important Parallel Graph Algorithms 89
1
(5) 3
(2) 3 (2) 6
(1) 1 (2) 4
(0) 2 (0) 4 (1) 5
(0) 6 (1) 5
(0) 7
(a) From vertex 1 (0) 7
(b) From vertex 2
3
4
(2) 5 (2) 3
(1) 5 (0) 6
5
6
(2) 1 (2) 5
(2) 4 (1) 6 (0) 7
(0) 2
(0) 2
(f) From vertex 6
(e) From vertex 5
(5) 5
(2) 4 (1) 6
(1) 3 (0) 1
(0) 2
(g) From vertex 7
Fig. 3.7 Parallel BC computation—BFS level-by-level traversal from each node. Partial δ values
are shown beside the nodes in parentheses
90 3 Efficient Parallel Implementation of Graph Algorithms
Table 3.5 Summary of pred, dist, and σ computations for vertex 1 in Fig. 3.6a
Vertex Stack_Array Predecessor Distance
processed BFS level Operations (SA) contents (Pred) set (dist) σ
None Push 1 onto SA SA[level=0]=1
1 0 Push 6,3 onto SA SA[level=1]=6,3 Pred[6]=1 dist[6]=1 σ [6] = 1
Push 1 onto Pred Pred[3]=1 dist[3]=1 σ [3] = 1
3 1 Push 4,2 onto SA SA[level=2]=4,2 Pred[4]=3 dist[4]=2 σ [4] = 1
Push 3 onto Pred Pred[2]=3 dist[2]=2 σ [2] = 1
6 1 Push 5 onto SA SA[level=2]=4,2,5 Pred[5]=6 dist[5]=2 σ [5] = 1
4 2 Nothing to do
since neighbours
3,5 are already in
SA
2 2 Nothing to do
since neighbour
3 is already in
SA
5 2 Push 7 onto SA SA[level=3]=7 Pred[7]=5 dist[7]=3 σ [7] = 1
7 3 Nothing to do
since neighbour
5 is already in
SA
traverses parent pointers till the root and returns the ID of the root. A union operation
on two sets A and B combines the two trees corresponding to the two sets, by making
the root of tree with a “lesser” ID for its root as the child of the tree with “greater ID”
for its root. This assumes that the IDs of the elements have already been assigned to
the elements and that they will not be changed any time. Elements may be assigned
their IDs either as a total order or as a random permutation of a total order, from
the range 1, . . . , N, with N being the total number of elements (such as number of
vertices in a graph).
3.5 Some Important Parallel Graph Algorithms 91
Finding connected components using Union-Find is almost trivial. Start with each
vertex as a component, consider each edge in parallel, if the vertices of the edge
belong to different components, merge the two components into one. This sequence
is repeated until no more Union operations are possible. Algorithm 3.18 describes
this procedure.
92 3 Efficient Parallel Implementation of Graph Algorithms
The Gather-Apply-Scatter model, Vertex API based BSP model, and worklist
model for distributed execution were discussed in Sect. 1.9.1. These models are
implemented for CPU clusters. Minimizing the communication between subgraphs,
maintaining proper work-balance etc. are required in a distributed computation.
These are hard problems and heuristics are used to achieve suitable approximations
in practice. The MapReduce framework is used for a rich set of large scale
computation. The execution model of this framework provides inferior performance
for graph analytics. The model is found suitable for applications such as SQL query
processing.
In a BSP model of execution a vertex may be active or inactive. An active vertex
performs computation, communication, and processes the received data. If there
is mismatch in the amount of computation and communication among subgraphs,
load-imbalance may occur. The load balancing in graph analytics is improved
with dynamic strategies. Such a scheme does not rely fully on the input graph
structure. It looks at how the computation happens at the runtime. Based on the
runtime behaviour, graph partitions are modified by moving vertices and associated
properties from one subgraph to another. Frameworks have considered sizes of the
incoming and the outgoing messages, and time of each superstep in taking decisions
at runtime [72].
There exists work which improve on top of the Gather-Apply-Scatter model
of execution. In one extension a hybrid vertex-cut and edge-cut partitioning with
separate communication engine is used. Another extension uses agent-graph model
to partition graphs, and uses scatter-agent and combine-agent to reduce commu-
nication overhead [73]. Domain Specific Languages (DSLs) for graph analytics
exist on CPUs. They different in the data structures and language constructs [74–
76]. The DSLs which supports incremental dynamic algorithm have also been
implemented [26, 27]. DSL with different way of programming where algorithm
(computation) and schedule (how to compute) are specified separately are proposed
recently. Separate languages are provided for algorithm and schedule [76].
Fault tolerance is an important requirement in graph analytics. Frameworks
which reduce overhead on fault tolerance make the computation to happen from
scratch [77]. There also exist frameworks which exploit check-pointing resulting in
only a small performance overhead [78]. and frameworks that perform replication
based fault-tolerance [79] Recent frameworks running on a CPU cluster, in which
computation restarts only from a safe point, guarantee correct output [80].
Graph analytics on cluster with each node having a CPU and a GPU have been
explored. It need to be noted that there is major difference in architecture style
3.6 Graph Analytics on Distributed Systems 97
The frameworks which supports multi-GPU machines have been developed in the
past [82]. An advantage of using multi-GPU machine is its power-efficiency as well
as cost-per-watt rating. Multi-GPU machines allow the computation to be largely
partitioned across GPUs, with CPU only coordinating across devices. The DSL
which supports CPU and GPU cluster also supports multi-GPU machines [26].
Chapter 4
Graph Analytics Frameworks
Frameworks take away the drudgery of routine tasks in programming graph analytic
applications. This chapter describes in some detail, the different models of execution
that are used in graph analytics, such as BSP, Map-Reduce, asynchronous execution,
GAS, Inspector-Executor, and Advance-Filter-Compute. It also provides a glimpse
of different existing frameworks on multi-core CPUs, GPUs, and distributed
systems.
4.1 Introduction
with several graph analytics frameworks. While all the graph analytics frameworks
provide such a functionality at the macroscopic level, they differ in multiple aspects.
Typically, a framework supports either synchronous execution [14], or asynchronous
execution [17], or both [11]. This respectively means, the workers (threads or
processes) execute parallel steps and coordinate to wait for others to finish each
step, or run completely independently of other threads, or use a combination of
the two. They also differ in the programming styles that they support. Gather-
Apply-Scatter (GAS) [11], inspector-executor [66], advance-filter-compute [12] and
Map-Reduce [83] are among the well known frameworks that have been proposed
in the past for graph analytics. Each style dictates a manner in which the underlying
computation must be encoded. The framework enforces explicit control-flow for
programming algorithms. Nuances of a particular computing system are handled
internally by the framework (e.g., graph partitioning and communication in a
distributed setup).
Graph algorithm implementation on GPUs started with handwritten codes. Effi-
cient implementations of algorithms such as Breadth First Search (BFS) and Single
Source Shortest Path (SSSP) on GPUs were proposed several years ago [84, 85].
The BFS implementation from Merrill et al. [57] is novel and efficient, which
exploits worklist-based processing to improve work-efficiency of the implemen-
tation. Efficient implementations of other algorithms on GPUs such as n-body
simulation [86], betweenness centrality [87], data flow analysis [88] and control-
flow analysis [89] were also proposed later. Efficient graph analytics frameworks
targeting GPUs [12, 90] and multi-GPU machines [20, 91] are available now.
In this section we discuss various popular execution models for graph algorithms
and the frameworks and DSLs that provide them.
A brief description of BSP was provided in 1.9.1.1. The BSP model performs
computations in supersteps [13]. The reader may recall that the each superstep
consists of three parts: Computation, Communication and Barrier Synchronization.
A brief decsription of important frameworks that follow BSP are provided below.
4.3.1.1 Pregel
Pregel [14] is a proprietary graph analytics framework from Google, that follows the
BSP model. The input graph G(V, E) can have a set of mutable properties associated
with vertices and edges. In each superstep, vertices carry out the computation in
parallel. The algorithmic logic must be programmed in the API functions of Pregel
(see Algorithm 4.1). A vertex modifies the mutable properties of its neighbouring
vertices and edges, sends messages to vertices, receives messages from vertices,
and if required changes the topology of the graph. A typical example for a mutable
102 4 Graph Analytics Frameworks
property is the vertex property dist in the SSSP computation. All active vertices
perform computation and all the vertices are as set active initially.
A vertex deactivates itself by calling the VoteToHalt() API function and it gets
reactivated automatically when a message is received from another vertex. Once all
the vertices call the VoteToHalt() function and no message is sent across vertices, the
algorithm terminates. Algorithmic logic of the computation is required to be coded
in the Compute() function. The message, vertex and edge data types are specified
as templates. Different template types may be chosen for different algorithms. A
programmer is required to override the Compute() function, which will be run
on all the active vertices in each superstep, and code the algorithmic logic of the
computation in it.
4.3.1.2 GPS
The GPS [95] framework follows the BSP model of execution and incorporates
incremental modifications to Pregel. GPS repartitions the vertices of the graph
across the nodes of a distributed system during the computation, based on message-
sending patterns. It tries to combine two vertices on two different nodes which send
messages among themselves frequently during computation. The GPS framework
provides the Master.Compute() function in addition to the Vertex.Compute()
function of Pregel. The Master.Compute() function is called at the beginning of
each superstep. The Master class has access to merged global objects and Compute()
function can update global objects.
4.3.1.3 Pregelix
Pregelix [96] uses set-oriented, iterative dataflow approach for the implementation
of the Pregel programming model. Pregelix combines the messages and vertex states
4.3 Models for Graph Analytics 103
in a computation into a tuple. Query evaluation technique is then used to execute the
user program. Pregelix models the Pregel semantics as a query plan and implements
the semantics as an iterative dataflow of relational operators. Message exchange is
taken as the join followed by a group-by operation. Pregelix offers a set of alternative
physical evaluation strategies for different workloads.
4.3.1.4 Ligra
Apache Giraph [99] is an open source framework based loosely on the Pregel model.
It also supports programming in the Map-Reduce model. Apache Giraph is extended
to handle large scale graphs in [15]. The framework was evaluated for graphs
with more than 6 Billion edges. Apache Hama [100] is a distributed computing
framework based on the BSP model. It follows a Pregel-like computation model for
graph analytics. The SSSP algorithm in Hama is shown in Algorithm 4.2.
Algorithm 4.2 works on a graph G(V, E). The computation happens on messages
in the compute() function. In the first superstep, the value (distance) of each vertex
is set to INFINITY (see Lines 2–4). The variable minDist is local for each vertex.
The value of minDist is set to zero for StartVertex and INFINITY for all other
vertices in each iteration (see Line 5). The getValue() function returns the value
of the vertex. In the following supersteps each vertex attains the smallest value
of the messages received. If the received message value for a vertex is less than
its current value, the vertex updates its value using the setValue() function. This
reduces the vertex value. If the value gets reduced for a vertex u, then a message
with a value u.value + e.weight is sent to all the vertices v, where u → v ∈ E,
using the SendMessage() function. At the end of each iteration, the vertex makes
itself inactive by calling voteToHalt() function. If a vertex receives a message, it gets
reactivated for the next superstep. The number of messages sent over the network is
reduced using the function combine(), which takes the minimum value of messages
for each destination vertex and sends only one message with the minimum value.
Computation stops when no message is sent in a superstep denoting a fixed point.
4.3.1.6 Totem
The BSP execution model is used in GPUs and multi-GPU machines also.
Totem [20] is a framework for graph processing on a heterogeneous and multi-
gpu system. It follows the BSP execution model. It supports running algorithms in
a distributed fashion among multi-core CPU and multiple GPUs of the multi-GPU
machine. The graph object is partitioned and each subgraph is stored on the devices
used for computation. Totem stores graphs in the Compressed Sparse Row (CSR)
format. It partitions graphs in a way similar to edge-cut partitioning.
Totem uses the buffers outbox and inbox on each device for communication. The
outbox buffer is allocated with space for each remote vertex, while the inbox buffer
has an entry for each local vertex that is a remote vertex in another subgraph on a
different device. Totem partitions a graph onto multiple devices, with less storage
overhead. It aggregates boundary edges (edges whose vertices belong to different
master devices) to reduce communication overhead. It sorts the vertex ids in the
inbox buffer to have better cache locality. Totem has inbuilt benchmarks which the
user can specify as a numerical value. A user can also specify how many GPUs
to use, the percentage of the graph that should go to GPU etc., as command line
arguments. Such heterogeneous computing is useful as some algorithms do not
perform well on GPUs alone. Examples are SSSP, BFS etc., on GPU for road
networks. This happens as road networks have a large diameter and hence less
parallelism is possible. A user can dictate that an algorithm should be executed
either on CPU or on GPU, based on the type of input and/or algorithm.
The basic structure of the Totem framework is shown in Algorithm 4.3, where
a Totem benchmark defines the parameters in the totem_config class. A new
benchmark should use this template and place the algorithmic logic properly in the
functions in the struct totem_config. The totem execution engine will take
4.3 Models for Graph Analytics 105
4.3.2 Map-Reduce
Hadoop [103] supports Map-Reduce processing of graphs and uses the Hadoop
distributed file system (HDFS) for storing data. HaLoop [83] is a framework which
follows the Map-Reduce pattern with support for iterative computation and with
better caching and scheduling methods. Twister [104] is also a framework which
follows the Map-Reduce model of execution. Pregel-like systems can outperform
Map-Reduce systems in graph analytic applications.
The execution model of GraphLab is shown in Algorithm 4.4. The data graph
G(V, E, D)2 (Line 3) of GraphLab stores the program state. A programmer can
associate data with each vertex and edge based on the algorithm requirement. The
update function (Line 6) of GraphLab takes as input, a vertex v and its scope Sv
(data stored in v, its adjacent vertices and edges). It returns modified scope Sv and a
set of vertices T’ which require further processing. The set T’ is added to the set T
(Line 7), so that it can be processed in the upcoming iteration. Algorithm terminates
when T becomes empty (Line 4).
ASPIRE [107], KLA [108], CoRAL [109] and Wang et al. [110] describe other
examples of asynchronous frameworks.
2V is the vertex set, E is the edge set and D is the data table.
4.3 Models for Graph Analytics 107
Very large graphs do not fit in RAM and hence, external memory algorithms
partition the graphs into smaller chunks such that each chunk fits into memory and
can be processed. The computed values are then used to process the next chunk, and
so on. The GraphChi [58] framework processes large graphs using a single machine,
with the graph being split into parts, called shards, loading shards one by one into
RAM and then processing each shard. It uses Gauss–Seidel iterative computation
based on a parallel sliding window. Such a framework is useful in the absence
of distributed clusters. The parallel sliding window technique was experimented
with elementary graph algorithms [111] also. The Turbo-Graph framework [112]
proposes the pin-and-slide execution model for parallel graph analytics using
external memory. It supports graph analytics on multi-core CPU with FlashSSD
I/O devices. The I/O operations and computations are performed asynchronously
to provide more efficiency. TurboGraph has been shown to outperform GraphChi.
TurboGraph is extended for external memory algorithms on CPU clusters [113]. The
extension supports multiple levels of parallel and overlapped processing for efficient
usage of multi-core CPUs, hard-disks, and network.
8 }
4.3.6 Inspector-Executor
remote data, and the received copy of remote data in the current iteration. After the
computation, the values are copied to the local copy of the accumulated remote data.
Galois [66] uses the inspector-executor execution model for topology-driven
graph algorithms targeting multi-core CPUs. It supports mutation (morphing)
of graph objects via cautious speculative execution.3 Galois uses a data-centric
formulation of algorithms called operator formulation. Galois defines:
• Active Elements: the vertices or edges where computation needs to be performed
at a particular instance of program execution.
• Neighborhood: the vertices or edges which are read or written by active elements
in a particular instance of execution.
• Ordering of the active elements present at a particular instance of program
execution.
In unordered algorithms, active elements can be processed (e.g., Delaunay Mesh
Refinement, Minimum Spanning Tree) in any order, whereas in ordered algorithms,
elements are processed in a particular order (e.g, -Stepping SSSP).
Galois uses a worklist based execution model, where all the active elements
are stored in a worklist and they are processed either in ordered or unordered
fashion using a foreach operator. During the processing of active elements, new
active elements are created, and these will be processed in the following rounds of
computation. Computation stops when all the active elements have been processed
and no more active elements are being created, denoting a fixed point.
Algorithm 4.9 shows the pseudo-code for -Stepping SSSP implementation
in Galois. Galois uses an order by integer metric (OBIM) bucket for -stepping
implementation of SSSP as declared in Line 26. The operator in the InitialProcess
structure reduces the distance of the neighbours of the source vertex and adds to the
OBIM buckets (Lines 19–21). This function is called from SSSP class in Line 29.
Then the parallel for_each_local iterator of Galois calls the operator of Process
structure in Line 30. This calls the operator of Process defined in Lines 9–11. The
parallel iterator completes once all buckets are free and the SSSP distance of all the
vertices are computed. The relaxNode() and relaxEdge() functions are not shown
in the algorithm. They are used to reduce the distance values of vertices as done in
other SSSP algorithms.
Galois also supports mutation of graph objects using cautious morph imple-
mentations (Delaunay Mesh Refinement and Delaunay Triangulation) and also
algorithms based on mesh networks. Galois does not support multiple graph objects.
Programming a new benchmark in Galois requires much effort, as understanding
the C++ library and parallel iterators is more difficult compared to a DSL based
approach.
3 Morph algorithms can be classified as cautious, if the algorithms read all the neighborhood
4.3.7 Advance-Filter-Compute
This model defines advance, filter, and compute primitives which operate on
frontiers in different ways. A frontier is a subset of edges or vertices of the graph
which is actively involved in the computation. The Gunrock [116] framework
follows this execution model. Gunrock provides data-centric abstraction for graph
operations at a higher level which makes programming graph algorithms easy.
Gunrock has a set of API to express a wide range of graph processing primitives
and targets Nvidia GPUs. Each operation in this model can be of the following
types:
4.4 Frameworks for Single Machines 111
• An advance operation creates a new frontier using the current frontier by visiting
the neighbors of the current frontier. This operation can be used in algorithms
such as SSSP and BFS which activate subsets of neighbouring vertices.
• The filter operation produces a new frontier using the current frontier, but the new
frontier will be a subset of the current frontier. An algorithm which uses such a
primitive is the -Stepping SSSP.
• The compute operation processes all the elements in the current frontier using a
programmer defined computation function and generates a new frontier.
The SSSP computation in Gunrock is shown in Algorithm 4.10. It starts with a
call to SET_PROBLEM_DATA() (Lines 1–6) which initializes the distance dist to
∞ and predecessor preds to NULL for all the vertices. This is followed by dist of
root node being made to 0. Then the root node is inserted into the worklist frontier.
The computation happens in the while loop (Lines 20–24) with consecutive calls
to the functions ADVANCE (Line 21), FILTER (Line 22) and PRIORITYQUEUE
(Line 23). The ADVANCE function with the call to UPDATEDIST (Lines 7–10),
reduces the distance of the destination vertex d_id of the edge e_id using the value
dist[s_id]+weight[e_id] where s_id is the source vertex of the edge. All the updated
vertices are added to the frontier for processing in the next iterations. Then the
ADVANCE function calls SETPRED (Lines 11–14) which sets the predecessor in
the shortest path of vertices from the root node. The FILTER function removes
redundant vertices from the frontier using a call to REMOVEREDUNDANT. This
reduces the size of the worklist frontier which will be processed in the next iteration
of the while loop. Computation stops when frontier.size becomes zero.
Programs can be specified in Gunrock as a series of bulk-synchronous steps.
Gunrock also looks at GPU specific optimizations such as kernel fusion. Gunrock
provides load balance on irregular graphs where the degree of the vertices in the
frontier can vary a lot. This variance is very high in graphs which follow the
power-law distribution. Instead of assigning one thread to each vertex, Gurnock
loads the neighbor list offsets into the shared memory, and then uses a Cooperative
Thread Array (CTA) to process operations on the neighbor list edges. Gunrock also
provides vertex-cut partitioning, so that neighbours of a vertex can be processed by
multiple threads. Gunrock uses a priority queue based execution model for SSSP
implementation. Gunrock was able to get good performance using the execution
model and optimizations (mentioned above) on a single GPU device.
Modern CPUs are equipped with multiple cores with MIMD architecture. Each CPU
has a large volatile memory and a barrier across all threads can be implemented
quite easily, as it follows the MIMD model. In comparison, GPU devices have
a massively parallel architecture and follow the SIMT execution model. A GPU
device is attached as a separate computing unit on a CPU. For example, the
112 4 Graph Analytics Frameworks
Nvidia K-80 GPU has 4992 cores, 24 GB device memory and a base clock rate
of 560 MHz. GPUs are also used for General Purpose computing (GPGPU), apart
from their extensive usage in graphics platforms. Graph algorithms are irregular,
require atomic operations, and can result in thread divergence when executed on a
streaming multiprocessor (SM). Threads are scheduled in multiple thread blocks.
Each thread block is assigned a streaming multi-processor (SM). The Nvidia K-80
GPU has 26 SMs, each with 192 streaming processors (SP) (thus, 26 × 192 = 4992
cores).
A barrier for threads within a thread block is possible in GPU as threads in a
thread block are scheduled on the same SM. A global barrier across all threads in a
CUDA kernel which spans multiple thread blocks is not available in a GPU directly.
This needs to be implemented in software [117] and has higher overheads. This will
force each thread in a thread block to process multiple elements using a for loop,
so that the total number of threads is not huge. Before a computation, data needs
to be copied from the non-volatile storage of a computer to the volatile memory
of its CPU, and then to the volatile memory of the GPU. Writing an efficient GPU
program requires a deep knowledge of the GPU architecture, so that the algorithm
can be implemented with less thread divergence, fewer atomic operations, coalesced
access etc. The performance issues of graph analytics on GPU are explored in [118].
4.4 Frameworks for Single Machines 113
Ringo [119] is a system for analysis of large graphs (with hundreds of millions
of edges) on a multi-core CPU. Ringo has a tight integration between graph and
table processing and efficient conversions between graphs and tables. It supports
various types of graphs. Ringo runs on a multi-core CPU machine with a large main
memory, and outperforms distributed systems on all input graphs.
4.4.2.1 IrGL
4.4.2.2 Lonestar-GPU
The LonestarGPU [122] framework supports mutation of graph objects and imple-
mentation of cautious morph algorithms. It has implementations of several cautious
morph algorithms like Delaunay Mesh Refinement, Survey Propagation, Boruvka’s-
MST and Points-to-Analysis. Boruvka’s-MST algorithm implementation uses the
Union-Find data structure. LonestarGPU also has implementations of algorithms
like SSSP, BFS, Connected Components etc., with and without using worklists.
LonestarGPU does not provide any API based programming style.
4.4.2.3 MapGraph
MapGraph [123] is an open source framework which uses the vertex centric Gather-
Apply-Scatter model of execution and targets single GPUs. It uses compressed
sparse row (CSR) format to store graph objects. MapGraph decides the scheduling
policy at runtime based on the number of active vertices and the size of the adjacency
lists for the active vertices at each superstep.
4.4 Frameworks for Single Machines 115
The Apply phase in the GAS execution model is parallel and well optimized.
The MapGraph framework optimizes the Gather and Scatter phases of the GAS
execution model. It uses dynamic scheduling and two-phase decomposition to
achieve the same. Dynamic scheduling uses the CTA-based scheduling strategy
based on the degree of an active vertex. An active vertex is assigned to a CTA,4 and
each thread of the CTA processes only one neighbor of the vertex when the vertex
degree is large (like in social network graphs). Scan-based scattering uses a prefix
sum operation to compute the starting and ending points in the column-indices array.
Then an entire CTA is assigned to gather the referenced neighbors from the column-
indices array using the scatter vector. Warp-based scattering does a coarse-grained
redistribution of the scattering workloads. These three policies are used to obtain
better load balance and improved memory access patterns. Dynamic scheduling is
very efficient for BFS and SSSP. But it may lead to imbalanced workloads among
CTAs.
The Puffin [124] framework proposes a novel data representation, which meets
programming needs with minimum storage space requirements. The framework
overlaps communication and computation using novel runtime strategies. The
runtime system of Puffin divides the tasks and manages the order of execution
of different kinds of tasks. Puffin provides both vertex-centric and edge-centric
programming models that are used in graph analytics. Tigr [125] transforms graphs
for GPU-friendly computation. Tigr Transforms irregular graphs to more regular
ones by changing the topology without graph partitioning. The split transformation
of Tigr splits vertices with very high degree iteratively until the degree of each
vertex is within a predefined limit. The virtual split transformation of Tigr adds a
virtual layer on top of the input graph. The virtualization separates the programming
abstraction from the input graph. The computation tasks are scheduled at the virtual
layer of the transformed graph. The actual value propagation is carried to the input
graph. GraphBLAS extension for GPUs is called GraphBLAS Template Library
(GBTL). An effort to standardize GraphBLAS for GPU is reported in [126]. A few
works [127, 128] also focus on graph analytics on a single GPU.
programmer is required to write only sequential C++ code with these API. Medusa
provides a programming model called the Edge-Message-Vertex or EMV model. It
provides API for processing vertices, edges or messages on GPUs. A programmer
can implement an algorithm using these API. Medusa API are presented in
Table 4.1. API on vertices and edges can also send messages to neighbouring
vertices. Medusa programs require user-defined data structures and implementation
of Medusa API for an algorithm. The Medusa framework automatically converts
the Medusa API code into CUDA code. The API of Medusa hide most of the
CUDA specific details. The generated CUDA code is then compiled and linked with
Medusa libraries. Medusa runtime system is responsible for running programmer-
written codes (with Medusa API) in parallel on GPUs.
Algorithm 4.12 shows the pagerank algorithm implementation using Medusa
API. The pagerank algorithm is defined in Lines 26–31. It consists of three user-
defined API: SendRank (Lines 2–7) which operates on EdgeList, a vertex API
UpdateVertex (Lines 9–13) which operates over the vertices and a Combiner()
function. The Combiner() function is for combining message values received from
the Edgelist operator, which sends messages using the sendMsg function (Line 6).
The Combiner() operation type is defined as addition (Line 36) and message type
as float (Line 37) in the main() function. The main() function also defines
the number of iterations for pagerank() function as 30 (Line 39) and then the
pagerank() function is called using Medusa::Run() (Line 41). The main() function
in Medusa code initializes algorithm-specific parameters like msgtype, aggregator
function, number of GPUs, number of iterations etc. It then loads the graph onto
GPU(s) and calls the Medusa::Run function which consists of the main kernel. After
the kernel finishes its execution, the result is copied using the Dump_Result function
(Line 42).
The SendRank API takes an EdgeList el and a vertex v as arguments, and
computes a new value for v.rank. This value is sent to all the neighbours of the
vertex v stored in Edgelist el. The value sent using the sendMsg function is then
aggregated using the Combiner() function (Line 29) which is defined as the sum of
the values received. The UpdateVertex Vertex API then updates the pagerank using
the standard equation to compute the pagerank of a vertex (Line 12).
4.4 Frameworks for Single Machines 117
4.4.3.2 Lux
Lux initializes the vertex properties for an iteration by running the init() function
on all the vertices. The properties of the vertices from the previous iteration (vold )
are passed as immutable inputs to init() function. The compute() function takes an
edge e(u,v) and its properties, and the properties of the vertex u from the previous
iteration (uold ) as inputs and updates the properties of the vertex v. The input
properties are immutable in the compute() function. The order of processing the
edges is non-deterministic. The compute function is executed concurrently on many
vertices. As the last step of an iteration, the update() function is called on every
vertex and updates are committed. The Lux runtime exits when no vertex properties
are updated in an iteration, denoting a fixed point. Algorithms 4.14 and 4.15 present
examples of programming the Pagerank algorithm in Lux.
Lux is a distributed framework. It partitions a large graph to subgraphs for
different compute nodes. Lux stores the vertex updates in the zero-copy memory.
The zero-copy memory is shared among all GPUs on a compute node. The GPUs
4.5 Frameworks for Distributed Systems 119
store mutable vertex properties in the shared zero-copy memory, which can be
directly loaded by other GPUs. It makes use of the GPU shared memory to store
subgraphs. The partially-shared design of Lux reduces the memory for vertex
updates and substantially reduces the communication between compute nodes. Lux
uses edge-cut partitioning to create subgraphs with almost equal number of edges
in each subgraph. It guarantees coalesced accesses using its partitioning method,
which increases the throughput. Lux uses compressed sparse row (CSR) format to
store graphs, and its runtime performs fast dynamic repartitioning which achieves
efficient load-balancing. Workload imbalance is detected by monitoring runtime
performance and then recomputing the partitioning in order to restore balance.
Partitioning methods for distributed graph analytics have been compared in [131].
Different policies include balanced vertices, balanced edges and balanced ver-
tices and edges. The 2D block partitioning uses the adjacency matrix which
120 4 Graph Analytics Frameworks
is divided into blocks in both dimensions, and assigns one or more blocks to
each node. CheckerBoard Vertex-Cuts (BVC), Cartesian Vertex-Cuts (CVC) and
Jagged Vertex-Cuts (JVC) are examples of 2D partitioning. Vertex-cut and edge-
cut partitioning methods have already been discussed in Sect. 1.4. Distributed
execution involves computation across different compute nodes in parallel, and
communication between nodes to exchange information. Efficiency is achieved
if there is load balance and less communication overhead. As the problem of
optimal graph partitioning is NP-hard, we rely on heuristics. We compare different
distributed graph analytics frameworks in this section.
5 https://spark.apache.org.
4.5 Frameworks for Distributed Systems 121
represented as vertex and edge property collections. The vertex collection properties
are keyed by vertex identifiers.
The Zorro [134] system provides efficient fault tolerance for distributed graph
analytics frameworks. Zorro was integrated into PowerGraph. Zorro opportunisti-
cally exploits vertex replication available in the framework to rebuild the state of
failed servers. The cost involved in rebuilding the system is very less. The Zorro
system quickly recovers over 99% of the graph state when a few servers fail, and
between 87–92% when half of the cluster fails. Zorro produces results with little to
zero inaccuracy in experiments during failure of cluster nodes.
PGX.D [135] uses a fast cooperative context-switching mechanism. It has a low-
overhead, bandwidth-efficient communication framework with remote data-pulling
patterns. It reduces communication volume and provides workload balance by
applying selective ghost nodes, edge partitioning, and edge chunking transparently.
There are few other distributed graph analytics frameworks for CPU clusters such
as [136–138].
Confined recovery has been adopted in distributed frameworks for CPUs where
rollback is not required on fault [109, 133].
Gluon and Phoenix follow the BSP execution model. Gluon-Async extends
Gluon for asynchronous execution in distributed heterogeneous graph analyt-
ics [141]. The execution model of Gluon-Async is lock-free, non-blocking, and
asynchronous, and is named bulk-asynchronous parallel (BASP). It combines the
advantages of BSP models and asynchronous execution. The computation happens
in supersteps. The individual hosts do not wait for the completion of the round from
other hosts. Each host sends messages and receives available messages and moves
to the next round.
Chapter 5
GPU Architecture and Programming
Challenges
5.1 Introduction
Specialized software running on a general purpose computer may not give required
performance for different application domains. A general purpose computer may
not be able to harness the parallelism available in the software due to several
reasons such as lack of sufficient number of CPU function units, ability to process
vectors, slow synchronization, slow memory management, etc. This issue led to
the design of hardware accelerators. Hardware acceleration is implemented with a
dedicated hardware for a specific domain or application. Memory management unit
(MMU) and floating point arithmetic co-processor, are examples for elementary
hardware accelerators. Graphical Processing Units (GPUs), Field Programmable
Gate Array (FPGA) and Application Specific Integrated Circuits (ASICs) are
examples for complex hardware accelerators. Many application domains benefit
from hardware acceleration. Table 5.1 describes available hardware accelerators
for different application domains. Hardware acceleration is very critical to achieve
high performance. Dedicated hardware is energy-efficient as it consumes less power
compared to a software based solution on a general purpose computer.
Flynn has classified the computing architecture [142] into four classes mentioned
below:
• Single Instruction Single Data (SISD)
• Single Instruction Multiple Data (SIMD)
GPU devices follow the SIMT architecture. GPU devices provide massive-
parallelism with a huge (typically more than thousand) number of cores which
are hierarchically clustered. GPU devices are attached as accelerators to host multi-
core CPUs (main CPU of the computing unit), and data needs to be copied between
the volatile memory of CPU and GPU. A GPU device cannot directly access the
nonvolatile memory or secondary storage of the host device. A GPU device consists
5.2 Graphics Processing Unit (GPU) 125
to different IOHs, then the data will have to go through the system memory and
the communication overhead is high. Otherwise, DMA transfer is possible between
the GPUs connected to the same IOH, and the overhead is lower. If two GPUs
G1 and G2 are connected to the same IOH, G1 (G2 ) can read or write to the
memory of G2 (G1 respectively). This needs to be enabled by calling the API
functions (provided by CUDA) before the communication happens. DMA transfer
between GPUs on the same IOH is known as peer-access. A sample CUDA code
to enable peer-access between GPUs on a machine is shown in Algorithm 5.1. The
function call (cudaDeviceEnablePeerAccess()) in the program will be successful
(i.e., returns cudaSuccess) if the GPUs are connected on the same IOH. Then
onwards, communication between GPUs on the same IOH will happen using DMA.
PeerAccess and normal access in a multi-GPU machine is shown in Fig. 5.1a, b
respectively.
GPU0 GPU1
GPU0 GPU1
Memory Memory
Memory Memory
PCIexpress Bus Memory
GPUs have a massively parallel architecture with very high computing power
compared to that of multi-core CPUs, but the amount of volatile memory available
in a GPU (tens of GB) is much lesser than that in a multi-core CPU (hundreds
of GB). The high computing power of GPUs is utilized for general purpose
computing in various domains. Prior art comparing relative performance of GPUs
and multi-core CPUs has concluded that GPU devices outperform multi-core CPUs
in application domains such as graph analytics [26], machine learning [147], image
processing [148], to name a few. The performance of graph anaytics applications
on a GPU depends on graph properties such as diameter and degree distribution
of the graph object. General purpose computing on GPU is very efficient for
benchmarks when there is minimal communication between a device and its host
during the computation. Otherwise, there is significant runtime overhead and the
performance suffers. Such processing may require a special hardware architecture
(hardware acceleration) to improve the performance [149]. A typical example for
such a hardware accelerator is an integrated GPU which reduces the communication
overhead between CPU and GPU.
The performance of Nvidia GPUs are best harnessed when programmed with
CUDA. A CUDA programmer is required to take care of memory management, data
transfer between CPU and GPU, thread management etc. Programming in CUDA
is therefore even more challenging than programming in high-level languages such
as C or C++. The CUDA programming model follows the syntax of C++ with
additional keywords and library functions. The keywords and functions have been
128 5 GPU Architecture and Programming Challenges
Table 5.3 CUDA—important keywords and functions for CPU (host) and GPU (device)
Item Type Description
__global__ Keyword Function run on device, called from host
__device__ Keyword Function or variable accessible from device
__syncthreads() Keyword Barrier for a thread block.
cudaMemcpy Function Transfer Data between host and device
cudaMemcpyFromSymbol Function Copy data from device to host
cudaMemcpyToSymbol Function Copy data from host to device
cudaMalloc Function Allocate memory on device
cudaDeviceSynchronize Function Barrier for the CUDA kernel (called from host)
added to differentiate between CPU aspects and GPU aspects, such as, the CPU
memory and the GPU memory, functions executed on the CPU and the GPU,
etc. The library functions for thread and memory management are also present in
CUDA. Table 5.3 lists some of the important keywords and functions in CUDA.
There are CUDA libraries such as Thrust [4] and CUB1 which provide STL-like
high-level abstractions for GPGPU. Good performance is not always guaranteed
when special libraries are used for computation.
A parallel device (GPU) function called from the host (CPU) is called a CUDA
kernel. A CUDA kernel is called with thread blocks and threads-per-block, with
both having three dimensions x, y, and z. One dimensional indexing, where values
are specified only for the x dimension, is often used in practice. A one dimensional
grid consisting of p number of thread blocks and with q number of threads-per-
block, will have p × q number of threads running the CUDA kernel. Each kernel
thread is identified by a unique identifier which is calculated by the formula given
below (shown for one dimensional grid for simplicity).
Here, the variables blockIdx.x and threadIdx.x represent the thread block number
and the thread number within the thread block respectively, blockDimx.x and
gridDim.x represent the number of threads in a block, and the number of thread
blocks in a kernel respectively. There is a limit on the maximum number of thread
blocks and threads in a thread block specific to each device. GPU devices with
1 https://nvlabs.github.io/cub/.
5.3 General Purpose Computing on GPU (GPGPU) 129
Fig. 5.2 NVidia GPU with—each SM having 192 SPs (six warps—6 × 32)
compute capability2 greater than two have the maximum number of threads in a
thread block set to 1024. The number of thread blocks differs with GPU devices,
with values such as 64×1024 thread blocks or more per kernel possible. In Tesla
K20 GPU, the maximum dimension values for a thread block are (1024, 1024, 64)
and the maximum dimension size for a grid is (2147483647, 65535, 65535) for the
triple (x, y, z).
The GPU device computing core hierarchy is shown in Fig. 5.2. The GPU device
consist of (r = p ×q) number of streaming processors (SPs), where p is the number
of streaming multiprocessors(SMs), and q is the number of SPs in each SM. The
individual thread blocks of a CUDA kernel are assigned to an SM by the CUDA
warp-scheduler.3 The threads in the thread block are executed as a collection of
warps (thirty two threads) in SIMT fashion. Once a thread block is assigned to
an SM, the thread block cannot migrate to a different SM. The CUDA runtime
scheduler will assign another thread block to an SM once the SM has completed
the computation of the previously assigned thread block. The process repeats until
all the thread blocks of a CUDA kernel have executed on various SMs in the order
decided by the CUDA runtime scheduler.
CUDA kernel calls are asynchronous. This means, after launching a kernel,
the next CPU code continues to execute while the GPU kernel is being executed
concurrently. The cudaDeviceSynchronize() function call needs to be added in the
source program after the kernel call for synchronous execution (alternatively, one
may use events or synchronize using waiting loops). The cudaDeviceSynchronize()
function call acts as a barrier for the host, and the host waits for completion of the
kernel execution on the device before going to the next instruction.
The CUDA runtime supports a barrier for threads within a single thread block
using the library function __syncthreads(). A programmer is required to place the
__syncthreads() function call at a proper place so that all the threads within a thread
block execute the function call. If it is placed inside a conditional code, all the
threads may not execute the function call. This may lead to threads not meeting
at a program point, resulting in a hung-kernel.
2 The compute capability is the “feature set” (both hardware and software features) of the device.
As devices and CUDA versions get more and more features, the compute capability also increases.
3 A warp is just a group of threads and its size is set by the hardware, usually as 32.
130 5 GPU Architecture and Programming Challenges
The CUDA runtime does not support a global-barrier for all the threads of
a kernel. This is in contrast to the multi-core parallel libraries (e.g., OpenMP),
which provide such global barriers. A global-barrier for a CUDA kernel needs to be
implemented in software. A lock-free software implementation of the global barrier
has been devised [117]. A requirement in the implementation of a global-barrier
is that a CUDA kernel must be called with the number of thread blocks less than
or equal to the number of SMs on the particular GPU. This will require the CUDA
kernel to be called multiple times from the host, as the host calls the kernel with
fewer threads. However, this is necessary to prevent deadlocks. Imagine a situation
in which several thread blocks are mapped to a single SM and the active block has a
global barrier, over which it waits for completion. This leads to a deadlock since the
active thread block cannot be preempted and it does not exit without the other thread
blocks reaching the barrier. Other thread blocks cannot execute without preemption
of the active block.
A sample code for a CUDA kernel barrier is shown in Algorithm 5.2. The basic
idea is to assign a synchronization variable to each thread block. Thread block i will
be assigned location Ain [i], and thread zero of thread block i will be responsible
for setting Ain [i]. N threads of thread block zero (N is the number of thread blocks)
busy-wait on Ain , thread j checking Ain [j ]. The __syncthreads() call at Line 13
ensures synchronization of all the threads in thread block zero.
Algorithm 5.3 contains a CUDA kernel named kernel (Lines 4–14) and a device
function devfun (Lines 1–3) called from kernel. Consider a GPU device with 192
5.3 General Purpose Computing on GPU (GPGPU) 131
cores per SM and a CUDA kernel launched with 960 threads-per-block. The warp-
scheduler assigns a thread block of 960 threads to each SM (assuming that the
number of total threads is more than the total number of SPs in the GPU). The
threads in a thread block are executed as a collection of warps (32 threads per warp)
which follow the SIMT architecture. The number of warps in the thread block is
thirty (960÷32). An SM has only 192 cores and at a time only six (192÷32) warps
can run on an SM. A new warp is possibly loaded into the SM upon termination of
a running warp, since usually, there is no preemption of a warp. The thread block
finishes execution when all the thirty warps finish execution.
The threads within a single warp follow the same control flow, unlike in the
MIMD model followed on a multi-core CPU. The program in Algorithm 5.3 con-
tains a conditional block (see Lines 6–13). The CUDA compiler produces predicated
instructions for the true and false blocks of statements. All the instructions will be
executed but for those instructions for which the predicate is false, the effect is
that of executing a NULL (or NOP) instruction. In the current example, even if
one thread in a warp satisfies the condition (id < size&&B[id] == true), all
the warp-threads enter the conditional block, but except one, the others will not
do anything useful. In general, if some threads satisfy a condition and the others
do not, all the threads end up executing both the true and the false code blocks,
with loss of performance. This is known as warp divergence. The best situation is
when all the warp-threads satisfy the condition, or when none of the warp-threads
satisfies the condition. In such a case, all the threads execute only one block of
instructions and overheads of warp divergence are zero. With large nested if-then
blocks, overheads of warp divergence (execution of dummy instructions) can reduce
performance upto 32 times. The lack of warp divergence in a CUDA kernel is
a necessary (not sufficient) condition for optimal throughput. Conditional blocks
are the primary candidates for warp divergence. Good performance also demands
132 5 GPU Architecture and Programming Challenges
L2 Cache
Global Memory
warps to have coalesced memory access patterns. Scattered access patterns result
in difference in access latency among threads which increases the execution time.
GPU memory hierarchy is shown in Fig. 5.3. The SMs in the GPU consist of
registers, local memory, shared memory, and L1 cache. The shared memory should
be used effectively by a CUDA kernel for throughput. The program in Algorithm 5.4
shows the computation of sum of all the elements in an array, where the number of
threads in the thread block is 1024. As many thread blocks as necessary to take care
of the whole array ptr are launched in the main program (see Line 23).
The addition of values of elements for each thread block is carried out using
the GPU shared memory. Each thread initializes one location of the shared array
blockcount in Line 7. Summing of the elements of the shared array blockcount
takes place in the loop (see Line 10), which makes log2 (1024) = 10 iterations.
In the first iteration (i = 2), thread 0 sums up blockcount[0] and blockcount[1],
thread 2 sums up blockcount[2] and blockcount[3], thread 4 sums up blockcount[4]
and blockcount[5], and so on, and leave the results in blockcount[0], blockcount[2],
blockcount[4], etc., respectively. In the second iteration (i = 4), threads 0, 4, 8, . . . ,
only are active. Thread 0 sums up blockcount[0] and blockcount[2], thread 4
sums up blockcount[4] and blockcount[6], thread 8 sums up blockcount[8] and
blockcount[10], etc., and leave the results in blockcount[0], blockcount[4], block-
count[8], etc., respectively. Similarly, in the third iteration (i = 8), sum of the
elements blockcount[0..7] is computed into blockcount[0], sum of the elements
blockcount[8..15] is computed into blockcount[8], etc. Finally, the sum of all the
elements of blockcount will be available in blockcount[0].
The barrier for a thread block is achieved using __syncthreads() library function.
The local value computed by an SM for each thread block is added atomically to the
5.4 Graph Analytics on GPU 133
global device memory location for the variable reduxsum. The above computation
reduces the number of atomic operations and global memory accesses. This results
in good performance and throughput.
The performance of topology driven graph algorithms (e.g., BFS and SSSP) varies
considerably for different graphs with the same numbers of vertices and edges.
The properties of a graph object which affect the performance of topology-driven
algorithms are diameter, variance in degree distribution, and the number of active
elements (edges or vertices). The dependence on the above properties in GPUs is
due to warp based execution, and massively parallel architecture of the GPU. Graph
algorithms have irregular access patterns and if care is not taken, this enforces usage
of atomic operations for most of the graph algorithms, which can be expensive.
134 5 GPU Architecture and Programming Challenges
Atomic operations can be avoided even in cases where two or more threads
write to the memory location simultaneously, provided all the threads write the
same value [32]. The level based BFS Algorithm is a candidate for atomic-free
implementation (see Algorithm 5.54 ). Each vertex is typically processed by one
thread (see Line 15) which sequentially examines its neighbours (see Line 3).
The function BFS is called only on vertices with their dist value equal to the
value of variable level (see Line 17). To begin with only the source vertex has
its dist (distance from source, measured by the number of edges) set to zero and
all others set to infinity. Variable level is also set to zero. Unexamined vertices
get their dist values updated when their level is reached (see Line 3). The global
variable changed is set to one if dist value is updated in one or more vertices (see
Lines 3). The computation reaches a fixed point once BFS distance values of all
the vertices reachable from the source vertex are computed. In next invocation, no
thread (vertex) will set the changed variable to one, marking termination.
4 A CUDA program equivalent to this algorithm must handle all the details of storing and accessing
the graph, and other aspects. These are deferred until the chapter on compilation of Falcon.
5.4 Graph Analytics on GPU 135
It is possible that several threads (vertices) launched with the same value of level
try to update the dist value of a vertex simultaneously. However, since all of them
write the same value, correctness is not sacrificed. Therefore, the algorithm does not
require any atomic operations (hardware guarantees atomicity of a single write for
primitive data types).
BFS computation can create warp divergence, due to the fact that the number
of active vertices is small during the initial and the final stages (only the source
vertex is active at the beginning). However, all the threads would be launched
with many having nothing to do. This happens due to the conditional statement in
Line 15. Those threads whose vertices satisfy the condition will execute the BFS call
statement and others will do nothing. The number of threads (vertices) which satisfy
the condition in the same warp and the number of neighbours for those threads
(vertices) in the BFS computation are dependent purely on the topology of the input
graph.
The pseudo-code for the edge based SSSP computation from Algorithm 3.7 is
reproduced in a slightly modified form in Algorithm 5.6. Edge based computation
results in fine-grained parallelism where computation per thread is very small. In
vertex based SSSP computation shown in Algorithm 3.5, each vertex is processed
by a thread, which in turn, sequentially processes all its outgoing neighbours.
This results in coarse-grained parallelism with more work per thread. However,
in general, edge based processing results in less warp divergence than vertex based
processing.
136 5 GPU Architecture and Programming Challenges
Warp divergence is present in vertex based algorithms, which iterate over vertices
and their neighbours. This can be eliminated by iterating over edges. The perfor-
mance effects will be visible when the variation in degree distribution is very high.
Other elementary algorithms such as connected components, minimum spanning
tree etc., do not follow a topology driven processing as in algorithms such as BFS
and SSSP. Diameter of a graph does not affect the performance of these benchmarks.
These algorithms involve graph contraction and can be implemented using the
disjoint set union-find data structure.
Worklist based implementation of algorithms does not always perform well
on GPUs. The poor performance in such cases is due to atomic addition of
elements to the worklist and lack of parallelism present in such implementations to
use massively parallel GPU device. -stepping (bucket based) implementation of
elementary algorithms is very efficient on multi-core CPUs but not on GPUS. The
parallelism is not enough to utilize more than thousand cores available in GPUs, but
good enough for less than hundred cores of a multi-core CPU.
Chapter 6
Dynamic Graph Algorithms
Dynamic graph algorithms compute the graph properties from the previous set of
values. Typical operations in dynamic graph algorithms are insertion and deletion of
edges and vertices, and the query for property values relevant to the algorithm. The
efficiency of a dynamic algorithm depends on the data structure used to implement
it. This chapter provides a glimpse into this exciting area in graph analytics.
6.1 Introduction
Real world graphs change their topology over time. Social network graphs get
modified with addition and deletion of edges and vertices. A typical example is the
twitter network graph,1 with users as vertices and “following” relationship between
users as edges. Road network graphs where vertices represent junctions and edges
represent roads, get updated by the addition and deletion of roads. The change
in topology in road network graphs happens over a longer period of time when
compared to social network graphs. The edge weights of road network graphs which
represent expected travelling time from one junction (source vertex) to another
junction (destination vertex) change frequently based on traffic congestion. Traffic
congestion results in change in shortest paths multiple times over a day. The shortest
distance route changes over a long period of time when a road gets added or deleted.
The deletion of edge happens in situations such as a bridge getting damaged.
The movie-actor or author-publication bipartite graphs change by the addition of
vertices and edges. Algorithms such as Delaunay Mesh Refinement (DMR) and
Delaunay Triangulation (DT) used in computational geometry are also dynamic
where the mesh of triangles is refined by the addition of points, and addition/deletion
of edges.
1 https://snap.stanford.edu/data/twitter-2010.html.
v1 v1 v1
70 5 70 5 70
15 15 15
v0 v2 v4 v0 v2 v4 v0 v2 v4
25 15 25 15 25 15
10 25 10 25 10 25
25
v3 DEL v3 AD D v3
←−−−−− −−−−−→
(a) Del_edge(v 0 → v 1 ) (b) Original Graph (c) Add_Edge(v 3 → v 4 )
Fig. 6.1 Shortest distance path (v0 → v4 ): Del_ Edge(), original graph, and Add_Edge()
6.3 Dynamic Shortest Path Computation 139
C1 C3 C2 C1 e1 C1 C2 C1 e1 C1 e2 C1
DE L AD D
←−−−−− −−−−−→
(a) Del_Edge (e _1) (b) Original Graph (c) Add_Edge (e _2)
three connected components) and on addition of the edge e2 (see Fig. 6.2c, which
shows a single connected component).
There are optimizations possible with deletion and addition of edges for the Single
Source Shortest Path (SSSP) problem. Deletion of an edge e is considered here
as increasing its weight to ∞. Addition of an edge e with weight w is considered
here as reducing the weight of e from ∞ to w (provided there is no edge with
same source and destination vertex currently in the graph object with weight less
than w). Dynamic single destination shortest path (SDSP) algorithms using priority
queues which compute the shortest path from current values have been reported
in literature [152]. SSSP computation on a graph G can be simply performed by
computing SDSP on the reverse or transpose graph of G (all edges reversed), with
the source marked as the destination. It is also possible to make simple modifications
to the SDSP algorithm in Algorithm 6.1 to directly compute SSSP.
The incremental and decremental SDSP algorithms store properties related to
shortest path computation. Let T ⊆ G be an acyclic graph which contains all the
vertices of G and all the edges belonging to all the single destination shortest paths.
The graph becomes an acyclic graph when multiple shortest paths exist for the same
vertex; otherwise it is a tree. The algorithms use the data structures mentioned below.
140 6 Dynamic Graph Algorithms
The first three data structures are assumed to be maintained dynamically and should
be available when updates need to be applied:
1. spe is a boolean array of size |E|: spe[i] = 1 if edgei ∈ T, otherwise 0.
2. spd is an array of size |V |: spd[i] = shortest path distance from vi to the
destination vertex.
3. vcnt is an array of size |V |: vcnt[i] = count of edges e : vi → vj which are part
of T, i.e., spe[e] = 1.
4. Q is a set of vertices.
5. pqueue is a priority queue implemented using a heap.
The pseudo code for the incremental dynamic SDSP computation is shown in
Algorithm 6.1 where edge weight is increased for a single edge. This can handle
deletion of an edge as well. The function takes as arguments, the edge e : v1 → v2,
whose weight is increased, edges belonging to T (spe), the distance value for each
vertex to the destination (spd ), the number of edges belonging to T from each
vertex (vcnt), and the graph object (G). The Heap object used for the SDSP (Single
Destination Shortest Path) recomputation is pqueue, which stores vertices of the
graph object. The algorithm simply returns if the edge e does not belong to T (see
Line 4). The increase in the weight does not affect the shortest path in this case. If
the edge e belongs to T, value vcnt[v1] is decreased by one. The shortest path values
of the vertices need to be recomputed if and only if there are no other outgoing edges
from v1 which belong to T, (that is if vcnt[v1] becomes zero). The function returns
immediately, if vcnt[v1] is greater than zero (see Line 6).
The loop at Line 8 identifies and accumulates all the vertices affected by the
edge whose weight is increased, and places these vertices in the set Q. These are
the vertices connected to the source of the deleted edge directly or indirectly by the
shortest path tree edges. It also unmarks all the edges in the shortest path tree that
are connected to the vertices in Q. The next loop at Line 19 tries to compute tentative
values for the shortest paths of the vertices in Q, and places them in the Heap object
pqueue. Then the while-loop at Line 25 uses the Heap object pqueue to consider
one vertex at a time in the order of increasing distance from the destination and
recomputes the distances. It also recomputes the shortest path tree in the affected
region simultaneously.
The vertex v1 is added to the set Q. The spd value of all the vertices added to
the queue is made ∞ (Line 9). The shortest path from vertex v1 to the destination
vertex needs to be recomputed now. The other affected vertices are now added to Q
(see Lines 8–18). The shortened distance needs to be recomputed for each vertex t
such that t → v ∈ T , v ∈ Q, and vcnt[t] == 1 (as discussed for vertex v1). If
vcnt[t] − 1 is zero, t is considered as affected by the increase in weight of the edge
v1 → v2, and is added to Q.
6.3 Dynamic Shortest Path Computation 141
The elements in Q are then processed (see Lines 19–24). For an element u ∈ Q,
all the outgoing edges u → v are considered and if spd[v] + weight (u → v) is
greater than spd[u], spd[u] is modified to spd[v] + weight (u → v). If the distance
value is reduced from ∞ for u ∈ Q, it is added to the heap pqueue.
The elements in pqueue are then processed (see Lines 25–39). The vertex v with
minimum distance value is deleted from pqueue. Then all the vertices t such that
the edge t → v ∈ G.E are considered and the distance value of t is reduced if its
current value is greater than dist[v] + weight (t → v). If the value is decreased,
then the vertex t is added to pqueue using the adjust function. The adjust function
decreases the value of the vertex t if it is already in pqueue. Otherwise, it adds t to
pqueue. T and the number of outgoing edges in T for each vertex are then updated
in a loop (see Lines 33–38). The process continues until pqueue becomes empty.
Figure 6.3 shows an input graph in Fig. 6.3a and the modified graph in Fig. 6.3b on
deletion of the edge v2 → v0 . To compute single destination shortest path (SDSP),
the destination vertex is taken as v0 for the example. The edges which are part of
the SDSP tree T are shown in bold face in both the subfigures.
The edge v2 → v0 is a part of T and the vertex v2 has Vcnt value of one. So the
vertex v2 is added to the set Q. All the incoming edges to v2 are now processed. The
vertices v4 and v3 are also added to Q, as the Vcnt value becomes zero for both the
vertices. Vertex v1 is not considered as the edge v1 → v2 is not a part of T. The set
Q now contains the vertices v2 , v4 and v3 , but they need not be processed in that
order (Q is a set). The distance field of all the elements in Q is set to ∞.
The vertices in Q are now considered in any arbitrary order, say, in the order v3 ,
v4 and v2 (this order illustrates the example better). The outgoing edges of v3 are
considered first. Edge v3 → v2 contributes nothing since v2 is ∞. Edge v3 → v0
makes the tentative value of spd[v3 ] as 45, since inf ty > spd[v0 ] + 45 = 0 +
v1 v1
10 25 10 25
20 10 20 10
v0 v2 v4 v0 v2 v4
15 15 ∞ 15
5 5
45 45
v3 v3
45 = 45. Thus, (v3 , 45) is added to the heap pqueue. Now vertex v4 is considered.
As before, the edge v4 → v2 yields nothing. the edge v4 → v1 makes spd[v4 ]as
spd[v1 ] + 25 = 10 + 25 = 35, tentatively. Therefore, (v4 , 35) is now added to the
heap, pqueue. Vertex v2 is processed next. Edge v2 → v1 sets spd[v2 ] as spd[v1 ] +
20 = 10 + 20 = 30, tentatively, and (v2 , 30) is also added to the heap, pqueue.
The elements of the heap pqueue processed in increasing order of the distance
component (spd[]) of its elements. The first element to be processed is (v2 , 30). Its
in_neighbour v3 (with spd[v3 ] as 45) gets spd[v3 ] reduced to 35 due to the edge
V4 → V2 , because spd[V2 ] + 5 = 30 + 5 = 35 > 45. The entry (v3 , 45) in the heap
gets modified to (v3 , 35). The other two in_neighbours v1 and v4 are not affected.
While processing out_neighbours, the edge v2 → v1 is set as a tree edge.
The element (v4 , 35) is removed from pqueue and processed next. There are no
in_neighbours to be processed. The out_neighbour loop marks the edge v4 → v1 as
a tree edge. Now the last element (v3 , 35) is removed from the heap and processed.
Again, there are no in_neighbours to be processed. The out_neighbour loop marks
the edge v3 → v2 as a tree edge. The algorithm now terminates.
Table 6.1 shows the initial and final values of Vcnt and SPdist for all the vertices.
The algorithm is efficient when it accepts all the edges in a single batch. The set
Q can be populated with the source vertices of all the deleted edges right in the first
step. The rest of the algorithm remains the same.
Modifying the incremental SDSP algorithm to compute SSSP is simple. The
destination vertex of the deleted edge is added to Q, instead of the source vertex
(v1 is replaced by v2 in Algorithm 6.1). The out_neighbours are processed first,
followed by in_neighbours in the two loops at Lines 8 and 19, and also in the two
loops at Lines 27 and 33.
As an example of computing SSSP incrementally, Fig. 6.4 shows graphs with
shortest paths before and after deleting the edge v0 → v2 . The edges belonging to T
are shown in bold face. The source vertex is taken as v0 . Table 6.2 shows the initial
and final values of Vcnt and SPdist for all the vertices.
v1 v1
25 25
10 20 10 20
10 10
v0 v2 v4 v0 v2 v4
15 15 ∞ 15
5 5
45 45
v3 v3
The decremental version of SDSP allows for insertion of a new edge with a weight
w, which is treated as reducing the weight of the edge from ∞ (non-existent edge) to
w. The pseudo-code for the decremental SDSP algorithm is shown in Algorithm 6.2.
If the new edge v1 → v2 does not change the distance of vertex v1 to the destination
(see Line 4), the processing halts. If the new edge creates an alternate shortest path
from v1 to destination, it is marked as a shortest path edge and the path counter
vcnt[v1 ] is incremented (see Line 7). Otherwise, distance of v1 is updated and the
heap pqueue in initialized with v1 .
The While-loop at Line 12 removes a minimum distance vertex at a time from the
heap pqueue and processes its out_neighbours and then its in_neighbours. The for-
loop at Line 15 processing out_neighbours of vertex v, either adds an alternative
shortest path passing through v, the edge v → t, and t, or it ignores the edge v → t.
The for-loop at Line 24 processing in_neighbours of vertex v, updates the shortest
path counter of t, if passing through t, the edge t → v, and v, is cheaper than taking
an alternative path through t bypassing v, and adds t to the heap, pqueue. Otherwise,
the edge t → v is marked as part of the shortest path, if the shortest path passing
through t, the edge t → v, and v is an alternative shortest path (see Line 25).
As an example, consider adding the edge v2 → v0 with weight 15, back to the graph
in Fig. 6.3b. Vertex v2 gets added to the heap, pqueue, along with a distance of 15.
While processing the out_neighbour v0 of v2 , the edge v2 → v0 gets marked as a
shortest path edge. There is no effect on v1 . While processing its in_neighbour v3 ,
the shortest path from v3 reduces and it gets added to the heap, pqueue, along with
a distance of 20. Similarly, v4 also gets added to the heap along with a distance
of 30. In the next two iterations of the while-loop, the two edges, v3 → v2 and
6.3 Dynamic Shortest Path Computation 145
The pseudo-code for parallel computation of DMR is shown in Algorithm 6.5. The
algorithm first computes the set of all bad triangles in the input delaunay triangulated
mesh (Line 5). The set of all bad triangles is stored in the dynamic collection
(varying size) bad_triangles. The algorithm re-triangulates all the elements in
the Collection object bad_triangles in parallel (Lines 7–15). The cavity for
each bad triangle contains a collection of its surrounding triangles which needs
6.5 Challenges in Implementing Dynamic Algorithms 149
The change in graph topology due to addition and deletion of vertices and edges
demands efficient utilization of memory. Deletion of edges and vertices demands
efficient garbage collection, which is programmer’s responsibility in unmanaged
languages such as C and C++. Addition of edges requires the graph object to
increase its size at runtime. This requires reallocation of the graph memory so that
more edges and vertices can be accommodated at runtime. Compilers such as gcc
and nvcc provide library functions for reallocation of memory. Efficient utilization
is possible only when deleted memory locations (occupied by deleted edges and
vertices) are freed and some compaction is performed so that garbage memory is
minimal. This is crucial when the data sizes are huge, but requires more effort from
the programmer.
Most elementary dynamic graph algorithms require different data structures such as
Queue, Heap, Forest of trees, etc. Efficient parallel implementation of these data
structures is very challenging. The performance benefit of parallel implementation
of elementary dynamic graph algorithms using these data structures is still debat-
able. There are not many parallel implementations of dynamic algorithms available
in literature. Algorithms such as DMR and DT require a lock for the entire parallel
region (global barrier). Such a feature is missing in GPU devices, and needs to be
6.5 Challenges in Implementing Dynamic Algorithms 151
The domain-specific language Falcon is presented in this chapter. The data types
and statements of Falcon that support easy programming of graph analytics
applications are described. To drive home the point that Falcon programs can
be very efficient, code generation mechanisms used in the Falcon compiler are
delineated with examples.
7.1 Introduction
Graph analytic frameworks provide abstractions which enforce the algorithmic logic
to be written inside the API functions specific to each framework. The programmer
may still be required to handle dynamic memory allocation and thread management.
Domain Specific Languages (DSLs) provide a higher level of abstraction with
special constructs and data types specific to the domain. A DSL program is
closer to the pseudo-code of an algorithm. This eases programming and enhances
productivity while producing code for applications specific to the domain. Of
course, this benefit is often at the cost of generality, and one may not be able
to implement algorithms outside that domain in the corresponding DSL. SQL for
database applications and HTML for web programming are examples of DSLs. This
chapter discusses DSLs for graph analytics with emphasis on the Falcon graph
DSL [27].
The novel features of the Falcon DSL and its compiler are listed below:
• Support for heterogeneous hardware: CPU and GPU
• Support for programming graph analytics on distributed heterogeneous systems
• Support for dynamic graph algorithms
• Support to represent and process meshes as graphs
Intermediate Representation
TARGET ≤ 2
True False
GPU+CPU
GPU CPU multi-GPU CPU cluster GPU cluster
cluster
TARGET=0 TARGET=1 TARGET=2 TARGET=3 TARGET=4
TARGET=5
7.2 Overview
Falcon extends the C programming language with additional data types, prop-
erties for data types, and constructs for parallel execution and synchronization. A
programmer inputs a Falcon program (DSL code) and one or more command-
line arguments to the Falcon compiler. The command-line arguments include
TARGET architecture, type of the DSL program (static or dynamic), dimensions
of vertices in the graph objects (supports up to three dimensional coordinates),
etc. Falcon programs are explicitly parallel. The Falcon compiler also supports
parallel reduction operations. An important implication of extending a general-
7.2 Overview 155
purpose language is that the programmer can continue to write general programs
with the new DSL as well.
An SSSP program in Falcon is shown in Algorithm 7.1. The Graph object graph
is added with a vertex property dist of type int in Line 8. The addPointProperty()
function is used to add a new property to each vertex in the graph object. The graph
object is read in Line 9. The dist property of each vertex is initialized to a sufficiently
large value in Line 11 using the foreach parallel construct. The foreach
parallel construct operates on the Graph object graph using the iterator points.
It is converted to parallel code by the code generator of the Falcon compiler.
For example, OpenMP and CUDA code are generated for multi-core CPU and GPU
respectively.
The dist property of the source vertex is initialized to zero in Line 12. SSSP
computation then happens in the while loop (Lines 13–17) until a fixed point
is reached. After initializing changed, the dist property values of the vertices are
reduced by calling the relaxgraph() function using a parallel foreach statement.
The relaxgraph() function takes a Point object p and a Graph object graph, as
arguments. For all the edges p → t ∈ G, the dist value of the vertex t is reduced
using the MIN() function, provided dist value of t is greater than the sum of dist
value of the vertex p and weight of the edge p → t. If the dist value of the vertex
156 7 Falcon: A Domain Specific Language for Graph Analytics
t is reduced, the variable changed is set to one. The Falcon compiler relies on
coarse-grained parallelism and does not support nested parallelism (see discussion
in Sect. 3.2.1). The foreach statement inside the relaxgraph function (Line 3)
is converted to a sequential for loop in the generated code as it is nested under
the foreach statement which calls the relaxgraph() function in Line 15. The
MIN() function is converted to atomic-min() function of GPU or CPU based on the
target hardware. The semantics of the MIN(arg1, arg2, changed) library function of
Falcon is as follows:
atomic if (arg1 > arg2) {arg1 = arg2; changed = 1;}
If the value of the variable changed is still zero in Line 16, it implies that
distances have stabilized and have not changed for any vertex in the last iteration.
The fixed-point computation terminates in this case. Otherwise, the while loop
continues for another iteration.
V3 r1
0
0
V0
40
l0
40
1 00 100 l 0 60
0
V1 l1 l1
0
75
75
60
V2 r0 (c) (G 2 )
single
statement foreach
statement
statement
Unnikrishnan-PhD-thesis.pdf.
158 7 Falcon: A Domain Specific Language for Graph Analytics
The data types provided by Falcon are listed in Table 7.4. Falcon provides
Point, Edge, Graph, Set and Collection data types.
A Point data type stores the vertices of a graph object, and can have upto three
dimensions with coordinate values stored in the fields x, y and z respectively.2
An Edge data type represents an edge and consists of source, and destination
vertices, and also the weight of the edge, all of which can be accessed using src, dst,
and weight fields of the Edge data type.
A Graph data type consists of vertices (Point) and edges (Edge). Additional
properties can be associated with Graph, Point and Edge objects using the API
functions of the Falcon data types. These functions are explained in Sect. 7.4.
A Set data type is implemented as a union-find data structure whose size (in
terms of the number of primitive elements such as Point or Edge) cannot change at
runtime. The two operations on a Set data type are the well-known operations on
a union-find data structure, viz., find a primitive element in the set and perform
a union of one subset with another disjoint subset. The parallel union-find data
structure is explained in detail in Sect. 3.5.5.1. Falcon expects that union() and
find() operations should not be called inside the same function because this may
give rise to race conditions leading to incorrect results.
A Collection data type is implemented as a dynamic data structure. The size
of a Collection object may be modified at runtime by add and del operations.
The parallel implementation of the add and del operations uses atomic operations
so that its sequential consistency is preserved.
The API functions of the Graph data type are shown in Table 7.5. The read function
reads a graph object from an input file. The addEdgeProperty and addPointProperty
functions add a new property to the edges and vertices respectively, of a graph
object. The addProperty function adds a new property to a graph object. The npoints
and nedges fields of the Graph data type store the number of vertices and edges
respectively, of a graph object.
The important fields of the Edge data type are shown in Table 7.6. The src and
dst fields store the source and destination vertices of an edge. The weight field stores
the weight of an edge. The del function is used to delete an edge. Note that this
function is different from the graph.delEdge() function. The former operates on an
Edge and the latter operates on a Graph data type.
The important fields and functions of the Point data type are shown in
Table 7.7. The Point data type can store int or float values and it can represent
up to three dimensional coordinates. The values in each dimension can be accessed
using the fields x, y and z respectively.
Table 7.5 Important API functions of the Falcon Graph data type
Function Description
graph.read() Read a Graph object
graph.addPointProperty() Add a new property to each vertex
graph.addEdgeProperty() Add a new property to each edge
graph.addProperty() Add a new property to the Graph
graph.getWeight() Get weight of an edge
graph.addPoint() Add a new vertex
graph.addEdge() Add a new edge
graph.delPoint() Delete a vertex
graph.delEdge() Delete an edge
Table 7.7 Important fields and functions of the Point data type
Field Type Description
x, y, z var Stores Point coordinates in each dimension
getOutDegree() Function Returns number of outgoing edges of a Point
getInDegree() Function Returns number of incoming edges of a Point
Falcon provides mutual exclusion via the single statement. The parallel con-
structs are the foreach statement and the parallel sections statement.
The Falcon compiler supports reduction operations and does not support nested
parallelism.
The single statement operates on either one item or a collection of items
(see Table 7.8). The else block is optional. The first variant of the single
statement is used to lock one item. In second variant, a thread tries to get a lock
on a Collection object coll given as the argument. The lock on a collection
of elements is required to implement dynamic algorithms where all the shared
data among threads (e.g., a set of neighbor nodes) requires to be locked before
participating threads can process the elements in the Collection object. A thread
succeeds if it is able to lock all the elements in the Collection object. In both the
variants, the thread that succeeds in acquiring a lock executes the code following it.
The optional else block is executed by threads that fail to acquire the lock.
The foreach statement is an important parallel execution construct in
Falcon. The foreach statement processes a set of elements in parallel. The
foreach statement on objects of Collection or Set data type does not
have an iterator. The condition and advance_expression are optional for both the
variants of foreach statement (see Table 7.9). The foreach statement with
the optional condition expression makes sure that only the elements in the object
The Graph data type has the iterators points and edges to process all the
vertices and edges respectively. It also has the fields npoints and nedges which
contain the number of vertices and number of edges of the graph, respectively.
The iterator pptyname is generated automatically by the Falcon compiler when
a new property pptyname is added to a Graph object using the addProperty()
function. For example, an iterator by the name triangle will be added to a Graph
object when a property with name triangle is added to the graph object using the
Figure 7.2 shows the high level view of the code generation process for single
machines adopted by the Falcon compiler. As an example, Algorithm 7.4 shows
the optimized SSSP Falcon program. The program has three functions reset(),
update(), and relaxgraph(). The Graph object graph is augmented with three
Point properties dist, olddist and uptd of types int, int and bool, respectively
(see Lines 18–20). Since we have explained SSSP in multiple contexts, only the
points to be noted in the context of Falcon are mentioned here.
An important point to note in Algorithm 7.4 is that the code uses the graph only
on the CPU. The programmer does not need to worry about two graph copies, or the
7.6 Code Generation—Single Machine 163
data transfer. All of these are taken care of automatically by the Falcon compiler,
as explained in later sections.
Only the vertices which have the uptd property true will execute statements
inside the relaxgraph() function as the foreach statement has a filter based on the
condition (t.uptd == true) (see Line 27). The update() function checks whether dist
value of each vertex has been reduced by relaxgraph() function using the condition
t.dist < t.olddist ∀t ∈ graph.vertices. The uptd property of a vertex is set to true
if its dist value is reduced. Only such vertices participate in the next invocation of
the relaxgraph() function. The foreach statement in Line 4 is not parallel as it
is nested inside the foreach statement in Line 27 which calls the relaxgraph()
function. The computation in the while-loop is the usual fixed-point computation.
The library function MIN sets the variable changed to one, if reduction in dist value
occurs, which results in the while-loop iterating once more. Otherwise, after the
check at Line 28, the while-loop exits.
Algorithm 7.1 does not have a condition for the foreach statement which calls
the relaxgraph kernel. This leads to |E| number of atomic MIN operations in each
164 7 Falcon: A Domain Specific Language for Graph Analytics
call to relaxgraph function. Thus Algorithm 7.1 reaches a fixed point with lesser
number of iterations compared to Algorithm 7.4. Algorithm 7.4 iterates through
those outgoing edges of the vertices v ∈ G.V which have uptd property True.
During the first iteration only the outgoing edges of the source vertex participate in
the computation. This number increases as the wavefront of vertices advances. The
advantage of Algorithm 7.4 is that the number of atomic MIN operations is smaller
than that of Algorithm 7.1, leading to better performance, but with more iterations
required to reach the fixed point.
The code generator of the Falcon compiler generates parallel C++ and CUDA
codes for multi-core CPU and GPU respectively. The graph object is represented
in Compressed Sparse Row (CSR) format by default, and Edge List format is also
supported. The C++ classes HGraph and GGraph are used to store a graph object
on multi-core CPU and GPU respectively, and both the classes inherit from the
7.6 Code Generation—Single Machine 165
class Graph. The Graph class has a field named extra of type void* which is
used to allocate extra properties added to a graph object using addPointProperty,
addEdgeProperty, and addProperty methods. The Falcon compiler performs type
checking and reports errors when there is mismatch in operators for domain-specific
constructs. The Point and Edge variables are converted to integers.
The Point data is stored as a Union data type with float and int fields (fpe
and ipe respectively). Point objects are stored in an array named points of size |V |.
The fields of the Edge data type are stored in CSR format using two arrays edges
and index of size 2 × |E| and (|V | + 1) respectively. Each edge stores two values,
destination vertex and edge weight. The source vertex is computed from index array.
A Graph object consists of collections of vertices and edges. The Set data type
implementation uses the Union-Find data structure.4 The Collection data type
has been implemented using the Thrust library for GPU [4] and the Galois Worklist5
for CPU [66].
The number and types of extra properties vary across algorithms. The Falcon
compiler generates code which allocates memory for the properties specified in the
input program. Algorithm 7.5 shows the generated code for allocating the properties
dist, olddist and uptd on multi-core CPU (Lines 6–11) and GPU (Lines 12–19). The
properties are allocated in the extra field of the Graph class using the functions
malloc() and cudaMalloc() on CPU and GPU respectively. The extra field of type
void* allocated on GPU (device) cannot be accessed directly from the CPU (host).
Therefore, it is allocated first and then copied to a temporary variable, tmp. Then the
fields are allocated using the tmp variable.
4 The details of implementation of Union-Find in Falcon are available in [26] and https://www.
csa.iisc.ac.in/~srikant/papers-theses/Unnikrishnan-PhD-thesis.pdf.
5 https://iss.oden.utexas.edu/?p=projects/galois.
166 7 Falcon: A Domain Specific Language for Graph Analytics
Algorithm 7.5: SSSP: Generated Code for Extra Property Allocation on CPU
and GPU
1 struct struct_hgraph {
2 int *dist, *olddist;
3 bool *uptd;
4 };
5 struct struct_hgraph tmp;
6 alloc_extra_graphcpu(HGraph &graph) {
7 graph.extra = (struct struct_hgraph *) malloc(sizeof(struct struct_hgraph));
8 ((struct struct_hgraph *)(graph.extra))→dist = (int *)
malloc(sizeof(int)*graph.npoints);
9 ((struct struct_hgraph *)(graph.extra))→olddist = (int *)
malloc(sizeof(int)*graph.npoints);
10 ((struct struct_hgraph *)(graph.extra))→uptd = (bool *)
malloc(sizeof(bool)*graph.npoints);
11 }
12 alloc_extra_graph(GGraph &graph) {
13 cudaMalloc((void **) &(graph.extra), sizeof(struct struct_hgraph));
14 cudaMemcpy(&tmp, (ep *)(graph.extra), sizeof(struct
struct_hgraph),cudaMemcpyDeviceToHost);
15 cudaMalloc((void **) &(tmp.dist), sizeof(int)* graph.npoints);
16 cudaMalloc((void **) &(tmp.olddist), sizeof(int)* graph.npoints);
17 cudaMalloc((void **) &(tmp.uptd), sizeof(bool)* graph.npoints);
18 cudaMemcpy(graph.extra, &tmp, sizeof(struct struct_hgraph),
cudaMemcpyHostToDevice);
19 }
vertex identifier. The number of outgoing edges from a vertex t is found using
(index[t + 1] − index[t]). The starting offset for the outgoing edges from vertex t
is 2 × index[t]. The GPU code for the relaxgraph() function in SSSP is shown in
Algorithm 7.6 with the last two lines showing the CUDA kernel call to relaxgraph()
from the host.
The code generated for (parallel) call to the relaxgraph() function for CPU
is shown in Algorithm 7.7. The relaxgraph() function call inside the foreach
statement is converted to an OpenMP parallel for loop by the Falcon compiler.
The variable TOT_CPU carries a value equal to the number of cores in the CPU.
Algorithm 7.6: SSSP: GPU Version of the Code Generated for relaxgraph()
and the Function Call from CPU
1 #define t ((( struct struct_hgraph *)(graph.extra)))
2 __global__ void relaxgraph(GGraph graph int x) {
3 int id = blockIdx.x * blockDim.x + threadIdx.x + x;
4 if( id <graph.npoints && t→uptd[id] == true ){
5 int falcft0 = graph.index[id];
6 int falcft1 = graph.index[id + 1] - graph.index[id];
7 for (int falcft2 = 0; falcft2 <falcft1; falcft2++) {
8 ut0 = 2 * (falcft0 + falcft2); //edge index
9 int ut1 = graph.edges[ut0].ipe; //dest point
10 int ut2 = graph.edges[ut0 + 1].ipe; // edge weight
11 GMIN( t->dist[ut1], t->dist[id] + ut2, changed);
12 }
13 }
14 }
15 int flcBlocks=(graph.npoints / TPB + 1)>MAXBLKS ? MAXBLKS : (graph.npoints / TPB
+ 1);
16 for (int kk = 0;kk<graph.npoints;kk += TPB * flcBlocks)
17 relaxgraph <<< flcBlocks, TPB >>>(graph, lev, kk);
Algorithm 7.7: CPU Version of the Code Generated for relaxgraph() and Its
Call
1 #define t ((( struct struct_hgraph *)(graph.extra)))
2 void relaxgraph(int &p, HGraph &graph) {
3 if( id <graph.npoints &&t→dist[id] == true ){
4 int falcft0 = graph.index[id];
5 int falcft1 = graph.index[id + 1] - graph.index[id];
6 for (int falcft2 = 0; falcft2 <falcft1; falcft2++) {
7 int ut0 = 2 * (falcft0 + falcft2); //edge index
8 int ut1 = graph.edges[ut0].ipe; //dest point
9 int ut2 = graph.edges[ut0 + 1].ipe; // edge weight
10 HMIN( (t->dist[ut1], t->dist[id] + ut2, changed);
11 }
12 }
13 }
14 #pragma omp parallel for num_threads(TOT_CPU)
15 for (int i = 0; i <graph.npoints; i++) relaxgraph(i, graph);
The single statement when used with only one element is converted to a compare-
and-swap based code for CPU and GPU. When it is used with a collection of items,
168 7 Falcon: A Domain Specific Language for Graph Analytics
Algorithm 7.9: Code Generated for Line 26: { changed = 0;} in Algo-
rithm 7.4
1 int falcvt3 = 0;
2 cudaMemcpyToSymbol(changed, &(falcvt3), sizeof(int), 0, cudaMemcpyHostToDevice);
each thread tries to lock possible overlapping elements. If two or more threads
try to lock collections of elements, say A and B, with common elements among
them (i.e., A ∩ B = ), the thread with the lowest thread-id is given priority in
locking its collection object. This is ensured by the Falcon code generator. The
lock on a collection of elements requires a global barrier and the Falcon compiler
implements a global barrier for GPU in software (see Algorithm 5.2 in Chap. 5).
performs the reduction, is called from the host. RSUM0 uses a single-dimensional
array reduxarr stored in the shared memory private to each streaming multiple
processor (SM). The size of the array is 1024 and the CUDA kernel is called from
the host (CPU) with the number of threads per block, TPB ≤ 1024. The weights of
the edges which have marked value true are added to reduxarr by each thread in
the thread block (see Line 9). The values in the elements of the array reduxarr are
added to reduxarr[0] using a sequential for loop (see Lines 13–16). The local
value computed by each thread block (available in reduxarr[0] is then added to the
global device (GPU) variable dmstcost (see Line 17). The final value of dmstcost is
copied to the host variable mstcost (see Lines 21–29).
170 7 Falcon: A Domain Specific Language for Graph Analytics
The Falcon compiler supports distributed graph analytics on CPU clusters, GPU
clusters using GPUs of each machine or both CPU and GPUs of each machine
and multi-GPU machines. A programmer needs to write only a single program in
Falcon and with appropriate command line arguments, the Falcon compiler
generates programs for different distributed systems. The generated code uses
Message Passing Interface (MPI) library calls for communication, and C++
(CUDA) code for CPU (GPU) device (respectively).
Real world graphs are very large in size and cannot be processed on a single
machine. Typical examples are social network graphs such as Twitter and Facebook.
Such graphs are partitioned into multiple subgraphs and each subgraph is processed
on a separate machine. Distributed graph analytics in Falcon follows the BSP
model (see Sect. 1.9.1.1) which involves a series of supersteps with each superstep
having the three phases, parallel computation, communication, and synchronization.
Graph analytics is efficient when there is proper balance in computation and
communication. Computation is balanced if each subgraph requires a similar
computation time during each superstep. Communication time is minimal if the
number of edges across subgraphs is minimal. Optimum graph partitioning which
balances partition sizes and minimizes the edge-cut across partitions is an NP-hard
problem and needs to rely on heuristics (see Sect. 1.4).
A distributed system requires a different graph storage strategy than a shared
memory system. The Point data type is associated with a global vertex id (GID)
in the graph. The same vertex can have different remote vertex id (RID) in different
subgraphs. Each vertex is also associated with a local vertex id (LID) in exactly
one subgraph. Table 7.11 shows the partitioning of a graph with 15 vertices across
three machines and remapping of GID to RID and LID. The GIDs of the vertices
Table 7.11 Conversion of GID to LID and RID on three machines M1, M2, M3
v0 v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11 v12 v13 v14
M1 v0 (l0 ) v1 (l1 ) v2 (l2 ) v3 (l3 ) v4 (l4 ) v6 (r5 ) v8 (r6 ) v9 (r7 ) v11 (r8 ) v13 (r9 )
M2 v5 (l0 ) v6 (l1 ) v7 (l2 ) v8 (l3 ) v9 (l4 ) v2 (r5 ) v4 (r6 ) v11 (r7 ) v13 (r8 )
M3 v10 (l0 ) v11 (l1 ) v12 (l2 ) v13 (l3 ) v14 (l4 ) v3 (r5 ) v5 (r6 ) v7 (r7 )
Machine 0 1 2
Index (M1) (M2) (M3)
Vertex
offsets (0, 5, 8, 10) (0, 5, 7, 9) (0, 5, 6, 8)
7.7 Falcon for Distributed Systems 171
are v0 , . . . , v14 . The first partition on machine M1 has ten vertices, of which
v0 , v1 , v2 , v3 , v4 are local to M1 (renamed as l0 , . . . , l4 ), v6 , v8 , v9 are remote and
located on machine M2 (renamed on M1 as r5 , r6 , r7 ), and v11 , v13 are also remote
and located on machine M3 (renamed as r8 , r9 ). The offsets array (the last one in
Table 7.11) shows the offsets of vertices for each machine. For example, the second
block of offsets belongs to M2 and within that block, offsets 0, 5, and 7 indicate the
starting offsets of local vertices on M2, remote vertices on M1, and remote vertices
on M3, respectively. The last entry of each offset block (9 in the block for M2)
indicates the number of vertices (local+remote) on a machine. It must also be noted
that the number of remote vertices on a machine can be computed from the offset
array. For example, offset[1, 2]—offset[1, 1]6 gives the number of remote vertices
on machine M2 which are local vertices on machine M1 (two in this example: v2
and v6 ).
In the edge-list format, each subgraph stores the outgoing edges with their LID
as the source vertex. A vertex becomes remote when the destination vertex of an
edge in the subgraph is not local to that subgraph. The role of the source vertex and
the destination vertex are interchanged when the graph is stored in reverse edge-
list format. Each edge is stored in exactly one subgraph as Falcon uses edge-cut
partitioning. After every parallel computation phase, each subgraph gathers updated
properties of the local vertices from subgraphs where its local vertex is a remote
vertex. Distributed computation requires vertex properties to be updated only using
commutative and associative operations. A programmer views a graph as a single
object. Graph partitioning and distributed code generation are taken care of by the
Falcon compiler.
Code generated by the Falcon compiler for distributed systems contains
calls to MPI library functions such as MPI_Isend(),MPI_Recv(), MPI_INIT,
MPI_Comm_rank etc. Important MPI library functions often present in the
generated code are listed in Table 7.12. The generated code is compiled with a
native machine compiler and the associated libraries (gcc, nvcc, etc.). Distributed
execution with N processes assigns a unique integer value called rank to each
process, where 0 ≤ rank < N. The rank value is used to uniquely identify each
process. The communication between processes happens by specifying the rank of
the processes.
Locking of elements in distributed systems is not supported by MPI. The
Falcon compiler implements it in software using MPI library functions. The
single construct uses distributed locking and it is transparent to the programmer.
The number of processes with which the same binary is executed on multiple
machines is specified as a command line argument during execution. Algorithm 7.11
6 Theoffset array has been implemented as a single dimensional array in Falcon for efficiency
reasons.
172 7 Falcon: A Domain Specific Language for Graph Analytics
Table 7.12 MPI library functions used by the Falcon code generator
Function Operation
MPI_INT Initializes the MPI execution environment
MPI_Comm_rank Used to find the rank of a process
MPI_Comm_size Finds the number of processes involved in program execution
MPI_Isend Send data from one process to another process
MPI_Recv Receive data from a process
MPI_Get_count returns the number of data elements received from the most recent
MPI_Recv() operation
MPI_Barrier Barrier for all the processes
MPI_Finalize All the processes call this function before program termination
shows the code for initializing a distributed system with GPU. The total number of
processes and the rank of the process are stored in the global variables FALCsize
and FALCrank respectively. The value of FALCrank acts as a unique identifier
for each process with different values in each process. It must be noted that each
variable in a distributed system is duplicated with different memory locations in
each process. Processes are very different from threads. The values are obtained
by calls to the functions MPI_Comm_size() (see Line 6) and MPI_Comm_rank()
(see Line 7) respectively. Communication is required for synchronizing subgraph
properties. The variables FALCrecvbuff and FALCsendbuff (of type FALCbuffer)
are used to receive and send data. Lines 12–19 allocate the buffer with target being
a GPU device. The dist property in the SSSP computation is an example of a property
to be communicated among processes running on different machines.
Line 1 of Algorithm 7.4 declares a global variable changed. Algorithm 7.12 shows
the generated code for different types of distributed systems (based on the value of
the command line argument TARGET). For a target system with CPU and GPU
devices, the variable changed will be duplicated to two copies, one each on CPU
and GPU (Line 3, Algorithm 7.12).
Algorithm 7.13 shows how the Falcon code generator performs global variable
allocation. It checks global variables which are read (use) or written (def) inside
(possibly nested) foreach statements (parallel regions). Declaration statements
appropriate to the target system (one for each device of the target) are generated
for such global variables. Such global variables will be manipulated by several
processes simultaneously and will require synchronization.
7.7 Falcon for Distributed Systems 173
Algorithm 7.15: CPU Cluster: Code for Synchronizing the Global Variable
Changed
1 #define MCW MPI_COMM_WORLD
2 #define MSI MPI_STATUS_IGNORE
3 MPI_Status status[8]; // 0 ≤ rank, NPARTS < 8
4 MPI_Request request[8];
5 int falctv4;
6 for (int i = 0; i <NPARTS; i++) {
7 if (i != rank) MPI_Isend(&changed, 1, MPI_INT, i, messageno, MCW, &request[i]);
8 }
9 for (int i = 0; i <NPARTS; i++) {
10 if (i != rank) MPI_Recv(&falctv4, 1, MPI_INT, i, messageno, MCW, MSI);
11 changed = changed + falcvt4;
12 }
13 if (changed == 0) break;
processes is collected in the temporary variable falctv4. The value is collected using
the synchronous MP I _Recv() function. The received value is used to update the
variable changed (Line 11). If the value of the variable changed is zero, the break
statement is executed (Line 13).7
The code in Algorithm 7.15 is for a CPU cluster with eight machines. The code
for a GPU cluster will have additional code for copying variables from and to the
GPU. The Falcon compiler uses OpenMPI library with cuda-aware-mpi support
for multi-GPU machines.
7 This statement is generated if it is a part of the source code as in the SSSP or other algorithms.
7.7 Falcon for Distributed Systems 175
Falcon implements distributed Union-Find Set data type on top of the Set
implementation for single machines [26]. The distributed Set uses the first process
(rank = 0) for collecting union() requests from all the other processes. The processs
with rank zero performs the union() operation and sends the updated parent value to
all the other processes involved (see Algorithm 7.16). This seems to be adequate for
small clusters that are tightly coupled. A more complex union would be warranted
for large clusters.
The Collection data type supports duplicate elements. The add() function
of the Collection data type is overloaded and supports adding elements to
a Collection object with no duplicate elements. This avoids sending the
same data of remote nodes to the corresponding master nodes multiple times. A
176 7 Falcon: A Domain Specific Language for Graph Analytics
We now explain the code generation for single and foreach statements.
rank zero. Processes with rank greater than zero send all the lock values which are
modified from the initial value of MAX_INT, to the process with rank zero (P0 ). The
process P0 collects the lock values from the remote nodes and sets the lock value for
all the points to the smallest rank value. The process P0 then sends the modified lock
value back to each remote node. In the second phase, the remote nodes receive the
final lock value for each vertex from process P0 , and update the lock value of each
of their vertices. The node which obtains the lock for vertex i executes the body of
the single statement for vertex i.
The pseudo code for distributed single statement is shown in Algorithm 7.19.
Function fun() is duplicated into two versions, fun1() and fun2(). The function
fun1() has {Program segment 1} and ends with a CAS operation replacing the single
statement that acquires local locks. Function fun2() begins with a CAS operation
replacing the single statement that operates on the acquired lock (if any). Function
fun2() also has stmt_block{} of the single statement and ends with Program
segment 2. The calling portion of function fun() contains the code for acquiring the
global lock for each vertex. Such an implementation is used in the Boruvka’s-MST
implementation, wherein different threads race for operating on a component.
Algorithm 7.20: CPU Cluster: Prologue and Epilogue Code for relaxgraph
Call
1 // prefix-code
2 #pragma omp parallel for num_threads(FALC_THREADS)
3 for (int i = graph.nlocalpoints; i <graph.nremotepoints; i++) {
4 tempdist[i] = ((struct struct_graph *)(graph.extra)) ->dist[i];
5 }
6 #pragma omp parallel for num_threads(FALC_THREADS)
7 for (int i = 0; i <graph.nlocaledges; i++) {
8 relaxgraph(i, graph);
9 }
10 // suffix-code
11 for (int kk = 1; kk <FALCsize; kk++) {
12 #pragma omp parallel for num_threads(FALC_THREADS)
13 for (int i = graph.offset[kk - 1]; i <graph.offset[kk]; i++) {
14 addto_sendbuff(i, graph, FALCsendsize, FALCsendbuff, kk - 1);
15 }
16 }
The relaxgraph() function of the SSSP program (see Algorithms 7.1 and 7.4)
updates the dist value of the destination vertex (t) of an edge e : p → t using a
MIN function and the vertex t could be a remote vertex for a subgraph Gi ⊂ G. The
Falcon compiler generates prologue and epilogue codes for a parallel foreach
statement, if the statement body updates a remote vertex. Such a code fragment
using OpenMP is shown in Algorithm 7.20.
7.7 Falcon for Distributed Systems 179
The code first copies the current dist value of remote vertices to a tem-
porary buffer tempdist (Line 4). Remote vertices have indices in the range:
[Gi .localpoints, Gi .remotepoints] for any Gi ⊂ G. The relaxgraph() function
is called after the copy operation (Line 8). The number of remote vertices for
a remote node (machine) with index kk is (off set[kk] − off set[kk − 1]). The
add_to_sendbuff() function (Line 14) checks for the condition (tempdist[i]! =
dist[i)) and adds the remote vertices of the subgraph Gkk which satisfy this
condition, to FALCsendbuff [kk]. FALCsendbuff [kk] contains a set of tuples
(dist, loca_id) for the subgraph Gkk , where local_id is the local vertex id (LID) of
a vertex in the subgraph, where the remote vertex is a local vertex. Then these buffer
values are sent to the respective remote nodes. Node p receives (dist, local_id[p])
pairs from all the remote nodes and updates its own dist value, by taking the MIN of
its own dist value and the remote values. Algorithm 7.21 shows details of how this
is achieved using MPI library calls.
Chapter 8
Experiments, Evaluation and Future
Directions
8.1 Introduction
The GPU has 4352 streaming processors (SPs) with 68 streaming multiprocessors
(SMs) and 64 SPs per SM. It also has 11 GB of volatile memory and SPs run at a
frequency of 1350 MHz.
The inputs used for evaluation consist of random graphs, RMAT graphs, and
road graphs. The inputs are listed in Tables 8.1, 8.2, and 8.3 (M denotes Million).
The random graphs were created using the graph generator tool available in the
Galois [66] framework. The RMAT graphs were created using the GTGraph tool
with default values for parameters a, b, c, d [154] which decide the graph sparsity
and edge distribution. The road network graphs are USA road networks available in
public domain [155]. The benchmarks include Breadth-First Search (BFS), Single
Source Shortest Paths (SSSP), Connected Components using Union-Find (CC), and
Boruvka’s Minimum Spanning Tree (MST) algorithms programmed in Falcon.
The baseline running times for SSSP and BFS are listed in Table 8.4 for GPU and
in Table 8.5 for CPU. The baseline BFS and SSSP algorithms use the atomic MIN
library function. Both the algorithms use only one vertex (dist) property to store
the BFS and the SSSP distances. The maximum BFS distance for R-MAT graph
is ∞ as the graph is not fully connected. The number inside parentheses (after ∞)
for the R-MAT graphs is the number of iterations taken to reach the fixed point.
The main kernel which computes the BFS and the SSSP distances, runs for all the
vertices in the graph object. The optimized version of Falcon program for BFS uses a
level-based traversal with no atomic operations. The one for SSSP uses three vertex
properties dist, olddist and updated, and one additional kernel.
8.2 Single Machine Experiments 183
The number of iterations of the BFS kernel for R-MAT graphs was below ten.
The average number of vertices that participated in an iteration is very high due to
the low diameter of these graphs. This is good enough for utilization of the 4352
GPU cores.
184 8 Experiments, Evaluation and Future Directions
The four road network graphs have high diameters (BFS distance) which range
from 2878 to 6261. This leads to less parallelism per iteration as the number of
iterations for the BFS and the SSSP computation is directly related to the BFS
distance. The running times for road network graphs are high on GPU and CPU
as given in Tables 8.4 and 8.5 respectively. Road network graphs have a poor
performance on GPU and -stepping computation is more suitable for road network
graphs on CPU.
CUDA runtime performs warp-voting for the SIMT Nvidia-GPU, which helps in
reducing the running time of level-based BFS. The level based BFS computation has
a conditional block with the condition (t.dist == level) in the kernel. If none of the
threads in a warp satisfies the condition, all the thirty two threads finish execution.
The naïve BFS algorithm has no such condition and uses an atomic MIN operation,
which leads to its high running time. The SSSP computation also benefits by warp-
voting with the condition (t.updated == T rue). This condition makes sure that
only the vertices whose distance was reduced in the previous iteration take part in
the current iteration.
The speedup of optimized algorithms over naïve algorithms on random graphs and
RMAT graphs are shown for GPU and CPU in Figs. 8.1 and 8.2 respectively. The
speedup depends on graph topology, presence of atomic operations, the way threads
are scheduled at runtime, etc. We observe that GPU programs are much faster than
the CPU versions. The multi-core CPU used for evaluation has only 8 computing
cores while the number of SPs for the GPU is 4352. The performance is likely to be
better on multi-core CPUs with more number of cores.
OpenMP based code is not always good for a multi-core CPU. However, the -
stepping algorithm gets benefited considerably. Figure 8.3 shows the speedup of the
-Stepping algorithm over naïve Falcon code for random and RMAT graphs with
BFS and SSSP benchmarks.
The road networks exhibit poor performance when implemented using OpenMP for
BFS and SSSP computations. In contrast, the -stepping algorithm runs extremely
fast for road networks [26] on CPU. Figure 8.4a shows the speedup of optimized
Falcon OpenMP code over naïve Falcon code. The -stepping SSSP and BFS
are much faster than the respective nav̈e programs as shown in Fig. 8.4b. Figure 8.5
shows the speedup of optimized Falcon BFS and SSSP over naïve BFS and SSSP
respectively.
8.2 Single Machine Experiments 185
BFS
SSSP BFS
8 8
SSSP
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0M
0M
0M
0M
0M
0M
d8M
M
d16
d24
d32
d48
d64
at1
at2
at3
at4
at6
at8
ran
ran
ran
ran
ran
ran
rm
rm
rm
rm
rm
rm
(a) Random Graphs (b) RMAT Graphs
Fig. 8.1 GPU: speedup of BFS and SSSP (optimized Falcon code) over Naïve Falcon code
BFS 8 BFS
8
SSSP SSSP
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0M
0M
M
8M
t20
t30
0
d16
d24
d32
d48
d64
at1
at4
at6
at8
d
ran
a
ran
ran
ran
ran
ran
rm
rm
rm
rm
rm
rm
Fig. 8.2 CPU: speedup of BFS and SSSP (optimized Falcon code) over naïve Falcon code
186 8 Experiments, Evaluation and Future Directions
24
BFS 20 BFS
8
SSSP SSSP
7
16
6
5 12
4
8
3
2
4
1 2
1
0M
0M
0M
0M
0M
d8M
0
d16
d24
d32
d48
d64
at1
at2
at3
at4
at6
at8
ran
ran
ran
ran
ran
ran
rm
rm
rm
rm
rm
rm
(a) Random Graphs (b) RMAT Graphs
Fig. 8.3 CPU: speedup of -stepping BFS and SSSP Falcon code over Naïve Falcon code
30 520
480
25 BFS BFS
SSSP 400 SSSP
20
320
240
10
160
4 80
40
1 10
E
SA
A-
A-
CT
CT
US
A-
A-
-U
US
US
A-
A-
US
A-
US
A
US
US
US
US
Fig. 8.4 CPU: speedup of BFS and SSSP (optimized Falcon code) over Naïve Falcon code
8.2 Single Machine Experiments 187
3
10
6
1.5
1 4
0.5 2
1
E
A
A-
A-
CT
CT
US
US
A-
A-
US
US
A-
A-
US
A-
US
A-
US
US
US
US
(a) SSSP (b) BFS
Fig. 8.5 GPU: (road input) speedup of Falcon-optimized over Falcon-Naïve
The running times of the CC and the MST algorithms for CPU (8-core) are shown in
Table 8.6 and for GPU are shown in Table 8.7. The GPU exploits parallelism well.
The road network graphs are not affected by the diameter in these benchmarks, as
path compression in the Union-Find algorithm happens quickly. These algorithms
are not traversal algorithms like SSSP and BFS.
8.2.4 Summary
• The CPU cluster used for the experiments had sixteen machines. Each machine
of the cluster consisted of two CPU sockets with 12 Intel Haswell cores operating
at 2.5 GHz and 128 GB volatile memory.
• The GPU cluster had eight nodes, each with an Nvidia-Tesla K40 GPU card with
2880 cores and 12 GB volatile memory. Each node had a 12 core multi-core CPU
as its host.
The benchmarks were run in a distributed fashion with the graph inputs partitioned
across the machines or devices. Graph inputs of bigger size were used. The inputs
used for evaluation are shown in Table 8.8. They consist of RMAT, web, and social
network graphs. Falcon programs follow the BSP execution model.
8.3.1 Multi-GPU
Eight GPUs were organized into two clusters with four GPUs in each cluster. The
GPUs in each cluster can communicate directly (without involving the host). This
feature is enabled in CUDA using the peer-access API functions.
Table 8.9 shows the running times of different benchmarks (SSSP, BFS, CC) on
the multi-GPU machine with increasing order of number of GPUs. The values of
|E|/|V | are 23, 35, and 35 for UK-2005, UK-2007 and Twitter inputs respectively.
This is much higher than the value of ten used in RMAT inputs. The values
shown in the table are best among edge based and vertex based computations.
The Twitter input has a high variance in degree and it was observed that vertex
based computation is upto 7 times slower than edge based computation. Variance
in degree distribution results in thread divergence in GPUs. The Falcon compiler
used OpenMPI library with cuda-aware-mpi to reduce communication overhead
between GPUs on the same machine.
The computation time is high for CPUs compared to massively parallel GPUs, and
the total time spent for communication on CPU clusters is very less when compared
to the computation time [27].
The CPU+GPU cluster uses both the CPU and the GPU of each machine for
computation. The BSP model imposes a barrier after computation and commu-
nication. The computation time is high for CPUs compared to massively parallel
GPUs. The communication time is low in CPU-CPU communication compared
to GPU-GPU communication. This imbalance in communication and computation
results in making the CPU+GPU cluster slower than all other distributed machine
combinations.
8.4 Future Directions in Distributed Graph Analytics 191
Table 8.10 Running time (in seconds) of RMAT graphs on eight GPUs or four CPUs and GPUs
Multi CPU GPU + CPU
Input Algorithm GPU cluster cluster
rmat100M BFS 1.2 3.7 8.4
rmat100M SSSP 5.0 9.3 31.6
rmat100M CC 1.1 2.7 81.3
rmat200M BFS 1.8 6.9 16.7
rmat200M SSSP 10.9 18.1 66.9
rmat200M CC 2.4 5.7 17.4
8.3.5 Results
Table 8.10 compares the running times of multi-GPU, GPU cluster and CPU+GPU
cluster. The multi-GPU machine with peer-access capability is always faster. The
running time on the GPU cluster increases considerably due to the communication
overhead. The GPU+CPU cluster takes longer due to imbalance in computation
time and communication time in CPU and GPU. Detailed results can be found
in [27].
Graph analytics has witnessed a rapid growth in the recent past. This is a clear
indicator of its importance, and is due to more and more applications being modeled
using graphs. Entities and their relationships can be naturally modeled as graph
structures. Well-known algorithms can then be readily applied to these problems.
Scalability has been, currently is, and will definitely continue to be at the forefront
of graph analytics research. Several aspects of graph processing are either still
unexplored or are in their infancy. We envision that the following areas would
witness acceleration in the days to come.
an individual device, a subset of the devices, or for all the network devices
for the same graph application. Depending upon the device characteristics, their
performance toward graph processing, and hardware availability, users may want to
mix-and-match devices against a performance criteria—which could be execution
time, energy efficiency, performance per watt per dollar, etc. Domain-agnostic and
domain-specific languages need to evolve API and constructs to help users achieve
such a complex goal. Compilers also need to be able to generate a variety of backend
codes from the same algorithmic specification [158].
While graphs can be categorized based on their structural characteristics, very large
graphs exhibit heterogeneity also within themselves. For instance, some vertices
have low clustering coefficient, while some have it very high, and many hover
around average. Most of the graph processing systems treat different parts of the
graph in a uniform manner. However, we need different types of processing for
different parts, which can execute a graph operator efficiently based on the structural
patterns in those parts [159]. Such a processing which adapts based on the subgraph
structure would be very valuable. Such an adaptive processing can then be separately
optimized—either for an algorithmic pattern, or for a specific backend device. In a
distributed setup, such a processing relies on an adaptive graph partitioning, which
can divide a graph across network nodes in a non-uniform manner, and still achieve
an overall performance improvement, due to optimized processing for different
subgraph patterns or for a device. Again, this performance could be in terms of
execution time, energy efficiency, or other systemic criteria.
While there have been works supporting dynamic and streaming graph updates [162,
163], focused research on the topic would uncover common patterns across algo-
rithms, and classification of graph algorithms based on these patterns. For instance,
incremental processing is computationally easier than the decremental processing in
case of shortest paths. In contrast, decremental processing is easier in case of vertex
coloring. It is also unclear how to optimize these dynamic updates for different kinds
of graphs, or for different subgraphs wherein they are applied. Choosing a backend
device for certain kinds of dynamic updates would be an altogether new area of
research.
Several machine learning algorithms have been cast as graph algorithms, and
systems specialized for such a processing have been developed. However, long-
running graph algorithms which churn a lot of data, provide a great opportunity
to learn about the graph processing itself! Thus, one can learn about patterns of
traversals, updates of attributes, or even which devices are bottlenecks for which
operations [52, 166, 167]. Such a learning can be very helpful in specializing the
processing for a certain kind of graph, a subgraph, or a certain device characteristic.
A heterogeneous system handling variety of graphs and running different kinds of
algorithms would be an ideal testbed for scheduling graph operations. We expect
such a learning on unstructured data to become mainstream as the data continue to
grow.
References
15. A. Ching, S. Edunov, M. Kabiljo, D. Logothetis, S. Muthukrishnan, One trillion edges: graph
processing at facebook-scale, in Proceedings of the VLDB Endowment (2015), pp. 1804–1815
16. Y. Low, D. Bickson, G.J. Gonzalez, C. Guestrin, A. Kyrola, J.M. Hellerstein, Graphlab: a
new parallel framework for machine learning, in Conference on Uncertainty in Artificial
Intelligence (UAI), UAI’10 (AUAI Press, Arlington, 2010), pp. 340–349. http://dl.acm.org/
citation.cfm?id=3023549.3023589
17. Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, J.M. Hellerstein, Distributed
GraphLab: a framework for machine learning and data mining in the cloud, in Proceedings of
the VLDB Endowment (2012), pp. 716–727
18. M. Burtscher, R. Nasre, K. Pingali, A quantitative study of irregular programs on GPUs, in
IEEE International Symposium on Workload Characterization (IISWC) (IEEE, Piscataway,
2012), pp. 141–151
19. S. Pai, K. Pingali, A compiler for throughput optimization of graph algorithms on GPUs,
in Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented
Programming, Systems, Languages, and Applications, OOPSLA (ACM, New York, 2016),
pp. 1–19
20. A. Gharaibeh, L. Beltrão Costa, E. Santos-Neto, M. Ripeanu, A yoke of oxen and a thousand
chickens for heavy lifting graph processing, in Proceedings of the 21st International Con-
ference on Parallel Architectures and Compilation Techniques, PACT ’12 (ACM, New York,
2012), pp. 345–354
21. R. Dathathri, G. Gill, L. Hoang, H.-V. Dang, A. Brooks, N. Dryden, M. Snir, K. Pingali,
Gluon: a communication-optimizing substrate for distributed heterogeneous graph analytics,
in Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design
and Implementation (PLDI) (ACM, New York, 2018), pp. 752–768
22. S. Hong, H. Chafi, E. Sedlar, K. Olukotun, Green-Marl: a DSL for easy and efficient graph
analysis, in Proceedings of the Seventeenth International Conference on Architectural Sup-
port for Programming Languages and Operating Systems, ASPLOS XVII (ACM, New York,
2012), pp. 349–362
23. D. Prountzos, R. Manevich, K. Pingali, Elixir: a system for synthesizing concurrent graph
programs, in Proceedings of the ACM International Conference on Object Oriented Pro-
gramming Systems Languages and Applications, OOPSLA ’12 (ACM, New York, 2012), pp.
375–394. https://doi.org/10.1145/2384616.2384644
24. S. Hong, S. Salihoglu, J. Widom, K. Olukotun, Simplifying scalable graph processing with a
domain-specific language, in Proceedings of Annual IEEE/ACM International Symposium on
Code Generation and Optimization, CGO ’14 (ACM, New York, 2014), pp. 208–218
25. G. Shashidhar, R. Nasre, LightHouse: an automatic code generator for graph algorithms on
GPUs, in Languages and Compilers for Parallel Computing, ed. by C. Ding, J. Criswell,
P. Wu (Springer, Cham 2017), pp. 235–249
26. U. Cheramangalath, R. Nasre, Y.N. Srikant, Falcon: a graph manipulation language for
heterogeneous systems. ACM Trans. Archit. Code Optim. 12(4), 54:1–54:27 (2015). https://
doi.org/10.1145/2842618
27. U. Cheramangalath, R. Nasre, Y.N. Srikant, DH-Falcon: a language for large-scale graph
processing on distributed heterogeneous systems, in 2017 IEEE International Conference on
Cluster Computing (CLUSTER) (IEEE, Piscataway, 2017), pp. 439–450
28. E. Moore, in The Shortest Path Through a Maze. Bell Telephone System. Technical
publications. Monograph (Bell Telephone System, 1959). https://books.google.co.in/books?
id=IVZBHAAACAAJ
29. R. Tarjan, Depth-first search and linear graph algorithms. SIAM J. Comput. 1(2), 146–160
(1972)
30. R.E. Bellman, On a routing problem. Q. Appl. Math. 16, 87–90 (1958)
31. R.W. Floyd, Algorithm 97: shortest path. Commun. ACM 5(6), 345–345 (1962). https://doi.
org/10.1145/367766.368168
32. T H. Cormen, C. Stein, R.L. Rivest, C.E. Leiserson, Introduction to Algorithmms, 2nd edn.
(McGraw-Hill, New York 2001)
References 197
33. R. Tarjan, Depth-first search and linear graph algorithms. SIAM J. Comput. 1(2), 146–160
(1972). https://doi.org/10.1137/0201010
34. D. Coppersmith, L. Fleischer, B. Hendrickson, A. Pinar, A divide-and-conquer algorithm for
identifying strongly connected components, Tech. Rep., Ernest Orlando Lawrence Berkeley
National Laboratory, Berkeley, 2003. https://escholarship.org/uc/item/1hx5n2df
35. R. Prim, Shortest connection networks and some generalizations. Bell System Technol. J. 36,
1389–1401 (1957)
36. J. Kruskal, On the shortest spanning tree of a graph and the traveling salesman problem. Proc.
Am. Math. Soc. 7, 48–50 (1956)
37. O. Boruvka, O jistem problemu minimalnim(about a certain minimal problem), in (in czech,
germansummary), Prace mor. pnrodoved., v(3), 3758 (1926)
38. Wikipedia Contributors, Borůvka’s algorithm—wikipedia, the free encyclopedia (2019).
https://en.wikipedia.org/w/index.php?title=Bor%C5%AFvka%27s_algorithm&oldid=
903222379. Accessed 27 Oct 2019
39. J.J. Whang, A. Lenharth, I.S. Dhillon, K. Pingali, Scalable data-driven pagerank: algorithms,
system issues, and lessons learned, in European Conference on Parallel Processing (Springer,
Berlin, 2015), pp. 438–450
40. Wikipedia Contributors, Pagerank—Wikipedia, the free encyclopedia (2019). https://en.
wikipedia.org/w/index.php?title=PageRank&oldid=907975070. Accessed 11 Aug 2019
41. Wikipedia Contributors, Graph coloring—Wikipedia, the free encyclopedia (2019). https://en.
wikipedia.org/w/index.php?title=Graph_coloring&oldid=908422192. Accessed 11 Aug 2019
42. Wikipedia Contributors, Greedy coloring—Wikipedia, the free encyclopedia (2019). https://
en.wikipedia.org/w/index.php?title=Greedy_coloring&oldid=908906078. Accessed 11 Aug
2019
43. Wikipedia Contributors, Degeneracy (graph theory)—Wikipedia, the free encyclope-
dia (2019). https://en.wikipedia.org/w/index.php?title=Degeneracy_(graph_theory)&oldid=
908159787. Accessed 11 Aug 2019
44. Wikipedia Contributors, Betweenness centrality—Wikipedia, the free encyclopedia (2019).
https://en.wikipedia.org/w/index.php?title=Betweenness_centrality&oldid=905972092.
Accessed 11 Aug 2019
45. U. Brandes, A faster algorithm for betweenness centrality. J. Math. Soc. 25(2), 163–177
(2001)
46. D. Chakrabarti, C. Faloutsos, Graph mining: Laws, generators, and algorithms. ACM
Comput. Surv. 38(1) (2006). https://doi.org/10.1145/1132952.1132954
47. S. Parthasarathy, S. Tatikonda, D. Ucar, A survey of graph mining techniques for biological
datasets, in Managing and Mining Graph Data. Advances in Database Systems, ed. by
C. Aggarwal, H. Wang (Springer, Boston, 2010)
48. C. Vicknair, M. Macias, Z. Zhao, X. Nan, Y. Chen, D. Wilkins, A comparison of a graph
database and a relational database: a data provenance perspective, in Proceedings of the 48th
Annual Southeast Regional Conference, ACM SE ’10 (ACM, New York, 2010), pp. 42:1–42:6.
https://doi.org/10.1145/1900008.1900067
49. M. Needham, A.E. Hodler, Graph Algorithms: Practical Examples in Apache Spark and
Neo4j (O’Reilly Media, Sebastopol, 2019)
50. F. Rousseau, E. Kiagias, M. Vazirgiannis, Text categorization as a graph classification
problem, in Proceedings of the 53rd Annual Meeting of the Association for Computational
Linguistics (2015), pp. 1702–1712
51. F. Rousseau, Graph-of-words: mining and retrieving text with networks of features, Ph.D
Thesis, Ecole Polytechnique Laboratoire d’Informatique de l’X (LIX), 2015
52. R. Mihalcea, Graph-based ranking algorithms for sentence extraction, applied to text summa-
rization, in Proceedings of the ACL 2004 on Interactive Poster and Demonstration Sessions,
ACLdemo ’04 (Association for Computational Linguistics, Stroudsburg, 2004). http://dx.doi.
org/10.3115/1219044.1219064
198 References
71. G. Malewicz, M.H. Austern, A.J. Bik, J.C. Dehnert, I. Horn, N. Leiser, G. Czajkowski,
Pregel: a system for large-scale graph processing, in Proceedings of the 2010 ACM SIGMOD
International Conference on Management of Data, SIGMOD ’10 (ACM, New York, 2010),
pp. 135–146. https://doi.org/10.1145/1807167.1807184
72. Z. Khayyat, K. Awara, A. Alonazi, H. Jamjoom, D. Williams, P. Kalnis, Mizan: a System
for dynamic load balancing in large-scale graph processing, in Proceedings of the 8th ACM
European Conference on Computer Systems, EuroSys ’13 (ACM, New York, 2013), pp. 169–
182. https://doi.org/10.1145/2465351.2465369
73. Y. Jie, T. Guangming, M. Zeyao, S. Ninghui, Graphine: programming graph-parallel compu-
tation of large natural graphs for multicore clusters. IEEE Trans. Parallel Distrib. Syst. 27(6),
1647–1659 (2016)
74. S. Hong, H. Chafi, E. Sedlar, K. Olukotun, Green-Marl: a DSL for easy and efficient graph
analysis, in Proceedings of the Seventeenth International Conference on Architectural Sup-
port for Programming Languages and Operating Systems, ASPLOS XVII (ACM, New York,
2012), pp. 349–362. https://doi.org/10.1145/2150976.2151013
75. D. Prountzos, R. Manevich, K. Pingali, Elixir: a System for synthesizing concurrent graph
programs, in Proceedings of the ACM International Conference on Object Oriented Pro-
gramming Systems Languages and Applications, OOPSLA ’12 (ACM, New York, 2012), pp.
375–394. https://doi.org/10.1145/2384616.2384644
76. Y. Zhang, M. Yang, R. Baghdadi, S. Kamil, J. Shun, S. Amarasinghe, GraphIt: a high-
performance graph DSL. Proc. ACM Program. Lang. 2, 121:1–121:30 (2018). https://doi.
org/10.1145/3276491
77. X. Zhu, W. Chen, W. Zheng, X. Ma, Gemini: a computation-centric distributed graph
processing system, in Proceedings of the 12th USENIX Conference on Operating Systems
Design and Implementation, OSDI’16 (USENIX Association, Berkeley, 2016), pp. 301–316.
http://dl.acm.org/citation.cfm?id=3026877.3026901
78. B. Shao, H. Wang, Y. Li, Trinity: a distributed graph engine on a memory cloud, in
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data,
SIGMOD ’13 (ACM, New York, 2013), pp. 505–516. https://doi.org/10.1145/2463676.
2467799
79. R. Chen, Y. Yao, P. Wang, K. Zhang, Z. Wang, H. Guan, B. Zang, H. Chen, Replication-based
fault-tolerance for large-scale graph processing. IEEE Trans. Parallel Distrib. Syst. 29(7),
1621–1635 (2018).
80. R. Dathathri, G. Gill, L. Hoang, K. Pingali, Phoenix: a substrate for resilient distributed graph
analytics, in Proceedings of the Twenty-Fourth International Conference on Architectural
Support for Programming Languages and Operating Systems, ASPLOS ’19 (ACM, New York,
2019), pp. 615–630. https://doi.org/10.1145/3297858.3304056
81. R. Dathathri, G. Gill, L. Hoang, H.-V. Dang, A. Brooks, N. Dryden, M. Snir, K. Pingali,
Gluon: a communication-optimizing substrate for distributed heterogeneous graph analytics.
SIGPLAN Not. 53(4), 752–768 (2018). https://doi.org/10.1145/3296979.3192404
82. A. Gharaibeh, L. Beltrão Costa, E. Santos-Neto, M. Ripeanu, A yoke of oxen and a thousand
chickens for heavy lifting graph processing, in Proceedings of the 21st International Con-
ference on Parallel Architectures and Compilation Techniques, PACT ’12 (ACM, New York,
2012), pp. 345–354. https://doi.org/10.1145/2370816.2370866
83. Y. Bu, B. Howe, M. Balazinska, M.D. Ernst, HaLoop: efficient iterative data processing
on large clusters. Proc. VLDB Endow. 3(1–2), 285–296 (2010). http://dx.doi.org/10.14778/
1920841.1920881
84. P. Harish, P.J. Narayanan, Accelerating large graph algorithms on the GPU Using CUDA, in
Proceedings of the 14th International Conference on High Performance Computing, HiPC’07
(Springer, Berlin, 2007), pp. 197–208. http://dl.acm.org/citation.cfm?id=1782174.1782200
85. P. Harish, V. Vineet, P.J. Narayanan, Large graph algorithms for massively multithreaded
architectures, Tech. Rep., 2009
86. M. Burtscher, K. Pingali, An efficient CUDA implementation of the tree-based Barnes hut
N-body algorithm, in GPU Gems, ed. by W.-M.W. Hwu (Elsevier, Amsterdam, 2011)
200 References
87. A.E. Sariyüce, K. Kaya, E. Saule, U.V. Çatalyürek, Betweenness centrality on GPUs and
heterogeneous architectures, in Proceedings of the 6th Workshop on General Purpose
Processor Using Graphics Processing Units, GPGPU-6 (ACM, New York, 2013), pp. 76–
85. https://doi.org/10.1145/2458523.2458531
88. M. Méndez-Lojo, A. Mathew, K. Pingali, Parallel inclusion-based points-to analysis, in
Proceedings of the ACM International Conference on Object Oriented Programming Systems
Languages and Applications, OOPSLA ’10 (ACM, New York, 2010), pp. 428–443. https://
doi.org/10.1145/1869459.1869495
89. T. Prabhu, S. Ramalingam, M. Might, M. Hall, EigenCFA: accelerating flow analysis with
GPUs, in Proceedings of the 38th Annual ACM SIGPLAN-SIGACT Symposium on Principles
of Programming Languages, POPL ’11 (ACM, New York, 2011), pp. 511–522. https://doi.
org/10.1145/1926385.1926445
90. M. Kulkarni, M. Burtscher, C. Cascaval, K. Pingali, Lonestar: a suite of parallel irregular
programs, in Proceedings of the IEEE International Symposium on Performance Analysis of
Systems and Software (IEEE, Piscataway, 2009), pp. 65–76
91. Z. Jia, Y. Kwon, G. Shipman, P. McCormick, M. Erez, A. Aiken, A distributed multi-GPU
system for fast graph processing. Proc. VLDB Endow. 11(3), 297–310 (2017). https://doi.org/
10.14778/3157794.3157799
92. Harshvardhan, A. Fidel, N.M. Amato, L. Rauchwerger, The STAPL parallel graph library, in
Languages and Compilers for Parallel Computing, ed. by H. Kasahara, K. Kimura (Springer,
Berlin, 2013), pp. 46–60
93. Z. Shang, J.X. Yu, Z. Zhang, TuFast: a lightweight parallelization library for graph analytics,
in Proceedings of the 2019 IEEE 35th International Conference on Data Engineering (ICDE)
(IEEE, Piscataway, 2019), pp. 710–721
94. The Boost Graph Library, User Guide and Reference Manual (Addison-Wesley Longman
Publishing, Boston, 2002)
95. S. Salihoglu, J. Widom, GPS: A graph processing system, in Proceedings of the 25th
International Conference on Scientific and Statistical Database Management, SSDBM (ACM,
New York, 2013), pp. 22:1–22:12. https://doi.org/10.1145/2484838.2484843
96. Y. Bu, V. Borkar, J. Jia, M.J. Carey, T. Condie, Pregelix: Big(Ger) graph analytics on
a dataflow engine. Proc. VLDB Endow. 8(2), 161–172 (2014). http://dx.doi.org/10.14778/
2735471.2735477
97. J. Shun, G.E. Blelloch, Ligra: a lightweight graph processing framework for shared memory,
in Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel
Programming, PPoPP ’13 (ACM, New York, 2013), pp. 135–146. https://doi.org/10.1145/
2442516.2442530
98. J. Shun, L. Dhulipala, G.E. Blelloch, Smaller and faster: parallel processing of compressed
graphs with Ligra+, in 2015 Data Compression Conference (2015), pp. 403–412
99. M. Han, K. Daudjee, Giraph unchained: barrierless asynchronous parallel execution in Pregel-
like graph processing systems. Proc. VLDB Endow. 8(9), 950–961 (2015). https://doi.org/10.
14778/2777598.2777604
100. K. Siddique, Z. Akhtar, Y. Kim, Y.-S. Jeong, E.J. Yoon, Investigating Apache Hama: a
bulk synchronous parallel computing framework. J. Supercomput. 73(9), 4190–4205 (2017).
https://doi.org/10.1007/s11227-017-1987-9
101. K. Lee, L. Liu, K. Schwan, C. Pu, Q. Zhang, Y. Zhou, E. Yigitoglu, P. Yuan, Scaling iterative
graph computations with GraphMap, in Proceedings of the International Conference for High
Performance Computing, Networking, Storage and Analysis, SC ’15 (2015), pp. 1–12
102. Y. Cheng, F. Wang, H. Jiang, Y. Hua, D. Feng, Z. Wang, LCC-graph: a high-performance
graph-processing framework with low communication costs, in 2016 IEEE/ACM 24th
International Symposium on Quality of Service (IWQoS) (2016), pp. 1–10
103. T. White, Hadoop: The Definitive Guide, 1st edn. (O’Reilly Media, Sebastopol, 2009)
104. J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.-H. Bae, J. Qiu, G. Fox, Twister: a runtime
for iterative mapreduce, in Proceedings of the 19th ACM International Symposium on High
Performance Distributed Computing, HPDC ’10 (ACM, New York, 2010), pp. 810–818.
https://doi.org/10.1145/1851476.1851593
References 201
121. A. Davidson, S. Baxter, M. Garland, J.D. Owens, Work-efficient parallel GPU methods for
single source shortest paths, in Proceedings of the 2014 IEEE 28th International Symposium
on Parallel and Distributed Processing IPDPS 2014 (IEEE, Piscataway, 2014)
122. R. Nasre, M. Burtscher, K. Pingali, Morph algorithms on GPUs, in Proceedings of the 18th
ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’13
(ACM, New York, 2013), pp. 147–156. https://doi.org/10.1145/2442516.2442531
123. Z. Fu, M. Personick, B. Thompson, MapGraph: a high level API for fast development of
high performance graph analytics on GPUs, in Proceedings of Workshop on GRAph Data
Management Experiences and Systems, GRADES’14 (ACM, New York, 2014), pp. 2:1–2:6.
https://doi.org/10.1145/2621934.2621936
124. P. Zhao, X. Luo, J. Xiao, X. Shi, H. Jin, Puffin: graph processing system on multi-GPUs,
in 2017 IEEE 10th Conference on Service-Oriented Computing and Applications (SOCA)
(IEEE, Piscataway, 2017), pp. 50–57
125. A.H. Nodehi Sabet, J. Qiu, Z. Zhao, Tigr: transforming irregular graphs for GPU-friendly
graph processing, in Proceedings of the Twenty-Third International Conference on Archi-
tectural Support for Programming Languages and Operating Systems, ASPLOS ’18 (ACM,
New York, 2018), pp. 622–636. https://doi.org/10.1145/3173162.3173180
126. P. Zhang, M. Zalewski, A. Lumsdaine, S. Misurda, S. McMillan, GBTL-CUDA: graph
algorithms and primitives for GPUs, in 2016 IEEE International Parallel and Distributed
Processing Symposium Workshops (IPDPSW) (2016), pp. 912–920
127. D. Sengupta, S.L. Song, K. Agarwal, K. Schwan, GraphReduce: processing large-scale
graphs on accelerator-based systems, in Proceedings of the International Conference for High
Performance Computing, Networking, Storage and Analysis, SC ’15 (2015), pp. 1–12
128. W. Zhong, J. Sun, H. Chen, J. Xiao, Z. Chen, C. Cheng, X. Shi, Optimizing graph processing
on GPUs. IEEE Trans Parallel Distrib. Syst. 28(4), 1149–1162 (2017)
129. J. Zhong, B. He, Medusa: Simplified Graph Processing on GPUs. IEEE Trans. Parallel Distrib.
Syst. 25(6), 1543–1552 (2014). https://doi.org/10.1109/TPDS.2013.111
130. A. Aiken, M. Bauer, S. Treichler, Realm: an event-based low-level runtime for distributed
memory architectures, in 2014 23rd International Conference on Parallel Architecture and
Compilation Techniques (PACT) (2014), pp. 263–275
131. G. Gill, R. Dathathri, L. Hoang, K. Pingali, A study of partitioning policies for graph analytics
on large-scale distributed platforms. Proc. VLDB Endow. 12(4), 321–334 (2018). https://doi.
org/10.14778/3297753.3297754
132. Y. Simmhan, A. Kumbhare, C. Wickramaarachchi, S. Nagarkar, S. Ravi, C. Raghavendra,
V. Prasanna, Goffish: a sub-graph centric framework for large-scale graph analytics, in Euro-
Par 2014 Parallel Processing, ed. by F. Silva, I. Dutra, V. Santos Costa (Springer, Cham,
2014), pp. 451–462
133. J.E. Gonzalez, R.S. Xin, A. Dave, D. Crankshaw, M. J. Franklin, I. Stoica, GraphX: graph
processing in a distributed dataflow framework, in Proceedings of the 11th USENIX Con-
ference on Operating Systems Design and Implementation, OSDI’14 (USENIX Association,
Berkeley, 2014), pp. 599–613. http://dl.acm.org/citation.cfm?id=2685048.2685096
134. M. Pundir, L. M. Leslie, I. Gupta, R.H. Campbell, Zorro: zero-cost reactive failure recovery
in distributed graph processing, in Proceedings of the Sixth ACM Symposium on Cloud
Computing, SoCC ’15 (ACM, New York, 2015), pp. 195–208. https://doi.org/10.1145/
2806777.2806934
135. S. Hong, S. Depner, T. Manhardt, J. Van Der Lugt, M. Verstraaten, H. Chafi, PGX.D: a
fast distributed graph processing engine, in Proceedings of the International Conference for
High Performance Computing, Networking, Storage and Analysis, SC ’15 (IEEE, Piscataway,
2015), pp. 1–12
136. D. Yan, J. Cheng, Y. Lu, W. Ng, Blogel: a block-centric framework for distributed computa-
tion on real-world graphs. Proc. VLDB Endow. 7(14), 1981–1992 (2014). https://doi.org/10.
14778/2733085.2733103
137. S.M. Faisal, S. Parthasarathy, P. Sadayappan, Global graphs: a middleware for large scale
graph processing, in 2014 IEEE International Conference on Big Data (Big Data) (IEEE,
Piscataway, 2014), pp. 33–40
References 203
138. Y. Zhao, K. Yoshigoe, M. Xie, S. Zhou, R. Seker, J. Bian, LightGraph: lighten communication
in distributed graph-parallel processing, in 2014 IEEE International Congress on Big Data
(IEEE, Piscataway, 2014), pp. 717–724
139. R. Dathathri, G. Gill, L. Hoang, H.-V. Dang, A. Brooks, N. Dryden, M. Snir, K. Pingali,
Gluon: a communication-optimizing substrate for distributed heterogeneous graph analytics,
in Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design
and Implementation, PLDI 2018 (ACM, New York, 2018), pp. 752–768. https://doi.org/10.
1145/3192366.3192404
140. R. Dathathri, G. Gill, L. Hoang, K. Pingali, Phoenix: a substrate for resilient distributed graph
analytics, in Proceedings of the Twenty-Fourth International Conference on Architectural
Support for Programming Languages and Operating Systems (ACM, New York, 2019), pp.
615–630. https://doi.org/10.1145/3297858.3304056
141. R. Dathathri, G. Gill, L. Hoang, V. Jatala, K. Pingali, V.K. Nandivada, H. Dang, M. Snir,
Gluon-Async: a bulk-asynchronous system for distributed and heterogeneous graph analytics,
in 2019 28th International Conference on Parallel Architectures and Compilation Techniques
(PACT) (2019), pp. 15–28
142. M. Flynn, Some computer organizations and their effectiveness. IEEE Trans. Comput. 100(9),
948–960 (1972)
143. Nvidia GK-110B Specifications (2013). https://www.nvidia.com/content/dam/en-zz/
Solutions/Data-Center/tesla-product-literature/TeslaK80-datasheet.pdf
144. Nvidia GTX-870M Specifications (2014). https://www.geforce.com/hardware/notebook-
gpus/geforce-gtx-870m/specifications
145. Nvidia GTX-1080 Specifications (2017). https://images-eu.ssl-images-amazon.com/images/
I/91las2p%2BDnS.pdf
146. Nvidia GK-110B Specifications (2018). https://www.nvidia.com/content/dam/en-zz/
Solutions/design-visualization/productspage/quadro/quadro-desktop/quadro-volta-gv100-
data-sheet-us-nvidia-704619-r3-web.pdf
147. A. Coates, B. Huval, T. Wang, D.J. Wu, A.Y. Ng, B. Catanzaro, Deep learning with
COTS HPC systems, in Proceedings of the 30th International Conference on International
Conference on Machine Learning, ICML’13, vol. 28 (2013), pp. III-1337–III-1345. JMLR.
org, http://dl.acm.org/citation.cfm?id=3042817.3043086
148. A. Eklund, P. Dufort, D. Forsberg, S.M. LaConte, Medical image processing on the GPU—
past, present and future. Med. Image Anal. 17(8), 1073–1094
149. Y. Go, M. Jamshed, Y. Moon, C. Hwang, K. Park, APUNet: revitalizing GPU as packet
processing accelerator, in Proceedings of the 14th USENIX Conference on Networked Systems
Design and Implementation, NSDI’17 (USENIX Association, Berkeley, 2017), pp. 83–96.
http://dl.acm.org/citation.cfm?id=3154630.3154638
150. M. Thorup, Dynamic graph algorithms with applications, in Proceedings of the 7th Scandi-
navian Workshop on Algorithm Theory, SWAT ’00 (Springer, Berlin, 2000), pp. 1–9. http://dl.
acm.org/citation.cfm?id=645900.672593
151. G. Ramalingam, T. Reps, On the computational complexity of dynamic graph prob-
lems. Theor. Comput. Sci. 158(1–2), 233–277 (1996). https://doi.org/10.1016/0304-
3975(95)00079-8
152. L.S. Buriol, M.G.C. Resende, M. Thorup, Speeding up dynamic shortest-path algorithms.
INFORMS J. Comput. 20(2), 191–204 (2008). https://doi.org/10.1287/ijoc.1070.0231
153. S.-W. Cheng, T.K. Dey, J.R. Shewchuk, Delaunay Mesh Generation (CRC Press, Boca Raton,
2012)
154. D.A. Bader, K. Madduri, GTgraph: A Synthetic Graph Generator Suite, Atlanta, 2006
155. Ninth DIMACS Implementation Challenge—Shortest Paths (2006). http://www.dis.uniroma1.
it/challenge9/download.shtml
156. P. Boldi, M. Santini, S. Vigna, A large time-aware web graph. SIGIR Forum 42(2), 33–38
(2008). https://doi.org/10.1145/1480506.1480511
204 References
157. H. Kwak, C. Lee, H. Park, S. Moon, What is twitter, a social network or a news media?,
in Proceedings of the 19th International Conference on World Wide Web, WWW ’10 (ACM,
New York, 2010), pp. 591–600. https://doi.org/10.1145/1772690.1772751
158. B. Gogoi, U. Cheramangalath, R. Nasre, Custom code generation for a graph DSL, in
Proceedings of the 13th Annual Workshop on General Purpose Processing Using Graphics
Processing Unit, GPGPU’20 (ACM, New York, 2020), pp. 51–60. https://doi.org/10.1145/
3366428.3380772
159. R. Chen, J. Shi, Y. Chen, H. Chen, PowerLyra: differentiated graph computation and
partitioning on skewed graphs, in Proceedings of the Tenth European Conference on
Computer Systems, EuroSys ’15 (ACM, New York, 2015), pp. 1:1–1:15. https://doi.org/10.
1145/2741948.2741970
160. A. Mukkara, N. Beckmann, M. Abeydeera, X. Ma, D. Sanchez, Exploiting locality in graph
analytics through hardware-accelerated traversal scheduling, in 2018 51st Annual IEEE/ACM
International Symposium on Microarchitecture (MICRO) (2018), pp. 1–14
161. S. Khoram, J. Zhang, M. Strange, J. Li, Accelerating graph analytics by co-optimizing storage
and access on an FPGA-HMC platform, in Proceedings of the 2018 CM/SIGDA International
Symposium on Field-Programmable Gate Arrays, FPGA ’18 (Association for Computing
Machinery, New York, 2018), pp. 239–248. https://doi.org/10.1145/3174243.3174260
162. G. Malhotra, H. Chappidi, R. Nasre, Fast dynamic graph algorithms, in Languages and
Compilers for Parallel Computing, ed. by L. Rauchwerger (Springer, Cham, 2017), pp. 262–
277
163. U.A. Acar, D. Anderson, G.E. Blelloch, L. Dhulipala, Parallel batch-dynamic graph connec-
tivity, in The 31st ACM Symposium on Parallelism in Algorithms and Architectures, SPAA
’19 (Association for Computing Machinery, New York, 2019), pp. 381–392. https://doi.org/
10.1145/3323165.3323196
164. Z. Li, B. Zhang, S. Ren, Y. Liu, Z. Qin, R.S.M. Goh, M. Gurusamy, Performance modelling
and cost effective execution for distributed graph processing on configurable VMs, in
Proceedings of the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid
Computing, CCGrid ’17 (IEEE Press, New York, 2017), pp. 74–83. https://doi.org/10.1109/
CCGRID.2017.85
165. S.D. Pollard, S. Srinivasan, B. Norris, A Performance and recommendation system for
parallel graph processing implementations: work-in-progress, in Companion of the 2019
ACM/SPEC International Conference on Performance Engineering, ICPE ’19 (Association
for Computing Machinery, New York, 2019), pp. 25–28. https://doi.org/10.1145/3302541.
3313097
166. T.D. Bui, S. Ravi, V. Ramavajjala, Neural graph learning: training neural networks using
graphs, in Proceedings of the Eleventh ACM International Conference on Web Search and
Data Mining, WSDM ’18 (Association for Computing Machinery, New York, 2018), pp. 64–
71. https://doi.org/10.1145/3159652.3159731
167. R. Al-Rfou, B. Perozzi, D. Zelle, DDGK: learning graph representations for deep divergence
graph kernels, in The World Wide Web Conference, WWW ’19 (Association for Computing
Machinery, New York, 2019), pp. 37–48. https://doi.org/10.1145/3308558.3313668
Index