Theoretically Efficient Parallel Graph Algorithms
Theoretically Efficient Parallel Graph Algorithms
There has been significant recent interest in parallel graph processing due to the need to quickly analyze
the large graphs available today. Many graph codes have been designed for distributed memory or external
memory. However, today even the largest publicly-available real-world graph (the Hyperlink Web graph with
over 3.5 billion vertices and 128 billion edges) can fit in the memory of a single commodity multicore server.
Nevertheless, most experimental work in the literature report results on much smaller graphs, and the ones
for the Hyperlink graph use distributed or external memory. Therefore, it is natural to ask whether we can
efficiently solve a broad class of graph problems on this graph in memory.
This paper shows that theoretically-efficient parallel graph algorithms can scale to the largest publicly-
available graphs using a single machine with a terabyte of RAM, processing them in minutes. We give im-
plementations of theoretically-efficient parallel algorithms for 20 important graph problems. We also present
the interfaces, optimizations, and graph processing techniques that we used in our implementations, which
were crucial in enabling us to process these large graphs quickly. We show that the running times of our 4
implementations outperform existing state-of-the-art implementations on the largest real-world graphs. For
many of the problems that we consider, this is the first time they have been solved on graphs at this scale. We
have made the implementations developed in this work publicly-available as the Graph Based Benchmark
Suite (GBBS).
CCS Concepts: • Computing methodologies → Shared memory algorithms;
Additional Key Words and Phrases: Parallel graph algorithms, parallel graph processing
ACM Reference format:
Laxman Dhulipala, Guy E. Blelloch, and Julian Shun. 2021. Theoretically Efficient Parallel Graph Algorithms
Can Be Fast and Scalable. ACM Trans. Parallel Comput. 8, 1, Article 4 (April 2021), 70 pages.
https://doi.org/10.1145/3434393
1 INTRODUCTION
Today, the largest publicly-available graph, the Hyperlink Web graph, consists of over 3.5 billion
vertices and 128 billion edges [107]. This graph presents a significant computational challenge
A conference version of this paper appeared in the 30th Symposium on Parallelism in Algorithms and Architectures
(2018) [53]; in this version we give significantly more detail on the interface and algorithms.
Authors’ addresses: L. Dhulipala and J. Shun, MIT CSAIL, 32 Vassar Street Cambridge, MA 02139; email: {laxman,
jshun}@mit.edu; G. E. Blelloch, Computer Science Department, Carnegie Mellon University Pittsburgh, PA 15213; email:
[email protected].
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and
the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from [email protected].
© 2021 Association for Computing Machinery.
2329-4949/2021/04-ART4 $15.00
https://doi.org/10.1145/3434393
ACM Transactions on Parallel Computing, Vol. 8, No. 1, Article 4. Publication date: April 2021.
4:2 L. Dhulipala et al.
for both distributed and shared memory systems. Indeed, very few algorithms have been applied
to this graph, and those that have often take hours to run [85, 99, 154], with the fastest times
requiring between 1–6 minutes using a supercomputer [145, 146]. In this paper, we show that a
wide range of fundamental graph problems can be solved quickly on this graph, often in minutes,
on a single commodity shared-memory machine with a terabyte of RAM.1 For example, our k-core
implementation takes under 3.5 minutes on 72 cores, whereas Slota et al. [146] report a running
time of about 6 minutes for approximate k-core on a supercomputer with over 8000 cores. They
also report that they can identify the largest connected component on this graph in 63 seconds,
whereas we can identify all connected components in just 25 seconds. Another recent result by
Stergiou et al. [147] solves connectivity on the Hyperlink 2012 graph in 341 seconds on a 1000
node cluster with 12000 cores and 128TB of RAM. Compared to this result, our implementation is
13.6x faster on a system with 128x less memory and 166x fewer cores. However, we note that they
are able to process a significantly larger private graph that we would not be able to fit into our
memory footprint. A more complete comparison between our work and existing work, including
both distributed and disk-based systems [51, 78, 85, 99, 154], is given in Section 8.
Importantly, all of our implementations have strong theoretical bounds on their work and depth.
There are several reasons that algorithms with good theoretical guarantees are desirable. For one,
they are robust as even adversarially-chosen inputs will not cause them to perform extremely
poorly. Furthermore, they can be designed on pen-and-paper by exploiting properties of the prob-
lem instead of tailoring solutions to the particular dataset at hand. Theoretical guarantees also
make it likely that the algorithm will continue to perform well even if the underlying data changes.
Finally, careful implementations of algorithms that are nearly work-efficient can perform much
less work in practice than work-inefficient algorithms. This reduction in work often translates to
faster running times on the same number of cores [52]. We note that most running times that have
been reported in the literature on the Hyperlink Web graph use parallel algorithms that are not
theoretically-efficient.
In this paper, we present implementations of parallel algorithms with strong theoretical bounds
on their work and depth for connectivity, biconnectivity, strongly connected components, low-
diameter decomposition, graph spanners, maximal independent set, maximal matching, graph
coloring, breadth-first search, single-source shortest paths, widest (bottleneck) path, betweenness
centrality, PageRank, spanning forest, minimum spanning forest, k-core decomposition, approxi-
mate set cover, approximate densest subgraph, and triangle counting. We describe the program-
ming interfaces, techniques, and optimizations used to achieve good performance on graphs with
billions of vertices and hundreds of billions of edges and share experimental results for the Hyper-
link 2012 and Hyperlink 2014 Web crawls, the largest and second largest publicly-available graphs,
as well as several smaller real-world graphs at various scales. Some of the algorithms we describe
are based on previous results from Ligra, Ligra+, and Julienne [52, 136, 141], and other papers
on efficient parallel graph algorithms [32, 77, 142]. However, most existing implementations were
changed significantly in order to be more memory efficient. Several algorithm implementations
for problems like strongly connected components, minimum spanning forest, and biconnectiv-
ity are new, and required implementation techniques to scale that we believe are of independent
interest. We also had to extend the compressed representation from Ligra+ [141] to ensure that
our graph primitives for mapping, filtering, reducing and packing the neighbors of a vertex were
theoretically-efficient. We note that using compression techniques is crucial for representing the
symmetrized Hyperlink 2012 graph in 1TB of RAM, as storing this graph in an uncompressed for-
mat would require over 900GB to store the edges alone, whereas the graph requires 330GB in our
1 These machines are roughly the size of a workstation and can be easily rented in the cloud (e.g., on Amazon EC2).
ACM Transactions on Parallel Computing, Vol. 8, No. 1, Article 4. Publication date: April 2021.
Theoretically Efficient Parallel Graph Algorithms Can Be Fast and Scalable 4:3
Table 1. Running Times (in seconds) of Our Algorithms on the Symmetrized Hyperlink2012
Graph Where (1) is the Single-Thread Time, (72h) is the 72-core Time Using Hyper-threading,
and (SU) is the Parallel Speedup
compressed format (less than 1.5 bytes per edge). We show the running times of our algorithms
on the Hyperlink 2012 graph as well as their work and depth bounds in Table 1. To make it easy to
build upon or compare to our work in the future, we describe the Graph Based Benchmark Suite
(GBBS), a benchmark suite containing our problems with clear I/O specifications, which we have
made publicly-available.2
We present an experimental evaluation of all of our implementations, and in almost all cases,
the numbers we report are faster than any previous performance numbers for any machines, even
much larger supercomputers. We are also able to apply our algorithms to the largest publicly-
available graph, in many cases for the first time in the literature, using a reasonably modest ma-
chine. Most importantly, our implementations are based on reasonably simple algorithms with
strong bounds on their work and depth. We believe that our implementations are likely to scale to
larger graphs and lead to efficient algorithms for related problems.
2 RELATED WORK
Parallel Graph Algorithms. Parallel graph algorithms have received significant attention since
the start of parallel computing, and many elegant algorithms with good theoretical bounds have
been developed over the decades (e.g., [5, 25, 45, 63, 84, 89, 98, 109, 111, 112, 121, 125, 135, 148]). A
major goal in parallel graph algorithm design is to find work-efficient algorithms with polylogarith-
mic depth. While many suspect that work-efficient algorithms may not exist for all parallelizable
graph problems, as inefficiency may be inevitable for problems that depend on transitive closure,
2 https://github.com/ParAlg/gbbs.
ACM Transactions on Parallel Computing, Vol. 8, No. 1, Article 4. Publication date: April 2021.
4:4 L. Dhulipala et al.
many problems that are of practical importance do admit work-efficient algorithms [88]. For these
problems, which include connectivity, biconnectivity, minimum spanning forest, maximal inde-
pendent set, maximal matching, and triangle counting, giving theoretically-efficient implementa-
tions that are simple and practical is important, as the amount of parallelism available on modern
systems is still modest enough that reducing the amount of work done is critical for achieving good
performance. Aside from intellectual curiosity, investigating whether theoretically-efficient graph
algorithms also perform well in practice is important, as theoretically-efficient algorithms are less
vulnerable to adversarial inputs than ad-hoc algorithms that happen to work well in practice.
Unfortunately, some problems that are not known to admit work-efficient parallel algorithms
due to the transitive-closure bottleneck [88], such as strongly connected components (SCC) and
single-source shortest paths (SSSP) are still important in practice. One method for circumvent-
ing the bottleneck is to give work-efficient algorithms for these problems that run in depth pro-
portional to the diameter of the graph—as real-world graphs have low diameter, and theoreti-
cal models of real-world graphs predict a logarithmic diameter, these algorithms offer theoretical
guarantees in practice [33, 131]. Other problems, like k-core are P-complete [7], which rules out
polylogarithmic-depth algorithms for them unless P = NC [73]. However, even k-core admits an
algorithm with strong theoretical guarantees on its work that is efficient in practice [52].
Parallel Graph Processing Frameworks. Motivated by the need to process very large graphs,
there have been many graph processing frameworks developed in the literature (e.g., [71, 97, 101,
115, 136] among many others). We refer the reader to [105, 152] for surveys of existing frame-
works. Several recent graph processing systems evaluate the scalability of their implementations
by solving problems on massive graphs [52, 85, 99, 145, 147, 154]. All of these systems report
running times either on the Hyperlink 2012 graph or Hyperlink 2014 graphs, two web crawls re-
leased by the WebDataCommons that are the largest and second largest publicly-available graphs
respectively. We describe these recent systems and give a detailed comparison of how our imple-
mentations compare to these existing solutions in Section 8.
Benchmarking Parallel Graph Algorithms. There are a surprising number of existing bench-
marks of parallel graph algorithms. SSCA [15] specifies four graph kernels, which include gen-
erating graphs in adjacency list format, subgraph extraction, and graph clustering. The Problem
Based Benchmark Suite (PBBS) [139] is a general benchmark of parallel algorithms that includes
6 problems on graphs: BFS, spanning forest, minimum spanning forest, maximal independent set,
maximal matching, and graph separators. The PBBS benchmarks are problem-based in that they
are defined only in terms of the input and output without any specification of the algorithm used
to solve the problem. We follow the style of PBBS in this paper of defining the input and output
requirements for each problem. The Graph Algorithm Platform (GAP) Benchmark Suite [22] spec-
ifies 6 kernels: BFS, SSSP, PageRank, connectivity, betweenness centrality, and triangle counting.
Several recent benchmarks characterize the architectural properties of parallel graph algo-
rithms. GraphBIG [113] describes 12 applications, including several problems that we consider, like
k-core and graph coloring (using the Jones-Plassmann algorithm), but also problems like depth-
first search, which are difficult to parallelize, as well as dynamic graph operations. CRONO [4]
implements 10 graph algorithms, including all-pairs shortest paths, exact betweenness central-
ity, traveling salesman, and depth-first search. LDBC [81] is an industry-driven benchmark that
selects 6 algorithms that are considered representative of graph processing including BFS, and
several algorithms based on label propagation.
Unfortunately, all of the existing graph algorithm benchmarks we are aware of restrict their eval-
uation to small graphs, often on the order of tens or hundreds of millions of edges, with the largest
graphs in the benchmarks having about two billion edges. As real-world graphs are frequently sev-
ACM Transactions on Parallel Computing, Vol. 8, No. 1, Article 4. Publication date: April 2021.
Theoretically Efficient Parallel Graph Algorithms Can Be Fast and Scalable 4:5
eral orders of magnitude larger than this, evaluation on such small graphs makes it hard to judge
whether the algorithms or results from a benchmark scale to terabyte-scale graphs. This paper
provides a problem-based benchmark in the style of PBBS for fundamental graph problems, and
evaluates theoretically-efficient parallel algorithms for these problems on the largest real-world
graphs, which contain hundreds of billions of edges.
3 PRELIMINARIES
3.1 Graph Notation
We denote an unweighted graph by G (V , E), where V is the set of vertices and E is the set of
edges in the graph. A weighted graph is denoted by G = (V , E, w ), where w is a function which
maps an edge to a real value (its weight). The number of vertices in a graph is n = |V |, and the
number of edges is m = |E|. Vertices are assumed to be indexed from 0 to n − 1. We call these
unique integer identifiers for vertices vertex IDs. For undirected graphs we use N (v) to denote
the neighbors of vertex v and d (v) to denote its degree. For directed graphs, we use N − (v) and
N + (v) to denote the in and out-neighbors of a vertex v, and d − (v) and d + (v) to denote its in and
out-degree, respectively. We use distG (u, v) to refer to the shortest path distance between u and v
in G. We use diam(G) to refer to the diameter of the graph, or the longest shortest path distance
between any vertex s and any vertex v reachable from s. Given an undirected graph G = (V , E) the
density of a set S ⊆ V , or D (S ), is equal to |E|S(S| ) | where E (S ) are the edges in the induced subgraph
on S. Δ is used to denote the maximum degree of the graph. We assume that there are no self-edges
or duplicate edges in the graph. We refer to graphs stored as a list of edges as being stored in the
edgelist format and the compressed-sparse column and compressed-sparse row formats as CSC
and CSR respectively.
ACM Transactions on Parallel Computing, Vol. 8, No. 1, Article 4. Publication date: April 2021.
4:6 L. Dhulipala et al.
til both of its children terminate (execute an end instruction). A computation starts with a single
root thread and finishes when that root thread ends. Processes can perform reads and writes to
the shared memory, as well as the testAndSet instruction. This model supports what is often
referred to as nested parallelism. If the root thread never does a fork, it is a standard sequential
program.
A computation can be viewed as a series-parallel DAG in which each instruction is a vertex,
sequential instructions are composed in series, and the forked threads are composed in parallel.
The work of a computation is the number of vertices and the depth is the length of the longest
path in the DAG. As is standard with the RAM model, we assume that the memory locations and
registers have at most O (log M ) bits, where M is the total size of the memory used.
Model Variants. We augment the binary-forking model described above with two atomic instruc-
tions that are used by our algorithms: fetchAndAdd and priorityWrite and discuss the model
with these instruction as the FA-BF, and PW-BF variants of the binary-forking model, respec-
tively. We abbreviate the basic binary-forking model with only the testAndSet instruction as the
BF model. Note that the basic binary-forking model includes a testAndSet, as this instruction is
necessary to implement joining tasks in a parallel schedulers (see for example [9, 38]), and since
all modern multicore architectures include the testAndSet instruction.
⊥ ⊕i=0 A[i]. Scan can be done in O (n) work and O (log n) depth (assuming ⊕ takes O (1) work) [84].
n−1
Reduce takes an array A and a monoid (⊥, ⊕) and returns the sum of the elements in A with
respect to the monoid, ⊥ ⊕i=0 n−1 A[i]. Filter takes an array A and a predicate f and returns a new
array containing a ∈ A for which f (a) is true, in the same order as in A. Reduce and filter can
both be done in O (n) work and O (log n) depth (assuming ⊕ and f take O (1) work). Finally, the
PointerJump primitive takes an array P of parent pointers which represent a directed rooted
forest (i.e., P[v] is the parent of vertex v) and returns an array R where R[v] is the root of the
directed tree containing v. This primitive can be implemented in O (n) work, and O (log n) depth
whp in the BF model [31].
4 INTERFACE
In this section we describe the high-level graph processing interface used by our algorithm imple-
mentations in GBBS, and explain how the interface is integrated into our overall system architec-
ture. The interface is written in C++ and extends the Ligra, Ligra+, and Julienne frameworks [52,
136, 141] with additional functional primitives over graphs and vertices that are parallel by default.
In what follows, we provide descriptions of these functional primitives, as well as the parallel cost
bounds obtained by our implementations.
System Overview. The GBBS library, which underlies our benchmark algorithm implementations
is built as a number of layers, which we illustrate in Figure 1.3 We use a shared-memory approach
to parallel graph processing in which the entire graph is stored in the main memory of a single
multicore machine. Our codes exploit nested parallelism using scheduler-agnostic parallel prim-
itives, such as fork-join and parallel-for loops. Thus, they can easily be compiled to use different
3A brief version of this interface was presented by the authors and their collaborators in [59].
ACM Transactions on Parallel Computing, Vol. 8, No. 1, Article 4. Publication date: April 2021.
Theoretically Efficient Parallel Graph Algorithms Can Be Fast and Scalable 4:7
Fig. 1. System architecture of GBBS. The core interfaces are the vertexSubset (Section 4.2), bucketing (Sec-
tion 4.3), vertex (Section 4.4), and graph interfaces (Section 4.5). These interfaces utilize parallel primitives and
routines from ParlayLib [27]. Parallelism is implemented using a parallel runtime system—Cilk, OpenMP,
TBB, or a homegrown scheduler based on the Arora-Blumofe-Plaxton deque [10] that we implemented
ourselves—and can be swapped using a command line argument. The vertex and graph interfaces use a
compression library that mediates access to the underlying graph, which can either be compressed or un-
compressed (see Section 4.1).
parallel runtimes such as Cilk, OpenMP, TBB, and also a custom work-stealing scheduler imple-
mented by the authors [27]. Theoretically, our algorithms are implemented and analyzed in the
binary-forking model [31] (see Section 3.3). Our interface makes use of several new types which
are defined in Table 2. We also define these types when they are first used in the text.
ACM Transactions on Parallel Computing, Vol. 8, No. 1, Article 4. Publication date: April 2021.
4:8 L. Dhulipala et al.
Data Types. One of the primary data types used in GBBS is the vertexSubset data type, which
represents a subset of vertices in the graph. Conceptually, a vertexSubset can either be sparse
(represented as a collection of vertex IDs) or dense (represented as a boolean array or bit-vector
of length n). A T vertexSubset is a generic vertexSubset, where each vertex is augmented with a
value of type T.
Primitives. We use four primitives defined on vertexSubset, which we illustrate in Figure 2.
vertexMap takes a vertexSubset and applies a user-defined function f over each vertex. This
ACM Transactions on Parallel Computing, Vol. 8, No. 1, Article 4. Publication date: April 2021.
Theoretically Efficient Parallel Graph Algorithms Can Be Fast and Scalable 4:9
Fig. 2. The core primitives in the vertexSubset interface used by GBBS, including the type definition of each
primitive and the cost bounds. We use vset as an abbreviation for vertexSubset in the figure. A vertexSubset
is a representation of a set of vertex IDs, which are unique integer identifiers for vertices. If the input
vertexSubset is augmented, the user-defined functions supplied to vertexMap and vertexFilter take a pair
of the vertex ID and augmented value as input, and the addToSubset primitive takes a sequence of vertexI D
and augmented value pairs.
primitive makes it easy to apply user-defined logic over vertices in a subset in parallel without
worrying about the state of the underlying vertexSubset (i.e., whether it is sparse or dense). We
also provide a specialized version of the vertexMap primitive, vertexMapVal through which the
user can create an augmented vertexSubset. vertexFilter takes a vertexSubset and a user-defined
predicate P and keeps only vertices satisfying P in the output vertexSubset. Finally, addToSubset
takes a vertexSubset and a sequence of unique vertex identifiers not already contained in the sub-
set, and adds these vertices to the subset. Note that this function mutates the supplied vertexSubset.
This primitive is implemented in O (1) amortized work by representing a sparse vertexSubset us-
ing a resizable array. The worst case depth of the primitive is O (log n) since the primitive scans at
most O (n) vertex IDs in parallel.
ACM Transactions on Parallel Computing, Vol. 8, No. 1, Article 4. Publication date: April 2021.
4:10 L. Dhulipala et al.
Fig. 3. The bucketing interface used by GBBS, including the type definition of each primitive and the cost
bounds. The bucketing structure represents a dynamic mapping between a set of identifiers to a set of buck-
ets. The total number of identifiers is denoted by n. † denotes that a bound holds in expectation, and ‡
denotes that a bound holds whp. We define the semantics of each operation in the text below.
itive updates the bktids for multiple identifiers by supplying the bucket structure and a sequence
of identifier and bktdest pairs.
The costs for using the bucket structure can be summarized by the following theorem from [52]:
Theorem 4.1. When there are n identifiers, T total buckets, K calls to updateBuckets, each of
which updates a set Si of identifiers, and L calls to nextBucket, parallel bucketing takes O (n + T +
K
i=0 |S i |) expected work and O ((K + L) log n) depth whp.
We refer to the Julienne paper [52] for more details about the bucketing interface and its imple-
mentation. We note that the implementation is optimized for the case where only a small number
of buckets are processed, which is typically the case in practice.
ACM Transactions on Parallel Computing, Vol. 8, No. 1, Article 4. Publication date: April 2021.
Theoretically Efficient Parallel Graph Algorithms Can Be Fast and Scalable 4:11
Fig. 4. The core vertex interface used by GBBS, including the type definition of each primitive and the cost
bounds for our implementations on uncompressed graphs. Note that for directed graphs, each of the neigh-
borhood operators has two versions, one for the in-neighbors and one for the out-neighbors of the vertex.
The cost bounds for the primitives on compressed graphs are identical assuming the compression block size
is O (log n) (note that for compressed graphs, i-th has work and depth proportional to the compression block
size of the graph in the worst case). The cost bounds shown here assume that the user-defined functions
supplied to map, reduce, scan, count, filter, pack, and iterate all cost O (1) work to evaluate. dit is the
number of times the function supplied to iterate returns true. nghlist is an abstract type for the neighbors
of a vertex, and is used by the vertex-vertex operators. The edge type is a triple (u, v, wuv ) where the first
two entries are the ids of the endpoints, and the last entry is the weight of the edge. l and h are the degrees
of the smaller and larger degree vertices supplied to a vertex-vertex operator, respectively.
4.5.1 Graph Operators. The graph operators, their types, and the cost bounds provided by our
implementation are shown in the top half of Figure 5. The interface provides primitives for query-
ing the number of vertices and edges in the graph (numVertices and numEdges), and for fetching
the vertex object for the i-th vertex (getVertex).
filterGraph. The filterGraph primitive takes as input a graph G (V , E), and a boolean function
P over edges specifying edges to preserve. filterGraph removes all edges in the graph where
P (u, v, wuv ) = false, and returns a new graph containing only edges where P (u, v, wuv ) = true.
The filterGraph primitive is useful for our triangle counting algorithm, which requires directing
the edges of an undirected graph to reduce overall work.
ACM Transactions on Parallel Computing, Vol. 8, No. 1, Article 4. Publication date: April 2021.
4:12 L. Dhulipala et al.
Fig. 5. The core graph interface used by GBBS, including the type definition of each primitive and the
cost bounds for our implementations on uncompressed graphs. vset is an abbreviation for vertexSubset
when providing a type definition. Note that for directed graphs, the interface provides two versions of each
vertexSubset operator, one for the in-neighbors and one for the out-neighbors of the vertex. The edge type is
a triple (u, v, wuv ) where the first two entries are the ids of the endpoints, and the last entry is the weight of
the edge. The vertexSubset operators can take both unaugmented and augmented vertexSubsets as input, but
ignore the augmented values in the input. U is the vertexSubset supplied as input to a vertexSubset operator.
For the src-based primitives, U ⊆ U is the set of vertices that are matched by the condition function (see
the text below). The cost bounds for the primitives on compressed graphs are identical assuming the com-
pression block size is O (log n). The cost bounds shown here assume that the user-defined functions supplied
to the vertexSubset operators all cost O (1) work to evaluate. † denotes that a bound holds in expectation,
and ‡ denotes that a bound holds whp.
packGraph. The interface also provides a primitive over edges called packGraph which oper-
ates similarly to filterGraph, but works in-place and mutates the underlying graph. packGraph
takes as input a graph G (V , E), and a boolean function P over the edges specifying edges to pre-
serve. packGraph mutates the input graph to remove all edges that do not satisfy the predicate.
This primitive is used by the biconnectivity (Algorithm 9), strongly connected components (Algo-
rithm 11), maximal matching (Algorithm 13), and minimum spanning forest (Algorithm 10) algo-
rithms studied in this paper.
extractEdges. The extractEdges primitive takes as input a graph G (V , E), and a boolean
function P over edges which specifies edges to extract, and returns an array containing all edges
where P (u, v, wuv ) = true. This primitive is useful in algorithms studied in this paper such as
ACM Transactions on Parallel Computing, Vol. 8, No. 1, Article 4. Publication date: April 2021.
Theoretically Efficient Parallel Graph Algorithms Can Be Fast and Scalable 4:13
Fig. 6. Illustration of srcCount and nghCount primitives. The input is illustrated in Panel (1), and consists of
a graph and a vertexSubset, with vertices in the vertexSubset illustrated in green. The green edges are edges
for which the user-defined predicate, P, returns true. Panel (2) and Panel (3) show the results of applying
srcCount and nghCount, respectively. In Panel (2), the cond function C returns true for both vertices in
the input vertexSubset. In Panel (3), the condition function C only returns true for v 2 , v 4 , and v 5 , and false
for v 0 , v 1 , v 3 , and v 6 . The output is an augmented int vertexSubset, illustrated in red, where each source
(neighbor) vertex v s.t. C (v) = true has an augmented value containing the number of incident edges where
P returns true.
maximal matching (Algorithm 13) and minimum spanning forest (Algorithm 10) where it is used
to extract subsets of edges from a CSR representation of the graph, which are then processed using
an algorithm operating over edgelists (edges tuples stored in an array).
contractGraph. Lastly, the contractGraph primitive takes a graph and an integer cluster
labeling L, i.e., a mapping from vertices to cluster ids, and returns the graph G = (V , E ) where
E = {(L(u), L(v) | (u, v) ∈ E}, with any duplicate edges or self-loops removed. V is V with all ver-
tices with no incident edges in E removed. This primitive is used by the connectivity (Algorithm 7)
and spanning forest (Algorithm 8) algorithms studied in this paper. The primitive can naturally be
generalized to weighted graphs by specifying how to reweight parallel edges (e.g., by averaging, or
taking a minimum or maximum), although this generalization is not necessary for the algorithms
studied in this paper.
Implementations and Cost Bounds. filterGraph, packGraph, and extractEdges are imple-
mented by invoking filter and pack on each vertex in the graph in parallel. The overall work and
depth comes from the fact that every edge is processed once by each endpoint, and since all vertices
are filtered (packed) in parallel. contractGraph can be implemented in O (n + m) expected work
and O (log n) depth whp in the BF model using semisorting [31, 74]. In practice, contractGraph is
implemented using parallel hashing [137], and we refer the reader to [140] for the implementation
details.
4.5.2 VertexSubset Operators. The second part of the graph interface consists of a set of oper-
ators over vertexSubsets. At a high level, each of these primitives take as input a vertexSubset,
apply a given user-defined function over the edges neighboring the vertexSubset, and output a
vertexSubset. The primitives include the edgeMap primitive from Ligra, as well as several exten-
sions and generalizations of the edgeMap primitive.
edgeMap. The edgeMap primitive takes as input a graph G (V , E), a vertexSubset U , and two
boolean functions F and C. edgeMap applies F to (u, v) ∈ E such that u ∈ U and C (v) = true (call
this subset of edges Ea ), and returns a vertexSubset U where u ∈ U if and only if (u, v) ∈ Ea and
F (u, v) = true. Our interface defines the edgeMap primitive identically to Ligra. This primitive is
used in many of the algorithms studied in this paper.
edgeMapData. The edgeMapData primitive works similarly to edgeMap, but returns an aug-
mented vertexSubset. Like edgeMap, it takes as input a graph G (V , E), a vertexSubset U , a
ACM Transactions on Parallel Computing, Vol. 8, No. 1, Article 4. Publication date: April 2021.
4:14 L. Dhulipala et al.
function F returning a value of type R option, and a boolean function C. edgeMapData ap-
plies F to (u, v) ∈ E such that u ∈ U and C (v) = true (call this subset of edges Ea ), and returns
a R vertexSubsetU where (u, r ) ∈ U (r is the augmented value associated with u) if and only if
(u, v) ∈ Ea and F (u, v) = Some(r ). The primitive is only used in the weighted breadth-first search
algorithm in this paper, where the augmented value is used to store the distance to a vertex at the
start of a computation round (Algorithm 2).
srcReduce and srcCount. The srcReduce primitive takes as input a graph G (V , E) and a
vertexSubset U , a map function M over edges returning values of type R, a boolean function C,
and a monoid A over values of type R, and returns a R vertexSubset. srcReduce applies M to each
(u, v) ∈ E s.t. u ∈ U and C (u) = true (let Mu be the set of values of type R from applying M to edges
incident to u), and returns a R vertexSubset U containing (u, r ) where r is the result of reducing
all values in Mu using the monoid A.
The srcCount primitive is a specialization of srcReduce, where R = int, the monoid A is (0, +),
and the map function is specialized to a boolean (predicate) function P over edges. This primitive
is useful for building a vertexSubset where the augmented value for each vertex is the number of
incident edges satisfying some condition. srcCount is used in our parallel approximate set cover
algorithm (Algorithm 15).
srcPack. The srcPack primitive is defined similarly to srcCount, but also removes edges that
do not satsify the given predicate. Specifically, it takes as input a graph G (V , E), a vertexSubset
U , and two boolean functions, P, and C. For each u ∈ U where C (u) = true, the function applies
P to all (u, v) ∈ E and removes edges that do not satisfy P. The function returns an augmented
vertexSubset containing all sources (neighbors), v, where C (v) = true. Each of these vertices is
augmented with an integer value storing the new degree of the vertex after applying the pack.
nghReduce and nghCount. The nghReduce primitive is defined similarly to srcReduce
above, but aggregates the results for neighbors of the input vertexSubset. It takes as input a graph
G (V , E), a vertexSubset U , a map function M over edges returning values of type R, a boolean
function C, a monoid A over values of type R, and lastly an update function T from values of type
R to O option. It returns a O vertexSubset. This function performs the following logic: M is applied
to each edge (u, v) where u ∈ U and C (v) = true in parallel (let the resulting values of type R be
Mv ). Next, the mapped values for each such v are reduced in parallel using the monoid A to obtain
a single value, Rv . Finally, T is called on the pair (v, Rv ) and the vertex and augmented value pair
(v, o) is emitted to the output vertexSubset if and only if T returns Some(o). nghReduce is used
in our PageRank algorithm (Algorithm 19).
The nghCount primitive is a specialization of nghReduce, where R = int, the monoid A is
(0, +), and the map function is specialized to a boolean (predicate) function P over edges. ngh-
Count is used in our k-core (Algorithm 16) and approximate densest subgraph (Algorithm 17)
algorithms.
Implementations and Cost Bounds. Our implementation of edgeMap in this paper is based on
the edgeMapBlocked primitive introduced in Section 7.2. The same primitive is used to implement
edgeMapData.
The src- primitives (srcReduce, srcCount, and srcPack) are relatively easy to implement.
These implementations work by iterating over the vertices in the input vertexSubset in parallel,
applying the condition function C, and then applying a corresponding vertex primitive on the
incident edges. The work for source operators is O (|U | + u ∈U d (u)), where U ⊆ U consists of
all vertices u ∈ U where C (u) = true, and the depth is O (log n) assuming that the boolean functions
and monoid cost O (1) work to apply.
ACM Transactions on Parallel Computing, Vol. 8, No. 1, Article 4. Publication date: April 2021.
Theoretically Efficient Parallel Graph Algorithms Can Be Fast and Scalable 4:15
The ngh- primitives require are somewhat trickier to implement compared to the src- primi-
tives, since these primitives require performing non-local reductions at the neighboring endpoints
of edges. Both nghReduce and nghCount can be implemented by first writing out all neighbors
of the input vertexSubset satisfying C to an array, A (along with their augmented values). A has
size at most O ( u ∈U d (u)). The next step applies a work-efficient semisort (e.g., [74]) to store all
pairs of neighbor and value keyed by the same neighbor contiguously. The final step is to apply a
prefix sum over the array, combining values keyed by the same neighbor using the reduction op-
eration defined by the monoid, and to use a prefix sum and map to build the output vertexSubset,
augmented with the final value in the array for each neighbor. The overall work is proportional to
semisorting and applying prefix-sums on arrays of |A|, which is O ( u ∈U d (u)) work in expecta-
tion, and the depth is O (log n) whp [31, 74]. In practice, our implementations use the work-efficient
histogram technique described in Section 7.1 for both nghReduce and nghCount.
Optimizations. We observe that for ngh- operators there is a potential to achieve speedups by
applying the direction-optimization technique proposed by Beamer for the BFS problem [21] and
applied to other problems by Shun and Blelloch [136]. Recall that this technique maps over all
vertices v ∈ V , and for those where C (v) = true, scans over the in-edges (v, u, wvu ) applying F to
edges where u is in the input vertexSubset until C (v) is no longer true. We can apply the same
technique for nghReduce and nghCount by performing a reduction over the in-neighbors of all
vertices satisfying C (v). This optimization can be applied without an increase in the theoretical
cost of the algorithm whenever the number edges incident to the input vertexSubset is a constant
fraction of m. The advantage is that the direction-optimized version runs in O (n) space and per-
forms inexpensive reads over the in-neighbors, whereas the more costly semisort or histogram
based approach runs in O ( u ∈U d (u)) space and requires performing multiple writes per incident
edge.
5 BENCHMARK
In this section we describe I/O specifications of our benchmark. We discuss related work and
present the theoretically-efficient algorithm implemented for each problem in Section 6. We mark
implementations based on prior work with a †, although in many of these cases, the implementa-
tions were still significantly modified to improve scalability on large compressed graphs.
ACM Transactions on Parallel Computing, Vol. 8, No. 1, Article 4. Publication date: April 2021.
4:16 L. Dhulipala et al.
O (k )-Spanner
Input: G = (V , E), an undirected, unweighted graph, and an integer stretch factor, k.
Output: H ⊆ E, a set of edges such that for every u, v ∈ V connected in G, distH (u, v) ≤ O (k ) ·
distG (u, v).
Low-Diameter Decomposition†
Input: G = (V , E), an undirected graph, 0 < β < 1.
Output: L, a mapping from each vertex to a cluster ID representing a (O (β ), O ((log n)/β )) de-
composition. A (β, d )-decomposition partitions V into C 1 , . . . , Ck such that:
• The shortest path between two vertices in Ci using only vertices in Ci is at most d.
• The number of edges (u, v) where u ∈ Ci , v ∈ C j , j i is at most βm.
Connectivity†
Input: G = (V , E), an undirected graph.
Output: L, a mapping from each vertex to a unique label for its connected component.
ACM Transactions on Parallel Computing, Vol. 8, No. 1, Article 4. Publication date: April 2021.
Theoretically Efficient Parallel Graph Algorithms Can Be Fast and Scalable 4:17
Spanning Forest†
Input: G = (V , E), an undirected graph.
Output: T , a set of edges representing a spanning forest of G.
Biconnectivity
Input: G = (V , E), an undirected graph.
Output: L, a mapping from each edge to the label of its biconnected component.
Maximal Matching†
Input: G = (V , E), an undirected graph.
Output: E ⊆ E, a set of edges such that no two edges in E share an endpoint and all edges in
E \ E share an endpoint with some edge in E .
Graph Coloring†
Input: G = (V , E), an undirected graph.
Output: C, a mapping from each vertex to a color such that for each edge (u, v) ∈ E, C (u) C (v),
using at most Δ + 1 colors.
ACM Transactions on Parallel Computing, Vol. 8, No. 1, Article 4. Publication date: April 2021.
4:18 L. Dhulipala et al.
k-core†
Input: G = (V , E), an undirected graph.
Output: D, a mapping from each vertex to its coreness value. Section 6.4 provides the definition
of k-cores and coreness values.
Triangle Counting†
Input: G = (V , E), an undirected graph.
Output: TG , the total number of triangles in G. Each unordered (u, v, w ) triangle is counted once.
PageRank†
Input: G = (V , E), an undirected graph.
Output: P, a mapping from each vertex to its PageRank value after a single iteration of PageRank.
6 ALGORITHMS
In this section, we give self-contained descriptions of all of the theoretically efficient algorithms
implemented in our benchmark and discuss related work. We cite the original papers that our
algorithms are based on in Table 1. We assume m = Ω(n) when stating cost bounds in this section.
Pseudocode Conventions. The pseudocode for many of the algorithms make use of our graph
processing interface, as well as the atomic primitives testAndSet, fetchAndAdd, and prior-
ityWrite. Our graph processing interface is defined in Section 4 and the atomic primitives are
defined in Section 3. We use _ as a wildcard to bind values that are not used. We use anonymous
functions in the pseudocode for consciseness, and adopt a syntax similar to how anonymous func-
tions are defined in the ML language. An anonymous function is introduced using the fn keyword.
For example,
fn(u, v, wuv ) : edge → return Rank[v]
ACM Transactions on Parallel Computing, Vol. 8, No. 1, Article 4. Publication date: April 2021.
Theoretically Efficient Parallel Graph Algorithms Can Be Fast and Scalable 4:19
is an anonymous function taking a triple representing an edge, and returning the Rank of the vertex
v. We drop type annotations when the argument types are clear from context. The option type,
E option, provides a distinction between some value of type E (Some(e)) and no value (None).
Types used by our algorithms are also summarized in Table 2. We use the array initializer notation
A[0, . . . , e) = value to denote an array consisting of e elements all initialized to value in parallel.
We use standard functional sequence primitives, such as map and filter on arrays. Assuming that
the user-defined map and filter functions cost O (1) work to apply, these primitives cost O (n) work
and O (log n) depth on a sequence of length n. We use the syntax ∀i ∈ [s, e) as shorthand for a
parallel loop over the indices [s, . . . , e). For example, ∀i ∈ [0, e), A[i] = i · A[i] updates the i-th
value of A[i] to i · A[i] in parallel for 0 ≤ i < e.
ACM Transactions on Parallel Computing, Vol. 8, No. 1, Article 4. Publication date: April 2021.
4:20 L. Dhulipala et al.
be emitted in the output vertexSubset (Line 7), and otherwise returns false (Line 8). Finally, at the
end of a round the algorithm increments the value of the current distance on Line 17.
Both the GeneralizedBFS and BFS algorithms run in O (m) work and O (diam(G) log n) depth
on the BF model. We note that emitting a shortest-path tree from a subset of vertices instead of dis-
tances can be done using nearly identical code, with the only differences being that (i) the algorithm
will store a Parents array instead of a Distances array, and (ii) the Update function will set the par-
ent of a vertex d to s upon a successful testAndSet. The main change we made to this algorithm
compared to the Ligra implementation was to improve the cache-efficiency of the edgeMap imple-
mentation using edgeMapBlocked, the block-based version of edgeMap described in Section 7.
ALGORITHM 2: wBFS
1: Distance[0, . . . , n) ∞
2: Relaxed[0, . . . , n) false
3: procedure GetBucketNum(v) return Distance[v]
4: procedure Cond(v) return true
5: procedure Update(s, d, w sd )
6: newDist Distance[s] + w sd
7: oldDist Distance[d]
8: res None
9: if newDist < oldDist then
10: if testAndSet(&Relaxed[d]) then first writer this round
11: res Some(oldDist) store and return the original distance
12: priorityWrite(&Distance[d], newDist, <)
13: return res
14: procedure Reset(v, oldDist)
15: Relaxed[v] 0
16: newDist Distance[d]
17: return B.getBucket(oldDist, newDist)
18: procedure wBFS(G (V , E, w ), src)
19: Distance[src] 0
20: B makeBuckets(|V |, GetBucketNum, increasing)
21: (bktId, bktContents) B.nextBucket()
22: while bktId nullbkt do
23: Moved edgeMapData(G, bktContents, Update, Cond)
24: NewBuckets vertexMapVal(Moved, Reset)
25: B.updateBuckets(NewBuckets)
26: (bktId, bktContents) B.nextBucket()
27: return Distance
this edge is the first visitor to d during this round by performing a testAndSet on the Relaxed
array, emitting d, and the old distance to d in the output vertexSubset if so.
The next step in the round applies a vertexMapVal on the augmented vertexSubset Moved.
The map function first resets the Relaxed flag for each vertex (Line 15), and then computes the
new bucket each relaxed vertex should move to using the getBucket primitive (Line 17). The out-
put is an augmented vertexSubset NewBuckets, containing vertices and their destination buckets
(Line 24). The last step updates the buckets for all vertices in NewBuckets (Line 25). The algorithm
runs in O (m) work in expectation and O (diam(G) log n) depth whp on the PW-BF model, as ver-
tices use priorityWrite to write the minimum distance to a neighboring vertex in each round.
The main change we made to this algorithm was to improve the cache-efficiency of edgeMapData
using the block-based edgeMapBlocked algorithm described in Section 7.
General-Weight SSSP
The General-Weight SSSP problem is to compute a mapping with the shortest path distance be-
tween the source vertex and every reachable vertex on a graph with general (positive and negative)
edge weights. The mapping should return a distance of ∞ for unreachable vertices. Furthermore,
if the graph contains a negative weight cycle reachable from the source, the mapping should set
the distance to all vertices in the cycle and vertices reachable from it to −∞.
Our implementation for this problem is the classic Bellman-Ford algorithm [49]. Algorithm 3
shows pseudocode for our frontier-based version of Bellman-Ford. The algorithm runs over a
ACM Transactions on Parallel Computing, Vol. 8, No. 1, Article 4. Publication date: April 2021.
4:22 L. Dhulipala et al.
ALGORITHM 3: Bellman-Ford
1: Relaxed[0, . . . , n) false
2: Distance[0, . . . , n) ∞
3: procedure Cond(v)
4: return true
5: procedure ResetFlags(v)
6: Relaxed[v] false
7: procedure Update(s, d, w sd )
8: newDist Distance[s] + w sd
9: if newDist < Distance[d] then
10: priorityWrite(&Distance[d], newDist, <)
11: if !Relaxed[d] then
12: return testAndSet(&Relaxed[d]) to ensure d is only added once to the next frontier
13: return false
14: procedure BellmanFord(G (V , E, w ), src)
15: F vertexSubset({src})
16: Distance[src] 0
17: round 0
18: while |F | > 0 do
19: if round = n then only applied if a negative weight cycle is reachable from src
20: R GeneralizedBFS(G, F ) defined in Algorithm 1
21: In parallel, set Distance[u] −∞ for u ∈ R s.t. R[u] ∞
22: return Distance
23: F edgeMap(G, F , Update, Cond)
24: vertexMap(F , ResetFlags)
25: round round + 1
26: return Distance
number of rounds. The initial frontier, F , consists of just the source vertex, src (Line 17). In each
round, the algorithm applies edgeMap over F to produce a new frontier of vertices that had their
shortest path distance decrease, and updates F to be this new frontier. The map function supplied
to edgeMap (Line 7–13) tests whether the distance to a neighbor can be decrased, and uses a pri-
orityWrite to atomically lower the distance (Line 10). Emitting a neighbor to the next frontier
is done using a testAndSet on Relaxed, an array of flags indicating whether the vertex had its
current shortest path distance decrease (Line 12). Finally, at the end of a round the algorithm resets
the flags for all vertices in F (Line 24). After k rounds, the algorithm has correctly computed the
distances to vertices that are within k hops from the source. Since any vertex is at most n − 1 hops
away from the source, if the number of rounds in the algorithm reaches n, we know that the in-
put graph contains a negative weight cycle. The algorithm identifies vertices reachable from these
cycles using the GeneralizedBFS algorithm (Algorithm 1) to compute all vertices reachable from
the current frontier (Line 20). It sets the distance to the vertices with distances that are not ∞ (i.e.,
reachable from a negative weight cycle) to −∞ (Line 21).
For inputs without negative-weight cycles, the algorithm runs in O (diam(G)m) work and
O (diam(G) log n) depth on the PW-BF model. If the graph contains a negative-weight cycle, the
algorithm runs in O (nm) work and O (n log n) depth on the PW-BF model. The main change we
made to this algorithm compared to the Ligra implementation was to add a GeneralizedBFS im-
plementation and invoke it in the case where the algorithm detects a negative weight cycle. We
also improve its cache-efficiency by using the block-based version of edgeMap, edgeMapBlocked,
which we describe in Section 7.
ACM Transactions on Parallel Computing, Vol. 8, No. 1, Article 4. Publication date: April 2021.
Theoretically Efficient Parallel Graph Algorithms Can Be Fast and Scalable 4:23
1 1
ζr (v) = + σvw ·
σrv σr w
w ∈D r (v )
where D r (v) is the set of all descendent vertices through v, i.e., w ∈ V where a shortest path from
r to w passes through v. These final scores can be converted to the dependency scores by first
subtracting σr1v and then multiplying by σrv , since
σvw σr w (v)
σrv · =
σr w σr w
w ∈D r (v ) w ∈D r (v )
Next, we discuss the main difference between our implementation and that of Ligra. The Ligra
implementation is based on using edgeMap with an map function that uses the fetchAndAdd
primitive to update the number of shortest paths (σrv ) in the forward phase, and to update the
inverted dependencies (ζr (v)) in the reverse phase. The Ligra implementation thus combines the
generation of the next BFS frontier with aggregating the number of shortest paths passing through
a vertex in the first phase, or the inverted dependency contribution of the vertex in the second
phase by using the fetchAndAdd primitive. In our implementation, we observed that for certain
ACM Transactions on Parallel Computing, Vol. 8, No. 1, Article 4. Publication date: April 2021.
4:24 L. Dhulipala et al.
graphs, especially those with skewed degree distribution, using a fetchAndAdd to sum up the
contributions incurs a large amount of contention, and significant speedups (in our experiments,
up to 2x on the Hyperlink2012 graph) can be obtained by (i) separating the computation of the next
frontier from the computation of the σrv and δr (v) values in the two phases and (ii) computing
the computation of σrv and δr (v) using the pull-based approach described below.
The pseudocode for our betweenness centrality implementation is shown in Algorithm 4. The
algorithm runs in two phases. The first phase (Lines 26–31) computes a BFS tree rooted at the
source vertex r using a nghMap using Update, and Cond defined identically to the BFS algorithm
in Algorithm 1. After computing the new BFS frontier, F , the algorithm maps over the vertices
in it using a vertexMap (Line 28), and applies the AggregatePathContributions procedure
for each vertex. This procedure (Lines 9–11) performs a reduction over all in-neighbors of the
vertex to pull path-scores from vertices that are completed, i.e. Completed[v] = true (Line 11). The
algorithm then applies a second vertexMap over F to mark these vertices as completed (Line 29).
The frontier is then saved for use in the second phase (Line 30). At the end of the second phase we
reset the Status values (Line 32).
The second phase (Lines 33–37) processes the saved frontiers level by level in reverse order. It
first extracts a saved frontier (Line 37). It then applies a vertexMap over the frontier applying
the AggregateDependencies procedure for each vertex (Line 35. This procedure (Lines 14–16)
performs a reduction over all out-neighbors of the vertex to pull the inverted dependency scores
over completed neighbors. Finally, the algorithm applies a second vertexMap to mark the vertices
in it as completed (Line 36). After all frontiers have been processed, the algorithm finalizes the
dependency scores by first subtracting the inverted NumPaths value, and then multiplying by the
NumPaths value (Line 38).
O (k )-Spanner
Computing graph spanners is a fundamental problem in combinatorial graph algorithms and graph
theory [120]. A graph H is a k-spanner of a graph G if ∀u, v ∈ V connected by a path, distG (u, v) ≤
distH (u, v) ≤ k · distG (u, v) (equivalently, such a subgraph is called a spanner with stretch k). The
spanner problem studied in this paper is to compute an O (k ) spanner for a given k.
Sequentially, classic results give elegant constructions of (2k − 1)-spanners using O (n1+1/k )
edges, which are essentially the best possible assuming the girth conjecture [149]. In this paper, we
ACM Transactions on Parallel Computing, Vol. 8, No. 1, Article 4. Publication date: April 2021.
Theoretically Efficient Parallel Graph Algorithms Can Be Fast and Scalable 4:25
implement the spanner algorithm recently proposed by Miller, Peng, Xu, and Vladu (MPXV) [110].
The construction results in an O (k )-spanner with expected size O (n1+1/k ), and runs in O (m) work
and O (k log n) depth on the BF model.
The MPXV spanner algorithm (Algorithm 5) uses the low-diameter decomposition (LDD) al-
gorithm, which will be described in Section 6.2. It takes as input a parameter k which controls
the stretch of the spanner. The algorithm first computes an LDD with β = log n/(2k ) (Line 3). The
stretch of each LDD cluster is O (k ) whp, and so the algorithm includes all tree edges generated
by the LDD in the spanner (Line 4). The algorithm handles inter-cluster edges by taking a single
ACM Transactions on Parallel Computing, Vol. 8, No. 1, Article 4. Publication date: April 2021.
4:26 L. Dhulipala et al.
ALGORITHM 5: O (k )-Spanner
1: procedure Spanner(G (V , E), k)
log n
2: β 2k
3: (Clusters, Parents) LDD(G (V , E), β ) see Algorithm 6
4: ELDD {(i, Parents[i]) | i ∈ [0, n) and Parents[i] ∞} tree edges used in the LDD
5: I one inter-cluster edge for each pair of adjacent clusters in L
6: return ELDD ∪ I
inter-cluster edge between a boundary vertex and each neighboring cluster (Line 5). Our imple-
mentation uses a parallel hash table to select a single inter-cluster edge between two neighboring
clusters.
We note that this procedure is slightly different than the procedure in the MPXV paper, which
adds a single edge between every boundary vertex of a cluster and each adjacent cluster. Our
algorithm only adds a single edge between two clusters, while the MPXV algorithm may add
multiple parallel edges between two clusters. Their argument bounding the stretch to O (k ) for an
edge spanning two clusters is still valid for our modified algorithm, since the endpoints can be
first routed to the cluster centers, and then to the single edge that was selected between the two
clusters.
ACM Transactions on Parallel Computing, Vol. 8, No. 1, Article 4. Publication date: April 2021.
Theoretically Efficient Parallel Graph Algorithms Can Be Fast and Scalable 4:27
distribution by randomly permuting the vertices in parallel (Line 15) and dividing the vertices in
the permutation into O (log n/β ) many batches (Line 16). After partitioning the vertices, the LDD
algorithm performs a sequence of rounds, where in each round all vertices that are not already
clustered in the next batch are added as new cluster centers. Each cluster then tries to acquire
unclustered vertices adjacent to it (thus increasing its radius by 1). This procedure is sometimes
referred to as ball-growing in the literature [12, 35, 111].
The first step in the ball-growing loop extracts newClusters, which is a vertexSubset of vertices
in the i-th batch that are not yet clustered (Line 22). Next, the algorithm applies a vertexMap to
update the Clusters and Visited status of the new clusters (Line 23). The new clusters are then added
to the current LDD frontier using the addToSubset primitive (Line 24). On Line 25, the algorithm
uses edgeMap to traverse the out edges of the current frontier and non-deterministically acquire
unvisited neighboring vertices. The condition and map functions supplied to edgeMap are defined
similarly to the ones in BFS.
We note that the pseudocode show in Algorithm 6 returns both the LDD clustering, Clusters, as
well as a Parents array. The Parents array contains for each vertex v that joins a different vertex’s
cluster (Clusters[v] v) the parent in the BFS tree rooted at Clusters[v]. Specifically, for a vertex
d that is not in its own cluster, Parents[d] stores the vertex s that succeeds at the testAndSet in
Line 6. The Parents array is used by both the O (k )-spanner and spanning forest algorithms in this
paper to extract the tree edges used in the LDD.
ACM Transactions on Parallel Computing, Vol. 8, No. 1, Article 4. Publication date: April 2021.
4:28 L. Dhulipala et al.
Connectivity
The connectivity problem is to compute a connectivity labeling of an undirected graph, i.e., a map-
ping from each vertex to a label such that two vertices have the same label if and only if there is
a path between them in the graph. Connectivity can easily be solved sequentially in linear work
using breadth-first or depth-first search. Parallel algorithms for connectivity have a long history;
we refer readers to [140] for a review of the literature. Early work on parallel connectivity dis-
covered many natural algorithms which perform O (m log n) work and poly-logarithmic depth [13,
122, 127, 135]. A number of optimal parallel connectivity algorithms were discovered in subse-
quent years [45, 67, 75, 76, 121, 123, 140], but to the best of our knowledge the recent algorithm by
Shun et al. is the only linear-work polylogarithmic-depth parallel algorithm that is practical and
has been studied experimentally [140].
In this paper, we implement the connectivity algorithm from Shun et al. [140], which runs in
O (m) expected work and O (log3 n) depth whp on the BF model. The implementation uses the
work-efficient algorithm for low-diameter decomposition (LDD) described above. One change we
made to the implementation from [140] was to separate the LDD and contraction steps from the
connectivity algorithm. Refactoring these sub-routines allowed us to express the main connectivity
algorithm in about 50 lines of code.
The connectivity algorithm from Shun et al. [140] (Algorithm 7) takes as input an undirected
graph G and a parameter 0 < β < 1. It first runs the LDD algorithm, Algorithm 6 (Line 2), which
decomposes the graph into clusters each with diameter O (log n/β ), and βm inter-cluster edges in
expectation. Next, it builds G by contracting each cluster to a single vertex and adding inter-cluster
edges while removing duplicate edges, self-loops, and isolated vertices (Line 3). It then checks if the
contracted graph is empty (Line 4); if so, the current clusters are the components, and it returns the
mapping from vertices to clusters (Line 5). Otherwise, it recurses on the contracted graph (Line 6)
and returns the connectivity labeling produced by assigning each vertex to the label assigned to
its cluster in the recursive call (Lines 7 and 8).
ALGORITHM 7: Connectivity
1: procedure Connectivity(G (V , E), β)
2: (L, P ) LDD(G (V , E), β ) see Algorithm 6
3: G (V , E ) contractGraph(G, L)
4: if |E | = 0 then
5: return L
6: L Connectivity(G (V , E ), β )
7: L {v → L [L[v]] | v ∈ V } implemented as a vertexMap over V
8: return L
Spanning Forest
The spanning forest problem is to compute a subset of edges in the graph that represent a spanning
forest. Finding spanning forests in parallel has been studied largely in conjunction with connec-
tivity algorithms, since most parallel connectivity algorithms can naturally be modified to output
a spanning forest (see [140] for a review of the literature).
Our spanning forest algorithm (Algorithm 8) is based on the connectivity algorithm from Shun
et al. [140] which we described earlier. Our algorithm runs in runs in O (m) expected work and
O (log3 n) depth whp on the BF model. The main difference in the spanning forest algorithm com-
pared to the connectivity algorithm is to include all LDD edges at each level of the recursion
(Line 4). These LDD edges are extracted using the Parents array returned by the LDD algorithm
given in Algorithm 6. Recall that this array has size proportional to the number of vertices, and all
ACM Transactions on Parallel Computing, Vol. 8, No. 1, Article 4. Publication date: April 2021.
Theoretically Efficient Parallel Graph Algorithms Can Be Fast and Scalable 4:29
entries initialized to ∞. The LDD algorithm uses this array to store the BFS parent of each vertex
v that joins a different vertex’s cluster (Clusters[v] v). The LDD edges are retrieved by checking
for each index i ∈ [0, n) whether Parents[i] ∞ and if so taking (i, Parents[i]) as an LDD edge.
Furthermore, observe that the LDD edges after the topmost level of recursion are taken from
a contracted graph, and need to be mapped back to some edge in the original graph realizing the
contracted edge. We decide which edges in G to add by maintaining a mapping from the edges in
the current graph at some level of recursion to the original edge set. Initially this mapping, M, is
an identity map (Line 12). To compute the mapping to pass to the recursive call, we select any edge
e in the input graph G that resulted in e ∈ E and map e to M (e) (Line 8). In our implementation,
we use a parallel hash table to select a single original edge per contracted edge.
Biconnectivity
A biconnected component of an undirected graph is a maximal subgraph such that the subgraph
remains connected under the deletion of any single vertex. Two closely related definitions are
articulation points and bridge. An articulation point is a vertex whose deletion increases the
number of connected components, and a bridge is an edge whose deletion increases the number
of connected components. Note that by definition an articulation point must have degree greater
than one. The biconnectivity problem is to emit a mapping that maps each edge to the label of
its biconnected component.
Sequentially, biconnectivity can be solved using the Hopcroft-Tarjan algorithm [80]. The al-
gorithm uses depth-first search (DFS) to identify articulation points and requires O (m + n) work
to label all edges with their biconnectivity label. It is possible to parallelize the sequential algo-
rithm using a parallel DFS, however, the fastest parallel DFS algorithm is not work-efficient [3].
Tarjan and Vishkin present the first work-efficient algorithm for biconnectivity [148] (as stated
in the paper the algorithm is not work-efficient, but it can be made so by using a work-efficient
connectivity algorithm). The same paper also introduces the Euler-tour technique, which can be
used to compute subtree functions on rooted trees in parallel in O (n) work and O (log2 n) depth
on the BF model. Another approach relies on the fact that biconnected graphs admit open ear
decompositions to solve biconnectivity efficiently [103, 126].
In this paper, we implement the Tarjan-Vishkin algorithm for biconnectivity in O (m) expected
work and O (max(diam(G) log n, log3 n)) depth on the FA-BF model. Our implementation first com-
putes connectivity labels using our connectivity algorithm, which runs in O (m) expected work and
O (log3 n) depth whp and picks an arbitrary source vertex from each component. Next, we compute
a spanning forest rooted at these sources using breadth-first search, which runs in O (m) work and
ACM Transactions on Parallel Computing, Vol. 8, No. 1, Article 4. Publication date: April 2021.
4:30 L. Dhulipala et al.
ALGORITHM 9: Biconnectivity
1: Parents[0, . . . , n) the parent of each vertex in a rooted spanning forest
2: Preorder[0, . . . , n) the preorder number of each vertex in a rooted spanning forest
3: Low[0, . . . , n) minimum preorder number for a non-tree edge in a vertex’s subtree
4: High[0, . . . , n) maximum preorder number for a non-tree edge in a vertex’s subtree
5: Size[0, . . . , n) the size of a vertex’s subtree
6: procedure IsArticulationPoint(u)
7: pu Parents[u]
8: return Preorder[pu ] ≤ Low(u) and High[u] < Preorder[pu ] + Size[pu ]
9: procedure IsNonCriticalEdge(u, v)
10: condv v = Parents[u] and IsArticulationPoint(v)
11: condu u = Parents[v] and IsArticulationPoint(u)
12: critical condu or condv true if this edge is a bridge
13: return !critical
14: procedure Biconnectivity(G (V , E))
15: F SpanningForest(G)
16: Parents root each tree in F at an arbitrary root
17: Preorder compute a preorder numbering on each rooted tree in F
18: For each v ∈ V , compute Low(v), High(v), and Size(v) subtree functions defined in the text
19: packGraph(G, IsNonCriticalEdge) removes all critical edges from the graph
20: Labels Connectivity(G)
21: return (Labels, Parents) sufficient to answer biconnectivity queries
O (diam(G) log n) depth. We compute the subtree functions Low, High, and Size for each vertex by
running leaffix and rootfix sums on the spanning forests produced by BFS with fetchAndAdd,
which requires O (n) work and O (diam(G) log n) depth. Finally, we compute an implicit represen-
tation of the biconnectivity labels for each edge, using an idea from [23]. This step computes per-
vertex labels by removing all critical edges and computing connectivity on the remaining graph.
The resulting vertex labels can be used to assign biconnectivity labels to edges by giving tree edges
the connectivity label of the vertex further from the root in the tree, and assigning non-tree edges
the label of either endpoint. Summing the cost of each step, the total work of this algorithm is
O (m) in expectation and the total depth is O (max(diam(G) log n, log3 n)) whp.
Algorithm 9 shows the Tarjan-Vishkin biconnectivity algorithm. It first computes a spanning
forest of G and roots the trees in this forest arbitrarily (Lines 15 and 16). Next, the algorithm
computes a preorder numbering, Preorder, with respect to the roots (Line 17). It then computes
the subtree functions Low(v) and High(v) for each v ∈ V , which are the minimum and maxi-
mum preorder numbers respectively of all non-tree edges (u, w ) where u is a vertex in v’s subtree
(Line 18). It also computes Size(v), the size of each vertex’s subtree. Observe that one can determine
whether the parent of a vertex u, pu is an articulation point by checking Preorder[pu ] ≤ Low(u) and
High(u) < Preorder[pu ] + Size[pu ]. Following [23], we refer to this set of tree edges (u, pu ), where
pu is an articulation point, as critical edges (Line 9). The last step of the algorithm is to compute
a connectivity labeling of the graph with all critical edges removed. Our algorithm removes the
critical edges using the packGraph primitive (see Section 4).
Given this final connectivity labeling, the biconnectivity label of an edge (u, v) is the connec-
tivity label of the vertex that is further from the root of the tree. The query data structure can
thus report biconnectivity labels of edges in O (1) time using 2n words of memory; each vertex just
stores its connectivity label, and the vertex ID of its parent in the rooted forest (for an edge (u, v)
either one vertex is the parent of the other, which determines the vertex further from the root, or
ACM Transactions on Parallel Computing, Vol. 8, No. 1, Article 4. Publication date: April 2021.
Theoretically Efficient Parallel Graph Algorithms Can Be Fast and Scalable 4:31
neither is the parent of the other, which implies that both are the same distance from the root).
The same query structure can also report whether an edge is a bridge in O (1) time. We refer the
reader to [23] for more details. The low space usage of this query structure is important for our
implementations as storing a biconnectivity label per-edge explicitly would require a prohibitive
amount of memory for large graphs.
Lastly, we discuss some details about our implemetation of the Tarjan-Vishkin algorithm, and
give the work and depth of our implementation. Note that the Preorder, Low, High, and Size arrays
can be computed either using the Euler tour technique, or by using leaffix and rootfix computations
on the trees. We use the latter approach used in our implementation. The most costly step in the
algorithm is to compute spanning forest and connectivity on the original graph, and so the theo-
retical algorithm (using the Euler tour technique) runs in O (m) work in expectation and O (log3 n)
depth whp. Our implementation runs in the same work but O (max(diam(G) log n, log3 n)) depth
whp as it computes a spanning tree using BFS and performs leaffix and rootfix computations on
this tree.
ACM Transactions on Parallel Computing, Vol. 8, No. 1, Article 4. Publication date: April 2021.
4:32 L. Dhulipala et al.
The edgelist-based Borůvka implementation (Lines 2–21) takes as input the number of vertices
and a prefix of the lowest weight edges currently in the graph. The forest is initially empty (Line 3).
The algorithm runs over a series of rounds. Within a round, the algorithm first initializes an array
P of (weight, index) pairs for all vertices (Line 5). Next, it loops in parallel over all edges in E and
perform priorityWrites to P based on the weight on both endpoints of the edge (Lines 8 and 9).
This step writes the weight and index-id of a minimum-weight edge incident to a vertex v into
P[v]. Next, for each vertex u that found an MSF edge incident to it, i.e., P[u] (∞, ∞) (Line 10),
the algorithm determines v, the neighbor of u along this MSF edge (Lines 11–12). If v also selected
(u, v, w ) as its MSF edge, the algorithm deterministically sets the vertex with lower id to be the root
of the tree (Line 14) and the vertex with higher id to point to lower one (Line 16). Otherwise, u joins
v’s component (Line 16). Lastly, the algorithm performs several clean-up steps. First, it updates
the forest with all newly identified MSF edges (Line 17). Next, it performs pointer-jumping (see
Section 3) to compress trees created in Parents (Line 18). Note that the pointer-jumping step can be
work-efficiently implemented in O (log n) depth whp on the BF model [31]. Finally, it relabels the
ACM Transactions on Parallel Computing, Vol. 8, No. 1, Article 4. Publication date: April 2021.
Theoretically Efficient Parallel Graph Algorithms Can Be Fast and Scalable 4:33
edges array E based on the new ids in Parents (Line 19) and then filters E to remove any self-loops,
i.e., edges within the same component after this round (Line 20).
We note that our implementation uses indirection by maintaining a set of active vertices and a
using a set of integer edge-ids to represent E in the Borůvka procedure. Applying indirection over
the vertices helps in practice as the algorithm can allocate P (Line 5) to have size proportional to the
number of active vertices in each round, which may be much smaller than n. Applying indirection
over the edges allows the algorithm to perform a filter over just the ids of the edges, instead of
triples containing the two endpoints and the weight of each edge.
We point out that the filtering idea used in our main algorithm is similar to the theoretically-
efficient algorithm of Cole et al. [45], except that instead of randomly sampling edges, our filtering
procedure selects a linear number of the lowest weight edges. Each filtering step costs O (m) work
and O (log m) depth, but as we only perform a constant number of steps before processing the
rest of the remaining graph, the filtering steps do not affect the work and depth asymptotically. In
practice, most of the edges are removed after 3–4 filtering steps, and so the remaining edges can
be copied into an edgelist and solved in a single Borůvka step. We also note that as the edges are
initially represented in both directions, we can pack out the edges so that each undirected edge
is only inspected once (we noticed that earlier edgelist-based implementations stored undirected
edges in both directions).
ACM Transactions on Parallel Computing, Vol. 8, No. 1, Article 4. Publication date: April 2021.
4:34 L. Dhulipala et al.
more detail below. Finally, the algorithm computes all (u, l ) pairs in the intersection of InL and
OutL in parallel (Line 10). For each pair, the algorithm first marks the vertex as done (Line 11). It
then performs a priorityWrite to atomically try and update the label of the vertex to l (Line 12).
After the parallel loop on Line 10 finishes, the label for a vertex u that had some vertex in its SCC
appear as a center in this batch will be set to l = d + j, where j = arg minj {Bi [j] in u’s SCC }, i.e.,
it the unique label for the vertex with minimum rank in the permutation B contained in u’s SCC.
The last step of the algorithm refines the subproblems in the graph by partitioning it, i.e., deleting
all edges which the algorithm identifies as not being in the same SCC. In our implementation, this
step is implemented using the packGraph primitive (Line 13), which considers every directed edge
in the graph and only preserves edges (u, v) where the number of centers reaching u and v in InL
are equal (respectively the number of centers reaching them in OutL). We note that the algorithm
described in Blelloch et al. [33] suggests that to partition the graph, each reachability search can
check whether any edge (u, v) where one endpoint is reachable in the search, and the other is not,
can be cut (possibly cutting some edges multiple times). The benefit of our approach is that we can
perform a single parallel scan over the edges in the graph and pack out a removed edge exactly
once. Our implementation runs in O (m log n) expected work and O (diam(G) log n) depth whp on
the PW-BF model.
One of the challenges in implementing this SCC algorithm is how to compute reachability in-
formation from multiple vertices (the centers) simultaneously. Our implementation explicitly ma-
terializes the forward and backward reachability sets for the set of centers that are active in the
current phase. The sets are represented as hash tables that store tuples of vertices and labels, (u, l ),
representing a vertex u in the same subproblem as the vertex c with label l that is visited by a
directed path from c. We explain how to make the hash table technique practical in Section 7.3.
The reachability sets are computed by running simultaneous breadth-first searches from all active
centers. In each round of the BFS, we apply edgeMap to traverse all out-edges (or in-edges) of
the current frontier. When we visit an edge (u, v) we try to add u’s center IDs to v. If u succeeds
in adding any IDs, it testAndSet’s a visited flag for v, and returns it in the next frontier if the
testAndSet succeeded. Each BFS requires at most O (diam(G)) rounds as each search adds the
same labels in each round as it would have had it run in isolation.
ACM Transactions on Parallel Computing, Vol. 8, No. 1, Article 4. Publication date: April 2021.
Theoretically Efficient Parallel Graph Algorithms Can Be Fast and Scalable 4:35
We also implement an optimized search for the first phase, which just runs two regular BFSs
over the in-edges and out-edges from a single pivot and stores the reachability information in bit-
vectors instead of hash-tables. It is well known that many directed real-world graphs have a single
massive strongly connected component, and so with reasonable probability the first vertex in the
permutation will find this giant component [43]. Our implementation also supports a trimming
optimization that is used by some papers in the literature [106, 144], which eliminates trivial SCCs
by removing any vertices that have zero in- or out-degree. We implement a procedure that recur-
sively trims until no zero in- or out-degree vertices remain, or until a maximum number of rounds
are reached, although in practice we found that a single trimming step is sufficient to remove the
majority of trivial vertices on our graph inputs.
ACM Transactions on Parallel Computing, Vol. 8, No. 1, Article 4. Publication date: April 2021.
4:36 L. Dhulipala et al.
Set to Flags succeeds). The algorithm also sets the Priority values of these vertices to 0 (Line 23),
which prevents them from being considered as potential roots in the remainder of the algorithm.
Next, the algorithm updates the number of finished vertices (Line 24). Finally, the algorithm com-
putes the next set of roots using a second edgeMap. The map function (Lines 9–12) decrements
the priority of all neighbors v visited over an edge (u, v) where u ∈ Covered and P[u] < P[v] using
a fetchAndAdd that returns true for a neighbor v if this edge decrements its priority to 0.
Maximal Matching
The maximal matching problem is to compute a subset of edges E ⊆ E such that no two edges
in E share an endpoint, and all edges in E \ E share an endpoint with some edge in E . Our
maximal matching implementation is based on the prefix-based algorithm from [32] that takes
O (m) expected work and O (log2 m) depth whp on the PW-BF model (using the improved depth
shown in [64]). We had to make several modifications to run the algorithm on the large graphs
in our experiments. The original code from [32] uses an edgelist representation, but we cannot
directly use this implementation as uncompressing all edges would require a prohibitive amount
of memory for large graphs. Instead, as in our MSF implementation, we simulate the prefix-based
approach by performing a constant number of filtering steps. Each filter step packs out 3n/2 of
the highest priority edges, randomly permutes them, and then runs the edgelist based algorithm
on the prefix. After computing the new set of edges that are added to the matching, we filter the
remaining graph and remove all edges that are incident to matched vertices. In practice, just 3–4
filtering steps are sufficient to remove essentially all edges in the graph. The last step uncompresses
ACM Transactions on Parallel Computing, Vol. 8, No. 1, Article 4. Publication date: April 2021.
Theoretically Efficient Parallel Graph Algorithms Can Be Fast and Scalable 4:37
any remaining edges into an edgelist and runs the prefix-based algorithm. The filtering steps can
be done within the work and depth bounds of the original algorithm.
Our implementation of the prefix-based maximal matching algorithm from Blelloch et al. [32]
is shown in Algorithm 13. The algorithm first creates the array matched, sets all vertices to be
unmatched, and initializes the matching to empty (Line 11). The algorithm runs a constant number
of filtering rounds, as described above, where each round fetches some number of highest priority
edges that are still active (i.e., neither endpoint is incident to a matched edge). First, it calculates
the number of edges to extract (Line 15). It then extracts the highest priority edges using the
packGraph primitive. The function supplied to packGraph checks whether an edge e is one of
the highest priority edges, and if so, emits it in the output edgelist, P and removes this edge from the
graph. Our implementation calculates edge priorities by hashing the edge pair. It selects whether
an edge is in the prefix by comparing each edge’s priority with the priority of approximately the
toExtract-th smallest priority, computed using approximate median.
Next, the algorithm applies the parallel greedy maximal matching algorithm (Lines 2–9) on it.
The parallel greedy algorithm first randomly permutes the edges in the prefix (Line 4). It then
repeatedly finds the set of edges that have the lowest rank in the prefix amongst all other edges
incident to either endpoint (Line 6), adds them to the matching (Line 7), and filters the edges based
on the newly matched edges (Line 8). The edges matched by the greedy algorithm are returned
to the MaximalMatching procedure (Line 9). We refer to [32, 64] for a detailed description of the
prefix-based algorithm that we implement, and a proof of the work and depth of the Parallel-
GreedyMM algorithm.
The last steps within a round are to filter the remaining edges in the graph based on the newly
matched edges using the packGraph primitive (Line 20). The supplied predicate does not return
ACM Transactions on Parallel Computing, Vol. 8, No. 1, Article 4. Publication date: April 2021.
4:38 L. Dhulipala et al.
any edges in the output edgelist, and packs out any edge incident to the partial matching, W .
Lastly, the algorithm adds the newly matched edges to the matching Line 21. We note that applying
a constant number of filtering rounds before executing ParallelGreedyMM does not affect the
work and depth bounds.
Graph Coloring
The graph coloring problem is to compute a mapping from each v ∈ V to a color such that for
each edge (u, v) ∈ E, C (u) C (v), using at most Δ + 1 colors. As graph coloring is NP-hard to
solve optimally, algorithms like greedy coloring, which guarantees a (Δ + 1)-coloring, are used
instead in practice, and often use much fewer than (Δ + 1) colors on real-world graphs [77,
151]. Jones and Plassmann (JP) parallelize the greedy algorithm using linear work, but unfor-
tunately adversarial inputs exist for the heuristics they consider that may force the algorithm
to run in O (n) depth. Hasenplaugh et al. introduce several heuristics that produce high-quality
colorings in practice and also achieve provably low-depth regardless of the input graph. These
include LLF (largest-log-degree-first), which processes vertices ordered by the log of their de-
gree and SLL (smallest-log-degree-last), which processes vertices by removing all lowest log-
degree vertices from the graph, coloring the remaining graph, and finally coloring the removed
vertices. √For LLF, they show that it runs in O (m + n) work and O (L log Δ + log n) depth, where
L = min{ m, Δ} + log2 Δ log n/ log log n in expectation.
In this paper, we implement a synchronous version of Jones-Plassmann using the LLF heuristic,
which runs in O (m + n) work and O (L log Δ + log n) depth on the FA-BF model. The algorithm is
implemented similarly to our rootset-based algorithm for MIS. In each round, after coloring the
roots we use a fetchAndAdd to decrement a count on our neighbors, and add the neighbor as a
root on the next round if the count is decremented to 0.
Algorithm 14 shows our synchronous implementation of the parallel LLF-Coloring algorithm
from [77]. The algorithm first computes priorities for each vertex in parallel using the countNghs
primitive (Line 14). This step computes the number of neighbors of a vertex that must run before
it by applying the countFn predicate (Line 13). This predicate function returns true for a (u, v) edge
to a neighbor v if the log-degree of v is greater than u, or, if the log-degrees are equal whether v
has a lower-rank in a permutation on the vertices (Line 1) than v. Next, the algorithm computes
the vertexSubset Roots (Line 15) which consists of all vertices that have no neighbors that are still
uncolored that must be run before them based on countFn. Note that Roots is an independent set.
The algorithm then loops while some vertex remains uncolored. Within the loop, it first assigns
colors to the roots in parallel (Line 18) by setting each root to the first unused color in its neigh-
borhood (Lines 5–6). Finally, it updates the number of finished vertices by the number of roots
(Line 19) and computes the next rootset by applying edgeMap on the rootset with a map func-
tion that decrements the priority over all (u, v) edges incident to Roots where Priority[v] > 0. The
map function returns true only if the priority decrement decreases the priority of the neighboring
vertex to 0 (Line 8).
ACM Transactions on Parallel Computing, Vol. 8, No. 1, Article 4. Publication date: April 2021.
Theoretically Efficient Parallel Graph Algorithms Can Be Fast and Scalable 4:39
the sum of the sizes of the sets (or the number of edges in the graph). There has been significant
work on finding work-efficient parallel algorithms that achieves an Hn -approximation [24, 36, 37,
91, 124].
Algorithm 15 shows pseudocode for the Blelloch et al. algorithm [36] which runs in O (m) work
and O (log3 n) depth on the PW-BF model. Our presentation here is based on the bucketing-based
implementation from Julienne [52], with one significant change regarding how sets acquire ele-
ments which we discuss below. The algorithm first buckets the sets based on their degree, placing
a set covering D elements into log1+ϵ D-th bucket (Line 24). It then processes the buckets in
decreasing order (Lines 26–38). In each round, the algorithm extracts the highest bucket (Sets)
(Line 26) and packs out the adjacency lists of vertices in this bucket to remove edges to neigh-
bors that are covered in prior rounds (Line 27). The output is an augmented vertexSubset, SetsD,
containing each set along with its new degree after packing out all dead edges. It then maps over
SetsD, updating the degree in D for each set with the new degree (Line 28). The algorithm then
filters SetsD to build a vertexSubsetActive, which contains sets that have sufficiently high degree
to continue in this round (Line 29).
The next few steps of the algorithm implement one step of MaNIS (Maximal Nearly-Independent
Set) [36], to compute a set of sets from Active that have little overlap. First, the algorithm assigns
a random priority to each currently active set using a random permutation, storing the priorities
in the array π (Lines 30–31). Next, it applies edgeMap (Line 32) where the map function (Line 12)
uses a priority-write on each (s, e) edge to try and acquire an element e using the priority of the
visiting set, π [s]. It then computes the number of elements each set successfully acquired using
the srcCount primitive (Line 33) with the predicate WonElm (Line 10) that checks whether the
minimum value stored at an element is the unique priority for the set. The final MaNIS step maps
over the vertices and the number of elements they successfully acquired (Line 34) with the map
function WonEnough (Lines 13–16) which adds sets that covered enough elements to the cover.
ACM Transactions on Parallel Computing, Vol. 8, No. 1, Article 4. Publication date: April 2021.
4:40 L. Dhulipala et al.
The final step in a round is to rebucket all sets which were not added to the cover to be processed
in a subsequent round (Lines 36–37). The rebucketed sets are those in Sets that were not added to
the cover, and the new bucket they are assigned to is calculated by using the getBucket primitive
with the current bucket, b, and a new bucket calculated based on their updated degree (Line 6).
Our implementation of approximate set cover in this paper is based on the implementation from
Julienne [52], and we refer to this paper for more details about the bucketing-based implementa-
tion. The main change we made in this paper is to ensure that we correctly set random priorities
ACM Transactions on Parallel Computing, Vol. 8, No. 1, Article 4. Publication date: April 2021.
Theoretically Efficient Parallel Graph Algorithms Can Be Fast and Scalable 4:41
for active sets in each round of the algorithm. Both the implementation in Julienne as well as an
earlier implementation of the algorithm [37] use the original IDs of sets instead of picking random
priorities for all sets that are active on a given round. This approach can cause very few vertices
to be added in each round on meshes and other graphs with a large amount of symmetry. Instead,
in our implementation, for AS , the active sets on a round, we generate a random permutation of
[0, . . . , |AS | − 1] and write these values into a pre-allocated dense array with size proportional to
the number of sets (Lines 30—31). We give experimental details regarding this change in Section 8.
ACM Transactions on Parallel Computing, Vol. 8, No. 1, Article 4. Publication date: April 2021.
4:42 L. Dhulipala et al.
the graph, and taking the maximum density subgraph over all suffixes of the degeneracy order.5
The problem has also received attention in parallel models of computation [17, 18]. Bahmani et al.
give a (2 + ϵ )-approximation running in O (log1+ϵ n) rounds of MapReduce [18]. Subsequently,
Bahmani et al. [17] showed that a (1 + ϵ )-approximation can be found in O (log n/ϵ 2 ) rounds of
MapReduce by using the multiplicative-weights approach on the dual of the natural LP for dens-
est subgraph. To the best of our knowledge, it is open whether the densest subgraph problem can
be exactly solved in NC.
In this paper, we implement the elegant (2 + ϵ )-approximation algorithm of Bahmani et al. (Al-
gorithm 17). Our implementation of the algorithm runs in O (m + n) work and O (log1+ϵ n log n)
depth. The algorithm starts with a candidate subgraph, S, consisting of all vertices, and an empty
approximate densest subgraph S max (Lines 4–5). It also maintains an array with the induced degree
of each vertex in the array D, which is initially just its degree in G (Line 3). The main loop iteratively
peels vertices with degree below the density threshold in the current candidate subgraph (Lines 6–
16). Specifically, it first finds all vertices with induced degree less than 2(1 + ϵ )D (S ) (Line 7). Next,
it calls nghCount (see Section 4), which computes for each neighbor of R the number of incident
edges removed by deleting vertices in R from the graph, and updates the neighbor’s degree in D
(Line 17). Finally, it removes vertices in R from S (Line 14). If the density of the updated subgraph
S is greater than the density of S max , the algorithm updates S max to be S.
Bahmani et al. show that this algorithm removes a constant factor of the vertices in each round,
but do not consider the work or total number of operations performed by their algorithm. We
briefly sketch how the algorithm can be implemented in O (m + n) work and O (log1+ϵ n log n)
depth. Instead of computing the density of the current subgraph by scanning all edges, we maintain
5 We note that the 2-approximation can be work-efficiently solved in the same depth as our k -core algorithm by augmenting
the k -core algorithm to return the order in which vertices are peeled. Computing the maximum density subgraph over
suffixes of the degeneracy order can be done using scan.
ACM Transactions on Parallel Computing, Vol. 8, No. 1, Article 4. Publication date: April 2021.
Theoretically Efficient Parallel Graph Algorithms Can Be Fast and Scalable 4:43
it explicitly using an array, D (Line 3) which tracks the degrees of vertices still in S, and update D
as vertices are removed from S. Each round of the algorithm does work proportional to vertices in
S to compute R (Line 7) but since S decreases by a constant factor in each round the work of these
steps to obtain R is O (n) over all rounds. Updating D can be done by computing the number of
edges going between R and S which are removed, which only requires scanning edges incident to
vertices in R using nghCount (Line 13). Therefore, the edges incident to each vertex are scanned
exactly once (in the round when it is included in R) and so the algorithm performs O (m + n) work.
The depth is O (log1+ϵ n log n) since there are O (log1+ϵ n) rounds each of which perform a filter
and nghCount which both run in O (log n) depth.
We note that an earlier implementation of our algorithm used the edgeMap primitive combined
with fetchAndAdd to decrement degrees of neighbors of R. We found that since a large number
of vertices are removed in each round, using fetchAndAdd can cause significant contention,
especially on graphs containing vertices with high degrees. Our implementation uses a work-
efficient histogram procedure to implement nghCount (see Section 7) which updates the degrees
while incurring very little contention.
Triangle Counting
The triangle counting problem is to compute the global count of the number of triangles in the
graph. Triangle counting has received significant recent attention due to its numerous applications
in Web and social network analysis. There have been dozens of papers on sequential triangle count-
ing (see e.g., [6, 83, 93, 117, 118, 129, 130], among many others). The fastest algorithms rely on ma-
trix multiplication and run in either O (nω ) or O (m2ω /(1+ω ) ) work, where ω is the best matrix mul-
tiplication exponent [6, 83]. The fastest algorithm that does not rely matrix multiplication requires
O (m3/2 ) work [93, 129, 130], which also turns out to be much more practical. Parallel algorithms
with O (m 3/2 ) work have been designed [1, 97, 142], with Shun and Tangwongsan [142] showing
an algorithm that requires O (log n) depth on the BF model.6 The implementation from [142] par-
allelizes Latapy’s compact-forward algorithm, which creates a directed graph DG where an edge
6 The algorithm in [142] was described in the Parallel Cache Oblivious model, with a depth of O (log3/2 n).
ACM Transactions on Parallel Computing, Vol. 8, No. 1, Article 4. Publication date: April 2021.
4:44 L. Dhulipala et al.
(u, v) ∈ E is kept in DG iff d (u) < d (v). Although triangle counting can be done directly on the
undirected graph in the same work and depth asymptotically, directing the edges helps reduce
work, and ensures that every triangle is counted exactly once.
In this paper we implement the triangle counting algorithm described in [142] (Algorithm 18).
The algorithm first uses the filterGraph primitive (Line 4) to direct the edges in the graph from
lower-degree to higher-degree, breaking ties lexicographically (Line 2). It then maps over all ver-
tices in the graph in parallel (Line 6), and for each vertex performs a sum-reduction over its out-
neighbors, where the value for each neighbor is the intersection size between the directed neigh-
borhoods N + (u) and N + (v) (Line 7).
We note that we had to make several significant changes to the implementation in order to run
efficiently on large compressed graphs. First, we parallelized the creation of the directed graph; this
step creates a directed graph encoded in the parallel-byte format in O (m) work and O (log n) depth
using the filterGraph primitive. We also parallelized the merge-based intersection algorithm to
make it work in the parallel-byte format. We give more details on these techniques in Section 7.
ACM Transactions on Parallel Computing, Vol. 8, No. 1, Article 4. Publication date: April 2021.
Theoretically Efficient Parallel Graph Algorithms Can Be Fast and Scalable 4:45
The condFn function (Line 6) specifies that value should be aggregated for each vertex with non-
zero in-degree. The mapFn function pulls a PageRank contribution of Pcurr [u]/d (u) for each in-
neighbor u in the frontier (Line 7). Finally, after the contributions to each neighbor have been
summed up, the applyFn function is called on a pair of a neighboring vertex v, and its contribution
(Lines 8–11). The apply step updates the next PageRank value for the vertex using the PageRank
equation above (Line 9) and updates the difference in PageRank values for this vertex in the diffs
vector (Line 10). The last steps in the loop applies a parallel reduction over the differences vector
to update the current error (Line 15) and finally swaps the current and next PageRank vectors
(Line 16).
The main modification we made to the implementation from Ligra was to implement the dense
iterations of the algorithm using the reduction primitive nghReduce, which can be carried out
over the incoming neighbors of a vertex in parallel, without using a fetchAndAdd instruction.
Each iteration of our implementation requires O (m + n) work and O (log n) depth (note that the
bounds hold deterministically since in each iteration we can apply a dense, or pull-based imple-
mentation which performs a parallel reduction over the in-neighbors of each vertex). As the num-
ber of iterations required for PageRank to finish for a given ϵ depends on the structure of the input
graph, our benchmark measures the time for a single iteration of PageRank.
ACM Transactions on Parallel Computing, Vol. 8, No. 1, Article 4. Publication date: April 2021.
4:46 L. Dhulipala et al.
but relatively small degeneracy, or largest non-empty core (labeled kmax in Table 3). For these
graphs, we observed that many early rounds, which process vertices with low coreness perform
a large number of fetchAndAdds on memory locations corresponding to high-degree vertices,
resulting in high contention [138]. To reduce contention, we designed a work-efficient histogram
implementation that can perform this step while only incurring O (log n) contention whp. The His-
togram primitive takes a sequence of (K, T) pairs, and an associative and commutative operator
R : T × T → T and computes a sequence of (K, T) pairs, where each key k only appears once, and
its associated value t is the sum of all values associated with keys k in the input, combined with
respect to R.
A useful example of histogram to consider is summing for each v ∈ N (F ) for a vertexSubset F ,
the number of edges (u, v) where u ∈ F (i.e., the number of incoming neighbors from the frontier).
This operation can be implemented by running histogram on a sequence where each v ∈ N (F )
appears once per (u, v) edge as a tuple (v, 1) using the operator +. One theoretically efficient
implementation of histogram is to simply semisort the pairs using the work-efficient semisort
algorithm from [74]. The semisort places pairs from the sequence into a set of heavy and light
buckets, where heavy buckets contain a single key that appears many times in the input sequence,
and light buckets contain at most O (log2 n) distinct keys (k, v) keys, each of which appear at most
O (log n) times whp (heavy and light keys are determined by sampling). We compute the reduced
value for heavy buckets using a standard parallel reduction. For each light bucket, we allocate a
hash table, and hash the keys in the bucket in parallel to the table, combining multiple values for
the same key using R. As each key appears at most O (log n) times whp we incur at most O (log n)
contention whp. The output sequence can be computed by compacting the light tables and heavy
arrays.
While the semisort implementation is theoretically efficient, it requires a likely cache miss for
each key when inserting into the appropriate hash table. To improve cache performance in this
step, we implemented a work-efficient algorithm with O (nϵ ) depth based on radix sort. Our imple-
mentation is based on the parallel radix sort from PBBS [139]. As in the semisort, we first sample
keys from the sequence and determine the set of heavy-keys. Instead of directly moving the ele-
ments into light and heavy buckets, we break up the input sequence into O (n1−ϵ ) blocks, each of
size O (nϵ ), and sequentially sort the keys within a block into light and heavy buckets. Within the
blocks, we reduce all heavy keys into a single value and compute an array of size O (nϵ ) which
holds the starting offset of each bucket within the block. Next, we perform a segmented-scan [26]
over the arrays of the O (n1−ϵ ) blocks to compute the sizes of the light buckets, and the reduced
values for the heavy-buckets, which only contain a single key. Finally, we allocate tables for the
light buckets, hash the light keys in parallel over the blocks and compact the light tables and heavy
keys into the output array. Each step runs in O (n) work and O (nϵ ) depth. Compared to the orig-
inal semisort implementation, this version incurs fewer cache misses because the light keys per
block are already sorted and consecutive keys likely go to the same hash table, which fits in cache.
We compared our times in the histogram-based version of k-core and the fetchAndAdd-based
version of k-core and saw between a 1.1–3.1x improvement from using the histogram.
7.2 edgeMapBlocked
One of the core primitives used by our algorithms is edgeMap (described in Section 3). The push-
based version of edgeMap, edgeMapSparse, takes a frontier U and iterates over all (u, v) edges
incident to it. It applies an update function on each edge that returns a boolean indicating whether
or not the neighbor should be included in the next frontier. The standard implementation of
edgeMapSparse first computes prefix-sums of d (u), u ∈ U to compute offsets, allocates an array
of size u ∈U d (u), and iterates over all u ∈ U in parallel, writing the ID of the neighbor to the array
ACM Transactions on Parallel Computing, Vol. 8, No. 1, Article 4. Publication date: April 2021.
Theoretically Efficient Parallel Graph Algorithms Can Be Fast and Scalable 4:47
if the update function F returns true, and ⊥ otherwise. It then filters out the ⊥ values in the array
to produce the output vertexSubset.
In real-world graphs, |N (U )|, the number of unique neighbors incident to the current frontier
is often much smaller than u ∈U d (u). However, edgeMapSparse will always perform u ∈U d (u)
writes and incur a proportional number of cache misses, despite the size of the output being at
most |N (U )|. More precisely, the size of the output is at most LN (U ) ≤ |N (U )|, where LN (U ) is
the number of live neighbors of U , where a live neighbor is a neighbor of the current frontier for
which F returns true. To reduce the number of cache misses we incur in the push-based traversal,
we implemented a new version of edgeMapSparse that performs at most LN (U ) writes that we
call edgeMapBlocked. The idea behind edgeMapBlocked is to logically break the edges incident
to the current frontier up into a set of blocks, and iterate over the blocks sequentially, packing
live neighbors compactly for each block. The output is obtained by applying a prefix-sum over the
number of live neighbors per-block, and compacting the block outputs into the output array.
We now describe a theoretically efficient implementation of edgeMapBlocked (Algorithm 20).
As in edgeMapSparse, we first compute an array of offsets O (Line 2) by prefix summing the
degrees of u ∈ U . We process the edges incident to this frontier in blocks of size bsize. As we cannot
afford to explicitly write out the edges incident to the current frontier to block them, we instead
logically assign the edges to blocks. Each block searches for a range of vertices to process with bsize
edges; the i-th block binary searches the offsets array to find the vertex incident to the start of the
(i · bsize)-th edge, storing the result into B[i] (Lines 4–5). The vertices that block i must process
are therefore between B[i] and B[i + 1]. We note that multiple blocks can be assigned to process
the edges incident to a high-degree vertex. Next, we allocate an intermediate array I of size dU
(Line 6), but do not initialize the memory, and an array A that stores the number of live neighbors
found by each block (Line 7). Next, we process the blocks in parallel by sequentially applying F to
each edge in the block and compactly writing any live neighbors to I [i · bsize] (Line 9), and write
the number of live neighbors to A[i] (Line 10). Finally, we do a prefix sum on A, which gives offsets
into an array of size proportional to the number of live neighbors, and copy the live neighbors in
parallel to R, the output array (Line 11).
We found that this optimization helps the most in algorithms where there is a significant imbal-
ance between the size of the output of each edgeMap, and u ∈U d (u). For example, in weighted
BFS, relatively few of the edges actually relax a neighboring vertex, and so the size of the out-
put, which contains vertices that should be moved to a new bucket, is usually much smaller than
the total number of edges incident to the frontier. In this case, we observed as much as a 1.8x
improvement in running time by switching from edgeMapSparse to edgeMapBlocked.
ACM Transactions on Parallel Computing, Vol. 8, No. 1, Article 4. Publication date: April 2021.
4:48 L. Dhulipala et al.
ACM Transactions on Parallel Computing, Vol. 8, No. 1, Article 4. Publication date: April 2021.
Theoretically Efficient Parallel Graph Algorithms Can Be Fast and Scalable 4:49
vertex into blocks, where each block contains a fixed number of neighbors. Each block is differ-
ence encoded with respect to the source. As each block can have a different compressed size, it
also stores offsets that point to the start of each block. The format stores the blocks in a neighbor
list L in sorted order.
We now describe efficient implementations of primitives used by our algorithms. All descrip-
tions are given for neighbor lists coded in the parallel-byte format, and we assume for simplicity
that the block size (the number of neighbors stored in each block) is O (log n). The Map primitive
takes as input neighbor list L, and a map function F , and applies F to each ID in L. This primitive can
be implemented with a parallel-for loop across the blocks, where each iteration decodes its block
sequentially. Our implementation of map runs in O (|L|) work and O (log n) depth. Map-Reduce
takes as input a neighbor list L, a map function F : vtx → T and a binary associative function R
and returns the sum of the mapped elements with respect to R. We perform map-reduce similarly
by first mapping over the blocks, then sequentially reducing over the mapped values in each block.
We store the accumulated value on the stack or in an heap-allocated array if the number of blocks
is large enough. Finally, we reduce the accumulated values using R to compute the output. Our
implementation of map-reduce runs in O (|L|) work and O (log n) depth.
Filter takes as input a neighbor list L, a predicate P, and an array T into which the vertices
satisfying P are written, in the same order as in L. Our implementation of filter also takes as input
an array S, which is an array of size d (v) space for lists L larger than a constant threshold, and
null otherwise. In the case where L is large, we implement the filter by first decoding L into S in
parallel; each block in L has an offset into S as every block except possibly the last block contains
the same number of vertex IDs. We then filter S into the output array T . In the case where L is
small we just run the filter sequentially. Our implementation of filter runs in O (|L|) work and
O (log n) depth. Pack takes as input a neighbor list L and a predicate P function, and packs L,
keeping only vertex IDs that satisfied P. Our implementation of pack takes as input an array S,
which an array of size 2 ∗ d (v) for lists larger than a constant threshold, and null otherwise. In the
case where L is large, we first decode L in parallel into the first d (v) cells of S. Next, we filter these
vertices into the second d (v) cells of S, and compute the new length of L. Finally, we recompress
the blocks in parallel by first computing the compressed size of each new block. We prefix-sum
the sizes to calculate offsets into the array and finally compress the new blocks by writing each
block starting at its offset. When L is small we just pack L sequentially. We make use of the pack
and filter primitives in our implementations of maximal matching, minimum spanning forest, and
triangle counting. Our implementation of pack runs in O (|L|) work and O (log n) depth.
The Intersection primitive takes as input two neighbor lists La and Lb and computes the size
of the intersection of L a and Lb (|L a | ≤ |Lb |). We implement an algorithm similar to the optimal
parallel intersection algorithm for sorted lists. As the blocks are compressed, our implementation
works on the first element of each block, which can be quickly decoded. We refer to these elements
as block starts. If the number of blocks in both lists sum to less than a constant, we intersect them
sequentially. Otherwise, we take the start vs of the middle block in L a , and binary search over the
starts of Lb to find the first block whose start is less than or equal to vs . Note that as the closest
value less than or equal to vs could be in the middle of the block, the subproblems we generate must
consider elements in the two adjoining blocks of each list, which adds an extra constant factor of
work in the base case. Our implementation of intersection runs in O (|L a | log(1 + |Lb |/|L a |)) work
and O (log n) depth.
8 EXPERIMENTS
In this section, we describe our experimental results on a set of real-world graphs and also discuss
related experimental work. Tables 5 and 6 show the running times for our implementations on
ACM Transactions on Parallel Computing, Vol. 8, No. 1, Article 4. Publication date: April 2021.
4:50 L. Dhulipala et al.
our graph inputs. For compressed graphs, we use the compression schemes from Ligra+ [141],
which we extended to ensure theoretical efficiency (see Section 7.4). We describe statistics about
our input graphs and algorithms (e.g., number of colors used, number of SCCs, etc.) in Section A.
ACM Transactions on Parallel Computing, Vol. 8, No. 1, Article 4. Publication date: April 2021.
Theoretically Efficient Parallel Graph Algorithms Can Be Fast and Scalable 4:51
WebDataCommons dataset where nodes represent web pages [107]. 3D-Torus is a 3-dimensional
torus with 1B vertices and 6B edges. We mark symmetric (undirected) versions of the directed
graphs with the suffix -Sym. We create weighted graphs for evaluating weighted BFS, Borůvka,
widest path, and Bellman-Ford by selecting edge weights between [1, log n) uniformly at ran-
dom. We process LiveJournal, com-Orkut, Twitter, and 3D-Torus in the uncompressed format, and
ClueWeb, Hyperlink2014, and Hyperlink2012 in the compressed format.
Table 4 lists the size in gigabytes of the compressed graph inputs used in this paper both with and
without compression, and reports the savings obtained by using compression. Note that the largest
graph studied in this paper, the directed Hyperlink2012 graph, barely fits in the main memory of
our machine in the uncompressed format, but would leave hardly any memory to be used for an
algorithm analyzing this graph. Using compression significantly reduces the memory required to
represent each graph (between 2.21–2.85x, and 2.53x on average). We converted the graphs listed
in Table 4 directly from the WebGraph format to the compressed format used in this paper by
modifying a sequential iterator method from the WebGraph framework [39].
ACM Transactions on Parallel Computing, Vol. 8, No. 1, Article 4. Publication date: April 2021.
4:52 L. Dhulipala et al.
Table 5. Running Times (in seconds) of Our Algorithms Over Symmetric Graph Inputs on a 72-core
Machine (with Hyper-threading) Where (1) is the Single-thread Time, (72h) is the 72 Core Time Using
Hyper-threading, and (SU) is the Parallel Speedup (Single-thread Time Divided by 72-core Time)
algorithm based on bucketing. We observe that our spanner algorithm is only slightly more costly
than computing connectivity on the same input.
In an earlier paper [52], we compared the running time of our weighted BFS implementation
to two existing parallel shortest path implementations from the GAP benchmark suite [22] and
Galois [100], as well as a fast sequential shortest path algorithm from the DIMACS shortest path
challenge, showing that our implementation is between 1.07–1.1x slower than the Δ-stepping im-
plementation from GAP, and 1.6–3.4x faster than the Galois implementation. Our old version of
Bellman-Ford was between 1.2–3.9x slower than weighted BFS; we note that after changing it to
use the edgeMapBlocked optimization, it is now competitive with weighted BFS and is between
1.2x faster and 1.7x slower on our graphs with the exception of 3D-Torus, where it performs 7.3x
slower than weighted BFS, as it performs O (n4/3 ) work on this graph.
ACM Transactions on Parallel Computing, Vol. 8, No. 1, Article 4. Publication date: April 2021.
Theoretically Efficient Parallel Graph Algorithms Can Be Fast and Scalable 4:53
Table 6. Running Times (in seconds) of Our Algorithms Over Symmetric Graph Inputs on a 72-core
Machine (with Hyper-threading) Where (1) is the Single-thread Time, (72h) is the 72 Core Time Using
Hyper-threading, and (SU) is the Parallel Speedup (Single-thread Time Divided by 72-core Time)
assume that vertex IDs in the graph are randomly permuted and always generates a random per-
mutation, even on the first round, as adding vertices based on their original IDs can result in poor
performance (for example on 3D-Torus). There are several existing implementations of fast parallel
connectivity algorithms [119, 139, 140, 144], however, only the implementation from [140], which
presents the connectivity algorithm that we implement in this paper, is theoretically-efficient. The
implementation from Shun et al. was compared to both the Multistep [144] and Patwary et al. [119]
implementations, and shown to be competitive on a broad set of graphs. We compared our con-
nectivity implementation to the work-efficient connectivity implementation from Shun et al. on
our uncompressed graphs and observed that our code is between 1.2–2.1x faster in parallel. Our
spanning forest implementation is slightly slower than connectivity due to having to maintain a
mapping between the current edge set and the original edge set.
Despite our biconnectivity implementation having O (diam(G) log n) depth, our implementation
achieves between a 20–59x speedup across all inputs, as the diameter of most of our graphs is ex-
tremely low. Our biconnectivity implementation is about 3–5x slower than running connectivity
on the graph, which seems reasonable as our current implementation performs two calls to connec-
tivity, and one breadth-first search. There are a several existing implementations of biconnectivity.
Cong and Bader [46] parallelize the Tarjan-Vishkin algorithm and demonstrated speedup over the
Hopcroft-Tarjan (HT) algorithm. Edwards and Vishkin [61] also implement the Tarjan-Vishkin al-
gorithm using the XMT platform, and show that their algorithm achieves good speedups. Slota and
ACM Transactions on Parallel Computing, Vol. 8, No. 1, Article 4. Publication date: April 2021.
4:54 L. Dhulipala et al.
Madduri [143] present a BFS-based biconnectivity implementation which requires O (mn) work in
the worst-case, but behaves like a linear-work algorithm in practice. We ran the Slota and Madduri
implementation on 36 hyper-threads allocated from the same socket, the configuration on which
we observed the best performance for their code, and found that our implementation is between
1.4–2.1x faster than theirs. We used a DFS-ordered subgraph corresponding to the largest con-
nected component to test their code, which produced the fastest times. Using the original order of
the graph affects the running time of their implementation, causing it to run between 2–3x slower
as the amount of work performed by their algorithm depends on the order in which vertices are
visited.
Our strongly connected components implementation achieves between a 13–43x speedup across
all inputs. Our implementation takes a parameter β, which is the base of the exponential rate at
which we grow the number of centers added. We set β between 1.1–2.0 for our experiments and
note that using a larger value of β can improve the running time on smaller graphs by up to a
factor of 2x. Our SCC implementation is between 1.6x faster to 4.8x slower than running connec-
tivity on the undirected version of the graph. There are several existing SCC implementations that
have been evaluated on real-world directed graphs [79, 106, 144]. The Hong et al. algorithm [79]
is a modified version of the FWBW-Trim algorithm from McLendon et al. [106], but neither al-
gorithm has any theoretical bounds on work or depth. Unfortunately [79] do not report running
times, so we are unable to compare our performance with them. The Multistep algorithm [144]
has a worst-case running time of O (n2 ), but the authors point-out that the algorithm behaves
like a linear-time algorithm on real-world graphs. We ran our implementation on 16 cores con-
figured similarly to their experiments and found that we are about 1.7x slower on LiveJournal,
which easily fits in cache, and 1.2x faster on Twitter (scaled to account for a small difference in
graph sizes). While the multistep algorithm is slightly faster on some graphs, our SCC implemen-
tation has the advantage of being theoretically-efficient and performs a predictable amount of
work.
Our minimum spanning forest implementation achieves between 17–54x speedup over the im-
plementation running on a single thread across all of our inputs. Obtaining practical parallel al-
gorithms for MSF has been a longstanding goal in the field, and several existing implementations
exist [14, 47, 116, 139, 155]. We compared our implementation with the union-find based MSF im-
plementation from PBBS [139] and the implementation of Borůvka from [155], which is one of the
fastest implementations we are aware of. Our MSF implementation is between 2.6–5.9x faster than
the MSF implementation from PBBS. Compared to the edgelist based implementation of Borůvka
from [155] our implementation is between 1.2–2.9x faster.
ACM Transactions on Parallel Computing, Vol. 8, No. 1, Article 4. Publication date: April 2021.
Theoretically Efficient Parallel Graph Algorithms Can Be Fast and Scalable 4:55
ACM Transactions on Parallel Computing, Vol. 8, No. 1, Article 4. Publication date: April 2021.
4:56 L. Dhulipala et al.
Fig. 7. Log-linear plot of normalized throughput vs. vertices for MIS, BFS, BC, and coloring on the 3D-Torus
graph family.
our implementation is about 1.8x slower than the implementation in GraphIt for LiveJournal and
Twitter when run on the same number of threads as in their experiments, which is likely due to
a partitioning optimization used by GraphIt that eliminates a large amount of cross-socket traffic
and thus improves performance on multi-socket systems.
7 The graph size when the system achieves half of its peak-performance.
ACM Transactions on Parallel Computing, Vol. 8, No. 1, Article 4. Publication date: April 2021.
Theoretically Efficient Parallel Graph Algorithms Can Be Fast and Scalable 4:57
Table 7. Cycles Stalled While the Memory Subsystem has an Outstanding Load (Trillions), LLC
Hit Rate and Misses (Billions), Bandwidth in GB/s (Bytes Read and Written from Memory,
Divided by Running Time), and Running Time in Seconds
8.8 Locality
While our algorithms are efficient in the specific variants of the binary-forking model that we
consider, we do not analyze their cache complexity, and in general they may not be efficient in a
model that takes caches into account. Despite this fact, we observed that our algorithms have good
cache performance on the graphs we tested on. In this section we give some explanation for this
fact by showing that our primitives make good use of the caches. Our algorithms are also aided
by the fact that these graph datasets often come in highly local orders (e.g., see the Natural order
in [56]).
We ran a set of experiments to study the locality of a subset of our algorithms on the ClueWeb
graph. Table 7 shows locality metrics for our experiments, which we measured using the Open Per-
formance Counter Monitor (PCM). We found that using a work-efficient histogram is 3.5x faster
than using fetchAndAdd in our k-core implementation, which suffers from high contention on
this graph. Using a histogram reduces the number of cycles stalled due to memory by more than
7x. We also ran our wBFS implementation with and without the edgeMapBlocked optimization,
which reduces the number of cache-lines read from and written to when performing a sparse
edgeMap. The blocked implementation reads and writes 2.1x fewer bytes than the unoptimized
version, which translates to a 1.7x faster running time. We note that we disabled the dense opti-
mization for this experiment to directly compare the two implementations of a sparse edgeMap.
ACM Transactions on Parallel Computing, Vol. 8, No. 1, Article 4. Publication date: April 2021.
4:58 L. Dhulipala et al.
Table 8. System Configurations (Memory in Terabytes, Hyper-threads, and Nodes) and Running Times
(in seconds) of Existing Results on the Hyperlink Graphs
connectivity implementations are 1.1x and 62x faster respectively, and our SSSP implementation
is 1.05x slower. Both FlashGraph and Mosaic compute weakly connected components, which is
equivalent to connectivity. GraFBoost [85] report disk-based running times for BFS and BC on the
Hyperlink2012 graph on a 32-core machine. They solve BFS in 900s and BC in 800s. Our BFS and
BC implementations are 53x and 22x faster than their implementations, respectively.
Slota et al. [146] report running times for the Hyperlink2012 graph on 256 nodes on the Blue
Waters supercomputer. Each node contains two 16-core processors with one thread each, for a total
of 8192 hyper-threads. They report they can find the largest connected component and SCC from
the graph in 63s and 108s respectively. Our implementations find all connected components 2.5x
faster than their largest connected component implementation, and find all strongly connected
components 1.6x slower than their largest-SCC implementation. Their largest-SCC implementa-
tion computes two BFSs from a randomly chosen vertex—one on the in-edges and the other on the
out-edges—and intersects the reachable sets. We perform the same operation as one of the first
steps of our SCC algorithm and note that it requires about 30 seconds on our machine. They solve
ACM Transactions on Parallel Computing, Vol. 8, No. 1, Article 4. Publication date: April 2021.
Theoretically Efficient Parallel Graph Algorithms Can Be Fast and Scalable 4:59
approximate k-cores in 363s, where the approximate k-core of a vertex is the coreness of the vertex
rounded up to the nearest powers of 2. Our implementation computes the exact coreness of each
vertex in 184s, which is 1.9x faster than the approximate implementation while using 113x fewer
cores.
Recently, Dathathri et al. [51] have reported running times for the Hyperlink2012 graph using
Gluon, a distributed graph processing system based on Galois. They process this graph on a 256
node system, where each node is equipped with 68 4-way hyper-threaded cores, and the hosts are
connected by an Intel Omni-Path network with 100Gbps peak bandwidth. They report times for
BFS, connectivity, PageRank, and SSSP. Other than their connectivity implementation, which uses
pointer-jumping, their implementations are based on data-driven asynchronous label-propagation.
We are not aware of any theoretical bounds on the work and depth of these implementations. Com-
pared to their reported times, our implementation of BFS is 22.7x faster, our implementation of
connectivity is 3x faster, and our implementation of SSSP is 9.8x faster. Our PageRank implemen-
tation is 2.9x slower (we ran it with ϵ, the variable that controls the convergence rate of PageRank,
set to 1e − 6). However, we note that the PageRank numbers they report are not for true PageRank,
but PageRank-Delta, and are thus incomparable.
Stergiou et al. [147] describe a connectivity algorithm that runs in O (log n) rounds in the BSP
model and report running times for the Hyperlink2012-Sym graph. They implement their algo-
rithm using a proprietary in-memory/secondary-storage graph processing system used at Yahoo!,
and run experiments on a 1000 node cluster. Each node contains two 6-core processors that are
2-way hyper-threaded and 128GB of RAM, for a total of 24000 hyper-threads and 128TB of RAM.
Their fastest running time on the Hyperlink2012 graph is 341s on their 1000 node system. Our
implementation solves connectivity on this graph in 25s–13.6x faster on a system with 128x less
memory and 166x fewer cores. They also report running times for solving connectivity on a pri-
vate Yahoo! webgraph with 272 billion vertices and 5.9 trillion edges, over 26 times the size of
our largest graph. While such a graph seems to currently be out of reach of our machine, we are
hopeful that techniques from theoretically-efficient parallel algorithms can help solve problems
on graphs at this scale and beyond.
ACM Transactions on Parallel Computing, Vol. 8, No. 1, Article 4. Publication date: April 2021.
4:60 L. Dhulipala et al.
practicality of these algorithms using an efficient parallel batch-dynamic data structure for dy-
namic graphs, such as Aspen [54], and to include these problems as part of GBBS.
Another direction is to extend GBBS to important application domains of graph algorithms,
such as graph clustering. Although clustering is quite different from the problems studied in this
paper since there is usually no single “correct” way to cluster a graph or point set, we believe that
our approach will be useful for building theoretically-efficient and scalable single-machine clus-
tering algorithms, including density-based clustering [62], affinity clustering [20], and hierarchical
agglomerative clustering (HAC) on graphs [102]. The recent work of Tseng et al. [150] presents a
work-efficient parallel structural graph clustering algorithm which is incorporated into GBBS.
Lastly, it would be interesting to understand the portability of our implementations in different
architectures and computational settings. Recent work in this direction has found that implemen-
tations developed in this paper can be efficiently implemented in a setting where the graph is stored
in NVRAM, and algorithms have access to a limited amount of DRAM [29, 58]. The experimen-
tal results for their NVRAM system, called Sage, shows that applying the implementations from
this paper in conjunction with an optimized edgeMap primitive designed for NVRAMs achieves
superior performance on an NVRAM-based machine compared to the state-of-the-art NVRAM im-
plementations of Gill et al. [69], providing promising evidence for the portability of our approach.
APPENDIX
A GRAPH STATISTICS
In this section, we list graph statistics computed for the graphs from Section 8.8 These statistics
include the number of connected components, strongly connected components, colors used by the
LLF and LF heuristics, number of triangles, and several others. These numbers will be useful for
verifying the correctness or quality of our algorithms in relation to future algorithms that also run
on these graphs. Although some of these numbers were present in Table 3, we include in the tables
below for completeness. We provide details about the statistics that are not self-explanatory.
• Effective Directed Diameter: the maximum number of levels traversed during a graph tra-
versal algorithm (BFS or SCC) on the unweighted directed graph.
• Effective Undirected Diameter: the maximum number of levels traversed during a graph tra-
versal algorithm (BFS) on the unweighted directed graph.
• Size of Largest (Connected/Biconnected/Strongly-Connected) Component: The number of ver-
tices in the largest (connected/biconnected/strongly-connected) component. Note that in
the case of biconnectivity, we assign labels to edges, so a vertex participates in a compo-
nent for each distinct edge label incident to it.
• Num. Triangles: The number of closed triangles in G, where each triangle (u, v, w ) is counted
exactly once.
• Num. Colors Used by (LF/LLF): The number of colors used is just the maximum color ID
assigned to any vertex.
• (Maximum Independent Set/Maximum Matching/Approximate Set Cover) Size: We report the
sizes of these objects computed by our implementations. For MIS and maximum match-
ing we report this metric to lower-bound the size of the maximum independent set and
maximum matching supported by the graph. For approximate set cover, we run our code
on instances similar to those used in prior work (e.g., Blelloch et al. [37] and Dhulipala
et al. [52]) where the elements are vertices and the sets are the neighbors of each vertex
8 Similar statistics can be found on the SNAP website (https://snap.stanford.edu/data/) and the Laboratory for Web Algo-
ACM Transactions on Parallel Computing, Vol. 8, No. 1, Article 4. Publication date: April 2021.
Theoretically Efficient Parallel Graph Algorithms Can Be Fast and Scalable 4:61
Statistic Value
Num. Vertices 4,847,571
Num. Directed Edges 68,993,773
Num. Undirected Edges 85,702,474
Effective Directed Diameter 16
Effective Undirected Diameter 20
Num. Connected Components 1,876
Num. Biconnected Components 1,133,883
Num. Strongly Connected Components 971,232
Size of Largest Connected Component 4,843,953
Size of Largest Biconnected Component 3,665,291
Size of Largest Strongly Connected Component 3,828,682
Num. Triangles 285,730,264
Num. Colors Used by LF 323
Num. Colors Used by LLF 327
Maximal Independent Set Size 2,316,617
Maximal Matching Size 1,546,833
Set Cover Size 964,492
k max (Degeneracy) 372
ρ (Num. Peeling Rounds in k-core) 3,480
Statistic Value
Num. Vertices 3,072,627
Num. Directed Edges —
Num. Undirected Edges 234,370,166
Effective Directed Diameter —
Effective Undirected Diameter 9
Num. Connected Components 187
Num. Biconnected Components 68,117
Num. Strongly Connected Components —
Size of Largest Connected Component 3,072,441
Size of Largest Biconnected Component 3,003,914
Size of Largest Strongly Connected Component —
Num. Triangles 627,584,181
Num. Colors Used by LF 86
Num. Colors Used by LLF 98
Maximal Independent Set Size 651,901
Maximal Matching Size 1,325,427
Set Cover Size 105,572
k max (Degeneracy) 253
ρ (Num. Peeling Rounds in k-core) 5,667
As com-Orkut is an undirected graph, some of the statistics are not applicable
and we mark the corresponding values with –.
ACM Transactions on Parallel Computing, Vol. 8, No. 1, Article 4. Publication date: April 2021.
4:62 L. Dhulipala et al.
Statistic Value
Num. Vertices 41,652,231
Num. Directed Edges 1,468,365,182
Num. Undirected Edges 2,405,026,092
Effective Directed Diameter 65
Effective Undirected Diameter 23
Num. Connected Components 2
Num. Biconnected Components 1,936,001
Num. Strongly Connected Components 8,044,729
Size of Largest Connected Component 41,652,230
Size of Largest Biconnected Component 39,708,003
Size of Largest Strongly Connected Component 33,479,734
Num. Triangles 34,824,916,864
Num. Colors Used by LF 1,081
Num. Colors Used by LLF 1,074
Maximal Independent Set Size 26,564,540
Maximal Matching Size 9,612,260
Set Cover Size 1,736,761
k max (Degeneracy) 2,488
ρ (Num. Peeling Rounds in k-core) 14,963
Statistic Value
Num. Vertices 978,408,098
Num. Directed Edges 42,574,107,469
Num. Undirected Edges 74,774,358,622
Effective Directed Diameter 821
Effective Undirected Diameter 132
Num. Connected Components 23,794,336
Num. Biconnected Components 81,809,602
Num. Strongly Connected Components 135,223,661
Size of Largest Connected Component 950,577,812
Size of Largest Biconnected Component 846,117,956
Size of Largest Strongly Connected Component 774,373,029
Num. Triangles 1,995,295,290,765
Num. Colors Used by LF 4,245
Num. Colors Used by LLF 4,245
Maximal Independent Set Size 459,052,906
Maximal Matching Size 311,153,771
Set Cover Size 64,322,081
k max (Degeneracy) 4,244
ρ (Num. Peeling Rounds in k-core) 106,819
ACM Transactions on Parallel Computing, Vol. 8, No. 1, Article 4. Publication date: April 2021.
Theoretically Efficient Parallel Graph Algorithms Can Be Fast and Scalable 4:63
Statistic Value
Num. Vertices 1,724,573,718
Num. Directed Edges 64,422,807,961
Num. Undirected Edges 124,141,874,032
Effective Directed Diameter 793
Effective Undirected Diameter 207
Num. Connected Components 129,441,050
Num. Biconnected Components 132,198,693
Num. Strongly Connected Components 1,290,550,195
Size of Largest Connected Component 1,574,786,584
Size of Largest Biconnected Component 1,435,626,698
Size of Largest Strongly Connected Component 320,754,363
Num. Triangles 4,587,563,913,535
Num. Colors Used by LF 4154
Num. Colors Used by LLF 4158
Maximal Independent Set Size 1,333,026,057
Maximal Matching Size 242,469,131
Set Cover Size 23,869,788
k max (Degeneracy) 4,160
ρ (Num. Peeling Rounds in k-core) 58,711
Statistic Value
Num. Vertices 3,563,602,789
Num. Directed Edges 128,736,914,167
Num. Undirected Edges 225,840,663,232
Effective Directed Diameter 5275
Effective Undirected Diameter 331
Num. Connected Components 144,628,744
Num. Biconnected Components 298,663,966
Num. Strongly Connected Components 1,279,696,892
Size of Largest Connected Component 3,355,386,234
Size of Largest Biconnected Component 3,023,064,231
Size of Largest Strongly Connected Component 1,827,543,757
Num. Triangles 9,648,842,110,027
Num. Colors Used by LF 10,566
Num. Colors Used by LLF 10,566
Maximal Independent Set Size 1,799,823,993
Maximal Matching Size 2,434,644,438
Set Cover Size 372,668,619
k max (Degeneracy) 10,565
ρ (Num. Peeling Rounds in k-core) 130,728
ACM Transactions on Parallel Computing, Vol. 8, No. 1, Article 4. Publication date: April 2021.
4:64 L. Dhulipala et al.
in the undirected graph. In the case of the social network and hyperlink graphs, this opti-
mization problem naturally captures the minimum number of users or Web pages whose
neighborhoods must be retrieved to cover the entire graph.
• k max (Degeneracy): The value of k of the largest non-empty k-core.
ACKNOWLEDGMENTS
Thanks to Jessica Shi and Tom Tseng for their work on GBBS and parts of this paper, and thanks to
the reviewers and Lin Ma for helpful comments. This research was supported in part by NSF grants
#CCF-1408940, #CCF-1533858, #CCF-1629444, and #CCF-1845763, DOE grant #DE-SC0018947, and
a Google Faculty Research Award.
REFERENCES
[1] Christopher R. Aberger, Andrew Lamb, Susan Tu, Andres Nötzli, Kunle Olukotun, and Christopher Ré. 2017. Emp-
tyHeaded: A relational engine for graph processing. ACM Trans. Database Syst. 42, 4 (2017), 20:1–20:44.
[2] Umut A. Acar, Daniel Anderson, Guy E. Blelloch, and Laxman Dhulipala. 2019. Parallel batch-dynamic graph con-
nectivity. In The 31st ACM on Symposium on Parallelism in Algorithms and Architectures, SPAA 2019, Phoenix, AZ,
USA, June 22–24, 2019. 381–392.
[3] Alok Aggarwal, Richard J. Anderson, and M.-Y. Kao. 1989. Parallel depth-first search in general directed graphs. In
ACM Symposium on Theory of Computing (STOC). 297–308.
[4] Masab Ahmad, Farrukh Hijaz, Qingchuan Shi, and Omer Khan. 2015. Crono: A benchmark suite for multithreaded
graph algorithms executing on futuristic multicores. In IEEE International Symposium on Workload Characterization,
IISWC. 44–55.
[5] Noga Alon, László Babai, and Alon Itai. 1986. A fast and simple randomized parallel algorithm for the maximal
independent set problem. J. Algorithms 7, 4 (1986), 567–583.
[6] N. Alon, R. Yuster, and U. Zwick. 1997. Finding and counting given length cycles. Algorithmica 17, 3 (1997), 209–223.
[7] Richard Anderson and Ernst W. Mayr. 1984. A P-complete Problem and Approximations to It. Technical Report.
[8] Alexandr Andoni, Clifford Stein, and Peilin Zhong. 2020. Parallel approximate undirected shortest paths via low hop
emulators. In ACM Symposium on Theory of Computing (STOC). ACM, 322–335.
[9] N. S. Arora, R. D. Blumofe, and C. G. Plaxton. 2001. Thread scheduling for multiprogrammed multiprocessors. Theory
of Computing Systems (TOCS) 34, 2 (01 Apr 2001).
[10] Nimar S. Arora, Robert D. Blumofe, and C. Greg Plaxton. 2001. Thread scheduling for multiprogrammed multipro-
cessors. Theory of Computing Systems (TOCS) 34, 2 (2001), 115–144.
[11] Baruch Awerbuch. 1985. Complexity of network synchronization. J. ACM 32, 4 (1985), 804–823.
[12] Baruch Awerbuch, Bonnie Berger, Lenore Cowen, and David Peleg. 1992. Low-diameter graph decomposition is in
NC. In Scandinavian Workshop on Algorithm Theory. 83–93.
[13] Baruch Awerbuch and Y. Shiloach. 1983. New connectivity and MSF algorithms for ultracomputer and PRAM. In
International Conference on Parallel Processing (ICPP). 175–179.
[14] David A. Bader and Guojing Cong. 2006. Fast shared-memory algorithms for computing the minimum spanning
forest of sparse graphs. J. Parallel Distrib. Comput. 66, 11 (2006), 1366–1378.
[15] David A. Bader and Kamesh Madduri. 2005. Design and implementation of the HPCS graph analysis benchmark on
symmetric multiprocessors. In IEEE International Conference on High-Performance Computing (HiPC). 465–476.
[16] David A. Bader and Kamesh Madduri. 2006. Designing multithreaded algorithms for breadth-first search and st-
connectivity on the Cray MTA-2. In International Conference on Parallel Processing (ICPP). 523–530.
[17] Bahman Bahmani, Ashish Goel, and Kamesh Munagala. 2014. Efficient primal-dual graph algorithms for MapReduce.
In International Workshop on Algorithms and Models for the Web-Graph. 59–78.
[18] Bahman Bahmani, Ravi Kumar, and Sergei Vassilvitskii. 2012. Densest subgraph in streaming and MapReduce. Proc.
VLDB Endow. 5, 5 (2012), 454–465.
[19] Georg Baier, Ekkehard Köhler, and Martin Skutella. 2005. The k-splittable flow problem. Algorithmica 42, 3–4 (2005),
231–248.
[20] MohammadHossein Bateni, Soheil Behnezhad, Mahsa Derakhshan, MohammadTaghi Hajiaghayi, Raimondas
Kiveris, Silvio Lattanzi, and Vahab Mirrokni. 2017. Affinity clustering: Hierarchical clustering at scale. In Advances
in Neural Information Processing Systems. 6864–6874.
[21] Scott Beamer, Krste Asanović, and David Patterson. 2013. Direction-optimizing breadth-first search. Scientific Pro-
gramming 21, 3–4 (2013), 137–148.
[22] Scott Beamer, Krste Asanovic, and David A. Patterson. 2015. The GAP benchmark suite. CoRR abs/1508.03619 (2015).
ACM Transactions on Parallel Computing, Vol. 8, No. 1, Article 4. Publication date: April 2021.
Theoretically Efficient Parallel Graph Algorithms Can Be Fast and Scalable 4:65
[23] Naama Ben-David, Guy E. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, Yan Gu, Charles McGuffey, and Julian
Shun. 2018. Implicit decomposition for write-efficient connectivity algorithms. In IEEE International Parallel and
Distributed Processing Symposium (IPDPS). 711–722.
[24] Bonnie Berger, John Rompel, and Peter W. Shor. 1994. Efficient NC algorithms for set cover with applications to
learning and geometry. J. Computer and System Sciences 49, 3 (Dec. 1994), 454–477.
[25] Marcel Birn, Vitaly Osipov, Peter Sanders, Christian Schulz, and Nodari Sitchinava. 2013. Efficient parallel and
external matching. In European Conference on Parallel Processing (Euro-Par). 659–670.
[26] Guy E. Blelloch. 1993. Prefix sums and their applications. In Synthesis of Parallel Algorithms, John Reif (Ed.). Morgan
Kaufmann.
[27] Guy E. Blelloch, Daniel Anderson, and Laxman Dhulipala. 2020. ParlayLib - A toolkit for parallel algorithms on
shared-memory multicore machines. In ACM Symposium on Parallelism in Algorithms and Architectures (SPAA).
507–509.
[28] Guy E. Blelloch and Laxman Dhulipala. 2018. Introduction to Parallel Algorithms. http://www.cs.cmu.edu/realworld/
slidesS18/parallelChap.pdf. Carnegie Mellon University.
[29] Guy E. Blelloch, Laxman Dhulipala, Phillip B. Gibbons, Yan Gu, Charlie McGuffey, and Julian Shun. 2021. The read-
only semi-external model. In SIAM/ACM Symposium on Algorithmic Principles of Computer Systems (APOCS). 70–84.
[30] Guy E. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, and Julian Shun. 2012. Internally deterministic algorithms
can be fast. In ACM Symposium on Principles and Practice of Parallel Programming (PPoPP). 181–192.
[31] Guy E. Blelloch, Jeremy T. Fineman, Yan Gu, and Yihan Sun. 2020. Optimal parallel algorithms in the binary-forking
model. In ACM Symposium on Parallelism in Algorithms and Architectures (SPAA). 89–102.
[32] Guy E. Blelloch, Jeremy T. Fineman, and Julian Shun. 2012. Greedy sequential maximal independent set and matching
are parallel on average. In ACM Symposium on Parallelism in Algorithms and Architectures (SPAA). 308–317.
[33] Guy E. Blelloch, Yan Gu, Julian Shun, and Yihan Sun. 2016. Parallelism in randomized incremental algorithms. In
ACM Symposium on Parallelism in Algorithms and Architectures (SPAA). 467–478.
[34] Guy E. Blelloch, Yan Gu, and Yihan Sun. 2017. Efficient construction on probabilistic tree embeddings. In Intl. Colloq.
on Automata, Languages and Programming (ICALP). 26:1–26:14.
[35] Guy E. Blelloch, Anupam Gupta, Ioannis Koutis, Gary L Miller, Richard Peng, and Kanat Tangwongsan. 2014. Nearly-
linear work parallel SDD solvers, low-diameter decomposition, and low-stretch subgraphs. Theory of Computing
Systems 55, 3 (2014), 521–554.
[36] Guy E. Blelloch, Richard Peng, and Kanat Tangwongsan. 2011. Linear-work greedy parallel approximate set cover
and variants. In ACM Symposium on Parallelism in Algorithms and Architectures (SPAA).
[37] Guy E. Blelloch, Harsha Vardhan Simhadri, and Kanat Tangwongsan. 2012. Parallel and I/O efficient set covering
algorithms. In ACM Symposium on Parallelism in Algorithms and Architectures (SPAA). 82–90.
[38] Robert D. Blumofe and Charles E. Leiserson. 1999. Scheduling multithreaded computations by work stealing. J. ACM
46, 5 (1999), 720–748.
[39] Paolo Boldi and Sebastiano Vigna. 2004. The WebGraph framework I: Compression techniques. In International
World Wide Web Conference (WWW). 595–601.
[40] Otakar Borůvka. 1926. O jistém problému minimálním. Práce Mor. Přírodověd. Spol. v Brně III 3 (1926), 37–58.
[41] Ulrik Brandes. 2001. A faster algorithm for betweenness centrality. Journal of Mathematical Sociology 25, 2 (2001),
163–177.
[42] Sergey Brin and Lawrence Page. 1998. The anatomy of a large-scale hypertextual web search engine. In International
World Wide Web Conference (WWW). 107–117.
[43] Andrei Broder, Ravi Kumar, Farzin Maghoul, Prabhakar Raghavan, Sridhar Rajagopalan, Raymie Stata, Andrew
Tomkins, and Janet Wiener. 2000. Graph structure in the web. Computer Networks 33, 1–6 (2000), 309–320.
[44] Moses Charikar. 2000. Greedy approximation algorithms for finding dense components in a graph. In International
Workshop on Approximation Algorithms for Combinatorial Optimization. 84–95.
[45] Richard Cole, Philip N. Klein, and Robert E. Tarjan. 1996. Finding minimum spanning forests in logarithmic time
and linear work using random sampling. In ACM Symposium on Parallelism in Algorithms and Architectures (SPAA).
243–250.
[46] Guojing Cong and David A. Bader. 2005. An experimental study of parallel biconnected components algorithms
on symmetric multiprocessors (SMPs). In IEEE International Parallel and Distributed Processing Symposium (IPDPS).
9–18.
[47] Guojing Cong and Ilie Gabriel Tanase. 2016. Composable locality optimizations for accelerating parallel forest com-
putations. In IEEE International Conference on High Performance Computing and Communications (HPCC). 190–197.
[48] Don Coppersmith, Lisa Fleischer, Bruce Hendrickson, and Ali Pinar. 2003. A Divide-and-conquer Algorithm for Iden-
tifying Strongly Connected Components. Technical Report RC23744. IBM Research.
ACM Transactions on Parallel Computing, Vol. 8, No. 1, Article 4. Publication date: April 2021.
4:66 L. Dhulipala et al.
[49] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. 2009. Introduction to Algorithms (3
ed.). MIT Press.
[50] Naga Shailaja Dasari, Ranjan Desh, and Mohammad Zubair. 2014. ParK: An efficient algorithm for k -core decom-
position on multicore processors. In IEEE International Conference on Big Data (BigData). 9–16.
[51] Roshan Dathathri, Gurbinder Gill, Loc Hoang, Hoang-Vu Dang, Alex Brooks, Nikoli Dryden, Marc Snir, and Keshav
Pingali. 2018. Gluon: A communication-optimizing substrate for distributed heterogeneous graph analytics. In ACM
SIGPLAN Conference on Programming Language Design and Implementation (PLDI). 752–768.
[52] Laxman Dhulipala, Guy E. Blelloch, and Julian Shun. 2017. Julienne: A framework for parallel graph algorithms
using work-efficient bucketing. In ACM Symposium on Parallelism in Algorithms and Architectures (SPAA). 293–304.
[53] Laxman Dhulipala, Guy E. Blelloch, and Julian Shun. 2018. Theoretically efficient parallel graph algorithms can be
fast and scalable. In ACM Symposium on Parallelism in Algorithms and Architectures (SPAA). 293–304.
[54] Laxman Dhulipala, Guy E. Blelloch, and Julian Shun. 2019. Low-latency graph streaming using compressed purely-
functional trees. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). 918–934.
[55] Laxman Dhulipala, David Durfee, Janardhan Kulkarni, Richard Peng, Saurabh Sawlani, and Xiaorui Sun. 2020. Paral-
lel batch-dynamic graphs: Algorithms and lower bounds. In ACM-SIAM Symposium on Discrete Algorithms (SODA).
1300–1319.
[56] Laxman Dhulipala, Igor Kabiljo, Brian Karrer, Giuseppe Ottaviano, Sergey Pupyrev, and Alon Shalita. 2016. Com-
pressing graphs and indexes with recursive graph bisection. In ACM International Conference on Knowledge Discovery
and Data Mining (KDD). 1535–1544.
[57] Laxman Dhulipala, Quanquan C. Liu, Julian Shun, and Shangdi Yu. 2021. Parallel batch-dynamic k -clique counting.
In SIAM/ACM Symposium on Algorithmic Principles of Computer Systems (APOCS). 129–143.
[58] Laxman Dhulipala, Charlie McGuffey, Hongbo Kang, Yan Gu, Guy E. Blelloch, Phillip B. Gibbons, and Julian Shun.
2020. Sage: Parallel semi-asymmetric graph algorithms for NVRAMs. Proc. VLDB Endow. 13, 9 (2020), 1598–1613.
[59] Laxman Dhulipala, Jessica Shi, Tom Tseng, Guy E. Blelloch, and Julian Shun. 2020. The graph based benchmark suite
(GBBS). In International Workshop on Graph Data Management Experiences and Systems (GRADES) and Network Data
Analytics (NDA). 11:1–11:8.
[60] Ran Duan, Kaifeng Lyu, and Yuanhang Xie. 2018. Single-source bottleneck path algorithm faster than sorting for
sparse graphs. In Intl. Colloq. on Automata, Languages and Programming (ICALP). 43:1–43:14.
[61] James A. Edwards and Uzi Vishkin. 2012. Better speedups using simpler parallel programming for graph connectivity
and biconnectivity. In International Workshop on Programming Models and Applications for Multicores and Manycores
(PMAM). 103–114.
[62] Martin Ester, Hans-Peter Kriegel, Jorg Sander, and Xiaowei Xu. 1996. A density-based algorithm for discovering
clusters in large spatial databases with noise. In ACM International Conference on Knowledge Discovery and Data
Mining (KDD). 226–231.
[63] Jeremy T. Fineman. 2018. Nearly work-efficient parallel algorithm for digraph reachability. In ACM Symposium on
Theory of Computing (STOC). 457–470.
[64] Manuela Fischer and Andreas Noever. 2018. Tight analysis of parallel randomized greedy MIS. In ACM-SIAM Sym-
posium on Discrete Algorithms (SODA). 2152–2160.
[65] Lisa K. Fleischer, Bruce Hendrickson, and Ali Pinar. 2000. On identifying strongly connected components in parallel.
In IEEE International Parallel and Distributed Processing Symposium (IPDPS). 505–511.
[66] Lester Randolph Ford and Delbert R. Fulkerson. 2009. Maximal flow through a network. In Classic Papers in Combi-
natorics. Springer, 243–248.
[67] Hillel Gazit. 1991. An optimal randomized parallel algorithm for finding connected components in a graph. SIAM J.
on Computing 20, 6 (Dec. 1991), 1046–1067.
[68] Hillel Gazit and Gary L. Miller. 1988. An improved parallel algorithm that computes the BFS numbering of a directed
graph. Inform. Process. Lett. 28, 2 (1988), 61–65.
[69] Gurbinder Gill, Roshan Dathathri, Loc Hoang, Ramesh Peri, and Keshav Pingali. 2020. Single machine graph ana-
lytics on massive datasets using intel optane DC persistent memory. Proc. VLDB Endow. 13, 8 (2020), 1304–13.
[70] A. V. Goldberg. 1984. Finding a Maximum Density Subgraph. Technical Report UCB/CSD-84-171. Berkeley, CA, USA.
[71] Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin. 2012. PowerGraph: Distributed
graph-parallel computation on natural graphs. In USENIX Symposium on Operating Systems Design and Implemen-
tation (OSDI). 17–30.
[72] Oded Green, Luis M. Munguia, and David A. Bader. 2014. Load balanced clustering coefficients. In Workshop on
Parallel programming for Analytics Applications (PPAA). 3–10.
[73] Raymond Greenlaw, H. James Hoover, and Walter L. Ruzzo. 1995. Limits to Parallel Computation: P-completeness
Theory. Oxford University Press, Inc.
ACM Transactions on Parallel Computing, Vol. 8, No. 1, Article 4. Publication date: April 2021.
Theoretically Efficient Parallel Graph Algorithms Can Be Fast and Scalable 4:67
[74] Yan Gu, Julian Shun, Yihan Sun, and Guy E. Blelloch. 2015. A top-down parallel semisort. In ACM Symposium on
Parallelism in Algorithms and Architectures (SPAA). 24–34.
[75] Shay Halperin and Uri Zwick. 1996. An optimal randomized logarithmic time connectivity algorithm for the EREW
PRAM. J. Comput. Syst. Sci. 53, 3 (1996), 395–416.
[76] Shay Halperin and Uri Zwick. 2001. Optimal randomized EREW PRAM algorithms for finding spanning forests. 39,
1 (2001), 1–46.
[77] William Hasenplaugh, Tim Kaler, Tao B. Schardl, and Charles E. Leiserson. 2014. Ordering heuristics for parallel
graph coloring. In ACM Symposium on Parallelism in Algorithms and Architectures (SPAA). 166–177.
[78] Loc Hoang, Matteo Pontecorvi, Roshan Dathathri, Gurbinder Gill, Bozhi You, Keshav Pingali, and Vijaya
Ramachandran. 2019. A round-efficient distributed betweenness centrality algorithm. In ACM Symposium on Prin-
ciples and Practice of Parallel Programming (PPoPP). 272–286.
[79] Sungpack Hong, Nicole C. Rodia, and Kunle Olukotun. 2013. On fast parallel detection of strongly connected compo-
nents (SCC) in small-world graphs. In International Conference for High Performance Computing, Networking, Storage
and Analysis (SC). 92:1–92:11.
[80] John Hopcroft and Robert Tarjan. 1973. Algorithm 447: Efficient algorithms for graph manipulation. Commun. ACM
16, 6 (1973), 372–378.
[81] Alexandru Iosup, Tim Hegeman, Wing Lung Ngai, Stijn Heldens, Arnau Prat-Pérez, Thomas Manhardto, Hassan
Chafio, Mihai Capotă, Narayanan Sundaram, Michael Anderson, Ilie Gabriel Tănase, Yinglong Xia, Lifeng Nai, and
Peter Boncz. 2016. LDBC graphalytics: A benchmark for large-scale graph analysis on parallel and distributed plat-
forms. Proc. VLDB Endow. 9, 13 (Sept. 2016), 1317–1328.
[82] Amos Israeli and Y. Shiloach. 1986. An improved parallel algorithm for maximal matching. Inform. Process. Lett. 22,
2 (1986), 57–60.
[83] Alon Itai and Michael Rodeh. 1977. Finding a minimum circuit in a graph. In ACM Symposium on Theory of Computing
(STOC). 1–10.
[84] J. Jaja. 1992. Introduction to Parallel Algorithms. Addison-Wesley Professional.
[85] Sang-Woo Jun, Andy Wright, Sizhuo Zhang, Shuotao Xu, et al. 2018. GraFBoost: Using accelerated flash storage for
external graph analytics. In ACM International Symposium on Computer Architecture (ISCA). 411–424.
[86] H. Kabir and K. Madduri. 2017. Parallel k -core decomposition on multicore platforms. In IEEE International Parallel
and Distributed Processing Symposium (IPDPS). 1482–1491.
[87] David R. Karger, Philip N. Klein, and Robert E. Tarjan. 1995. A randomized linear-time algorithm to find minimum
spanning trees. J. ACM 42, 2 (March 1995), 321–328.
[88] Richard M. Karp and Vijaya Ramachandran. 1990. Parallel algorithms for shared-memory machines. In Handbook of
Theoretical Computer Science (Vol. A), Jan van Leeuwen (Ed.). MIT Press, Cambridge, MA, USA, 869–941.
[89] Richard M. Karp and Avi Wigderson. 1984. A fast parallel algorithm for the maximal independent set problem. In
ACM Symposium on Theory of Computing (STOC). 266–272.
[90] Jinha Kim, Wook-Shin Han, Sangyeon Lee, Kyungyeol Park, and Hwanjo Yu. 2014. OPT: A new framework for
overlapped and parallel triangulation in large-scale graphs. In ACM International Conference on Management of
Data (SIGMOD). 637–648.
[91] Ravi Kumar, Benjamin Moseley, Sergei Vassilvitskii, and Andrea Vattani. 2015. Fast greedy algorithms in MapReduce
and streaming. ACM Trans. Parallel Comput. 2, 3 (2015), 14:1–14:22.
[92] Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. 2010. What is Twitter, a social network or a news
media? In International World Wide Web Conference (WWW). 591–600.
[93] Matthieu Latapy. 2008. Main-memory triangle computations for very large (sparse (power-law)) graphs. Theor. Com-
put. Sci. 407, 1–3 (2008), 458–473.
[94] Charles E. Leiserson and Tao B. Schardl. 2010. A work-efficient parallel breadth-first search algorithm (or how
to cope with the nondeterminism of reducers). In ACM Symposium on Parallelism in Algorithms and Architectures
(SPAA). 303–314.
[95] Jason Li. 2020. Faster parallel algorithm for approximate shortest path. In ACM Symposium on Theory of Computing
(STOC). 308–321.
[96] Yucheng Low, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo Kyrola, and Joseph M. Hellerstein. 2012.
Distributed GraphLab: A framework for machine learning and data mining in the cloud. Proc. VLDB Endow. 5, 8
(April 2012).
[97] Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, Carlos Guestrin, and Joseph M. Hellerstein. 2010.
GraphLab: A new parallel framework for machine learning. In Conference on Uncertainty in Artificial Intelligence
(UAI). 340–349.
[98] Michael Luby. 1986. A simple parallel algorithm for the maximal independent set problem. SIAM J. Comput. (1986),
1036–1053.
ACM Transactions on Parallel Computing, Vol. 8, No. 1, Article 4. Publication date: April 2021.
4:68 L. Dhulipala et al.
[99] Steffen Maass, Changwoo Min, Sanidhya Kashyap, Woonhak Kang, Mohan Kumar, and Taesoo Kim. 2017. Mosaic:
Processing a trillion-edge graph on a single machine. In European Conference on Computer Systems (EuroSys). 527–
543.
[100] Saeed Maleki, Donald Nguyen, Andrew Lenharth, María Garzarán, David Padua, and Keshav Pingali. 2016. DSMR:
A parallel algorithm for single-source shortest path problem. In Proceedings of the 2016 International Conference on
Supercomputing (ICS). 32:1–32:14.
[101] Grzegorz Malewicz, Matthew H. Austern, Aart J.C Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz
Czajkowski. 2010. Pregel: A system for large-scale graph processing. In ACM International Conference on Manage-
ment of Data (SIGMOD). 135–146.
[102] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval.
Cambridge university press.
[103] Yael Maon, Baruch Schieber, and Uzi Vishkin. 1986. Parallel ear decomposition search (EDS) and st-numbering in
graphs. Theoretical Computer Science 47 (1986), 277–298.
[104] David W. Matula and Leland L. Beck. 1983. Smallest-last ordering and clustering and graph coloring algorithms.
J. ACM 30, 3 (July 1983), 417–427.
[105] Robert Ryan McCune, Tim Weninger, and Greg Madey. 2015. Thinking like a vertex: A survey of vertex-centric
frameworks for large-scale distributed graph processing. ACM Comput. Surv. 48, 2, Article 25 (Oct. 2015), 39 pages.
[106] William Mclendon Iii, Bruce Hendrickson, Steven J. Plimpton, and Lawrence Rauchwerger. 2005. Finding strongly
connected components in distributed graphs. J. Parallel Distrib. Comput. 65, 8 (2005), 901–910.
[107] Robert Meusel, Sebastiano Vigna, Oliver Lehmberg, and Christian Bizer. 2015. The graph structure in the web–
analyzed on different aggregation levels. The Journal of Web Science 1, 1 (2015), 33–47.
[108] Ulrich Meyer and Peter Sanders. 2000. Parallel shortest path for arbitrary graphs. In European Conference on Parallel
Processing (Euro-Par). 461–470.
[109] Ulrich Meyer and Peter Sanders. 2003. Δ-stepping: A parallelizable shortest path algorithm. J. Algorithms 49, 1 (2003),
114–152.
[110] Gary L. Miller, Richard Peng, Adrian Vladu, and Shen Chen Xu. 2015. Improved parallel algorithms for spanners
and hopsets. In ACM Symposium on Parallelism in Algorithms and Architectures (SPAA). 192–201.
[111] Gary L. Miller, Richard Peng, and Shen Chen Xu. 2013. Parallel graph decompositions using random shifts. In ACM
Symposium on Parallelism in Algorithms and Architectures (SPAA). 196–203.
[112] Gary L. Miller and Vijaya Ramachandran. 1992. A new graph triconnectivity algorithm and its parallelization. Com-
binatorica 12, 1 (1992), 53–76.
[113] Lifeng Nai, Yinglong Xia, Ilie G. Tanase, Hyesoon Kim, and Ching-Yung Lin. 2015. GraphBIG: Understanding graph
computing in the context of industrial solutions. In International Conference for High Performance Computing, Net-
working, Storage and Analysis (SC). 69:1–69:12.
[114] Mark E. J. Newman. 2003. The structure and function of complex networks. SIAM Rev. 45, 2 (2003), 167–256.
[115] Donald Nguyen, Andrew Lenharth, and Keshav Pingali. 2013. A lightweight infrastructure for graph analytics. In
ACM Symposium on Operating Systems Principles (SOSP). 456–471.
[116] Sadegh Nobari, Thanh-Tung Cao, Panagiotis Karras, and Stéphane Bressan. 2012. Scalable parallel minimum span-
ning forest computation. In ACM Symposium on Principles and Practice of Parallel Programming (PPoPP). 205–214.
[117] Mark Ortmann and Ulrik Brandes. 2014. Triangle listing algorithms: Back from the diversion. In Algorithm Engi-
neering and Experiments (ALENEX). 1–8.
[118] Rasmus Pagh and Francesco Silvestri. 2014. The input/output complexity of triangle enumeration. In ACM Sympo-
sium on Principles of Database Systems (PODS). 224–233.
[119] M. M. A. Patwary, P. Refsnes, and F. Manne. 2012. Multi-core spanning forest algorithms using the disjoint-set data
structure. In IEEE International Parallel and Distributed Processing Symposium (IPDPS). 827–835.
[120] David Peleg and Alejandro A Schäffer. 1989. Graph spanners. Journal of Graph Theory 13, 1 (1989), 99–116.
[121] Seth Pettie and Vijaya Ramachandran. 2002. A randomized time-work optimal parallel algorithm for finding a min-
imum spanning forest. SIAM J. on Computing 31, 6 (2002), 1879–1895.
[122] C. A. Phillips. 1989. Parallel graph contraction. In ACM Symposium on Parallelism in Algorithms and Architectures
(SPAA). 148–157.
[123] Chung Keung Poon and Vijaya Ramachandran. 1997. A randomized linear work EREW PRAM algorithm to find a
minimum spanning forest. In International Symposium on Algorithms and Computation (ISAAC). 212–222.
[124] Sridhar Rajagopalan and Vijay V. Vazirani. 1999. Primal-dual RNC approximation algorithms for set cover and cov-
ering integer programs. SIAM J. on Computing 28, 2 (Feb. 1999), 525–540.
[125] Vijaya Ramachandran. 1989. A framework for parallel graph algorithm design. In International Symposium on Opti-
mal Algorithms. 33–40.
ACM Transactions on Parallel Computing, Vol. 8, No. 1, Article 4. Publication date: April 2021.
Theoretically Efficient Parallel Graph Algorithms Can Be Fast and Scalable 4:69
[126] Vijaya Ramachandran. 1993. Parallel open ear decomposition with applications to graph biconnectivity and tricon-
nectivity. In Synthesis of Parallel Algorithms, John H Reif (Ed.). Morgan Kaufmann Publishers Inc.
[127] J. Reif. 1985. Optimal Parallel Algorithms for Integer Sorting and Graph Connectivity. Technical Report TR-08-85.
Harvard University.
[128] Ahmet Erdem Sariyuce, C. Seshadhri, and Ali Pinar. 2018. Parallel local algorithms for core, truss, and nucleus
decompositions. Proc. VLDB Endow. 12, 1 (2018), 43–56.
[129] T. Schank. 2007. Algorithmic Aspects of Triangle-Based Network Analysis. Ph.D. Dissertation. Universitat Karlsruhe.
[130] Thomas Schank and Dorothea Wagner. 2005. Finding, counting and listing all triangles in large graphs, an experi-
mental study. In Workshop on Experimental and Efficient Algorithms (WEA). 606–609.
[131] Warren Schudy. 2008. Finding strongly connected components in parallel using O (log2 N ) reachability queries. In
ACM Symposium on Parallelism in Algorithms and Architectures (SPAA). 146–151.
[132] Stephen B. Seidman. 1983. Network structure and minimum degree. Soc. Networks 5, 3 (1983), 269–287.
[133] Martin Sevenich, Sungpack Hong, Adam Welc, and Hassan Chafi. 2014. Fast in-memory triangle listing for large
real-world graphs. In Workshop on Social Network Mining and Analysis. Article 2, 2:1–2:9 pages.
[134] Jessica Shi, Laxman Dhulipala, and Julian Shun. 2020. Parallel clique counting and peeling algorithms. arXiv preprint
arXiv:2002.10047 (2020).
[135] Yossi Shiloach and Uzi Vishkin. 1982. An O (log n) parallel connectivity algorithm. J. Algorithms 3, 1 (1982), 57–67.
[136] Julian Shun and Guy E. Blelloch. 2013. Ligra: A lightweight graph processing framework for shared memory. In
ACM Symposium on Principles and Practice of Parallel Programming (PPoPP). 135–146.
[137] Julian Shun and Guy E. Blelloch. 2014. Phase-concurrent hash tables for determinism. In ACM Symposium on Par-
allelism in Algorithms and Architectures (SPAA). 96–107.
[138] Julian Shun, Guy E. Blelloch, Jeremy T. Fineman, and Phillip B. Gibbons. 2013. Reducing contention through priority
updates. In ACM Symposium on Parallelism in Algorithms and Architectures (SPAA). 299–300.
[139] Julian Shun, Guy E. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, Aapo Kyrola, Harsha Vardhan Simhadri, and
Kanat Tangwongsan. 2012. Brief announcement: The problem based benchmark suite. In ACM Symposium on Par-
allelism in Algorithms and Architectures (SPAA).
[140] Julian Shun, Laxman Dhulipala, and Guy E. Blelloch. 2014. A simple and practical linear-work parallel algorithm for
connectivity. In ACM Symposium on Parallelism in Algorithms and Architectures (SPAA). 143–153.
[141] Julian Shun, Laxman Dhulipala, and Guy E. Blelloch. 2015. Smaller and faster: Parallel processing of compressed
graphs with Ligra+. In Data Compression Conference (DCC). 403–412.
[142] Julian Shun and Kanat Tangwongsan. 2015. Multicore triangle computations without tuning. In IEEE International
Conference on Data Engineering (ICDE). 149–160.
[143] George M. Slota and Kamesh Madduri. 2014. Simple parallel biconnectivity algorithms for multicore platforms. In
IEEE International Conference on High-Performance Computing (HiPC). 1–10.
[144] George M. Slota, Sivasankaran Rajamanickam, and Kamesh Madduri. 2014. BFS and coloring-based parallel algo-
rithms for strongly connected components and related problems. In IEEE International Parallel and Distributed Pro-
cessing Symposium (IPDPS). 550–559.
[145] George M. Slota, Sivasankaran Rajamanickam, and Kamesh Madduri. 2015. Supercomputing for Web Graph Analytics.
Technical Report SAND2015-3087C. Sandia National Lab.
[146] G. M. Slota, S. Rajamanickam, and K. Madduri. 2016. A case study of complex graph analysis in distributed memory:
Implementation and optimization. In IEEE International Parallel and Distributed Processing Symposium (IPDPS). 293–
302.
[147] Stergios Stergiou, Dipen Rughwani, and Kostas Tsioutsiouliklis. 2018. Shortcutting label propagation for distributed
connected components. In International Conference on Web Search and Data Mining (WSDM). 540–546.
[148] Robert E. Tarjan and Uzi Vishkin. 1985. An efficient parallel biconnectivity algorithm. SIAM J. on Computing 14, 4
(1985), 862–874.
[149] Mikkel Thorup and Uri Zwick. 2005. Approximate distance oracles. J. ACM 52, 1 (2005), 1–24.
[150] Tom Tseng, Laxman Dhulipala, and Julian Shun. 2021. Parallel index-based structural graph clustering and its ap-
proximation. To appear in ACM International Conference on Management of Data (SIGMOD) (2021).
[151] Dominic J. A. Welsh and Martin B. Powell. 1967. An upper bound for the chromatic number of a graph and its
application to timetabling problems. Comput. J. 10, 1 (1967), 85–86.
[152] Da Yan, Yingyi Bu, Yuanyuan Tian, and Amol Deshpande. 2017. Big graph analytics platforms. Foundations and
Trends in Databases 7, 1–2 (2017), 1–195.
[153] Yunming Zhang, Mengjiao Yang, Riyadh Baghdadi, Shoaib Kamil, Julian Shun, and Saman Amarasinghe. 2018.
GraphIt: A high-performance graph DSL. Object-Oriented Programming Systems, Languages,and Applications (OOP-
SLA) (2018), 121:1–121:30.
ACM Transactions on Parallel Computing, Vol. 8, No. 1, Article 4. Publication date: April 2021.
4:70 L. Dhulipala et al.
[154] Da Zheng, Disa Mhembere, Randal Burns, Joshua Vogelstein, Carey E. Priebe, and Alexander S. Szalay. 2015. Flash-
Graph: Processing billion-node graphs on an array of commodity SSDs. In USENIX Conference on File and Storage
Technologies (FAST). 45–58.
[155] Wei Zhou. 2017. A Practical Scalable Shared-Memory Parallel Algorithm for Computing Minimum Spanning Trees.
Master’s thesis. KIT.
ACM Transactions on Parallel Computing, Vol. 8, No. 1, Article 4. Publication date: April 2021.