A Survey of Distributed Graph Algorithms On Massive Graphs
A Survey of Distributed Graph Algorithms On Massive Graphs
LINGKAI MENG, Antai College of Economics and Management, Shanghai Jiao Tong University, China
YU SHAO, East China Normal University, China
LONG YUAN∗ , Nanjing University of Science and Technology, China
arXiv:2404.06037v2 [cs.DC] 28 Oct 2024
Authors’ addresses: Lingkai Meng, Antai College of Economics and Management, Shanghai Jiao Tong University, Shanghai,
China, [email protected]; Yu Shao, East China Normal University, Shanghai, China, [email protected]; Long
Yuan, Nanjing University of Science and Technology, Nanjing, China, [email protected]; Longbin Lai, Alibaba
Group, Hangzhou, China, [email protected]; Peng Cheng, East China Normal University, Shanghai, China,
[email protected]; Xue Li, Alibaba Group, Hangzhou, China, [email protected]; Wenyuan Yu, Alibaba Group,
Hangzhou, China, [email protected]; Wenjie Zhang, University of New South Wales, Sydney, Australia,
[email protected]; Xuemin Lin, Antai College of Economics and Management, Shanghai Jiao Tong University,
Shanghai, China, [email protected]; Jingren Zhou, Alibaba Group, Hangzhou, China, [email protected].
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and 111
the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from [email protected].
© 2024 Association for Computing Machinery.
0360-0300/2024/2-ART111 $15.00
https://doi.org/XXXXXXX.XXXXXXX
ACM Comput. Surv., Vol. 37, No. 4, Article 111. Publication date: February 2024.
111:2 Lingkai Meng, Yu Shao, and et al.
1 INTRODUCTION
Graph is a high-dimensional structure that models point-to-point relationships among entities.
Due to its strong representation ability, the graph is widely applied in social network analysis [27],
road network routing [75], and biological structure forecasting [23]. With the development of
information science and big data applications [1, 54] in recent years, the scales of graph data
sets become too large for single machines due to their limited storage and computing power. To
support the queries and analyses on massive graphs, researchers proposed many distributed graph
algorithms and systems to store a large-scale graph in multiple machines separately and compute
collaboratively, such as Pregel [118], Giraph [11], GraphX [77], and GraphScope [62].
Recent years have seen a surge in research on distributed graph algorithms, with a focus on
developing specific algorithms like PageRank, label propagation, and triangle counting, or address-
ing challenges such as workload scheduling and machine-to-machine communication. However,
comprehensive surveys providing a comprehensive view of the field remain limited. This paper
aims to bridge that gap by consolidating research on distributed graph algorithms for massive
graphs over the past decade, as featured in prestigious conferences and journals like SIGMOD,
VLDB, PPoPP, SC, TPDS, and TC. We distill four primary and recurrently addressed challenges
among these papers:
• Parallelism is a major objective that requires processing multiple operations at the same
time and reducing iteration rounds. Achieving efficient parallelism is challenging due to the
inherent sequential dependencies in graph analysis tasks. These dependencies require certain
computations to be completed before others can begin, limiting the ability to execute multiple
operations simultaneously.
• Load balance aims to distribute work evenly across vertices and improve the utilization of
computing resources. This helps prevent some machines from becoming overloaded while others
remain idle. Graphs often have skewed degree distributions, where a small number of vertices have
many connections while the majority have few. This imbalance can lead to uneven distribution
of workload across machines.
• Communication refers to the exchange of messages between vertices, which is an expensive
operation compared to random memory access. Distributed graph processing involves frequent
exchanges of data between machines. These communications can become a significant overhead,
especially in large-scale graphs, due to the complex and dense interconnections between vertices.
Optimizing communication overhead can improve execution efficiency in practice.
• Bandwidth limits the size of messages transmitted between vertices. Some algorithms require
a large amount of bandwidth, which may not be feasible in certain frameworks. The primary
reasons for the bandwidth limitations challenge in distributed graph algorithms are twofold: first,
high-degree vertices necessitate substantial bandwidth for communication; second, the frequent
transmission of small messages can result in inefficient bandwidth utilization.
To address the challenges, a lot of open-source distributed graph processing frameworks (e.g.,
Pregel [118] and GPS [137]) have been proposed. Generic solutions (e.g., parallel looping, message
receiving and sending, and broadcasting) are abstracted in these frameworks. Users have the
flexibility to develop graph algorithms following high-level functionalities, effectively abstracting
away the complexities of underlying implementation details. However, these solutions are highly
diverse and tailored for specific algorithms due to the irregularity of graph algorithms, and there is
no one unified pattern that can fit all graph algorithms.
In addition, various graph tasks are addressed by distributed graph algorithms in the existing
studies. To clearly introduce them, we classify the widely studied graph tasks, including PageRank,
ACM Comput. Surv., Vol. 37, No. 4, Article 111. Publication date: February 2024.
A Survey of Distributed Graph Algorithms on Massive Graphs 111:3
Louvain, SimRank, max flow, subgraph matching, and vertex cover, into seven topics: Centrality,
Community Detection, Similarity, Cohesive Subgraph, Traversal, Pattern Matching, and Covering.
In this paper, we first introduce the generic solutions for the four challenges and then dissect
the proportion of research papers that address the challenges across different algorithmic topics.
Moreover, we delve into the reasons behind the varying degrees of attention that certain challenges
receive within specific topics. For example, 70% of the papers related to Similarity topic (Figure 7c)
have focused on reducing communication overhead. Through the analysis, we show some deep
views of the research in distributed graph algorithms and also give insights into the potentially
promising directions for future studies. A unique contribution of this work is the construction of a
comprehensive graph that encapsulates the surveyed material, as shown in Figure 1. This graph
maps out the intricate connections among papers, topics, algorithms, solutions, and challenges,
etc., providing a visual narrative of the landscape. Readers can play with the graph using an online
interactive tool (http://gsp.blue/querying?name=graph_algo), which includes several examples to
guide readers on how to use the tool.
Contribution. Existing surveys focus on either specific distributed challenges (e.g., load bal-
ance [92]) or particular distributed algorithms (e.g., pattern matching [24]). However, our survey
targets challenges faced by different distributed graph algorithms in processing, with irregular
computation taken into consideration. Specifically, we make the main contributions as follows:
• We provide an overview of the main challenges in distributed graph algorithms and the different
solutions proposed to tackle them. This provides a comprehensive understanding of distributed
graph processing.
• We survey various distributed graph algorithms and categorize them into seven topics based on
the challenges they address.
• For each topic of distributed graph algorithms, we conduct a thorough analysis of existing
efforts. We also summarize the main challenges they address and provide unique insights into
the underlying reasons.
ACM Comput. Surv., Vol. 37, No. 4, Article 111. Publication date: February 2024.
111:4 Lingkai Meng, Yu Shao, and et al.
The outline of our survey is shown in Figure 2 and the rest of this paper is organized as fol-
lows. Section 2 provides a review of existing distributed graph systems and computing processing.
Section 3 summarizes some challenges and solutions, which are not common for single-machine al-
gorithms. Section 4 describes popular distributed graph algorithms in detail, highlighting differences
from their single-machine versions. Section 5 presents various application scenarios for different
graph tasks. Section 6 discusses prevalent research trends and potential research opportunities.
Section 7 concludes this survey.
2 BACKGROUND
2.1 Graph and Topics of Graph Analytics
We first define the graph and other relevant concepts that are used throughout this paper.
Definition 2.1 (Graph). A graph 𝐺 comprises a set 𝑉 (𝐺) of 𝑛 vertices and a set 𝐸 (𝐺) of 𝑚 edges.
Vertices represent entities, and edges signify the relationships between entities. This paper primarily
focuses on simple graphs, namely having at most one edge between any pair of vertices. Graphs
are categorized into directed and undirected graphs. In an undirected graph, an edge (𝑢, 𝑣) denotes
an unordered pair of vertices, implying that (𝑢, 𝑣) = (𝑣, 𝑢). Conversely, in a directed graph, an edge
(𝑢, 𝑣) represents an ordered pair of vertices, where (𝑢, 𝑣) ≠ (𝑣, 𝑢). Additionally, when edges are
assigned a float number based on edge attributes, the graph is termed a weighted graph.
ACM Comput. Surv., Vol. 37, No. 4, Article 111. Publication date: February 2024.
A Survey of Distributed Graph Algorithms on Massive Graphs 111:5
Definition 2.2 (Density). The density of a graph is defined as the ratio of the edge number 𝑚 and
the maximum possible edge number (𝑛 2 in directed graphs and 𝑛(𝑛 − 1)/2 in undirected graphs).
Generally, a graph is sparse if 𝑚 ≪ 𝑂 (𝑛 2 ) and dense otherwise.
Definition 2.3 (Neighbors and Degree). In a directed graph 𝐺, for any given vertex 𝑢 ∈ 𝑉 (𝐺),
we define the in-neighbor and out-neighbor set of 𝑢 as 𝑁𝑖𝑛 (𝑢) = {𝑣 |(𝑣, 𝑢) ∈ 𝐸 (𝐺)} and 𝑁𝑜𝑢𝑡 (𝑢) =
{𝑣 |(𝑢, 𝑣) ∈ 𝐸 (𝐺)}. The in-degree and out-degree of vertex 𝑢 are the sizes of 𝑁𝑖𝑛 (𝑢) and 𝑁𝑜𝑢𝑡 (𝑢),
respectively. In the context of an undirected graph 𝐺, the neighbor set for a vertex 𝑢 ∈ 𝑉 (𝐺) is
defined as 𝑁 (𝑢) = {𝑣 |(𝑢, 𝑣) ∈ 𝐸 (𝐺)}, and its degree is the size of the neighbor set.
Definition 2.4 (Path). A path in a graph is a sequence of vertices where each pair of consecutive
vertices is connected by an edge. Specifically, a path 𝑃 can be defined as 𝑃 = (𝑣 1, 𝑣 2, . . . , 𝑣𝑘 ), where
each pair (𝑣𝑖 , 𝑣𝑖+1 ) ∈ 𝐸 (𝐺) for all 1 ≤ 𝑖 < 𝑘. The length of the path is determined by the number
of vertices it contains, which in this case is 𝑘. In a weighted graph, where edges carry weights,
the distance is defined as the minimum sum of the weights of all edges (𝑣𝑖 , 𝑣𝑖+1 ) along all paths.
Consequently, the shortest path length and distance from vertex 𝑢 to vertex 𝑣 are the minimum
length and the minimum sum of weights, respectively, among all possible paths connecting 𝑢 and 𝑣.
Definition 2.5 (Diameter). The diameter of a graph is defined as the longest shortest path length
between any pair of vertices. In other words, it represents the largest length that must be traversed
to connect any two vertices in the graph using the shortest path between them.
Definition 2.6 (Cycle). A cycle is a special case of a path that starts and ends at the same vertex,
i.e., 𝐶 = (𝑣 1, 𝑣 2, . . . , 𝑣𝑘 ) with 𝑣 1 = 𝑣𝑘 . The length of a cycle is the number of vertices it contains.
Definition 2.7 (Tree). A tree, commonly defined within the scope of an undirected graph, is a
connected graph that is free of cycles. A forest is a disjoint set of trees.
Definition 2.8 (Subgraph). A subgraph 𝐻 of a graph 𝐺 is another graph where 𝑉 (𝐻 ) ⊂ 𝑉 (𝐺),
𝐸 (𝐻 ) ⊂ 𝐸 (𝐺), and for each (𝑢, 𝑣) ∈ 𝐸 (𝐻 ), 𝑢, 𝑣 ∈ 𝑉 (𝐻 ).
Due to the diverse application scenarios of graph analytics, a broad range of topics within this
field are explored in the literature. In this paper, we concentrate on specific categories of graph
analytics that are frequently examined in the context of distributed graph algorithms. A similar
taxonomy can be found in other works [58, 83, 103]. These categories are: (1) Centrality: Focus
on determining the relative significance of each vertex within a complex graph. (2) Community
Detection: Aim at segregating the vertices of a graph into distinct communities, characterized
by denser connections within the same community compared to those with vertices outside the
community. (3) Similarity: Concern with assessing the likeness between two vertices in a graph,
based on defined criteria. (4) Cohesive Subgraph: Involve identifying subgraphs that meet specific
cohesion criteria. (5) Traversal: Entail visiting vertices in a graph following a certain procedure, to
review or modify the vertices’ attributes. (6) Pattern Matching: Search for specific input patterns
within a graph. (7) Covering: Address combinatorial optimization problems in graphs, often linked
to minimum vertex-covering challenges. These categories represent key areas in graph analytics,
each with distinct methodologies and applications in distributed graph algorithms, as will be
discussed in detail in Section 4.
ACM Comput. Surv., Vol. 37, No. 4, Article 111. Publication date: February 2024.
111:6 Lingkai Meng, Yu Shao, and et al.
2.2.1 Distributed Computing Libraries and Languages. To support efficient distributed algorithm
and system development, distributed computing libraries based on traditional sequential languages
are designed, such as MPI [65], Java Remote Method Invocation [126], PyCOMPSs [15], and Open-
SHMEM [40]. Among them, MPI is one of the earliest libraries designed for distributed computing.
Strictly, MPI is a library standard specification for message passing. It defines a model of distributed
computing where each process has its own local memory, and data is explicitly shared by passing
messages between processes. MPI includes point-to-point and collective communication routines,
as well as support for process groups, communication contexts, and application topologies [155].
MPI has now emerged as the de facto standard for message passing on computer clusters [78].
Besides the distributed computing libraries, programming languages specifically for implement-
ing distributed applications are also created [16]. Unlike sequential programming languages, dis-
tributed programming languages aim to address problems such as concurrency, communication, and
synchronization in distributed computing from the language level to simplify distributed program-
ming. They generally provide built-in primitives for task distribution, communication, and fault-
tolerance. Representative languages included X10 [41], Haskell [159], JoCaml [66], and Erlang [8].
2.2.2 General-Purpose Distributed Processing Frameworks. Although distributed computing li-
braries and languages enable efficient distributed computing, the applications still put lots of effort
into the details of running a distributed program, such as the issues of parallelizing the computation,
distributing the data, and handling failures. To address this complexity, general-purpose distributed
processing frameworks are designed.
MapReduce [56] is a simple and powerful framework introduced by Google for scalable distributed
computing, simplifying the development of applications that process vast amounts of data. It
takes away the complexity of distributed programming by exposing two processing steps that
programmers need to implement: Map and Reduce. In the Map step, data is broken down into
key-value pairs and then distributed to different computers for processing, and in the Reduce step,
all computed results are combined and summarized based on key. MapReduce and its variants [166]
reduce the complexity of developing distributed applications. However, MapReduce reads and
writes intermediate data to disks, which incurs a significant performance penalty. To address the
deficiency of MapReduce, Spark [176] shares a similar programming model to MapReduce but
extends it with a data-sharing abstraction called resilient distributed dataset (RDD) [175], which
can be explicitly cached in memory and reused in multiple MapReduce-like parallel operations.
Due to the introduction of RDD, Spark achieves significant performance improvement compared
to MapReduce [177]. In addition to MapReduce and Spark, other general-purpose distributed
processing frameworks are also designed, such as Storm [158], Flink [32], and Ray [124]. [53, 136,
146] provide comprehensive surveys on general-purpose distributed processing frameworks.
2.2.3 Distributed Graph Processing Frameworks. General-purpose distributed processing frame-
works facilitate efficient distributed computing by isolating the applications from the lower-level
details. However, the general-purpose distributed processing frameworks are inefficient in process-
ing big graph data due to the skewed degree distributions of graph data and poor locality during
the computation [83]. As a consequence, distributed graph processing systems started to emerge
with Google’s Pregel [118] for better performance and scalability.
Pregel [118] is based on the Vertex-Centric programming model in which the graph computation
is divided into a series of synchronized iterations, each of which is called a superstep. During a single
superstep, each vertex executes a user-defined function in parallel. This function typically processes
data based on the vertex’s current state and the messages it received from other vertices [94].
Besides Pregel, representative Vertex-Centric graph processing framework includes GraphLab [109],
PowerGraph [76], GPS [137], Giraph [81], and Pregel+ [171].
ACM Comput. Surv., Vol. 37, No. 4, Article 111. Publication date: February 2024.
A Survey of Distributed Graph Algorithms on Massive Graphs 111:7
Beyond the Vertex-Centric based framework, the Edge-Centric and Subgraph-Centric based
distributed graph processing frameworks have also been developed. The Edge-Centric model,
pioneered by X-Stream [134] and WolfGraph [181], focuses on edges rather than vertices, offer-
ing better performance for non-uniform graphs by streamlining edge interactions and avoiding
random memory access [83]. Despite its programming complexity, the Edge-Centric model boosts
performance for algorithms that require pre-sorted edge lists [134, 181]. The Subgraph-Centric
model concentrates on entire subgraphs. This model minimizes communication overhead, making
it ideal for tasks where frequent vertex interaction is crucial, such as in social network analysis.
Frameworks like Giraph++ [156], GoFFish [147], and Blogel [170] adapt this model. Despite these
frameworks’ advancements, distributed graph algorithms still face challenges like parallelism, load
balance, communication, and bandwidth due to the complexity and irregularity of graph data, as
discussed in Section 1.
Agg. Machine 3
Fig. 3. The framework of the distributed graph processing. The graph is divided into several partitions that
are assigned to different machines. Vertices in different partitions require a global communication to send
messages while vertices in the same partition can exchange messages by local communication. The volume of
one communication has a limitation called bandwidth, indicating the largest size of a single message.
Computation Resource Efficiency: Distributed systems are characterized by their vast and scal-
able computation resources, which enable the system to process massive volumes of graph data
ACM Comput. Surv., Vol. 37, No. 4, Article 111. Publication date: February 2024.
111:8 Lingkai Meng, Yu Shao, and et al.
and execute complex graph computations. Therefore, it is of great importance to fully exploit the
computation resource in the system when designing distributed graph algorithms. Different from
the centralized graph algorithms in which all the instructions are executed in a single machine,
distributed graph algorithms need the collaboration and cooperation of multiple machines to finish
a task, which brings the challenges of parallelism and load balance.
• Parallelism: Parallelism in distributed graph processing involves executing multiple computations
concurrently across various machines. This approach entails segmenting a larger graph analysis
task into smaller, manageable subtasks. These subtasks are then distributed among different
machines, enabling them to be executed simultaneously. Such a strategy not only facilitates
efficient resource utilization but also significantly reduces the overall computation time, thereby
enhancing the performance of graph processing tasks. However, graph analysis tasks often exhibit
inherent sequential dependencies [3, 88, 179], making the realization of parallelism in distributed
graph algorithms a complex endeavor. A profound understanding of the fundamental nature
of these tasks is essential to discern independent subtasks that can be parallelized effectively.
This necessitates a careful analysis to strike a balance between preserving the integrity of the
sequential dependencies and optimizing for parallel execution.
• Load Balance: Load balance in distributed graph processing ensures that the computational work-
load is evenly distributed across all machines. An imbalance in load can lead to inefficiencies: some
machines may complete their tasks quickly and remain underutilized, while others (often known
as stragglers) struggle with ongoing computations, ultimately delaying the overall process. This
imbalance is particularly problematic in distributed graph processing, where the irregular nature
of computations arises from non-uniform [50] degree distribution and topological asymmetry.
Albeit crucial, effectively resolving load imbalance is complex. It demands not only precise initial
workload quantification but also ongoing adjustments during runtime to address any imbalances.
Network Resource Efficiency: In distributed systems, where machines communicate via net-
works, efficiently using network resources becomes vital, particularly in graph processing. The
inherent complexity of graph data, marked by intricate structures and irregular vertex connections,
often requires operations on a single vertex to interact with multiple others. This scenario leads
to frequent and extensive network data exchanges especially when interconnected vertices are
spread in different machines. Consequently, two main challenges arise in terms of network resource
efficiency.
• Communication Overhead: Communication overhead in distributed systems, defined by the
network resource usage from message exchanges, is largely dependent on data transmission
volumes. In distributed graph processing, the necessity to communicate across machines for
accessing vertices or edges on different machines increases network communication. Inefficient
management of these data exchanges can lead to significant network congestion, turning network
communication into a critical bottleneck for overall computing performance. Thus, managing
communication overhead is crucial for optimizing the efficiency and effectiveness of distributed
graph processing.
• Bandwidth: Bandwidth in distributed systems represents the maximum data transfer capacity
between machines in each round of message passing. Limited by hardware and network infras-
tructure, bandwidth is not infinitely scalable. In distributed graph processing, vertices with high
degrees demand substantial bandwidth due to intensive communication with neighbors [34] and
frequent visits from multiple vertices, common in random walk algorithms [111]. Additionally,
low bandwidth utilization is also a challenge. For tasks, like triangle counting, BFS, and connected
components, numerous small messages are exchanged between low-degree vertices, containing
ACM Comput. Surv., Vol. 37, No. 4, Article 111. Publication date: February 2024.
A Survey of Distributed Graph Algorithms on Massive Graphs 111:9
only neighbor information. On the other hand, each message exchange using the message-passing
interface, like MPI, introduces additional overhead in the form of header information and hand-
shake protocol messages resulting in a diminished ratio of actual payload data and thereby
leading to an inefficient utilization of bandwidth resources [149]. For these reasons, effectively
and efficiently optimizing bandwidth utilization is challenging in distributed graph processing.
Remark. Besides the above challenges, there are still other challenges like fault tolerance [80, 180]
that need to be considered in distributed graph processing. However, distributed graph processing
often addresses them following the established solutions for general distributed algorithms in the
literature. Therefore, we focus on the above four challenges in this survey and omit other challenges
(Interested readers can refer to [47, 120, 154, 161] for the solutions to these challenges).
3.2.1 Computation Resource Efficiency Optimization. This section focuses on the solutions to
optimize the computation resource efficiency including parallelism and load balance.
Parallelism: Breaking down the graph analysis task into smaller subtasks that can be processed
independently reduces the overall execution time by allowing multiple computations to proceed
concurrently in the distributed system. To achieve an effective parallelism, various techniques,
including dependency resolution, asynchronous execution, and shortcut augmentation, have been
proposed.
• Dependency Resolution: Graph analysis tasks generally have an intrinsic dependency order on
the computation. For example, the order to visit the vertices in the graph traversal issues is
specific, and any algorithms on the graph traversal have to follow this order to ensure correctness.
The implied dependency order significantly limits the potential parallelism of the distributed
graph algorithms. Dependency resolution focuses on identifying and exploiting the inherent
dependencies among subtasks in the original graph analysis task to ensure correct and efficient
parallel execution. Given the dependency order is unique to each specific graph analysis task, the
dependency resolution methods also vary accordingly. The specific details on these dependency
resolution methods are presented in Section 4 [37, 48, 52, 86, 88, 105, 107, 120, 123].
• Asynchronous Execution: Synchronization (Figure 4 (a)) is a coordination mechanism used to
ensure that multiple concurrent processes in distributed graph processing do not cause data
inconsistency and are able to successfully complete the task without conflict. In distributed graph
processing, synchronous execution is the easiest synchronization mechanism in which all compu-
tations are divided into discrete global iterations or super-steps. During each super-step, all ma-
chines perform their computations simultaneously locally, and at the end of each super-step, there
is a barrier synchronization point where all machines must wait until every other machine has fin-
ished its computations and communications [9, 122]. This can ensures consistency but often lead
to poor computation performance due to waiting times caused by slower machines or stragglers.
Asynchronous execution (Figure 4 (b)), instead, allows machines to operate independently without
waiting for each other at barrier synchronization points. Once a machine finishes its computation,
it immediately sends its updated information to its neighbors and proceeds with the next round
of computation. This approach can significantly speed up the computation process as there’s no
ACM Comput. Surv., Vol. 37, No. 4, Article 111. Publication date: February 2024.
asynchronous
111:10 Lingkai Meng, Yu Shao, and et al.
Machine 1
idle waiting time, and it can also converge faster in certain scenarios.
Machine 2 Therefore, many distributed
asynchronous
graph algorithms [82, 121, 133, 179] exploit the asynchronous
Machine 3 execution to enhance parallelism.
iteration 1 iteration 2 iteration 3
Machine 1 Machine 1
Machine 2 Machine 2
Machine 3 Machine 3
iteration 1 iteration 2 iteration 3 asynchronous execution
Machine 2
Fig. 4. An example of synchronous and asynchronous execution: In (a), vertical lines represent synchronous
signals,
Machine 3
where each machine must wait until all other machines have completed their workloads. In (b),
asynchronous execution
asynchronous execution involves no waiting time, allowing each machine to compute independently, thereby
increasing overall parallelism.
ACM Comput. Surv., Vol. 37, No. 4, Article 111. Publication date: February 2024.
A Survey of Distributed Graph Algorithms on Massive Graphs 111:11
u1 v1 u1 v1 v1’ u1 v1
Mirror
u2 v2 u2 v2 Vertex u2 v2
u3 v3 u3 v3 u3 v3
u4 v4 u4 v4 u4 v4
(a) The Vertex Partitioning (b) The Edge Partitioning (c) The Hybrid Partitioning
Fig. 5. An example of partitioning methods: (a) vertex partitioning with 1 edge-cut, (b) edge partitioning with
1 vertex-cut (requiring mirror vertices for synchronization). Minimum-cut can cause unbalanced partitions.
The hybrid method (c) with 3 edge-cuts considers both cut numbers and vertex distribution for better balance.
generally encompasses a task queue, responsible for holding pending tasks available for execution,
and a task scheduler that selects the subsequent task for execution using diverse policies [44, 172].
3.2.2 Optimizing Network Resource Efficiency. This section focuses on the solutions to address
communication overhead and bandwidth challenges regarding network resource efficiency.
Communication Overhead: In distributed graph processing, vertices in different machines fre-
quently exchange messages, resulting in substantial communication overhead. To address this
challenge, researchers have devoted considerable efforts to developing effective solutions to optimize
communication overhead, which are categorized as follows:
u1 u2 u1 u2
u3 u4 u3 u4 u8
Communications
u5 u6 u5 u6 Between Subgraphs
u7 u8 u7 Subgraph
Fig. 6. An example of the Subgraph-Centric model is shown in (b). Compared to the Vertex-Centric model in
(a), where communication occurs between vertices, in (b), local computations are carried out within subgraphs
until convergence, and then communication only occurs between subgraphs.
• Local Computation: By exploiting the inherent locality of the graph analysis task, local com-
putation aims to perform the computation primarily based on locally available information at
each machine to avoid message exchange to reduce communication overhead [51, 141, 142].
To achieve this goal, a widely adopted solution is to aggregate information from neighbors in
advance, and thus accessing the information about these neighbors only requires local com-
munication or local shared memory [30]. Moreover, Subgraph-Centric computation (Figure 6)
[14, 100, 105, 117, 123, 144] is another popular local computation variant to reduce communi-
cation overhead. Subgraph-centric computation decomposes the graph into smaller subgraphs.
Each machine in a distributed system is tasked with processing a specific subgraph indepen-
dently, leveraging local information and neighboring subgraphs’ data as necessary. This localized
approach minimizes the need for frequent expensive inter-machine communication.
ACM Comput. Surv., Vol. 37, No. 4, Article 111. Publication date: February 2024.
111:12 Lingkai Meng, Yu Shao, and et al.
• Selective Activation: Selective activation solution selects a subset of vertices to be activated in each
iteration, thereby mitigating the need for extensive communication across all edges during the
distributed graph processing. Two prevalent methods for selective activation have been proposed
in the literature. The randomized method activates vertices according to certain probabilistic
criteria following the intuition that deleting one edge from the graph rarely changes the required
proprieties such as shortest path distances [139] and cycle-freeness [67], which effectively pre-
vents unnecessary communication on inactive vertices [55]. The deterministic method often
adopts an activating threshold, whereby a vertex only sends messages when it satisfies a specific
condition based on the threshold [37, 110, 123, 165]. By leveraging these selective activation
solutions, distributed graph algorithms can effectively reduce the communication overhead.
• Hybrid Push-Pull Model: The push and pull are two fundamental communication paradigms in
distributed graph processing. In the push model, each vertex sends its updating messages or
partial computation results to its neighboring vertices, while in the pull model, vertices retrieve
information from their neighbors as needed to complete their computations. Generally, these two
paradigms each have their own merits regarding communication efficiency in distributed graph
processing. For instance, the push model might be a better choice for the initial stage with fewer
active vertices, while the pull model might perform better on few inactive vertices [30, 133]. To
combine the strengths of both models to reduce the communication overhead, hybrid solutions
that automatically switch between the push model and the pull model are proposed [36, 108].
• Aggregation: Aggregation aims to reduce the amount of data transmitted between machines by
merging multiple data elements and eliminating duplicates or redundant elements. A prominent
method is to aggregate messages locally before sending them to other machines, wherein only
essential information is retained. This method not only reduces the overall communication
volume but also effectively diminishes the rounds of communications required [63, 138]. In
addition, some algorithms [35, 101] focus on simplifying the graph structure by aggregating
the information of vertices/edges within a certain area where communication primarily occur
between these vertices/edges. This simplification facilitates the communication between these
aggregated vertices/edges, thereby significantly alleviating the overall communication overhead.
Bandwidth: Frequent data exchange between machines also poses significant strain on the available
network bandwidth. To mitigate the requirement of bandwidth in distributed graph processing, vari-
ous solutions are designed in the literature, including message coordination and dense representation.
• Message Coordinating: In distributed graph processing, a vertex may receive a substantial number
of messages within a single iteration and the available network bandwidth cannot accommodate
the simultaneous transmission of these messages to the neighbors, message coordinating methods
prioritize the messages based on certain criteria such that a message with potentially better result
or a farther destination is assigned higher priority while less promising messages may be pruned
by later updates, and transmit the messages based on their priorities accordingly [86, 88, 111, 113].
Message coordinating allows for more efficient utilization of available network resources and
helps mitigate the impact of bandwidth limitations on distributed graph processing systems.
• Message Buffering: Batch processing is commonly used to tackle the issue of low bandwidth
utilization in distributed graph processing. It utilizes a message buffer queue within each machine,
wherein messages intended for transmission are temporarily stored. Once the buffer reaches its
capacity, all accumulated messages are combined into a single batch and subsequently transmitted
to other machines. Although this approach may introduce a delay in communication, it yields
significant benefits in terms of optimizing bandwidth resource utilization and reducing the
processing time associated with sending and receiving messages [72, 73, 138, 149, 161].
ACM Comput. Surv., Vol. 37, No. 4, Article 111. Publication date: February 2024.
A Survey of Distributed Graph Algorithms on Massive Graphs 111:13
0% 0% 0%
17.24% 20%
23.53%
37.94% 35%
45%
52.94% 10%
31.03%
23.53% 70%
13.79% 20%
(a) Centrality (b) Community Detection (c) Similarity (d) Cohesive Subgraph
5.71% 2.22% 0%
13.33% Parallelism
28.57%
37.14%
34.29%
35.56% Load Balance
7.14% 64.29% Comm. Overhead
48.89%
22.86%
Bandwidth
4.1 Centrality
Centrality measures in graph theory are instrumental for quantifying the relative importance of
vertices within a complex network. PageRank and personalized PageRank are relatively easy to
parallelize as they only require neighbor information for computation. In contrast, Betweenness
centrality and closeness centrality assess vertex influence based on shortest paths, requiring global
graph information. This complexity poses challenges for designing distributed algorithms. Due to
these characteristics, the latter is more suitable for identifying critical nodes that control the flow
of communication or resources in networks compared to the former.
As Figure 7(a) illustrates, parallelism in centrality tasks is a popular research focus (37.93%). This is
primarily due to the inherent independence of centrality computations for each vertex, which lends
itself well to parallel processing approaches. Surprisingly, load balancing—a fundamental challenge
in distributed environments—has garnered limited attention in this domain, accounting for only
13.79% of the research. This could be attributed to two potential factors. Firstly, the computational
requirements for centrality calculations on each vertex are relatively straightforward, such as the
summation process in PageRank algorithms. Secondly, in cases where the centrality task is complex,
such as in calculating betweenness centrality, the research community may prioritize developing
ACM Comput. Surv., Vol. 37, No. 4, Article 111. Publication date: February 2024.
111:14 Lingkai Meng, Yu Shao, and et al.
efficient parallelization techniques over addressing load balancing issues. A significant portion of
the centrality research, precisely 31.03%, is devoted to minimizing communication overhead, while
17.24% of the studies aim to address bandwidth constraints.
PageRank, initially conceptualized by in [26], is designed to assess the importance of webpages,
represented as vertices in a web graph, interconnected by hyperlinks symbolizing the edges.
The PageRank score, symbolized by 𝜋, is derived from the stationary probabilities of an 𝛼-decay
random walk. This process starts from a vertex selected uniformly from all 𝑛 vertices at random.
Subsequently, at each step, there exists a probability 𝛼 of initiating a new random walk from a
random vertex, and a complementary probability 1 − 𝛼 of transitioning to an adjacent vertex. The
PageRank score of a vertex 𝑢, denoted as 𝜋 (𝑢), quantifies the relative importance of vertex 𝑢 within
𝜋 (𝑣)
the web graph, which can be formulated as 𝜋 (𝑢) = (1 − 𝛼) · 𝑛1 + 𝛼 · 𝑣 ∈𝑁𝑖𝑛 (𝑢 ) 𝑁𝑜𝑢𝑡
Í
(𝑣) . Typically,
algorithms for computing PageRank can be categorized into three distinct types:
(1) Iterative algorithms: Examples include PCPM [100] and SubgraphRank [14]. These algorithms
iteratively update PageRank values until converge.
(2) Random-walk-based algorithms: This category comprises algorithms including IPRA [140],
RP [113], MRP [111, 113], and FP [112], which generate and simulate multiple random walks and
estimate PageRank values through the destination vertices of these walks.
(3) Eigenvector-based algorithms: The PageRank values as fixed points 𝜋 for all vertices represent
an eigenvector. Rather than directly using the original equation, the Stochastic Approximation (SA)
algorithm [82] iteratively approximates the stable state 𝜋 by constructing a differential equation,
enabling gradual convergence towards the solution.
Optimizing parallelism is crucial, especially in random-walk-based algorithms. These algorithms
generate multiple independent random walks to approximate PageRankvalues, benefiting sig-
nificantly from parallel processing. Specifically, IPRA introduces a method using short walks as
efficient indices in the graph, reducing the number of processing rounds needed. Similarly, MRP
starts with short walks that are then combined to form longer walks, further enhancing efficiency.
In contrast, the eigenvector-based SA algorithm employs asynchronous computation. It updates
PageRankvalues by selecting individual vertices in each iteration, allowing subsequent iterations
to proceed asynchronously as long as they do not update the same vertices simultaneously.
As a prevalent challenge in distributed computing, load balancing can impact the efficiency of
PageRank algorithms. The computation of a vertex’s PageRank is influenced by its connectivity,
resulting in vertices with higher degrees engaging in more intensive operations of PageRank values.
To mitigate this imbalance, iterative algorithms like PCPM and SubgraphRank employ the Subgraph-
Centric computation model, which divides the graph into multiple partitions, each managed by a
dedicated thread. However, balancing the workload, which includes both internal computation and
global communication, remains complex. They use tools like OpenMP for dynamic scheduling of
vertex computation, rather than adhering to an ideal static partitioning.
PageRank algorithms face significant challenges in minimizing communication overhead. A major
advantage of this model is its ability to replace inter-partition vertex communication with local
memory access. Additionally, it allows multiple communications from one vertex to all destination
vertices within the same partition to be consolidated into a single data transfer, further reducing
communication overhead. The SA algorithm optimizes communication costs by using the pull
mode. In each iteration, a selected vertex updates its PageRankvalue by “pulling” information from
its neighbors. This eliminates the need for the vertex to “push” its updated PageRankvalue to others,
thereby halving communication overhead.
Bandwidth constraints can pose challenges, particularly in random walk-based algorithms, where
a proliferation of random walks may converge on a single vertex, creating a bottleneck. To mitigate
ACM Comput. Surv., Vol. 37, No. 4, Article 111. Publication date: February 2024.
A Survey of Distributed Graph Algorithms on Massive Graphs 111:15
this, both RP and MRP algorithms employ a coordinator mechanism to evenly distribute walks
among all neighboring vertices, thus preventing concentration on a single vertex. In contrast, the FP
algorithm adopts a more bandwidth-efficient approach by transmitting only the expected number
of paths, rather than sending individual path messages.
Personalized PageRank modifies the standard PageRank algorithm by enabling users to specify
a start vertex, denoted as 𝑠. Unlike traditional PageRank , where the random walk commences from
and restarts at a randomly selected vertex with probability 𝛼, personalized PageRank consistently
initializes and restarts the walk at the specified source vertex 𝑠.
Given the similarities between personalized PageRank and the standard PageRank algorithm,
most techniques for addressing distributed computing challenges are comparable. For instance, the
Subgraph-Centric computation model [79] and shared-memory approaches [148] effectively reduce
internal communication within partitions. Additionally, employing a coordinator between two
partitions can significantly minimize redundant global communications, as demonstrated in [79].
We highlight techniques employed in personalized PageRank to achieve optimal load distribution.
The study [79] proposes a theoretically perfect partition utilizing hierarchical methods. Moreover,
load imbalance can also arise during sampling in weighted graphs, where the next hop in random
walks is uneven, necessitating weighted sampling. The Alias method, capable of constant-time
sampling for weighted elements, and its extension, the Alias Tree, are effective solutions for handling
extremely large numbers of neighbors in memory-limited scenarios [106].
Betweenness Centrality measures the extent to which a specific vertex lies on the shortest paths
between other vertices within a graph. The underlying premise is that important vertices are more
likely to be on many of these paths, influencing communication flow across the graph. Given 𝐶𝑏 (𝑢)
as the normalized betweenness centrality value for a vertex 𝑢, most algorithms [52, 86, 88, 90] follow
the Brandes’ algorithm [25] to compute 𝐶𝑏 (𝑢) as: 𝐶𝑏 (𝑢) = (𝑛−1)1(𝑛−2) 𝑠 ∈𝑉 \{𝑢 } 𝛿𝑠• (𝑢), and 𝛿𝑠• (𝑢) =
Í
Í 𝜎𝑠𝑢 ·𝜎𝑢𝑡
𝑡 ∈𝑉 \{𝑠,𝑢 } 𝜎𝑠𝑡 , where 𝜎𝑠𝑡 is the number of shortest paths from a vertex 𝑠 to another vertex 𝑡.
The value of 𝛿𝑠• (𝑢) can be recursively computed as: 𝛿𝑠• (𝑢) = 𝑢 ∈𝑃𝑠 (𝑤 ) 𝜎𝜎𝑠𝑤
Í 𝑠𝑢
· (1 + 𝛿𝑠• (𝑤)) , where
𝑃𝑠 (𝑤) is the set of predecessors of 𝑤 in the SSSP Directed Acyclic Graph (SSSP DAG).
To compute betweenness centrality values of all vertices, Brandes’ algorithm comprises two
primary steps: (1) SSSP Step involves constructing a Directed Acyclic Graph (DAG) from each vertex
in the graph; (2) Brandes’ Step recursively computes the dependency score 𝛿𝑠• (𝑢) and systematically
accumulate these scores to determine the value of 𝑢.
In adapting Brandes’ algorithm for distributed computing, a key consideration is parallelizing
the computation across 𝑛 SSSP DAGs to enhance parallelism. This involves executing the SSSP and
Brandes’ steps concurrently from all vertices. In unweighted graphs, BFS is used for this parallel
execution, computing 𝜎𝑠𝑢 and 𝛿𝑠• (𝑢) in the BFS order [88, 90]. For weighted graphs, alternatives
like the Bellman-Ford [29] and Lenzen-Peleg [102] algorithms are employed [52, 86].
The communication demands of the SSSP step, especially on weighted graphs, are high due
to frequent message passing. To mitigate this, the algorithm developed by Crescenzi et al. [52],
which is based on Bellman-Ford, updates 𝜎𝑠𝑢 and 𝛿𝑠• (𝑢) with differential values instead of full
recomputation. Alternatively, the Lenzen-Peleg-based algorithm in [86] adopts an early termination
strategy called Finalizer to stop the SSSP step as long as all vertices have received correct 𝜎𝑠𝑢 , which
is shown to be effective when graph diameters are small.
Addressing bandwidth limitations involves coordinating the order of message sending. In the
SSSP step, messages concerning shorter distances are prioritized [86], as longer distances are more
likely to change. In the Brandes’ step, priority is reverted to longer distances due to their greater
number of hops to the source, as discussed in [88]. Given the inherent difference between the two
steps, the algorithm in [86] attempts to track the round numbers at which updates occur in the
ACM Comput. Surv., Vol. 37, No. 4, Article 111. Publication date: February 2024.
111:16 Lingkai Meng, Yu Shao, and et al.
SSSP step and then apply these in reverse order during the Brandes’ step. Additionally, Pebble
BFS [87], particularly effective in parallel SSSP for unweighted graphs, adopts a staggered BFS
approach. The algorithm [88] utilizes the idea of visiting one vertex (instead of multiple vertices
simultaneously) per round, thereby avoiding message congestion to some large-degree vertices.
Closeness Centrality quantifies the average shortest path length from a given vertex to all other
vertices in the network, suggesting that a vertex of higher importance typically exhibits shorter
average distances to other vertices. The normalized closeness centrality value 𝐶𝑐 (𝑢) for a vertex 𝑢
is calculated as 𝐶𝑐 (𝑢) = Í 𝑛−1 , where 𝑑 (𝑢, 𝑣) is the shortest distance from 𝑢 to 𝑣.
𝑣 ∈𝑉 𝑑 (𝑢,𝑣)
The calculation of closeness centrality values, much like the betweenness centrality, confronts
considerable communication overhead. This is particularly severe in incremental closeness central-
ity, where the introduction of new edges necessitates the reevaluation of closeness centrality values.
The STREAMER algorithm [139] effectively addresses this by utilizing a level structure to filter
out edges that do not impact shortest path distances, which is a common scenario in real-world
graphs. To further refine the process, the DACCER algorithm [164] employs an approximation of
the “relative rank”. This is based on the observation that vertices with higher closeness centrality
tend to have larger neighborhoods with higher-degree vertices. By aggregating degree information
from neighbors, each vertex can estimate its closeness centrality rank with reduced communication.
ACM Comput. Surv., Vol. 37, No. 4, Article 111. Publication date: February 2024.
A Survey of Distributed Graph Algorithms on Massive Graphs 111:17
Load balancing plays a critical role in the efficiency of Louvain algorithms in distributed environ-
ments, since the workload of a single machine is typically proportional to the degree of the vertices it
processes, thus a few high-degree vertices can lead to significant skew in the workload distribution.
Jianping Zeng, et al. [178] extend the graph partitioning method based on vertex delegates, initially
proposed by Pearce et al. [128]. The method described involves duplicating high-degree vertices
across all machines in a distributed system. The edges connected to these high-degree vertices are
then reassigned to ensure a balanced distribution of edges, and consequently, computational work,
across the different machines in the system. Another distributed algorithm, called DPLAL [142],
employs the well-known graph partitioning tool METIS [95], which is designed to minimize the
number of cut edges or the volume of communication needed between different parts of the graph.
Reducing communication overhead is also a prevalent challenge in Louvain algorithms. Some
recent studies [74, 178] reduce this overhead by utilizing “ghost” vertices, which are a set of replicas
of the vertices that are connected to local vertices but are processed by other machines. These
ghost vertices help to maintain the local integrity of the graph, thereby reducing inter-machine
communication. The paper [74] also presents two heuristics for optimizing the distributed Lou-
vain algorithm, one is setting varying thresholds during different stages of iterations to decide
termination, effectively reducing the total number of iterations during the algorithm; another is
setting vertices with minor changes in modularity during iterations as “inactive”, to save computa-
tional and communication resources. Moreover, some papers [141, 142] adopt the Subgraph-Centric
model. Each machine collects information about adjacent vertices, then updates local community
states independently, without inter-machine communication.
Label Propagation algorithm (LPA) [131] is an effective approach to detecting communities by
propagating labels throughout the network. The algorithm starts by assigning a unique label to each
vertex, which represents the community to which the vertex currently belongs. In each subsequent
iteration, every vertex considers the labels of its immediate neighbors and adopts the label that is
most frequent among them. The algorithm continues to iterate through the vertices and update
their labels until convergence, i.e., when each vertex has the majority label of its neighbors or the
maximum number of iterations defined by the user is reached.
Implementing distributed label propagation algorithms involves balancing parallelism with the
quality of identified communities. Misordered computations can result in overly large, impractical
communities. The EILPA algorithm [13] identifies influential vertices, termed “leaders”, using a
personalized PageRank model. Label propagation starts from these leader vertices, ensuring a
structured, influence-based community detection. The PSPLAP algorithm [116] also considers
vertex influence by assigning weights via 𝑘-shell decomposition and calculating propagation
probabilities and vertex similarities based on these weights. Vertices update their labels during
each iteration using these probabilities and similarities, aiming for more coherent community
detection by reducing label updating randomness typical in label propagation algorithms.
Communication overhead is another prevalent challenge when implementing the label propaga-
tion algorithm in distributed environments. To tackle this challenge, Xu Liu, et al. [108] propose a
dynamic mechanism that alternates between push and pull strategies to reduce communication
costs. Initially, push is used for rapid community formation, and then the algorithm switches to
pull in later iterations to consider neighbors’ label quality.
Connected Components refer to subgraphs where every vertex can be reached from any other
vertex. Computing connected components typically involves traversing all vertices and edges in a
graph based on Breadth-First-Search (BFS) or Depth-First-Search (DFS), resulting in a linear time
complexity. However, in distributed environments, the irregularity of the graph leads to centralized
ACM Comput. Surv., Vol. 37, No. 4, Article 111. Publication date: February 2024.
111:18 Lingkai Meng, Yu Shao, and et al.
workload and communication in high-degree vertices, which are the primary focus for optimizing
connected component computation.
For improving parallelism, the paper [132] leverages the max-consensus-like protocol where
each vertex only considers the maximum value among its in-neighbors to update its own value.
The algorithm can terminate in 𝐷 + 2 iterations where 𝐷 is the diameter of the graph. Benefiting
from this protocol, each vertex only needs to receive information from its in-neighbors, which also
helps to improve computational and communication efficiency.
Balancing workload is also important for improving the efficiency of connected component
computation. In the paper of Sebastian Lamm, et al. [101], high-degree vertices are replicated to
different machines to alleviate the load imbalance issues caused by processing such vertices on a
single machine. Chirag Jain, et al. [89] adopt the strategy of dynamically redistributing vertices.
Due to the vertex pruning strategy used, vertices may be removed from machines after each
iteration, leading to load imbalance issues. In their work, after each iteration, some vertices will be
redistributed to achieve load balancing.
To minimize communication overhead, selective activation strategies are adopted by a number
of papers [89, 110]. CRACKER [110] identifies seed vertices to construct trees for each connected
component. Vertices are iteratively removed from the graph and added to the tree, reducing
communication volume by trimming vertices during this process. Another work [89] removes
completed connected components in each iteration to decrease the number of active vertices in
subsequent iterations. The paper [101] proposes a graph contraction pre-processing strategy. It
conducts local BFS on each machine and merges vertices within the same connected component
into a single vertex, avoiding multiple messages to common neighbors. In addition, Xing Feng, et
al. [63] adopts the local computation strategy, where they first run BFS locally to label vertices
in the same connected component with the same color. Then, inter-machine communication is
carried out to merge subgraphs by merging different colors received by the same vertex. To further
reduce the amount of communication, messages are locally aggregated before being sent to other
machines.
4.3 Similarity
This topic refers to evaluating the similarity of the two vertices in a graph. Vertices with a high
similarity score indicate that they have something in common in structure or attribute. Jaccard
similarity and cosine similarity are basic measurement methods using set operations, requiring only
neighboring information for computation, thus enabling high parallelization. In contrast, SimRank
calculates similarity by considering the entire graph’s information. Despite its lower efficiency, it
can uncover latent relationships and similarities hidden within the graph structure.
According to Figure 7c, efforts are divided, with 20% focusing on enhancing parallelism and 10%
on improving load balancing. This distribution underscores the challenge in similarity computation
across all vertex pairs, necessitating extensive inter-vertex communication. Consequently, the
predominant emphasis of research, at 70%, is on reducing communication overhead. It’s noteworthy,
however, that bandwidth constraints have not received significant attention in this topic.
Jaccard Similarity and Cosine Similarity are commonly used to measure the similarity of ver-
tices. Jaccard similarity measures the similarity between two sets based on their shared elements.
Specifically, it is defined as the size of the intersection of two sets divided by the size of their union,
|𝐴∩𝐵 |
which is: 𝐽𝑎𝑐𝑐𝑎𝑟𝑑 (𝐴, 𝐵) = |𝐴∪𝐵 | . The value of Jaccard similarity ranges from 0 to 1, and if either 𝐴
or 𝐵 is an empty set, the value is 0. In the context of graph data, sets 𝐴 and 𝐵 are typically defined
as the sets containing all the neighbors of a given vertex. cosine similarity √︁ is another measure of
similarity between two sets, which is defined as: 𝐶𝑜𝑠𝑖𝑛𝑒 (𝐴, 𝐵) = |𝐴 ∩ 𝐵|/ |𝐴| × |𝐵|.
ACM Comput. Surv., Vol. 37, No. 4, Article 111. Publication date: February 2024.
A Survey of Distributed Graph Algorithms on Massive Graphs 111:19
ACM Comput. Surv., Vol. 37, No. 4, Article 111. Publication date: February 2024.
111:20 Lingkai Meng, Yu Shao, and et al.
ACM Comput. Surv., Vol. 37, No. 4, Article 111. Publication date: February 2024.
A Survey of Distributed Graph Algorithms on Massive Graphs 111:21
4.5 Traversal
Traversal, a fundamental method in graph analysis, involves visiting each vertex or edge in a graph
to identify paths, trees, or cycles. This process is crucial for uncovering the structural information
of a graph. Each traversal algorithm, such as single-source shortest path (SSSP) for finding the
shortest path and maximum flow for capacity planning in flow networks, serves specific applications.
However, these algorithms all require global updates and synchronization during computation,
posing significant challenges in distributed environments.
ACM Comput. Surv., Vol. 37, No. 4, Article 111. Publication date: February 2024.
111:22 Lingkai Meng, Yu Shao, and et al.
As indicated in Figure 7e, the studies on parallelism focus on identifying sub-structures that have
strong dependencies among vertices during traversal, which account for 34.29% of the research
efforts in traversal. Additionally, about 22.86% of works consider load balancing, especially when
the Subgraph-Centric model is adopted. A significant number of papers (37.14%) have contributed
to the reduction of communication overhead in traversal: the inherent redundancy in visiting
vertices and the need for transferring messages make it a communication-intensive task. Finally,
There is little (5.71%) research specifically targeting bandwidth optimization.
Breadth-First-Search (BFS) is utilized to systematically visit all vertices starting from a specified
source vertex, facing challenges due to irregular vertex access patterns despite its simplicity.
The nature of the workload in BFS processes, where a vertex becomes active only upon visitation,
poses a challenge in maintaining load balance. To address this, BFS-4K [28] employs a dynamic
thread scheduling strategy for child kernels to effectively balance the workload. Additionally,
a delegate-based approach [30] involves assigning delegates to high-degree vertices in different
machines. This method ensures that access to these vertices primarily involves local communication,
thereby reducing the reliance on more costly global communication.
To address the challenge of communication overhead in BFS, particularly due to redundant
communications with already visited vertices, a direct solution is to “mask” these vertices, omitting
them from future iterations as detailed in [18]. Additionally, optimizing BFS based on its stages
further reduces this overhead. Initially, when few vertices have been visited, a top-down (push model)
is more effective. As the number of visited vertices grows, a bottom-up (pull model) approach
becomes preferable to limit redundant communications. Identifying the ideal point to switch
between these models is key, as explored in [30, 133].
Single Source Shortest Path (SSSP) aims to find the shortest path from a source vertex to all
others. The discussions of SSSP are predominantly on weighted graphs. While the sequential
Bellman-Ford and Dijkstra algorithms are hard to parallelize, there are four major methodologies
for implementing SSSP in distributed environments:
(1) Subgraph SSSP: Applying Dijkstra’s algorithm on subgraphs within a distributed framework
and exchanging distance messages between partitions [117].
(2) Δ-stepping: The Δ-stepping method operates by selecting vertices within a specific distance
range, denoted by [𝑘Δ, (𝑘+1)Δ). This approach prioritizes the relaxation of shorter edges that fall
below a threshold Δ until reaching convergence [36, 161].
(3) Skeleton: These algorithms [31, 43, 60, 64, 84], which offer faster processing with approxi-
mation, construct a “skeleton” graph (a subgraph) and compute “hopsets” to expedite finding the
shortest paths. Skeleton-based algorithms compute distances from skeleton vertices to other graph
vertices, adding as shortcuts into the original graph. These shortcuts can significantly decrease the
number of hops (vertices) of a shortest path.
(4) Transshipment: A broader approach that generalizes the SSSP task, often modeled as a linear
programming problem [17] or linear oblivious routing [135, 182] for approximating distances.
Skeleton-based algorithms focus on achieving higher parallelism. In these algorithms, parallelism
largely depends on the quality of the “hopset”. Utilizing an approximate “hopset” with a shorter
hop distance (the number of hops required between vertices) has proven more practical [84]. CFR
algorithms introduce a tradeoff parameter to balance hop distance, “hopset” size, and computation
time [31]. Moreover, a two-level skeleton approach, which constructs an additional skeleton atop
the original, achieves even greater efficiency, closely approaching theoretical limits [43].
In the context of optimizing load imbalance, Delta-stepping algorithms skillfully integrate aspects
of Bellman-Ford and Dijkstra for distributed environments. Nevertheless, they are susceptible to
load unbalancing, particularly when high-degree vertices are processed simultaneously, leading
ACM Comput. Surv., Vol. 37, No. 4, Article 111. Publication date: February 2024.
A Survey of Distributed Graph Algorithms on Massive Graphs 111:23
to excessive communications. A practical solution [36] involves distributing these vertices across
different machines. For updating long edges, instead of transmitting sparse pairs of delayed distance
messages, a more efficient approach [161] is to aggregate these as a dense distance vector. This
uniform communication strategy not only streamlines the process but also aids in balancing the
communication load across the system.
High communication overhead, a prevalent challenge in SSSP algorithms, results from extensive
distance relaxing and updating processes. One approach to address this is dynamically switching
between push and pull models at different stages, as proposed in [36]. Another optimization
involves using a hierarchical sliding window instead of a fixed Δ-bucket to enhance communication
efficiency [161]. The Subgraph SSSP algorithm [117] reduces communication costs by computing
partial shortest path distances within individual partitions. In transshipment algorithms, gradient
descent techniques are employed to minimize communication by optimizing update directions [17].
Minimum Spanning Tree (MST) is a connected subgraph that encompasses all 𝑛 vertices and
𝑛 − 1 edges, forming a tree. It is characterized by the minimum aggregated weight of its edges,
representing the least costly way to connect all vertices in a graph. In distributed environments,
many works are based on the GHS algorithm [69], which involves three steps: (1) each vertex
starts as a single-vertex tree (fragment); (2) each fragment selects the lightest edge connecting it
to another fragment, merging them; (3) reducing the number of fragments by at least half each
iteration, achieving a single fragment representing the MST after at most 𝑂 (log 𝑛) rounds.
Distributed MST algorithms, if not optimized, may incur significant communication overhead,
particularly when a fragment’s diameter greatly exceeds the graph’s diameter. in a linear graph with
𝑛 vertices where the first vertex connects to all others (resulting in a diameter of 1), choosing a line
as a fragment increases its diameter to 𝑛 − 1. If a line is chosen as a fragment, its diameter becomes
𝑛−1. This scenario requires the GHS algorithm to perform 𝑂 (𝑛) rounds for merging. The Controlled-
GHS algorithm [70] addresses this by controlling fragment diameter growth. It merges fragments
using specific methods when a fragment is small or few fragments remain, or by merging close
fragments to minimize communication. Besides, some approaches focus on creating a low-diameter
spanning tree first, then applying Controlled-GHS to maintain low-diameter growth [121].
Cycle Detection entails identifying cycles of a specified length in a graph. For smaller cycles,
there are numerous effective algorithms. The task of finding a 3-cycle, commonly known as a
triangle, typically involves checking for common neighbors between pairs of vertices on each edge.
Similarly, detecting a 4-cycle, or square, usually requires examining common neighbors between
pairs of vertices on all wedges (a wedge being a path of two edges). However, efficiently detecting
larger cycles, namely 𝑘-cycle for arbitrary 𝑘, aiming to reduce communication costs and iteration
rounds, continues to be a challenging area of research.
Cycle detection, often based on a BFS approach where two traversals run from a vertex until
they meet, hinges on minimizing communication overhead. A notable strategy, as described in [33],
involves sparsification—running BFS on a subgraph with edges randomly removed to reduce
communication and computation costs. However, this can cause false negatives if edges in a
cycle are removed. To counter this, the algorithm is executed multiple times on varied sparsified
subgraphs, with the results aggregated to identify cycles. An enhanced method in [67] improves
accuracy by selectively removing edges less likely to form cycles, balancing efficiency with the
likelihood of detecting cycles.
Maximum Flow in graph theory focuses on computing the maximum flow achievable from a
source vertex 𝑠 to a sink vertex 𝑡 in a graph. Each edge in this graph is assigned a certain capacity that
limits the flow through it. The core methodology for determining maximum flow involves repeatedly
identifying paths through the graph—known as augmenting paths—that can still carry more flow.
ACM Comput. Surv., Vol. 37, No. 4, Article 111. Publication date: February 2024.
111:24 Lingkai Meng, Yu Shao, and et al.
This augmentation process is conducted iteratively: after each traversal, the flow is augmented
along the identified path, and the residual capacities are updated. This process continues until no
further augmenting paths can be identified, indicating that the maximum flow has been achieved.
In tackling parallelism challenges within maximum flow algorithms, optimizing the efficiency
of graph traversals is crucial, particularly since each augmentation process involves a traversal
that can be computationally intensive. The primary optimization strategy in this context aims to
maximize the number of paths augmented in a single iteration, thereby enhancing overall efficiency.
In line with this strategy, it has been proven more practical and efficient for identifying augmenting
paths on a sparse subgraph. For example, the algorithm in [130] employs a method that extracts a
spanning tree from the graph. It then concurrently finds multiple augmenting paths within this
sub-structure. This approach not only boosts parallelism but also significantly reduces the number
of iterative rounds required. Furthermore, as discussed in the context of SSSP tasks, flow problems
can be reformulated as linear optimization problems and tackled using methods like gradient
descent. This perspective allows for alternative solutions, as proposed in [71], which computes an
approximate solution with a capacity penalty. Such an approach, focusing on approximation rather
than exact solutions, enables the algorithm to achieve satisfactory results in fewer rounds.
ACM Comput. Surv., Vol. 37, No. 4, Article 111. Publication date: February 2024.
A Survey of Distributed Graph Algorithms on Massive Graphs 111:25
algorithms, 1D partitioning [72, 73, 150] is commonly employed, focusing on partitioning either ver-
tices or edges. Conversely, 2D partitioning [7, 127, 173] is more prevalent in map-based algorithms,
wherein the data is divided across the graph’s two-dimensional adjacency matrix. Optimization
strategies vary based on the partitioning method chosen. The study in [7] begins with generating
an adjacency matrix from sorted vertices to distribute tasks evenly across machines. It utilizes cyclic
distribution to balance light and heavy tasks. Similarly, [173]adapts this approach in heterogeneous
distributed environments. TRUST [127] adopts a distinct 2D distribution strategy, partitioning
vertices first based on their one-hop and then using these to form two-hop neighbor partitions, to
ensure balanced load distribution. In [150], a cyclic distribution method akin to [7] is implemented
for distributing sorted vertices in a 1D partitioning scheme. Meanwhile, other 1D partitioning algo-
rithms like [72, 73] strive for a more equitable distribution of edges to improve overall load balance.
Optimizing communication overhead is critical in triangle counting, given its communication-
intensive nature [149]. The efficiency in distributed environments is significantly impacted by the fre-
quency and volume of communications. One effective strategy is message aggregation, where small
messages are consolidated into larger ones for collective transmission, reducing communication fre-
quency and potentially optimizing bandwidth [72, 73, 138, 149]. Expander decomposition is another
approach used in several studies [38, 39, 72]. It segments graphs into partitions of varying densities,
assigning computational tasks based on these densities. High-density partitions are processed
within single machines to minimize communication overhead. In [150],caching RMA accesses are
studied to reduce remote data requests and communication demands. DistTC [85] introduces a fully
asynchronous triangle counting algorithm, using boundary vertices and their connecting edges as
proxies replicated across machines to enable independent triangle counting without inter-machine
communication. Finally, Ancy Sarah Tom et al. [7] implement a pruning technique. Specifically,
during the intersection of two adjacency lists of vertices 𝑢𝑘 and 𝑢 𝑗 , only the triangle-closing vertices,
𝑢𝑘 , satisfying 𝑘 > 𝑗 are considered, reducing unnecessary computations and communications.
𝑘-Clique is defined as a subgraph that contains 𝑘 vertices, with each pair of vertices being intercon-
nected by an edge. As a local task, the computation of 𝑘-clique is well-suited for parallel processing.
However, the high computational complexity of 𝑘-clique computations, typically requiring 𝑂 (𝑘 2𝑛𝑘 )
time in single-machine environments as noted by Shahrivari and Jalili [143], poses a significant chal-
lenge. During the computation, each vertex is involved in sending/receiving substantial amounts of
information to/from its neighbors. Consequently, key focuses in optimizing 𝑘-clique computations
include achieving load balance, reducing communication, and managing bandwidth constraints.
For workload balancing, the study by Censor-Hillel et al. [35] introduces the generalized partition
trees, which are constructed by a partial-pass streaming algorithm: it adds vertices one by one to
the current partition, and either continues if the number of edges is within a threshold, or creates
a new partition if otherwise. This method is designed to ensure equitable workload distribution,
with each machine processing a similar number of edges.
Many research efforts in the field of 𝑘-clique computation are directed toward reducing commu-
nication overhead. A notable example presented in [35] introduces a two-tiered communication
granularity, comprising auxiliary tokens and main tokens. The core strategy of this approach is
to prioritize the transmission of main tokens between vertices, resorting to the transmission of
auxiliary tokens only when absolutely necessary. By adopting this method, the algorithm effectively
reduces the overall volume of communication. Additionally, the algorithm in [34] contributes to
reducing communication overhead by eliminating vertices with lower degrees and their connected
edges in each iteration. A similar pruning strategy is employed in KCminer [143]. In this approach,
vertices with fewer than 𝑘 neighbors are discarded before computation begins.
ACM Comput. Surv., Vol. 37, No. 4, Article 111. Publication date: February 2024.
111:26 Lingkai Meng, Yu Shao, and et al.
ACM Comput. Surv., Vol. 37, No. 4, Article 111. Publication date: February 2024.
A Survey of Distributed Graph Algorithms on Massive Graphs 111:27
workload queue, into which intermediate results from completed tasks are subdivided into smaller
tasks. These smaller tasks are then dynamically distributed to available working threads as needed.
G-Miner, also proposed in [44], supports dynamic load balancing as well. It utilizes a strategy that
allows workers to “steal” unprocessed tasks from others. This method ensures that all workers
are actively engaged and effectively redistributes tasks to balance the overall workload across the
system.
Subgraph mining necessitates strategies to minimize communication overhead. A widely adopted
approach, similar to that of subgraph matching, involves caching frequently accessed vertex
data [44, 47, 49]. This caching mechanism ensures that critical data is readily available, thereby
reducing repetitive data requests across the network. In addition to caching, the research in [44]
introduces a task priority queue to organize tasks with common remote candidates. This grouping
allows related data to be transferred in a single operation, reducing the frequency and volume of
remote data fetches. Furthermore, the study in [153] explores the use of a personalized broadcast
operation. This method involves selectively assigning data to relevant machines based on their
specific needs or the tasks they are performing, aiming at minimizing unnecessary communication.
4.7 Covering
Covering tasks comprise a category of optimization problems centered around identifying sets of
vertices or edges that adhere to specific constraints. Minimum vertex covering seeks the smallest set
of vertices covering all edges, maximum matching aims to find the largest set of non-overlapping
edges, and graph coloring assigns colors to vertices such that no adjacent vertices share the same
color, with each problem focusing on different aspects of graph structure and optimization. These
problems are NP-hard, making the algorithms computationally intensive and difficult to parallelize.
Consequently, over half of the reviewed papers focus on enhancing parallelism (64.29%). Addi-
tionally, the complexities of computation pose issues with load balancing, and 7.14% of the surveyed
papers target it. Moreover, these algorithms often necessitate decisions based on neighboring
vertices’ data, making the reduction of communication overhead (28.57%) another critical concern.
Minimum Vertex Covering is challenging due to its NP-hard nature, which is further com-
pounded in distributed environments. Recent works [45] try to find a strict Nash Equilibrium state
as an approximate, yet near-optimal, solution. In this context, vertices act as strategic agents in a
snowdrift game with adjacent vertices, representing a local optimization process. The game’s strict
Nash Equilibrium corresponds to a vertex cover, aligning the strategy with the overall objective.
This game-based algorithm is well-suited to the message-passing model, allowing each vertex to
select a beneficial neighbor based on both past and present decisions through rational adjustments.
The level of parallelism achieved can significantly influence the rate of convergence. With limited
local storage, where a vertex can only retain information on one prior selection, the process may
fail to reach a strict Nash Equilibrium and get stuck in a deadlock. To overcome these limitations
and expedite convergence, some algorithms employ randomization strategies—permitting vertices
to select neighbors randomly instead of rationally, thereby introducing a chance to jump out from
the dead loop [45, 46]. Further enhancements in parallelism can be realized by expanding the local
memory capacity, allowing for the storage of additional previous choices [151].
Maximum Matching seeks the largest collection of edges in a graph where no two edges share a
common vertex. A traditional algorithm to find a maximum matching is the Hungary algorithm [96],
which iteratively searches for an augmenting path from each unmatched vertex. An augmenting
path, consisting of alternating matched and unmatched edges, is leveraged to increment the size of
the matching. The Hopcroft-Karp algorithm [68], optimized for parallelism, can generate multiple
ACM Comput. Surv., Vol. 37, No. 4, Article 111. Publication date: February 2024.
111:28 Lingkai Meng, Yu Shao, and et al.
Centrality
Similar Users
Community Detection
Social Network Analysis Recommendation Systems
Similarity
Cohesive Subgraph
Fig. 8. Application scenarios of graph tasks. Each arrow indicates that academic papers mention the applica-
tion of graph tasks within the corresponding topic to the target scenarios.
augmenting paths within a single iteration. While these algorithms are fundamentally designed for
bipartite graphs, they can be adapted for use with general graphs.
The augmenting process is the bottleneck of the parallelism. An efficient augmenting path algo-
rithm can reduce iteration rounds [2]. Another algorithm is to locally simulate multiple augmenting
paths, since these paths are not too long [61]. Recently, Sparse Matrix-Vector Multiplication (SpMV)
has been well studied and the augmenting process can be rewritten by matrix operations [12]. The
fast matrix calculation can significantly increase parallelism with humongous computing resources.
Another challenge is the high communication cost, stemming from the need for each augmenting
path search to traverse the entire graph. However, since augmenting paths are typically short,
operating on a strategically chosen subgraph with fewer vertices can be more efficient without
compromising the quality of the result [55].
Graph Coloring algorithms iteratively select an independent set (IS) for coloring. An IS is a set
of vertices in which no two are adjacent, allowing for simultaneous conflict-free coloring. Each
vertex in the IS greedily selects the lowest available color not used by its neighbors. The IS can be
obtained by some sampling methods (e.g., Maximal Independent Set) or priority methods.
To enhance parallelism, one strategy is to identify a larger IS, coloring more vertices concurrently.
A vertex-cut-based algorithm partitions the graph into components by removing vertex-cuts,
allowing IS algorithms to be efficiently applied to these smaller, more manageable components [129].
Another approach optimizes the coloring priority scheme so that lower-priority vertices get colored
earlier, reducing wait times for higher-priority decisions [3].
Graph coloring algorithms often incur excessive communication overhead towards the end of the
process due to increased dependency on a dwindling number of uncolored vertices. A straightfor-
ward solution is to precompute frequently occurring structures, thereby coloring them in a single
instance [129], or to defer their coloring to a later stage [42]. Building on this concept, a subgraph-
based algorithm decomposes the main graph into several subgraphs, prioritizes the coloring of inter-
nal vertices, and subsequently resolves color conflicts that arise among the boundary vertices [21].
ACM Comput. Surv., Vol. 37, No. 4, Article 111. Publication date: February 2024.
A Survey of Distributed Graph Algorithms on Massive Graphs 111:29
provide descriptions of the performance differences between various graph tasks on certain topics
and explain how these differences result in different application scenarios.
Centrality. In social network analysis, PageRank and personalized PageRank identify influential
individuals [82, 112]. Closeness and betweenness centrality, calculated based on the shortest path,
have lower efficiency but are more suitable for identifying key vertices in information transmission.
For example, in biological network analysis, betweenness centrality identifies crucial genes or
proteins that act as key intermediaries [59]. In supply chain management, betweenness centrality
finds important intermediaries in logistics pathways, improving efficiency and reducing costs [160].
Closeness centrality helps optimize logistics routes by identifying key vertices.
Community Detection. In social network analysis, these algorithms are used to identify densely
connected communities [13, 116, 178]. Users within the same community often share common
interests, making community detection algorithms highly useful for recommendation systems.
Louvain is an iterative method that optimizes modularity by progressively merging vertices and
communities, ultimately forming a stable community structure. Due to its efficiency and scalability,
Louvain is well-suited for handling large networks [142, 178]. LPA is fast but may produce unstable
results in large networks, ideal for real-time data stream analysis [10, 108, 116]. The connected
components algorithm is low-cost and suitable for basic network analysis [89, 110].
Similarity. In social network analysis, similarity algorithms identify users with similar interests,
helpful for making friends. In biological networks, it compares genes or proteins to find those with
similar functions, aiding in research and drug development. In recommendation systems, it can be
used to recommend other products similar to those a user has browsed or purchased, enhancing
the user experience and increasing sales [114, 115].
Cohesive Subgraph. Cohesive subgraphs have significant applications in various networks. In
social networks, they help identify tightly connected user groups, like interest-based communi-
ties [37, 105, 120, 144]. In biological networks, they find gene groups with similar expressions,
indicating potential functions [123, 152]. In e-commerce networks, they identify user groups with
similar behavior, improving recommendation accuracy. Besides, different algorithms are also suited
to some specific scenarios based on their trade-off between efficiency and subgraph density. For
example, the maximum clique algorithm suits scenarios needing very tight connections but less time
sensitivity, while 𝑘-Core works well for large networks due to its efficiency and scalability [105, 168].
Traversal. Traversal algorithms have diverse applications due to their different definitions. In
road networks, BFS is used for simple tasks like finding the shortest path in an unweighted
map [57], while SSSP handles complex networks, considering direction and traffic [84]. In supply
chain management, the minimum spanning tree reduces total transportation costs, and maximum
flow optimizes goods flow, increasing logistics throughput [93]. Cycle detection identifies money
laundering in financial networks and is used in operating systems to detect deadlocks [67].
Pattern Matching. In social network analysis, triangle counting evaluates community cohe-
sion and aids clustering [38, 85, 173], while 𝑘-Clique detection identifies tightly connected user
groups [143]. In biological networks, 𝑘-Clique identifies protein interaction clusters, and subgraph
matching locates specific specific gene clusters [44, 153, 154]. In financial networks, subgraph
matching and subgraph mining detect abnormal transaction patterns and potential fraudulent
activities [44].
Covering. The algorithms are commonly used for optimization of resource allocation and supply
chain management [2]. Minimum vertex cover identifies key vertices to monitor, optimizing resource
allocation [46]. For example, in sensor networks, minimum vertex cover helps in placing the least
number of sensors to monitor all critical points. Maximum matching pairs vertices to maximize the
ACM Comput. Surv., Vol. 37, No. 4, Article 111. Publication date: February 2024.
111:30 Lingkai Meng, Yu Shao, and et al.
number of edges, useful in optimizing resource distribution [12]. Graph coloring assigns colors to
vertices to minimize conflicts, essential in scheduling and resource allocation tasks [21].
ACM Comput. Surv., Vol. 37, No. 4, Article 111. Publication date: February 2024.
A Survey of Distributed Graph Algorithms on Massive Graphs 111:31
When vertices are distributed across separate machines, they require network communication to
exchange data, making communication overhead a widespread issue in distributed computing.
Opportunity. Minimizing communication overhead remains a critical challenge for many tasks
within distributed settings. Existing work typically employs one or two strategies to optimize
communication overhead, such as the Subgraph-Centric model or pruning strategies. These meth-
ods are broadly applicable and generally suited for almost all tasks. Using multiple optimization
techniques could further optimize communication overhead. Moreover, the integration of deep
learning techniques to predict and dynamically adjust communication patterns based on the
current network state and workload distribution is also gaining increased attention.
• Bandwidth: Relatively few studies concentrate on bandwidth issues; however, some computation-
intensive tasks such as pattern matching, as well as algorithms utilizing random walks, have
proposed methods to mitigate this challenge. Bandwidth commonly is not a limiting factor for
the performance of tasks that are not communication-intensive or when processing sparse graph
data. Furthermore, strategies aimed at minimizing communication overhead and achieving load
balance often help to alleviate bandwidth constraints.
Opportunity. Nevertheless, as graph data scales continue to grow, the importance of bandwidth
management is expected to escalate. Addressing this issue is inherently more complex than other
challenges because it involves preventing network congestion while also maximizing bandwidth
efficiency to avoid resource wastage. Future research will likely place increased emphasis on
effectively tackling bandwidth challenges, potentially through the development of advanced
network management techniques, like adaptive bandwidth allocation based on live traffic analysis.
7 CONCLUSIONS
Graphs can well represent relationships among entities. Analyzing and processing large-scale graph
data has been applied in many applications, such as social network analysis, recommendation
systems, and road network routing. Distributed graph processing provides a solution for efficiently
handling large-scale graph data in the real world. To understand the state-of-the-art research of
graph tasks in distributed environments and facilitate its development, in this paper, we conduct
an extensive survey on distributed graph tasks.
We first overview the existing distributed graph processing infrastructure. These tools facilitate
the design of distributed algorithms, but it is still difficult to overcome the challenges arising from
the inherent characteristics of distributed systems and graphs. We then analyze and summarize
the main challenges and solutions for graph tasks in distributed environments. Next, we provide
a taxonomy of primary graph tasks and conduct a detailed analysis of their existing efforts on
distributed environments, including challenges they focus on and unique insights for solving those
challenges. Finally, we discuss the research focus and the existing research gaps in the field of
distributed graph processing and identify potential future research opportunities.
ACKNOWLEDGMENTS
Long Yuan is supported by NSFC 62472225. Xuemin Lin is supported by NSFC U2241211 and
U20B2046.
REFERENCES
[1] James Abello, Frank van Ham, and Neeraj Krishnan. 2006. ASK-GraphView: A Large Scale Graph Visualization
System. TVCG 12, 5 (2006), 669–676.
[2] Mohamad Ahmadi, Fabian Kuhn, and Rotem Oshman. 2018. Distributed Approximate Maximum Matching in the
CONGEST Model. In DISC 2018 (LIPIcs, Vol. 121). 6:1–6:17.
ACM Comput. Surv., Vol. 37, No. 4, Article 111. Publication date: February 2024.
111:32 Lingkai Meng, Yu Shao, and et al.
[3] Ghadeer Alabandi, Evan Powers, and Martin Burtscher. 2020. Increasing the parallelism of graph coloring via
shortcutting. In PPoPP 2020. 262–275.
[4] Mehdi Alemi and Hassan Haghighi. 2019. KTMiner: distributed k-truss detection in big graphs. Inf. Syst. 83 (2019),
195–216.
[5] Khaled Ammar, Frank McSherry, Semih Salihoglu, and Manas Joglekar. 2018. Distributed Evaluation of Subgraph
Queries Using Worst-case Optimal and Low-Memory Dataflows. Proc. VLDB Endow. 11, 6 (2018), 691–704.
[6] Renzo Angles, Marcelo Arenas, Pablo Barceló, Aidan Hogan, Juan L. Reutter, and Domagoj Vrgoc. 2017. Foundations
of Modern Query Languages for Graph Databases. ACM Comput. Surv. 50, 5 (2017), 68:1–68:40.
[7] Shaikh Arifuzzaman, Maleq Khan, and Madhav V. Marathe. 2017. Distributed-Memory Parallel Algorithms for
Counting and Listing Triangles in Big Graphs. CoRR abs/1706.05151 (2017). arXiv:1706.05151
[8] Joe Armstrong. 2010. erlang. Commun. ACM 53, 9 (2010), 68–75.
[9] Arthur U. Asuncion, Padhraic Smyth, and Max Welling. 2008. Asynchronous Distributed Learning of Topic Models.
In NIPS 2008. Curran Associates, Inc., 81–88.
[10] Jean-Philippe Attal, Maria Malek, and Marc Zolghadri. 2019. Parallel and distributed core label propagation with
graph coloring. Concurr. Comput. Pract. Exp. 31, 2 (2019), e4355.
[11] Ching Avery. 2011. Giraph: Large-scale graph processing infrastructure on hadoop. Proceedings of the Hadoop Summit.
Santa Clara 11, 3 (2011), 5–9.
[12] Ariful Azad and Aydin Buluç. 2015. Distributed-Memory Algorithms for Maximal Cardinality Matching Using Matrix
Algebra. In CLUSTER 2015. 398–407.
[13] Mehdi Azaouzi and Lotfi Ben Romdhane. 2017. An evidential influence-based label propagation algorithm for
distributed community detection in social networks. Procedia computer science 112 (2017), 407–416.
[14] Nitin Chandra Badam and Yogesh Simmhan. 2014. Subgraph Rank: PageRank for Subgraph-Centric Distributed
Graph Processing. In COMAD 2014. 38–49.
[15] Rosa M. Badia, Javier Conejero, Jorge Ejarque, Daniele Lezzi, and Francesc Lordan. 2022. PyCOMPSs as an Instrument
for Translational Computer Science. Comput. Sci. Eng. 24, 2 (2022), 79–84.
[16] Henri E. Bal, Jennifer G. Steiner, and Andrew S. Tanenbaum. 1989. Programming Languages for Distributed Computing
Systems. ACM Comput. Surv. 21, 3 (1989), 261–322.
[17] Ruben Becker, Sebastian Forster, Andreas Karrenbauer, and Christoph Lenzen. 2021. Near-Optimal Approximate
Shortest Paths and Transshipment in Distributed and Streaming Models. SIAM J. Comput. 50, 3 (2021), 815–856.
[18] Massimo Bernaschi, Giancarlo Carbone, Enrico Mastrostefano, and Flavio Vella. 2015. Solutions to the st-connectivity
problem using a GPU-based distributed BFS. J. Parallel and Distrib. Comput. 76 (2015), 145–153.
[19] Maciej Besta, Raghavendra Kanakagiri, Harun Mustafa, Mikhail Karasikov, Gunnar Rätsch, Torsten Hoefler, and
Edgar Solomonik. 2020. Communication-Efficient Jaccard similarity for High-Performance Distributed Genome
Comparisons. In IPDPS 2020. 1122–1132.
[20] Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. 2008. Fast unfolding of communi-
ties in large networks. Journal of statistical mechanics: theory and experiment 2008, 10 (2008), P10008.
[21] Ian Bogle, Erik G. Boman, Karen D. Devine, Sivasankaran Rajamanickam, and George M. Slota. 2020. Distributed
Memory Graph Coloring Algorithms for Multiple GPUs. In IA3 2020. 54–62.
[22] Matthias Bonne and Keren Censor-Hillel. 2019. Distributed Detection of Cliques in Dynamic Networks. In ICALP
2019 (LIPIcs, Vol. 132). 132:1–132:15.
[23] Karsten M. Borgwardt, Cheng Soon Ong, Stefan Schönauer, S. V. N. Vishwanathan, Alexander J. Smola, and Hans-Peter
Kriegel. 2005. Protein function prediction via graph kernels. In ISMB 2005. 47–56.
[24] Sarra Bouhenni, Saïd Yahiaoui, Nadia Nouali-Taboudjemat, and Hamamache Kheddouci. 2022. A Survey on Distributed
Graph Pattern Matching in Massive Graphs. ACM Comput. Surv. 54, 2 (2022), 36:1–36:35.
[25] Ulrik Brandes. 2001. A faster algorithm for betweenness centrality. Journal of mathematical sociology 25, 2, 163–177.
[26] Sergey Brin and Lawrence Page. 1998. The anatomy of a large-scale hypertextual web search engine. Computer
networks and ISDN systems 30, 1-7 (1998), 107–117.
[27] Andrei Z. Broder, Ravi Kumar, Farzin Maghoul, Prabhakar Raghavan, Sridhar Rajagopalan, Raymie Stata, Andrew
Tomkins, and Janet L. Wiener. 2000. Graph structure in the Web. Comput. Networks 33, 1-6 (2000), 309–320.
[28] Federico Busato and Nicola Bombieri. 2015. BFS-4K: An Efficient Implementation of BFS for Kepler GPU Architectures.
IEEE Trans. Parallel Distributed Syst. 26, 7 (2015), 1826–1838.
[29] Federico Busato and Nicola Bombieri. 2016. An Efficient Implementation of the Bellman-Ford Algorithm for Kepler
GPU Architectures. IEEE Trans. Parallel Distributed Syst. 27, 8 (2016), 2222–2233.
[30] Huanqi Cao, Yuanwei Wang, Haojie Wang, Heng Lin, Zixuan Ma, Wanwang Yin, and Wenguang Chen. 2022. Scaling
graph traversal to 281 trillion edges with 40 million cores. In PPoPP 2022. 234–245.
[31] Nairen Cao, Jeremy T. Fineman, and Katina Russell. 2021. Brief Announcement: An Improved Distributed Approximate
Single Source Shortest Paths Algorithm. In PODC 2021. 493–496.
ACM Comput. Surv., Vol. 37, No. 4, Article 111. Publication date: February 2024.
A Survey of Distributed Graph Algorithms on Massive Graphs 111:33
[32] Paris Carbone, Asterios Katsifodimos, Stephan Ewen, Volker Markl, Seif Haridi, and Kostas Tzoumas. 2015. Apache
Flink™: Stream and Batch Processing in a Single Engine. IEEE Data Eng. Bull. 38, 4 (2015), 28–38.
[33] Keren Censor-Hillel, Eldar Fischer, Gregory Schwartzman, and Yadu Vasudev. 2019. Fast distributed algorithms for
testing graph properties. Distributed Comput. 32, 1 (2019), 41–57.
[34] Keren Censor-Hillel, François Le Gall, and Dean Leitersdorf. 2020. On Distributed Listing of Cliques. In PODC 2020.
474–482.
[35] Keren Censor-Hillel, Dean Leitersdorf, and David Vulakh. 2022. Deterministic Near-Optimal Distributed Listing of
Cliques. In PODC 2022. 271–280.
[36] Venkatesan T. Chakaravarthy, Fabio Checconi, Prakash Murali, Fabrizio Petrini, and Yogish Sabharwal. 2017. Scalable
Single Source Shortest Path Algorithms for Massively Parallel Systems. IEEE Trans. Parallel Distributed Syst. 28, 7
(2017), 2031–2045.
[37] T-H Hubert Chan, Mauro Sozio, and Bintao Sun. 2021. Distributed approximate k-core decomposition and min–max
edge orientation: Breaking the diameter barrier. J. Parallel and Distrib. Comput. 147 (2021), 87–99.
[38] Yi-Jun Chang, Seth Pettie, Thatchaphol Saranurak, and Hengjie Zhang. 2021. Near-optimal Distributed Triangle
Enumeration via Expander Decompositions. J. ACM 68, 3 (2021), 21:1–21:36.
[39] Yi-Jun Chang and Thatchaphol Saranurak. 2019. Improved Distributed Expander Decomposition and Nearly Optimal
Triangle Enumeration. In PODC 2019. 66–73.
[40] Barbara M. Chapman, Tony Curtis, Swaroop Pophale, Stephen W. Poole, Jeffery A. Kuehn, Chuck Koelbel, and Lauren
Smith. 2010. Introducing OpenSHMEM: SHMEM for the PGAS community. In PGAS 2010. ACM, 2.
[41] Philippe Charles, Christian Grothoff, Vijay A. Saraswat, Christopher Donawa, Allan Kielstra, Kemal Ebcioglu,
Christoph von Praun, and Vivek Sarkar. 2005. X10: an object-oriented approach to non-uniform cluster computing.
In OOPSLA 2005. ACM, 519–538.
[42] Shiri Chechik and Doron Mukhtar. 2019. Optimal Distributed Coloring Algorithms for Planar Graphs in the LOCAL
model. In SODA 2019. 787–804.
[43] Shiri Chechik and Doron Mukhtar. 2022. Single-source shortest paths in the CONGEST model with improved bounds.
Distributed Comput. 35, 4 (2022), 357–374.
[44] Hongzhi Chen, Miao Liu, Yunjian Zhao, Xiao Yan, Da Yan, and James Cheng. 2018. G-Miner: an efficient task-oriented
graph mining system. In EuroSys 2018. 32:1–32:12.
[45] Jie Chen and Xiang Li. 2021. A Minimal Memory Game-Based Distributed Algorithm to Vertex Cover of Networks. In
ISCAS 2021. 1–5.
[46] Jie Chen and Xiang Li. 2023. Toward the minimum vertex cover of complex networks using distributed potential
games. Sci. China Inf. Sci. 66, 1 (2023), 112205.
[47] Jingji Chen and Xuehai Qian. 2023. Khuzdul: Efficient and Scalable Distributed Graph Pattern Mining Engine. In
ASPLOS 2023. 413–426.
[48] Pei-Ling Chen, Chung-Kuang Chou, and Ming-Syan Chen. 2014. Distributed algorithms for k-truss decomposition.
In IEEE BigData 2014. 471–480.
[49] Xuhao Chen, Tianhao Huang, Shuotao Xu, Thomas Bourgeat, Chanwoo Chung, and Arvind. 2021. FlexMiner: A
Pattern-Aware Accelerator for Graph Pattern Mining. In ISCA 2021. 581–594.
[50] Aaron Clauset, Cosma Rohilla Shalizi, and Mark E. J. Newman. 2009. Power-Law Distributions in Empirical Data.
SIAM Rev. 51, 4 (2009), 661–703.
[51] Mirel Cosulschi, Mihai Gabroveanu, Florin Slabu, and Adriana Sbircea. 2015. Scaling Up a Distributed Computing Of
Similarity Coefficient with Mapreduce. Int. J. Comput. Sci. Appl. 12, 2 (2015), 81–98.
[52] Pierluigi Crescenzi, Pierre Fraigniaud, and Ami Paz. 2020. Simple and Fast Distributed Computation of Betweenness
Centrality. In INFOCOM 2020. 337–346.
[53] Pawel Czarnul, Jerzy Proficz, and Krzysztof Drypczewski. 2020. Survey of Methodologies, Approaches, and Chal-
lenges in Parallel Programming Using High-Performance Computing Systems. Sci. Program. 2020 (2020), 4176794:1–
4176794:19.
[54] Guohao Dai, Tianhao Huang, Yuze Chi, Jishen Zhao, Guangyu Sun, Yongpan Liu, Yu Wang, Yuan Xie, and Huazhong
Yang. 2019. GraphH: A Processing-in-Memory Architecture for Large-Scale Graph Processing. IEEE Trans. Comput.
Aided Des. Integr. Circuits Syst. 38, 4 (2019), 640–653.
[55] Varsha Dani, Aayush Gupta, Thomas P. Hayes, and Seth Pettie. 2023. Wake up and join me! An energy-efficient
algorithm for maximal matching in radio networks. Distributed Comput. 36, 3 (2023), 373–384.
[56] Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified Data Processing on Large Clusters. In OSDI 2004.
137–150.
[57] Stéphane Devismes and Colette Johnen. 2016. Silent self-stabilizing BFS tree algorithms revisited. J. Parallel and
Distrib. Comput. 97 (2016), 11–23.
ACM Comput. Surv., Vol. 37, No. 4, Article 111. Publication date: February 2024.
111:34 Lingkai Meng, Yu Shao, and et al.
[58] David Dominguez-Sal, Norbert Martínez-Bazan, Victor Muntés-Mulero, Pere Baleta, and Josep Lluís Larriba-Pey.
2010. A Discussion on the Design of Graph Database Benchmarks. In TPCTC 2010 (Lecture Notes in Computer Science,
Vol. 6417). Springer, 25–40.
[59] Kevin Durant and Stephan Wagner. 2017. On the distribution of betweenness centrality in random trees. Theoretical
Computer Science 699 (2017), 33–52.
[60] Michael Elkin. 2020. Distributed Exact Shortest Paths in Sublinear Time. J. ACM 67, 3 (2020), 15:1–15:36.
[61] Guy Even, Moti Medina, and Dana Ron. 2015. Distributed Maximum Matching in Bounded Degree Graphs. In
Proceedings of the 2015 International Conference on Distributed Computing and Networking, ICDCN 2015. 18:1–18:10.
[62] Wenfei Fan, Tao He, Longbin Lai, Xue Li, Yong Li, Zhao Li, Zhengping Qian, Chao Tian, Lei Wang, Jingbo Xu, Youyang
Yao, Qiang Yin, Wenyuan Yu, Kai Zeng, Kun Zhao, Jingren Zhou, Diwen Zhu, and Rong Zhu. 2021. GraphScope: A
Unified Engine For Big Graph Processing. Proc. VLDB Endow. 14, 12 (2021), 2879–2892.
[63] Xing Feng, Lijun Chang, Xuemin Lin, Lu Qin, Wenjie Zhang, and Long Yuan. 2018. Distributed computing connected
components with linear communication cost. Distributed Parallel Databases 36, 3 (2018), 555–592.
[64] Sebastian Forster and Danupon Nanongkai. 2018. A Faster Distributed Single-Source Shortest Paths Algorithm. In
FOCS 2018. 686–697.
[65] Message P Forum. 1993. MPI: A message passing interface. , 878-883 pages.
[66] Cédric Fournet, Fabrice Le Fessant, Luc Maranget, and Alan Schmitt. 2002. JoCaml: A language for concurrent
distributed and mobile programming. In International School on Advanced Functional Programming. 129–158.
[67] Pierre Fraigniaud and Dennis Olivetti. 2017. Distributed Detection of Cycles. In Proceedings of the 29th ACM Symposium
on Parallelism in Algorithms and Architectures, SPAA 2017. 153–162.
[68] Harold N. Gabow. 2017. The Weighted Matching Approach to Maximum Cardinality Matching. Fundam. Informaticae
154, 1-4 (2017), 109–130.
[69] Robert G. Gallager, Pierre A. Humblet, and Philip M. Spira. 1983. A Distributed Algorithm for Minimum-Weight
Spanning Trees. ACM Trans. Program. Lang. Syst. 5, 1 (1983), 66–77.
[70] Juan A. Garay, Shay Kutten, and David Peleg. 1998. A Sublinear Time Distributed Algorithm for Minimum-Weight
Spanning Trees. SIAM J. Comput. 27, 1 (1998), 302–316.
[71] Mohsen Ghaffari, Andreas Karrenbauer, Fabian Kuhn, Christoph Lenzen, and Boaz Patt-Shamir. 2018. Near-Optimal
Distributed Maximum Flow. SIAM J. Comput. 47, 6 (2018), 2078–2117.
[72] Sayan Ghosh. 2022. Improved Distributed-memory Triangle Counting by Exploiting the Graph Structure. In HPEC
2022. 1–6.
[73] Sayan Ghosh and Mahantesh Halappanavar. 2020. TriC: Distributed-memory Triangle Counting by Exploiting the
Graph Structure. In HPEC 2020. 1–6.
[74] Sayan Ghosh, Mahantesh Halappanavar, Antonino Tumeo, Ananth Kalyanaraman, Hao Lu, Daniel G. Chavarría-
Miranda, Arif Khan, and Assefaw Hadish Gebremedhin. 2018. Distributed Louvain Algorithm for Graph Community
Detection. In 2018 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2018. 885–895.
[75] Andrew V. Goldberg and Chris Harrelson. 2005. Computing the shortest path: A search meets graph theory. In SODA
2005. 156–165.
[76] Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin. 2012. PowerGraph: Distributed
Graph-Parallel Computation on Natural Graphs. In OSDI 2012. 17–30.
[77] Joseph E. Gonzalez, Reynold S. Xin, Ankur Dave, Daniel Crankshaw, Michael J. Franklin, and Ion Stoica. 2014. GraphX:
Graph Processing in a Distributed Dataflow Framework. In OSDI 2014. 599–613.
[78] William Gropp. 2001. Learning from the Success of MPI. In HiPC 2001 (Lecture Notes in Computer Science, Vol. 2228).
Springer, 81–94.
[79] Tao Guo, Xin Cao, Gao Cong, Jiaheng Lu, and Xuemin Lin. 2017. Distributed Algorithms on Exact Personalized
PageRank. In SIGMOD 2017. 479–494.
[80] Shubhankar Gupta and Suresh Sundaram. 2023. Moving-Landmark Assisted Distributed Learning Based Decentralized
Cooperative Localization (DL-DCL) with Fault Tolerance. In AAAI 2023. 6175–6182.
[81] Minyang Han and Khuzaima Daudjee. 2015. Giraph Unchained: Barrierless Asynchronous Parallel Execution in
Pregel-like Graph Processing Systems. Proc. VLDB Endow. 8, 9 (2015), 950–961.
[82] Yiran He and Hoi-To Wai. 2021. Provably Fast Asynchronous And Distributed Algorithms For Pagerank Centrality
Computation. In ICASSP 2021. 5050–5054.
[83] Safiollah Heidari, Yogesh Simmhan, Rodrigo N. Calheiros, and Rajkumar Buyya. 2018. Scalable Graph Processing
Frameworks: A Taxonomy and Open Challenges. ACM Comput. Surv. 51, 3 (2018), 60:1–60:53.
[84] Monika Henzinger, Sebastian Krinninger, and Danupon Nanongkai. 2016. A deterministic almost-tight distributed
algorithm for approximating single-source shortest paths. (2016), 489–498.
[85] Loc Hoang, Vishwesh Jatala, Xuhao Chen, Udit Agarwal, Roshan Dathathri, Gurbinder Gill, and Keshav Pingali. 2019.
DistTC: High Performance Distributed Triangle Counting. In HPEC 2019. 1–7.
ACM Comput. Surv., Vol. 37, No. 4, Article 111. Publication date: February 2024.
A Survey of Distributed Graph Algorithms on Massive Graphs 111:35
[86] Loc Hoang, Matteo Pontecorvi, Roshan Dathathri, Gurbinder Gill, Bozhi You, Keshav Pingali, and Vijaya Ramachan-
dran. 2019. A round-efficient distributed betweenness centrality algorithm. In PPoPP 2019. 272–286.
[87] Stephan Holzer and Roger Wattenhofer. 2012. Optimal distributed all pairs shortest paths and applications. In PODC
2012. 355–364.
[88] Qiang-Sheng Hua, Haoqiang Fan, Ming Ai, Lixiang Qian, Yangyang Li, Xuanhua Shi, and Hai Jin. 2016. Nearly
Optimal Distributed Algorithm for Computing Betweenness Centrality. In 36th IEEE International Conference on
Distributed Computing Systems, ICDCS 2016. 271–280.
[89] Chirag Jain, Patrick Flick, Tony Pan, Oded Green, and Srinivas Aluru. 2017. An Adaptive Parallel Algorithm for
Computing Connected Components. IEEE Trans. Parallel Distributed Syst. 28, 9 (2017), 2428–2439.
[90] Fuad T. Jamour, Spiros Skiadopoulos, and Panos Kalnis. 2018. Parallel Algorithm for Incremental Betweenness
Centrality on Large Graphs. IEEE Trans. Parallel Distributed Syst. 29, 3 (2018), 659–672.
[91] Glen Jeh and Jennifer Widom. 2002. SimRank: a measure of structural-context similarity. In KDD 2002. 538–543.
[92] Yichuan Jiang. 2016. A Survey of Task Allocation and Load Balancing in Distributed Systems. IEEE Trans. Parallel
Distributed Syst. 27, 2 (2016), 585–599.
[93] Changhee Joo, Xiaojun Lin, Jiho Ryu, and Ness B. Shroff. 2016. Distributed Greedy Approximation to Maximum
Weighted Independent Set for Scheduling With Fading Channels. IEEE/ACM Trans. Netw. 24, 3 (2016), 1476–1488.
[94] Vasiliki Kalavri, Vladimir Vlassov, and Seif Haridi. 2018. High-Level Programming Abstractions for Distributed Graph
Processing. IEEE Trans. Knowl. Data Eng. 30, 2 (2018), 305–324.
[95] George Karypis and Vipin Kumar. 1998. A software package for partitioning unstructured graphs, partitioning
meshes, and computing fill-reducing orderings of sparse matrices. University of Minnesota, Department of Computer
Science and Engineering, Army HPC Research Center, Minneapolis, MN 38 (1998), 7–1.
[96] Harold W Kuhn. 1955. The Hungarian method for the assignment problem. Nav. Res. Logist. Q. 2, 1-2 (1955), 83–97.
[97] Longbin Lai, Lu Qin, Xuemin Lin, and Lijun Chang. 2015. Scalable Subgraph Enumeration in MapReduce. Proc. VLDB
Endow. 8, 10 (2015), 974–985.
[98] Longbin Lai, Zhu Qing, Zhengyi Yang, Xin Jin, Zhengmin Lai, Ran Wang, Kongzhang Hao, Xuemin Lin, Lu Qin,
Wenjie Zhang, Ying Zhang, Zhengping Qian, and Jingren Zhou. 2019. Distributed Subgraph Matching on Timely
Dataflow. Proc. VLDB Endow. 12, 10 (2019), 1099–1112.
[99] Longbin Lai, Yufan Yang, Zhibin Wang, Yuxuan Liu, Haotian Ma, Sijie Shen, Bingqing Lyu, Xiaoli Zhou, Wenyuan Yu,
Zhengping Qian, Chen Tian, Sheng Zhong, Yeh-Ching Chung, and Jingren Zhou. 2023. GLogS: Interactive Graph
Pattern Matching Query At Large Scale. In USENIX ATC 2023. USENIX Association, 53–69.
[100] Kartik Lakhotia, Rajgopal Kannan, and Viktor K. Prasanna. 2018. Accelerating PageRank using Partition-Centric
Processing. In USENIX ATC 2018. 427–440.
[101] Sebastian Lamm and Peter Sanders. 2022. Communication-efficient Massively Distributed Connected Components.
In IPDPS 2022. 302–312.
[102] Christoph Lenzen, Boaz Patt-Shamir, and David Peleg. 2019. Distributed distance computation and routing with small
messages. Distributed Comput. 32, 2 (2019), 133–157.
[103] Xue Li, Ke Meng, Lu Qin, Longbin Lai, Wenyuan Yu, Zhengping Qian, Xuemin Lin, and Jingren Zhou. 2023. Flash: A
Framework for Programming Distributed Graph Processing Algorithms. In ICDE 2023. IEEE, 232–244.
[104] Zhenguo Li, Yixiang Fang, Qin Liu, Jiefeng Cheng, Reynold Cheng, and John C. S. Lui. 2015. Walking in the Cloud:
Parallel SimRank at Scale. Proc. VLDB Endow. 9, 1 (2015), 24–35.
[105] Xuankun Liao, Qing Liu, Jiaxin Jiang, Xin Huang, Jianliang Xu, and Byron Choi. 2022. Distributed D-core Decomposi-
tion over Large Directed Graphs. Proc. VLDB Endow. 15, 8 (2022), 1546–1558.
[106] Wenqing Lin. 2019. Distributed Algorithms for Fully Personalized PageRank on Large Graphs. In WWW 2019.
1084–1094.
[107] Qing Liu, Xuankun Liao, Xin Huang, Jianliang Xu, and Yunjun Gao. 2023. Distributed (𝛼, 𝛽)-Core Decomposition
over Bipartite Graphs. In ICDE 2023. 909–921.
[108] Xu Liu, Jesun Sahariar Firoz, Marcin Zalewski, Mahantesh Halappanavar, Kevin J. Barker, Andrew Lumsdaine, and
Assefaw H. Gebremedhin. 2019. Distributed Direction-Optimizing Label Propagation for Community Detection. In
HPEC 2019. 1–6.
[109] Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, Carlos Guestrin, and Joseph M. Hellerstein. 2012.
Distributed GraphLab: A Framework for Machine Learning in the Cloud. Proc. VLDB Endow. 5, 8 (2012), 716–727.
[110] Alessandro Lulli, Emanuele Carlini, Patrizio Dazzi, Claudio Lucchese, and Laura Ricci. 2017. Fast Connected Compo-
nents Computation in Large Graphs by Vertex Pruning. IEEE Trans. Parallel Distributed Syst. 28, 3 (2017), 760–773.
[111] Siqiang Luo. 2019. Distributed PageRank Computation: An Improved Theoretical Study. In AAAI 2019. 4496–4503.
[112] Siqiang Luo. 2020. Improved Communication Cost in Distributed PageRank Computation - A Theoretical Study. In
ICML 2020 (Proceedings of Machine Learning Research, Vol. 119). 6459–6467.
ACM Comput. Surv., Vol. 37, No. 4, Article 111. Publication date: February 2024.
111:36 Lingkai Meng, Yu Shao, and et al.
[113] Siqiang Luo, Xiaowei Wu, and Ben Kao. 2022. Distributed PageRank computation with improved round complexities.
Inf. Sci. 607 (2022), 109–125.
[114] Siqiang Luo and Zulun Zhu. 2023. Massively Parallel Single-Source SimRanks in o(log n) Rounds. CoRR abs/2304.04015
(2023). arXiv:2304.04015
[115] Xiongcai Luo, Jun Gao, Chang Zhou, and Jeffrey Xu Yu. 2017. UniWalk: Unidirectional Random Walk Based Scalable
SimRank Computation over Large Graph. In ICDE 2017. 325–336.
[116] Tinghuai Ma, Mingliang Yue, Jingjing Qu, Yuan Tian, Abdullah Al-Dhelaan, and Mznah Al-Rodhaan. 2018. PSPLPA:
Probability and similarity based parallel label propagation algorithm on spark. Physica A: Statistical Mechanics and its
Applications 503 (2018), 366–378.
[117] Saeed Maleki, Donald Nguyen, Andrew Lenharth, María Jesús Garzarán, David A. Padua, and Keshav Pingali. 2016.
DSMR: a shared and distributed memory algorithm for single-source shortest path problem. In PPoPP 2016. 39:1–39:2.
[118] Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz
Czajkowski. 2010. Pregel: a system for large-scale graph processing. In SIGMOD 2010. 135–146.
[119] Fragkiskos D. Malliaros, Christos Giatsidis, Apostolos N. Papadopoulos, and Michalis Vazirgiannis. 2020. The core
decomposition of networks: theory, algorithms and applications. VLDB J. 29, 1 (2020), 61–92.
[120] Aritra Mandal and Mohammad Al Hasan. 2017. A distributed k-core decomposition algorithm on spark. In IEEE
BigData 2017. 976–981.
[121] Ali Mashreghi and Valerie King. 2021. Broadcast and minimum spanning tree with o(m) messages in the asynchronous
CONGEST model. Distributed Comput. 34, 4 (2021), 283–299.
[122] Pragnesh Jay Modi, Wei-Min Shen, Milind Tambe, and Makoto Yokoo. 2005. Adopt: asynchronous distributed
constraint optimization with quality guarantees. Artif. Intell. 161, 1-2 (2005), 149–180.
[123] Alberto Montresor, Francesco De Pellegrini, and Daniele Miorandi. 2013. Distributed k-Core Decomposition. IEEE
Trans. Parallel Distributed Syst. 24, 2 (2013), 288–300.
[124] Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng
Yang, William Paul, Michael I. Jordan, and Ion Stoica. 2018. Ray: A Distributed Framework for Emerging AI
Applications. In OSDI 2018. USENIX Association, 561–577.
[125] Hung Q. Ngo, Ely Porat, Christopher Ré, and Atri Rudra. 2018. Worst-case Optimal Join Algorithms. J. ACM 65, 3
(2018), 16:1–16:40.
[126] Oracle. 2014. Remote Method Invocation Home. https://www.oracle.com Accessed: January 15, 2024.
[127] Santosh Pandey, Zhibin Wang, Sheng Zhong, Chen Tian, Bolong Zheng, Xiaoye S. Li, Lingda Li, Adolfy Hoisie, Caiwen
Ding, Dong Li, and Hang Liu. 2021. Trust: Triangle Counting Reloaded on GPUs. IEEE Trans. Parallel Distributed Syst.
32, 11 (2021), 2646–2660.
[128] Roger A. Pearce, Maya B. Gokhale, and Nancy M. Amato. 2014. Faster Parallel Traversal of Scale Free Graphs at
Extreme Scale with Vertex Delegates. In SC 2014,. IEEE Computer Society, 549–559.
[129] Yun Peng, Byron Choi, Bingsheng He, Shuigeng Zhou, Ruzhi Xu, and Xiaohui Yu. 2016. VColor: A practical vertex-cut
based approach for coloring large graphs. In ICDE 2016. 97–108.
[130] Yossi Peretz and Yigal Fischler. 2022. A fast parallel max-flow algorithm. J. Parallel Distr. Com. 169 (2022), 226–241.
[131] Usha Nandini Raghavan, Réka Albert, and Soundar Kumara. 2007. Near linear time algorithm to detect community
structures in large-scale networks. Physical review E 76, 3 (2007), 036106.
[132] Emily A. Reed, Guilherme Ramos, Paul Bogdan, and Sérgio Pequito. 2023. A Scalable Distributed Dynamical Systems
Approach to Learn the Strongly Connected Components and Diameter of Networks. IEEE Trans. Autom. Control. 68, 5
(2023), 3099–3106.
[133] Luis Remis, María Jesús Garzarán, Rafael Asenjo, and Angeles G. Navarro. 2018. Exploiting social network graph
characteristics for efficient BFS on heterogeneous chips. J. Parallel Distributed Comput. 120 (2018), 282–294.
[134] Amitabha Roy, Ivo Mihailovic, and Willy Zwaenepoel. 2013. X-Stream: edge-centric graph processing using streaming
partitions. In SOSP 2013. 472–488.
[135] Václav Rozhon, Christoph Grunau, Bernhard Haeupler, Goran Zuzic, and Jason Li. 2022. Undirected (1+𝜖)-shortest
paths via minor-aggregates: near-optimal deterministic parallel and distributed algorithms. In STOC 2022. 478–487.
[136] Sherif Sakr, Anna Liu, and Ayman G. Fayoumi. 2013. The family of mapreduce and large-scale data processing
systems. ACM Comput. Surv. 46, 1 (2013), 11:1–11:44.
[137] Semih Salihoglu and Jennifer Widom. 2013. GPS: a graph processing system. In SSDBM 2013. 22:1–22:12.
[138] Peter Sanders and Tim Niklas Uhl. 2023. Engineering a Distributed-Memory Triangle Counting Algorithm. In IPDPS
2023. 702–712.
[139] Ahmet Erdem Sariyüce, Erik Saule, Kamer Kaya, and Ümit V. Çatalyürek. 2013. STREAMER: A distributed framework
for incremental closeness centrality computation. In CLUSTER 2013. 1–8.
[140] Atish Das Sarma, Anisur Rahaman Molla, Gopal Pandurangan, and Eli Upfal. 2015. Fast distributed PageRank
computation. Theor. Comput. Sci. 561 (2015), 113–121.
ACM Comput. Surv., Vol. 37, No. 4, Article 111. Publication date: February 2024.
A Survey of Distributed Graph Algorithms on Massive Graphs 111:37
[141] Naw Safrin Sattar and Shaikh Arifuzzaman. 2018. Parallelizing Louvain Algorithm: Distributed Memory Challenges.
In DASC/PiCom/DataCom/CyberSciTech 2018. 695–701.
[142] Naw Safrin Sattar and Shaikh Arifuzzaman. 2022. Scalable distributed Louvain algorithm for community detection in
large graphs. J. Supercomput. 78, 7 (2022), 10275–10309.
[143] Saeed Shahrivari and Saeed Jalili. 2021. Efficient Distributed k-Clique Mining for Large Networks Using MapReduce.
IEEE Trans. Knowl. Data Eng. 33, 3 (2021), 964–974.
[144] Yingxia Shao, Lei Chen, and Bin Cui. 2014. Efficient cohesive subgraphs detection in parallel. In SIGMOD 2014.
613–624.
[145] Aneesh Sharma, C. Seshadhri, and Ashish Goel. 2017. When Hashes Met Wedges: A Distributed Algorithm for Finding
High Similarity Vectors. In WWW 2017. ACM, 431–440.
[146] Muhammad Shiraz, Abdullah Gani, Rashid Hafeez Khokhar, and Rajkumar Buyya. 2013. A Review on Distributed
Application Processing Frameworks in Smart Mobile Devices for Mobile Cloud Computing. IEEE Commun. Surv.
Tutorials 15, 3 (2013), 1294–1313.
[147] Yogesh Simmhan, Alok Gautam Kumbhare, Charith Wickramaarachchi, Soonil Nagarkar, Santosh Ravi, Cauligi S.
Raghavendra, and Viktor K. Prasanna. 2014. GoFFish: A Sub-graph Centric Framework for Large-Scale Graph
Analytics. In Euro-Par 2014 (Lecture Notes in Computer Science, Vol. 8632). 451–462.
[148] Nasrin Mazaheri Soudani, Afsaneh Fatemi, and Mohammadali Nematbakhsh. 2019. PPR-partitioning: a distributed
graph partitioning algorithm based on the personalized PageRank vectors in vertex-centric systems. Knowl. Inf. Syst.
61, 2 (2019), 847–871.
[149] Trevor Steil, Tahsin Reza, Keita Iwabuchi, Benjamin W. Priest, Geoffrey Sanders, and Roger Pearce. 2021. TriPoll:
computing surveys of triangles in massive-scale temporal graphs with metadata. In SC 2021. ACM, 67.
[150] András Strausz, Flavio Vella, Salvatore Di Girolamo, Maciej Besta, and Torsten Hoefler. 2022. Asynchronous
Distributed-Memory Triangle Counting and LCC with RMA Caching. In IPDPS 2022. 291–301.
[151] Changhao Sun, Huaxin Qiu, Wei Sun, Qian Chen, Li Su, Xiaochu Wang, and Qingrui Zhou. 2022. Better Approximation
for Distributed Weighted Vertex Cover via Game-Theoretic Learning. IEEE Trans. Syst. Man Cybern. Syst. 52, 8 (2022),
5308–5319.
[152] Michael Svendsen, Arko Provo Mukherjee, and Srikanta Tirthapura. 2015. Mining maximal cliques from a large graph
using MapReduce: Tackling highly uneven subproblem sizes. J. Parallel Distributed Comput. 79-80 (2015), 104–114.
[153] Nilothpal Talukder and Mohammed J. Zaki. 2016. A distributed approach for graph mining in massive networks.
Data Min. Knowl. Discov. 30, 5 (2016), 1024–1052.
[154] Carlos H. C. Teixeira, Alexandre J. Fonseca, Marco Serafini, Georgos Siganos, Mohammed J. Zaki, and Ashraf
Aboulnaga. 2015. Arabesque: a system for distributed graph mining. In SOSP 2015. 425–440.
[155] CORPORATE The MPI Forum. 1993. MPI: a message passing interface. In Proceedings of the 1993 ACM/IEEE Conference
on Supercomputing. 878–883.
[156] Yuanyuan Tian, Andrey Balmin, Severin Andreas Corsten, Shirish Tatikonda, and John McPherson. 2013. From
"Think Like a Vertex" to "Think Like a Graph". Proc. VLDB Endow. 7, 3 (2013), 193–204.
[157] Etsuji Tomita, Akira Tanaka, and Haruhisa Takahashi. 2004. The Worst-Case Time Complexity for Generating All
Maximal Cliques. In COCOON 2004 (Lecture Notes in Computer Science, Vol. 3106). 161–170.
[158] Ankit Toshniwal, Siddarth Taneja, Amit Shukla, Karthik Ramasamy, Jignesh M Patel, Sanjeev Kulkarni, Jason Jackson,
Krishna Gade, Maosong Fu, Jake Donham, et al. 2014. Storm@ twitter. In Proceedings of SIGMOD. 147–156.
[159] Philip W. Trinder, Hans-Wolfgang Loidl, and Robert F. Pointon. 2002. Parallel and Distributed Haskells. J. Funct.
Program. 12, 4&5 (2002), 469–510.
[160] Christian Wallmann and Markus Gerschberger. 2020. The association between network centrality measures and
supply chain performance: The case of distribution networks. In ISM 2020, Vol. 180. Elsevier, 172–179.
[161] Yuanwei Wang, Huanqi Cao, Zixuan Ma, Wanwang Yin, and Wenguang Chen. 2022. Scaling Graph 500 SSSP to 140
Trillion Edges with over 40 Million Cores. In SC 2022. 19:1–19:15.
[162] Yue Wang, Ruiqi Xu, Zonghao Feng, Yulin Che, Lei Chen, Qiong Luo, and Rui Mao. 2020. DISK: A Distributed
Framework for Single-Source SimRank with Accuracy Guarantee. Proc. VLDB Endow. 14, 3 (2020), 351–363.
[163] Zhaokang Wang, Rong Gu, Weiwei Hu, Chunfeng Yuan, and Yihua Huang. 2019. BENU: Distributed Subgraph
Enumeration with Backtracking-Based Framework. In ICDE 2019. 136–147.
[164] Klaus Wehmuth and Artur Ziviani. 2013. DACCER: Distributed Assessment of the Closeness Centrality Ranking in
complex networks. Comput. Networks 57, 13 (2013), 2536–2548.
[165] Tongfeng Weng, Xu Zhou, Kenli Li, Peng Peng, and Keqin Li. 2022. Efficient Distributed Approaches to Core
Maintenance on Large Dynamic Graphs. IEEE Trans. Parallel Distributed Syst. 33, 1 (2022), 129–143.
[166] Tom White. 2012. Hadoop: The definitive guide. " O’Reilly Media, Inc.".
[167] Lizhi Xiang, Arif Khan, Edoardo Serra, Mahantesh Halappanavar, and Aravind Sukumaran-Rajam. 2021. cuTS: scaling
subgraph isomorphism on distributed multi-GPU systems using trie based data structure. In SC 2021. 69.
ACM Comput. Surv., Vol. 37, No. 4, Article 111. Publication date: February 2024.
111:38 Lingkai Meng, Yu Shao, and et al.
[168] Yanyan Xu, James Cheng, and Ada Wai-Chee Fu. 2016. Distributed Maximal Clique Computation and Management.
IEEE Trans. Serv. Comput. 9, 1 (2016), 110–122.
[169] Yanyan Xu, James Cheng, Ada Wai-Chee Fu, and Yingyi Bu. 2014. Distributed Maximal Clique Computation. In IEEE
BigData 2014. 160–167.
[170] Da Yan, James Cheng, Yi Lu, and Wilfred Ng. 2014. Blogel: A Block-Centric Framework for Distributed Computation
on Real-World Graphs. Proc. VLDB Endow. 7, 14 (2014), 1981–1992.
[171] Da Yan, James Cheng, Yi Lu, and Wilfred Ng. 2015. Effective Techniques for Message Reduction and Load Balancing
in Distributed Graph Computation. In WWW 2015. 1307–1317.
[172] Zhengyi Yang, Longbin Lai, Xuemin Lin, Kongzhang Hao, and Wenjie Zhang. 2021. HUGE: An Efficient and Scalable
Subgraph Enumeration System. In SIGMOD 2021. 2049–2062.
[173] Abdurrahman Yasar, Sivasankaran Rajamanickam, Jonathan W. Berry, and Ümit V. Çatalyürek. 2022. A Block-Based
Triangle Counting Algorithm on Heterogeneous Environments. IEEE Trans. Parallel Distributed Syst. 33, 2 (2022),
444–458.
[174] Ziqiang Yu, Xiaohui Yu, Nick Koudas, Yang Liu, Yifan Li, Yueting Chen, and Dingyu Yang. 2020. Distributed Processing
of k Shortest Path Queries over Dynamic Road Networks. In SIGMOD 2020. 665–679.
[175] Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauly, Michael J. Franklin,
Scott Shenker, and Ion Stoica. 2012. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory
Cluster Computing. In NSDI 2012. 15–28.
[176] Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster
Computing with Working Sets. In HotCloud 2010.
[177] Matei Zaharia, Reynold S. Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh
Rosen, Shivaram Venkataraman, Michael J. Franklin, Ali Ghodsi, Joseph Gonzalez, Scott Shenker, and Ion Stoica. 2016.
Apache Spark: a unified engine for big data processing. Commun. ACM 59, 11 (2016), 56–65.
[178] Jianping Zeng and Hongfeng Yu. 2018. A Scalable Distributed Louvain Algorithm for Large-Scale Graph Community
Detection. In CLUSTER 2018. 268–278.
[179] Dongxiang Zhang, Dingyu Yang, Yuan Wang, Kian-Lee Tan, Jian Cao, and Heng Tao Shen. 2017. Distributed shortest
path query processing on dynamic road networks. VLDB J. 26, 3 (2017), 399–419.
[180] Tian Zhou, Lixin Gao, and Xiaohong Guan. 2021. A Fault-Tolerant Distributed Framework for Asynchronous Iterative
Computations. IEEE Trans. Parallel Distributed Syst. 32, 8 (2021), 2062–2073.
[181] Huanzhou Zhu, Ligang He, Matthew Leeke, and Rui Mao. 2020. WolfGraph: The edge-centric graph processing on
GPU. Future Gener. Comput. Syst. 111 (2020), 552–569.
[182] Goran Zuzic, Gramoz Goranci, Mingquan Ye, Bernhard Haeupler, and Xiaorui Sun. 2022. Universally-Optimal
Distributed Shortest Paths and Transshipment via Graph-Based 𝓁 1 -Oblivious Routing. In SODA 2022. 2549–2579.
ACM Comput. Surv., Vol. 37, No. 4, Article 111. Publication date: February 2024.