A Comprehensive Survey and Experimental Study of Subgraph Matching Trends, Unbiasedness, and Interaction
A Comprehensive Survey and Experimental Study of Subgraph Matching Trends, Unbiasedness, and Interaction
1 INTRODUCTION
The search for a particular structure within a large graph is a fundamental operation in graph
mining, formally defined as subgraph matching. As shown in Figure 1, given a large data graph 𝐺
and a query graph 𝑄, subgraph matching enumerates all embeddings 𝑓 , which map every vertex in
𝑄 to the corresponding vertex of an isomorphic subgraph in 𝐺.
∗ Corresponding Author: [email protected]
Authors’ addresses: Zhijie Zhang, School of Data Science, Fudan University, Shanghai, China, [email protected];
Yujie Lu, School of Data Science, Fudan University, Shanghai, China, [email protected]; Weiguo Zheng, School of
Data Science, Fudan University, Shanghai, China, [email protected]; Xuemin Lin, Antai College of Economics
and Management, Shanghai JiaoTong University, Shanghai, China, [email protected].
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the
full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from [email protected].
© 2024 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM 2836-6573/2024/2-ART60
https://doi.org/10.1145/3639315
Proc. ACM Manag. Data, Vol. 2, No. 1 (SIGMOD), Article 60. Publication date: February 2024.
60:2 Zhijie Zhang, Yujie Lu, Weiguo Zheng, & Xuemin Lin
1.1 Background
Subgraph matching finds broad applications in various fields such as fraud detection [62], social
network analysis [19], bioinformatics [9], and knowledge graphs [40, 60]. Subgraph matching is a
typical NP-hard problem with the worst-case time complexity 𝑂 (|𝑉𝐺 | |𝑉𝑄 | ), where |𝑉𝐺 | and |𝑉𝑄 |
denote the number of vertices in graphs 𝐺 and 𝑄, respectively [22]. Although the size of the query
graph is often bounded in practical application scenarios, subgraph queries with tens of nodes are
widely studied in bioinformatics [9], social network [49], and knowledge graphs [48]. Therefore, the
computational overhead for queries of this size remains prohibitively expensive and it may become
a bottleneck in the applications. Numerous efficient algorithms [6, 7, 11, 26, 28, 34, 38, 65, 66, 75, 81]
have been proposed for subgraph matching. In addition, there are studies exploring alternative
settings, including parallel [32, 42, 84] and distributed subgraph matching [93, 98], continuous
subgraph matching [41, 55, 76, 95], as well as the GPU-based [24, 77, 88, 97] or FPGA-based
approaches [33, 89].
In this paper, we focus on subgraph matching within a single large graph under the basic setting,
i.e., in-memory, CPU-based, and single-threaded approaches. This forms the basis for studying
subgraph matching, enabling its extension to diverse scenarios. Given the crucial role of subgraph
matching, a comprehensive survey is warranted that goes beyond merely enumerating existing
algorithms, to provide an unbiased evaluation of techniques and insights into future research.
The objective of our survey is to address three central questions in the domain of subgraph
matching.
(1) What are the prevailing trends in algorithm development?
(2) How to rigorously assess the performance of an algorithm without any bias?
(3) How does the interaction between techniques affect the performance evaluation?
1.2 Contributions
Trends. Most methods for subgraph matching follow the filtering-ordering-enumerating frame-
work. In the last decade, the main focus of existing algorithms has been on filtering and ordering
optimization [6, 7, 9, 26, 28, 35, 65]. Despite equipment with carefully designed candidate filtering
and ordering, backtracking algorithms still face a significant issue of numerous futile recursions.
Recently, a growing effort has been made to enhance the backtracking framework to minimize the
number of backtrackings. This has led to the development of several noteworthy works in this
field, such as DPiso [25], RM [75], VEQ [38], GuP [4], and CaLiG [95]. All these methods involve
modifying the backtracking framework to varying degrees, which has become a current research
trend. Thus, we believe that a comprehensive review is needed to summarize and further investigate
these works in greater depth.
Proc. ACM Manag. Data, Vol. 2, No. 1 (SIGMOD), Article 60. Publication date: February 2024.
A Comprehensive Survey and Experimental Study of Subgraph Matching: Trends, Unbiasedness, and Interaction 60:3
Reducing backtrackings could be primarily achieved via two means: pruning during enumeration
to avoid unnecessary and redundant backtrackings (DPiso, VEQ, and GuP); and enumerating
multiple embeddings simultaneously (RM and CaLiG). We offer a comprehensive comparative analysis
of these algorithms and investigate their individual strengths and weaknesses through empirical
evaluation.
Unbiasedness. We notice that previous works [4, 25, 38, 74] follow a “conventional setting” in
evaluations: an algorithm is terminated either after producing 105 matchings or upon reaching a
timeout of 300 seconds. In the early stage of subgraph matching studies, a size of 105 matchings
may have been considered representative. However, with the increasing scale of query and data
graphs, continuing to follow this setting might be inappropriate. We evaluate 9 algorithms on
dataset dblp and report their rankings under different output limits (i.e., the maximum number
of embeddings) in Figure 2. It can be observed that simply adjusting the output limit results in a
dramatic shift in the rankings. As the number of reported embeddings increases, the previously
high-ranked methods RI [9] and CFL [7] start to decline, while initially low-ranked methods such
as RM [75] and VEQ [38] rise to the forefront. Additionally, as shown in Figure 12 (Section 4.5), the
results for all extended combinations exhibit even more significant changes.
This phenomenon is believed to sound the alarm for researchers in the field of subgraph matching,
raising the critical questions: How can we unbiasedly evaluate the performance of an algorithm? The
ideal solution is to conduct experiments without any limits, however, it is not realistic due to the
NP-hard nature of subgraph matching [22]. In our paper, we introduce the metric embeddings per
seconds (EPS), which is defined as the average number of embeddings returned per second. This
metric provides a comprehensive measurement by considering both time cost and the number of
reported embeddings. EPS allows for comparisons across different time and output limits, ensuring
relative consistency in performance, as illustrated in Figure 13. We also re-evaluate representative
subgraph matching algorithms using the metric EPS.
Interaction. The abundant algorithms for subgraph matching could be decomposed into individual
techniques (filter, order, and enumeration) and be combined with each other, allowing a large
investigation space of the effect of these techniques. However, previous surveys [47, 74] either
used the original algorithms or replaced individual techniques based on the default algorithm,
overlooking the potential interactions among diverse algorithm combinations. In fact, for each
individual technique, different complementary techniques may be required to achieve its optimal
performance. Assume that there are 𝑎 filtering, 𝑏 ordering, and 𝑐 enumerating techniques, the
combination space covers 𝑎 × 𝑏 × 𝑐 possible combinations, but in the previous studies [74], only
𝑎 + 𝑏 + 𝑐 of combinations have been investigated, resulting in a significant number of combinations
being overlooked and left unexplored.
To thoroughly investigate the entire design space, this paper selects 10 representative techniques
for each stage and explores the feasible combinations. Through a comprehensive analysis, we
Proc. ACM Manag. Data, Vol. 2, No. 1 (SIGMOD), Article 60. Publication date: February 2024.
60:4 Zhijie Zhang, Yujie Lu, Weiguo Zheng, & Xuemin Lin
evaluate the performance of both the original combinations and the feasible combinations derived
from various algorithms. The presence of interaction effects, as revealed by our experimental find-
ings, holds greater significance. Specifically, our experiments demonstrate that several enumeration
methods are significantly influenced by the effectiveness of filters and matching orders, highlighting
the importance of considering these factors in performance evaluation.
In summary, we make the following contributions in this paper.
• We identify a notable shift in the field of subgraph matching, where the focus is transition-
ing from filtering and ordering optimizations to the development of robust backtracking
techniques.
• We have discovered the potential bias caused by using the conventional setting that limits
the reported embeddings. Thus, we introduce an effective metric, namely embeddings per
second (EPS), which reflects the performance of algorithms consistently.
• Beyond the original methods, we use 10 typical techniques for each stage and evaluate all
the feasible combinations to comprehensively investigate the interactions between these
techniques.
• We conduct extensive experiments on various real-world datasets, as well as diverse types
of synthetic datasets and label assignments, which provides a thorough analysis of the
performance of all technique combinations.
2 PRELIMINARY
2.1 Problem Definition
In this paper, we focus on undirected and connected graphs with labeled vertices. Note that
graphs with labeled edges will also be discussed in Section 4.3. Generally, we consider query graph
𝑄 = (𝑉𝑄 , 𝐸𝑄 , Σ, 𝐿𝑄 ) and data graph 𝐺 = (𝑉𝐺 , 𝐸𝐺 , Σ, 𝐿𝐺 ), where 𝑉𝑄 and 𝑉𝐺 are sets of vertices, 𝐸𝑄
and 𝐸𝐺 are sets of edges, Σ is a set of labels, and 𝐿𝑄 : 𝑉𝑄 → Σ and 𝐿𝐺 : 𝑉𝐺 → Σ are the mappings
of a query (resp. data) vertex to its label. If there is no ambiguity, we use 𝑢 as a query vertex and 𝑣
as a data vertex. Let 𝑑 (𝑢) denote the degree of vertex 𝑢 and 𝑁𝑄 (𝑢) (resp. 𝑁𝐺 (𝑣)) denote the sets of
neighbors of 𝑢 (resp. 𝑣).
Definition 2.1 (Subgraph Isomorphism). Given a query graph 𝑄 = (𝑉𝑄 , 𝐸𝑄 , Σ, 𝐿𝑄 ) and a data
graph 𝐺 = (𝑉𝐺 , 𝐸𝐺 , Σ, 𝐿𝐺 ), 𝑄 is subgraph isomorphic to 𝐺 if there exists a mapping function 𝑓 :
𝑉𝑄 → 𝑉𝐺 , such that
(1) ∀ 𝑢 ∈ 𝑉𝑄 , we have 𝐿𝑄 (𝑢) = 𝐿𝐺 (𝑓 (𝑢)) where 𝑓 (𝑢) ∈ 𝑉𝐺 ,
(2) ∀ 𝑒 (𝑢 1, 𝑢 2 ) ∈ 𝐸𝑄 , we have 𝑒 (𝑓 (𝑢 1 ), 𝑓 (𝑢 2 )) ∈ 𝐸𝐺 , and
(3) ∀ 𝑢𝑖 , 𝑢 𝑗 ∈ 𝐸𝑄 , 𝑢𝑖 ≠ 𝑢 𝑗 ), then 𝑓 (𝑢𝑖 ) ≠ 𝑓 (𝑢 𝑗 ).
The mapping function 𝑓 is also called an embedding of 𝑄 in 𝐺. Each embedding could be expressed
as a set of one-to-one vertex pairs {(𝑢, 𝑓 (𝑢))}, and each vertex pair is called an assignment.
A partial embedding 𝑀 : 𝐼 → 𝑉𝐺 , where 𝐼 ⫋ 𝑉𝑄 , is an embedding of a subgraph of 𝑄 (induced
by 𝐼 ). An extension is to add an assignment (𝑢, 𝑣) to a partial embedding 𝑀, denoted by 𝑀 ∪ (𝑢, 𝑣).
Definition 2.2 (Subgraph Matching). Given a query graph 𝑄 and data graph 𝐺, subgraph
matching returns all the embeddings of 𝑄 in 𝐺.
Example 2.1.1. Considering the query 𝑄 and data graph 𝐺 in Figure 1, according to the definition
of subgraph isomorphism (Definition 2.1), 𝑀 = {(𝑢 0, 𝑣 1 ), (𝑢 1, 𝑣 4 ), (𝑢 2, 𝑣 5 ), (𝑢 3, 𝑣 8 ), (𝑢 4, 𝑣 2 ), (𝑢 5, 𝑣 3 )}
is an embedding of 𝑄 in 𝐺.
Definition 2.2 returns all embeddings, implying automorphisms can yield multiple solutions
that may be unnecessary computation in some applications. A stream of researches [29, 64, 68]
Proc. ACM Manag. Data, Vol. 2, No. 1 (SIGMOD), Article 60. Publication date: February 2024.
A Comprehensive Survey and Experimental Study of Subgraph Matching: Trends, Unbiasedness, and Interaction 60:5
only delivers induced subgraphs, i.e., automorphism producing one single embedding at most. For
empirical studies, please refer to Section 4.4.
The general framework of filtering-ordering-enumerating is outlined in Algorithm 1.
Filtering Vertices. The algorithm first generates candidate sets by filtering vertices (line 1).
Compared to directly traversing the entire graph, the filtering method selects candidates for each
query vertices and eliminates data vertices that cannot be potential matches. The candidate of
query vertex 𝑢 is denoted by 𝐶 (𝑢).
Matching Order. The algorithm then generates a matching order (line 2). A matching order 𝜑 of 𝑄
refers to a permutation of 𝑉𝑄 , determining the order in which vertices are explored during the search
process. The index of vertex 𝑢 in 𝜑 is denoted by 𝜑 (𝑢). The set of forward neighbors is defined as
𝑁 + (𝑢𝑖 ) = {𝑢 𝑗 |𝜑 (𝑢 𝑗 ) > 𝜑 (𝑢𝑖 )} and the set of backward neighbors is 𝑁 − (𝑢𝑖 ) = {𝑢 𝑗 |𝜑 (𝑢 𝑗 ) < 𝜑 (𝑢𝑖 )}.
Enumerating Process. The algorithms finally conduct enumeration (line 3). The basic enumerating
process first computes local candidates 𝐶𝑀 (𝑢) (line 8), followed by extending partial matchings
(line 13), and recursively carries out the computations (line 14). recursively to enumerate the results.
The naive termination condition (line 5) is 𝑖 = |𝑉𝑄 | but with techniques enumerating multiple
embeddings at a time [7, 25, 95], the search depth could be reduced. The index for pruning during
enumeration [4, 25, 38] is available (lines 10-11, 16) as well. Using additional indices, these methods
prune some of the unnecessary backtrackings.
Proc. ACM Manag. Data, Vol. 2, No. 1 (SIGMOD), Article 60. Publication date: February 2024.
60:6 Zhijie Zhang, Yujie Lu, Weiguo Zheng, & Xuemin Lin
subgraph matching methods in terms of vertex size and label size. In this paper, we uniformly use
the term “subgraph matching” to describe the problem.
The solutions of subgraph matching can be divided into three categories, namely join-based,
exploration-based, and hybrid methods. To enable parallel computing, join-based methods have
been widely adopted in parallel [39, 72] or distributed manner [1, 36, 44, 45, 61, 87, 90, 96, 99].
For join-based methods, Lai et al. [46] classify join strategies into binary join [44, 45, 61, 78]
and worst-case optimal join (WCOJ) [1, 3, 5, 36, 57–59, 82, 96]. Most in-memory single-threaded
approaches [4, 6, 7, 11, 25, 28, 34, 38] adopt the exploration-based methods. We will introduce
and compare the representative algorithms in detail in Section 3. Wang et al. [86, 87] develop
a exploration-based approach in distributed system while researchers [54, 99] explore hybrid
methodologies.
Comparison with existing surveys. Different from the existing surveys [47, 74] of subgraph
matching, we review new enumeration techniques [4, 25, 31, 38, 95] and evaluate all the feasible
combinations of different techniques while they only evaluate the original algorithms [47] or
replace one technique of the original algorithms [74]. Moreover, we use more and larger real-world
datasets as well as more types of synthetic datasets and label assignments. We also extend the
algorithms and evaluate them on graphs with both vertex and edge labels. In addition, we adopt a
novel metric EPS that can evaluate performance more unbiasedly.
2.2.2 Subgraph Search. Subgraph Search, also known as subgraph containment, is to search in
a database of graphs instead of a single large graph. Given a set of graphs 𝐷 = {𝐺 1, 𝐺 2, · · · , 𝐺𝑛 },
subgraph search aims to find all the graphs that contain the query graph 𝑄.
The filtering-indexing-verification framework is widely used in early subgraph searching efforts.
The focus of such research is to design effective indices to filter out unsatisfiable instances without
conducting computationally expensive subgraph isomorphism tests. One group of works [94, 101]
extracts frequent features appearing in the data graph, while another group [8, 17, 23, 43] considers
all features up to a bounded size. The space complexity would be |𝐷 ||F |, where |F | is the size of
feature set, and |F | could be quite large.
Recently, techniques leveraging subgraph matching have emerged to tackle subgraph search
efficiently without requiring indexing [52, 73]. In particular, VEQ [38] applies the same algorithmic
approach to subgraph search and outperforms prior methods that rely on indexing.
2.2.3 Graph Analytic System. To facilitate large-scale graph computations, numerous graph analytic
systems have been proposed, such as vertex-centric frameworks [50, 53, 69, 91], and subgraph-
centric frameworks [18, 63, 79, 93]. A series of works focusing on graph mining systems have
been proposed [15, 16, 20, 51, 70, 85, 100], e.g., Fractal [18], G-Thinker [93], Peregrine [30], and
Sandslash [14].
These systems often provide specific subgraph matching algorithms based on the system designs.
G-thinker adopts a subgraph matching approach similar to VF2, which involves local filtering
followed by enumerating all the matches. Fractal could accept a user-defined filtering function to
reduce the data graph and extend vertices iteratively. Peregrine matches the core subgraph first
and enumerates all the embeddings with an aggregator. Sandslash generates a matching order
starting with denser sub-patterns and provides APIs toExtend and toAdd for backtracking search.
SketchTree [99] is implemented in Pregel+ [92].
Proc. ACM Manag. Data, Vol. 2, No. 1 (SIGMOD), Article 60. Publication date: February 2024.
A Comprehensive Survey and Experimental Study of Subgraph Matching: Trends, Unbiasedness, and Interaction 60:7
3 COMPARISON OF METHODS
3.1 Filtering Techniques
3.1.1 Overview. The filtering method is designed to identify the candidate vertex set 𝐶 (𝑢) for
each query vertex, with the aim of removing extraneous vertices in advance. Notably, while the
worst-case time complexity remains 𝑂 (Π𝑢𝑖 |𝐶 (𝑢𝑖 )|) = 𝑂 (|𝑉𝐺 | |𝑉𝑄 | ), the number of vertices can
be considerably reduced in practice via candidate filtering. Ten representative filtering methods
reviewed in the paper are presented in Table 1.
To categorize filtering methods, three aspects could be employed:
(1) Filter Type. Early filtering methods, like LDF [81] or NLF [7], employ a one-round filtering
approach, utilizing only the local features of individual nodes. To reduce the occurrence of false
positives, modern algorithms often employ propagation filtering techniques to achieve a more
accurate candidate set. After one vertex is filtered out, the result could be utilized to refine the result
of its neighbors. Specially, RM [75] employs join to perform filtering. In practice, the propagation
method also specifies the number of rounds to traverse and refine candidates, as well as the traverse
order.
(2) Filter Rule. The filtering rules are usually designed based on a necessary condition for subgraph
matching. For local filtering, LDF [81] collects data vertex 𝑣 with the same label as 𝑢 such that
𝑣 has a degree greater than or equal to that of 𝑢. For every label 𝑙, NLF [7] checks whether a
candidate vertex 𝑣 has not fewer neighbors with label 𝑙 than vertex 𝑢. For propagation filtering,
the filtering rules are similar in nature but are applied to the candidate set instead of the original
graph, disregarding pruned vertices. For example, the neighbor-safety filter proposed by VEQ [38]
could be viewed as an extension of NLF for propagation scenarios.
(3) Neighborhood Usage. The neighborhood usage (i.e., neighbors taken into account) for filtering
is diverse as shown in Table 1. Different algorithms have designed various auxiliary data structures
A. For example, CFL and TSO only maintain tree edges (using 𝑁𝑇 ), while CECI and DPiso maintain
all edges (using 𝑁𝐶𝑆 ). Furthermore, for the sake of efficiency, some algorithms only consider the
forward (resp. children) neighbors 𝑁 + (𝑢) or backward (resp. parent) neighbors 𝑁 − (𝑢) in a single
round of propagation.
3.1.2 Latest Work. The latest work mainly focuses on designing strict filter rules to achieve a
candidate set with fewer false positives.
Proc. ACM Manag. Data, Vol. 2, No. 1 (SIGMOD), Article 60. Publication date: February 2024.
60:8 Zhijie Zhang, Yujie Lu, Weiguo Zheng, & Xuemin Lin
𝑢𝑢2 𝑢𝑢3
B C 𝐶𝐶(𝑢𝑢0 ) 𝑣𝑣0 𝑣𝑣1 𝑣𝑣7 𝐶𝐶(𝑢𝑢0 ) 𝑣𝑣1 𝑣𝑣7 𝐶𝐶(𝑢𝑢0 ) 𝑣𝑣1 𝑣𝑣0 𝑣𝑣1
A A A
(a) A tree (star) structure. 𝑣𝑣2
𝑅𝑅𝑒𝑒 𝑢𝑢1 𝑢𝑢2 𝑢𝑢1 𝑢𝑢3 𝑅𝑅𝑒𝑒 ′ 𝐶𝐶(𝑢𝑢2 ) 𝑣𝑣2 𝑣𝑣3 𝑣𝑣4 𝑣𝑣5 𝐶𝐶(𝑢𝑢2 ) 𝑣𝑣3 𝑣𝑣4 𝑣𝑣5 𝐶𝐶(𝑢𝑢2 ) 𝑣𝑣4 𝑣𝑣5
𝑣𝑣3 B B 𝑣𝑣4
v1 v2 v1 v5
v3 v4 v5 v6
𝑣𝑣2 𝑣𝑣3 𝑣𝑣4 𝐶𝐶(𝑢𝑢1 ) 𝑣𝑣3 𝑣𝑣4 𝐶𝐶(𝑢𝑢1 ) 𝑣𝑣4 𝐶𝐶(𝑢𝑢1 )
𝑅𝑅𝑒𝑒 = 𝑅𝑅𝑒𝑒 ⋉ 𝑅𝑅𝑒𝑒 ′ 𝑅𝑅𝑒𝑒 ′ = 𝑅𝑅𝑒𝑒 ′ ⋉ 𝑅𝑅𝑒𝑒
𝑢𝑢1 𝑢𝑢2 𝑢𝑢1 𝑢𝑢3
v1 v2 v1 v5 𝐶𝐶(𝑢𝑢3 ) 𝑣𝑣0 𝑣𝑣1 𝑣𝑣7 𝑣𝑣8 𝐶𝐶(𝑢𝑢3 ) 𝑣𝑣1 𝑣𝑣8 𝐶𝐶(𝑢𝑢3 ) 𝑣𝑣1 𝑣𝑣8 A A ⋯ A
𝑣𝑣2 𝑣𝑣3 𝑣𝑣4 𝑣𝑣5 𝑣𝑣10 𝑣𝑣2 𝑣𝑣3 𝑣𝑣5 𝐶𝐶(𝑢𝑢4 ) 𝑣𝑣5 𝑣𝑣6 𝑣𝑣104
(b) Refined candidates 𝐶𝐶(𝑢𝑢4 ) 𝑣𝑣2 𝑣𝑣3 𝑣𝑣4 𝑣𝑣5 𝑣𝑣10 𝐶𝐶(𝑢𝑢4 )
by the full reducer. (c) Initial Candidate Set generated by NLF. (d) Refined by Filter Rule 3.1 and 3.2 . (e) Refined by Filter Rule 3.3. (f) Data Graph 𝐺𝐺𝐺 .
Filter Rule 3.1. (Candidate Existence) For each 𝑣 ∈ 𝐶 (𝑢), if there exists 𝑢 ′ ∈ 𝑁 (𝑢) such that
𝐶 (𝑢 ′ ) ∪ 𝑁 (𝑣) = ∅, remove 𝑣 from 𝐶 (𝑢).
RM [75]. RM adopts a relation filter to build relations for each query edge based on labels. It
decomposes 𝑄 into a set of tree-structured sub-queries, as in Figure 3(a), and prunes unnecessary
tuples with the full reducer. The full reducer (Figure 3(b)) iteratively selects 𝑒 whose end vertex is a
leaf and 𝑒 ′ sharing end vertex with 𝑒. The relations of 𝑒 are filtered by 𝑅𝑒 ′ ← 𝑅𝑒 ′ ⋉ 𝑅𝑒 . The pruning
power of RM is competitive with the Filter Rule 3.1 [75]. The filtering techniques following Filter
Rule 3.1 has the complexity 𝑂 (|𝐸𝑄 ||𝐸𝐺 |). For such a rule, one vertex 𝑣 could be counted in multiple
candidate sets, violating the injective constraints and causing false positives.
VEQ [38]. VEQ introduces a neighbor-safety filter, extending NLF on the candidate set. If 𝑣 ∈ 𝐶 (𝑢)
has not fewer label-𝑙 neighbors in CS than 𝑢, 𝑣 ∈ 𝐶 (𝑢) is neighbor-safe regarding 𝑢.
Filter Rule 3.2. (Neighbor Safety) For each 𝑣 ∈ 𝐶 (𝑢), if there exists 𝑙 ∈ Σ such that |𝑁𝑄 (𝑢, 𝑙)| >
|𝑁𝐶𝑆 (𝑢, 𝑣, 𝑙)|, remove 𝑣 from 𝐶 (𝑢), where 𝑁𝐶𝑆 (𝑢, 𝑣, 𝑙) denotes the label-𝑙 neighbors of (𝑢, 𝑣) in CS.
VEQ implements NLF as a bit array with 4|Σ||𝑉𝐺 | bits to represent |𝑁𝐶𝑆 (𝑢, 𝑣, 𝑙)| up to 4. Adopting
Filter Rule 3.1 and 3.2, the complexity of neighbor-safety filter is 𝑂 (|𝐸𝑄 ||𝐸𝐺 |).
CaLiG [95]. The intuition of Filter Rule 3.3 is that vertex 𝑣 matches 𝑢 only if 𝑣’s neighbors match
𝑢’s neighbors. CaLiG constructs a bipartite graph for each 𝑣 ∈ 𝐶 (𝑢) and checks the existence of
injective matching, similar to GraphQL [28]. GraphQL traverses 𝐶 (𝑢) along an order of vertices
while CaLiG propagates the state locally as it is designed for streaming graphs.
Filter Rule 3.3. (Injective Matching) For each 𝑣 ∈ 𝐶 (𝑢), a bigraph is construct with two sets of
vertices 𝑁𝑄 (𝑢) and 𝑁𝐺 (𝑣), where there is an edge between 𝑣 𝑗 ∈ 𝑁𝐺 (𝑣) and 𝑢𝑖 ∈ 𝑁𝑄 (𝑢) if 𝑣 𝑗 ∈ 𝐶 (𝑢𝑖 ).
If no injective matching is found in the bigraph, we remove 𝑣 from 𝐶 (𝑢).
Although Filter Rule 3.3 has the strongest pruning power, the complexity is 𝑂 (𝑑𝑄2.5 |𝐸𝑄 ||𝐸𝐺 |) as
bigraph matching is required, where 𝑑𝑄 is the maximum vertex degree of 𝑄.
Example 3.1.1. Consider the running example in Figure 1, the candidate set initialized by NLF
is shown in Figure 3(c). No further candidates are able to be pruned by for Filter Rule 3.1, as for
any 𝑢, 𝑣, 𝑣 ∈ 𝐶 (𝑢) and any 𝑢 ′ ∈ 𝑁 (𝑢), there exists one corresponding neighbor 𝑣 ′ ∈ 𝑁 (𝑣) and
𝑣 ′ ∈ 𝐶 (𝑢 ′ ). However, for 𝑣 2 ∈ 𝐶 (𝑢 1 ), 𝑣 2 has only one neighbor labeled 𝐴, but two neighbors labeled
𝐴 are needed for 𝑢 1 . Therefore, 𝑣 2 is not neighbor-safe and could be removed from 𝐶 (𝑢 1 ). The
candidate set refined by Filter Rule 3.1 and 3.2 is shown in Figure 3(d). For 𝑣 7 ∈ 𝐶 (𝑢 0 ), although 𝑣 7
is neighbor-safe, its neighbor 𝑣 3 is used twice in 𝐶 (𝑢 1 ) and 𝐶 (𝑢 2 ), violating the injective constraints.
By Filter Rule 3.3, 𝑣 7 could be pruned from 𝐶 (𝑢 0 ) and the final candidate set (Figure 3(e)) is much
more tighter, with fewer false positives.
Proc. ACM Manag. Data, Vol. 2, No. 1 (SIGMOD), Article 60. Publication date: February 2024.
A Comprehensive Survey and Experimental Study of Subgraph Matching: Trends, Unbiasedness, and Interaction 60:9
3.2.2 Latest Work. VEQ [38]. CFL [7] proposes leaf decomposition (as discussed in Section 3.3.3),
which has been widely adopted by subsequent algorithms like DPiso. VEQ follows the candidate-size
order of DPiso but generates a new adaptive order. In CFL and DPiso, the leaf decomposition needs
to postpone the matching of leaf vertices (i.e. degree-one vertex) to the end of the matching order.
However, when the failures happen in the matching of leaf vertex, more invalid backtrackings
would be wasted. Therefore, the dynamic order of VEQ checks whether the size of remaining
candidates for a degree-one extendable vertex 𝑢 is less than or equal to the needed size, to detect
the failure in advance.
Example 3.2.1. Let us consider query graph 𝑄 in Figure 1(a) and data graph 𝐺 ′ in Figure 3(f).
There is no embedding of 𝑄 in 𝐺 ′ as there are insufficient vertices labeled as 𝐵 to assign to the leaf
vertices 𝑢 4 and 𝑢 5 . However, if we postpone the matching of the leaf vertices with the matching
order (𝑢 0, 𝑢 1, 𝑢 2, 𝑢 3, 𝑢 4, 𝑢 5 ), we need to traverse the 100 candidates from 𝑣 5 to 𝑣 104 to identify the
failure. For the adaptive order of VEQ, after assigning 𝑣 3 to 𝑢 1 and assigning 𝑣 4 to 𝑢 2 , there is only
one candidate for leaf vertex 𝑢 4 , so 𝑢 4 is selected instead of 𝑢 3 . Since no candidate exists for 𝑢 5 , and
the failure has been detected. and the invalid traversal can be avoided.
Proc. ACM Manag. Data, Vol. 2, No. 1 (SIGMOD), Article 60. Publication date: February 2024.
60:10 Zhijie Zhang, Yujie Lu, Weiguo Zheng, & Xuemin Lin
𝜋𝜋𝑀𝑀 𝑢𝑢0 , 𝑣𝑣0 = {𝑣𝑣0 , 𝑣𝑣1 , 𝑣𝑣2 } 𝑣𝑣0 𝑣𝑣1 𝑣𝑣2 𝑢𝑢0
𝑁𝑁𝑁𝑁 𝑢𝑢1 , 𝑣𝑣3 = ∅
𝑣𝑣3 𝑣𝑣3 𝑣𝑣3 𝑢𝑢1
𝑁𝑁𝑁𝑁 𝑢𝑢2 , 𝑣𝑣4 = ∅
𝑢0
A 𝐹𝐹𝑀𝑀 = 𝑢𝑢1 , 𝑢𝑢2 , 𝑢𝑢4 , 𝑢𝑢5 , 𝑢𝑢3 ∉ 𝐹𝐹𝑀𝑀 𝑣𝑣4 𝑣𝑣4 𝑣𝑣4 𝑢𝑢2
𝑁𝑁𝑁𝑁 𝑢𝑢4 , 𝑣𝑣105 = ∅
𝑢1 𝜋𝜋𝑀𝑀 𝑢𝑢3 , 𝑣𝑣5 = {𝑣𝑣5 , 𝑣𝑣6 , ⋯ , 𝑣𝑣104 } 𝑣𝑣5 𝑣𝑣6 𝑣𝑣104 𝑣𝑣5 𝑣𝑣6 𝑣𝑣104 𝑣𝑣5 𝑣𝑣6 𝑣𝑣104 𝑢𝑢3
B B B 𝐹𝐹𝑀𝑀 = 𝑢𝑢1 , 𝑢𝑢2 , 𝑢𝑢4 , 𝑢𝑢5
𝑢2 𝑢5 𝑁𝑁𝑁𝑁 𝑢𝑢5 , 𝑣𝑣105 = ∅ 𝑣𝑣105 𝑣𝑣105 ⋯ 𝑣𝑣105 𝑣𝑣105 𝑣𝑣105 ⋯ 𝑣𝑣105 𝑣𝑣105 𝑣𝑣105 ⋯ 𝑣𝑣105 𝑢𝑢4
𝑢4 𝐹𝐹𝑀𝑀 = 𝑢𝑢1 , 𝑢𝑢2 , 𝑢𝑢4 , 𝑢𝑢5
B A 𝑁𝑁𝑁𝑁 𝑢𝑢5 , 𝑣𝑣105 = {(𝑢𝑢4 , 𝑣𝑣105 )} 𝑣𝑣105 𝑣𝑣105 𝑣𝑣105 𝑣𝑣105 𝑣𝑣105 𝑣𝑣105 𝑣𝑣105 𝑣𝑣105 𝑣𝑣105 𝑢𝑢5
𝑢3
RM [75]. RM considers the cardinality estimation of sub-queries challenging and thus turns back to
the graph structure of 𝑄. CFL proposes core-forest-leaf decomposition and starts the enumeration
from the dense part (i.e., core vertices). RM extends the idea and takes advantage of the dense
sub-structures in core vertices. It constructs a density tree with nucleus decomposition [67], which
could find dense subgraphs with a multi-level hierarchy. RM then starts searching in the densest
part and ends at the sparsest part.
Proc. ACM Manag. Data, Vol. 2, No. 1 (SIGMOD), Article 60. Publication date: February 2024.
A Comprehensive Survey and Experimental Study of Subgraph Matching: Trends, Unbiasedness, and Interaction 60:11
Failing Set. Failing set is proposed in DPiso [25]. DPiso has built a rooted DAG 𝑞𝐷 by BFS, capturing
the dependence on assignments. Failures are classified into two classes: (i) conflict-class, where one
data vertex is used twice in a single embedding; (ii) emptyset-class, where the local candidates for 𝑢
to extend is empty. For (i), failing set is 𝐹𝑀 = anc(𝑢) ∪ anc(𝑢 ′ ) while for (ii), 𝐹𝑀 = anc(𝑢), where
anc(𝑢 ′ ) denotes the set of all ancestors of 𝑢 in 𝑞𝐷 including 𝑢 itself. The failing set 𝐹𝑀 of an inner
node in the search tree is updated from its children 𝐹𝑀𝑖 . Given failing 𝐹𝑀 , for a search node (i.e.,
partial embedding) in the search tree (𝑢, 𝑣), if 𝑢 ∉ 𝐹𝑀 , then all siblings of node 𝑀 are redundant.
In other words, the nogood discovery adopted in failing set is 𝐷 = {(𝑢, 𝑀 (𝑢)|𝑢 ∈ 𝐹𝑀 }, and DPiso
employs backjumping until the assignment in the nogood changes.
Example 3.3.1. Consider the query graph 𝑄 in Figure 1(a) and the data graph 𝐺 ′ in Figure 3(f),
the DAG generated for 𝑄 is shown in Figure 4(a). The partial mapping in (𝑢 5, 𝑣 105 ) fails and belongs
to conflict-class and 𝐹𝑀 = anc(𝑢 4 ) ∪ anc(𝑢 5 ) = {𝑢 0, 𝑢 1, 𝑢 4, 𝑢 5 }. The failing set is updated in a
bottom-up fashion. Because 𝑢 3 ∉ 𝐹𝑀 , the siblings of (𝑢 3, 𝑣 5 ) are redundant. However, since failing
set is discarded when backjumping has been done, the same process is still needed for partial
embedding involved (𝑢 0, 𝑣 1 ) and (𝑢 0, 𝑣 2 ). The space pruned is illustrated in yellow in Figure 4(b).
Guard-based Pruning. The failing set of DPiso would be discarded after it has been used for
backjumping. GuP [4] proposes guard-based pruning to reuse a discovered nogood for pruning
multiple times, at the cost of extra memory consumption. The nogood guard 𝑁𝑉 (𝑢𝑖 , 𝑣) in GuP is
constructed based on deadend mask. The deadend mask, denoted by 𝐾, captures the adjacency
constraint related to the failure, based on the concept of local candidate-vertex set. The local candidate-
vertex set of 𝑢𝑖 under 𝑀, is a set of 𝑣 ∈ 𝐶 (𝑢𝑖 ) that holds 𝑣 ∈ 𝑁𝐺 (𝑀 (𝑢 𝑗 )) for all 𝑢 𝑗 ∈ 𝑁𝑄 (𝑢𝑖 ). Also,
GuP proposes the reservation guard to prevent conflict-class failure.
Example 3.3.2. The partial matching on (𝑢 5, 𝑣 105 ) has an injectivity conflict, and the deadend
mask is 𝐾 (𝑢5,𝑣105 ) = {𝑢 4, 𝑢 5 }. With the bottom-up updating, we have 𝐾 (𝑢4,𝑣105 ) = {𝑢 4 }, 𝐾 (𝑢3,𝑣5 ) = ∅
and finally 𝐾 (𝑢1,𝑣3 ) = ∅. Since 𝑁𝑉 (𝑢 1, 𝑣 3 ) = ∅ is the subset of any 𝑀, the subtree rooted at (𝑢 1, 𝑣 3 )
could be pruned. The nogood guard on (𝑢 1, 𝑣 3 ) is reusable for 𝑀 beginning with (𝑢 0, 𝑣 1 ) and (𝑢 0, 𝑣 2 )
as well. In Figure 4(b), the space pruned is marked in green.
Dynamic Equivalence Class. While failing set and guard-based pruning only record the failure,
VEQ [38] proposes dynamic equivalence to remove the equivalent subtree, the failed ones as well
as the successful ones. The idea of dynamic equivalence class may be inspired by the concept of
equivalence class. Due to the exclusive focus on the nodes within the candidate set, the dynamic
equivalence class is defined over the candidate set. For 𝑣𝑖 , 𝑣 𝑗 ∈ 𝐶𝑀 (𝑢), 𝑣𝑖 and 𝑣 𝑗 share neighbors if 𝑣𝑖
and 𝑣 𝑗 have common neighbors in 𝐶𝑀 (𝑢 𝑗 ) for every 𝑢 𝑗 ∈ 𝑁𝐺 (𝑢). After excluding some exceptional
cases, 𝑣𝑖 and 𝑣 𝑗 are symmetric to each other, denoted by 𝜋𝑀 (𝑢, 𝑣𝑖 ) = 𝜋𝑀 (𝑢, 𝑣 𝑗 ) = {𝑣𝑖 , 𝑣 𝑗 }. Thus, under
the search subtree of 𝑀, for any embeddings containing 𝑣𝑖 , a symmetric embedding containing 𝑣 𝑗 can
be generated directly. For failure, dynamic equivalence class could be considered as symmetricity-
based nogood discovery. The equivalence can be only used for siblings of 𝑀, and not reusable for
other search subtrees.
Example 3.3.3. It is obvious that vertices 𝑣 5 to 𝑣 104 share common neighbors, and 𝜋𝑀 (𝑢 3, 𝑣 5 ) =
{𝑣 5, 𝑣 6, · · · , 𝑣 104 }. The subtree rooted at 𝑀 ∪ (𝑢 3, 𝑣 5 ) is visited and no embedding is found. This
implies that the subtree rooted at all data vertices in 𝜋𝑀 (𝑢 3, 𝑣 5 ) is identical, and there is no need
for further exploration. In addition, although 𝑣 0, 𝑣 1, 𝑣 2 have different neighbors in 𝐺 ′ , they share
common neighbors in the candidate set after filtering, 𝜋𝑀 (𝑢 0, 𝑣 0 ) = {𝑣 0, 𝑣 1, 𝑣 2 }. Therefore, the
subtree rooted at (𝑢 0, 𝑣 1 ) and (𝑢 0, 𝑣 2 ) can be pruned. In Figure 4(b), the space pruned is marked in
purple.
Proc. ACM Manag. Data, Vol. 2, No. 1 (SIGMOD), Article 60. Publication date: February 2024.
60:12 Zhijie Zhang, Yujie Lu, Weiguo Zheng, & Xuemin Lin
3.3.3 Enumerating Multiple embeddings at A Time. Instead of backtracking until all query vertices
are assigned, there are ways to terminate the search in advance and produce multiple embeddings.
One-step ahead. One-step ahead is a fundamental improvement in the basic backtracking frame-
work. During the final step of backtracking, for each unvisited data vertex in the local candidates,
there exists a complete embedding. The termination condition in Algorithm 1 is 𝑖 = |𝑉𝑄 | − 1. For
example, as shown in Figure 5(a), all the query vertices except 𝑢 2 have been matched. We directly
intersect 𝑢 3 ’s and 𝑢 6 ’s neighbors in candidate sets, and one embedding is generated for each local
candidate.
𝑢𝑢0 (𝑣𝑣0 ) 𝑢𝑢1 (𝑣𝑣1 ) 𝑢𝑢4 (𝑣𝑣4 ) 𝑢𝑢0 (𝑣𝑣0 ) 𝑢𝑢1 (𝑣𝑣1 ) 𝑢𝑢4
A C C A C C
Leaf enumeration. For a leaf vertex, i.e., a degree-one vertex, the matching depends on its only
neighbor. TSO [26] permutes the vertices in the same neighbor equivalent class (NEC) to generate
embeddings at a time. CFL [7] proposes leaf decomposition, and the related leaf-matching algorithm
to effectively enumerate the Cartesian product. Compared to TSO, CFL not only considers NEC
vertices but also puts all leaf query vertices with the same label in the label class. The two leaf
vertices 𝑢 4 and 𝑢 5 in Figure 1(a) are not in the same NEC but in the same label class. DPiso [25]
also takes the leaf decomposition. VEQ [38] finds NEC among all degree-one vertices in 𝑄, and
merges the vertices in 𝑞𝐷 . The query vertices are decomposed into the set of degree-one vertices
and the remaining set 𝑉𝑄 ′ . The worst-case complexity for subgraph matching is |𝑉𝐺 | |𝑉𝑄 ′ | instead of
|𝑉𝐺 | |𝑉𝑄 | . For example, the leaf vertices 𝑢 3 and 𝑢 4 are unmapped in Figure 5(b), embeddings could be
obtained by simply permuting 𝑢 6 ’s neighbors in candidate sets.
𝑢𝑢2 𝑢𝑢3 𝑢2 𝑢3
(𝑣𝑣2 ) B C B C
Proc. ACM Manag. Data, Vol. 2, No. 1 (SIGMOD), Article 60. Publication date: February 2024.
A Comprehensive Survey and Experimental Study of Subgraph Matching: Trends, Unbiasedness, and Interaction 60:13
vertices, and enumerates all embeddings for shell vertices at a time. The worst-case complexity of
KSS is reduced to 𝑂 (|𝑉𝐺 | |𝑉𝑘𝑒𝑟𝑛𝑒𝑙 | ).
Since CaLiG is originally designed for streaming graphs, in order to make it applicable to all
input orders, we adjust the decomposition of kernel and shell. Given a matching order, the query
vertex 𝑢 whose 𝑁𝑄+ (𝑢) = ∅ is added into the shell vertex set.
4 EXPERIMENTAL EVALUATION
4.1 Experimental Setting
Techniques to Evaluate. We select ten representative techniques for each stage (i.e., filter, order,
and enumeration), as shown in Table 2, and the total number of feasible algorithm combinations
amounts to 534. Note that several techniques are not able to combine with other techniques since
they may highly depend on specific data structures. We carefully re-implement all these techniques
mentioned, except those covered by [74]. For ease of presentation, prefixes are added to indicate
specific components of a method: ‘f’ for filter, ‘o’ for order, and ‘e’ for enumeration. For example,
fVEQ represents the filtering technique of VEQ.
Data Graphs. We select 14 typical real-world graphs across different domains. To minimize the
influence of human bias on the selection of label size, we conducted experiments on 10 datasets in
Table 3 with label sizes of 15, 30, 45, and 60. The vertex labels are randomly assigned following the
previous work [4, 74]. As shown in Table 4, we also collect 4 datasets with real vertex and edge
labels. For further details, please refer to Section 4.3.
We use ER model [13] and RMAT model [12, 37] to generate random graphs and power-law
graphs, respectively. For each model, the default settings are |𝑉 | = 1𝑀, |𝐸| = 5𝑀, and |Σ| = 30. In
Proc. ACM Manag. Data, Vol. 2, No. 1 (SIGMOD), Article 60. Publication date: February 2024.
60:14 Zhijie Zhang, Yujie Lu, Weiguo Zheng, & Xuemin Lin
order to evaluate scalability, we vary the number of vertices |𝑉 | at 0.05M, 0.1M, 0.5M, and 1M, the
number of edges |𝐸| at 5M, 10M, 15M, and 20M, and the label cardinality |Σ| at 15, 30, 45, and 60.
Query Graphs. To generate query graphs, we follow the approach as described in previous
studies [4, 25, 74]. The sampling method performs a Metropolis-Hastings random walk [27] on data
graphs and extracts the induced subgraphs as queries. The size of query graphs ranges from 10,
20, 30, 40, and 50, and each query set contains 1000 query graphs. We generate queries for graphs
with real edge labels, each consisting of 100 queries. For real queries on DBpedia, we adopt the
LC-QuAD workload [80] in the experiment.
Evaluation Metrics. We measure the effectiveness and efficiency of each technique with the
following metrics.
Í
• Average candidate size. The average candidate size after filtering, i.e., 𝑢 ∈𝑉𝑄 |𝐶 (𝑢)|/|𝑉𝑄 |,
denoting the effectiveness of filter.
• Embeddings per seconds (EPS). Considering both the elapsed time and the number of reported
embeddings, EPS is defined as the average number of embeddings returned per second
and provides a consistent evaluation of algorithm performance across time and output size
limitations. Since EPS is an efficiency metric (like speed), it is important to note that a larger
EPS score does not necessarily correlate to a larger number of results.
Instead of stopping the search upon finding 105 embeddings as previous methods [4, 25, 38, 74],
we have not imposed any maximum limit on the number of embeddings. In our opinion, subgraph
matching could yield a large number of embeddings, easily exceeding 105 or even 108 . Limiting the
search to 105 embeddings may only cover a small fraction of the entire search space. Section 4.5
will provide detailed insights into how output limits can affect the performance of the algorithms.
Due to the large number of queries that need to be executed, we set a timeout of one second and
use EPS as the evaluation metric. It is worth noting that EPS optimizes for the average case rather
than the worst case Reporting the response time by limiting the output size is likely to skew the
metric significantly if a single worst-case query takes up most of the total time. Contrarily, EPS
remains relatively stable even in the face of some worst-case queries.
Experiment Environment. All the algorithms are implemented in C++ and evaluated on a Linux
Server equipped with Intel(R) E5-2596v4 CPU @2.2 GHz and 128G RAM. The source codes will be
available at: https://github.com/JackChuengQAQ/SubgraphMatchingSurvey.
Proc. ACM Manag. Data, Vol. 2, No. 1 (SIGMOD), Article 60. Publication date: February 2024.
A Comprehensive Survey and Experimental Study of Subgraph Matching: Trends, Unbiasedness, and Interaction 60:15
The performance of both the original algorithms and the best-performing combination on
edge-labeled graphs are showcased in Figure 8. In comparison to graphs without edge labels, the
differences between the original methods have relatively narrowed. However, the best combination
far surpasses all original methods, demonstrating the necessity of studying the interplay between
techniques. VEQ does not perform well on datasets 𝑤𝑚 and 𝑓 𝑏, due to the additional requirement
to check edge labels in the neighbor-safety inspection within the fVEQ technique. As a comparison,
across datasets 𝑤𝑚, 𝑓 𝑏, and 𝑑𝑝, the enumeration technique eVEQ is used in the best-performing
combinations. CaLiG exhibits a good performance across all datasets, because fewer bipartite graph
Proc. ACM Manag. Data, Vol. 2, No. 1 (SIGMOD), Article 60. Publication date: February 2024.
60:16 Zhijie Zhang, Yujie Lu, Weiguo Zheng, & Xuemin Lin
checkings are required benefitting from the introduction of edge labels and higher selectivity. The
inferior performance of all methods on dataset tc is due to the degree distribution of tc is nearly
uniform, not conforming to the power-law property.
Evaluation of Real Queries. We extract real query workloads from LC-QuAD [80]. Given that
LC-QuAD pertains to natural language questions on knowledge graphs, the queries typically consist
of around 3 to 5 vertices, and the number of embeddings tends to be relatively small. Owing to
the small size of the query graphs, most complex filtering and enumerating techniques yield little
benefit. The enumeration stage could swiftly end due to the limited search depth. Consequently,
the RM method, which directly employs a join-based approach for filtering, exhibits the best
performance.
Evaluation of Edge Label Size. We investigate the effect of edge label size on algorithm per-
formance by varying edge labels from 10 to 1000. The results on 𝑤𝑚 are presented in Figure 10.
With the increase in edge label size, most algorithms experience a decrease in EPS due to the
reduced number of embeddings and relatively constant filtering time. This results in an increased
time overhead per embedding. However, there is an exception in the case of CaliG algorithm. The
substantial cost of fCaLiG degrades its performance when dealing with graphs of small edge label
size. Nevertheless, as the edge label size increases, the performance of CaliG improves.
Proc. ACM Manag. Data, Vol. 2, No. 1 (SIGMOD), Article 60. Publication date: February 2024.
A Comprehensive Survey and Experimental Study of Subgraph Matching: Trends, Unbiasedness, and Interaction 60:17
performance, it appears that the effectiveness of the failing set method has decreased since the
introduction of symmetry-breaking constraints.
Proc. ACM Manag. Data, Vol. 2, No. 1 (SIGMOD), Article 60. Publication date: February 2024.
60:18 Zhijie Zhang, Yujie Lu, Weiguo Zheng, & Xuemin Lin
(a2) best ranking for (b2) best ranking for (c2) best ranking for
each filtering each ordering each enumeration
By fixing the ordering method as RI, we combine filters fLDF and fVEQ with enumerating methods
eLFTJ and eVEQ, exhaustively. As shown in Figure 14(a), the impact of filter replacement for eLFTJ
is not significant, whereas the impact is substantial for eVEQ. When eVEQ is combined with fVEQ,
the performance is significantly improved. However, in datasets like cs, db, and tw, the performance
of eVEQ combined with fLDF is even worse than that of fLDF-eLFTJ, In fact, interaction effects exist
widely between algorithm combinations. As demonstrated in Figures 15, 16, and 17 for evaluations
in Sections 4.7, 4.8, and 4.9 respectively, the optimal combinations of particular techniques vary
across datasets, indicating a better synergy between the techniques and their optimal combinations.
The factors underlying the emergence of interactions will be further discussed in the following
subsections.
Proc. ACM Manag. Data, Vol. 2, No. 1 (SIGMOD), Article 60. Publication date: February 2024.
A Comprehensive Survey and Experimental Study of Subgraph Matching: Trends, Unbiasedness, and Interaction 60:19
(a) An example of the interaction effect. (b) The average candidate size.
Proc. ACM Manag. Data, Vol. 2, No. 1 (SIGMOD), Article 60. Publication date: February 2024.
60:20 Zhijie Zhang, Yujie Lu, Weiguo Zheng, & Xuemin Lin
Fig. 15. Performance of filtering with the best combination (on average over all label and query sizes).
Fig. 16. Performance of ordering with the best combination (on average over all label and query sizes).
Fig. 17. Performance of enumerating with the best combination (on average over all label and query sizes).
less significant compared to the selectivity affected by the data graph. This leads to data-dependent
methods outperforming data-independent ones. On the other hand, for datasets like 𝑡𝑤 that have
larger values for average degree and core number, oRM consistently delivers superior performance.
In terms of the optimal choices, all ordering methods use eKSS as the optimal enumeration
method. The only exception is oRM, which may occasionally select the original combination (oRM-
oRM-oRM). It is observed that eKSS displays a strong reliance on the ordering method utilized,
as the input order directly influences the kernel-and-shell decomposition. It demonstrates that
data-independent methods tend to focus more on the graph structure, often leading to larger shell
sets. In conclusion, both the graph structure and candidate size are crucial for developing a good
order for subgraph matching.
Proc. ACM Manag. Data, Vol. 2, No. 1 (SIGMOD), Article 60. Publication date: February 2024.
A Comprehensive Survey and Experimental Study of Subgraph Matching: Trends, Unbiasedness, and Interaction 60:21
Fig. 18. Varying the output size and the query size.
2-core query, the core vertices encompass the entire query, and yet kernel-and-shell decomposition
remains applicable.
Technique eVEQ introduces dynamic equivalence class detection during backtracking search
to avoid searching through duplicate subtrees. It yields remarkable performance boosting, with
an improvement of 3∼8 orders of magnitude compared to eLFTJ, except on graphs cs, tw, and ys.
However, the trade-off for using eVEQ is that its performance may be inferior to that of eLFTJ on
certain graphs. These enumerating methods have shown promising results and suggest modifying
the backtracking framework may lead to further improvements in subgraph matching.
When considering the optimal combinations for enumerating methods, it is worth noting that
the optimal selection of filtering and ordering methods still vary across different graphs. Even when
we initially claim that the eVEQ enumeration heavily depends on the filtering method used, it does
not always choose the tighter filter like the eVEQ filter. At times, it is not necessary to employ
complex filtering rules to sufficiently reduce the candidate set.
Proc. ACM Manag. Data, Vol. 2, No. 1 (SIGMOD), Article 60. Publication date: February 2024.
60:22 Zhijie Zhang, Yujie Lu, Weiguo Zheng, & Xuemin Lin
20
oCFL
10
1
20
oGQL
10
1
20
oRI 10
1
f
NLFf
GQLf
DP f
VEQf
NLFf
GQLf
DP f
VEQf
NLFf
GQLf
DP f
VEQf
NLFf
GQLf
DP f
VEQf
NLFf
GQLf
DP f
VEQ
As the treewidth increases, there is a significant downward trend in EPS, indicating the rising of
query difficulty. When the treewidth is intermediate, eVEQ tends to outperform other engine meth-
ods. However, at small treewidths, the efficiency of eVEQ noticeably deteriorates. This decline may
be attributed to the simplistic structure of tree-like queries, where the application of sophisticated
pruning techniques might lead to suboptimal performance. Interestingly, the order is observed
to influence the algorithm’s performance, but not consistently. Under conditions of exceptionally
small tree-widths, where cardinality estimation is assumed to be straightforward, data-dependent
order (e.g., oGQL and oCFL) does not consistently outperform data-independent order (oRI). The
phenomenon leads to minimal performance differences, deserving further investigation. In addition,
the kernel and shell decomposition in eKSS depends on the given order, yet its performance is not
heavily influenced.
4.12.1 Vary |𝑉 |. As shown in Figures 20(a) and 20(b), a decline in performance is observed across all
algorithm combinations, with the increase of |𝑉 | in ER graphs. The determinant factor of algorithmic
performance lies predominantly in filtering techniques rather than enumerating techniques. Since
ER graphs do not exhibit power-law properties, it is less likely to discover a large number of
embeddings under the search subtree rooted in a single high-degree node. This leads to limited
opportunities to enumerate multiple embeddings at a time or pruning during enumeration. Ranked
by performance, the filtering techniques are fCFL, fDPiso, fCECI, fLDF, and fNLF, affirming the
effectiveness of Filter Rule 3.1 for ER graphs.
For RMAT graphs, the importance of enumeration techniques is more evident. According to
Figure 20(c), the optimal method is RM, followed by eEXPLORE and eKSS. The performance of these
algorithm combinations remains relatively stable as the RMAT graph size increases. Meanwhile,
the performance of eVEQ rapidly declines with an increase in the number of vertices. In contrast,
eRM and eKSS once again demonstrate their performance stability.
4.12.2 Vary |𝐸|. The performance of all algorithms in terms of EPS is shown in Figure 20(d) and
Figure 20(e). For the native backtracking search, as the number of edges increases, with increased
|𝐸|, the success rate of backtrackings also rises, leading to an increase in performance measured by
EPS. For eVEQ and eGQL, the increase in |𝐸| causes additional overhead, resulting in a performance
curve that initially rises but then declines.
Proc. ACM Manag. Data, Vol. 2, No. 1 (SIGMOD), Article 60. Publication date: February 2024.
A Comprehensive Survey and Experimental Study of Subgraph Matching: Trends, Unbiasedness, and Interaction 60:23
(a) Vary |𝑉 | in ER, (b) Vary |𝑉 | in (c) Vary |𝑉 | in (d) Vary |𝐸 | in ER, (e) Vary |𝐸 | in (f) Vary |Σ| in ER, (g) Vary |Σ| in
by filter. ER, by enumera- RMAT, by enumer- by enumeration. RMAT, by enumer- by enumeration. RMAT, by enumer-
tion. ation. ation. ation.
high selectivity
Filtering fDPiso fLDF / fNLF fVEQ
default
low selectivity
default
Ordering oRI / oRM oGQL
For RMAT graphs, Figure 20(e) (by varying |𝐸|) and Figure 20(c) (by varying |𝑉 |) exhibit a
comparable trend. The performance of VEQ and GQL deteriorates substantially faster with the
growth in the number of edges, compared to the other methods.
4.12.3 Vary |Σ|. On ER graphs, illustrated by Figure 20(f), all methods show a consistent decline
in performance as |Σ| increases. Generally, it is assumed that as the number of labels increases
(leading to higher selectivity), the speed for the algorithm also increases correspondingly. However,
since these results are still randomly scattered on ER graphs, leading to an actual increase in the
time required to return a single embedding.
On RMAT graphs, Figure 20(f) is also similar to Figure 20(b). Regarding enumerating methods,
RM method remains the best, followed by eEXPL and eKSS, and then eLFTJ and eQSI. With the
growth in label size, eGQL and eVEQ exhibit rapid growth, especially eVEQ, which becomes the
best method at |Σ| = 60. The growth of eGQL is attributed to fewer edge checkings needed when
computing local candidates, while the higher selectivity of vertices provides eVEQ with more
dynamic equivalence classes.
Proc. ACM Manag. Data, Vol. 2, No. 1 (SIGMOD), Article 60. Publication date: February 2024.
60:24 Zhijie Zhang, Yujie Lu, Weiguo Zheng, & Xuemin Lin
1) For filtering techniques, the fundamental principle is to make selections based on selectivity.
The fDPiso method could serve as the default. For graphs with a rich number of labels and higher
selectivity, fLDF or fNLF can be employed. Conversely, for graphs with fewer labels, or in cases
where a large number of failing candidates are detected during enumeration, the fVEQ method
might be worth considering.
2) For ordering techniques, we can take oRI or oRM as the default choice but turn to oGQL when
the dataset is sparse and the candidate sizes are small.
3) For the enumeration process, techniques like eEXPL or eLFTJ can be employed when the output
limit is relatively small. For query graphs with small treewidth, eKSS serves as an appropriate
choice. In most other situations, eVEQ is a good choice. However, if an excessive number of
candidates persist, or if the computational cost of eVEQ’s equivalence class becomes substantial
during execution, it may be advantageous to use eKSS or eRM.
5.2 Discussion
We conduct a comprehensive evaluation and analysis of the methods and find current trends
focused on improving the backtracking framework. We can classify these methods into two types.
One type such as RM and KSS enumerates multiple embeddings at a time while the other type like
VEQ and GuP performs pruning during enumeration. We believe that enhancing the backtracking
framework remains a promising direction for future progress. Going forward, challenges are how to
integrate the above two types of methods, and to mitigate the suboptimal worst-case performance
of pruning during enumeration methods.
We find that the choice of limiting reported embeddings has a considerable effect on the rankings
of these methods, suggesting that the results of experimental evaluations in some past studies may
have been biased. Hence, we propose a novel metric embeddings per second and re-evaluate all the
algorithms. It has been commonly assumed that the use of pruning in enumeration techniques is
beneficial mainly for large and complex queries [74]. Our empirical results show that such pruning
techniques may be also efficient for simple queries, particularly when there are a large number of
embeddings.
By exploring the entire algorithm design space, our research reveals that the interactions between
various techniques play a crucial role. For example, KSS relies on the ordering method employed.
We compare individual techniques when combined in their optimal combinations, to minimize
potential interaction effects. In addition, we observe that combining different techniques could yield
better performance and the ideal combination varies across these datasets. For future algorithm
design, it’s better to focus on the interplay between techniques, beyond optimizing individual ones.
6 CONCLUSIONS
In this paper, we conduct a comprehensive survey and experimental study of subgraph matching,
focusing on three key issues: identifying the current trend, ensuring unbiasedness, and investigating
the potential interactions. We identify that current trends focus on enhancing backtracking searches,
whose promising advantages have also been confirmed in our experiments. Beyond fixing the
output limit, we unbiasedly evaluate the performance of the algorithms by using an effective metric
EPS. To study the interaction effect between individual techniques, we evaluate and analyze the
original algorithms, as well as feasible combinations, through both real-world and synthetic graphs.
ACKNOWLEDGMENTS
This work was supported by National Natural Science Foundation of China (No. U23A20496),
Shanghai Science and Technology Innovation Action Plan (No. 21511100401), and GuangDong
Basic and Applied Basic Research Foundation (No. 2019B1515120048).
Proc. ACM Manag. Data, Vol. 2, No. 1 (SIGMOD), Article 60. Publication date: February 2024.
A Comprehensive Survey and Experimental Study of Subgraph Matching: Trends, Unbiasedness, and Interaction 60:25
REFERENCES
[1] Christopher R. Aberger, Andrew Lamb, Susan Tu, Andres Nötzli, Kunle Olukotun, and Christopher Ré. 2017. Empty-
Headed: A Relational Engine for Graph Processing. ACM Trans. Database Syst. 42, 4, Article 20 (oct 2017), 44 pages.
https://doi.org/10.1145/3129246
[2] Noga Alon, Raphael Yuster, and Uri Zwick. 1995. Color-Coding. J. ACM 42, 4 (jul 1995), 844–856. https://doi.org/10.
1145/210332.210337
[3] Khaled Ammar, Frank McSherry, Semih Salihoglu, and Manas Joglekar. 2018. Distributed evaluation of subgraph
queries using worstcase optimal lowmemory dataflows. Proceedings of the VLDB Endowment (2018).
[4] Junya Arai, Yasuhiro Fujiwara, and Makoto Onizuka. 2023. GuP: Fast Subgraph Matching by Guard-Based Pruning.
Proc. ACM Manag. Data 1, 2, Article 167 (jun 2023), 26 pages. https://doi.org/10.1145/3589312
[5] Molham Aref, Balder ten Cate, Todd J. Green, Benny Kimelfeld, Dan Olteanu, Emir Pasalic, Todd L. Veldhuizen, and
Geoffrey Washburn. 2015. Design and Implementation of the LogicBlox System. In Proceedings of the 2015 ACM
SIGMOD International Conference on Management of Data (Melbourne, Victoria, Australia) (SIGMOD ’15). Association
for Computing Machinery, New York, NY, USA, 1371–1382. https://doi.org/10.1145/2723372.2742796
[6] Bibek Bhattarai, Hang Liu, and H. Howie Huang. 2019. CECI: Compact Embedding Cluster Index for Scalable Subgraph
Matching. In Proceedings of the 2019 International Conference on Management of Data (Amsterdam, Netherlands)
(SIGMOD ’19). Association for Computing Machinery, New York, NY, USA, 1447–1462.
[7] Fei Bi, Lijun Chang, Xuemin Lin, Lu Qin, and Wenjie Zhang. 2016. Efficient Subgraph Matching by Postponing
Cartesian Products. In Proceedings of the 2016 International Conference on Management of Data (San Francisco,
California, USA) (SIGMOD ’16). Association for Computing Machinery, New York, NY, USA, 1199–1214. https:
//doi.org/10.1145/2882903.2915236
[8] Vincenzo Bonnici, Alfredo Ferro, Rosalba Giugno, Alfredo Pulvirenti, and Dennis Shasha. 2010. Enhancing Graph
Database Indexing by Suffix Tree Structure. In Proceedings of the 5th IAPR International Conference on Pattern
Recognition in Bioinformatics (Nijmegen, The Netherlands) (PRIB’10). Springer-Verlag, Berlin, Heidelberg, 195–203.
[9] Vincenzo Bonnici, Rosalba Giugno, Alfredo Pulvirenti, Dennis Shasha, and Alfredo Ferro. 2013. A subgraph
isomorphism algorithm and its application to biochemical data. BMC Bioinformatics 14, 7 (22 Apr 2013), S13.
https://doi.org/10.1186/1471-2105-14-S7-S13
[10] Antoine Bordes, Nicolas Usunier, Alberto Garcia-Durán, Jason Weston, and Oksana Yakhnenko. 2013. Translating
Embeddings for Modeling Multi-Relational Data. In Proceedings of the 26th International Conference on Neural
Information Processing Systems - Volume 2 (Lake Tahoe, Nevada) (NIPS’13). Curran Associates Inc., Red Hook, NY,
USA, 2787–2795.
[11] Vincenzo Carletti, Pasquale Foggia, Alessia Saggese, and Mario Vento. 2018. Challenging the Time Complexity of
Exact Subgraph Isomorphism for Huge and Dense Graphs with VF3. IEEE Transactions on Pattern Analysis and
Machine Intelligence 40, 4 (2018), 804–818. https://doi.org/10.1109/TPAMI.2017.2696940
[12] Deepayan Chakrabarti, Yiping Zhan, and Christos Faloutsos. 2004. R-MAT: A Recursive Model for Graph Mining.
Society for Industrial and Applied Mathematics, 442–446. https://doi.org/10.1137/1.9781611972740.43 0.
[13] Peter Pin-Shan Chen. 1976. The Entity-Relationship Model—toward a Unified View of Data. ACM Trans. Database
Syst. 1, 1 (mar 1976), 9–36. https://doi.org/10.1145/320434.320440
[14] Xuhao Chen, Roshan Dathathri, Gurbinder Gill, Loc Hoang, and Keshav Pingali. 2021. Sandslash: A Two-Level
Framework for Efficient Graph Pattern Mining. In Proceedings of the ACM International Conference on Supercomputing
(Virtual Event, USA) (ICS ’21). Association for Computing Machinery, New York, NY, USA, 378–391. https://doi.org/
10.1145/3447818.3460359
[15] Xuhao Chen, Roshan Dathathri, Gurbinder Gill, and Keshav Pingali. 2020. Pangolin: An Efficient and Flexible Graph
Mining System on CPU and GPU. Proc. VLDB Endow. 13, 8 (apr 2020), 1190–1205. https://doi.org/10.14778/3389133.
3389137
[16] Xuhao Chen, Tianhao Huang, Shuotao Xu, Thomas Bourgeat, Chanwoo Chung, and Arvind Arvind. 2021. FlexMiner:
A Pattern-Aware Accelerator for Graph Pattern Mining. In 2021 ACM/IEEE 48th Annual International Symposium on
Computer Architecture (ISCA). 581–594. https://doi.org/10.1109/ISCA52012.2021.00052
[17] Raffaele Di Natale, Alfredo Ferro, Rosalba Giugno, Misael Mongiovì, Alfredo Pulvirenti, and Dennis Shasha. 2010.
SING: Subgraph search In Non-homogeneous Graphs. BMC Bioinformatics 11, 1 (19 Feb 2010), 96. https://doi.org/10.
1186/1471-2105-11-96
[18] Vinicius Dias, Carlos H. C. Teixeira, Dorgival Guedes, Wagner Meira, and Srinivasan Parthasarathy. 2019. Fractal: A
General-Purpose Graph Pattern Mining System. In Proceedings of the 2019 International Conference on Management of
Data (Amsterdam, Netherlands) (SIGMOD ’19). Association for Computing Machinery, New York, NY, USA, 1357–1374.
https://doi.org/10.1145/3299869.3319875
[19] Wenfei Fan. 2012. Graph Pattern Matching Revised for Social Network Analysis. In Proceedings of the 15th International
Conference on Database Theory (Berlin, Germany) (ICDT ’12). Association for Computing Machinery, New York, NY,
Proc. ACM Manag. Data, Vol. 2, No. 1 (SIGMOD), Article 60. Publication date: February 2024.
60:26 Zhijie Zhang, Yujie Lu, Weiguo Zheng, & Xuemin Lin
Proc. ACM Manag. Data, Vol. 2, No. 1 (SIGMOD), Article 60. Publication date: February 2024.
A Comprehensive Survey and Experimental Study of Subgraph Matching: Trends, Unbiasedness, and Interaction 60:27
Proc. ACM Manag. Data, Vol. 2, No. 1 (SIGMOD), Article 60. Publication date: February 2024.
60:28 Zhijie Zhang, Yujie Lu, Weiguo Zheng, & Xuemin Lin
[61] Miao Qiao, Hao Zhang, and Hong Cheng. 2017. Subgraph matching: on compression and computation. Proceedings of
the VLDB Endowment 11, 2 (2017), 176–188.
[62] Xiafei Qiu, Wubin Cen, Zhengping Qian, You Peng, Ying Zhang, Xuemin Lin, and Jingren Zhou. 2018. Real-
Time Constrained Cycle Detection in Large Dynamic Graphs. Proc. VLDB Endow. 11, 12 (aug 2018), 1876–1888.
https://doi.org/10.14778/3229863.3229874
[63] Abdul Quamar, Amol Deshpande, and Jimmy Lin. 2016. NScale: Neighborhood-Centric Large-Scale Graph Analytics
in the Cloud. The VLDB Journal 25, 2 (apr 2016), 125–150. https://doi.org/10.1007/s00778-015-0405-2
[64] Pedro Ribeiro, Pedro Paredes, Miguel E. P. Silva, David Aparicio, and Fernando Silva. 2021. A Survey on Subgraph
Counting: Concepts, Algorithms, and Applications to Network Motifs and Graphlets. ACM Comput. Surv. 54, 2, Article
28 (mar 2021), 36 pages. https://doi.org/10.1145/3433652
[65] Siddhartha Sahu, Amine Mhedhbi, Semih Salihoglu, Jimmy Lin, and M. Tamer Özsu. 2017. The Ubiquity of Large
Graphs and Surprising Challenges of Graph Processing. Proc. VLDB Endow. 11, 4 (dec 2017), 420–431.
[66] Siddhartha Sahu, Amine Mhedhbi, Semih Salihoglu, Jimmy Lin, and M. Tamer Özsu. 2018. The Ubiquity of Large
Graphs and Surprising Challenges of Graph Processing. Proc. VLDB Endow. 11, 4 (oct 2018), 420–431. https:
//doi.org/10.1145/3164135.3164139
[67] Ahmet Erdem Sariyuce, C. Seshadhri, Ali Pinar, and Umit V. Catalyurek. 2015. Finding the Hierarchy of Dense
Subgraphs Using Nucleus Decompositions. In Proceedings of the 24th International Conference on World Wide Web
(Florence, Italy) (WWW ’15). International World Wide Web Conferences Steering Committee, Republic and Canton
of Geneva, CHE, 927–937. https://doi.org/10.1145/2736277.2741640
[68] C. Seshadhri. 2023. Some Vignettes on Subgraph Counting Using Graph Orientations. In 26th International Conference
on Database Theory (ICDT 2023) (Leibniz International Proceedings in Informatics (LIPIcs), Vol. 255), Floris Geerts
and Brecht Vandevoort (Eds.). Schloss Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl, Germany, 3:1–3:10.
https://doi.org/10.4230/LIPIcs.ICDT.2023.3
[69] Yingxia Shao, Bin Cui, Lei Chen, Lin Ma, Junjie Yao, and Ning Xu. 2014. Parallel subgraph listing in a large-scale
graph. In Proceedings of the 2014 ACM SIGMOD international conference on Management of Data. 625–636.
[70] Tianhui Shi, Mingshu Zhai, Yi Xu, and Jidong Zhai. 2020. GraphPi: High Performance Graph Pattern Matching through
Effective Redundancy Elimination. In Proceedings of the International Conference for High Performance Computing,
Networking, Storage and Analysis (Atlanta, Georgia) (SC ’20). IEEE Press, Article 100, 14 pages.
[71] Richard M. Stallman and Gerald J. Sussman. 1976. Forward Reasoning and Dependency-Directed Backtracking in a
System for Computer-Aided Circuit Analysis. Artif. Intell. 9 (1976), 135–196.
[72] Shixuan Sun, Yulin Che, Lipeng Wang, and Qiong Luo. 2019. Efficient Parallel Subgraph Enumeration on a Single
Machine. In 2019 IEEE 35th International Conference on Data Engineering (ICDE). 232–243. https://doi.org/10.1109/
ICDE.2019.00029
[73] Shixuan Sun and Qiong Luo. 2019. Scaling up subgraph query processing with efficient subgraph matching. In 2019
IEEE 35th International Conference on Data Engineering (ICDE). IEEE, 220–231.
[74] Shixuan Sun and Qiong Luo. 2020. In-Memory Subgraph Matching: An In-Depth Study. In Proceedings of the 2020
ACM SIGMOD International Conference on Management of Data (Portland, OR, USA) (SIGMOD ’20). Association for
Computing Machinery, New York, NY, USA, 1083–1098. https://doi.org/10.1145/3318464.3380581
[75] Shixuan Sun, Xibo Sun, Yulin Che, Qiong Luo, and Bingsheng He. 2020. RapidMatch: A Holistic Approach to Subgraph
Query Processing. Proc. VLDB Endow. 14, 2 (oct 2020), 176–188. https://doi.org/10.14778/3425879.3425888
[76] Shixuan Sun, Xibo Sun, Bingsheng He, and Qiong Luo. 2022. RapidFlow: An Efficient Approach to Continuous
Subgraph Matching. Proc. VLDB Endow. 15, 11 (jul 2022), 2415–2427. https://doi.org/10.14778/3551793.3551803
[77] Xibo Sun and Qiong Luo. 2023. Efficient GPU-Accelerated Subgraph Matching. Proc. ACM Manag. Data 1, 2, Article
181 (jun 2023), 26 pages. https://doi.org/10.1145/3589326
[78] Zhao Sun, Hongzhi Wang, Haixun Wang, Bin Shao, and Jianzhong Li. 2012. Efficient subgraph matching on billion
node graphs. Proceedings of the VLDB Endowment (2012).
[79] Carlos H. C. Teixeira, Alexandre J. Fonseca, Marco Serafini, Georgos Siganos, Mohammed J. Zaki, and Ashraf
Aboulnaga. 2015. Arabesque: A System for Distributed Graph Mining. In Proceedings of the 25th Symposium on
Operating Systems Principles (Monterey, California) (SOSP ’15). Association for Computing Machinery, New York, NY,
USA, 425–440. https://doi.org/10.1145/2815400.2815410
[80] Priyansh Trivedi, Gaurav Maheshwari, Mohnish Dubey, and Jens Lehmann. 2017. Lc-quad: A corpus for complex
question answering over knowledge graphs. In International Semantic Web Conference. Springer, 210–218.
[81] J. R. Ullmann. 1976. An Algorithm for Subgraph Isomorphism. J. ACM 23, 1 (jan 1976), 31–42. https://doi.org/10.
1145/321921.321925
[82] T. Veldhuizen. 2014. Triejoin: A Simple, Worst-Case Optimal Join Algorithm. In ICDT.
[83] Todd L Veldhuizen. 2012. Leapfrog triejoin: a worst-case optimal join algorithm. arXiv preprint arXiv:1210.0481 (2012).
Proc. ACM Manag. Data, Vol. 2, No. 1 (SIGMOD), Article 60. Publication date: February 2024.
A Comprehensive Survey and Experimental Study of Subgraph Matching: Trends, Unbiasedness, and Interaction 60:29
[84] Carletti Vincenzo, Pasquale Foggia, Pierluigi Ritrovato, Mario Vento, and Vincenzo Vigilante. 2019. A Parallel
Algorithm for Subgraph Isomorphism. 141–151. https://doi.org/10.1007/978-3-030-20081-7_14
[85] Kai Wang, Zhiqiang Zuo, John Thorpe, Tien Quang Nguyen, and Guoqing Harry Xu. 2018. RStream: Marrying
Relational Algebra with Streaming for Efficient Graph Mining on a Single Machine. In Proceedings of the 13th USENIX
Conference on Operating Systems Design and Implementation (Carlsbad, CA, USA) (OSDI’18). USENIX Association,
USA, 763–782.
[86] Zhaokang Wang, Rong Gu, Weiwei Hu, Chunfeng Yuan, and Yihua Huang. 2019. BENU: Distributed subgraph
enumeration with backtracking-based framework. In 2019 IEEE 35th International Conference on Data Engineering
(ICDE). IEEE, 136–147.
[87] Zhaokang Wang, Weiwei Hu, Guowang Chen, Chunfeng Yuan, Rong Gu, and Yihua Huang. 2021. Towards Efficient
Distributed Subgraph Enumeration Via Backtracking-Based Framework. IEEE Transactions on Parallel and Distributed
Systems 32, 12 (2021), 2953–2969. https://doi.org/10.1109/TPDS.2021.3076246
[88] Lizhi Xiang, Arif Khan, Edoardo Serra, Mahantesh Halappanavar, and Aravind Sukumaran-Rajam. 2021. CuTS:
Scaling Subgraph Isomorphism on Distributed Multi-GPU Systems Using Trie Based Data Structure. In Proceedings of
the International Conference for High Performance Computing, Networking, Storage and Analysis (St. Louis, Missouri)
(SC ’21). Association for Computing Machinery, New York, NY, USA, Article 69, 14 pages. https://doi.org/10.1145/
3458817.3476214
[89] Su Xunbin, Lin Yinnian, and Lei Zou. 2023. FASI: FPGA-friendly Subgraph Isomorphism on Massive Graphs. In 2023
IEEE 39th International Conference on Data Engineering (ICDE).
[90] Da Yan, Yingyi Bu, Yuanyuan Tian, and Amol Deshpande. 2017. Big graph analytics platforms. Foundations and
Trends in Databases 7, 1-2 (2017), 1–195.
[91] Da Yan, Yingyi Bu, Yuanyuan Tian, Amol Deshpande, and James Cheng. 2016. Big Graph Analytics Systems. In
Proceedings of the 2016 International Conference on Management of Data (San Francisco, California, USA) (SIGMOD
’16). Association for Computing Machinery, New York, NY, USA, 2241–2243. https://doi.org/10.1145/2882903.2912566
[92] Da Yan, James Cheng, Kai Xing, Yi Lu, Wilfred Ng, and Yingyi Bu. 2014. Pregel+: A Distributed Graph Computing
Framework with Effective Message Reduction. The Chinese University of Hong Kong, Hong Kong, China. http:
//www.cse.cuhk.edu.hk/pregelplus/
[93] Da Yan, Guimu Guo, Md Mashiur Rahman Chowdhury, M. Tamer Özsu, Wei-Shinn Ku, and John C. S. Lui. 2020.
G-thinker: A Distributed Framework for Mining Subgraphs in a Big Graph. In 2020 IEEE 36th International Conference
on Data Engineering (ICDE). 1369–1380. https://doi.org/10.1109/ICDE48307.2020.00122
[94] Xifeng Yan, Philip S. Yu, and Jiawei Han. 2004. Graph Indexing: A Frequent Structure-Based Approach. In Proceedings
of the 2004 ACM SIGMOD International Conference on Management of Data (Paris, France) (SIGMOD ’04). Association
for Computing Machinery, New York, NY, USA, 335–346. https://doi.org/10.1145/1007568.1007607
[95] Rongjian Yang, Zhijie Zhang, Weiguo Zheng, and Jeffrey Xu Yu. 2023. Fast Continuous Subgraph Matching over
Streaming Graphs via Backtracking Reduction. Proc. ACM Manag. Data 1, 1, Article 15 (may 2023), 26 pages.
https://doi.org/10.1145/3588695
[96] Zhengyi Yang, Longbin Lai, Xuemin Lin, Kongzhang Hao, and Wenjie Zhang. 2021. HUGE: An Efficient and
Scalable Subgraph Enumeration System. In Proceedings of the 2021 International Conference on Management of
Data (Virtual Event, China) (SIGMOD ’21). Association for Computing Machinery, New York, NY, USA, 2049–2062.
https://doi.org/10.1145/3448016.3457237
[97] Li Zeng, Lei Zou, M. Tamer Özsu, Lin Hu, and Fan Zhang. 2020. GSI: GPU-friendly Subgraph Isomorphism. In 2020 IEEE
36th International Conference on Data Engineering (ICDE). 1249–1260. https://doi.org/10.1109/ICDE48307.2020.00112
[98] Yuejia Zhang, Weiguo Zheng, Zhijie Zhang, Peng Peng, and Xuecang Zhang. 2022. Hybrid Subgraph Matching
Framework Powered by Sketch Tree for Distributed Systems. In 2022 IEEE 38th International Conference on Data
Engineering (ICDE). 1031–1043. https://doi.org/10.1109/ICDE53745.2022.00082
[99] Yuejia Zhang, Weiguo Zheng, Zhijie Zhang, Peng Peng, and Xuecang Zhang. 2022. Hybrid Subgraph Matching
Framework Powered by Sketch Tree for Distributed Systems. In 2022 IEEE 38th International Conference on Data
Engineering (ICDE). 1031–1043. https://doi.org/10.1109/ICDE53745.2022.00082
[100] Cheng Zhao, Zhibin Zhang, Peng Xu, Tianqi Zheng, and Jiafeng Guo. 2020. Kaleido: An Efficient Out-of-core Graph
Mining System on A Single Machine. In 2020 IEEE 36th International Conference on Data Engineering (ICDE). 673–684.
https://doi.org/10.1109/ICDE48307.2020.00064
[101] Peixiang Zhao, Jeffrey Xu Yu, and Philip S. Yu. 2007. Graph Indexing: Tree + Delta >= Graph. In Proceedings of the
33rd International Conference on Very Large Data Bases (Vienna, Austria) (VLDB ’07). VLDB Endowment, 938–949.
Proc. ACM Manag. Data, Vol. 2, No. 1 (SIGMOD), Article 60. Publication date: February 2024.