0% found this document useful (0 votes)

179 views21 pages

In-Depth Analysis of Graph-Based RAG in A Unified Framework

This paper presents a comprehensive analysis of graph-based Retrieval-Augmented Generation (RAG) methods within a unified framework, highlighting their effectiveness in enhancing large language models (LLMs) by integrating external knowledge. The authors systematically compare various graph-based RAG approaches across multiple question-answering datasets, identifying new variants that outperform existing state-of-the-art methods. Additionally, the study outlines promising research opportunities based on the findings from their extensive experiments and analyses.

Uploaded by

yjin276

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

179 views21 pages

In-Depth Analysis of Graph-Based RAG in A Unified Framework

Uploaded by

yjin276

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

In-depth Analysis of Graph-based RAG in a Unified Framework

[Experiment, Analysis & Benchmark]

Yingli Zhou Yaodong Su Youran Sun Shu Wang
CUHK-Shenzhen CUHK-Shenzhen CUHK-Shenzhen CUHK-Shenzhen

Taotao Wang Runyuan He Yongwei Zhang Sicong Liang

CUHK-Shenzhen CUHK-Shenzhen Huawei Cloud Huawei Cloud

Xilin Liu Yuchi Ma Yixiang Fang

Huawei Cloud Huawei Cloud CUHK-Shenzhen
ABSTRACT Prompt

Graph-based Retrieval-Augmented Generation (RAG) has proven Query

effective in integrating external knowledge into large language 🔍
models (LLMs), improving their factual accuracy, adaptability, in- LLM Wrong!
…
terpretability, and trustworthiness. A number of graph-based RAG
methods have been proposed in the literature. However, these meth- Corpus Retrieved Chunks
ods have not been systematically and comprehensively compared Vanilla RAG
under the same experimental settings. In this paper, we first sum- Prompt
marize a unified framework to incorporate all graph-based RAG Query
methods from a high-level perspective. We then extensively com- 🔍
pare representative graph-based RAG methods over a range of Build LLM Correct!
questing-answering (QA) datasets – from specific questions to ab- Graph
stract questions – and examine the effectiveness of all methods, Corpus Graph Relevant Information
providing a thorough analysis of graph-based RAG approaches. Graph-based RAG
As a byproduct of our experimental analysis, we are also able to Figure 1: Overview of vanilla RAG and graph-based RAG.
identify new variants of the graph-based RAG methods over spe-
cific QA and abstract QA tasks respectively, by combining existing knowledge, real-time updated information, and proprietary knowl-
techniques, which outperform the state-of-the-art methods. Finally, edge, which are outside LLMs’ pre-training corpus, known as “hal-
based on these findings, we offer promising research opportunities. lucination” [64].
We believe that a deeper understanding of the behavior of existing To bridge this gap, the Retrieval Augmented Generation (RAG)
methods can provide new valuable insights for future research. technique [15, 18, 27, 29, 83, 87, 89] has been proposed, which
supplements LLM with external knowledge to enhance its fac-
PVLDB Reference Format: tual accuracy and trustworthiness. Consequently, RAG techniques
Yingli Zhou, Yaodong Su, Youran Sun, Shu Wang, Taotao Wang, Runyuan have been widely applied in various fields, especially in domains
He, Yongwei Zhang, Sicong Liang, Xilin Liu, Yuchi Ma, and Yixiang Fang. where LLMs need to generate reliable outputs, such as health-
In-depth Analysis of Graph-based RAG in a Unified Framework care [49, 78, 91], finance [41, 60], and education [19, 80]. Moreover,
[Experiment, Analysis & Benchmark]. PVLDB, 18(1): XXX-XXX, 2025. RAG has proven highly useful in many data management tasks,
doi:XX.XX/XXX.XX including NL2SQL [13, 38], data cleaning [14, 40, 56, 66], knob tun-
ing [20, 37], DBMS diagnosis [70, 92, 93], and SQL rewrite [42, 73].
PVLDB Artifact Availability: Due to the important role of the RAG technique in LLM-based
The source code, data, and/or other artifacts have been made available at applications, numerous RAG methods have been proposed in the
https://github.com/JayLZhou/GraphRAG. past year [27]. Among these methods, the state-of-the-art RAG
approaches typically use the graph data as the external data (also
1 INTRODUCTION called graph-based RAG), since they capture the rich semantic in-
The development of Large Language Models (LLMs) like GPT-4 [1], formation and link relationships between entities. Given a user
Qwen2.5 [84], and Llama 3.1 [11] has sparked a revolution in the query 𝑄, the key idea of graph-based RAG methods is to retrieve
field of artificial intelligence [19, 28, 41, 49, 60, 78, 80, 91]. Despite relevant information (e.g., nodes, edges, subgraphs, or textual data)
their remarkable comprehension and generation capabilities, LLMs from the graph and then feed them along with 𝑄 as a prompt into
may still generate incorrect outputs due to a lack of domain-specific LLM to generate answers. The overview of naive-based RAG (i.e.,
vanilla RAG) and graph-based RAG are shown in Figure 1.
This work is licensed under the Creative Commons BY-NC-ND 4.0 International Following the success of graph-based RAG, researchers from
License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of
this license. For any use beyond those covered by this license, obtain permission by
fields such as database, data mining, machine learning, and natural
emailing [email protected]. Copyright is held by the owner/author(s). Publication rights language processing have designed efficient and effective graph-
licensed to the VLDB Endowment. based RAG methods [12, 21, 22, 30, 39, 64, 68, 81, 82]. In Table 1, we
Proceedings of the VLDB Endowment, Vol. 18, No. 1 ISSN 2150-8097.
doi:XX.XX/XXX.XX summarize the key characteristics of 12 representative graph-based
RAG methods based on the graph types they rely on, their index
Table 1: Classification of existing representative graph-based RAG methods.

Method Graph Type Index Component Retrieval Primitive Retrieval Granularity Specific QA Abstract QA
RAPTOR [68] Tree Tree node Question vector Tree node
KGP [81] Passage Graph Entity Question Chunk
HippoRAG [22] Knowledge Graph Entity Entities in question Chunk
G-retriever [26] Knowledge Graph Entity, Relationship Question vector Subgraph
ToG [72] Knowledge Graph Entity, Relationship Question Subgraph
DALK [39] Knowledge Graph Entity Entities in question Subgraph
LGraphRAG [12] Textual Knowledge Graph Entity, Community Question vector Entity, Relationship, Chunk, Community
GGraphRAG [12] Textual Knowledge Graph Community Question vector Community
FastGraphRAG [16] Textual Knowledge Graph Entity Entities in question Entity, Relationship, Chunk
LLightRAG [21] Rich Knowledge Graph Entity, Relationship Low-level keywords in question Entity, Relationship, Chunk
GLightRAG [21] Rich Knowledge Graph Entity, Relationship High-level keywords in question Entity, Relationship, Chunk
HLightRAG [21] Rich Knowledge Graph Entity, Relationship Both high- and low-level keywords Entity, Relationship, Chunk

components, retrieval primitives and granularity, and the types 2 PRELIMINARIES

of tasks they support. After a careful literature review, we make In this section, we review some key concepts of LLM and the general
the following observations. First, no prior work has proposed a workflow of graph-based RAG methods.
unified framework to abstract the graph-based RAG solutions and
identify key performance factors. Second, existing works focus on 2.1 Large Language Models (LLMs)
evaluating the overall performance, but not individual components.
We introduce some fundamental concepts of LLMs, including LLM
Third, there is no existing comprehensive comparison between all
prompting and retrieval augmented generation (RAG).
these methods in terms of accuracy and efficiency.
LLM Prompting. After instruction tuning on large corpus of
Our work. To address the above issues, in this paper, we conduct
human interaction scenarios, LLM is capable of following human
an in-depth study on graph-based RAG methods. We first propose a
instructions to complete different tasks [10, 61]. Specifically, given
novel unified framework with four stages, namely ❶ Graph build- the task input, we construct a prompt that encapsulates a com-
ing, ❷ Index construction, ❸ Operator configuration, and ❹ Retrieval prehensive task description. The LLM processes this prompt to
& generation, which captures the core ideas of all existing methods. fulfill the task and generate the corresponding output. Note that
Under this framework, we systematically compare 12 existing rep- pre-training on trillions of bytes of data enables LLM to generalize
resentative graph-based RAG methods. We conduct comprehensive to diverse tasks by simply adjusting the prompt [61].
experiments on the widely used question-answering (QA) datasets, Retrieval Augmented Generation. During completing tasks
including the specific and abstract questions, which evaluate the with prompting, LLMs often generate erroneous or meaningless
effectiveness of these methods in handling diverse query types and responses, i.e., the hallucination problem [28]. To mitigate the prob-
provide an in-depth analysis. lem, retrieval augmented generation (RAG) is utilized as an ad-
In summary, our principal contributions are as follows. vanced LLM prompting technique by using the knowledge within
the external corpus, typically including two major steps [18]: (1) re-
trieval: given a user question 𝑄, using the index to retrieve the most
• Summarize a novel unified framework with four stages for
relevant (i.e., top-𝑘) chunks to 𝑄, where the large corpus is first split
graph-based RAG solutions from a high-level perspective
into smaller chunks, and (2) generation: guiding LLM to generate
(Sections 3 ∼ 6).
answers with the retrieved chunks along with 𝑄 as a prompt.
• Conduct extensive experiments from different angles us-
ing various benchmarks, providing a thorough analysis
2.2 Graph-based RAG
of graph-based RAG methods. Based on our analysis, we
identify new variants of graph-based RAG methods, by Unlike vanilla RAG, graph-based RAG methods employ graph
combining existing techniques, which outperform the state- structures built from external corpus to enhance contextual un-
of-the-art methods (Section 7). derstanding in LLMs and generate more informed and accurate
• Summarize lessons learned and propose practical research responses [64]. Typically, graph-based RAG methods are composed
opportunities that can facilitate future studies (Section 8). of three major stages: (1) graph building: given a large corpus D
with 𝑑 chunks, for each chunk, an LLM extracts nodes and edges,
which are then combined to construct a graph G; (2) retrieval: given
The rest of the paper is organized as follows. In Section 2, we a user question 𝑄, using the index to retrieve the most relevant
present the preliminaries and introduce a novel unified framework information (e.g., nodes or subgraphs) from G, and (3) generation:
for graph-based RAG solutions in Section 3. In Sections 4 through guiding LLM to generate answers by incorporating the retrieved in-
6, we compare the graph-based RAG methods under our unified formation into the prompt along with 𝑄. Compared to vanilla RAG,
framework. The comprehensive experimental results and analysis the key difference in graph-based RAG methods can be summarized
are reported in Section 7. We present the learned lessons and a list from two perspectives:
of research opportunities in Section 8, and Section 9 reviews related • Retrieval source: Vanilla RAG retrieves knowledge directly
work while Section 10 summarizes the paper. from external chunks, whereas graph-based RAG methods
2
Offline Online

Nodes
Question Retrieval &
conversion generation
Edges
Corpus Chunks Graph Question Answer
❶ Graph building ❹ Retrieval & generation

or
…… LLM Entities Keywords Pri.
Graph elements Embeddings
Question retrieval
Graph Index Encoder Question vector
primitive

❷ Index construction
Operator Pool Parameters Pri.
……
Node Edge
…… Answer
Chunk Subgraph Selector Operators Graph Index Chunks
Prompt LLM
❸ Operator configuration
Figure 2: Workflow of graph-based RAG methods under our unified framework.

retrieve information from a graph constructed using these Table 2: Comparison of different types of graphs.
chunks.
• Retrieval element: Given a user question 𝑄, vanilla RAG Attributes Tree PG KG TKG RKG
aims to retrieve the most relevant chunks, while graph-
based RAG methods focus on finding useful information Original Chunk
from the graph, such as nodes, relationships, or subgraphs. Entity Name
Entity Type
3 A UNIFIED FRAMEWORK Entity Description
Relationship Name
In this section, we develop a novel unified framework, consisting
Relationship Keyword
of four stages: ❶ Graph building, ❷ Index construction, ❸ Operator Relationship Description
configuration, and ❹ Retrieval & generation, which can cover all Edge Weight
existing graph-based RAG methods, as shown in Algorithm 1.
Algorithm 1: A unified framework for graph-based RAG queries once completed, we present the workflow of graph-based
RAG methods under our framework in Figure 2.
input : Corpus D, and user question 𝑄
output : The answers for user question 𝑄
1 C ← split D into multiple chunks;
4 GRAPH BUILDING
// (1) Graph building. The graph building stage aims to transfer the input corpus into a
2 G ←GraphBuilding(C); graph, serving as a fundamental component in graph-based RAG
// (2) Index construction. methods. Before building a graph, the first step is splitting the
3 I ← IndexConstruction(G, C); corpus into smaller chunks, followed by using an LLM or other
// (3) Operator configuration. tools to create nodes and edges based on these chunks. There are five
4 O ← OperatorConfiguration( ); types of graphs, each with a corresponding construction method; we
// (4) Retrieve relevant information and generate present a brief description of each graph type and its construction
response. method below:
5 R ← Retrieval&generation(G, I, O, 𝑄); ❶ Passage Graph. In the passage graph (PG), each chunk repre-
6 return R; sents a node, and edges are built by the entity linking tools [81].
If two chunks contain a number of the same entities larger than a
Specifically, given the large corpus D, we first split it into multi- threshold, we link an edge for these two nodes.
ple chunks C (line 1). We then sequentially execute operations in ❷ Tree. The tree is constructed in a progressive manner, where
the following four stages (lines 2-5): (1) Build the graph G for input each chunk represents the leaf node in the tree. Then, it uses an
chunks C (Section 4); (2) Construct the index based on the graph LLM to generate higher-level nodes. Specifically, at the 𝑖-th layer,
G from the previous stage (Section 5); (3) Configure the retriever the nodes of (𝑖 + 1)-th layer are created by clustering nodes from
operators for subsequent retrieving stages (Section 6), and (4) For the 𝑖-th layer that does not yet have parent nodes. For each cluster
the input user question 𝑄, retrieve relevant information from G with more than two nodes, the LLM generates a virtual parent node
using the selected operators and feed them along with the question with a high-level summary of its child node descriptions.
𝑄 into the LLM to generate the answer. Note that the first three ❸ Knowledge Graph. The knowledge graph (KG) is constructed
stages are executed offline, enabling support for efficient online by extracting entities and relationships from each chunk, where
3
each entity represents an object and the relationship denotes the there are seven different operators to retrieve nodes. ❶ VDB lever-
semantic relation between two entities. ages the vector database to retrieve nodes by computing the vector
❹ Textual Knowledge Graph. A textual knowledge graph (TKG) similarity with the query vector. ❷ RelNode extracts nodes from
is a specialized KG (following the same construction step as KG), the provided relationships. ❸ PPR uses the Personalized PageR-
with the key difference being that in a TKG, each entity and rela- ank (PPR) algorithm [25] to identify the top-𝑘 similar nodes to the
tionship is assigned a brief textual description. question, where the restart probability of each node is based on its
❺ Rich Knowledge Graph. The rich knowledge graph (RKG) is similarity to the entities in the given question. ❹ Agent utilizes
an extended version of TKG, containing more information, includ- the capabilities of LLMs to select nodes from a list of candidate
ing textual descriptions for entities and relationships, as well as nodes. ❺ Onehop selects the one-hop neighbor entities of the given
keywords for relationships. entities. ❻ Link selects the top-1 most similar entity for each entity
We summarize the key characters of each graph type in Table 2. in the given set from the vector database. ❼ TF-IDF retrieves the
top-𝑘 relevant entities by ranking them based on term frequency
5 INDEX CONSTRUCTION and inverse document frequency from the TF-IDF matrix.
To support efficient online querying, existing graph-based RAG • Relationship type. These operators are designed to retrieve
methods typically include an index-construction stage, which in- relationships from the graph that are most relevant to the user ques-
volves storing entities or relationships in the vector database, and tion. There are four operators:❶ VDB, ❷ Onehop, ❸ Aggregator,
computing community reports for efficient online retrieval. Gen- and ❹ Agent. Specifically, the VDB operator also uses the vector
erally, there are three types of indices, ❶ Node Index, ❷ Rela- database to retrieve relevant relationships. The Onehop operator se-
tionship Index, and ❸ Community Index, where for the first lects relationships linked by one-hop neighbors of the given selected
two types, we use the well-known text-encoder models, such as entities. The Aggregator operator builds upon the PPR operator in
BERT [9], BGE-M3 [55], or ColBert [35] to generate embeddings the node operator. Given the PPR scores of entities, the most rele-
for nodes or relationships in the graph. vant relationships are determined by leveraging entity-relationship
❶ Node Index stores the graph nodes in the vector database. For interactions. Specifically, the score of each relationship is obtained
RAPTOR, G-retriever, DALK, FastGraphRAG, LGraphRAG, LLightRAG, by summing the scores of the two entities it connects. Thus, the
and HLightRAG, all nodes in the graph are directly stored in the vec- top-𝑘 relevant relationships can be selected. The key difference for
tor database. For each node in KG, its embedding vector is generated the Agent operator is that, instead of using a candidate entity list,
by encoding its entity name, while for nodes in Tree, TKG, and RKG, it uses a candidate relationship list, allowing the LLM to select the
the embedding vectors are generated by encoding their associated most relevant relationships based on the question.
textual descriptions. In KGP, it stores the TF-IDF matrix [24], which • Chunk type. The operators in this type aim to retrieve the
represents the term-weight distribution across different nodes (i.e., most relevant chunks to the given question. There are three op-
chunks) in the index. erators: ❶ Aggregator, ❷ FromRel, and ❸ Occurrence, where
❷ Relationship Index stores the relationships of the graph in a the first one is based on the Link operator in the relationship type,
vector database, where for each relationship, its embedding vector specifically, we use the relationship scores and the relationship-
is generated by encoding a description that combines its associated chunk interactions to select the top-𝑘 chunks, where the score of
context (e.g., description) and the names of its linked entities. each chunk is obtained by the summing the scores of all relation-
❸ Community Index stores the community reports for each ships extracted from it. The FromRel operator retrieves chunks that
community, where communities are generated by the clustering al- “contain” the given relationships. The Occurrence operator selects
gorithm and the LLM produces the reports. Specifically, Leiden [75] the top-𝑘 chunks based on the given relationships, assigning each
algorithm is utilized by LGraphRAG and GGraphRAG. chunk a score by counting the number of times it contains both
entities in a relationship.
6 RETRIEVAL AND GENERATION • Subgraph type. There are three operators to retrieve the
In this section, we explore the key steps in graph-based RAG meth- relevant subgraphs from the graph G: The ❶ KhopPath operator
ods, i.e., selecting operators, and using them to retrieve relevant aims to identify 𝑘-hop paths in G by iteratively finding such paths
information to question 𝑄. where the start and end points belong to the given entity set. After
identifying a path, the entities within it are removed from the entity
6.1 Retrieval operators set, and this process repeats until the entity set is empty. Note that
if two paths can be merged, they are combined into one path. For
In this subsection, we exhibit that the retrieval stage in various example, if we have two paths 𝐴 → 𝐵 → 𝐶 and 𝐴 → 𝐵 → 𝐶 →
graph-based RAG methods can be decoupled into a series of op- 𝐷, we can merge them into a single path 𝐴 → 𝐵 → 𝐶 → 𝐷.
erators, with different methods selecting specific operators and The ❷ Steiner operator first identifies the relevant entities and
combining them in various ways. By selecting and arranging these relationships, then uses these entities as seed nodes to construct a
operators in different sequences, all existing (and potentially fu- Steiner tree [24]. The ❸ AgentPath operator aims to identify the
ture) graph-based RAG methods can be implemented. Through an most relevant 𝑘-hop paths to a given question, by using LLM to
in-depth analysis of all implementations, we distill the retrieval filter out the irrelevant paths.
process into a set of 19 operators, forming an operator pool. Based • Community type. Only the LGraphRAG and GGraphRAG using
on the granularity of retrieval, we classify the operators into five the community operators, which includes two detailed operators,
categories: ❶ Entity, and ❷ Layer. The Entity operator aims to obtain the
• Node type. This type of operator focuses on retrieving “impor- communities containing the specified entities. Here, all identified
tant” nodes for a given question, and based on the selection policy,
4
Table 3: Operators utilized in graph-based RAG methods; “N/A” means that this type of operator is not used.

Method Node Relationship Chunk Subgraph Community

RAPTOR VDB N/A N/A N/A N/A
KGP TF-IDF N/A N/A N/A N/A
HippoRAG Link + PPR Aggregator Aggregator N/A N/A
G-retriever VDB VDB N/A Steiner N/A
ToG Link + Onehop + Agent Onehop + Agent N/A N/A N/A
DALK Link + Onehop + Agent N/A N/A KhopPath + AgentPath N/A
FastGraphRAG Link + VDB + PPR Aggregator Aggregator N/A N/A
LGraphRAG VDB Onehop Occurrence N/A Entity
RGraphRAG N/A N/A N/A N/A Layer
LLightRAG VDB Onehop Occurrence N/A N/A
GLightRAG FromRel VDB FromRel N/A N/A
HLightRAG VDB + FromRel Onehop + VDB Occurrence + FromRel N/A N/A

communities are sorted based on their rating (generated by the Table 4: Datasets used in our experiments; The underlined
LLM), and then the top-𝑘 communities are returned. The Leiden number of chunks denotes that the dataset is pre-split into
algorithm generates hierarchical communities, where higher layers chunks by the expert annotator.
represent more abstract, high-level information. The Layer opera-
tor is used to retrieve all communities below the required layers. Dataset # of Tokens # of Questions # of Chunks QA Type
MultihopQA 1,434,889 2,556 609 Specific QA
6.2 Operator configuration Quality 1,522,566 4,609 265 Specific QA
PopQA 2,630,554 1,172 33,595 Specific QA
Under our unified framework, any existing graph-based RAG method MusiqueQA 3,280,174 3,000 29,898 Specific QA
can be implemented by leveraging the operator pool along with HotpotQA 8,495,056 3,702 66,581 Specific QA
specific method parameters. Those parameters define two key as- ALCE 13,490,670 948 89,562 Specific QA

pects: (1) which operators to use, and (2) how to combine or apply Mix 611,602 125 61 Abstract QA
MultihopSum 1,434,889 125 609 Abstract QA
the selected operators. Agriculture 1,949,584 125 12 Abstract QA
In Table 3, we present how the existing graph-based RAG meth- CS 2,047,923 125 10 Abstract QA
ods utilize our provided operators to assemble their retrieval stages. Legal 4,774,255 125 94 Abstract QA
Due to this independent and modular decomposition of all graph-
based RAG methods, we not only gain a deeper understanding of are two types of answer generation paradigms: ❶ Directly and ❷
how these approaches work but also gain the flexibility to combine Map-Reduce. The former directly utilizes the LLM to generate the
these operators to create new methods. Besides, new operators can answer, while the latter, used in GGraphRAG, analyzes the retrieved
be easily created, for example, we can create a new operator VDB communities one by one, first, each community is used to answer
within the community type, which allows us to retrieve the most the question independently in parallel, and then all relevant partial
relevant communities by using vector search to compare the seman- answers are summarized into a final answer.
tic similarity between the question and the communities. In our
later experimental results (see Exp.5 in Section 7.3), thanks to our 7 EXPERIMENTS
modular design, we can design a new state-of-the-art graph-based We now present the experimental results. Section 7.1 discusses the
RAG method by first creating two new operators and combining setup. We discuss the results for specific QA and abstract QA tasks
them with the existing operators. in Sections 7.2 and 7.3, respectively.

6.3 Retrieval & generation 7.1 Setup

In the Retrieval & generation stage, the graph-based RAG methods Workflow of our evaluation. We present the first open-source
first go through a Question conversion stage (see the second subfig- testbed for graph-based RAG methods, which (1) collects and re-
ure on the right side of Figure 2), which aims to transfer the user implements 12 representative methods within a unified framework
input question 𝑄 into the retrieval primitive 𝑃, where 𝑃 denotes (as depicted in Section 3). (2) supports a fine-grained comparison
the atomic retrieval unit, such as entities or keywords in 𝑄, and the over the building blocks of the retrieval stage with up to 100+ vari-
embedding vector of 𝑄. ants, and (3) provides a comprehensive evaluation over 11 datasets
In the Question conversion stage, DALK, HippoRAG, and ToG extract with various metrics in different scenarios, we summarize the work-
entities from the question; KGP directly uses the original question flow of our empirical study in Figure 3, and make our unified system
as the retrieval primitive. The three versions of LightRAG extract available in: https://github.com/JayLZhou/GraphRAG/tree/master.
keywords from the question as the retrieval primitive, and the Benchmark Dataset. We employ 11 real-world datasets for eval-
remaining methods use the embedding vector of 𝑄. uating the performance of each graph-based RAG method, including
Based on the retrieval primitive 𝑃 and the selected operators, the both specific and abstract questions.
most relevant information to 𝑄 is retrieved and combined with 𝑄 to • Specific. The question in this group is detail-oriented and
form the final prompt for LLM response generation. Generally, there typically references specific entities within the graph (e.g.,
5
Table 5: Comparison of methods on different datasets, where Purple denotes the best result, and Orange denotes the best result
excluding the best one; For the three largest datasets, we replace the clustering method in RAPTOR from Gaussian Mixture to
K-means, as the former fails to finish within two days; The results of this version (i.e., K-means) are marked with † .

MultihopQA Quality PopQA MusiqueQA HotpotQA ALCE

Method
Accuracy Recall Accuracy Accuracy Recall Accuracy Recall Accuracy Recall STRREC STREM STRHIT

ZeroShot 49.022 34.256 37.058 28.592 8.263 1.833 5.072 35.467 42.407 15.454 3.692 30.696
VanillaRAG 50.626 36.918 39.141 60.829 27.058 17.233 27.874 50.783 57.745 34.283 11.181 63.608

G-retriever 42.019 43.116 31.807 17.084 6.075 2.733 11.662 — — 9.754 2.215 19.726
ToG 41.941 38.435 34.888 47.677 23.727 9.367 20.536 — — 13.975 3.059 29.114
KGP 48.161 36.272 33.955 57.255 24.635 17.333 27.572 — — 27.692 8.755 51.899
DALK 53.952 47.232 34.251 45.604 19.159 11.367 22.484 33.252 47.232 21.408 4.114 44.937
LLightRAG 44.053 35.528 34.780 38.885 16.764 9.667 19.810 34.144 41.811 21.937 5.591 43.776
GLightRAG 48.474 38.365 33.413 20.944 8.146 7.267 17.204 25.581 33.297 17.859 3.587 37.131
HLightRAG 50.313 41.613 34.368 41.244 18.071 11.000 21.143 35.647 43.334 25.578 6.540 50.422
FastGraphRAG 52.895 44.278 37.275 53.324 22.433 13.633 24.470 43.193 51.007 30.190 8.544 56.962
HippoRAG 53.760 47.671 48.297 59.900 24.946 17.000 28.117 50.324 58.860 23.357 6.962 43.671
LGraphRAG 55.360 50.429 37.036 45.461 18.657 12.467 23.996 33.063 42.691 28.448 8.544 54.747
RAPTOR 56.064 44.832 56.997 62.545 27.304 24.133† 35.595† 55.321† 62.424† 35.255† 11.076† 65.401†

“Who won the 2024 U.S. presidential election?”). We cate- 1. Method Implementation 2. Benchmark Collection
gorize the questions into two groups based on complexity:
Simple and Complex. The former has answers directly avail-
able in one or two text chunks, requiring no reasoning Reimplementation Systems Question
Methods Dataset Metric
across chunks, which includes three datasets: Quality [62],
PopQA [53], and HotpotQA [85]. The latter involves rea-
soning across multiple chunks, understanding implicit rela- 3. Evaluation
tionships, and synthesizing knowledge, including datasets: Insights
Accuracy Applicability
MultihopQA [74], MusiqueQA [76], and ALCE [17].
Graph Operator Parameter
• Abstract. Unlike the previous groups, the questions in this Type Combination Analysis
category are not centered on specific factual queries. In- Token cost Efficiency
New SOTA
stead, they involve abstract, conceptual inquiries that en-
Figure 3: Workflow of our empirical study.
compass broader topics, summaries, or overarching themes.
An example of an abstract question is: “How does arti- specific options or words. Following existing works [17, 68], we
ficial intelligence influence modern education?”. The ab- use string recall (STRREC), string exact matching (STREM), and
stract question requires a high-level understanding of the string hit (STRHIT) as evaluation metrics. For abstract QA tasks,
dataset contents, including five datasets: Mix [65], Multi- we follow prior work [12] and use a head-to-head comparison
hopSum [74], Agriculture [65], CS [65], and Legal [65]. approach using an LLM evaluator (i.e., GPT-4o). This is mainly
Their statistics, including the numbers of tokens, and questions, because LLMs have demonstrated strong capabilities as evaluators
and the question-answering (QA) types are reported in Table 4. of natural language generation, often achieving state-of-the-art or
For specific (both complex and simple) QA datasets, we use the competitive results when compared to human judgments [77, 90].
questions provided by each dataset. While for abstract QA datasets, Here, we utilize four evaluation dimensions: Comprehensiveness,
we follow existing works [12, 21] and generate questions using one Diversity, Empowerment, and Overall for abstract QA tasks.
of the most advanced LLM, GPT-4o. Specifically, for each dataset, Implementation. We implement all the algorithms in Python
we generate 125 questions by prompting GPT-4o, following the with our proposed unified framework and try our best to ensure a
approach in [21]. The prompt template used for question generation native and effective implementation. All experiments are run on 350
is provided in our technical report [67]. Note that MultihopQA and Ascend 910B-3 NPUs [31]. Besides, Zeroshot, and VanillaRAG are
MultihopSum originate from the same source, but differ in the types also included in our study, which typically represent the model’s
of questions they include—the former focuses on complex QA tasks, inherent capability and the performance improvement brought by
while the latter on abstract QA tasks. basic RAG, respectively. If a method cannot finish in two days, we
Evaluation Metric. For the specific QA tasks, we use Accuracy mark its result as N/A in the figures and “—” in the tables.
and Recall to evaluate performance on the first five datasets based Hyperparameter Settings. In our experiment, we use Llama-
on whether gold answers are included in the generations instead 3-8B [11] as the default LLM, which is widely used in existing
of strictly requiring exact matching, following [53, 69]. For the RAG methods [88]. For LLM, we set the maximum token length
ALCE dataset, answers are typically full sentences rather than to 8,000, and use greedy decoding to generate one sample for the
6
Table 6: Comparison RAPTOR and RAPTOR-K. 3). Specifically, RAPTOR leverages similarity-based vector search,
comparing the embedding similarity between the given question
MultihopQA Quality PopQA and the high-level text summary. In contrast, LGraphRAG selects
Method
Accuracy Recall Accuracy Accuracy Recall communities based on whether they contain the given entities and
RAPTOR 56.064 44.832 56.997 62.545 27.304 the rating score generated by the LLM. However, this approach
RAPTOR-K 56.768 44.208 54.567 64.761 28.469 may retrieve communities that do not align with the question’s
semantic meaning, potentially degrading LLM performance. To
further investigate this issue, we conduct an extra ablation study
deterministic output For each method requiring top-𝑘 selection
later (See Exp.4 in this Section).
(e.g., chunks or entities), we set 𝑘 = 4 to accommodate the token
(5) For the three largest datasets, the K-means [24]-based RAPTOR
length limitation. We use one of the most advanced text-encoding
(denoted as RAPTOR-K) also demonstrates remarkable performance.
models, BGE-M3 [55], as the embedding model across all methods to
This suggests that the clustering method used in RAPTOR merely im-
generate embeddings for vector search. If an expert annotator pre-
pacts overall performance. This may be because different clustering
splits the dataset into chunks, we use those as they preserve human
methods share the same key idea: grouping similar items into the
insight. Otherwise, following existing works [12, 21], we divide
same cluster. Therefore, they may generate similar chunk clusters.
the corpus into 1,200-token chunks. For other hyper-parameters of
To verify this, we compare RAPTOR-K with RAPTOR on the first three
each method, we follow the original settings in their available code.
datasets, and present results in Table 6. We observe that RAPTOR-K
achieves comparable or even better performance than RAPTOR. In
7.2 Evaluation for specific QA the remaining part of our experiments, if RAPTOR does not finish
In this section, we evaluate the performance of different methods constructing the graph within two days, we use RAPTOR-K instead.
on specific QA tasks. Exp.2. Token costs of graph and index building. In this
Exp.1. Overall performance. We report the metric values experiment, we first report the token costs of building four types
of all algorithms on specific QA tasks in Table 5. We can make the of graphs across all datasets. Notably, building PG incurs no token
following observations and analyses: (1) Generally, the RAG tech- cost, as it does not rely on the LLM for graph construction. As
nique significantly enhances LLM performance across all datasets, shown in Figure 4(a) to (f), we observe the following: (1) Building
and the graph-based RAG methods (e.g., HippoRAG and RAPTOR) trees consistently require the least token cost, while TKG and RKG
typically exhibit higher accuracy than VanillaRAG. However, if the incur the highest token costs, with RKG slightly exceeding TKG.
retrieved elements are not relevant to the given question, RAG may In some cases, RKG requires up to 40× more tokens than trees. (2)
degrade the LLM’s accuracy. For example, on the Quality dataset, KG falls between these extremes, requiring more tokens than trees
compared to Zeroshot, RAPTOR improves accuracy by 53.80%, while but fewer than TKG and RKG. This trend aligns with the results in
G-retriever decreases it by 14.17%. This is mainly because, for Table 2, where graphs with more attributes require higher token
simple QA tasks, providing only entities and relationships from a costs for construction. (3) Recall that the token cost for an LLM call
subgraph is insufficient to answer such questions effectively. consists of two parts: the prompt token, which accounts for the
(2) For specific QA tasks, retaining the original text chunks is cru- tokens used in providing the input, and the completion part, which
cial for accurate question answering, as the questions and answers includes the tokens generated by the model as a response. Here,
in these datasets are derived from the text corpus. This may explain we report the token costs for prompt and completion on HotpotQA
why G-retriever, ToG, and DALK, which rely solely on graph struc- and ALCE datasets in Figure 4(g) to (h). The other datasets exhibit
ture information, perform poorly on most datasets. However, on similar trends, we include their results in our technical report [67].
MultihopQA, which requires multi-hop reasoning, DALK effectively We conclude that, regardless of the graph type, the prompt part
retrieves relevant reasoning paths, achieving accuracy and recall always incurs higher token costs than the completion part.
improvements of 6.57% and 27.94% over VanillaRAG, respectively. We then examine the token costs of index building across all
(3) If the dataset is pre-split into chunks by the expert annotator, datasets. Since only LGraphRAG and GGraphRAG require an LLM for
VanillaRAG often performs better compared to datasets where index construction, we report only the token costs for generating
chunks are split based on the token size, and we further investigate community reports in Figure 5. We can see that the token cost for
this phenomenon later in our technical report [67]. index construction is nearly the same as that for building TKG. This
(4) RAPTOR often achieves the best performance among most is mainly because it requires generating a report for each commu-
datasets, especially for simple questions. For complex questions, nity, and the number of communities is typically large, especially in
RAPTOR also performs exceptionally well. This is mainly because, large datasets. For example, the HotpotQA dataset contains 57,384
for such questions, high-level summarized information is crucial for communities, significantly increasing the overall token consump-
understanding the underlying relationships across multiple chunks. tion. That is to say, on large datasets, the two versions of GraphRAG
Hence, as we shall see, LGraphRAG is expected to achieve similar often take more tokens than other methods in the offline stage.
results, as it also incorporates high-level information (i.e., a summa- Exp.3. Evaluation of the generation costs. In this exper-
rized report of the most relevant community for a given question). iment, we evaluate the time and token costs for each method in
However, we only observe this effect on the MultihopQA dataset. specific QA tasks. Specifically, we report the average time and
For the other two complex QA datasets, LGraphRAG even underper- token costs for each query across all datasets in Table 7 (These
forms compared to VanillaRAG. Meanwhile, RAPTOR still achieves results may vary upon rerunning due to the inherent uncertainty
the best performance on these two datasets. We hypothesize that of the LLM.). It is not surprising that ZeroShot and VanillaRAG
this discrepancy arises from differences in how high-level infor-
mation is retrieved (See operators used for each method in Table
7
107 107 108 108

107 107

106 106 106 106

Tree KG TKG RKG Tree KG TKG RKG Tree KG TKG RKG Tree KG TKG RKG
(a) MultihopQA (b) Quality (c) PopQA (d) MusiqueQA

RKG RKG

TKG TKG
108 108
KG KG

107 107 Tree Tree

0 20 40 60 80 100 0 20 40 60 80 100
6 6
10 10 Token cost proportion (%) Token cost proportion (%)
Tree KG TKG RKG Tree KG TKG RKG Prompt token Completion token Prompt token Completion token
(e) HotpotQA (f) ALCE (g) The token costs on HotpotQA (h) The token costs on ALCE
Figure 4: Token cost of graph building on specific QA datasets.
Table 7: The average time and token costs of all methods on specific QA datasets.

MultihopQA Quality PopQA MusiqueQA HotpotQA ALCE

Method
time token time token time token time token time token time token
ZeroShot 3.23 s 270.3 1.47 s 169.1 1.17 s 82.2 1.73 s 137.8 1.51 s 125.0 2.41 s 177.2
VanillaRAG 2.35 s 3,623.4 2.12 s 4,502.0 1.41 s 644.1 1.31 s 745.4 1.10 s 652.0 1.04 s 849.1
G-retriever 6.87 s 1,250.0 5.18 s 985.5 37.51 s 3,684.5 31.21 s 3,260.5 — — 101.16 s 5,096.1
ToG 69.74 s 16,859.6 37.03 s 10,496.4 42.02 s 11,224.2 53.55 s 12,480.8 — — 34.94 s 11,383.2
KGP 38.86 s 13,872.2 35.76 s 14,092.7 37.49 s 6,738.9 39.82 s 7,555.8 — — 105.09 s 9,326.6
DALK 28.03 s 4,691.5 13.23 s 2,863.6 16.33 s 2,496.5 17.48 s 3,510.9 21.33 s 3,989.7 17.04 s 4,071.9
LLightRAG 19.28 s 5,774.1 15.76 s 5,054.5 10.71 s 2,447.5 13.95 s 3,267.6 13.94 s 3,074.2 10.34 s 4,427.9
GLightRAG 18.37 s 5,951.5 15.97 s 5,747.3 12.10 s 3,255.6 15.20 s 3,260.8 13.95 s 3,028.7 13.02 s 4,028.1
HLightRAG 19.31 s 7,163.2 21.49 s 6,492.5 17.71 s 5,075.8 20.93 s 5,695.3 19.58 s 4,921.7 16.55 s 6,232.3
FastGraphRAG 7.17 s 5,874.8 3.48 s 6,138.9 13.25 s 6,157.0 15.19 s 6,043.5 28.71 s 6,029.8 25.82 s 6,010.9
HippoRAG 3.46 s 3,261.1 3.03 s 3,877.6 2.32 s 721.3 2.69 s 828.4 3.12 s 726.4 2.94 s 858.2
LGraphRAG 2.98 s 6,154.9 3.77 s 6,113.7 1.72 s 4,325.2 2.66 s 4,675.7 2.05 s 4,806.2 2.11 s 5,441.1
RAPTOR 3.18 s 3,210.0 2.46 s 4,140.7 1.36 s 1,188.3 1.85 s 1,742.9 1.48 s 757.6 1.54 s 793.6

Prompt token Completion token relationships for answering the question. On the other hand, the
108 costs of LLightRAG, GLightRAG, and HLightRAG gradually increase,
107 aligning with the fact that more information is incorporated into
the prompt construction. All three methods are more expensive
106
than LGraphRAG in specific QA tasks, as they use LLM to extract
105 keywords in advance. Moreover, the time cost of all methods is
MultihopQA Quality PopQA MusiqueQA HotpotQA ALCE
dataset proportional to the completion token cost. We present the results
Figure 5: Token cost of index construction in specific QA. in our technical report [67], which explains why in some datasets,
VanillaRAG is even faster than ZeroShot.
are the most cost-efficient methods in terms of both time and to-
Exp.4. Detailed analysis for RAPTOR and LGraphRAG. Our
ken consumption. Among all graph-based RAG methods, RAPTOR
first analysis about RAPTOR aims to explain why RAPTOR outper-
and HippoRAG are typically the most cost-efficient, as they share a
forms VanillaRAG. Recall that in RAPTOR, for each question 𝑄, it
similar retrieval stage with VanillaRAG. The main difference lies
retrieves the top-𝑘 items across the entire tree, meaning the re-
in the chunk retrieval operators they use. Besides, KGP and ToG
trieved items may originate from different layers. That is, we report
are the most expensive methods, as they rely on the agents (i.e.,
the proportion of retrieved items across different tree layers in Ta-
different roles of the LLM) for information retrieval during prompt
ble 8. As we shall see, for the MultihopQA and MusiqueQA datasets,
construction. The former utilizes the LLM to reason the next re-
the proportion of retrieved high-level information (i.e., items not
quired information based on the original question and retrieved
from leaf nodes) is significantly higher than in other datasets. For
chunks, while the latter employs LLM to select relevant entities and
8
Table 8: Proportion of retrieved nodes across tree layers. improves Accuracy by 6.42% on the MultihopQA dataset and 11.6%
on the MusiqueQA dataset, respectively.
Layer MultihopQA Quality PopQA MusiqueQA HotpotQA ALCE
0 59.3% 76.8% 76.1% 69.3% 89.7% 90.6%
1 27.5% 18.7% 16.5% 28.1% 9.5% 8.8%
7.3 Evaluation for abstract QA
>1 13.2% 4.5% 7.4% 2.6% 0.8% 0.6% In this section, we evaluate the performance of different methods
on abstract QA tasks.
Exp.1. Overall Performance. We evaluate the performance
Table 9: Descriptions of the different variants of LGraphRAG.
of methods that support abstract QA (see Table 1) by presenting
head-to-head win rate percentages, comparing the performance of
New retrieval each row method against each column method. Here, we denote VR,
Name Retrieval elements
strategy
RA, GS, LR, and FG as VanillaRAG, RAPTOR, GGraphRAG with high-
LGraphRAG Entity, Relationship, Community, Chunk layer communities (i.e., two-layer for this original implementation),
GraphRAG-ER Entity, Relationship HLightRAG and FastGraphRAG, respectively. The results are shown
GraphRAG-CC Community, Chunk
VGraphRAG-CC Community, Chunk
in Figure 6 to Figure 10, and we can see that: (1) Graph-based RAG
VGraphRAG Entity, Relationship, Community, Chunk methods often outperform VanillaRAG, primarily because they
effectively capture inter-connections among chunks. (2) Among
datasets requiring multi-hop reasoning to answer questions, high- graph-based RAG methods, GGraphRAG and RAPTOR generally out-
level information plays an essential role. This may explain why perform HLightRAG and FastGraphRAG as they integrate high-level
RAPTOR outperforms VanillaRAG on these two datasets. summarized text into the prompt, which is essential for abstract
We then conduct a detailed analysis of LGraphRAG on complex QA tasks. In contrast, the latter two rely solely on low-level graph
questions in specific QA datasets by modifying its retrieval meth- structures (e.g., entities and relationships) and original text chunks,
ods or element types. By doing this, we create three variants of limiting their effectiveness in handling abstract questions. (3) On
LGraphRAG, and we present the detailed descriptions for each vari- almost all datasets, GGraphRAG consistently achieves the best per-
ant in Table 9. Here, VGraphRAG-CC introduces a new retrieval strat- formance, aligning with findings from existing work [12]. This
egy. Unlike LGraphRAG, it uses vector search to retrieve the top-𝑘 suggests that community reports are highly effective in captur-
elements (i.e., chunks or communities) from the vector database. ing high-level structured knowledge and relational dependencies
Eventually, we evaluate their performance on the three complex QA among chunks, meanwhile, the Map-Reduce strategy further aids
datasets and present the results in Table 10. We make the following in filtering irrelevant retrieved content. (4) Sometimes, RAPTOR out-
analysis: (1) Community reports serve as effective high-level infor- performs GGraphRAG, likely because the textual information in the
mation for complex QA tasks. For instance, VGraphRAG-CC achieves original chunks is crucial for answering certain questions.
comparable or even better performance than RAPTOR, highlighting Exp.2. Token costs of graph and index building. The token
the value of community reports. (2) The retrieval strategy in the orig- costs of the graph and index building across all abstract QA datasets
inal LGraphRAG, which selects communities and chunks based on are shown in Figures 11 and 12 respectively. The conclusions are
the frequency of relevant entities in a given question, may not be op- highly similar to the Exp.2 in Section 7.2.
timal in some cases. This is verified by VGraphRAG-CC consistently Exp.3. Evaluation of the generation costs. In this exper-
outperforming GraphRAG-CC, which suggests that a vector-based iment, we present the time and token costs for each method in
retrieval approach is a better way than the heuristic rule-based abstract QA tasks. As shown in Table 11, GGraphRAG is the most
way. (3) For multi-hop reasoning tasks (e.g., MultihopQA), entity expensive method, as expected, while other graph-based methods
and relationship information can serve as an auxiliary signal, it exhibit comparable costs, although they are more expensive than
helps LLM link relevant information (i.e., entity and relationship VanillaRAG. For example, on the MutihopSum dataset, GGraphRAG
descriptions), and guide the reasoning process. This is supported by requires 57 × more time and 210 × more tokens per query com-
LGraphRAG outperforming both GraphRAG-CC and VGraphRAG-CC pared to VanillaRAG. Specifically, each query in GGraphRAG takes
on the MultihopQA dataset, indicating the importance of structured around 9 minutes and consumes 300K tokens, making it impractical
graph information for multi-hop reasoning. for real-world scenarios. This is because, to answer an abstract
Exp.5. New SOTA algorithm. Based on the above analysis, question, GGraphRAG needs to analyze all retrieved communities,
we aim to develop a new state-of-the-art method for complex QA which is highly time- and token-consuming, especially when the
datasets, denoted as VGraphRAG. Specifically, our algorithm first re- number of communities is large (e.g., in the thousands).
trieves the top-𝑘 entities and their corresponding relationships, this Exp.4. New SOTA algorithm. While the GGraphRAG shows
step is the same as LGraphRAG. Next, we adopt the vector search- remarkable performance in abstract QA, its time and token costs
based retrieval strategy to select the most relevant communities are not acceptable in practice. Based on the above analysis, we aim
and chunks, this step is the same as VGraphRAG-CC. Then, by com- to design a cost-efficient version of GGraphRAG, named CheapRAG.
bining the four elements above, we construct the final prompt of Our design is motivated by the fact that: only a few communities
our method to effectively guide the LLM in generating accurate (typically less than 10) are useful for generating responses. However,
answers. The results are also shown in Table 10, we can see that in GGraphRAG, all communities within a specified number of layers
VGraphRAG performs best on all complex QA datasets. For example, must be retrieved and analyzed, which is highly token-consuming.
on the ALCE dataset, it improves STRREC, STREM, and STRHIT by Besides, as discussed earlier, original chunks are also valuable for
8.47%, 13.18%, and 4.93%, respectively, compared to VGraphRAG-CC. certain questions. Therefore, our new algorithm, CheapRAG, incor-
Meanwhile, compared to RAPTOR, our new algorithm VGraphRAG porates these useful chunks. Specifically, given a new question 𝑄,
9
Table 10: Comparison of our newly designed methods on specific datasets with complex questions.

Dataset Metric ZeroShot VanillaRAG LGraphRAG RAPTOR GraphRAG-ER GraphRAG-CC VGraphRAG-CC VGraphRAG
Accuracy 49.022 50.626 55.360 56.064 52.739 52.113 55.203 59.664
MultihopQA
Recall 34.526 36.918 50.429 44.832 45.113 43.770 46.750 50.893
Accuracy 1.833 17.233 12.467 24.133 11.200 13.767 22.400 26.933
MusiqueQA
Recall 5.072 27.874 23.996 35.595 22.374 25.707 35.444 40.026
STRREC 15.454 34.283 28.448 35.255 26.774 35.366 37.820 41.023
ALCE STREM 3.692 11.181 8.544 11.076 7.5949 11.920 13.608 15.401
STRHIT 30.696 63.608 54.747 65.401 52.743 64.662 68.460 71.835

VR RA GS LR FG VR RA GS LR FG VR RA GS LR FG VR RA GS LR FG
VR 50 58 30 36 93 VR 50 66 58 35 90 VR 50 60 29 28 92 VR 50 60 19 32 93
RA 42 50 39 26 82 RA 34 50 54 20 76 RA 40 50 45 22 82 RA 40 50 44 24 82
GS 70 61 50 15 89 GS 42 46 50 26 86 GS 71 54 50 12 88 GS 81 56 50 14 88
LR 64 74 85 50 98 LR 65 80 74 50 96 LR 72 78 88 50 98 LR 68 76 86 50 98
FG 7 18 11 2 50 FG 10 24 14 4 50 FG 8 18 12 2 50 FG 7 18 12 2 50
(a) Comprehensiveness (b) Diversity (c) Empowerment (d) Overall
Figure 6: The abstract QA results on Mix dataset.

VR RA GS LR FG VR RA GS LR FG VR RA GS LR FG VR RA GS LR FG
VR 50 50 2 46 95 VR 50 64 58 64 93 VR 50 52 36 39 95 VR 50 52 44 46 95
RA 50 50 47 48 94 RA 36 50 42 49 85 RA 48 50 45 45 93 RA 48 50 45 47 94
GS 78 53 50 79 96 GS 42 55 50 52 92 GS 64 54 50 41 97 GS 56 55 50 52 97
LR 54 52 21 50 92 LR 36 51 48 50 88 LR 61 55 59 50 95 LR 54 53 48 50 93
FG 5 6 4 8 50 FG 7 15 8 12 50 FG 5 7 3 5 50 FG 5 6 3 7 50
(a) Comprehensiveness (b) Diversity (c) Empowerment (d) Overall
Figure 7: The abstract QA results on MultihopSum dataset.

VR RA GS LR FG VR RA GS LR FG VR RA GS LR FG VR RA GS LR FG
VR 50 32 39 54 85 VR 50 32 45 59 77 VR 50 24 41 52 85 VR 50 30 38 53 85
RA 68 50 19 73 94 RA 68 50 16 76 90 RA 76 50 22 76 96 RA 70 50 16 76 95
GS 61 81 50 62 89 GS 55 84 50 63 82 GS 59 78 50 58 91 GS 62 84 50 62 90
LR 46 27 38 50 78 LR 41 24 37 50 71 LR 48 24 42 50 81 LR 47 24 38 50 79
FG 15 6 11 22 50 FG 23 10 18 29 50 FG 15 4 9 19 50 FG 15 5 10 21 50
(a) Comprehensiveness (b) Diversity (c) Empowerment (d) Overall
Figure 8: The abstract QA results on Agriculture dataset.
our algorithm adopts a vector search-based retrieval strategy to (3) The constructed graphs are typically very sparse, with their size
select the most relevant communities and chunks. Next, we apply proportional to the number of chunks.
a Map-Reduce strategy to generate the final answer. As shown in
Figure 13 and Table 11, CheapRAG not only achieves better perfor-
mance than GGraphRAG but also significantly reduces token costs (in 8 LESSONS AND OPPORTUNITIES
most cases). For example, on the MultihopSum dataset, CheapRAG We summarize the lessons (L) for practitioners and propose practical
reduces token costs by 100× compared to GGraphRAG, while achiev- research opportunities (O) based on our observations.
ing better answer quality. While the diversity of answers generated Lessons:
by CheapRAG is not yet optimal, we leave this as a future work. L1. In Figure 14, we depict a roadmap of the recommended RAG
More Analysis. In addition, we present a detailed analysis of the methods, highlighting which methods are best suited for different
graph-based RAG methods in our technical report [67], including scenarios.
Effect of chunk size, Effect of base model, and The size of graph. L2. Chunk quality is very important for the overall performance of
Due to the limited space, we only summarize the key conclusions all RAG methods, and human experts are better at splitting chunks
here: (1) Chunk quality is crucial for all methods. (2) Stronger LLM than relying solely on token size.
backbone models can further enhance performance for all methods. L3. For complex questions in specific QA, high-level information
is typically needed, as they capture the complex relationship among
10
VR RA GS LR FG VR RA GS LR FG VR RA GS LR FG VR RA GS LR FG
VR 50 22 25 36 80 VR 50 18 25 37 75 VR 50 15 24 29 80 VR 50 15 22 30 79
RA 78 50 55 69 99 RA 82 50 51 79 99 RA 85 50 59 72 100 RA 85 50 54 73 67
GS 75 45 50 64 97 GS 75 49 50 63 91 GS 76 41 50 60 96 GS 78 46 50 62 97
LR 64 31 36 50 95 LR 63 21 37 50 93 LR 71 28 40 50 98 LR 70 27 38 50 97
FG 20 1 3 5 50 FG 25 1 9 7 50 FG 20 0 4 2 50 FG 21 33 3 3 50
(a) Comprehensiveness (b) Diversity (c) Empowerment (d) Overall
Figure 9: The abstract QA results on CS dataset.

VR RA GS LR FG VR RA GS LR FG VR RA GS LR FG VR RA GS LR FG
VR 50 26 31 41 93 VR 50 36 32 45 90 VR 50 24 29 34 95 VR 50 26 30 37 94
RA 74 50 27 67 95 RA 64 50 68 68 93 RA 76 50 31 67 96 RA 74 50 31 67 96
GS 69 73 50 62 97 GS 68 33 50 66 94 GS 71 69 50 60 96 GS 70 69 50 62 96
LR 59 33 38 50 97 LR 55 32 34 50 93 LR 66 33 40 50 97 LR 63 33 38 50 97
FG 7 5 3 3 50 FG 10 7 6 7 50 FG 5 4 4 3 50 FG 6 4 4 3 50
(a) Comprehensiveness (b) Diversity (c) Empowerment (d) Overall
Figure 10: The abstract QA results on Legal dataset.

107 107 107

106 107

105 106 106 106 106

Tree TKG RKG Tree TKG RKG Tree TKG RKG Tree TKG RKG Tree KG TKG
(a) Mix (b) MultihopSum (c) Agriculture (d) CS (e) Legal
Figure 11: Token cost of the graph building on abstract QA datasets.
Table 11: The average time and token costs on abstract QA datasets.

VanillaRAG RAPTOR GGraphRAG HLightRAG FastGraphRAG CheapRAG

Dataset
time token time token time token time token time token time token
Mix 18.7 s 4,114 35.5 s 4,921 72.2 s 10,922 22.6 s 5,687 20.9 s 4,779 27.3 s 11,720
MultihopSum 9.1 s 1,680 32.7 s 4,921 521.0 s 353,889 33.7 s 5,329 34.4 s 5,839 54.1 s 3,784
Agriculture 17.4 s 5,091 20.7 s 3,753 712.3 s 448,762 25.3 s 4,364 28.8 s 5,640 47.1 s 10,544
CS 17.8 s 4,884 32.7 s 4,921 442.0 s 322,327 51.4 s 4,908 28.2 s 5,692 48.8 s 17,699
Legal 26.2 s 2,943 59.8 s 3,573 231.2 s 129,969 31.1 s 4,441 34.0 s 5,411 34.8 s 14,586

Prompt token Completion token Opportunities:

O1. All existing graph-based RAG methods (both specific QA
106
and abstract QA) assume the setting of the external corpus is static.
105 What if the external knowledge source evolves over time? For
104 example, Wikipedia articles are constantly evolving, with frequent
Mix MultihopSum Agriculture CS Legal
dataset updates to reflect new information. Can we design graph-based
Figure 12: Token cost of index construction in abstract QA. RAG methods that efficiently and effectively adapt to such dynamic
chunks, and the vector search-based retrieval strategy is better than changes in external knowledge sources?
the rule-based (e.g., Entity operator) one. O2. The quality of a graph plays a key role in determining the
L4. Community reports provide a more effective high-level struc- effectiveness of graph-based RAG methods. However, evaluating
ture than summarized chunk clusters for abstract QA tasks, as they graph quality before actually handling a question remains a critical
better capture diversified topics and overarching themes within challenge that needs to be addressed. Existing graph construction
local modules of the corpus. methods consume a substantial number of tokens and often produce
L5. Original chunks are useful for all QA tasks, as they provide graphs with redundant entities or miss potential relationships, so
essential textual descriptions for augmenting or completing infor- designing a cost-efficient yet effective construction method is a
mation needed to answer questions. When attempting to design meaningful research direction.
new graph-based RAG methods, incorporating the relevant original
chunks is not a bad idea.
11
Com. Div. Emp. Overall Com. Div. Emp. Overall Com. Div. Emp. Overall Com. Div. Emp. Overall Com. Div. Emp. Overall

VR 68 56 55 62 VR 74 47 64 68 VR 84 54 66 75 VR 86 47 71 78 VR 82 45 62 68
RA 77 56 64 70 RA 68 48 64 67 RA 68 42 43 54 RA 66 19 39 49 RA 63 27 40 46
GS 73 65 66 69 GS 72 43 65 68 GS 81 38 62 71 GS 66 21 48 51 GS 64 26 38 46
LR 68 22 34 44 LR 66 45 56 64 LR 84 54 62 74 LR 80 34 54 66 LR 76 42 44 59
FG 97 81 92 93 FG 100 86 98 99 FG 99 79 94 94 FG 98 79 95 96 FG 97 90 94 97
(a) Mix (b) MultihopSum (c) Agriculture (d) CS (e) Legal
Figure 13: Comparison of our newly designed method on abstract QA datasets.

Specific Abstract • LLMs for graph data management. Recent advances in

LLMs have demonstrated superiority in natural language under-
Simple Complex
Token standing and reasoning, which offers opportunities to leverage
Cost
Chunk Token Reasoning LLMs to graph data management. These include using LLMs for
Split Cost Step
knowledge graph (KG) creation [81] and completion [86], as well as
Token Low Medium High for the extraction of causal graphs [4] from source texts. They also
Human High Low Less More
size
include advanced methods by employing KG to enhance the LLM
Vanilla
RAG
RAPTOR LGraphRAG RAPTOR LGraphRAG RAPTOR HLightRAG GGraphRAG reasoning. For instance, KD-CoT [79] retrieves facts from an exter-
(a) Specific QA (b) Abstract QA nal knowledge graph to guide the CoT performed by LLMs. RoG [51]
Figure 14: The taxonomy tree of RAG methods. proposes a planning-retrieval-reasoning framework that retrieves
O3. In many domains, the corpus is private (e.g., finance, legal, reasoning paths from KGs to guide LLMs conducting trust-worthy
and medical), and retrieving the relevant information from such reasoning. To capture graph structure, GNN-RAG [54] adopts a
corpus can reveal information about the knowledge source. Design- lightweight graph neural network to retrieve elements from KGs
ing a graph-based RAG method that incorporates local differential effectively. StructGPT [33] and ToG [72] treat LLMs as agents that
privacy is an interesting research problem. interact with KGs to find reasoning paths.
O4. In real applications, external knowledge sources are not • LLMs for database. Due to the wealth of developer experience
limited to text corpora; they may also include PDFs, HTML pages, captured in a vast array of database forum discussions, recent stud-
tables, and other structured or semi-structured data. A promising ies [5, 13, 20, 37, 43, 73, 93, 94] have begun leveraging LLMs to en-
future research direction is to explore graph-based RAG methods hance database performance. For instance, GPTuner [37] proposes
for heterogeneous knowledge sources. to enhance database knob tuning using LLM by leveraging domain
O5. The well-known graph database systems, such as Neo4j [59] knowledge to identify important knobs and coarsely initialize their
and Nebula [58], support transferring the corpus into a knowledge values for subsequent refinement. Besides, D-Bot [93] proposes an
graph via LLM. However, enabling these popular systems to sup- LLM-based database diagnosis system, which can retrieve relevant
port the various graph-based RAG methods presents an exciting knowledge chunks and tools, and use them to identify typical root
opportunity. More details are in our technical report [67]. causes accurately. The LLM-based data analysis systems and tools
are also been studied [2, 8, 44–48, 63].
9 RELATED WORKS To the best of our knowledge, our work is the first study that
provides a unified framework for all existing graph-based RAG
In this section, we review the related works, including Retrieval-
methods and compares existing solutions comprehensively via in-
Augmentation-Generation (RAG) frameworks, and LLMs for graph
depth experimental results.
data management. Besides, works on employing LLM for enhancing
database performance are also discussed.
10 CONCLUSIONS
• RAG frameworks. RAG has been proven to excel in many
tasks, including open-ended question answering [32, 71], program- In this paper, we provide an in-depth experimental evaluation and
ming context [5–7], SQL rewrite [43, 73], automatic DBMS configu- comparison of existing graph-based Retrieval-Augmented Genera-
ration debugging [70, 93], and data cleaning [56, 57, 66]. The naive tion (RAG) methods. We first provide a novel unified framework,
RAG technique relies on retrieving query-relevant information which can cover all the existing graph-based RAG methods, us-
from external knowledge bases to mitigate the “hallucination” of ing an abstraction of a few key operations. We then thoroughly
LLMs. Recently, most RAG approaches [12, 21, 22, 39, 64, 68, 81, 82] analyze and compare different graph-based RAG methods under
have adopted graph as the external knowledge to organize the in- our framework. We further systematically evaluate these methods
formation and relationships within documents, achieving improved from different angles using various datasets for both specific and
overall retrieval performance, which is extensively reviewed in abstract question-answering (QA) tasks, and also develop variations
this paper. In terms of open-source software, a variety of graph by combining existing techniques, which often outperform state-of-
databases are supported by both the LangChain [36] and LlamaIn- the-art methods. From extensive experimental results and analysis,
dex [50] libraries, while a more general class of graph-based RAG we have identified several important findings and analyzed the crit-
applications is also emerging, including systems that can create ical components that affect the performance. In addition, we have
and reason over knowledge graphs in both Neo4j [59] and Nebula- summarized the lessons learned and proposed practical research
Graph [58]. For more details, please refer to the recent surveys of opportunities that can facilitate future studies.
graph-based RAG methods [23, 64].
12
REFERENCES [25] Taher H Haveliwala. 2002. Topic-sensitive pagerank. In Proceedings of the 11th
[1] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- international conference on World Wide Web. 517–526.
cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal [26] Xiaoxin He, Yijun Tian, Yifei Sun, Nitesh V Chawla, Thomas Laurent, Yann
Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 LeCun, Xavier Bresson, and Bryan Hooi. 2024. G-retriever: Retrieval-augmented
(2023). generation for textual graph understanding and question answering. arXiv
[2] Eric Anderson, Jonathan Fritz, Austin Lee, Bohou Li, Mark Lindblad, Henry preprint arXiv:2402.07630 (2024).
Lindeman, Alex Meyer, Parth Parmar, Tanvi Ranade, Mehul A Shah, et al. 2024. [27] Yucheng Hu and Yuxing Lu. 2024. Rag and rau: A survey on retrieval-augmented
The Design of an LLM-powered Unstructured Analytics System. arXiv preprint language model in natural language processing. arXiv preprint arXiv:2404.19543
arXiv:2409.00847 (2024). (2024).
[3] Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. [28] Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Hao-
2023. Self-rag: Learning to retrieve, generate, and critique through self-reflection. tian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al.
arXiv preprint arXiv:2310.11511 (2023). 2023. A survey on hallucination in large language models: Principles, taxonomy,
[4] Taiyu Ban, Lyvzhou Chen, Xiangyu Wang, and Huanhuan Chen. 2023. From challenges, and open questions. arXiv preprint arXiv:2311.05232 (2023).
query tools to causal architects: Harnessing large language models for advanced [29] Yizheng Huang and Jimmy Huang. 2024. A Survey on Retrieval-Augmented Text
causal discovery from data. arXiv preprint arXiv:2306.16902 (2023). Generation for Large Language Models. arXiv preprint arXiv:2404.10981 (2024).
[5] Sibei Chen, Ju Fan, Bin Wu, Nan Tang, Chao Deng, Pengyi Wang, Ye Li, Jian Tan, [30] Yiqian Huang, Shiqi Zhang, and Xiaokui Xiao. 2025. KET-RAG: A Cost-
Feifei Li, Jingren Zhou, et al. 2024. Automatic Database Configuration Debugging Efficient Multi-Granular Indexing Framework for Graph-RAG. arXiv preprint
using Retrieval-Augmented Language Models. arXiv preprint arXiv:2412.07548 arXiv:2502.09304 (2025).
(2024). [31] huawei. 2019. Ascend GPU. https://e.huawei.com/ph/products/computing/
[6] Sibei Chen, Yeye He, Weiwei Cui, Ju Fan, Song Ge, Haidong Zhang, Dongmei ascend.
Zhang, and Surajit Chaudhuri. 2024. Auto-Formula: Recommend Formulas in [32] Soyeong Jeong, Jinheon Baek, Sukmin Cho, Sung Ju Hwang, and Jong C Park.
Spreadsheets using Contrastive Learning for Table Representations. Proceedings 2024. Adaptive-rag: Learning to adapt retrieval-augmented large language mod-
of the ACM on Management of Data 2, 3 (2024), 1–27. els through question complexity. arXiv preprint arXiv:2403.14403 (2024).
[7] Sibei Chen, Nan Tang, Ju Fan, Xuemi Yan, Chengliang Chai, Guoliang Li, and Xi- [33] Jinhao Jiang, Kun Zhou, Zican Dong, Keming Ye, Wayne Xin Zhao, and Ji-Rong
aoyong Du. 2023. Haipipe: Combining human-generated and machine-generated Wen. 2023. StructGPT: A General Framework for Large Language Model to
pipelines for data preparation. Proceedings of the ACM on Management of Data 1, Reason over Structured Data. In Proceedings of the 2023 Conference on Empirical
1 (2023), 1–26. Methods in Natural Language Processing. 9237–9251.
[8] Zui Chen, Lei Cao, Sam Madden, Tim Kraska, Zeyuan Shang, Ju Fan, Nan Tang, [34] Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav
Zihui Gu, Chunwei Liu, and Michael Cafarella. 2023. SEED: Domain-Specific Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T Joshi,
Data Curation With Large Language Models. arXiv e-prints (2023), arXiv–2310. Hanna Moazam, et al. 2023. Dspy: Compiling declarative language model calls
[9] Jacob Devlin. 2018. Bert: Pre-training of deep bidirectional transformers for into self-improving pipelines. arXiv preprint arXiv:2310.03714 (2023).
language understanding. arXiv preprint arXiv:1810.04805 (2018). [35] Omar Khattab and Matei Zaharia. 2020. Colbert: Efficient and effective passage
[10] Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, search via contextualized late interaction over bert. In Proceedings of the 43rd
Jingjing Xu, Zhiyong Wu, Tianyu Liu, et al. 2022. A survey on in-context learning. International ACM SIGIR conference on research and development in Information
arXiv preprint arXiv:2301.00234 (2022). Retrieval. 39–48.
[11] Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad [36] Langchian. 2023. Langchian. https://python.langchain.com/docs/additional_
Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, resources/arxiv_references/.
et al. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024). [37] Jiale Lao, Yibo Wang, Yufei Li, Jianping Wang, Yunjia Zhang, Zhiyuan Cheng,
[12] Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Wanghu Chen, Mingjie Tang, and Jianguo Wang. 2024. Gptuner: A manual-
Mody, Steven Truitt, and Jonathan Larson. 2024. From local to global: A graph reading database tuning system via gpt-guided bayesian optimization. Proceed-
rag approach to query-focused summarization. arXiv preprint arXiv:2404.16130 ings of the VLDB Endowment 17, 8 (2024), 1939–1952.
(2024). [38] Boyan Li, Yuyu Luo, Chengliang Chai, Guoliang Li, and Nan Tang. 2024. The
[13] Ju Fan, Zihui Gu, Songyue Zhang, Yuxin Zhang, Zui Chen, Lei Cao, Guoliang Li, Dawn of Natural Language to SQL: Are We Fully Ready? arXiv preprint
Samuel Madden, Xiaoyong Du, and Nan Tang. 2024. Combining small language arXiv:2406.01265 (2024).
models and large language models for zero-shot nl2sql. Proceedings of the VLDB [39] Dawei Li, Shu Yang, Zhen Tan, Jae Young Baik, Sukwon Yun, Joseph Lee, Aaron
Endowment 17, 11 (2024), 2750–2763. Chacko, Bojian Hou, Duy Duong-Tran, Ying Ding, et al. 2024. DALK: Dynamic
[14] Meihao Fan, Xiaoyue Han, Ju Fan, Chengliang Chai, Nan Tang, Guoliang Li, and Co-Augmentation of LLMs and KG to answer Alzheimer’s Disease Questions
Xiaoyong Du. 2024. Cost-effective in-context learning for entity resolution: A with Scientific Literature. arXiv preprint arXiv:2405.04819 (2024).
design space exploration. In 2024 IEEE 40th International Conference on Data [40] Lan Li, Liri Fang, and Vetle I Torvik. 2024. AutoDCWorkflow: LLM-based
Engineering (ICDE). IEEE, 3696–3709. Data Cleaning Workflow Auto-Generation and Benchmark. arXiv preprint
[15] Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, arXiv:2412.06724 (2024).
Tat-Seng Chua, and Qing Li. 2024. A survey on rag meeting llms: Towards [41] Yinheng Li, Shaofei Wang, Han Ding, and Hang Chen. 2023. Large language
retrieval-augmented large language models. In Proceedings of the 30th ACM models in finance: A survey. In Proceedings of the fourth ACM international
SIGKDD Conference on Knowledge Discovery and Data Mining. 6491–6501. conference on AI in finance. 374–382.
[16] FastGraphRAG. 2024. FastGraphRAG. https://github.com/circlemind-ai/fast- [42] Zhaodonghui Li, Haitao Yuan, Huiming Wang, Gao Cong, and Lidong Bing. 2024.
graphrag. LLM-R2: A Large Language Model Enhanced Rule-based Rewrite System for
[17] Tianyu Gao, Howard Yen, Jiatong Yu, and Danqi Chen. 2023. Enabling large Boosting Query Efficiency. arXiv preprint arXiv:2404.12872 (2024).
language models to generate text with citations. arXiv preprint arXiv:2305.14627 [43] Zhaodonghui Li, Haitao Yuan, Huiming Wang, Gao Cong, and Lidong Bing. 2025.
(2023). LLM-R2: A Large Language Model Enhanced Rule-based Rewrite System for
[18] Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Boosting Query Efficiency. Proceedings of the VLDB Endowment 1, 18 (2025),
Jiawei Sun, and Haofen Wang. 2023. Retrieval-augmented generation for large 53–65.
language models: A survey. arXiv preprint arXiv:2312.10997 (2023). [44] Chen Liang, Donghua Yang, Zheng Liang, Zhiyu Liang, Tianle Zhang, Boyu Xiao,
[19] Aashish Ghimire, James Prather, and John Edwards. 2024. Generative AI in Yuqing Yang, Wenqi Wang, and Hongzhi Wang. 2025. Revisiting Data Analysis
Education: A Study of Educators’ Awareness, Sentiments, and Influencing Factors. with Pre-trained Foundation Models. arXiv preprint arXiv:2501.01631 (2025).
arXiv preprint arXiv:2403.15586 (2024). [45] Yiming Lin, Mawil Hasan, Rohan Kosalge, Alvin Cheung, and Aditya G
[20] Victor Giannankouris and Immanuel Trummer. 2024. {\lambda } -Tune: Har- Parameswaran. 2025. TWIX: Automatically Reconstructing Structured Data
nessing Large Language Models for Automated Database System Tuning. arXiv from Templatized Documents. arXiv preprint arXiv:2501.06659 (2025).
preprint arXiv:2411.03500 (2024). [46] Yiming Lin, Madelon Hulsebos, Ruiying Ma, Shreya Shankar, Sepanta Zeigham,
[21] Zirui Guo, Lianghao Xia, Yanhua Yu, Tu Ao, and Chao Huang. 2024. LightRAG: Aditya G Parameswaran, and Eugene Wu. 2024. Towards Accurate and Ef-
Simple and Fast Retrieval-Augmented Generation. arXiv e-prints (2024), arXiv– ficient Document Analytics with Large Language Models. arXiv preprint
2410. arXiv:2405.04674 (2024).
[22] Bernal Jiménez Gutiérrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su. [47] Chunwei Liu, Matthew Russo, Michael Cafarella, Lei Cao, Peter Baille Chen, Zui
2024. HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Chen, Michael Franklin, Tim Kraska, Samuel Madden, and Gerardo Vitagliano.
Language Models. arXiv preprint arXiv:2405.14831 (2024). 2024. A Declarative System for Optimizing AI Workloads. arXiv preprint
[23] Haoyu Han, Yu Wang, Harry Shomer, Kai Guo, Jiayuan Ding, Yongjia Lei, Ma- arXiv:2405.14696 (2024).
hantesh Halappanavar, Ryan A Rossi, Subhabrata Mukherjee, Xianfeng Tang, [48] Chunwei Liu, Gerardo Vitagliano, Brandon Rose, Matt Prinz, David Andrew
et al. 2024. Retrieval-augmented generation with graphs (graphrag). arXiv Samson, and Michael Cafarella. 2025. PalimpChat: Declarative and Interactive
preprint arXiv:2501.00309 (2024). AI analytics. arXiv preprint arXiv:2502.03368 (2025).
[24] Jiawei Han, Jian Pei, and Hanghang Tong. 2022. Data mining: concepts and [49] Lei Liu, Xiaoyan Yang, Junchi Lei, Xiaoyang Liu, Yue Shen, Zhiqiang Zhang,
techniques. Morgan kaufmann. Peng Wei, Jinjie Gu, Zhixuan Chu, Zhan Qin, et al. 2024. A Survey on Medical
Large Language Models: Technology, Application, Trustworthiness, and Future
13
Directions. arXiv preprint arXiv:2406.03712 (2024). [73] Zhaoyan Sun, Xuanhe Zhou, and Guoliang Li. 2024. R-Bot: An LLM-based Query
[50] llamaindex. 2023. llamaindex. https://www.llamaindex.ai/. Rewrite System. arXiv preprint arXiv:2412.01661 (2024).
[51] Linhao Luo, Yuan-Fang Li, Gholamreza Haffari, and Shirui Pan. 2023. Reasoning [74] Yixuan Tang and Yi Yang. 2024. Multihop-rag: Benchmarking retrieval-
on graphs: Faithful and interpretable large language model reasoning. arXiv augmented generation for multi-hop queries. arXiv preprint arXiv:2401.15391
preprint arXiv:2310.01061 (2023). (2024).
[52] Yu A Malkov and Dmitry A Yashunin. 2018. Efficient and robust approximate [75] Vincent A Traag, Ludo Waltman, and Nees Jan Van Eck. 2019. From Louvain to
nearest neighbor search using hierarchical navigable small world graphs. IEEE Leiden: guaranteeing well-connected communities. Scientific reports 9, 1 (2019),
transactions on pattern analysis and machine intelligence 42, 4 (2018), 824–836. 1–12.
[53] Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and [76] Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal.
Hannaneh Hajishirzi. 2022. When not to trust language models: Investigat- 2022. MuSiQue: Multihop Questions via Single-hop Question Composition.
ing effectiveness of parametric and non-parametric memories. arXiv preprint Transactions of the Association for Computational Linguistics 10 (2022), 539–554.
arXiv:2212.10511 (2022). [77] Jiaan Wang, Yunlong Liang, Fandong Meng, Zengkui Sun, Haoxiang Shi, Zhixu
[54] Costas Mavromatis and George Karypis. 2024. GNN-RAG: Graph Neural Retrieval Li, Jinan Xu, Jianfeng Qu, and Jie Zhou. 2023. Is chatgpt a good nlg evaluator? a
for Large Language Model Reasoning. arXiv preprint arXiv:2405.20139 (2024). preliminary study. arXiv preprint arXiv:2303.04048 (2023).
[55] Multi-Linguality Multi-Functionality Multi-Granularity. 2024. M3-Embedding: [78] Jinqiang Wang, Huansheng Ning, Yi Peng, Qikai Wei, Daniel Tesfai, Wenwei
Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Mao, Tao Zhu, and Runhe Huang. 2024. A Survey on Large Language Models
Through Self-Knowledge Distillation. (2024). from General Purpose to Medical Applications: Datasets, Methodologies, and
[56] Zan Ahmad Naeem, Mohammad Shahmeer Ahmad, Mohamed Eltabakh, Mourad Evaluations. arXiv preprint arXiv:2406.10303 (2024).
Ouzzani, and Nan Tang. 2024. RetClean: Retrieval-Based Data Cleaning Using [79] Keheng Wang, Feiyu Duan, Sirui Wang, Peiguang Li, Yunsen Xian, Chuantao
LLMs and Data Lakes. Proceedings of the VLDB Endowment 17, 12 (2024), 4421– Yin, Wenge Rong, and Zhang Xiong. 2023. Knowledge-driven cot: Exploring
4424. faithful reasoning in llms for knowledge-intensive question answering. arXiv
[57] Avanika Narayan, Ines Chami, Laurel Orr, and Christopher Ré. 2022. Can Foun- preprint arXiv:2308.13259 (2023).
dation Models Wrangle Your Data? Proceedings of the VLDB Endowment 16, 4 [80] Shen Wang, Tianlong Xu, Hang Li, Chaoli Zhang, Joleen Liang, Jiliang Tang,
(2022), 738–746. Philip S Yu, and Qingsong Wen. 2024. Large language models for education: A
[58] nebula. 2010. nebula. https://www.nebula-graph.io/. survey and outlook. arXiv preprint arXiv:2403.18105 (2024).
[59] neo4j. 2006. neo4j. https://neo4j.com/. [81] Yu Wang, Nedim Lipka, Ryan A Rossi, Alexa Siu, Ruiyi Zhang, and Tyler Derr.
[60] Yuqi Nie, Yaxuan Kong, Xiaowen Dong, John M Mulvey, H Vincent Poor, Qing- 2024. Knowledge graph prompting for multi-document question answering. In
song Wen, and Stefan Zohren. 2024. A Survey of Large Language Models Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 19206–19214.
for Financial Applications: Progress, Prospects and Challenges. arXiv preprint [82] Junde Wu, Jiayuan Zhu, Yunli Qi, Jingkun Chen, Min Xu, Filippo Menolascina,
arXiv:2406.11903 (2024). and Vicente Grau. 2024. Medical graph rag: Towards safe medical large language
[61] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela model via graph retrieval-augmented generation. arXiv preprint arXiv:2408.04187
Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. (2024).
Training language models to follow instructions with human feedback. Advances [83] Shangyu Wu, Ying Xiong, Yufei Cui, Haolun Wu, Can Chen, Ye Yuan, Lianming
in neural information processing systems 35 (2022), 27730–27744. Huang, Xue Liu, Tei-Wei Kuo, Nan Guan, et al. 2024. Retrieval-augmented gener-
[62] Richard Yuanzhe Pang, Alicia Parrish, Nitish Joshi, Nikita Nangia, Jason Phang, ation for natural language processing: A survey. arXiv preprint arXiv:2407.13193
Angelica Chen, Vishakh Padmakumar, Johnny Ma, Jana Thompson, He He, et al. (2024).
2021. QuALITY: Question answering with long input texts, yes! arXiv preprint [84] An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu,
arXiv:2112.08608 (2021). Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. 2024. Qwen2. 5
[63] Liana Patel, Siddharth Jha, Carlos Guestrin, and Matei Zaharia. 2024. Lotus: Technical Report. arXiv preprint arXiv:2412.15115 (2024).
Enabling semantic queries with llms over tables of unstructured and structured [85] Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan
data. arXiv preprint arXiv:2407.11418 (2024). Salakhutdinov, and Christopher D Manning. 2018. HotpotQA: A dataset for di-
[64] Boci Peng, Yun Zhu, Yongchao Liu, Xiaohe Bo, Haizhou Shi, Chuntao Hong, Yan verse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600
Zhang, and Siliang Tang. 2024. Graph retrieval-augmented generation: A survey. (2018).
arXiv preprint arXiv:2408.08921 (2024). [86] Liang Yao, Jiazhen Peng, Chengsheng Mao, and Yuan Luo. 2023. Explor-
[65] Hongjin Qian, Peitian Zhang, Zheng Liu, Kelong Mao, and Zhicheng Dou. 2024. ing large language models for knowledge graph completion. arXiv preprint
Memorag: Moving towards next-gen rag via memory-inspired knowledge dis- arXiv:2308.13916 (2023).
covery. arXiv preprint arXiv:2409.05591 (2024). [87] Hao Yu, Aoran Gan, Kai Zhang, Shiwei Tong, Qi Liu, and Zhaofeng Liu. 2024.
[66] Yichen Qian, Yongyi He, Rong Zhu, Jintao Huang, Zhijian Ma, Haibin Wang, Evaluation of Retrieval-Augmented Generation: A Survey. arXiv preprint
Yaohua Wang, Xiuyu Sun, Defu Lian, Bolin Ding, et al. 2024. UniDM: A Unified arXiv:2405.07437 (2024).
Framework for Data Manipulation with Large Language Models. Proceedings of [88] Xuanwang Zhang, Yunze Song, Yidong Wang, Shuyun Tang, Xinfeng Li, Zhen-
Machine Learning and Systems 6 (2024), 465–482. gran Zeng, Zhen Wu, Wei Ye, Wenyuan Xu, Yue Zhang, et al. 2024. Raglab:
[67] The Technique Report. 2025. In-depth Analysis of Graph-based RAG in a Unified A modular and research-oriented unified framework for retrieval-augmented
Framework (technical report). https://github.com/JayLZhou/GraphRAG/blob/ generation. arXiv preprint arXiv:2408.11381 (2024).
master/VLDB2025_GraphRAG.pdf. [89] Penghao Zhao, Hailin Zhang, Qinhan Yu, Zhengren Wang, Yunteng Geng,
[68] Parth Sarthi, Salman Abdullah, Aditi Tuli, Shubh Khanna, Anna Goldie, and Fangcheng Fu, Ling Yang, Wentao Zhang, and Bin Cui. 2024. Retrieval-augmented
Christopher D Manning. 2024. Raptor: Recursive abstractive processing for generation for ai-generated content: A survey. arXiv preprint arXiv:2402.19473
tree-organized retrieval. arXiv preprint arXiv:2401.18059 (2024). (2024).
[69] Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, [90] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu,
Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2024. Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2023. Judging
Toolformer: Language models can teach themselves to use tools. Advances in llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information
Neural Information Processing Systems 36 (2024). Processing Systems 36 (2023), 46595–46623.
[70] Vikramank Singh, Kapil Eknath Vaidya, Vinayshekhar Bannihatti Kumar, Sopan [91] Yanxin Zheng, Wensheng Gan, Zefeng Chen, Zhenlian Qi, Qian Liang, and
Khosla, Murali Narayanaswamy, Rashmi Gangadharaiah, and Tim Kraska. 2024. Philip S Yu. 2024. Large language models for medicine: a survey. International
Panda: Performance debugging for databases using LLM agents. (2024). Journal of Machine Learning and Cybernetics (2024), 1–26.
[71] Shamane Siriwardhana, Rivindu Weerasekera, Elliott Wen, Tharindu Kalu- [92] Yihang Zheng, Bo Li, Zhenghao Lin, Yi Luo, Xuanhe Zhou, Chen Lin, Jinsong
arachchi, Rajib Rana, and Suranga Nanayakkara. 2023. Improving the domain Su, Guoliang Li, and Shifu Li. 2024. Revolutionizing Database Q&A with Large
adaptation of retrieval augmented generation (RAG) models for open domain Language Models: Comprehensive Benchmark and Evaluation. arXiv preprint
question answering. Transactions of the Association for Computational Linguistics arXiv:2409.04475 (2024).
11 (2023), 1–17. [93] Xuanhe Zhou, Guoliang Li, Zhaoyan Sun, Zhiyuan Liu, Weize Chen, Jianming
[72] Jiashuo Sun, Chengjin Xu, Lumingyuan Tang, Saizhuo Wang, Chen Lin, Yeyun Wu, Jiesi Liu, Ruohang Feng, and Guoyang Zeng. 2024. D-bot: Database diagnosis
Gong, Lionel Ni, Heung-Yeung Shum, and Jian Guo. 2024. Think-on-Graph: Deep system using large language models. Proceedings of the VLDB Endowment 17, 10
and Responsible Reasoning of Large Language Model on Knowledge Graph. In (2024), 2514–2527.
The Twelfth International Conference on Learning Representations. [94] Xuanhe Zhou, Zhaoyan Sun, and Guoliang Li. 2024. Db-gpt: Large language
model meets database. Data Science and Engineering 9, 1 (2024), 102–111.

14
A ADDITIONAL EXPERIMENTS graph has not shown a strong correlation with the final performance
of graph-based RAG methods. This observation motivates us to
A.1 More analysis
explore a method for evaluating the quality of the constructed graph
In this Section, we present a more detailed analysis of graph-based before using it for LLM-based question answering.
RAG methods from the following angles.
Exp.1. Effect of the chunk size. Recall that our study in- A.2 Additional results on token costs.
cludes some datasets that are pre-split by the export annotator. To
As shown in Figure 16, we present the proportions of token costs
investigate this impact, we re-split the corpus into multiple chunks
for the additional eight datasets, which exhibit trends similar to
based on token size for these datasets instead of using their original
those observed in HotpotQA and ALCE.
chunks. Here, we create three new datasets from HotpotQA, PopQA,
and ALCE, named HotpotAll, PopAll, and ALCEAll, respectively.
A.3 Additional results of generation token
For each dataset, we use Original to denote its original ver-
sion and New chunk to denote the version after re-splitting. We costs.
report the results of graph-based RAG methods on both the original As shown in Figure 17, we present the average token costs for
and new version datasets in Figure 15, we can see that: (1) The prompt tokens and completion tokens across all questions in all
performance of all methods declines, mainly because rule-based specific QA datasets. We can observe that the running time of
chunk splitting (i.e., by token size) fails to provide concise infor- each method is highly proportional to the completion token costs,
mation as effectively as expert-annotated chunks. (2) Graph-based which aligns with the computational paradigm of the Transformer
methods, especially those relying on TKG and RKG, are more sen- architecture.
sitive to chunk quality. This is because the graphs they construct
encapsulate richer information, and coarse-grained chunk splitting A.4 Evaluation metrics
introduces potential noise within each chunk. Such noise can lead This section outlines the metrics used for evaluation.
to inaccurate extraction of entities or relationships and their cor- • Metrics for specific QA Tasks. We use accuracy as the eval-
responding descriptions, significantly degrading the performance uation metric, based on whether the gold answers appear in the
of these methods. (3) As for token costs, all methods that retrieve model’s generated outputs, rather than requiring an exact match,
chunks incur a significant increase due to the larger chunk size in following the approach in [3, 53, 69]. This choice is motivated by
New chunk compared to Original, while other methods remain the uncontrollable nature of LLM outputs, which often makes it
stable. These findings highlight that chunk segmentation quality is difficult to achieve exact matches with standard answers. Similarly,
crucial for the overall performance of all RAG methods. we prefer recall over precision as it better reflects the accuracy of
Exp.2. Effect of the base model. In this experiment, we the generated responses.
evaluate the effect of the LLM backbone by replacing Llama-3-8B • Metrics for abstract QA Tasks. Building on existing work,
with Llama-3-70B [11] on the MultihopQA and ALCEAll datasets. we use an LLM to generate abstract questions, as shown in Figure
We make the following observations: (1) All methods achieve per- 18. Defining ground truth for abstract questions, especially those
formance improvements when equipped with a more powerful involving complex high-level semantics, presents significant chal-
backbone LLM (i.e., Llama-3-70B) compared to Llama-3-8B. For lenges. To address this, we adopt an LLM-based multi-dimensional
example, on the ALCEALL dataset, replacing the LLM from 8B to comparison method, inspired by [12, 21], which evaluates compre-
70B, ZeroShot improves STRREC, STREM, and STRHIT by 102.1%, hensiveness, diversity, empowerment, and overall quality. We use a
94.2%, and 101.7%, respectively. (2) Our proposed method, VGraphRAG, robust LLM, specifically GPT-4o, to rank each baseline in compari-
consistently achieves the best performance regardless of the LLM son to our method. The evaluation prompt used is shown in Figure
backbone used. For example, on the MultihopQA dataset, VGraphRAG 19.
with Llama-3-70B achieves 7.20% and 12.13% improvements over
RAPTOR with Llama-3-70B in terms of Accuracy and Recall, respec- A.5 Implementation details
tively. (3) Under both Llama-3-70B and Llama-3-8B, while all meth- In this subsection, we present more details about our system imple-
ods show improved performance, they exhibit similar trends to mentation. Specifically, we use HNSW [52] from Llama-index [50]
those observed with Llama-3-8B. For instance, RAPTOR remains (a well-known open-source project) as the default vector database
the best-performing method among all existing graph-based RAG for efficient vector search. In addition, for each method, we optimize
approaches, regardless of the LLM used. efficiency by batching or parallelizing operations such as encoding
Exp.3. The size of graph. For each dataset, we report the size nodes or chunks, and computing personalized page rank, among
of five types of graphs in Table 13. We observe that PG is typically others, during the retrieval stage.
denser than other types of graphs, as they connect nodes based on
shared entity relationships, where each node represents a chunk B MORE DISCUSSIONS.
in PG. In fact, the probability of two chunks sharing at least a few
entities is quite high, leading to a high graph density (i.e., the ratio B.1 New operators
of edges to nodes), sometimes approaching a clique (fully connected Here, we introduce the operators that are not used in existing
graph). In contrast, KG, TKG, and RKG are much sparser since they graph-based RAG methods but are employed in our newly designed
rely entirely on LLMs to extract nodes and edges. This sparsity state-of-the-art methods.
is primarily due to the relatively short and incomplete outputs Chunk type. We include a new operator VDB of chunk type,
typically generated by LLMs, which miss considerable potential which is used in our VGraphRAG method. This operator is the same
node-edge pairs. Interestingly, the size or density of the constructed as the chunk retrieval strategy of VanillaRAG.
15
Original New chunk Original New chunk Original New chunk
60 80 80
50 60
60
40 40
40
30 20
20 20 0
G G R G G G G G G G R G G G G G AG AG OR AG AG AG AG AG
RA RA TO RA RA RA RA RA RA RA TO RA RA RA RA RA aR oR PT hR hR tR tR tR
la po RAP aph aph ght ght ght la po RAP aph aph ght ght ght ll ipp RA rap rap igh igh igh
n il Hip G r Gr Li Li Li nil Hip G r Gr Li Li Li i
Va st L L G H Va st L L G H V an H
st
G LG LL GL HL
Fa Fa Fa
(a) HotpotQA (Accuracy) (b) HotpotQA (Recall) (c) PopQA (Accuracy)

Original New chunk Original New chunk Original New chunk

30 40

20 30 10

10 20
5
0 10
AG AG OR AG AG AG AG AG AG AG OR AG AG AG AG AG AG AG OR AG AG AG AG AG
aR oR PT hR hR tR tR tR aR oR PT hR hR tR tR tR aR oR PT hR hR tR tR tR
i ll ipp RA rap rap igh igh igh ill ipp RA rap rap igh igh igh i ll ipp RA rap rap igh igh igh
n H G LG LL GL HL n H G LG LL GL HL n H G LG LL GL HL
Va st Va st Va st
Fa Fa Fa
(d) PopQA (Recall) (e) ALCE (STRREC) (f) ALCE (STREM)

Original New chunk Original New chunk Original New chunk

70 104 104
60
50
40 103 103
30
G G R G G G G G G G R G G G G G G G R G G G G G
RA RA TO RA RA RA RA RA RA RA TO RA RA RA RA RA RA RA TO RA RA RA RA RA
l la ppo RAP aph aph ght ght ght l la ppo RAP aph aph ght ght ght l la ppo RAP aph aph ght ght ght
n i Hi G r Gr Li Li Li ni Hi G r Gr Li Li Li n i Hi G r Gr Li Li Li
Va st L L G H Va st L L G H Va st L L G H
Fa Fa Fa
(g) ALCE (STRHIT) (h) HotpotQA (Token) (i) PopQA (Token)
Figure 15: Effect of chunk quality on the performance of specific QA tasks.

Community type. We also include a new operator VDB of com- L8. Methods designed for knowledge reasoning tasks, such as
munity type, retrieving the top-𝑘 communities by vector searching, DALK, ToG, and G-retriever, do not perform well on document-
where the embedding of each community is generated by encoding based QA tasks. This is because these methods are better suited
its community report. for extracting reasoning rules or paths from well-constructed KGs.
However, when KGs are built from raw text corpora, they may not
B.2 More Lessons and Opportunities accurately capture the correct reasoning rules, leading to subopti-
In this section, we show the more lessons and opportunities learned mal performance in document-based QA tasks.
from our study. L9. The effectiveness of RAG methods is highly impacted by the
Lessons relevance of the retrieved elements to the given question. That is,
L6. For large datasets, both versions of the GraphRAG methods if the retrieved information is irrelevant or noisy, it may degrade
incur unacceptable token costs, as they contain a large number the LLM’s performance. When designing new graph-based RAG
of communities, leading to high costs for generating community methods, it is crucial to evaluate whether the retrieval strategy
reports. effectively retrieves relevant information for the given question.
L7. Regardless of whether the questions are specific or abstract, Opportunities
they all rely on an external corpus (i.e., documents). For such ques- O6. An interesting future research direction is to explore more
tions, merely using graph-structure information (nodes, edges, or graph-based RAG applications. For example, applying graph-based
subgraphs) is insufficient to achieve good performance. RAG to scientific literature retrieval can help researchers efficiently
extract relevant studies and discover hidden relationships between

16
Table 12: The specific QA performance comparison of graph-based RAG methods with different LLM backbone models.

MultihopQA ALCEAll
Method LLM backbone
Accuracy Recall STRREC STREM STRHIT
Llama-3-8B 49.022 34.256 15.454 3.692 30.696
Zeroshot
Llama-3-70B 55.908 52.987 31.234 7.170 61.920
Llama-3-8B 50.626 36.918 29.334 8.228 56.329
VanillaRAG
Llama-3-70B 56.768 49.127 34.961 9.810 68.038
Llama-3-8B 53.760 47.671 21.633 5.696 41.561
HippoRAG
Llama-3-70B 57.277 57.736 32.904 9.916 32.534
Llama-3-8B 56.064 44.832 34.044 10.971 62.342
RAPTOR
Llama-3-70B 63.028 61.042 37.286 12.236 68.671
Llama-3-8B 52.895 44.278 27.258 7.490 53.376
FastGraphRAG
Llama-3-70B 54.069 55.787 35.658 12.236 65.612
Llama-3-8B 55.360 50.429 27.785 8.017 52.954
LGraphRAG
Llama-3-70B 58.060 55.390 34.256 10.232 66.561
Llama-3-8B 59.664 50.893 35.213 11.603 64.030
VGraphRAG
Llama-3-70B 67.567 68.445 37.576 12.447 69.198

Table 13: The size of each graph type across all datasets.

Tree PG KG TKG RKG

Dataset # of vertices # of edges # of vertices # of edges # of vertices # of edges # of vertices # of edges # of vertices # of edges
MultihopQA 2,053 2,052 1,658 564,446 35,953 37,173 12,737 10,063 18,227 12,441
Quality 1,862 1,861 1,518 717,468 28,882 30,580 10,884 8,992 13,836 9,044
PopQA 38,325 38,324 32,157 3,085,232 260,202 336,676 179,680 205,199 188,946 215,623
MusiqueQA 33,216 33,215 29,898 3,704,201 228,914 295,629 153,392 183,703 149,125 188,149
HotpotQA 73,891 73,890 66,559 13,886,807 511,705 725,521 291,873 401,693 324,284 436,362
ALCE 99,303 99,302 89,376 22,109,289 610,925 918,499 306,821 475,018 353,989 526,486
Mix 719 718 1,778 1,225,815 28,793 34,693 7,464 2,819 7,701 3,336
MultihopSum 2,053 2,052 N/A N/A N/A N/A 12,737 10,063 18,227 12,441
Agriculture 2,156 2,155 N/A N/A N/A N/A 15,772 7,333 17,793 12,600
CS 2,244 2,243 N/A N/A N/A N/A 10,175 6,560 12,340 8,692
Legal 5,380 5,379 N/A N/A N/A N/A 15,034 10,920 16,565 17,633

concepts. Another potential application is legal document analy- B.3 Benefit of our framework
sis, where graph structures can capture case precedents and legal Our framework offers exceptional flexibility by enabling the combi-
interpretations to assist in legal reasoning. nation of different methods at various stages. This modular design
O7. The users may request multiple questions simultaneously, allows different algorithms to be seamlessly integrated, ensuring
but existing graph-based RAG methods process them sequentially. that each stage—such as graph building, and retrieval&generation—can
Hence, a promising future direction is to explore efficient scheduling be independently optimized and recombined. For example, methods
strategies that optimize multi-query handling. This could involve like HippoRAG, which typically rely on KG, can easily be adapted
batching similar questions or parallelizing retrieval. to use RKG instead, based on specific domain needs.
O8. Different types of questions require different levels of infor- In addition, our operator design allows for simple modifica-
mation, yet all existing graph-based RAG methods rely on fixed, tions—often just a few lines of code—to create entirely new graph-
predefined rules. How to design an adaptive mechanism that can based RAG methods. By adjusting the retrieval stage or swapping
address these varying needs remains an open question. components, researchers can quickly test and implement new strate-
O9. Existing methods do not fully leverage the graph structure; gies, significantly accelerating the development cycle of retrieval-
they typically rely on simple graph patterns (e.g., nodes, edges, or enhanced models.
𝑘-hop paths). Although GraphRAG adopts a hierarchical community The modular nature of our framework is further reinforced by the
structure (detecting by the Leiden algorithm), this approach does use of retrieval elements (such as node, relationship, or subgraph)
not consider node attributes, potentially compromising the quality coupled with retrieval operators. This combination enables us to
of the communities. That is, determining which graph structures easily design new operators tailored to specific tasks. For example,
are superior remains an open question. by modifying the strategy for retrieving given elements, we can
create customized operators that suit different application scenarios.

17
RKG RKG RKG RKG

TKG TKG TKG TKG

KG KG KG KG

Tree Tree Tree Tree

0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
Token usage proportion (%) Token usage proportion (%) Token usage proportion (%) Token usage proportion (%)
Prompt token Completion token Prompt token Completion token Prompt token Completion token Prompt token Completion token
(a) MultihopQA (b) Quality (c) PopQA (d) MusiqueQA

RKG RKG RKG RKG

TKG TKG TKG TKG

Tree Tree Tree Tree

By systematically evaluating the effectiveness of various retrieval Datasets: Our study did not include domain-specific knowledge
components under our unified framework, we can identify the datasets, which are crucial for certain applications. Incorporating
most efficient combinations of graph construction, indexing, and such datasets could provide more nuanced insights and allow for
retrieval strategies. This approach enables us to optimize retrieval a better evaluation of how these methods perform in specialized
performance across a range of use cases, allowing for both the settings. (3) Resource Constraints: Due to resource limitations, the
enhancement of existing methods and the creation of novel, state- largest model we utilized is Llama-3-70B, and the entire paper con-
of-the-art techniques. sumes nearly 10 billion tokens. Running larger models, such as
Finally, our framework contributes to the broader research com- GPT-4o (175B parameters or beyond), would incur significantly
munity by providing a standardized methodology to assess graph- higher costs, potentially reaching several hundred thousand dollars
based RAG approaches. The introduction of a unified evaluation depending on usage. While we admit that introducing more power-
testbed ensures reproducibility, promotes fair a benchmark, and ful models could further enhance performance, the 70B model is
facilitates future innovations in RAG-based LLM applications. already a strong choice, balancing performance and resource feasi-
bility. That is to say, exploring the potential of even larger models
B.4 Limitations in future work could offer valuable insights and further refine the
In our empirical study, we put considerable effort into evaluating the findings. (4) Prompt Sensitivity: The performance of each method is
performance of existing graph-based RAG methods from various highly affected by its prompt design. Due to resource limitations,
angles. However, our study still has some limitations, primarily due we did not conduct prompt ablation studies and instead used the
to resource constraints. (1) Token Length Limitation: The primary available prompts from the respective papers. Actually, a fairer
experiments are conducted using Llama-3-8B with a token window comparison would mitigate this impact by using prompt tuning
size of 8k. This limitation on token length restricted the model’s tools, such as DSPy [34], to customize the prompts and optimize
ability to process longer input sequences, which could potentially the performance of each method.
impact the overall performance of the methods, particularly in tasks These limitations highlight areas for future exploration, and
that require extensive context. Larger models with larger token overcoming these constraints would enable a more thorough and
windows could better capture long-range dependencies and deliver reliable evaluation of graph-based RAG methods, strengthening the
more robust results. This constraint is a significant factor that may findings and advancing the research.
affect the generalizability of our findings. (2) Limited Knowledge
C PROMPT

18
100
101
102
103
104
102
103
104
102
103
104
Ze Ze Ze
ro ro ro
Va sh Va sh Va sh
ni ot ni ot ni ot
ll ll ll
G- aR G- aR G- a RA
re AG re AG re G
tr tr tr
ie N/A ie ie
ve ve ve
r N/A r r
N/A
To To To
G N/A G G
N/A
KG KG KG
P N/A P P
DA DA DA
LL LK LL LK LL LK

Prompt token
Prompt token
Prompt token

ig ig ig
ht ht ht
RA R R
GL G GL AG GL AG

(e) HotpotQA
R R
(a) MultihopQA
RA
HL
ig G HL
ig
AG HL
ig
AG
Fa ht Fa ht Fa ht
st RA st RA st RA
Gr G Gr G Gr G
ap ap ap
hR hR hR
Hi AG Hi AG Hi AG
pp pp pp

Completion token
Completion token
Completion token

oR oR oR
LG AG LG AG LG AG
ra ra ra
ph ph ph
RA RA RA
G G G
RA RA RA
PT PT PT
OR OR OR

19
102
103
104
102
103
104
102
103
104

Ze Ze Ze
ro r os ro
Va sh Va ho Va sh
ni ot ni t ni ot
ll ll ll
G- aR G- aR G- aR
re AG re AG re AG
tr tr tr
ie ie ie
ve ve ve
r r r
To To To
G G G
KG KG KG
P P P
DA DA DA
LL LK LL LK LL LK
Prompt token
Prompt token
Prompt token

ig ig ig
ht ht ht
RA R R
GL G GL AG GL AG

(f) ALCE
ig ig ig
(b) Quality

ht ht ht
R R
(d) MusiqueQA

RA
HL
ig G HL
ig
AG HL
ig
AG
Fa ht Fa ht Fa ht
st RA st R st R
Gr G Gr AG Gr AG
ap ap ap
hR hR hR
Hi AG Hi AG Hi AG
pp pp pp
Completion token
Completion token
Completion token

oR oR oR
LG AG LG AG LG AG
ra ra ra
ph ph ph

Figure 17: Token costs for prompt and completion tokens in the generation stage across all datasets.
RA RA RA
G G G
RA RA RA
PT PT PT
OR OR OR
Prompt for generating abstract questions

Prompt:
Given the following description of a dataset:
{description}
Please identify 5 potential users who would engage with this dataset. For each user, list 5 tasks they would perform with
this dataset. Then, for each (user, task) combination, generate 5 questions that require a high-level understanding of the
entire dataset.
Output the results in the following structure:
- User 1: [user description]
- Task 1: [task description]
- Question 1:
- Question 2:
- Question 3:
- Question 4:
- Question 5:
- Task 2: [task description]
...
- Task 5: [task description]
- User 2: [user description]
...
- User 5: [user description]
...
Note that there are 5 users and 5 tasks for each user, resulting in 25 tasks in total. Each task should have 5 questions,
resulting in 125 questions in total. The Output should present the whole tasks and questions for each user.
Output:

Figure 18: The prompt for generating abstract questions.

20
Prompt for LLM-based multi-dimensional comparison

Prompt:
You will evaluate two answers to the same question based on three criteria: Comprehensiveness, Diversity, Empower-
ment, and Directness.
• Comprehensiveness: How much detail does the answer provide to cover all aspects and details of the question?
• Diversity: How varied and rich is the answer in providing different perspectives and insights on the question?
• Empowerment: How well does the answer help the reader understand and make informed judgments about the
topic?
• Directness: How specifically and clearly does the answer address the question?
For each criterion, choose the better answer (either Answer 1 or Answer 2) and explain why. Then, select an overall winner
based on these four categories.
Here is the question:
Question: {query}
Here are the two answers:
Answer 1: {answer1}
Answer 2: {answer2}
Evaluate both answers using the four criteria listed above and provide detailed explanations for each criterion. Output
your evaluation in the following JSON format:
{
"Comprehensiveness": {
"Winner": "[Answer 1 or Answer 2]",
"Explanation": "[Provide one sentence explanation here]"
},
"Diversity": {
"Winner": "[Answer 1 or Answer 2]",
"Explanation": "[Provide one sentence explanation here]"
},
"Empowerment": {
"Winner": "[Answer 1 or Answer 2]",
"Explanation": "[Provide one sentence explanation here]"
},
"Overall Winner": {
"Winner": "[Answer 1 or Answer 2]",
"Explanation": "[Briefly summarize why this answer is the overall winner]"
}
}
Output: