In-Depth Analysis of Graph-Based RAG in A Unified Framework
In-Depth Analysis of Graph-Based RAG in A Unified Framework
Method Graph Type Index Component Retrieval Primitive Retrieval Granularity Specific QA Abstract QA
RAPTOR [68] Tree Tree node Question vector Tree node
KGP [81] Passage Graph Entity Question Chunk
HippoRAG [22] Knowledge Graph Entity Entities in question Chunk
G-retriever [26] Knowledge Graph Entity, Relationship Question vector Subgraph
ToG [72] Knowledge Graph Entity, Relationship Question Subgraph
DALK [39] Knowledge Graph Entity Entities in question Subgraph
LGraphRAG [12] Textual Knowledge Graph Entity, Community Question vector Entity, Relationship, Chunk, Community
GGraphRAG [12] Textual Knowledge Graph Community Question vector Community
FastGraphRAG [16] Textual Knowledge Graph Entity Entities in question Entity, Relationship, Chunk
LLightRAG [21] Rich Knowledge Graph Entity, Relationship Low-level keywords in question Entity, Relationship, Chunk
GLightRAG [21] Rich Knowledge Graph Entity, Relationship High-level keywords in question Entity, Relationship, Chunk
HLightRAG [21] Rich Knowledge Graph Entity, Relationship Both high- and low-level keywords Entity, Relationship, Chunk
Nodes
Question Retrieval &
conversion generation
Edges
Corpus Chunks Graph Question Answer
❶ Graph building ❹ Retrieval & generation
or
…… LLM Entities Keywords Pri.
Graph elements Embeddings
Question retrieval
Graph Index Encoder Question vector
primitive
❷ Index construction
Operator Pool Parameters Pri.
……
Node Edge
…… Answer
Chunk Subgraph Selector Operators Graph Index Chunks
Prompt LLM
❸ Operator configuration
Figure 2: Workflow of graph-based RAG methods under our unified framework.
retrieve information from a graph constructed using these Table 2: Comparison of different types of graphs.
chunks.
• Retrieval element: Given a user question 𝑄, vanilla RAG Attributes Tree PG KG TKG RKG
aims to retrieve the most relevant chunks, while graph-
based RAG methods focus on finding useful information Original Chunk
from the graph, such as nodes, relationships, or subgraphs. Entity Name
Entity Type
3 A UNIFIED FRAMEWORK Entity Description
Relationship Name
In this section, we develop a novel unified framework, consisting
Relationship Keyword
of four stages: ❶ Graph building, ❷ Index construction, ❸ Operator Relationship Description
configuration, and ❹ Retrieval & generation, which can cover all Edge Weight
existing graph-based RAG methods, as shown in Algorithm 1.
Algorithm 1: A unified framework for graph-based RAG queries once completed, we present the workflow of graph-based
RAG methods under our framework in Figure 2.
input : Corpus D, and user question 𝑄
output : The answers for user question 𝑄
1 C ← split D into multiple chunks;
4 GRAPH BUILDING
// (1) Graph building. The graph building stage aims to transfer the input corpus into a
2 G ←GraphBuilding(C); graph, serving as a fundamental component in graph-based RAG
// (2) Index construction. methods. Before building a graph, the first step is splitting the
3 I ← IndexConstruction(G, C); corpus into smaller chunks, followed by using an LLM or other
// (3) Operator configuration. tools to create nodes and edges based on these chunks. There are five
4 O ← OperatorConfiguration( ); types of graphs, each with a corresponding construction method; we
// (4) Retrieve relevant information and generate present a brief description of each graph type and its construction
response. method below:
5 R ← Retrieval&generation(G, I, O, 𝑄); ❶ Passage Graph. In the passage graph (PG), each chunk repre-
6 return R; sents a node, and edges are built by the entity linking tools [81].
If two chunks contain a number of the same entities larger than a
Specifically, given the large corpus D, we first split it into multi- threshold, we link an edge for these two nodes.
ple chunks C (line 1). We then sequentially execute operations in ❷ Tree. The tree is constructed in a progressive manner, where
the following four stages (lines 2-5): (1) Build the graph G for input each chunk represents the leaf node in the tree. Then, it uses an
chunks C (Section 4); (2) Construct the index based on the graph LLM to generate higher-level nodes. Specifically, at the 𝑖-th layer,
G from the previous stage (Section 5); (3) Configure the retriever the nodes of (𝑖 + 1)-th layer are created by clustering nodes from
operators for subsequent retrieving stages (Section 6), and (4) For the 𝑖-th layer that does not yet have parent nodes. For each cluster
the input user question 𝑄, retrieve relevant information from G with more than two nodes, the LLM generates a virtual parent node
using the selected operators and feed them along with the question with a high-level summary of its child node descriptions.
𝑄 into the LLM to generate the answer. Note that the first three ❸ Knowledge Graph. The knowledge graph (KG) is constructed
stages are executed offline, enabling support for efficient online by extracting entities and relationships from each chunk, where
3
each entity represents an object and the relationship denotes the there are seven different operators to retrieve nodes. ❶ VDB lever-
semantic relation between two entities. ages the vector database to retrieve nodes by computing the vector
❹ Textual Knowledge Graph. A textual knowledge graph (TKG) similarity with the query vector. ❷ RelNode extracts nodes from
is a specialized KG (following the same construction step as KG), the provided relationships. ❸ PPR uses the Personalized PageR-
with the key difference being that in a TKG, each entity and rela- ank (PPR) algorithm [25] to identify the top-𝑘 similar nodes to the
tionship is assigned a brief textual description. question, where the restart probability of each node is based on its
❺ Rich Knowledge Graph. The rich knowledge graph (RKG) is similarity to the entities in the given question. ❹ Agent utilizes
an extended version of TKG, containing more information, includ- the capabilities of LLMs to select nodes from a list of candidate
ing textual descriptions for entities and relationships, as well as nodes. ❺ Onehop selects the one-hop neighbor entities of the given
keywords for relationships. entities. ❻ Link selects the top-1 most similar entity for each entity
We summarize the key characters of each graph type in Table 2. in the given set from the vector database. ❼ TF-IDF retrieves the
top-𝑘 relevant entities by ranking them based on term frequency
5 INDEX CONSTRUCTION and inverse document frequency from the TF-IDF matrix.
To support efficient online querying, existing graph-based RAG • Relationship type. These operators are designed to retrieve
methods typically include an index-construction stage, which in- relationships from the graph that are most relevant to the user ques-
volves storing entities or relationships in the vector database, and tion. There are four operators:❶ VDB, ❷ Onehop, ❸ Aggregator,
computing community reports for efficient online retrieval. Gen- and ❹ Agent. Specifically, the VDB operator also uses the vector
erally, there are three types of indices, ❶ Node Index, ❷ Rela- database to retrieve relevant relationships. The Onehop operator se-
tionship Index, and ❸ Community Index, where for the first lects relationships linked by one-hop neighbors of the given selected
two types, we use the well-known text-encoder models, such as entities. The Aggregator operator builds upon the PPR operator in
BERT [9], BGE-M3 [55], or ColBert [35] to generate embeddings the node operator. Given the PPR scores of entities, the most rele-
for nodes or relationships in the graph. vant relationships are determined by leveraging entity-relationship
❶ Node Index stores the graph nodes in the vector database. For interactions. Specifically, the score of each relationship is obtained
RAPTOR, G-retriever, DALK, FastGraphRAG, LGraphRAG, LLightRAG, by summing the scores of the two entities it connects. Thus, the
and HLightRAG, all nodes in the graph are directly stored in the vec- top-𝑘 relevant relationships can be selected. The key difference for
tor database. For each node in KG, its embedding vector is generated the Agent operator is that, instead of using a candidate entity list,
by encoding its entity name, while for nodes in Tree, TKG, and RKG, it uses a candidate relationship list, allowing the LLM to select the
the embedding vectors are generated by encoding their associated most relevant relationships based on the question.
textual descriptions. In KGP, it stores the TF-IDF matrix [24], which • Chunk type. The operators in this type aim to retrieve the
represents the term-weight distribution across different nodes (i.e., most relevant chunks to the given question. There are three op-
chunks) in the index. erators: ❶ Aggregator, ❷ FromRel, and ❸ Occurrence, where
❷ Relationship Index stores the relationships of the graph in a the first one is based on the Link operator in the relationship type,
vector database, where for each relationship, its embedding vector specifically, we use the relationship scores and the relationship-
is generated by encoding a description that combines its associated chunk interactions to select the top-𝑘 chunks, where the score of
context (e.g., description) and the names of its linked entities. each chunk is obtained by the summing the scores of all relation-
❸ Community Index stores the community reports for each ships extracted from it. The FromRel operator retrieves chunks that
community, where communities are generated by the clustering al- “contain” the given relationships. The Occurrence operator selects
gorithm and the LLM produces the reports. Specifically, Leiden [75] the top-𝑘 chunks based on the given relationships, assigning each
algorithm is utilized by LGraphRAG and GGraphRAG. chunk a score by counting the number of times it contains both
entities in a relationship.
6 RETRIEVAL AND GENERATION • Subgraph type. There are three operators to retrieve the
In this section, we explore the key steps in graph-based RAG meth- relevant subgraphs from the graph G: The ❶ KhopPath operator
ods, i.e., selecting operators, and using them to retrieve relevant aims to identify 𝑘-hop paths in G by iteratively finding such paths
information to question 𝑄. where the start and end points belong to the given entity set. After
identifying a path, the entities within it are removed from the entity
6.1 Retrieval operators set, and this process repeats until the entity set is empty. Note that
if two paths can be merged, they are combined into one path. For
In this subsection, we exhibit that the retrieval stage in various example, if we have two paths 𝐴 → 𝐵 → 𝐶 and 𝐴 → 𝐵 → 𝐶 →
graph-based RAG methods can be decoupled into a series of op- 𝐷, we can merge them into a single path 𝐴 → 𝐵 → 𝐶 → 𝐷.
erators, with different methods selecting specific operators and The ❷ Steiner operator first identifies the relevant entities and
combining them in various ways. By selecting and arranging these relationships, then uses these entities as seed nodes to construct a
operators in different sequences, all existing (and potentially fu- Steiner tree [24]. The ❸ AgentPath operator aims to identify the
ture) graph-based RAG methods can be implemented. Through an most relevant 𝑘-hop paths to a given question, by using LLM to
in-depth analysis of all implementations, we distill the retrieval filter out the irrelevant paths.
process into a set of 19 operators, forming an operator pool. Based • Community type. Only the LGraphRAG and GGraphRAG using
on the granularity of retrieval, we classify the operators into five the community operators, which includes two detailed operators,
categories: ❶ Entity, and ❷ Layer. The Entity operator aims to obtain the
• Node type. This type of operator focuses on retrieving “impor- communities containing the specified entities. Here, all identified
tant” nodes for a given question, and based on the selection policy,
4
Table 3: Operators utilized in graph-based RAG methods; “N/A” means that this type of operator is not used.
communities are sorted based on their rating (generated by the Table 4: Datasets used in our experiments; The underlined
LLM), and then the top-𝑘 communities are returned. The Leiden number of chunks denotes that the dataset is pre-split into
algorithm generates hierarchical communities, where higher layers chunks by the expert annotator.
represent more abstract, high-level information. The Layer opera-
tor is used to retrieve all communities below the required layers. Dataset # of Tokens # of Questions # of Chunks QA Type
MultihopQA 1,434,889 2,556 609 Specific QA
6.2 Operator configuration Quality 1,522,566 4,609 265 Specific QA
PopQA 2,630,554 1,172 33,595 Specific QA
Under our unified framework, any existing graph-based RAG method MusiqueQA 3,280,174 3,000 29,898 Specific QA
can be implemented by leveraging the operator pool along with HotpotQA 8,495,056 3,702 66,581 Specific QA
specific method parameters. Those parameters define two key as- ALCE 13,490,670 948 89,562 Specific QA
pects: (1) which operators to use, and (2) how to combine or apply Mix 611,602 125 61 Abstract QA
MultihopSum 1,434,889 125 609 Abstract QA
the selected operators. Agriculture 1,949,584 125 12 Abstract QA
In Table 3, we present how the existing graph-based RAG meth- CS 2,047,923 125 10 Abstract QA
ods utilize our provided operators to assemble their retrieval stages. Legal 4,774,255 125 94 Abstract QA
Due to this independent and modular decomposition of all graph-
based RAG methods, we not only gain a deeper understanding of are two types of answer generation paradigms: ❶ Directly and ❷
how these approaches work but also gain the flexibility to combine Map-Reduce. The former directly utilizes the LLM to generate the
these operators to create new methods. Besides, new operators can answer, while the latter, used in GGraphRAG, analyzes the retrieved
be easily created, for example, we can create a new operator VDB communities one by one, first, each community is used to answer
within the community type, which allows us to retrieve the most the question independently in parallel, and then all relevant partial
relevant communities by using vector search to compare the seman- answers are summarized into a final answer.
tic similarity between the question and the communities. In our
later experimental results (see Exp.5 in Section 7.3), thanks to our 7 EXPERIMENTS
modular design, we can design a new state-of-the-art graph-based We now present the experimental results. Section 7.1 discusses the
RAG method by first creating two new operators and combining setup. We discuss the results for specific QA and abstract QA tasks
them with the existing operators. in Sections 7.2 and 7.3, respectively.
ZeroShot 49.022 34.256 37.058 28.592 8.263 1.833 5.072 35.467 42.407 15.454 3.692 30.696
VanillaRAG 50.626 36.918 39.141 60.829 27.058 17.233 27.874 50.783 57.745 34.283 11.181 63.608
G-retriever 42.019 43.116 31.807 17.084 6.075 2.733 11.662 — — 9.754 2.215 19.726
ToG 41.941 38.435 34.888 47.677 23.727 9.367 20.536 — — 13.975 3.059 29.114
KGP 48.161 36.272 33.955 57.255 24.635 17.333 27.572 — — 27.692 8.755 51.899
DALK 53.952 47.232 34.251 45.604 19.159 11.367 22.484 33.252 47.232 21.408 4.114 44.937
LLightRAG 44.053 35.528 34.780 38.885 16.764 9.667 19.810 34.144 41.811 21.937 5.591 43.776
GLightRAG 48.474 38.365 33.413 20.944 8.146 7.267 17.204 25.581 33.297 17.859 3.587 37.131
HLightRAG 50.313 41.613 34.368 41.244 18.071 11.000 21.143 35.647 43.334 25.578 6.540 50.422
FastGraphRAG 52.895 44.278 37.275 53.324 22.433 13.633 24.470 43.193 51.007 30.190 8.544 56.962
HippoRAG 53.760 47.671 48.297 59.900 24.946 17.000 28.117 50.324 58.860 23.357 6.962 43.671
LGraphRAG 55.360 50.429 37.036 45.461 18.657 12.467 23.996 33.063 42.691 28.448 8.544 54.747
RAPTOR 56.064 44.832 56.997 62.545 27.304 24.133† 35.595† 55.321† 62.424† 35.255† 11.076† 65.401†
“Who won the 2024 U.S. presidential election?”). We cate- 1. Method Implementation 2. Benchmark Collection
gorize the questions into two groups based on complexity:
Simple and Complex. The former has answers directly avail-
able in one or two text chunks, requiring no reasoning Reimplementation Systems Question
Methods Dataset Metric
across chunks, which includes three datasets: Quality [62],
PopQA [53], and HotpotQA [85]. The latter involves rea-
soning across multiple chunks, understanding implicit rela- 3. Evaluation
tionships, and synthesizing knowledge, including datasets: Insights
Accuracy Applicability
MultihopQA [74], MusiqueQA [76], and ALCE [17].
Graph Operator Parameter
• Abstract. Unlike the previous groups, the questions in this Type Combination Analysis
category are not centered on specific factual queries. In- Token cost Efficiency
New SOTA
stead, they involve abstract, conceptual inquiries that en-
Figure 3: Workflow of our empirical study.
compass broader topics, summaries, or overarching themes.
An example of an abstract question is: “How does arti- specific options or words. Following existing works [17, 68], we
ficial intelligence influence modern education?”. The ab- use string recall (STRREC), string exact matching (STREM), and
stract question requires a high-level understanding of the string hit (STRHIT) as evaluation metrics. For abstract QA tasks,
dataset contents, including five datasets: Mix [65], Multi- we follow prior work [12] and use a head-to-head comparison
hopSum [74], Agriculture [65], CS [65], and Legal [65]. approach using an LLM evaluator (i.e., GPT-4o). This is mainly
Their statistics, including the numbers of tokens, and questions, because LLMs have demonstrated strong capabilities as evaluators
and the question-answering (QA) types are reported in Table 4. of natural language generation, often achieving state-of-the-art or
For specific (both complex and simple) QA datasets, we use the competitive results when compared to human judgments [77, 90].
questions provided by each dataset. While for abstract QA datasets, Here, we utilize four evaluation dimensions: Comprehensiveness,
we follow existing works [12, 21] and generate questions using one Diversity, Empowerment, and Overall for abstract QA tasks.
of the most advanced LLM, GPT-4o. Specifically, for each dataset, Implementation. We implement all the algorithms in Python
we generate 125 questions by prompting GPT-4o, following the with our proposed unified framework and try our best to ensure a
approach in [21]. The prompt template used for question generation native and effective implementation. All experiments are run on 350
is provided in our technical report [67]. Note that MultihopQA and Ascend 910B-3 NPUs [31]. Besides, Zeroshot, and VanillaRAG are
MultihopSum originate from the same source, but differ in the types also included in our study, which typically represent the model’s
of questions they include—the former focuses on complex QA tasks, inherent capability and the performance improvement brought by
while the latter on abstract QA tasks. basic RAG, respectively. If a method cannot finish in two days, we
Evaluation Metric. For the specific QA tasks, we use Accuracy mark its result as N/A in the figures and “—” in the tables.
and Recall to evaluate performance on the first five datasets based Hyperparameter Settings. In our experiment, we use Llama-
on whether gold answers are included in the generations instead 3-8B [11] as the default LLM, which is widely used in existing
of strictly requiring exact matching, following [53, 69]. For the RAG methods [88]. For LLM, we set the maximum token length
ALCE dataset, answers are typically full sentences rather than to 8,000, and use greedy decoding to generate one sample for the
6
Table 6: Comparison RAPTOR and RAPTOR-K. 3). Specifically, RAPTOR leverages similarity-based vector search,
comparing the embedding similarity between the given question
MultihopQA Quality PopQA and the high-level text summary. In contrast, LGraphRAG selects
Method
Accuracy Recall Accuracy Accuracy Recall communities based on whether they contain the given entities and
RAPTOR 56.064 44.832 56.997 62.545 27.304 the rating score generated by the LLM. However, this approach
RAPTOR-K 56.768 44.208 54.567 64.761 28.469 may retrieve communities that do not align with the question’s
semantic meaning, potentially degrading LLM performance. To
further investigate this issue, we conduct an extra ablation study
deterministic output For each method requiring top-𝑘 selection
later (See Exp.4 in this Section).
(e.g., chunks or entities), we set 𝑘 = 4 to accommodate the token
(5) For the three largest datasets, the K-means [24]-based RAPTOR
length limitation. We use one of the most advanced text-encoding
(denoted as RAPTOR-K) also demonstrates remarkable performance.
models, BGE-M3 [55], as the embedding model across all methods to
This suggests that the clustering method used in RAPTOR merely im-
generate embeddings for vector search. If an expert annotator pre-
pacts overall performance. This may be because different clustering
splits the dataset into chunks, we use those as they preserve human
methods share the same key idea: grouping similar items into the
insight. Otherwise, following existing works [12, 21], we divide
same cluster. Therefore, they may generate similar chunk clusters.
the corpus into 1,200-token chunks. For other hyper-parameters of
To verify this, we compare RAPTOR-K with RAPTOR on the first three
each method, we follow the original settings in their available code.
datasets, and present results in Table 6. We observe that RAPTOR-K
achieves comparable or even better performance than RAPTOR. In
7.2 Evaluation for specific QA the remaining part of our experiments, if RAPTOR does not finish
In this section, we evaluate the performance of different methods constructing the graph within two days, we use RAPTOR-K instead.
on specific QA tasks. Exp.2. Token costs of graph and index building. In this
Exp.1. Overall performance. We report the metric values experiment, we first report the token costs of building four types
of all algorithms on specific QA tasks in Table 5. We can make the of graphs across all datasets. Notably, building PG incurs no token
following observations and analyses: (1) Generally, the RAG tech- cost, as it does not rely on the LLM for graph construction. As
nique significantly enhances LLM performance across all datasets, shown in Figure 4(a) to (f), we observe the following: (1) Building
and the graph-based RAG methods (e.g., HippoRAG and RAPTOR) trees consistently require the least token cost, while TKG and RKG
typically exhibit higher accuracy than VanillaRAG. However, if the incur the highest token costs, with RKG slightly exceeding TKG.
retrieved elements are not relevant to the given question, RAG may In some cases, RKG requires up to 40× more tokens than trees. (2)
degrade the LLM’s accuracy. For example, on the Quality dataset, KG falls between these extremes, requiring more tokens than trees
compared to Zeroshot, RAPTOR improves accuracy by 53.80%, while but fewer than TKG and RKG. This trend aligns with the results in
G-retriever decreases it by 14.17%. This is mainly because, for Table 2, where graphs with more attributes require higher token
simple QA tasks, providing only entities and relationships from a costs for construction. (3) Recall that the token cost for an LLM call
subgraph is insufficient to answer such questions effectively. consists of two parts: the prompt token, which accounts for the
(2) For specific QA tasks, retaining the original text chunks is cru- tokens used in providing the input, and the completion part, which
cial for accurate question answering, as the questions and answers includes the tokens generated by the model as a response. Here,
in these datasets are derived from the text corpus. This may explain we report the token costs for prompt and completion on HotpotQA
why G-retriever, ToG, and DALK, which rely solely on graph struc- and ALCE datasets in Figure 4(g) to (h). The other datasets exhibit
ture information, perform poorly on most datasets. However, on similar trends, we include their results in our technical report [67].
MultihopQA, which requires multi-hop reasoning, DALK effectively We conclude that, regardless of the graph type, the prompt part
retrieves relevant reasoning paths, achieving accuracy and recall always incurs higher token costs than the completion part.
improvements of 6.57% and 27.94% over VanillaRAG, respectively. We then examine the token costs of index building across all
(3) If the dataset is pre-split into chunks by the expert annotator, datasets. Since only LGraphRAG and GGraphRAG require an LLM for
VanillaRAG often performs better compared to datasets where index construction, we report only the token costs for generating
chunks are split based on the token size, and we further investigate community reports in Figure 5. We can see that the token cost for
this phenomenon later in our technical report [67]. index construction is nearly the same as that for building TKG. This
(4) RAPTOR often achieves the best performance among most is mainly because it requires generating a report for each commu-
datasets, especially for simple questions. For complex questions, nity, and the number of communities is typically large, especially in
RAPTOR also performs exceptionally well. This is mainly because, large datasets. For example, the HotpotQA dataset contains 57,384
for such questions, high-level summarized information is crucial for communities, significantly increasing the overall token consump-
understanding the underlying relationships across multiple chunks. tion. That is to say, on large datasets, the two versions of GraphRAG
Hence, as we shall see, LGraphRAG is expected to achieve similar often take more tokens than other methods in the offline stage.
results, as it also incorporates high-level information (i.e., a summa- Exp.3. Evaluation of the generation costs. In this exper-
rized report of the most relevant community for a given question). iment, we evaluate the time and token costs for each method in
However, we only observe this effect on the MultihopQA dataset. specific QA tasks. Specifically, we report the average time and
For the other two complex QA datasets, LGraphRAG even underper- token costs for each query across all datasets in Table 7 (These
forms compared to VanillaRAG. Meanwhile, RAPTOR still achieves results may vary upon rerunning due to the inherent uncertainty
the best performance on these two datasets. We hypothesize that of the LLM.). It is not surprising that ZeroShot and VanillaRAG
this discrepancy arises from differences in how high-level infor-
mation is retrieved (See operators used for each method in Table
7
107 107 108 108
107 107
RKG RKG
TKG TKG
108 108
KG KG
Prompt token Completion token relationships for answering the question. On the other hand, the
108 costs of LLightRAG, GLightRAG, and HLightRAG gradually increase,
107 aligning with the fact that more information is incorporated into
the prompt construction. All three methods are more expensive
106
than LGraphRAG in specific QA tasks, as they use LLM to extract
105 keywords in advance. Moreover, the time cost of all methods is
MultihopQA Quality PopQA MusiqueQA HotpotQA ALCE
dataset proportional to the completion token cost. We present the results
Figure 5: Token cost of index construction in specific QA. in our technical report [67], which explains why in some datasets,
VanillaRAG is even faster than ZeroShot.
are the most cost-efficient methods in terms of both time and to-
Exp.4. Detailed analysis for RAPTOR and LGraphRAG. Our
ken consumption. Among all graph-based RAG methods, RAPTOR
first analysis about RAPTOR aims to explain why RAPTOR outper-
and HippoRAG are typically the most cost-efficient, as they share a
forms VanillaRAG. Recall that in RAPTOR, for each question 𝑄, it
similar retrieval stage with VanillaRAG. The main difference lies
retrieves the top-𝑘 items across the entire tree, meaning the re-
in the chunk retrieval operators they use. Besides, KGP and ToG
trieved items may originate from different layers. That is, we report
are the most expensive methods, as they rely on the agents (i.e.,
the proportion of retrieved items across different tree layers in Ta-
different roles of the LLM) for information retrieval during prompt
ble 8. As we shall see, for the MultihopQA and MusiqueQA datasets,
construction. The former utilizes the LLM to reason the next re-
the proportion of retrieved high-level information (i.e., items not
quired information based on the original question and retrieved
from leaf nodes) is significantly higher than in other datasets. For
chunks, while the latter employs LLM to select relevant entities and
8
Table 8: Proportion of retrieved nodes across tree layers. improves Accuracy by 6.42% on the MultihopQA dataset and 11.6%
on the MusiqueQA dataset, respectively.
Layer MultihopQA Quality PopQA MusiqueQA HotpotQA ALCE
0 59.3% 76.8% 76.1% 69.3% 89.7% 90.6%
1 27.5% 18.7% 16.5% 28.1% 9.5% 8.8%
7.3 Evaluation for abstract QA
>1 13.2% 4.5% 7.4% 2.6% 0.8% 0.6% In this section, we evaluate the performance of different methods
on abstract QA tasks.
Exp.1. Overall Performance. We evaluate the performance
Table 9: Descriptions of the different variants of LGraphRAG.
of methods that support abstract QA (see Table 1) by presenting
head-to-head win rate percentages, comparing the performance of
New retrieval each row method against each column method. Here, we denote VR,
Name Retrieval elements
strategy
RA, GS, LR, and FG as VanillaRAG, RAPTOR, GGraphRAG with high-
LGraphRAG Entity, Relationship, Community, Chunk layer communities (i.e., two-layer for this original implementation),
GraphRAG-ER Entity, Relationship HLightRAG and FastGraphRAG, respectively. The results are shown
GraphRAG-CC Community, Chunk
VGraphRAG-CC Community, Chunk
in Figure 6 to Figure 10, and we can see that: (1) Graph-based RAG
VGraphRAG Entity, Relationship, Community, Chunk methods often outperform VanillaRAG, primarily because they
effectively capture inter-connections among chunks. (2) Among
datasets requiring multi-hop reasoning to answer questions, high- graph-based RAG methods, GGraphRAG and RAPTOR generally out-
level information plays an essential role. This may explain why perform HLightRAG and FastGraphRAG as they integrate high-level
RAPTOR outperforms VanillaRAG on these two datasets. summarized text into the prompt, which is essential for abstract
We then conduct a detailed analysis of LGraphRAG on complex QA tasks. In contrast, the latter two rely solely on low-level graph
questions in specific QA datasets by modifying its retrieval meth- structures (e.g., entities and relationships) and original text chunks,
ods or element types. By doing this, we create three variants of limiting their effectiveness in handling abstract questions. (3) On
LGraphRAG, and we present the detailed descriptions for each vari- almost all datasets, GGraphRAG consistently achieves the best per-
ant in Table 9. Here, VGraphRAG-CC introduces a new retrieval strat- formance, aligning with findings from existing work [12]. This
egy. Unlike LGraphRAG, it uses vector search to retrieve the top-𝑘 suggests that community reports are highly effective in captur-
elements (i.e., chunks or communities) from the vector database. ing high-level structured knowledge and relational dependencies
Eventually, we evaluate their performance on the three complex QA among chunks, meanwhile, the Map-Reduce strategy further aids
datasets and present the results in Table 10. We make the following in filtering irrelevant retrieved content. (4) Sometimes, RAPTOR out-
analysis: (1) Community reports serve as effective high-level infor- performs GGraphRAG, likely because the textual information in the
mation for complex QA tasks. For instance, VGraphRAG-CC achieves original chunks is crucial for answering certain questions.
comparable or even better performance than RAPTOR, highlighting Exp.2. Token costs of graph and index building. The token
the value of community reports. (2) The retrieval strategy in the orig- costs of the graph and index building across all abstract QA datasets
inal LGraphRAG, which selects communities and chunks based on are shown in Figures 11 and 12 respectively. The conclusions are
the frequency of relevant entities in a given question, may not be op- highly similar to the Exp.2 in Section 7.2.
timal in some cases. This is verified by VGraphRAG-CC consistently Exp.3. Evaluation of the generation costs. In this exper-
outperforming GraphRAG-CC, which suggests that a vector-based iment, we present the time and token costs for each method in
retrieval approach is a better way than the heuristic rule-based abstract QA tasks. As shown in Table 11, GGraphRAG is the most
way. (3) For multi-hop reasoning tasks (e.g., MultihopQA), entity expensive method, as expected, while other graph-based methods
and relationship information can serve as an auxiliary signal, it exhibit comparable costs, although they are more expensive than
helps LLM link relevant information (i.e., entity and relationship VanillaRAG. For example, on the MutihopSum dataset, GGraphRAG
descriptions), and guide the reasoning process. This is supported by requires 57 × more time and 210 × more tokens per query com-
LGraphRAG outperforming both GraphRAG-CC and VGraphRAG-CC pared to VanillaRAG. Specifically, each query in GGraphRAG takes
on the MultihopQA dataset, indicating the importance of structured around 9 minutes and consumes 300K tokens, making it impractical
graph information for multi-hop reasoning. for real-world scenarios. This is because, to answer an abstract
Exp.5. New SOTA algorithm. Based on the above analysis, question, GGraphRAG needs to analyze all retrieved communities,
we aim to develop a new state-of-the-art method for complex QA which is highly time- and token-consuming, especially when the
datasets, denoted as VGraphRAG. Specifically, our algorithm first re- number of communities is large (e.g., in the thousands).
trieves the top-𝑘 entities and their corresponding relationships, this Exp.4. New SOTA algorithm. While the GGraphRAG shows
step is the same as LGraphRAG. Next, we adopt the vector search- remarkable performance in abstract QA, its time and token costs
based retrieval strategy to select the most relevant communities are not acceptable in practice. Based on the above analysis, we aim
and chunks, this step is the same as VGraphRAG-CC. Then, by com- to design a cost-efficient version of GGraphRAG, named CheapRAG.
bining the four elements above, we construct the final prompt of Our design is motivated by the fact that: only a few communities
our method to effectively guide the LLM in generating accurate (typically less than 10) are useful for generating responses. However,
answers. The results are also shown in Table 10, we can see that in GGraphRAG, all communities within a specified number of layers
VGraphRAG performs best on all complex QA datasets. For example, must be retrieved and analyzed, which is highly token-consuming.
on the ALCE dataset, it improves STRREC, STREM, and STRHIT by Besides, as discussed earlier, original chunks are also valuable for
8.47%, 13.18%, and 4.93%, respectively, compared to VGraphRAG-CC. certain questions. Therefore, our new algorithm, CheapRAG, incor-
Meanwhile, compared to RAPTOR, our new algorithm VGraphRAG porates these useful chunks. Specifically, given a new question 𝑄,
9
Table 10: Comparison of our newly designed methods on specific datasets with complex questions.
Dataset Metric ZeroShot VanillaRAG LGraphRAG RAPTOR GraphRAG-ER GraphRAG-CC VGraphRAG-CC VGraphRAG
Accuracy 49.022 50.626 55.360 56.064 52.739 52.113 55.203 59.664
MultihopQA
Recall 34.526 36.918 50.429 44.832 45.113 43.770 46.750 50.893
Accuracy 1.833 17.233 12.467 24.133 11.200 13.767 22.400 26.933
MusiqueQA
Recall 5.072 27.874 23.996 35.595 22.374 25.707 35.444 40.026
STRREC 15.454 34.283 28.448 35.255 26.774 35.366 37.820 41.023
ALCE STREM 3.692 11.181 8.544 11.076 7.5949 11.920 13.608 15.401
STRHIT 30.696 63.608 54.747 65.401 52.743 64.662 68.460 71.835
VR RA GS LR FG VR RA GS LR FG VR RA GS LR FG VR RA GS LR FG
VR 50 58 30 36 93 VR 50 66 58 35 90 VR 50 60 29 28 92 VR 50 60 19 32 93
RA 42 50 39 26 82 RA 34 50 54 20 76 RA 40 50 45 22 82 RA 40 50 44 24 82
GS 70 61 50 15 89 GS 42 46 50 26 86 GS 71 54 50 12 88 GS 81 56 50 14 88
LR 64 74 85 50 98 LR 65 80 74 50 96 LR 72 78 88 50 98 LR 68 76 86 50 98
FG 7 18 11 2 50 FG 10 24 14 4 50 FG 8 18 12 2 50 FG 7 18 12 2 50
(a) Comprehensiveness (b) Diversity (c) Empowerment (d) Overall
Figure 6: The abstract QA results on Mix dataset.
VR RA GS LR FG VR RA GS LR FG VR RA GS LR FG VR RA GS LR FG
VR 50 50 2 46 95 VR 50 64 58 64 93 VR 50 52 36 39 95 VR 50 52 44 46 95
RA 50 50 47 48 94 RA 36 50 42 49 85 RA 48 50 45 45 93 RA 48 50 45 47 94
GS 78 53 50 79 96 GS 42 55 50 52 92 GS 64 54 50 41 97 GS 56 55 50 52 97
LR 54 52 21 50 92 LR 36 51 48 50 88 LR 61 55 59 50 95 LR 54 53 48 50 93
FG 5 6 4 8 50 FG 7 15 8 12 50 FG 5 7 3 5 50 FG 5 6 3 7 50
(a) Comprehensiveness (b) Diversity (c) Empowerment (d) Overall
Figure 7: The abstract QA results on MultihopSum dataset.
VR RA GS LR FG VR RA GS LR FG VR RA GS LR FG VR RA GS LR FG
VR 50 32 39 54 85 VR 50 32 45 59 77 VR 50 24 41 52 85 VR 50 30 38 53 85
RA 68 50 19 73 94 RA 68 50 16 76 90 RA 76 50 22 76 96 RA 70 50 16 76 95
GS 61 81 50 62 89 GS 55 84 50 63 82 GS 59 78 50 58 91 GS 62 84 50 62 90
LR 46 27 38 50 78 LR 41 24 37 50 71 LR 48 24 42 50 81 LR 47 24 38 50 79
FG 15 6 11 22 50 FG 23 10 18 29 50 FG 15 4 9 19 50 FG 15 5 10 21 50
(a) Comprehensiveness (b) Diversity (c) Empowerment (d) Overall
Figure 8: The abstract QA results on Agriculture dataset.
our algorithm adopts a vector search-based retrieval strategy to (3) The constructed graphs are typically very sparse, with their size
select the most relevant communities and chunks. Next, we apply proportional to the number of chunks.
a Map-Reduce strategy to generate the final answer. As shown in
Figure 13 and Table 11, CheapRAG not only achieves better perfor-
mance than GGraphRAG but also significantly reduces token costs (in 8 LESSONS AND OPPORTUNITIES
most cases). For example, on the MultihopSum dataset, CheapRAG We summarize the lessons (L) for practitioners and propose practical
reduces token costs by 100× compared to GGraphRAG, while achiev- research opportunities (O) based on our observations.
ing better answer quality. While the diversity of answers generated Lessons:
by CheapRAG is not yet optimal, we leave this as a future work. L1. In Figure 14, we depict a roadmap of the recommended RAG
More Analysis. In addition, we present a detailed analysis of the methods, highlighting which methods are best suited for different
graph-based RAG methods in our technical report [67], including scenarios.
Effect of chunk size, Effect of base model, and The size of graph. L2. Chunk quality is very important for the overall performance of
Due to the limited space, we only summarize the key conclusions all RAG methods, and human experts are better at splitting chunks
here: (1) Chunk quality is crucial for all methods. (2) Stronger LLM than relying solely on token size.
backbone models can further enhance performance for all methods. L3. For complex questions in specific QA, high-level information
is typically needed, as they capture the complex relationship among
10
VR RA GS LR FG VR RA GS LR FG VR RA GS LR FG VR RA GS LR FG
VR 50 22 25 36 80 VR 50 18 25 37 75 VR 50 15 24 29 80 VR 50 15 22 30 79
RA 78 50 55 69 99 RA 82 50 51 79 99 RA 85 50 59 72 100 RA 85 50 54 73 67
GS 75 45 50 64 97 GS 75 49 50 63 91 GS 76 41 50 60 96 GS 78 46 50 62 97
LR 64 31 36 50 95 LR 63 21 37 50 93 LR 71 28 40 50 98 LR 70 27 38 50 97
FG 20 1 3 5 50 FG 25 1 9 7 50 FG 20 0 4 2 50 FG 21 33 3 3 50
(a) Comprehensiveness (b) Diversity (c) Empowerment (d) Overall
Figure 9: The abstract QA results on CS dataset.
VR RA GS LR FG VR RA GS LR FG VR RA GS LR FG VR RA GS LR FG
VR 50 26 31 41 93 VR 50 36 32 45 90 VR 50 24 29 34 95 VR 50 26 30 37 94
RA 74 50 27 67 95 RA 64 50 68 68 93 RA 76 50 31 67 96 RA 74 50 31 67 96
GS 69 73 50 62 97 GS 68 33 50 66 94 GS 71 69 50 60 96 GS 70 69 50 62 96
LR 59 33 38 50 97 LR 55 32 34 50 93 LR 66 33 40 50 97 LR 63 33 38 50 97
FG 7 5 3 3 50 FG 10 7 6 7 50 FG 5 4 4 3 50 FG 6 4 4 3 50
(a) Comprehensiveness (b) Diversity (c) Empowerment (d) Overall
Figure 10: The abstract QA results on Legal dataset.
VR 68 56 55 62 VR 74 47 64 68 VR 84 54 66 75 VR 86 47 71 78 VR 82 45 62 68
RA 77 56 64 70 RA 68 48 64 67 RA 68 42 43 54 RA 66 19 39 49 RA 63 27 40 46
GS 73 65 66 69 GS 72 43 65 68 GS 81 38 62 71 GS 66 21 48 51 GS 64 26 38 46
LR 68 22 34 44 LR 66 45 56 64 LR 84 54 62 74 LR 80 34 54 66 LR 76 42 44 59
FG 97 81 92 93 FG 100 86 98 99 FG 99 79 94 94 FG 98 79 95 96 FG 97 90 94 97
(a) Mix (b) MultihopSum (c) Agriculture (d) CS (e) Legal
Figure 13: Comparison of our newly designed method on abstract QA datasets.
14
A ADDITIONAL EXPERIMENTS graph has not shown a strong correlation with the final performance
of graph-based RAG methods. This observation motivates us to
A.1 More analysis
explore a method for evaluating the quality of the constructed graph
In this Section, we present a more detailed analysis of graph-based before using it for LLM-based question answering.
RAG methods from the following angles.
Exp.1. Effect of the chunk size. Recall that our study in- A.2 Additional results on token costs.
cludes some datasets that are pre-split by the export annotator. To
As shown in Figure 16, we present the proportions of token costs
investigate this impact, we re-split the corpus into multiple chunks
for the additional eight datasets, which exhibit trends similar to
based on token size for these datasets instead of using their original
those observed in HotpotQA and ALCE.
chunks. Here, we create three new datasets from HotpotQA, PopQA,
and ALCE, named HotpotAll, PopAll, and ALCEAll, respectively.
A.3 Additional results of generation token
For each dataset, we use Original to denote its original ver-
sion and New chunk to denote the version after re-splitting. We costs.
report the results of graph-based RAG methods on both the original As shown in Figure 17, we present the average token costs for
and new version datasets in Figure 15, we can see that: (1) The prompt tokens and completion tokens across all questions in all
performance of all methods declines, mainly because rule-based specific QA datasets. We can observe that the running time of
chunk splitting (i.e., by token size) fails to provide concise infor- each method is highly proportional to the completion token costs,
mation as effectively as expert-annotated chunks. (2) Graph-based which aligns with the computational paradigm of the Transformer
methods, especially those relying on TKG and RKG, are more sen- architecture.
sitive to chunk quality. This is because the graphs they construct
encapsulate richer information, and coarse-grained chunk splitting A.4 Evaluation metrics
introduces potential noise within each chunk. Such noise can lead This section outlines the metrics used for evaluation.
to inaccurate extraction of entities or relationships and their cor- • Metrics for specific QA Tasks. We use accuracy as the eval-
responding descriptions, significantly degrading the performance uation metric, based on whether the gold answers appear in the
of these methods. (3) As for token costs, all methods that retrieve model’s generated outputs, rather than requiring an exact match,
chunks incur a significant increase due to the larger chunk size in following the approach in [3, 53, 69]. This choice is motivated by
New chunk compared to Original, while other methods remain the uncontrollable nature of LLM outputs, which often makes it
stable. These findings highlight that chunk segmentation quality is difficult to achieve exact matches with standard answers. Similarly,
crucial for the overall performance of all RAG methods. we prefer recall over precision as it better reflects the accuracy of
Exp.2. Effect of the base model. In this experiment, we the generated responses.
evaluate the effect of the LLM backbone by replacing Llama-3-8B • Metrics for abstract QA Tasks. Building on existing work,
with Llama-3-70B [11] on the MultihopQA and ALCEAll datasets. we use an LLM to generate abstract questions, as shown in Figure
We make the following observations: (1) All methods achieve per- 18. Defining ground truth for abstract questions, especially those
formance improvements when equipped with a more powerful involving complex high-level semantics, presents significant chal-
backbone LLM (i.e., Llama-3-70B) compared to Llama-3-8B. For lenges. To address this, we adopt an LLM-based multi-dimensional
example, on the ALCEALL dataset, replacing the LLM from 8B to comparison method, inspired by [12, 21], which evaluates compre-
70B, ZeroShot improves STRREC, STREM, and STRHIT by 102.1%, hensiveness, diversity, empowerment, and overall quality. We use a
94.2%, and 101.7%, respectively. (2) Our proposed method, VGraphRAG, robust LLM, specifically GPT-4o, to rank each baseline in compari-
consistently achieves the best performance regardless of the LLM son to our method. The evaluation prompt used is shown in Figure
backbone used. For example, on the MultihopQA dataset, VGraphRAG 19.
with Llama-3-70B achieves 7.20% and 12.13% improvements over
RAPTOR with Llama-3-70B in terms of Accuracy and Recall, respec- A.5 Implementation details
tively. (3) Under both Llama-3-70B and Llama-3-8B, while all meth- In this subsection, we present more details about our system imple-
ods show improved performance, they exhibit similar trends to mentation. Specifically, we use HNSW [52] from Llama-index [50]
those observed with Llama-3-8B. For instance, RAPTOR remains (a well-known open-source project) as the default vector database
the best-performing method among all existing graph-based RAG for efficient vector search. In addition, for each method, we optimize
approaches, regardless of the LLM used. efficiency by batching or parallelizing operations such as encoding
Exp.3. The size of graph. For each dataset, we report the size nodes or chunks, and computing personalized page rank, among
of five types of graphs in Table 13. We observe that PG is typically others, during the retrieval stage.
denser than other types of graphs, as they connect nodes based on
shared entity relationships, where each node represents a chunk B MORE DISCUSSIONS.
in PG. In fact, the probability of two chunks sharing at least a few
entities is quite high, leading to a high graph density (i.e., the ratio B.1 New operators
of edges to nodes), sometimes approaching a clique (fully connected Here, we introduce the operators that are not used in existing
graph). In contrast, KG, TKG, and RKG are much sparser since they graph-based RAG methods but are employed in our newly designed
rely entirely on LLMs to extract nodes and edges. This sparsity state-of-the-art methods.
is primarily due to the relatively short and incomplete outputs Chunk type. We include a new operator VDB of chunk type,
typically generated by LLMs, which miss considerable potential which is used in our VGraphRAG method. This operator is the same
node-edge pairs. Interestingly, the size or density of the constructed as the chunk retrieval strategy of VanillaRAG.
15
Original New chunk Original New chunk Original New chunk
60 80 80
50 60
60
40 40
40
30 20
20 20 0
G G R G G G G G G G R G G G G G AG AG OR AG AG AG AG AG
RA RA TO RA RA RA RA RA RA RA TO RA RA RA RA RA aR oR PT hR hR tR tR tR
la po RAP aph aph ght ght ght la po RAP aph aph ght ght ght ll ipp RA rap rap igh igh igh
n il Hip G r Gr Li Li Li nil Hip G r Gr Li Li Li i
Va st L L G H Va st L L G H V an H
st
G LG LL GL HL
Fa Fa Fa
(a) HotpotQA (Accuracy) (b) HotpotQA (Recall) (c) PopQA (Accuracy)
20 30 10
10 20
5
0 10
AG AG OR AG AG AG AG AG AG AG OR AG AG AG AG AG AG AG OR AG AG AG AG AG
aR oR PT hR hR tR tR tR aR oR PT hR hR tR tR tR aR oR PT hR hR tR tR tR
i ll ipp RA rap rap igh igh igh ill ipp RA rap rap igh igh igh i ll ipp RA rap rap igh igh igh
n H G LG LL GL HL n H G LG LL GL HL n H G LG LL GL HL
Va st Va st Va st
Fa Fa Fa
(d) PopQA (Recall) (e) ALCE (STRREC) (f) ALCE (STREM)
Community type. We also include a new operator VDB of com- L8. Methods designed for knowledge reasoning tasks, such as
munity type, retrieving the top-𝑘 communities by vector searching, DALK, ToG, and G-retriever, do not perform well on document-
where the embedding of each community is generated by encoding based QA tasks. This is because these methods are better suited
its community report. for extracting reasoning rules or paths from well-constructed KGs.
However, when KGs are built from raw text corpora, they may not
B.2 More Lessons and Opportunities accurately capture the correct reasoning rules, leading to subopti-
In this section, we show the more lessons and opportunities learned mal performance in document-based QA tasks.
from our study. L9. The effectiveness of RAG methods is highly impacted by the
Lessons relevance of the retrieved elements to the given question. That is,
L6. For large datasets, both versions of the GraphRAG methods if the retrieved information is irrelevant or noisy, it may degrade
incur unacceptable token costs, as they contain a large number the LLM’s performance. When designing new graph-based RAG
of communities, leading to high costs for generating community methods, it is crucial to evaluate whether the retrieval strategy
reports. effectively retrieves relevant information for the given question.
L7. Regardless of whether the questions are specific or abstract, Opportunities
they all rely on an external corpus (i.e., documents). For such ques- O6. An interesting future research direction is to explore more
tions, merely using graph-structure information (nodes, edges, or graph-based RAG applications. For example, applying graph-based
subgraphs) is insufficient to achieve good performance. RAG to scientific literature retrieval can help researchers efficiently
extract relevant studies and discover hidden relationships between
16
Table 12: The specific QA performance comparison of graph-based RAG methods with different LLM backbone models.
MultihopQA ALCEAll
Method LLM backbone
Accuracy Recall STRREC STREM STRHIT
Llama-3-8B 49.022 34.256 15.454 3.692 30.696
Zeroshot
Llama-3-70B 55.908 52.987 31.234 7.170 61.920
Llama-3-8B 50.626 36.918 29.334 8.228 56.329
VanillaRAG
Llama-3-70B 56.768 49.127 34.961 9.810 68.038
Llama-3-8B 53.760 47.671 21.633 5.696 41.561
HippoRAG
Llama-3-70B 57.277 57.736 32.904 9.916 32.534
Llama-3-8B 56.064 44.832 34.044 10.971 62.342
RAPTOR
Llama-3-70B 63.028 61.042 37.286 12.236 68.671
Llama-3-8B 52.895 44.278 27.258 7.490 53.376
FastGraphRAG
Llama-3-70B 54.069 55.787 35.658 12.236 65.612
Llama-3-8B 55.360 50.429 27.785 8.017 52.954
LGraphRAG
Llama-3-70B 58.060 55.390 34.256 10.232 66.561
Llama-3-8B 59.664 50.893 35.213 11.603 64.030
VGraphRAG
Llama-3-70B 67.567 68.445 37.576 12.447 69.198
Table 13: The size of each graph type across all datasets.
concepts. Another potential application is legal document analy- B.3 Benefit of our framework
sis, where graph structures can capture case precedents and legal Our framework offers exceptional flexibility by enabling the combi-
interpretations to assist in legal reasoning. nation of different methods at various stages. This modular design
O7. The users may request multiple questions simultaneously, allows different algorithms to be seamlessly integrated, ensuring
but existing graph-based RAG methods process them sequentially. that each stage—such as graph building, and retrieval&generation—can
Hence, a promising future direction is to explore efficient scheduling be independently optimized and recombined. For example, methods
strategies that optimize multi-query handling. This could involve like HippoRAG, which typically rely on KG, can easily be adapted
batching similar questions or parallelizing retrieval. to use RKG instead, based on specific domain needs.
O8. Different types of questions require different levels of infor- In addition, our operator design allows for simple modifica-
mation, yet all existing graph-based RAG methods rely on fixed, tions—often just a few lines of code—to create entirely new graph-
predefined rules. How to design an adaptive mechanism that can based RAG methods. By adjusting the retrieval stage or swapping
address these varying needs remains an open question. components, researchers can quickly test and implement new strate-
O9. Existing methods do not fully leverage the graph structure; gies, significantly accelerating the development cycle of retrieval-
they typically rely on simple graph patterns (e.g., nodes, edges, or enhanced models.
𝑘-hop paths). Although GraphRAG adopts a hierarchical community The modular nature of our framework is further reinforced by the
structure (detecting by the Leiden algorithm), this approach does use of retrieval elements (such as node, relationship, or subgraph)
not consider node attributes, potentially compromising the quality coupled with retrieval operators. This combination enables us to
of the communities. That is, determining which graph structures easily design new operators tailored to specific tasks. For example,
are superior remains an open question. by modifying the strategy for retrieving given elements, we can
create customized operators that suit different application scenarios.
17
RKG RKG RKG RKG
KG KG KG KG
By systematically evaluating the effectiveness of various retrieval Datasets: Our study did not include domain-specific knowledge
components under our unified framework, we can identify the datasets, which are crucial for certain applications. Incorporating
most efficient combinations of graph construction, indexing, and such datasets could provide more nuanced insights and allow for
retrieval strategies. This approach enables us to optimize retrieval a better evaluation of how these methods perform in specialized
performance across a range of use cases, allowing for both the settings. (3) Resource Constraints: Due to resource limitations, the
enhancement of existing methods and the creation of novel, state- largest model we utilized is Llama-3-70B, and the entire paper con-
of-the-art techniques. sumes nearly 10 billion tokens. Running larger models, such as
Finally, our framework contributes to the broader research com- GPT-4o (175B parameters or beyond), would incur significantly
munity by providing a standardized methodology to assess graph- higher costs, potentially reaching several hundred thousand dollars
based RAG approaches. The introduction of a unified evaluation depending on usage. While we admit that introducing more power-
testbed ensures reproducibility, promotes fair a benchmark, and ful models could further enhance performance, the 70B model is
facilitates future innovations in RAG-based LLM applications. already a strong choice, balancing performance and resource feasi-
bility. That is to say, exploring the potential of even larger models
B.4 Limitations in future work could offer valuable insights and further refine the
In our empirical study, we put considerable effort into evaluating the findings. (4) Prompt Sensitivity: The performance of each method is
performance of existing graph-based RAG methods from various highly affected by its prompt design. Due to resource limitations,
angles. However, our study still has some limitations, primarily due we did not conduct prompt ablation studies and instead used the
to resource constraints. (1) Token Length Limitation: The primary available prompts from the respective papers. Actually, a fairer
experiments are conducted using Llama-3-8B with a token window comparison would mitigate this impact by using prompt tuning
size of 8k. This limitation on token length restricted the model’s tools, such as DSPy [34], to customize the prompts and optimize
ability to process longer input sequences, which could potentially the performance of each method.
impact the overall performance of the methods, particularly in tasks These limitations highlight areas for future exploration, and
that require extensive context. Larger models with larger token overcoming these constraints would enable a more thorough and
windows could better capture long-range dependencies and deliver reliable evaluation of graph-based RAG methods, strengthening the
more robust results. This constraint is a significant factor that may findings and advancing the research.
affect the generalizability of our findings. (2) Limited Knowledge
C PROMPT
18
100
101
102
103
104
102
103
104
102
103
104
Ze Ze Ze
ro ro ro
Va sh Va sh Va sh
ni ot ni ot ni ot
ll ll ll
G- aR G- aR G- a RA
re AG re AG re G
tr tr tr
ie N/A ie ie
ve ve ve
r N/A r r
N/A
To To To
G N/A G G
N/A
KG KG KG
P N/A P P
DA DA DA
LL LK LL LK LL LK
Prompt token
Prompt token
Prompt token
ig ig ig
ht ht ht
RA R R
GL G GL AG GL AG
(c) PopQA
ig ig ig
ht ht ht
(e) HotpotQA
R R
(a) MultihopQA
RA
HL
ig G HL
ig
AG HL
ig
AG
Fa ht Fa ht Fa ht
st RA st RA st RA
Gr G Gr G Gr G
ap ap ap
hR hR hR
Hi AG Hi AG Hi AG
pp pp pp
Completion token
Completion token
Completion token
oR oR oR
LG AG LG AG LG AG
ra ra ra
ph ph ph
RA RA RA
G G G
RA RA RA
PT PT PT
OR OR OR
19
102
103
104
102
103
104
102
103
104
Ze Ze Ze
ro r os ro
Va sh Va ho Va sh
ni ot ni t ni ot
ll ll ll
G- aR G- aR G- aR
re AG re AG re AG
tr tr tr
ie ie ie
ve ve ve
r r r
To To To
G G G
KG KG KG
P P P
DA DA DA
LL LK LL LK LL LK
Prompt token
Prompt token
Prompt token
ig ig ig
ht ht ht
RA R R
GL G GL AG GL AG
(f) ALCE
ig ig ig
(b) Quality
ht ht ht
R R
(d) MusiqueQA
RA
HL
ig G HL
ig
AG HL
ig
AG
Fa ht Fa ht Fa ht
st RA st R st R
Gr G Gr AG Gr AG
ap ap ap
hR hR hR
Hi AG Hi AG Hi AG
pp pp pp
Completion token
Completion token
Completion token
oR oR oR
LG AG LG AG LG AG
ra ra ra
ph ph ph
Figure 17: Token costs for prompt and completion tokens in the generation stage across all datasets.
RA RA RA
G G G
RA RA RA
PT PT PT
OR OR OR
Prompt for generating abstract questions
Prompt:
Given the following description of a dataset:
{description}
Please identify 5 potential users who would engage with this dataset. For each user, list 5 tasks they would perform with
this dataset. Then, for each (user, task) combination, generate 5 questions that require a high-level understanding of the
entire dataset.
Output the results in the following structure:
- User 1: [user description]
- Task 1: [task description]
- Question 1:
- Question 2:
- Question 3:
- Question 4:
- Question 5:
- Task 2: [task description]
...
- Task 5: [task description]
- User 2: [user description]
...
- User 5: [user description]
...
Note that there are 5 users and 5 tasks for each user, resulting in 25 tasks in total. Each task should have 5 questions,
resulting in 125 questions in total. The Output should present the whole tasks and questions for each user.
Output:
20
Prompt for LLM-based multi-dimensional comparison
Prompt:
You will evaluate two answers to the same question based on three criteria: Comprehensiveness, Diversity, Empower-
ment, and Directness.
• Comprehensiveness: How much detail does the answer provide to cover all aspects and details of the question?
• Diversity: How varied and rich is the answer in providing different perspectives and insights on the question?
• Empowerment: How well does the answer help the reader understand and make informed judgments about the
topic?
• Directness: How specifically and clearly does the answer address the question?
For each criterion, choose the better answer (either Answer 1 or Answer 2) and explain why. Then, select an overall winner
based on these four categories.
Here is the question:
Question: {query}
Here are the two answers:
Answer 1: {answer1}
Answer 2: {answer2}
Evaluate both answers using the four criteria listed above and provide detailed explanations for each criterion. Output
your evaluation in the following JSON format:
{
"Comprehensiveness": {
"Winner": "[Answer 1 or Answer 2]",
"Explanation": "[Provide one sentence explanation here]"
},
"Diversity": {
"Winner": "[Answer 1 or Answer 2]",
"Explanation": "[Provide one sentence explanation here]"
},
"Empowerment": {
"Winner": "[Answer 1 or Answer 2]",
"Explanation": "[Provide one sentence explanation here]"
},
"Overall Winner": {
"Winner": "[Answer 1 or Answer 2]",
"Explanation": "[Briefly summarize why this answer is the overall winner]"
}
}
Output:
21