CORAG A Cost-Constrained Retrieval Optimization System For Retrieval-Augmented Generation
CORAG A Cost-Constrained Retrieval Optimization System For Retrieval-Augmented Generation
Retrieval-Augmented Generation
Ziting Wang Haitao Yuan Wei Dong
Nanyang Technological University Nanyang Technological University Nanyang Technological University
Singapore Singapore Singapore
[email protected] [email protected] [email protected]
Singapore China
[email protected] [email protected]
ABSTRACT Response:
Designer: The Eiffel Tower was named after
Large Language Models (LLMs) have demonstrated remarkable Query: "Who designed the Eiffel the engineer Gustave Eiffel, whose company
Tower and when was it designed and built the tower.
generation capabilities but often struggle to access up-to-date in- constructed? Provide information Construction period: It was constructed
on its height as well." between 1887 and 1889.
formation, which can lead to hallucinations. Retrieval-Augmented Height: The Eiffel Tower stands at 324 meters
LLMs
tall.
Generation (RAG) addresses this issue by incorporating knowl-
edge from external databases, enabling more accurate and relevant Chunk1: "The Eiffel Tower in Paris
held the title of the world's tallest Score Token
responses. Due to the context window constraints of LLMs, it is structure for over 40 years until
1930.”
χ1 0.3 256
impractical to input the entire external database context directly Chunk2:“It was the tallest man-
χ1 + χ3 0.4 512
made structure in the world for 41
into the model. Instead, only the most relevant information, re- years until the completion of the
Reranker Model χ3 + χ4 0.8 512
Chrysler Building in New York in
ferred to as “chunks”, is selectively retrieved. However, current 1930. ” χ4 + χ3 0.9 512
Score
RAG research faces three key challenges. First, existing solutions Chunk3:" The Eiffel Tower Chunk χ3 0.8 χ1 + χ 2 + χ4
constructed between 1887 and 0.6 768
often select each chunk independently, overlooking potential cor- 1889, it stands at 324 meters tall.” Chunk χ1 0.6 χ1 + χ 4 + χ3 0.8 768
relations among them. Second, in practice, the utility of chunks Chunk4:"It was named after the Chunk χ4 0.4 χ1 + χ2 + χ3+ χ4 0.9
engineer Gustave Eiffel, whose 913
are “non-monotomic”, meaning that adding more chunks can de- company designed and built the Chunk χ2 0.3
item, even though several other
crease overall utility. Traditional methods emphasize maximizing designs were considered initially." Chunk Scores Chunk Combination Order
Potential Chunks
the number of included chunks, which can inadvertently compro-
mise performance. Third, each type of user query possesses unique Figure 1: Example of chunks combination order.
characteristics that require tailored handling—an aspect that cur-
rent approaches do not fully consider. 1 INTRODUCTION
To overcome these challenges, we propose a cost-constrained Although LLMs have demonstrated exceptional capabilities in gen-
retrieval optimization system CORAG for retrieval-augmented gen- eration tasks, they often struggle with accessing up-to-date in-
eration. We employ a Monte Carlo Tree Search (MCTS)-based pol- formation, which can lead to hallucinations [10, 38]. To address
icy framework to find optimal chunk combinations sequentially, these challenges, RAG has emerged as a crucial solution. By in-
allowing for a comprehensive consideration of correlations among tegrating external data sources into LLM, RAG can provide more
chunks. Additionally, rather than viewing budget exhaustion as accurate, relevant, and up-to-date information. Nowadays, RAG
a termination condition, we integrate budget constraints into the has been widely studied in the context of LLMs especially for tasks
optimization of chunk combinations, effectively addressing the requiring update external knowledge such as question answering
non-monotonicity of chunk utility. Furthermore, by designing a task [2, 22, 29], medical information retrieval [1, 32], and time se-
configuration agent, our system predicts optimal configurations ries analysis [12, 26, 40]. External data sources are often extremely
for each query type, enhancing adaptability and efficiency. Experi- large, making it impractical to input them directly into the LLM.
mental results indicate an improvement of up to 30% over baseline To address this issue, data is typically split into disjoint chunks
models, underscoring the framework’s effectiveness, scalability, and stored in a vector database, and then users query the most
and suitability for long-context applications. useful chunks to construct prompts for LLMs. Therefore, designing
PVLDB Reference Format: efficient and accurate structures and algorithms to search for the
Ziting Wang, Haitao Yuan, Wei Dong, Gao Cong, and Feifei Li. CORAG: A most relevant chunks has become a prominent research topic and
Cost-Constrained Retrieval Optimization System for Retrieval-Augmented has been widely studied in both the database [39, 48] and machine
Generation. PVLDB, 14(1): XXX-XXX, 2020. learning communities [2, 35, 43].
doi:XX.XX/XXX.XX
This work is licensed under the Creative Commons BY-NC-ND 4.0 International licensed to the VLDB Endowment.
License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of Proceedings of the VLDB Endowment, Vol. 14, No. 1 ISSN 2150-8097.
this license. For any use beyond those covered by this license, obtain permission by doi:XX.XX/XXX.XX
emailing [email protected]. Copyright is held by the owner/author(s). Publication rights
However, there are three key challenges in the existing ap- single fixed reranker model consistently outperforms the others
proaches. across all query variations (see our experiments in Section 6.3.4
for more details). Current methods [20, 46] typically rely on static
Challenge 1: Correlations between chunks. Currently, two primary reranker models for ranking chunks, lacking the flexibility to adapt
methods are used to identify the most relevant chunks. The first to varying query contexts.
approach formulates the problem as a approximate k-nearest neigh- Problem Statement: Is there a RAG system that fully considers
bor (AKNN) task [41, 45], where each chunk is assigned a score, and correlations between chunks and the non-monotonicity of utility
the approxiamte top-𝑘 chunks ranked by score are selected. The while being adaptable to all types of queries?
second approach clusters the chunks, returning all chunks within
the most relevant clusters in response to a query [22, 29]. However,
1.1 Our Contributions
both methods overlook potential correlations between chunks: the
first approach disregards correlations entirely, while the second ap- In this paper, we answer this question in the affirmative, by propos-
proach accounts for them only superficially by treating all chunks ing a novel MCTS based policy tree framework to optimize chunk
within each cluster as equally relevant. As a result, when multiple retrieval in RAG systems. In summary, our contributions can be
chunks convey similar or overlapping information, these methods summarized as follows:
introduce substantial redundancy in the selected chunks. • We propose the first RAG framework that considers the chunk
For example, as illustrated in Figure 1, when querying the height combination order for the RAG task. Instead of considering each
and history of the Eiffel Tower, if each chunk is treated indepen- chunk independently or at the cluster level, we use MCTS to
dently, a greedy method would select chunks 𝜒3 and 𝜒1 since they help search the optimal chunk combination order sequentially.
have the top two scores. However, both chunks only provide his- The high-level idea is as follows: First, we initialize the root node.
torical information, which is insufficient to fully address the query. Then, in an iterative process, we expand the tree by selecting
To better address the query, it is necessary to include a chunk with the highest utility node and computing its expended nodes’ util-
constructor’s name, such as 𝜒4 . On the other hand, the clustering ities. After each expansion, we update the utilities throughout
approach would return all of 𝜒1, 𝜒2, 𝜒3 , and 𝜒 4 , resulting in redun- the entire policy tree. During this process, the decision at each
dancy. An optimal solution would instead select 𝜒3 and 𝜒4 , as they iteration depends on the chunks already selected, allowing us to
provide the required information without redundancy. Additionally, fully consider the correlations between chunks. Moreover, MCTS
research [11, 19, 42] has shown that the order of chunks influences reduces the exponential search space to linear, and we apply par-
LLM performance, a factor that existing methods also overlook. Fol- allel expansion techniques to further enhance computational
lowing the example of the Eiffel Tower, when chunks 𝜒3 and 𝜒4 are efficiency. With such designs, we address Challenge 1.
selected, placing 𝜒4 first yields a higher score compared with the • In contrast to prior RAG frameworks that consider the exhaus-
reverse order will have a better performance. However, determining tion of the budget as one of termination conditions, we propose a
the optimal chunk combination order is a challenging task since novel formulation wherein budget constraints are integrated into
both of them require a search space growing exponentially with the the process of optimizing chunk combinations to fully consider
number of available chunks. In this paper, we further demonstrate the non-monotonicity of utility of chunks thereby addressing
that this problem is NP-hard (see Section 2.1). Challenge 2. Moreover, by prioritizing high-relevance, low-cost
chunks and factoring in token length, we further reduce compu-
Challenge 2: Non-monotonicity of utility. Current solutions op- tational costs.
erate on the assumption that including more chunks will always • We propose a contrastive learning-based agent that dynami-
yield better final results. Specifically, in the AKNN-based approach, cally adjusts MCTS configurations per query, adapting reranker
exactly 𝑘 chunks are selected deterministically each time. In the models and configurations to the specific query domain. This
clustering-based approach, a distance threshold between clusters approach tailors retrieval for dynamic, domain-specific queries
and the query is set, and all clusters within this threshold are re- with flexibility and robustness, addressing Challenge 3.
turned. Both of them return as many chunks as possible. However, • Additionally, we conducted comprehensive experiments, com-
in practice, the utility of chunks is not monotonic. More specifically, paring our framework with several state-of-the-art methods. The
excessive chunks can dilute key information by adding marginally results validate the effectiveness, efficiency, and scalability of
relevant content, creating noise that reduces clarity. Additionally, our approach, also showing a performance improvement of 30%
conflicting or nuanced differences across chunks may confuse the over the baseline.
model, lowering response quality. For example, as illustrated in Fig-
ure 1, when 𝜒3 and 𝜒4 are selected, adding the chunk 𝜒1 decreases 2 PRELIMINARIES
utility, highlighting that utility scores are often non-monotonic in In this section, we first introduce the definitions of some key con-
practice. cepts in Section 2.1, such as chunks and chunk combination order.
Challenge 3: Diversity of queries: User queries come in different Next, we give the NP-hard proof of the chunk order optimization
types, each requiring its own ranking strategy due to their unique problem. At last, we discuss the related work in Section 2.3.
characteristics [47]. In current RAG systems, the utility scores of
chunks often are determined by the assigned reranker model. So 2.1 Key Concepts
far, various reranker models exist, but we observe that their per- RAG & Chunks. RAG is an effective method for improving the
formance varies significantly across different query types, and no performance of generation models by retrieving relevant context
2
from an external corpus. In this approach, the corpus is first divided of the weights of these vertices and their covered hyperedges:
into smaller, manageable units called chunks, which are stored in ∑︁ ∑︁
a vector database. Therefore, we can give a formal definition of arg max 𝑤 1 (𝑣) + 𝑤 2 (𝑒) . (2)
V ′ ⊆ V,| V ′ |=𝑘
chunk as follows: 𝑣 ∈ V′ 𝑒 ∈ V′
Definition 2.1 (Chunk). Let 𝐶 represent a corpus of documents, 2.2.2 Reduction process. We now construct a corresponding Chunk
and a chunk 𝜒 is defined as a contiguous block of text extracted Combination Optimization Problem instance from the given MWHP
from 𝐶. Formally, a chunk 𝜒 consists of a sequence of tokens instance. For each node 𝑣 ∈ V, we create a corresponding chunk
(𝑡 1, 𝑡 2, . . . , 𝑡𝑛 ), where each 𝑡𝑖 is a token from 𝐶 and the size 𝑛 is X𝑣 . We define its cost cost(𝑋 𝑣 ) ≡ 1. Then, a chunk combination
set by users. order Φ corresponds to a subset of vertices of V, which is denoted
as V (Φ) ⊆ V. We define its utility as
In the RAG system, each chunk is embedded into a vector repre- ∑︁ ∑︁
sentation using an embedding model, which captures the chunk’s 𝑈 (Φ) = 𝑤 1 (𝑣) + 𝑤 2 (𝑒). (3)
semantic meaning and enables the retrieval of contextually sim- 𝑣 ∈ V (Φ) 𝑒 ∈ V (Φ)
ilar content. When a new query is received, the vector database
Finally, we set B = 𝑘 and our objective is
performs a similarity search to identify the chunks that are most
∑︁
semantically relevant to the query. These retrieved chunks are then arg max 𝑈 (Φ) s.t. cost(𝜒𝑖 ) = |Φ| ≤ 𝑘. (4)
passed to a generator (e.g., a large language model) to produce Φ
𝜒𝑖 ∈Φ
a final response based on the retrieved content. Specifically, the
more tokens a chunk contains, the higher the cost incurred by the Denote Φ∗ as the solution of (4), then, it is obvious V (Φ∗ ) is the
generator. Thus, we define the cost of a chunk as 𝑐𝑜𝑠𝑡 ( 𝜒) = | 𝜒 |, solution of (2) the reduction can be done in time of 𝑂 |V | · |E | .
which equals to the number of tokens in the chunk. Please note that a precondition of this reduction is that, in our
Chunk Combination Order. In the RAG system, the retrieval Chunk Combination Optimization Problem, we allow the rerank
result from the vector database may include multiple chunks. How- model to be arbitrary, meaning the utility scores can also be as-
ever, due to input limitations of the generation model, using all of signed arbitrarily. The complexity of finding the optimal chunk
these chunks is impractical. Therefore, it is necessary to select an combination order can be significantly reduced if certain assump-
optimal subset of chunks, known as a chunk combination, that fits tions are made about the reranker. For instance, if the reranker
within a given cost budget. Additionally, the order of the chunks does not consider correlations and simply sums the utility scores
within the combination significantly impacts the performance of of individual chunks linearly, each chunk could then be evaluated
the generation model. The goal is to identify the chunk combination independently. However, in this paper, we address the most general
order with the optimal order, formally defined as follows: case, making no assumptions about the reranker model.
Definition 2.2 (Optimal Chunk Combination Order Selection). Let 2.3 Related Work
{𝜒1, 𝜒2, . . . , 𝜒𝑘 } be a set of potential chunks, B be the cost budget,
and Φ = ⟨𝜒𝜙 1 , · · · , 𝜒𝜙𝑚 ⟩ represent a potential chunk combination 2.3.1 Retrieval-augmented Generation. RAG[14, 20] is widely used
order, where each 𝜒𝜙𝑖 is a chunk, and the index 𝜙𝑖 indicates its for handling knowledge-intensive NLP tasks. In a typical RAG
position Φ. Let 𝑈 (Φ) be the utility score assigned by the reranker pipeline, a dense-embedding-based retriever searches for relevant
model, which may be arbitrary or composite. Our objective is to information from an external database, which is then used by the
find the chunk combination order that maximizes the utility score LLM during the generation process. To improve this pipeline, some
while adhering to the cost constraint of feeding them into the LLMs studies[5, 18, 22, 35] have focused on adjusting retrievers to suit
to generate the final response, i.e., searching for the generation needs of LLMs better, developing multi-step re-
∑︁ trieval methods, and filtering out irrelevant information. Although
Φ̂ = arg max 𝑈 (Φ) s.t. cost(𝜒𝑖 ) ≤ B (1) there are many advanced retrievers[8, 9, 15, 16, 27, 34], it’s more
Φ
𝜒𝑖 ∈Φ promising to optimize the retriever and LLM together in an end-to-
end process[25, 31]. For example, the research[30] has focused on
2.2 Proof of NP-hard training retrievers and LLMs together, either simultaneously or in
To demonstrate that chunk combination order selection is NP-hard, stages. However, this requires surrogate loss for optimization and
we reduce the Maximum Weighted Hyperclique Problem (MWHP) complicates the training pipeline, especially when the embedding
to it. Since MWHP is NP-hard, we show that any MWHP instance database needs to be re-indexed frequently which will bring high
can be transformed into a Chunk Combination Optimization in- compute costs. Therefore, methods such as [5] decompose complex,
stance in polynomial time. multi-step queries into smaller sub-intents to improve response
comprehensiveness without frequently re-indexing. However, these
2.2.1 Problem definition of MWHP. Given a hypergraph H = approaches often overlook the critical role of chunk combination
(V, E, 𝑤 1, 𝑤 2 ), where V is the set of vertices, E is the set of hy- order, which can significantly impact the overall response quality of
peredges, where each contains a subset of V. 𝑤 1 : 𝑣 → R and LLMs. To the best of our knowledge, this paper is the first approach
𝑤 2 : 𝑒 → R are weight functions assigning a weight to each vertex to consider chunk combination order within the RAG task.
and hyperedge, respectively. Given a subset of vertices V ′ ⊆ V,
we say a hyperedge 𝑒 belongs to V ′ , i.e., 𝑒 ∈ V ′ , if V ′ covers all 2.3.2 Reranking for RAG. Reranking methods are crucial for en-
vertices of 𝑒. The objective is to find 𝑘 vertices maximizing the sum hancing retrieval performance within the RAG pipeline [43, 44, 51].
3
User Query Embedding Vector DB Potential Configuration Agent Policy Tree Optimal Chunk
Chunks Order LLMs
Search
Step1: Potential Chunks Retrieval Step2: Online Configuration Inference Step3: Optimal Chunk Search
Reranker
Configuration Policy Tree
Agent Iterations
Modeling
Corpus Chunks
Vector Index Similarity Coefficient
Search
MCTS Based
Positive Lconstrastive Tree Search
Query Embedding Label
Vector DB Lclassification
Negative
Label Encoding Network Lregression
Offline Contrastive Training
Traditional reranking approaches [33, 50] typically rely on mid- the in-monotonicity of the utility of chunk combination order, and
sized language models, such as BERT or T5, to rank retrieved con- adapt to diverse query domains. These challenges result in reduced
texts. However, these models often struggle to capture semantic relevancy of the outputs. To address these issues, we introduce
relationships between queries and contexts, especially in zero-shot CORAG, a system designed to retrieve the optimal chunk combina-
or few-shot settings. Therefore, recent research [43] highlights the tion while taking into account the query domain and user budget.
potential of instruction-tuned LLMs to improve context rerank- As the most important component of our system, we introduce the
ing by more accurately identifying relevant contexts, even in the Optimal Chunk Combination Search model. This model employs
presence of noise or irrelevant information. Despite these advance- MCTS based policy tree to perform sequential searches of chunk
ments, the full capacity of LLMs for reranking in RAG systems combination order under a cost constraint, allowing us to fully
remains underutilized. In particular, studies have shown that chunk consider the correlations between chunks (Challenge 1) as well as
arrangement can impact LLM performance [19], emphasizing the the non-monotonic nature of utility of chunk combination orders
need to consider chunk combination order in RAG tasks. However, (Challenge 2). Additionally, we propose a Configuration Inference
existing models are not well-suited for cases where optimal retrieval module that recommends the optimal MCTS configuration and
requires specific sequences or combinations of chunks, rather than reranker tailored to various query domains, thereby addressing
isolated chunks. Hence, future research is needed to better leverage the challenge 3. Below, we give a brief descriptions for these two
LLMs for arranging chunks more effectively in response to queries modules.
within the RAG framework. Optimal Chunk Combination Search: A straightforward ap-
proach to considering chunk correlations involves retrieving poten-
2.3.3 Reinforcement Learning for Large Language Models. Recently, tial chunks from a vector database (as shown in step 1 in Figure 2)
reinforcement learning (RL) has been increasingly utilized in vari- and exhaustively exploring all possible chunk combinations. How-
ous data management and RAG tasks. The RL technique can enable ever, this method incurs significant latency and computational costs.
large language models to improve their generation ability by lever- To mitigate this, we construct a policy tree (as shown in step 2),
aging external knowledge sources, such as search engines [13, 23]. reframing the optimal chunk combination search as a node search
In particular, the human feedback [4, 36, 37] can be integrated problem within the tree. Specifically, the root node of the policy tree
to help models produce more accurate and contextually relevant represents an initial empty state, and each child node corresponds
responses through the RL framework. In addition, some query opti- to a specific combination of chunks. For example, if the root node
mization approaches [17, 21, 49] further refine retrieval processes, has a child node representing chunk 𝜒1 , one of its child nodes might
allowing model performance to inform query adjustments and ulti- represent the combination 𝜒1 + 𝜒2 , while another could represent
mately enhance downstream task outcomes. In this work, we apply 𝜒1 + 𝜒3 .
a lightweight RL technique MCTS to optimize the chunk combina- We design a search algorithm based on MCTS to address this
tion order searching progress in the RAG system. We also introduce problem. Unlike traditional MCTS, our approach expands the node
a configuration agent to guide the MCTS search process. To our with the highest utility in each iteration, simultaneously evaluating
best knowledge, this is the first approach to addressing this specific all possible child nodes. Additionally, we account for both cost
problem. and budget constraints during the policy tree search process. Node
utility is calculated by balancing exploration with cost control,
3 SYSTEM OVERVIEW optimizing for both efficiency and accuracy.
As previously mentioned, existing RAG frameworks face three key Configuration Inference: A simple solution for configuration
challenges: how to fully consider correlations between chunks and tuning is to enumerate every possible configuration or reranker
4
and compute the results in parallel, and then select the optimal node of 𝑇 represents the initial state, devoid of any chunks. Each
configuration. However, this would result in impractical costs for subsequent non-root node embodies a chunk set, achieved by in-
the RAG system. To optimize the configuration (i.e., the number of corporating a newly selected chunk from the remaining potential
iterations, cost coefficient, and exploration coefficient) for the policy chunks into the sequence at its parent node. This process sequen-
tree search process, we introduce a configuration agent that dy- tially constructs an ordered chunk combination in each non-root
namically generates configurations based on the query domain. To node and our objective is to find the node with the highest utility
ensure the model’s effectiveness, we employ a contrastive learning score.
approach that uses positive and negative label pairs: positive labels
correspond to query embeddings from the same optimal reranker, Within the policy tree, our goal is to select a node that encom-
while negative labels come from different optimal reranker. A joint passes ordered chunks offering the highest benefit at the lowest
loss function is used to simultaneously optimize both the regression cost. To accomplish this, we need to devise a utility calculation
(for parameter tuning) and contrastive learning (to enhance label function to evaluate the trade-off between benefit and cost. This
differentiation). function is quantified through what we define as the “node utility”,
Summary. The pipepline of our framework is shown in Figure 2. described as follows.
We first generate an embedding for the input query, which is then Node Utility. The utility metric comprises two components: the
used to retrieve potential chunks from the vector database. These benefit derived from selecting the chunk combination and the cost
query embeddings are also fed into the configuration agent, which associated with using the chunk as a prompt in LLMs. Specifically,
dynamically generates the optimal MCTS configuration based on the benefit is quantified with LLMs, which can measure the simi-
the query domain. Using this optimal configuration, we can search larity between the selected chunks and the query. In particular, we
in the policy tree to determine the optimal chunk combination denote it as the node value 𝑉 . Next, we further use the Upper Con-
and order from the retrieved potential chunks. Finally, this optimal fidence Bound (UCB)[3] algorithm to balance exploitation (node
chunk combination is used to construct the final prompt for the value 𝑉 (𝑣𝑖 )) and exploration (search count 𝑁 (𝑣𝑖 )) for a given node
LLMs. 𝑣𝑖 . Regarding cost, we consider the token cost as defined in Section 2
and measure it by the proportion of the current chunk combina-
4 CHUNK COMBINATION RETRIEVAL tion’s cost relative to the total allocated budget B. Therefore, the
node utility is defined as follows:
As previously discussed, the order in which chunks are combined
significantly impacts the efficiency of prompt construction in LLMs. Definition 4.2 (Node Utility). Given a policy tree and a cost budget
Enumerating all possible orders of chunk combinations is not feasi- B, the utility of a non-root node is defined as:
ble due to the vast number of potential combinations, particularly
√︄
when the scenario involves a large number of chunks. In this sec-
𝑉 (𝑣𝑖 ) ln 𝑁 cost(𝑣𝑖 )
tion, we present a novel method that achieves a good trade-off U(𝑣𝑖 ) = +𝑐 −𝜆 (5)
between efficiency and accuracy in searching for the optimal chunk 𝑁 (𝑣𝑖 ) 𝑁 (𝑣𝑖 ) B
combination order problem. In Section 4.1, we model the problem
where 𝑉 (𝑣𝑖 ) is the estimated benefit value of the chunk combina-
as searching the optimal node within a policy tree (Section 4.1).
tion at node 𝑣𝑖 determined by a trained model, 𝑁 (𝑣𝑖 ) is the count
Then, we propose an MCTS-based algorithm to address this node
of visits to node 𝑣𝑖 , promoting exploration of less frequented nodes,
search problem (Section 4.2).
and 𝑁 is the total number of visits across all nodes in the policy tree,
ensuring a balance between exploration and exploitation. In addi-
4.1 Policy Tree Search Modeling
tion, cost(𝑣𝑖 ) denotes the token cost for node 𝑣𝑖 , B is the total token
To approach the optimal combination order, the first step is to find budget, 𝑐 moderates the exploration-exploitation trade-off, and 𝜆
a data structure that enables efficient enumeration of all possible serves as a penalty factor for the cost to enhance cost-efficiency.
combination orders. A natural choice is a tree, allowing us to explore
all potential answers by traversing from the root to the leaf nodes. Optimal Node Selection Modeling. Building on the defined node
Policy Tree. As illustrated in Figure 3, we construct a policy tree utility, the task of selecting an optimal chunk combination order,
to represent all potential orders of chunk combinations sourced as outlined in Section 2, is reformulated as optimal node selection
from the vector database. Specifically, the root node symbolizes within the policy tree 𝑇 . Given a budget constraint B, the objective
the initial state without any chunk, with each subsequent node is to identify the node 𝑣𝑖 ⊆ 𝑇 that maximizes the utility 𝑈 (𝑣𝑖 ), while
depicting a selected chunk from the potential ones. Thus, a child ensuring that the total cost associated with 𝑣𝑖 does not exceed B.
node emerges from its parent by selecting the next available chunk Formally, this is represented as:
from the queue of potential chunks and incorporating it into the
sequence established by the ancestor node. For instance, if a node √︄ !
represents the chunk combination order {𝜒 1 }, then a child node 𝑉 (𝑣𝑖 ) ln 𝑁 cost(𝑣𝑖 )
𝑣ˆ𝑖 = arg max +𝑐 −𝜆 (6)
might embody a subsequent combination order such as {𝜒1, 𝜒2 }, 𝑣𝑖 ⊆𝑇 𝑁 (𝑣𝑖 ) 𝑁 (𝑣𝑖 ) B
{𝜒1, 𝜒3 }, or {𝜒 1, 𝜒4 }. Accordingly, we define the policy tree formally
where 𝑉 (𝑣𝑖 ) is the estimated benefit of the chunk combination at
as follows:
node 𝑣𝑖 , and cost(𝑣𝑖 ) represents its associated cost. This formulation
Definition 4.1 (Policy Tree). Given a query 𝑞 and a set of poten- enables selecting chunks that maximize utility within the given
tial chunks {𝜒1, 𝜒2, . . . , 𝜒𝑛 }, we construct a policy tree 𝑇 . The root budget.
5
R
χ1+χ2 0.3
Root
χ1 R R
0.1 … 0.2 χ1+χ3 0.2 0.1 … 0.45
χ2
χn
χ1 … χn 0.1 … 0.2
χ1+χ2 χ1+χ3 χ1+χ4 χ1+χ4 Utility 0.7 0.3 0.2 0.7
Compute
Potential Chunks Policy Tree Selection Parallel Expansion & Utility Compute Utility Update Optimal Chunk Order
the precise prediction of MCTS configurations, including the itera- In particular, the contrastive loss 𝐿con (𝜃 ) encourages the em-
tion count and 𝜆. beddings of queries with the same optimal reranker to be close
together, while pushing apart the embeddings of queries with dif-
5.2.2 Contrastive Learning. To efficiently distinguish between dif- ferent rerankers. The classification loss 𝐿cla (𝜃 ) aids the model in
ferent query domains and recommend the optimal configuration correctly identifying the reranker using cross-entropy, and the re-
for each query, we utilize contrastive learning to bring queries of gression loss 𝐿reg (𝜃 ) minimizes the error in predicting the optimal
the same domain closer together while pushing apart embeddings MCTS configuration.
from different reranker classes. Remark. Once the total loss 𝐿total is calculated, the network param-
Contrastive pairs preparation. To prepare the training dataset, eters 𝜃 are updated using gradient descent with a learning rate
we must identify the optimal reranker and configuration for each 𝜂. This optimization process is repeated across multiple epochs 𝐸
query. In this study, the most suitable reranker and corresponding and batches, ensuring that both reranker selection and parameter
configurations for each query are determined through extensive prediction are improved over time.
experimentation with various setups. Subsequently, query pairs
are generated based on these optimal reranker annotations. Pos-
itive pairs are formed from queries that share the same optimal
6 EXPERIMENTS
reranker, promoting minimal distance between their embeddings The experimental study intends to answer the following questions:
in the feature space. Conversely, negative pairs are composed of • RQ1 How effective is our CORAG for the cost-constrained RAG
queries with different rerankers, where the goal is to maximize the pipeline compared to other methods?
distance between their embeddings. Since some rerankers perform • RQ2 How efficient is CORAG with varying chunk sizes?
similarly on certain queries, we select only cases with a ROUGE-L • RQ3 What are the bottlenecks of the current RAG?
difference exceeding 10% to form our training dataset. • RQ4 How scalable is CORAG with varying dataset sizes?
Contrastive loss. As illustrated in Figure 4, for a given positive • RQ5 What is the effectiveness of each design in CORAG?
pair (𝑥𝑖 , 𝑥𝑖+ ) and a negative pair (𝑥 𝑗 , 𝑥 𝑗− ), we first generate their
corresponding feature maps with the encoding model. These feature 6.1 Experiment Setting
maps are then utilized to compute the contrastive loss 𝐿con . In Environment. We integrate our system with the popular RAG
particular, this process can be formulated as follows: framework LlamaIndex1 . The experiments are run on a Linux server
with an Intel Core i7-13700K CPU (12 cores, 24 threads, 5.3 GHz),
𝐿con (𝜃 ) = 𝐹 con (𝑓𝜃 (𝑥𝑖 ), 𝑓𝜃 (𝑥𝑖+ )) + 𝐹 con (𝑓𝜃 (𝑥𝑖 ), 𝑓𝜃 (𝑥 𝑗− )) (10)
64GB RAM, and a 1 TiB NVMe SSD. The configuration agent module
where 𝑓𝜃 (𝑥) represents the embedding function, and 𝐹 con is the sim- is implemented in PyTorch 2.0 and trained on an NVIDIA RTX 4090
ilarity function applied to both types of pairs: positive pairs (with GPU with 24GB VRAM.
similar rerankers) and negative pairs (with different rerankers).
This loss function is designed to ensure that queries with the same Table 1: Statistics of datasets used in the experiments.
reranker are brought closer together in the embedding space, while
those with different rerankers are distanced. Dataset #train #dev #test #p
5.2.3 Whole training process. Finally, the total loss function 𝐿total MSMARCO 502,939 6,980 6,837 8,841,823
is the combination of the contrastive, classification, and regression Wiki 3,332 417 416 244,136
losses as follows:
8
0.45 WikiPassageQA 0.45 MARCO
Datasets. To evaluate the performance of CORAG across diverse
50 50
scenarios, we conduct experiments on two distinct datasets with dif- 0.40 0.40
ROUGE Score
40 40
fering task focuses: (1) WikiPassageQA[7] is a question-answering
0.35 30 0.35 30
benchmark containing 4,165 questions and over 100,000 text chunks, 20
20
aimed at evaluating passage-level retrieval. (2) MARCO[24] is a 0.30 0.30
10 10
comprehensive dataset tailored for natural language processing 0.25 0 0.25 0
256 512 1024 256 512 1024
tasks, primarily emphasizing question answering and information Chunk Size Chunk Size
Raptor CORAG+Agent Raptor CORAG+Agent
retrieval. As shown in Table 1, both WikiPassageQA and MARCO NaiveRAG CORAG upper NaiveRAG CORAG Upper
CORAG w/o Agent CORAG w/o Agent
provide high-quality question and passage annotations, making
them suitable benchmarks for assessing retrieval effectiveness. In
our experiments, we prompt LLMs to generate ground truth an- Figure 5: Efficiency Comparison
swers for each dataset. For instance, if we use Llama3 to evaluate
CORAG’s performance, we also prompt Llama3 to generate the
used during this process include a margin for contrastive loss (mar-
ground truth in the same experimental setting for fairness and
gin=1.0), learning rate (lr=0.001), batch size (32), number of epochs
alignment with the features of the LLMs.
(num_epochs=60), and the embedding model (i.e., BAAI/bge-m3[6]).
Baselines. We compare the performance of CORAG with two typi-
Evaluation Metrics. We assess effectiveness by comparing the
cal RAG baselines:
Rouge scores between the ground truth answers and the generated
• RAPTOR[29]: RAPTOR constructs a hierarchical document
responses, using Rouge-1, Rouge-2, and Rouge-L as evaluation
summary tree by recursively embedding, clustering, and sum-
metrics. To evaluate efficiency, we measure the latency required to
marizing text chunks, enabling multi-level abstraction. This ap-
answer a query using different methods.
proach aligns with the clustering-based methods discussed in
Section 1. We finish the tree construction within the limit of
budget. 6.2 Performance Comparison
• NaiveRAG: This is a basic method for retrieving relevant chunks. 6.2.1 RQ1: ROUGE Comparision. As shown in Table 2, we com-
First, candidate chunks are retrieved from the vector database pare CORAG with several baselines across different datasets, pri-
based on vector similarity search, followed by ranking them marily using WikiPassageQA and MARCO. The evaluations are con-
using a reranker model. This approach is the type of AKNN ducted on three different chunk sizes, utilizing ROUGE-1, ROUGE-2,
method mentioned in Section 1. To meet the cost constraint, we and ROUGE-L metrics to assess the improvements in responses
employ the greedy budget allocation strategy, retrieving chunks generated by the LLM due to our retrieval method. CORAG demon-
until the budget is fully exhausted. strate a substantial improvement of approximately 25% compared
to mainstream RAG approaches such as NaiveRAG and RAPTOR.
As expected, CORAG does not exceed the upper bound, which rep-
In addition, we remove the configuration agent from our method resents an extreme scenario where all possible combination orders
as a baseline to evaluate its impact on the performance of CORAG, are exhaustively enumerated, which is clearly inefficient and im-
referring to this version as CORAG w/o Agent. At last, we implement practical. In summary, CORAG outperforms baselines, enhancing
a method called CORAG Upper to establish an upper bound by retrieval relevancy while pruning the search space effectively.
exploring all possible chunk combinations and selecting the optimal
6.2.2 RQ2: Efficiency Evaluation. As shown in Figure 5, since
order. Due to the large number of potential combinations, we limit
CORAG is based on the tree search algorithm, the agent assists in
the exploration to combinations with fewer than six chunks in
predicting the optimal reranker and parameters for a given query.
CORAG Upper case.
Therefore, it is crucial to evaluate the impact of different chunk sizes
Remark. Other methods, such as GraphRAG [22], depend signifi-
and datasets on retrieval optimization task efficiency. We tested
cantly on frequent invocations of LLMs to summarize chunks and
efficiency using various datasets and chunk sizes, observing that
construct indexes, incurring substantial costs (e.g., billions of to-
NaiveRAG, which uses a traditional retrieval approach, achieved
kens) that exceed our strict cost constraints. Consequently, these
shorter retrieval times but lower ROUGE scores. CORAG upper
methods are not feasible for addressing our problem. For a fair com-
performs well in terms of ROUGE, but its efficiency is significantly
parison, we exclude these types of RAG methods in the experiment.
reduced due to exploring the entire search space. Similarly, RAP-
Hyper-parameter Settings: The hyper-parameters for CORAG
TOR, which leverages an external LLM for summarization, exhibited
are automatically determined by the configuration agent, while
poor efficiency. In contrast, our CORAG approach strikes a balance
NaiveRAG does not require any hyper-parameters. For other base-
between efficiency and retrieval relevance, achieving an effective
line methods, we ensure consistency by using identical hyper-
trade-off.
parameters for fair comparisons. Specifically, we set the explo-
ration coefficient to 2.4, the number of iterations to 10, and the cost 6.2.3 RQ3: Performance Breakdown. We present a performance
coefficient 𝜆 to 0.1. Preliminary experiments indicate that this con- breakdown of our baseline NaiveRAG, to highlight the bottlenecks
figuration optimizes baseline performance. Further ablation studies in the current RAG system. To address the challenge of searching
will also follow to validate these settings. for the optimal chunk combination order, implementing it with
Learning Parameter Setting: In our method, the configuration NaiveRAG requires the following steps: (a) obtaining the query em-
agent is trained using contrastive learning. The hyper-parameters bedding, (b) retrieving potential chunk combinations, (c) reranking
9
Table 2: ROUGE Comparison on WikiPassage QA and MARCO Datasets
WikiPassage QA MARCO
Method LLM Type 256 512 1024 256 512 1024
R1 R2 RL R1 R2 RL R1 R2 RL R1 R2 RL R1 R2 RL R1 R2 RL
Raptor Llama3-8B 0.338 0.154 0.316 0.322 0.147 0.301 0.335 0.159 0.305 0.386 0.208 0.356 0.393 0.213 0.366 0.338 0.154 0.316
NaiveRAG Llama3-8B 0.337 0.149 0.312 0.321 0.142 0.297 0.334 0.158 0.309 0.398 0.203 0.369 0.395 0.213 0.368 0.337 0.149 0.312
CORAG upper Llama3-8B 0.447 0.275 0.426 0.426 0.262 0.406 0.444 0.273 0.423 0.435 0.235 0.414 0.425 0.229 0.397 0.447 0.275 0.426
CORAG w/o Agent Llama3-8B 0.390 0.212 0.364 0.372 0.202 0.347 0.388 0.221 0.362 0.401 0.212 0.374 0.393 0.216 0.372 0.390 0.212 0.364
CORAG Llama3-8B 0.423 0.223 0.392 0.403 0.212 0.373 0.409 0.219 0.378 0.413 0.224 0.382 0.405 0.219 0.376 0.411 0.219 0.380
CORAG Mixtral8*7B 0.357 0.158 0.325 0.382 0.167 0.351 0.401 0.198 0.367 0.408 0.199 0.378 0.399 0.194 0.369 0.403 0.193 0.373
CORAG Phi-2 2.7B 0.351 0.137 0.317 0.318 0.117 0.298 0.308 0.108 0.288 0.335 0.109 0.305 0.325 0.103 0.301 0.333 0.108 0.303
WikiPassageQA MARCO embedding ROUGE scores. This efficient balance between performance and
3.5 retrieve
5 rerank computational overhead highlights the system’s capacity to prune
3.0 prompt
the search space effectively, ensuring fast retrieval even in expan-
4 2.5 sive datasets. As a result, our approach is well-suited for scenarios
Latency (s)
Latency (s)
3
2.0 where both large-scale data processing and high retrieval accuracy
1.5
are crucial.
2
1.0 Table 3: Performance comparison with varying budgets
1
0.5
ROUGE Score
0.40 0.40
testing values of 0, 0.1, 0.2, and 0.3. The results show that intro-
ducing the cost coefficient in the utility led to a slight decrease in
0.38 0.38 ROUGE scores. This decrease occurs because, without cost con-
straints, CORAG tends to produce longer outputs, albeit at the
0.36 0.36
expense of cost efficiency. However, despite the slight reduction in
0 1 2 3 0 1 2 3
C C ROUGE scores, the decline remains within 5%, which is acceptable.
These results highlight the importance of tuning the cost coeffi-
Figure 7: ROUGE Comparison between different C cient effectively to balance output richness and cost constraints,
further emphasizing the role of our configuration agent in enabling
ROUGE
0.44
Scores at Different lambda Parameter (WikiPassageQA)0.44 ROUGE Scores at Different lambda Parameter (MARCO) efficient configuration tuning for optimal CORAG performance.
R1 R1
RL RL
0.42 0.42 6.3.4 Ablation Study on Different Rerankers. To evaluate the im-
pact of different rerankers on retrieval performance, we conduct
ROUGE Score
ROUGE Score
0.40 0.40
an ablation study using six widely recognized reranker models:
jina-reranker-v1-turbo-en, jina-reranker-v2-base-multilingual, bge-
0.38 0.38
reranker-v2-m3, bge-reranker-large, bge-reranker-base, and
0.36 0.36
gte-multilingual-reranker-base. These rerankers are evaluated on
0 0.1 0.2 0.3 0 0.1 0.2 0.3
the MARCO dataset with the llama3-8B model, configured with a
lambda lambda
fixed cost coefficient of 0.1, an exploration coefficient of 2.4, and a
budget limit of 1024.
Figure 8: ROUGE Comparison between different lambda The results in Table 4 reveal variations in performance across
different rerankers, highlighting the importance of careful reranker
performance, striking an optimal balance between exploration and selection to optimize RAG system performance under specific oper-
exploitation within the search process. This balance enables the sys- ational constraints. Among the rerankers, gte-multilingual-reranker-
tem to effectively uncover relevant information while maintaining base and bge-reranker-large demonstrate consistently strong per-
a focus on high-potential chunks, ultimately leading to improved formance on QA tasks, suggesting that these reranker models have
RAG response. In contrast, both lower and higher exploration coef- high efficacy in capturing relevant information across different
ficients led to suboptimal results, either due to insufficient explo- QA queries. We observe that as the chunk size increases in the
ration or excessive diffusion of focus. These findings emphasize ablation study, each individual reranker yields lower performance
the critical role of the exploration coefficient in the performance of than agent recommended reranker towards different queries. This
the CORAG search process and highlight the importance of careful indicates that the configuration agent effectively leverages reranker
parameter tuning. diversity, dynamically adjusting configurations to improve retrieval
results. The configuration agent’s ability to recommend a better con-
Table 4: Performance comparison with varying rerankers figuration for the reranker selection and parameters adaptively un-
derscores its importance in maximizing RAG system performance,
particularly under constraints such as a limited budget.
ChunkSize Reranker R1 R2 RL
v1-turbo 0.412 0.216 0.379 6.4 Case Study
v2-base-multi 0.413 0.221 0.380
Figure 9 presents three examples to illustrate the retrieval quality
bge-m3 0.425 0.230 0.395
256 comparison between CORAG and the traditional NaiveRAG method,
bge-large 0.431 0.238 0.401
focusing on why our approach outperforms baseline methods. Due
bge-base 0.421 0.232 0.390
to its straightforward top-k retrieval and reranking, NaiveRAG of-
gte-base 0.424 0.232 0.395
ten misses essential information relevant to the query’s intent, as it
v1-turbo 0.366 0.173 0.333
frequently retrieves chunks based on keyword matching rather than
v2-base-multi 0.367 0.177 0.334
relevance to the query. For instance, with the query “Is bougainvillea
bge-m3 0.368 0.177 0.336
512 a shrub?”, NaiveRAG retrieves content containing matching key-
bge-large 0.362 0.177 0.332
words but fails to provide the actual classification of bougainvillea.
bge-base 0.375 0.185 0.344
In contrast, CORAG’s chunk combination strategy retrieves context
gte-base 0.364 0.181 0.335
that includes bougainvillea’s category, enabling the LLM to give a
v1-turbo 0.269 0.093 0.240
more accurate response. In another case, NaiveRAG retrieves terms
v2-base-multi 0.270 0.094 0.243
and legal clauses containing “oxyfluorfen” but lacks understand-
bge-m3 0.270 0.094 0.243
1024 ing of the query’s intent, while CORAG provides context linking
bge-large 0.265 0.092 0.236
oxyfluorfen to its use case in cotton, which requires logical rela-
bge-base 0.265 0.092 0.237
tionships between chunks that NaiveRAG’s vector similarity search
gte-base 0.270 0.094 0.242
11
Query: 'Is bougainvillea a shrub?' Query: ‘Is oxyfluorfen safe on cotton?' Query: ‘Where does bacteria come from?'
“Bougainvillea is a tropical vining shrub “Use Profile Oxyfluorfen is used for broad Bacteria can come from many different sources,
spectrum pre-and post-emergent control of annual including: Most pets that poop on the lawn are
that comes in a wide array of bright and warm blooded mammals, which means that their
broadleaf and grassy weeds in a variety of tree
fanciful colors.... Bougainvillea are tropical fruit, nut, vine, and field crops. ... An estimated excrement contains coli bacteria. …When
and must be protected from frost... ” 12% of cotton acreage in Louisiana was treated bacteria levels are elevated, the river and
with oxyfluorfen in 1992; ... tributaries become potentially unsafe …
”
CORAG’s Result CORAG’s Result CORAG’s Result
The flowers are actually modified leaves, Avoid contact with eyes and skin... CROP
SAFETY OXYFLUORFEN 240 EC may be Bacteria in the modern taxonomic sense
called bracts, that are long-lasting and applied as directed spray around dormant peach,
bright... Bougainvillea are tropical and are one of the three Domains. They must
plum, apricot, almond, apple and pear trees and
must be protected from frost... grapevines of all ages when applied at rates of have split from …
Bougainvillea thrives in full sun. less than 1.0 L/ha...
NaïveRAG’s Result NaïveRAG’s Result NaïveRAG’s Result
cannot capture. Finally, for the query “Where does bacteria come 7.2 Design Choices
from?”, NaiveRAG retrieves chunks with the keyword ’bacteria’ To address the challenges identified, the following design choices
but does not address its origin, whereas CORAG supplies a more aim to optimize the performance of RAG systems:
complete response, including sources of bacteria and conditions P1: Co-Design of Retrieval and Reranking Processes In CORAG,
for their proliferation. These cases illustrate that CORAG excels in parallel expansion in tree search accelerates query processing by
retrieving logically connected information, making it more effective enabling concurrent retrieval and reranking, significantly reducing
than NaiveRAG for queries requiring more than simple keyword latency. Future optimizations could address bottlenecks by elimi-
matching. nating stage-specific delays, further enhancing ranking efficiency.
This co-design approach efficiently manages chunk combination
order, improving the ranking process and relevance scoring.
7 INSIGHTS AND FUTURE DESIGN CHOICES P2: Optimization of Tree Structure and Search Iterations Re-
ON RAG sults indicate that shorter policy tree height enhances search ef-
ficiency by reducing computational overhead, especially advan-
7.1 Shortcomings of Current RAG tageous for large datasets. Minimizing tree height in tree-based
We provide an analysis of current RAG systems revealing perfor- searches improves the search speed for contextually relevant chunks,
mance challenges across the Retrieve (S1), Augment (S2), and Gen- significantly lowering latency and computational costs. This op-
eration (S3) phases. timization approach enhances RAG system performance across
S1: Retrieval Overhead Current RAG systems often utilize LLMs extensive datasets.
for summarization and indexing structures, overlooking the high P3: Dynamic Prompt Engineering Selecting rerankers based on
computational costs associated with external LLMs, thus escalat- query type and using adaptable prompt templates improve retrieval
ing compute expenses. Model-based rerankers, while improving relevance for LLMs. Dynamic prompt structures that align with
relevance during retrieval, introduce notable latency, which can query intent and domain-specific contexts maintain output qual-
impede efficiency in latency-sensitive contexts. Cost-effective index ity within resource constraints. This adaptive approach to prompt
construction and reranking optimization are essential to balance engineering achieves an effective balance between efficiency and
efficiency and performance. retrieval quality, addressing the dynamic nature of RAG system
S2: Augmentation Overhead Post-retrieval techniques, such as queries.
optimized chunk combination ordering, enhance context relevancy
but demand additional computation. Pruning strategies that mini- 8 CONCLUSION
mize the search space and refine combination order are critical for Considering the non-monotonicity of the chunk’s utility, correla-
balancing computational cost and augmented context relevancy. tion between the chunks, and the diversity of the different query
Efficient chunk combination optimization, emphasizing order and domain, we propose a learning-based retrieval optimization system
coherence, is vital for reducing costs and enhancing retrieval per- CORAG. Initially, we model the chunk combination orders as a
formance. policy tree and employ MCTS to explore this policy tree, aiming to
S3: Generation Overhead Effective prompt engineering for op- identify the optimal chunk combination order. We introduce a con-
timal chunk combinations requires significant computational re- figuration agent that accurately predicts the optimal configuration
sources. Query-specific prompt refinement and compression are and reranker for the given query. Additionally, we also design a
crucial to reduce overhead while maintaining input relevance and parallel query expansion strategy to expand multiple nodes in each
conciseness. Adaptive strategies that handle diverse query types iteration. Experimental results demonstrate that our method signif-
and domain-specific requirements ensure prompt efficiency without icantly outperforms state-of-the-art methods within constrained
compromising output quality. cost limits and also shows notable efficiency.
12
REFERENCES [28] Devendra Singh Sachan, Mike Lewis, Mandar Joshi, Armen Aghajanyan, Wen
[1] Mohammad Alkhalaf, Ping Yu, Mengyang Yin, and Chao Deng. 2024. Applying tau Yih, Joelle Pineau, and Luke Zettlemoyer. 2023. Improving Passage Retrieval
generative AI with retrieval augmented generation to summarize and extract with Zero-Shot Question Generation. arXiv:2204.07496 [cs.CL]
key clinical information from electronic health records. (2024), 104662. [29] Parth Sarthi, Salman Abdullah, Aditi Tuli, Shubh Khanna, Anna Goldie, and
[2] Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. 2023. Christopher D. Manning. 2024. RAPTOR: Recursive Abstractive Processing for
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. Tree-Organized Retrieval.
(2023). [30] Weijia Shi, Sewon Min, Michihiro Yasunaga, Minjoon Seo, Rich James, Mike
[3] Peter Auer. 2002. Using confidence bounds for exploitation-exploration trade-offs. Lewis, Luke Zettlemoyer, and Wen-tau Yih. 2023. Replug: Retrieval-augmented
3, Nov (2002), 397–422. black-box language models. (2023).
[4] Tom B Brown. 2020. Language models are few-shot learners. (2020). [31] Devendra Singh, Siva Reddy, Will Hamilton, Chris Dyer, and Dani Yogatama. 2021.
[5] Chi-Min Chan, Chunpu Xu, Ruibin Yuan, Hongyin Luo, Wei Xue, Yike Guo, End-to-end training of multi-document reader and retriever for open-domain
and Jie Fu. 2024. Rq-rag: Learning to refine queries for retrieval augmented question answering. 34 (2021), 25968–25981.
generation. (2024). [32] Karan Singhal, Shekoofeh Azizi, Tao Tu, S. Sara Mahdavi, Jason Wei, Hyung Won
[6] Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. 2024. Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl,
BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Perry Payne, Martin Seneviratne, Paul Gamble, Chris Kelly, Nathaneal Scharli,
Embeddings Through Self-Knowledge Distillation. arXiv:2402.03216 [cs.CL] Aakanksha Chowdhery, Philip Mansfield, Blaise Aguera y Arcas, Dale Webster,
[7] Daniel Cohen, Liu Yang, and W. Bruce Croft. 2018. WikiPassageQA: A Benchmark Greg S. Corrado, Yossi Matias, Katherine Chou, Juraj Gottweis, Nenad Tomasev,
Collection for Research on Non-factoid Answer Passage Retrieval. abs/1805.03797 Yun Liu, Alvin Rajkomar, Joelle Barral, Christopher Semturs, Alan Karthike-
(2018). arXiv:1805.03797 salingam, and Vivek Natarajan. 2022. Large Language Models Encode Clinical
[8] Florin Cuconasu, Giovanni Trappolini, Federico Siciliano, Simone Filice, Cesare Knowledge. arXiv:2212.13138 [cs.CL]
Campagnano, Yoelle Maarek, Nicola Tonellotto, and Fabrizio Silvestri. 2024. The [33] Manveer Singh Tamber, Ronak Pradeep, and Jimmy Lin. 2023. Scaling Down, LiT-
Power of Noise: Redefining Retrieval for RAG Systems. 719–729. ting Up: Efficient Zero-Shot Listwise Reranking with Seq2seq Encoder-Decoder
[9] Goetz Graefe and William J McKenna. 1993. The volcano optimizer generator: Models. (2023).
Extensibility and efficient search. 209–218. [34] Andrew Trotman, Antti Puurula, and Blake Burgess. 2014. Improvements to
[10] Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian BM25 and language models examined. 58–65.
Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting [35] Shuting Wang, Xin Xu, Mang Wang, Weipeng Chen, Yutao Zhu, and Zhicheng
Liu. 2023. A Survey on Hallucination in Large Language Models: Principles, Dou. 2024. RichRAG: Crafting Rich Responses for Multi-faceted Queries in
Taxonomy, Challenges, and Open Questions. arXiv:2311.05232 [cs.CL] Retrieval-Augmented Generation. (2024).
[11] Ziyan Jiang, Xueguang Ma, and Wenhu Chen. 2024. LongRAG: En- [36] Zheng Wang, Bingzheng Gan, and Wei Shi. 2024. Multimodal Query Suggestion
hancing Retrieval-Augmented Generation with Long-context LLMs. with Multi-Agebm25g from Human Feedback. arXiv:2402.04867 [cs.IR]
arXiv:2406.15319 [cs.CL] [37] Jeff Wu, Long Ouyang, Daniel M Ziegler, Nisan Stiennon, Ryan Lowe, Jan Leike,
[12] Ming Jin, Shiyu Wang, Lintao Ma, Zhixuan Chu, James Y Zhang, Xiaoming Shi, and Paul Christiano. 2021. Recursively summarizing books with human feedback.
Pin-Yu Chen, Yuxuan Liang, Yuan-Fang Li, Shirui Pan, et al. 2023. Time-llm: (2021).
Time series forecasting by reprogramming large language models. (2023). [38] Ziwei Xu, Sanjay Jain, and Mohan Kankanhalli. 2024. Hallucination is Inevitable:
[13] Angeliki Lazaridou, Elena Gribovskaya, Wojciech Stokowiec, and Nikolai Grig- An Innate Limitation of Large Language Models. arXiv:2401.11817 [cs.CL]
orev. 2022. Internet-augmented language models through few-shot prompting [39] Siqiao Xue, Danrui Qi, Caigao Jiang, Wenhui Shi, Fangyin Cheng, Keting Chen,
for open-domain question answering. (2022). Hongjun Yang, Zhiping Zhang, Jianshan He, Hongyang Zhang, Ganglin Wei,
[14] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Wang Zhao, Fan Zhou, Hong Yi, Shaodong Liu, Hongjun Yang, and Faqiang Chen.
Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rock- 2024. Demonstration of DB-GPT: Next Generation Data Interaction System
täschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive Empowered by Large Language Models. arXiv:2404.10209 [cs.AI]
nlp tasks. 33 (2020), 9459–9474. [40] Jiexia Ye, Weiqi Zhang, Ke Yi, Yongzi Yu, Ziyue Li, Jia Li, and Fugee Tsung.
[15] Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2015. A 2024. A Survey of Time Series Foundation Models: Generalizing Time Series
diversity-promoting objective function for neural conversation models. (2015). Representation with Large Language Model. arXiv:2405.02358 [cs.LG]
[16] Xinze Li, Zhenghao Liu, Chenyan Xiong, Shi Yu, Yu Gu, Zhiyuan Liu, and Ge Yu. [41] Ziqi Yin, Shanshan Feng, Shang Liu, Gao Cong, Yew Soon Ong, and Bin Cui.
2023. Structure-Aware Language Model Pretraining Improves Dense Retrieval 2024. LIST: Learning to Index Spatio-Textual Data for Embedding based Spatial
on Structured Data. (2023). Keyword Queries. (2024).
[17] Shu Liu, Asim Biswal, Audrey Cheng, Xiangxi Mo, Shiyi Cao, Joseph E Gonza- [42] Tan Yu, Anbang Xu, and Rama Akkiraju. 2024. In Defense of RAG in the Era of
lez, Ion Stoica, and Matei Zaharia. 2024. Optimizing llm queries in relational Long-Context Language Models. arXiv:2409.01666 [cs.CL]
workloads. (2024). [43] Yue Yu, Wei Ping, Zihan Liu, Boxin Wang, Jiaxuan You, Chao Zhang, Mohammad
[18] Yanming Liu, Xinyue Peng, Xuhong Zhang, Weihao Liu, Jianwei Yin, Jiannan Shoeybi, and Bryan Catanzaro. 2024. RankRAG: Unifying Context Ranking with
Cao, and Tianyu Du. 2024. RA-ISF: Learning to Answer and Understand from Retrieval-Augmented Generation in LLMs. (2024).
Retrieval Augmentation via Iterative Self-Feedback. (2024). [44] Hamed Zamani and Michael Bendersky. 2024. Stochastic RAG: End-to-
[19] Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. End Retrieval-Augmented Generation through Expected Utility Maximization.
2021. Fantastically ordered prompts and where to find them: Overcoming few- arXiv:2405.02816 [cs.CL]
shot prompt order sensitivity. (2021). [45] Hailin Zhang, Yujing Wang, Qi Chen, Ruiheng Chang, Ting Zhang, Ziming Miao,
[20] Yuanjie Lyu, Zhiyu Li, Simin Niu, Feiyu Xiong, Bo Tang, Wenjin Wang, Hao Wu, Yingyan Hou, Yang Ding, Xupeng Miao, Haonan Wang, et al. 2024. Model-
Huanyong Liu, Tong Xu, and Enhong Chen. 2024. Crud-rag: A comprehensive enhanced vector index. 36 (2024).
chinese benchmark for retrieval-augmented generation of large language models. [46] Penghao Zhao, Hailin Zhang, Qinhan Yu, Zhengren Wang, Yunteng Geng,
(2024). Fangcheng Fu, Ling Yang, Wentao Zhang, and Bin Cui. 2024. Retrieval-augmented
[21] Xinbei Ma, Yeyun Gong, Pengcheng He, Hai Zhao, and Nan Duan. 2023. Query generation for ai-generated content: A survey. (2024).
rewriting for retrieval-augmented large language models. (2023). [47] Siyun Zhao, Yuqing Yang, Zilong Wang, Zhiyuan He, Luna K. Qiu, and Lili
[22] Microsoft. 2024. GraphRAG. https://microsoft.github.io/graphrag/ Qiu. 2024. Retrieval Augmented Generation (RAG) and Beyond: A Compre-
[23] Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina hensive Survey on How to Make your LLMs use External Data More Wisely.
Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, arXiv:2409.14924 [cs.CL]
et al. 2021. Webgpt: Browser-assisted question-answering with human feedback. [48] Xinyang Zhao, Xuanhe Zhou, and Guoliang Li. 2024. Chat2Data: An Interactive
(2021). Data Analysis System with RAG, Vector Databases and LLMs. (2024).
[24] Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan [49] Xuanhe Zhou, Guoliang Li, Chengliang Chai, and Jianhua Feng. 2021. A learned
Majumder, and Li Deng. 2016. MS MARCO: A Human Generated MAchine query rewrite system using monte carlo tree search. 15, 1 (2021), 46–58.
Reading COmprehension Dataset. abs/1611.09268 (2016). arXiv:1611.09268 [50] Honglei Zhuang, Zhen Qin, Rolf Jagerman, Kai Hui, Ji Ma, Jing Lu, Jianmo Ni,
[25] Zhen Qin, Rolf Jagerman, Kai Hui, Honglei Zhuang, Junru Wu, Le Yan, Jiaming Xuanhui Wang, and Michael Bendersky. 2022. RankT5: Fine-Tuning T5 for Text
Shen, Tianqi Liu, Jialu Liu, Donald Metzler, et al. 2023. Large language models Ranking with Ranking Losses. arXiv:2210.10634 [cs.IR]
are effective text rankers with pairwise ranking prompting. (2023). [51] Shengyao Zhuang, Bing Liu, Bevan Koopman, and Guido Zuccon. 2023. Open-
[26] Chidaksh Ravuru, Sagar Srinivas Sakhinana, and Venkataramana Runkana. source large language models are strong zero-shot query likelihood models for
2024. Agentic Retrieval-Augmented Generation for Time Series Analysis. document ranking. (2023).
arXiv:2408.14484 [cs.AI]
[27] Stephen Robertson, Hugo Zaragoza, and Michael Taylor. 2004. Simple BM25
extension to multiple weighted fields. 42–49.
13