0% found this document useful (0 votes)

86 views13 pages

CORAG A Cost-Constrained Retrieval Optimization System For Retrieval-Augmented Generation

Uploaded by

469134492

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

86 views13 pages

CORAG A Cost-Constrained Retrieval Optimization System For Retrieval-Augmented Generation

Uploaded by

469134492

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

CORAG: A Cost-Constrained Retrieval Optimization System for

Retrieval-Augmented Generation
Ziting Wang Haitao Yuan Wei Dong
Nanyang Technological University Nanyang Technological University Nanyang Technological University
Singapore Singapore Singapore
[email protected] [email protected] [email protected]

Gao Cong Feifei Li

Nanyang Technological University Alibaba Group
arXiv:2411.00744v1 [cs.DB] 1 Nov 2024

Singapore China
[email protected] [email protected]

ABSTRACT Response:
Designer: The Eiffel Tower was named after
Large Language Models (LLMs) have demonstrated remarkable Query: "Who designed the Eiffel the engineer Gustave Eiffel, whose company
Tower and when was it designed and built the tower.
generation capabilities but often struggle to access up-to-date in- constructed? Provide information Construction period: It was constructed
on its height as well." between 1887 and 1889.
formation, which can lead to hallucinations. Retrieval-Augmented Height: The Eiffel Tower stands at 324 meters
LLMs
tall.
Generation (RAG) addresses this issue by incorporating knowl-
edge from external databases, enabling more accurate and relevant Chunk1: "The Eiffel Tower in Paris
held the title of the world's tallest Score Token

responses. Due to the context window constraints of LLMs, it is structure for over 40 years until
1930.”
χ1 0.3 256
impractical to input the entire external database context directly Chunk2:“It was the tallest man-
χ1 + χ3 0.4 512
made structure in the world for 41
into the model. Instead, only the most relevant information, re- years until the completion of the
Reranker Model χ3 + χ4 0.8 512
Chrysler Building in New York in
ferred to as “chunks”, is selectively retrieved. However, current 1930. ” χ4 + χ3 0.9 512
Score
RAG research faces three key challenges. First, existing solutions Chunk3:" The Eiffel Tower Chunk χ3 0.8 χ1 + χ 2 + χ4
constructed between 1887 and 0.6 768
often select each chunk independently, overlooking potential cor- 1889, it stands at 324 meters tall.” Chunk χ1 0.6 χ1 + χ 4 + χ3 0.8 768
relations among them. Second, in practice, the utility of chunks Chunk4:"It was named after the Chunk χ4 0.4 χ1 + χ2 + χ3+ χ4 0.9
engineer Gustave Eiffel, whose 913
are “non-monotomic”, meaning that adding more chunks can de- company designed and built the Chunk χ2 0.3
item, even though several other
crease overall utility. Traditional methods emphasize maximizing designs were considered initially." Chunk Scores Chunk Combination Order
Potential Chunks
the number of included chunks, which can inadvertently compro-
mise performance. Third, each type of user query possesses unique Figure 1: Example of chunks combination order.
characteristics that require tailored handling—an aspect that cur-
rent approaches do not fully consider. 1 INTRODUCTION
To overcome these challenges, we propose a cost-constrained Although LLMs have demonstrated exceptional capabilities in gen-
retrieval optimization system CORAG for retrieval-augmented generation tasks, they often struggle with accessing up-to-date in-
eration. We employ a Monte Carlo Tree Search (MCTS)-based pol- formation, which can lead to hallucinations [10, 38]. To address
icy framework to find optimal chunk combinations sequentially, these challenges, RAG has emerged as a crucial solution. By in-
allowing for a comprehensive consideration of correlations among tegrating external data sources into LLM, RAG can provide more
chunks. Additionally, rather than viewing budget exhaustion as accurate, relevant, and up-to-date information. Nowadays, RAG
a termination condition, we integrate budget constraints into the has been widely studied in the context of LLMs especially for tasks
optimization of chunk combinations, effectively addressing the requiring update external knowledge such as question answering
non-monotonicity of chunk utility. Furthermore, by designing a task [2, 22, 29], medical information retrieval [1, 32], and time se-
configuration agent, our system predicts optimal configurations ries analysis [12, 26, 40]. External data sources are often extremely
for each query type, enhancing adaptability and efficiency. Experi- large, making it impractical to input them directly into the LLM.
mental results indicate an improvement of up to 30% over baseline To address this issue, data is typically split into disjoint chunks
models, underscoring the framework’s effectiveness, scalability, and stored in a vector database, and then users query the most
and suitability for long-context applications. useful chunks to construct prompts for LLMs. Therefore, designing
PVLDB Reference Format: efficient and accurate structures and algorithms to search for the
Ziting Wang, Haitao Yuan, Wei Dong, Gao Cong, and Feifei Li. CORAG: A most relevant chunks has become a prominent research topic and
Cost-Constrained Retrieval Optimization System for Retrieval-Augmented has been widely studied in both the database [39, 48] and machine
Generation. PVLDB, 14(1): XXX-XXX, 2020. learning communities [2, 35, 43].
doi:XX.XX/XXX.XX
This work is licensed under the Creative Commons BY-NC-ND 4.0 International licensed to the VLDB Endowment.
License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of Proceedings of the VLDB Endowment, Vol. 14, No. 1 ISSN 2150-8097.
this license. For any use beyond those covered by this license, obtain permission by doi:XX.XX/XXX.XX
emailing [email protected]. Copyright is held by the owner/author(s). Publication rights
However, there are three key challenges in the existing ap- single fixed reranker model consistently outperforms the others
proaches. across all query variations (see our experiments in Section 6.3.4
for more details). Current methods [20, 46] typically rely on static
Challenge 1: Correlations between chunks. Currently, two primary reranker models for ranking chunks, lacking the flexibility to adapt
methods are used to identify the most relevant chunks. The first to varying query contexts.
approach formulates the problem as a approximate k-nearest neigh- Problem Statement: Is there a RAG system that fully considers
bor (AKNN) task [41, 45], where each chunk is assigned a score, and correlations between chunks and the non-monotonicity of utility
the approxiamte top-𝑘 chunks ranked by score are selected. The while being adaptable to all types of queries?
second approach clusters the chunks, returning all chunks within
the most relevant clusters in response to a query [22, 29]. However,
1.1 Our Contributions
both methods overlook potential correlations between chunks: the
first approach disregards correlations entirely, while the second ap- In this paper, we answer this question in the affirmative, by propos-
proach accounts for them only superficially by treating all chunks ing a novel MCTS based policy tree framework to optimize chunk
within each cluster as equally relevant. As a result, when multiple retrieval in RAG systems. In summary, our contributions can be
chunks convey similar or overlapping information, these methods summarized as follows:
introduce substantial redundancy in the selected chunks. • We propose the first RAG framework that considers the chunk
For example, as illustrated in Figure 1, when querying the height combination order for the RAG task. Instead of considering each
and history of the Eiffel Tower, if each chunk is treated indepen- chunk independently or at the cluster level, we use MCTS to
dently, a greedy method would select chunks 𝜒3 and 𝜒1 since they help search the optimal chunk combination order sequentially.
have the top two scores. However, both chunks only provide his- The high-level idea is as follows: First, we initialize the root node.
torical information, which is insufficient to fully address the query. Then, in an iterative process, we expand the tree by selecting
To better address the query, it is necessary to include a chunk with the highest utility node and computing its expended nodes’ util-
constructor’s name, such as 𝜒4 . On the other hand, the clustering ities. After each expansion, we update the utilities throughout
approach would return all of 𝜒1, 𝜒2, 𝜒3 , and 𝜒 4 , resulting in redun- the entire policy tree. During this process, the decision at each
dancy. An optimal solution would instead select 𝜒3 and 𝜒4 , as they iteration depends on the chunks already selected, allowing us to
provide the required information without redundancy. Additionally, fully consider the correlations between chunks. Moreover, MCTS
research [11, 19, 42] has shown that the order of chunks influences reduces the exponential search space to linear, and we apply par-
LLM performance, a factor that existing methods also overlook. Fol- allel expansion techniques to further enhance computational
lowing the example of the Eiffel Tower, when chunks 𝜒3 and 𝜒4 are efficiency. With such designs, we address Challenge 1.
selected, placing 𝜒4 first yields a higher score compared with the • In contrast to prior RAG frameworks that consider the exhaus-
reverse order will have a better performance. However, determining tion of the budget as one of termination conditions, we propose a
the optimal chunk combination order is a challenging task since novel formulation wherein budget constraints are integrated into
both of them require a search space growing exponentially with the the process of optimizing chunk combinations to fully consider
number of available chunks. In this paper, we further demonstrate the non-monotonicity of utility of chunks thereby addressing
that this problem is NP-hard (see Section 2.1). Challenge 2. Moreover, by prioritizing high-relevance, low-cost
chunks and factoring in token length, we further reduce compu-
Challenge 2: Non-monotonicity of utility. Current solutions op- tational costs.
erate on the assumption that including more chunks will always • We propose a contrastive learning-based agent that dynami-
yield better final results. Specifically, in the AKNN-based approach, cally adjusts MCTS configurations per query, adapting reranker
exactly 𝑘 chunks are selected deterministically each time. In the models and configurations to the specific query domain. This
clustering-based approach, a distance threshold between clusters approach tailors retrieval for dynamic, domain-specific queries
and the query is set, and all clusters within this threshold are re- with flexibility and robustness, addressing Challenge 3.
turned. Both of them return as many chunks as possible. However, • Additionally, we conducted comprehensive experiments, com-
in practice, the utility of chunks is not monotonic. More specifically, paring our framework with several state-of-the-art methods. The
excessive chunks can dilute key information by adding marginally results validate the effectiveness, efficiency, and scalability of
relevant content, creating noise that reduces clarity. Additionally, our approach, also showing a performance improvement of 30%
conflicting or nuanced differences across chunks may confuse the over the baseline.
model, lowering response quality. For example, as illustrated in Fig-
ure 1, when 𝜒3 and 𝜒4 are selected, adding the chunk 𝜒1 decreases 2 PRELIMINARIES
utility, highlighting that utility scores are often non-monotonic in In this section, we first introduce the definitions of some key con-
practice. cepts in Section 2.1, such as chunks and chunk combination order.
Challenge 3: Diversity of queries: User queries come in different Next, we give the NP-hard proof of the chunk order optimization
types, each requiring its own ranking strategy due to their unique problem. At last, we discuss the related work in Section 2.3.
characteristics [47]. In current RAG systems, the utility scores of
chunks often are determined by the assigned reranker model. So 2.1 Key Concepts
far, various reranker models exist, but we observe that their per- RAG & Chunks. RAG is an effective method for improving the
formance varies significantly across different query types, and no performance of generation models by retrieving relevant context
2
from an external corpus. In this approach, the corpus is first divided of the weights of these vertices and their covered hyperedges:
into smaller, manageable units called chunks, which are stored in ∑︁ ∑︁
a vector database. Therefore, we can give a formal definition of arg max 𝑤 1 (𝑣) + 𝑤 2 (𝑒) . (2)
V ′ ⊆ V,| V ′ |=𝑘
chunk as follows: 𝑣 ∈ V′ 𝑒 ∈ V′

Definition 2.1 (Chunk). Let 𝐶 represent a corpus of documents, 2.2.2 Reduction process. We now construct a corresponding Chunk
and a chunk 𝜒 is defined as a contiguous block of text extracted Combination Optimization Problem instance from the given MWHP
from 𝐶. Formally, a chunk 𝜒 consists of a sequence of tokens instance. For each node 𝑣 ∈ V, we create a corresponding chunk
(𝑡 1, 𝑡 2, . . . , 𝑡𝑛 ), where each 𝑡𝑖 is a token from 𝐶 and the size 𝑛 is X𝑣 . We define its cost cost(𝑋 𝑣 ) ≡ 1. Then, a chunk combination
set by users. order Φ corresponds to a subset of vertices of V, which is denoted
as V (Φ) ⊆ V. We define its utility as
In the RAG system, each chunk is embedded into a vector repre- ∑︁ ∑︁
sentation using an embedding model, which captures the chunk’s 𝑈 (Φ) = 𝑤 1 (𝑣) + 𝑤 2 (𝑒). (3)
semantic meaning and enables the retrieval of contextually sim- 𝑣 ∈ V (Φ) 𝑒 ∈ V (Φ)
ilar content. When a new query is received, the vector database
Finally, we set B = 𝑘 and our objective is
performs a similarity search to identify the chunks that are most
∑︁
semantically relevant to the query. These retrieved chunks are then arg max 𝑈 (Φ) s.t. cost(𝜒𝑖 ) = |Φ| ≤ 𝑘. (4)
passed to a generator (e.g., a large language model) to produce Φ
𝜒𝑖 ∈Φ
a final response based on the retrieved content. Specifically, the
more tokens a chunk contains, the higher the cost incurred by the Denote Φ∗ as the solution of (4), then, it is obvious V (Φ∗ ) is the

generator. Thus, we define the cost of a chunk as 𝑐𝑜𝑠𝑡 ( 𝜒) = | 𝜒 |, solution of (2) the reduction can be done in time of 𝑂 |V | · |E | .
which equals to the number of tokens in the chunk. Please note that a precondition of this reduction is that, in our
Chunk Combination Order. In the RAG system, the retrieval Chunk Combination Optimization Problem, we allow the rerank
result from the vector database may include multiple chunks. How- model to be arbitrary, meaning the utility scores can also be as-
ever, due to input limitations of the generation model, using all of signed arbitrarily. The complexity of finding the optimal chunk
these chunks is impractical. Therefore, it is necessary to select an combination order can be significantly reduced if certain assump-
optimal subset of chunks, known as a chunk combination, that fits tions are made about the reranker. For instance, if the reranker
within a given cost budget. Additionally, the order of the chunks does not consider correlations and simply sums the utility scores
within the combination significantly impacts the performance of of individual chunks linearly, each chunk could then be evaluated
the generation model. The goal is to identify the chunk combination independently. However, in this paper, we address the most general
order with the optimal order, formally defined as follows: case, making no assumptions about the reranker model.

Definition 2.2 (Optimal Chunk Combination Order Selection). Let 2.3 Related Work
{𝜒1, 𝜒2, . . . , 𝜒𝑘 } be a set of potential chunks, B be the cost budget,
and Φ = ⟨𝜒𝜙 1 , · · · , 𝜒𝜙𝑚 ⟩ represent a potential chunk combination 2.3.1 Retrieval-augmented Generation. RAG[14, 20] is widely used
order, where each 𝜒𝜙𝑖 is a chunk, and the index 𝜙𝑖 indicates its for handling knowledge-intensive NLP tasks. In a typical RAG
position Φ. Let 𝑈 (Φ) be the utility score assigned by the reranker pipeline, a dense-embedding-based retriever searches for relevant
model, which may be arbitrary or composite. Our objective is to information from an external database, which is then used by the
find the chunk combination order that maximizes the utility score LLM during the generation process. To improve this pipeline, some
while adhering to the cost constraint of feeding them into the LLMs studies[5, 18, 22, 35] have focused on adjusting retrievers to suit
to generate the final response, i.e., searching for the generation needs of LLMs better, developing multi-step re-
∑︁ trieval methods, and filtering out irrelevant information. Although
Φ̂ = arg max 𝑈 (Φ) s.t. cost(𝜒𝑖 ) ≤ B (1) there are many advanced retrievers[8, 9, 15, 16, 27, 34], it’s more
Φ
𝜒𝑖 ∈Φ promising to optimize the retriever and LLM together in an end-to-
end process[25, 31]. For example, the research[30] has focused on
2.2 Proof of NP-hard training retrievers and LLMs together, either simultaneously or in
To demonstrate that chunk combination order selection is NP-hard, stages. However, this requires surrogate loss for optimization and
we reduce the Maximum Weighted Hyperclique Problem (MWHP) complicates the training pipeline, especially when the embedding
to it. Since MWHP is NP-hard, we show that any MWHP instance database needs to be re-indexed frequently which will bring high
can be transformed into a Chunk Combination Optimization in- compute costs. Therefore, methods such as [5] decompose complex,
stance in polynomial time. multi-step queries into smaller sub-intents to improve response
comprehensiveness without frequently re-indexing. However, these
2.2.1 Problem definition of MWHP. Given a hypergraph H = approaches often overlook the critical role of chunk combination
(V, E, 𝑤 1, 𝑤 2 ), where V is the set of vertices, E is the set of hy- order, which can significantly impact the overall response quality of
peredges, where each contains a subset of V. 𝑤 1 : 𝑣 → R and LLMs. To the best of our knowledge, this paper is the first approach
𝑤 2 : 𝑒 → R are weight functions assigning a weight to each vertex to consider chunk combination order within the RAG task.
and hyperedge, respectively. Given a subset of vertices V ′ ⊆ V,
we say a hyperedge 𝑒 belongs to V ′ , i.e., 𝑒 ∈ V ′ , if V ′ covers all 2.3.2 Reranking for RAG. Reranking methods are crucial for en-
vertices of 𝑒. The objective is to find 𝑘 vertices maximizing the sum hancing retrieval performance within the RAG pipeline [43, 44, 51].
3
User Query Embedding Vector DB Potential Configuration Agent Policy Tree Optimal Chunk
Chunks Order LLMs
Search

Step1: Potential Chunks Retrieval Step2: Online Configuration Inference Step3: Optimal Chunk Search
Reranker
Configuration Policy Tree
Agent Iterations
Modeling
Corpus Chunks
Vector Index Similarity Coefficient
Search
MCTS Based
Positive Lconstrastive Tree Search
Query Embedding Label
Vector DB Lclassification
Negative
Label Encoding Network Lregression
Offline Contrastive Training

Figure 2: The architecture overview of CORAG system

Traditional reranking approaches [33, 50] typically rely on mid- the in-monotonicity of the utility of chunk combination order, and
sized language models, such as BERT or T5, to rank retrieved con- adapt to diverse query domains. These challenges result in reduced
texts. However, these models often struggle to capture semantic relevancy of the outputs. To address these issues, we introduce
relationships between queries and contexts, especially in zero-shot CORAG, a system designed to retrieve the optimal chunk combina-
or few-shot settings. Therefore, recent research [43] highlights the tion while taking into account the query domain and user budget.
potential of instruction-tuned LLMs to improve context rerank- As the most important component of our system, we introduce the
ing by more accurately identifying relevant contexts, even in the Optimal Chunk Combination Search model. This model employs
presence of noise or irrelevant information. Despite these advance- MCTS based policy tree to perform sequential searches of chunk
ments, the full capacity of LLMs for reranking in RAG systems combination order under a cost constraint, allowing us to fully
remains underutilized. In particular, studies have shown that chunk consider the correlations between chunks (Challenge 1) as well as
arrangement can impact LLM performance [19], emphasizing the the non-monotonic nature of utility of chunk combination orders
need to consider chunk combination order in RAG tasks. However, (Challenge 2). Additionally, we propose a Configuration Inference
existing models are not well-suited for cases where optimal retrieval module that recommends the optimal MCTS configuration and
requires specific sequences or combinations of chunks, rather than reranker tailored to various query domains, thereby addressing
isolated chunks. Hence, future research is needed to better leverage the challenge 3. Below, we give a brief descriptions for these two
LLMs for arranging chunks more effectively in response to queries modules.
within the RAG framework. Optimal Chunk Combination Search: A straightforward ap-
proach to considering chunk correlations involves retrieving poten-
2.3.3 Reinforcement Learning for Large Language Models. Recently, tial chunks from a vector database (as shown in step 1 in Figure 2)
reinforcement learning (RL) has been increasingly utilized in vari- and exhaustively exploring all possible chunk combinations. How-
ous data management and RAG tasks. The RL technique can enable ever, this method incurs significant latency and computational costs.
large language models to improve their generation ability by lever- To mitigate this, we construct a policy tree (as shown in step 2),
aging external knowledge sources, such as search engines [13, 23]. reframing the optimal chunk combination search as a node search
In particular, the human feedback [4, 36, 37] can be integrated problem within the tree. Specifically, the root node of the policy tree
to help models produce more accurate and contextually relevant represents an initial empty state, and each child node corresponds
responses through the RL framework. In addition, some query opti- to a specific combination of chunks. For example, if the root node
mization approaches [17, 21, 49] further refine retrieval processes, has a child node representing chunk 𝜒1 , one of its child nodes might
allowing model performance to inform query adjustments and ulti- represent the combination 𝜒1 + 𝜒2 , while another could represent
mately enhance downstream task outcomes. In this work, we apply 𝜒1 + 𝜒3 .
a lightweight RL technique MCTS to optimize the chunk combina- We design a search algorithm based on MCTS to address this
tion order searching progress in the RAG system. We also introduce problem. Unlike traditional MCTS, our approach expands the node
a configuration agent to guide the MCTS search process. To our with the highest utility in each iteration, simultaneously evaluating
best knowledge, this is the first approach to addressing this specific all possible child nodes. Additionally, we account for both cost
problem. and budget constraints during the policy tree search process. Node
utility is calculated by balancing exploration with cost control,
3 SYSTEM OVERVIEW optimizing for both efficiency and accuracy.
As previously mentioned, existing RAG frameworks face three key Configuration Inference: A simple solution for configuration
challenges: how to fully consider correlations between chunks and tuning is to enumerate every possible configuration or reranker
4
and compute the results in parallel, and then select the optimal node of 𝑇 represents the initial state, devoid of any chunks. Each
configuration. However, this would result in impractical costs for subsequent non-root node embodies a chunk set, achieved by in-
the RAG system. To optimize the configuration (i.e., the number of corporating a newly selected chunk from the remaining potential
iterations, cost coefficient, and exploration coefficient) for the policy chunks into the sequence at its parent node. This process sequen-
tree search process, we introduce a configuration agent that dy- tially constructs an ordered chunk combination in each non-root
namically generates configurations based on the query domain. To node and our objective is to find the node with the highest utility
ensure the model’s effectiveness, we employ a contrastive learning score.
approach that uses positive and negative label pairs: positive labels
correspond to query embeddings from the same optimal reranker, Within the policy tree, our goal is to select a node that encom-
while negative labels come from different optimal reranker. A joint passes ordered chunks offering the highest benefit at the lowest
loss function is used to simultaneously optimize both the regression cost. To accomplish this, we need to devise a utility calculation
(for parameter tuning) and contrastive learning (to enhance label function to evaluate the trade-off between benefit and cost. This
differentiation). function is quantified through what we define as the “node utility”,
Summary. The pipepline of our framework is shown in Figure 2. described as follows.
We first generate an embedding for the input query, which is then Node Utility. The utility metric comprises two components: the
used to retrieve potential chunks from the vector database. These benefit derived from selecting the chunk combination and the cost
query embeddings are also fed into the configuration agent, which associated with using the chunk as a prompt in LLMs. Specifically,
dynamically generates the optimal MCTS configuration based on the benefit is quantified with LLMs, which can measure the simi-
the query domain. Using this optimal configuration, we can search larity between the selected chunks and the query. In particular, we
in the policy tree to determine the optimal chunk combination denote it as the node value 𝑉 . Next, we further use the Upper Con-
and order from the retrieved potential chunks. Finally, this optimal fidence Bound (UCB)[3] algorithm to balance exploitation (node
chunk combination is used to construct the final prompt for the value 𝑉 (𝑣𝑖 )) and exploration (search count 𝑁 (𝑣𝑖 )) for a given node
LLMs. 𝑣𝑖 . Regarding cost, we consider the token cost as defined in Section 2
and measure it by the proportion of the current chunk combina-
4 CHUNK COMBINATION RETRIEVAL tion’s cost relative to the total allocated budget B. Therefore, the
node utility is defined as follows:
As previously discussed, the order in which chunks are combined
significantly impacts the efficiency of prompt construction in LLMs. Definition 4.2 (Node Utility). Given a policy tree and a cost budget
Enumerating all possible orders of chunk combinations is not feasi- B, the utility of a non-root node is defined as:
ble due to the vast number of potential combinations, particularly
√︄
when the scenario involves a large number of chunks. In this sec-
𝑉 (𝑣𝑖 ) ln 𝑁 cost(𝑣𝑖 )
tion, we present a novel method that achieves a good trade-off U(𝑣𝑖 ) = +𝑐 −𝜆 (5)
between efficiency and accuracy in searching for the optimal chunk 𝑁 (𝑣𝑖 ) 𝑁 (𝑣𝑖 ) B
combination order problem. In Section 4.1, we model the problem
where 𝑉 (𝑣𝑖 ) is the estimated benefit value of the chunk combina-
as searching the optimal node within a policy tree (Section 4.1).
tion at node 𝑣𝑖 determined by a trained model, 𝑁 (𝑣𝑖 ) is the count
Then, we propose an MCTS-based algorithm to address this node
of visits to node 𝑣𝑖 , promoting exploration of less frequented nodes,
search problem (Section 4.2).
and 𝑁 is the total number of visits across all nodes in the policy tree,
ensuring a balance between exploration and exploitation. In addi-
4.1 Policy Tree Search Modeling
tion, cost(𝑣𝑖 ) denotes the token cost for node 𝑣𝑖 , B is the total token
To approach the optimal combination order, the first step is to find budget, 𝑐 moderates the exploration-exploitation trade-off, and 𝜆
a data structure that enables efficient enumeration of all possible serves as a penalty factor for the cost to enhance cost-efficiency.
combination orders. A natural choice is a tree, allowing us to explore
all potential answers by traversing from the root to the leaf nodes. Optimal Node Selection Modeling. Building on the defined node
Policy Tree. As illustrated in Figure 3, we construct a policy tree utility, the task of selecting an optimal chunk combination order,
to represent all potential orders of chunk combinations sourced as outlined in Section 2, is reformulated as optimal node selection
from the vector database. Specifically, the root node symbolizes within the policy tree 𝑇 . Given a budget constraint B, the objective
the initial state without any chunk, with each subsequent node is to identify the node 𝑣𝑖 ⊆ 𝑇 that maximizes the utility 𝑈 (𝑣𝑖 ), while
depicting a selected chunk from the potential ones. Thus, a child ensuring that the total cost associated with 𝑣𝑖 does not exceed B.
node emerges from its parent by selecting the next available chunk Formally, this is represented as:
from the queue of potential chunks and incorporating it into the
sequence established by the ancestor node. For instance, if a node √︄ !
represents the chunk combination order {𝜒 1 }, then a child node 𝑉 (𝑣𝑖 ) ln 𝑁 cost(𝑣𝑖 )
𝑣ˆ𝑖 = arg max +𝑐 −𝜆 (6)
might embody a subsequent combination order such as {𝜒1, 𝜒2 }, 𝑣𝑖 ⊆𝑇 𝑁 (𝑣𝑖 ) 𝑁 (𝑣𝑖 ) B
{𝜒1, 𝜒3 }, or {𝜒 1, 𝜒4 }. Accordingly, we define the policy tree formally
where 𝑉 (𝑣𝑖 ) is the estimated benefit of the chunk combination at
as follows:
node 𝑣𝑖 , and cost(𝑣𝑖 ) represents its associated cost. This formulation
Definition 4.1 (Policy Tree). Given a query 𝑞 and a set of poten- enables selecting chunks that maximize utility within the given
tial chunks {𝜒1, 𝜒2, . . . , 𝜒𝑛 }, we construct a policy tree 𝑇 . The root budget.
5
R
χ1+χ2 0.3
Root

χ1 R R
0.1 … 0.2 χ1+χ3 0.2 0.1 … 0.45
χ2
χn

χ1 … χn 0.1 … 0.2
χ1+χ2 χ1+χ3 χ1+χ4 χ1+χ4 Utility 0.7 0.3 0.2 0.7
Compute

Potential Chunks Policy Tree Selection Parallel Expansion & Utility Compute Utility Update Optimal Chunk Order

Figure 3: Workflow of MCTS Based Chunk Optimization

Algorithm 1 MCTS-Based Policy Tree Search Algorithm 2 NodeSelection

1: Input: 𝑞: a query; 𝐷: a set of document nodes; B: total budget 1: function NodeSelection(𝑣 0 , B)
2: Initialize root node 𝑣 0 with query 𝑞 2: Input: 𝑣𝑖 : a tree node of the policy tree; B: remaining bud-
3: while within computational budget do get; 𝑅: predicted reranker
4: 𝑣 = NodeSelection(𝑣 0, B) 3: while not fully expanded and within budget do
5: UtilityUpdate(𝑣, B) [𝑉 (𝑣 1 ), 𝑉 (𝑣 2 ), . . . , 𝑉 (𝑣𝑘 )] = 𝑅(chunks(𝑣𝑖 ))
6: end while
7: return the best node combination 4: for all child nodes 𝑣 𝑗 of 𝑣𝑖 do
𝑉 (𝑣 ) 𝐶 (𝑣 )
√︃
5: Utility(𝑣 𝑗 ) = 𝑁 (𝑣𝑗 ) + 𝑐 𝑁ln(𝑣𝑁 ) − 𝜆 B𝑗
𝑗 𝑗
6: end for
4.2 MCTS-Based Policy Tree Search 7: Select node 𝑣 with the largest utility value
Motivation. Enumerating all nodes in the policy tree will locate 8: if 𝑣 is not fully expanded then
the optimal node but result in high computational costs. To address 9: Expand 𝑣 by adding new nodes
this, a straightforward approach is to apply the greedy strategy, 10: else
navigating the tree iteratively from the root. In each iteration, the 11: NodeSelection(𝑣, B)
child node with the highest benefit is selected, continuing until 12: end if
the budget is exhausted. However, this method is likely to lead to 13: end while
suboptimal results. For example, the benefit of 𝜒 1 may be slightly 14: return 𝑣
higher than 𝜒2 , but the benefit of 𝜒2 + 𝜒 3 could greatly exceed 15: end function
that of 𝜒1 + 𝜒3 . In this case, the greedy approach could lead to a
suboptimal result. Therefore, it is essential to revisit high-benefit
Algorithm 3 UtilityUpdate
parent nodes. Meanwhile, we need to reduce the exploration of
low-benefit nodes. 1: function UtilityUpdate(𝑣, B)
To achieve our objectives, we propose an MCTS-based policy tree 2: Input: 𝑣: a tree node; B: remaining budget
search method designed to efficiently select and rank combinations 3: Compute utility 𝐹 (𝑣)
of chunks. This approach explores the space of potential chunk 4: Update 𝑣’s utility: 𝑤𝑖 = 𝑤𝑖 + 𝐹 (𝑣)
orders iteratively, optimizing a given query within specified budget 5: Update 𝑣’s visit count: 𝑛𝑖 = 𝑛𝑖 + 1
constraints. 6: Propagate the utility and visit count updates up the tree
Overview. The MCTS-based strategy is outlined in Algorithm 1. 7: end function
We begin by initializing the root node of the policy tree using the
input query. While the computational budget is not exhausted, we
iteratively perform two key steps: Node Selection and Utility Update. • Selection: We identify the node 𝑣 with the maximum utility value.
Once the iteration limit or budget is reached, we halt the process If 𝑣 has not been expanded yet, we generate all possible child
and conduct a recursive search within the tree to find the node with nodes and incorporate them into the policy tree. If 𝑣 is already
the highest utility. Unlike traditional MCTS strategies that often expanded, we select the descendant node with the highest utility
focus solely on the root node, our method also considers promising for further exploration.
middle-layer nodes to maximize chunk combination utility. • Expansion: Upon selecting the node 𝑣 with the highest utility,
Detailed Explanation of Key Functions. We further explain the we expand it by generating all potential child nodes. Each child
two key functions as follows: represents a new possible chunk combination order. Our method
1. Node Selection (Algorithm 2). We recursively choose the node employs parallel expansion, which computes and evaluates mul-
with the highest utility, which is most likely to lead to an optimal tiple child nodes simultaneously. This parallelism leverages the
chunk combination. Specifically: value network’s ability to process multiple combinations at a
6
similar computational cost as a single node, enhancing search The following sections will detail each component and explain the
efficiency and breadth. design rationale behind them.
• Computing Utility: We compute the utility values for each new (1) Input Query Embedding. To effectively capture the fac-
child node using a utility formula. The reranker model 𝑅 pro- tors from various queries, given the diverse query types such as
cesses multiple chunk combinations in parallel, yielding: explicit facts, implicit facts, interpretable rationales, and hidden
rationales [47], we employed the BGE-M3 [6] embedding model to
[𝑉 (𝑣 1 ), 𝑉 (𝑣 2 ), . . . , 𝑉 (𝑣𝑘 )] = 𝑅(𝑣 1, 𝑣 2, . . . , 𝑣𝑘 ) (7) generate embeddings for each query. These embeddings enhance
the capacity of the learning framework by mapping similar query
Here, 𝑅 assesses the ordered chunk combinations, and we priori-
types to the same reranker class. Represented in a 1024-dimensional
tize the node 𝑣 𝑗 with the highest utility for continued exploration.
space, the embeddings capture essential semantic features, enabling
This approach balances utility and computational cost, acceler-
efficient comparison and classification by the encoding network.
ating convergence by broadening search space coverage.
This step helps improve retrieval relevance across different query
• Cost Estimation: During node expansion, we estimate the token
types. Additionally, optimal reranker annotation embeddings are
cost 𝐶 (𝑣 𝑗 ) of each selected node 𝑣 𝑗 using word length as a proxy.
also generated using the same embedding model, including its
To manage computational expenses effectively, we ensure these
unique features and associated metadata, allowing the model to
cost estimates do not surpass the available budget. Final cost are
align queries with the optimal reranker.
verified through tokenization before selecting optimal nodes.
(2) Feature Map Generation with Encoding Network. In
2. Utility Update (Algorithm 3). Once a node is expanded and its order to optimize the reranker selection task and recommend the
costs estimated, we perform the Utility Update as follows: optimal reranker for each query across different query types, we
• Utility Computation: We calculate the utility 𝐹 (𝑣) for the node 𝑣 utilize a encoding network to efficiently learn representations that
and update its cumulative utility 𝑤𝑖 and visit count 𝑛𝑖 . are useful for both classification and configuration prediction. We
• Propagation: These updates propagate upward through the tree, use a Siamese network consists of three fully connected layers to
adjusting the utility of 𝑣 and its parent nodes. This process en- accomplish this. It processes input query embeddings of dimension
sures that the tree structure accurately reflects the most promis- 𝑑 = 1024 and learns both a classification output and MCTS configu-
ing paths, guiding future search efforts. ration predictions (i.e. iterations, and 𝜆). The encoding network’s
branches share weights, each applying linear transformations fol-
5 CONFIGURATION AGENT lowed by RELU activations. Sequentially, the first hidden layer
reduces the dimension to 512, the second to 256, and the third to
After addressing the efficient handling of chunk correlations within 128. The final output layer provides a classification prediction spec-
user budget, the remaining task is designing a system that adapts ifying the optimal reranker for each query and a regression output
to each query’s domain. The MCTS process involves several critical for predicting the most effective MCTS configurations, guiding the
configurations, including reranker selection, the number of itera- search process. The classification output identifies the best reranker
tions, the exploration coefficient, and the cost coefficient. Optimally for each query, while the regression output determines the optimal
setting these configurations is particularly challenging across di- MCTS configuration settings.
verse query types. To tackle this, we propose a configuration agent
framework that predicts the optimal reranker and configuration for 5.2 Joint Training
each query. In this section, we first present the agent framework in
Section 5.1, followed by an overview of the model learning pipeline In this section, we outline the training pipeline for developing the
in Section 5.2. configuration agent. As depicted in Figure 4, we implement three
joint training tasks to enhance the model’s learning efficiency. The
first two tasks involve classification and regression to respectively
5.1 Model Framework
select the optimal reranker and predict the best values for MCTS hy-
Motivation. To address Challenge 3, which requires adapting to perparameters. Additionally, we incorporate a contrastive learning
each query’s domain and recommending the optimal configuration, approach to further refine the learning process.
a straightforward solution would be to employ an MLP classifier
to assign each query to its optimal reranker. However, preliminary 5.2.1 Classification and regression losses. Given the predicted reranker
experiments indicate that the performance of MLP classification label 𝑦pred and its corresponding actual optimal reranker 𝑦true , the
is suboptimal. Upon further analysis, we observed that queries of classification loss 𝐿cla is computed as follows:
similar types tend to share the same optimal reranker and configu- 𝐿cla (𝜃 ) = 𝐹 cla (𝑦pred, 𝑦true ) (8)
ration. Therefore, leveraging a Siamese Network with contrastive
learning to bring queries of the same class closer while pushing where 𝐹 cla represents the cross-entropy loss between the predicted
those from different classes further apart is a more viable approach. and actual reranker labels. This loss function aids in accurately clas-
Figure 4 provides an overview of our configuration agent, which sifying the optimal reranker for each query. Similarly, the regression
consists of two main modules responsible for transforming the input loss 𝐿reg is defined as:
into feature maps. First, the input embedding module generates em-
𝐿reg (𝜃 ) = 𝐹 reg (𝑝 pred, 𝑝 true ) (9)
beddings for the input queries. Subsequently, the encoding network
processes these embeddings to produce feature maps, which are where 𝐹 reg is the mean squared error (MSE) between the predicted
then utilized to derive various configurations for the MCTS setup. and actual MCTS parameters 𝑝 pred and 𝑝 true . This metric ensures
7
Training pair
Query1: "What is the capital city of Feature
LConstrastive
Germany?” Map Pair
Ground Truth
Query2: “Which country is Berlin‘s Query1
capital?” +
Best reranker:jina-reranker-v1-turbo-en
Label: Positive pair Best parameters
Reranker
1024 RELU 512 RELU 128 RELU Class
LClassification (Iteration = 30, lambda = 0.1, C = 2.4)
Query2
Query3: " How to bake a chocolate
cake?”
+ Best reranker:jina-reranker-v1-turbo-en
Best parameters
Query4: “What is the meaning of Encoding Network
Iteration, (Iteration = 24, lambda = 0.2, C = 2.6)
happiness?” Lambda, LRegression
Input C
Label: Negative pair Embedding
Feature Map Joint Training

Figure 4: Overview of Configuration Agent

the precise prediction of MCTS configurations, including the itera- In particular, the contrastive loss 𝐿con (𝜃 ) encourages the em-
tion count and 𝜆. beddings of queries with the same optimal reranker to be close
together, while pushing apart the embeddings of queries with dif-
5.2.2 Contrastive Learning. To efficiently distinguish between different rerankers. The classification loss 𝐿cla (𝜃 ) aids the model in
ferent query domains and recommend the optimal configuration correctly identifying the reranker using cross-entropy, and the re-
for each query, we utilize contrastive learning to bring queries of gression loss 𝐿reg (𝜃 ) minimizes the error in predicting the optimal
the same domain closer together while pushing apart embeddings MCTS configuration.
from different reranker classes. Remark. Once the total loss 𝐿total is calculated, the network param-
Contrastive pairs preparation. To prepare the training dataset, eters 𝜃 are updated using gradient descent with a learning rate
we must identify the optimal reranker and configuration for each 𝜂. This optimization process is repeated across multiple epochs 𝐸
query. In this study, the most suitable reranker and corresponding and batches, ensuring that both reranker selection and parameter
configurations for each query are determined through extensive prediction are improved over time.
experimentation with various setups. Subsequently, query pairs
are generated based on these optimal reranker annotations. Pos-
itive pairs are formed from queries that share the same optimal
6 EXPERIMENTS
reranker, promoting minimal distance between their embeddings The experimental study intends to answer the following questions:
in the feature space. Conversely, negative pairs are composed of • RQ1 How effective is our CORAG for the cost-constrained RAG
queries with different rerankers, where the goal is to maximize the pipeline compared to other methods?
distance between their embeddings. Since some rerankers perform • RQ2 How efficient is CORAG with varying chunk sizes?
similarly on certain queries, we select only cases with a ROUGE-L • RQ3 What are the bottlenecks of the current RAG?
difference exceeding 10% to form our training dataset. • RQ4 How scalable is CORAG with varying dataset sizes?
Contrastive loss. As illustrated in Figure 4, for a given positive • RQ5 What is the effectiveness of each design in CORAG?
pair (𝑥𝑖 , 𝑥𝑖+ ) and a negative pair (𝑥 𝑗 , 𝑥 𝑗− ), we first generate their
corresponding feature maps with the encoding model. These feature 6.1 Experiment Setting
maps are then utilized to compute the contrastive loss 𝐿con . In Environment. We integrate our system with the popular RAG
particular, this process can be formulated as follows: framework LlamaIndex1 . The experiments are run on a Linux server
with an Intel Core i7-13700K CPU (12 cores, 24 threads, 5.3 GHz),
𝐿con (𝜃 ) = 𝐹 con (𝑓𝜃 (𝑥𝑖 ), 𝑓𝜃 (𝑥𝑖+ )) + 𝐹 con (𝑓𝜃 (𝑥𝑖 ), 𝑓𝜃 (𝑥 𝑗− )) (10)
64GB RAM, and a 1 TiB NVMe SSD. The configuration agent module
where 𝑓𝜃 (𝑥) represents the embedding function, and 𝐹 con is the sim- is implemented in PyTorch 2.0 and trained on an NVIDIA RTX 4090
ilarity function applied to both types of pairs: positive pairs (with GPU with 24GB VRAM.
similar rerankers) and negative pairs (with different rerankers).
This loss function is designed to ensure that queries with the same Table 1: Statistics of datasets used in the experiments.
reranker are brought closer together in the embedding space, while
those with different rerankers are distanced. Dataset #train #dev #test #p
5.2.3 Whole training process. Finally, the total loss function 𝐿total MSMARCO 502,939 6,980 6,837 8,841,823
is the combination of the contrastive, classification, and regression Wiki 3,332 417 416 244,136
losses as follows:

𝐿total (𝜃 ) = 𝐿con (𝜃 ) + 𝐿cla (𝜃 ) + 𝐿reg (𝜃 ) (11) 1 https://www.llamaindex.ai/

8
0.45 WikiPassageQA 0.45 MARCO
Datasets. To evaluate the performance of CORAG across diverse
50 50
scenarios, we conduct experiments on two distinct datasets with dif- 0.40 0.40

Retrieve Time (s)

ROUGE Score

ROUGE Score
40 40
fering task focuses: (1) WikiPassageQA[7] is a question-answering
0.35 30 0.35 30
benchmark containing 4,165 questions and over 100,000 text chunks, 20
20
aimed at evaluating passage-level retrieval. (2) MARCO[24] is a 0.30 0.30
10 10
comprehensive dataset tailored for natural language processing 0.25 0 0.25 0
256 512 1024 256 512 1024
tasks, primarily emphasizing question answering and information Chunk Size Chunk Size
Raptor CORAG+Agent Raptor CORAG+Agent
retrieval. As shown in Table 1, both WikiPassageQA and MARCO NaiveRAG CORAG upper NaiveRAG CORAG Upper
CORAG w/o Agent CORAG w/o Agent
provide high-quality question and passage annotations, making
them suitable benchmarks for assessing retrieval effectiveness. In
our experiments, we prompt LLMs to generate ground truth an- Figure 5: Efficiency Comparison
swers for each dataset. For instance, if we use Llama3 to evaluate
CORAG’s performance, we also prompt Llama3 to generate the
used during this process include a margin for contrastive loss (mar-
ground truth in the same experimental setting for fairness and
gin=1.0), learning rate (lr=0.001), batch size (32), number of epochs
alignment with the features of the LLMs.
(num_epochs=60), and the embedding model (i.e., BAAI/bge-m3[6]).
Baselines. We compare the performance of CORAG with two typi-
Evaluation Metrics. We assess effectiveness by comparing the
cal RAG baselines:
Rouge scores between the ground truth answers and the generated
• RAPTOR[29]: RAPTOR constructs a hierarchical document
responses, using Rouge-1, Rouge-2, and Rouge-L as evaluation
summary tree by recursively embedding, clustering, and sum-
metrics. To evaluate efficiency, we measure the latency required to
marizing text chunks, enabling multi-level abstraction. This ap-
answer a query using different methods.
proach aligns with the clustering-based methods discussed in
Section 1. We finish the tree construction within the limit of
budget. 6.2 Performance Comparison
• NaiveRAG: This is a basic method for retrieving relevant chunks. 6.2.1 RQ1: ROUGE Comparision. As shown in Table 2, we com-
First, candidate chunks are retrieved from the vector database pare CORAG with several baselines across different datasets, pri-
based on vector similarity search, followed by ranking them marily using WikiPassageQA and MARCO. The evaluations are con-
using a reranker model. This approach is the type of AKNN ducted on three different chunk sizes, utilizing ROUGE-1, ROUGE-2,
method mentioned in Section 1. To meet the cost constraint, we and ROUGE-L metrics to assess the improvements in responses
employ the greedy budget allocation strategy, retrieving chunks generated by the LLM due to our retrieval method. CORAG demon-
until the budget is fully exhausted. strate a substantial improvement of approximately 25% compared
to mainstream RAG approaches such as NaiveRAG and RAPTOR.
As expected, CORAG does not exceed the upper bound, which rep-
In addition, we remove the configuration agent from our method resents an extreme scenario where all possible combination orders
as a baseline to evaluate its impact on the performance of CORAG, are exhaustively enumerated, which is clearly inefficient and im-
referring to this version as CORAG w/o Agent. At last, we implement practical. In summary, CORAG outperforms baselines, enhancing
a method called CORAG Upper to establish an upper bound by retrieval relevancy while pruning the search space effectively.
exploring all possible chunk combinations and selecting the optimal
6.2.2 RQ2: Efficiency Evaluation. As shown in Figure 5, since
order. Due to the large number of potential combinations, we limit
CORAG is based on the tree search algorithm, the agent assists in
the exploration to combinations with fewer than six chunks in
predicting the optimal reranker and parameters for a given query.
CORAG Upper case.
Therefore, it is crucial to evaluate the impact of different chunk sizes
Remark. Other methods, such as GraphRAG [22], depend signifi-
and datasets on retrieval optimization task efficiency. We tested
cantly on frequent invocations of LLMs to summarize chunks and
efficiency using various datasets and chunk sizes, observing that
construct indexes, incurring substantial costs (e.g., billions of to-
NaiveRAG, which uses a traditional retrieval approach, achieved
kens) that exceed our strict cost constraints. Consequently, these
shorter retrieval times but lower ROUGE scores. CORAG upper
methods are not feasible for addressing our problem. For a fair com-
performs well in terms of ROUGE, but its efficiency is significantly
parison, we exclude these types of RAG methods in the experiment.
reduced due to exploring the entire search space. Similarly, RAP-
Hyper-parameter Settings: The hyper-parameters for CORAG
TOR, which leverages an external LLM for summarization, exhibited
are automatically determined by the configuration agent, while
poor efficiency. In contrast, our CORAG approach strikes a balance
NaiveRAG does not require any hyper-parameters. For other base-
between efficiency and retrieval relevance, achieving an effective
line methods, we ensure consistency by using identical hyper-
trade-off.
parameters for fair comparisons. Specifically, we set the explo-
ration coefficient to 2.4, the number of iterations to 10, and the cost 6.2.3 RQ3: Performance Breakdown. We present a performance
coefficient 𝜆 to 0.1. Preliminary experiments indicate that this con- breakdown of our baseline NaiveRAG, to highlight the bottlenecks
figuration optimizes baseline performance. Further ablation studies in the current RAG system. To address the challenge of searching
will also follow to validate these settings. for the optimal chunk combination order, implementing it with
Learning Parameter Setting: In our method, the configuration NaiveRAG requires the following steps: (a) obtaining the query em-
agent is trained using contrastive learning. The hyper-parameters bedding, (b) retrieving potential chunk combinations, (c) reranking
9
Table 2: ROUGE Comparison on WikiPassage QA and MARCO Datasets

WikiPassage QA MARCO
Method LLM Type 256 512 1024 256 512 1024
R1 R2 RL R1 R2 RL R1 R2 RL R1 R2 RL R1 R2 RL R1 R2 RL
Raptor Llama3-8B 0.338 0.154 0.316 0.322 0.147 0.301 0.335 0.159 0.305 0.386 0.208 0.356 0.393 0.213 0.366 0.338 0.154 0.316
NaiveRAG Llama3-8B 0.337 0.149 0.312 0.321 0.142 0.297 0.334 0.158 0.309 0.398 0.203 0.369 0.395 0.213 0.368 0.337 0.149 0.312
CORAG upper Llama3-8B 0.447 0.275 0.426 0.426 0.262 0.406 0.444 0.273 0.423 0.435 0.235 0.414 0.425 0.229 0.397 0.447 0.275 0.426
CORAG w/o Agent Llama3-8B 0.390 0.212 0.364 0.372 0.202 0.347 0.388 0.221 0.362 0.401 0.212 0.374 0.393 0.216 0.372 0.390 0.212 0.364
CORAG Llama3-8B 0.423 0.223 0.392 0.403 0.212 0.373 0.409 0.219 0.378 0.413 0.224 0.382 0.405 0.219 0.376 0.411 0.219 0.380
CORAG Mixtral8*7B 0.357 0.158 0.325 0.382 0.167 0.351 0.401 0.198 0.367 0.408 0.199 0.378 0.399 0.194 0.369 0.403 0.193 0.373
CORAG Phi-2 2.7B 0.351 0.137 0.317 0.318 0.117 0.298 0.308 0.108 0.288 0.335 0.109 0.305 0.325 0.103 0.301 0.333 0.108 0.303

WikiPassageQA MARCO embedding ROUGE scores. This efficient balance between performance and
3.5 retrieve
5 rerank computational overhead highlights the system’s capacity to prune
3.0 prompt
the search space effectively, ensuring fast retrieval even in expan-
4 2.5 sive datasets. As a result, our approach is well-suited for scenarios
Latency (s)

Latency (s)

3
2.0 where both large-scale data processing and high retrieval accuracy
1.5
are crucial.
2
1.0 Table 3: Performance comparison with varying budgets
1
0.5

0 0.0 Budget Methods R1 R2 RL Cost

256 512 1024 256 512 1024
Chunk Size Chunk Size Raptor 0.338 0.154 0.316 1024
NaiveRAG 0.337 0.149 0.312 1024
Figure 6: Performance Breakdown 1024 CORAG w/o Agent 0.390 0.212 0.364 831
CORAG 0.411 0.219 0.380 769
CORAG upper 0.447 0.275 0.426 800
the potential chunk combinations, and (d) prompt refining. The Raptor 0.339 0.162 0.318 2048
average latency for each of these steps is reported in Figure 6, under NaiveRAG 0.338 0.157 0.314 2048
the same experimental settings as before. 2048 CORAG w/o Agent 0.392 0.223 0.366 921
The results reveal that the baseline suffers significantly from CORAG 0.425 0.235 0.395 845
the reranking process. In some cases, particularly with smaller CORAG upper 0.449 0.289 0.429 923
chunk sizes, the reranking process contributes more than half of Raptor 0.341 0.171 0.321 8192
the overall latency. The slow reranking process is due to two key NaiveRAG 0.341 0.165 0.316 8192
reasons: (1) The large number of potential chunk combination 8192 CORAG w/o Agent 0.394 0.235 0.369 901
orders. For example, as we simultaneously consider both the chunk CORAG 0.427 0.249 0.398 911
order and the combination, given 50 potential chunks, there can CORAG upper 0.451 0.305 0.432 923
be over 50! possible chunk combination orders. (2) The reranker
directly processes raw information using transformer models[28],
which minimizes information loss but produces a substantial time 6.3 RQ5: Ablation Study
penalty. When the reranker is provided with a huge number of 6.3.1 Ablation study for different Budget. As shown in Table 3, we
chunk combination orders, the efficiency might perform worse evaluate CORAG as a cost-constrained system, examining the im-
significantly. pact of varying budgets on overall performance. Using the MARCO
Therefore, designing a system that can efficiently utilize the dataset, we set budget limits at 1024, 2048, and 8192 tokens and
reranker while searching for the optimal chunk combination order evaluated the results with ROUGE. CORAG consistently outper-
is crucial for improving RAG performance. forms all baselines across these budget levels. Notably, CORAG’s
average token cost remains below each budget limit, indicating
6.2.4 RQ4: Scalability Evaluation. CORAG demonstrates excep-
that chunk utility is not monotonic, as highlighted in Challenge 2.
tional scalability, particularly when dealing with large datasets such
With a higher budget, CORAG benefits from an expanded search
as WikiPassageQA and MARCO, which contain a high volume of
space, enabling the inclusion of more relevant information without
passages. By segmenting each passage into 256-token chunks, the
merely increasing the number of chunks.
chunk count can easily scale to 100k or more. Despite this substan-
tial increase in data volume, our retrieval time only increased by 6.3.2 Ablation Study for Different Exploration Coefficients. As il-
10% compared to traditional methods, showcasing our system’s lustrated in Figure 7, we perform an ablation study to evaluate the
ability to manage large-scale retrieval tasks efficiently. Notably, our impact of varying exploration coefficients on system performance,
system outperforms the CORAG upper approach, requiring just specifically testing the C values of 0, 1, 2, and 3. The results indi-
one-tenth of the retrieval time, while still delivering competitive cate that an exploration coefficient of around 2 provides the best
10
ROUGE Scores at Different C Parameter (WikiPassageQA)
0.44 0.44
ROUGE Scores at Different C Parameter (MARCO)
R1
RL
R1
RL
6.3.3 Ablation Study for Different Cost Coefficients. As illustrated
in Figure 8, we perform an ablation study to evaluate the impact
0.42 0.42
of varying cost coefficients on system performance, specifically
ROUGE Score

ROUGE Score
0.40 0.40
testing values of 0, 0.1, 0.2, and 0.3. The results show that intro-
ducing the cost coefficient in the utility led to a slight decrease in
0.38 0.38 ROUGE scores. This decrease occurs because, without cost con-
straints, CORAG tends to produce longer outputs, albeit at the
0.36 0.36
expense of cost efficiency. However, despite the slight reduction in
0 1 2 3 0 1 2 3
C C ROUGE scores, the decline remains within 5%, which is acceptable.
These results highlight the importance of tuning the cost coeffi-
Figure 7: ROUGE Comparison between different C cient effectively to balance output richness and cost constraints,
further emphasizing the role of our configuration agent in enabling
ROUGE
0.44
Scores at Different lambda Parameter (WikiPassageQA)0.44 ROUGE Scores at Different lambda Parameter (MARCO) efficient configuration tuning for optimal CORAG performance.
R1 R1
RL RL

0.42 0.42 6.3.4 Ablation Study on Different Rerankers. To evaluate the im-
pact of different rerankers on retrieval performance, we conduct
ROUGE Score

ROUGE Score

0.40 0.40
an ablation study using six widely recognized reranker models:
jina-reranker-v1-turbo-en, jina-reranker-v2-base-multilingual, bge-
0.38 0.38
reranker-v2-m3, bge-reranker-large, bge-reranker-base, and
0.36 0.36
gte-multilingual-reranker-base. These rerankers are evaluated on
0 0.1 0.2 0.3 0 0.1 0.2 0.3
the MARCO dataset with the llama3-8B model, configured with a
lambda lambda
fixed cost coefficient of 0.1, an exploration coefficient of 2.4, and a
budget limit of 1024.
Figure 8: ROUGE Comparison between different lambda The results in Table 4 reveal variations in performance across
different rerankers, highlighting the importance of careful reranker
performance, striking an optimal balance between exploration and selection to optimize RAG system performance under specific oper-
exploitation within the search process. This balance enables the sys- ational constraints. Among the rerankers, gte-multilingual-reranker-
tem to effectively uncover relevant information while maintaining base and bge-reranker-large demonstrate consistently strong per-
a focus on high-potential chunks, ultimately leading to improved formance on QA tasks, suggesting that these reranker models have
RAG response. In contrast, both lower and higher exploration coef- high efficacy in capturing relevant information across different
ficients led to suboptimal results, either due to insufficient explo- QA queries. We observe that as the chunk size increases in the
ration or excessive diffusion of focus. These findings emphasize ablation study, each individual reranker yields lower performance
the critical role of the exploration coefficient in the performance of than agent recommended reranker towards different queries. This
the CORAG search process and highlight the importance of careful indicates that the configuration agent effectively leverages reranker
parameter tuning. diversity, dynamically adjusting configurations to improve retrieval
results. The configuration agent’s ability to recommend a better con-
Table 4: Performance comparison with varying rerankers figuration for the reranker selection and parameters adaptively un-
derscores its importance in maximizing RAG system performance,
particularly under constraints such as a limited budget.
ChunkSize Reranker R1 R2 RL
v1-turbo 0.412 0.216 0.379 6.4 Case Study
v2-base-multi 0.413 0.221 0.380
Figure 9 presents three examples to illustrate the retrieval quality
bge-m3 0.425 0.230 0.395
256 comparison between CORAG and the traditional NaiveRAG method,
bge-large 0.431 0.238 0.401
focusing on why our approach outperforms baseline methods. Due
bge-base 0.421 0.232 0.390
to its straightforward top-k retrieval and reranking, NaiveRAG of-
gte-base 0.424 0.232 0.395
ten misses essential information relevant to the query’s intent, as it
v1-turbo 0.366 0.173 0.333
frequently retrieves chunks based on keyword matching rather than
v2-base-multi 0.367 0.177 0.334
relevance to the query. For instance, with the query “Is bougainvillea
bge-m3 0.368 0.177 0.336
512 a shrub?”, NaiveRAG retrieves content containing matching key-
bge-large 0.362 0.177 0.332
words but fails to provide the actual classification of bougainvillea.
bge-base 0.375 0.185 0.344
In contrast, CORAG’s chunk combination strategy retrieves context
gte-base 0.364 0.181 0.335
that includes bougainvillea’s category, enabling the LLM to give a
v1-turbo 0.269 0.093 0.240
more accurate response. In another case, NaiveRAG retrieves terms
v2-base-multi 0.270 0.094 0.243
and legal clauses containing “oxyfluorfen” but lacks understand-
bge-m3 0.270 0.094 0.243
1024 ing of the query’s intent, while CORAG provides context linking
bge-large 0.265 0.092 0.236
oxyfluorfen to its use case in cotton, which requires logical rela-
bge-base 0.265 0.092 0.237
tionships between chunks that NaiveRAG’s vector similarity search
gte-base 0.270 0.094 0.242
11
Query: 'Is bougainvillea a shrub?' Query: ‘Is oxyfluorfen safe on cotton?' Query: ‘Where does bacteria come from?'

“Bougainvillea is a tropical vining shrub “Use Profile Oxyfluorfen is used for broad Bacteria can come from many different sources,
spectrum pre-and post-emergent control of annual including: Most pets that poop on the lawn are
that comes in a wide array of bright and warm blooded mammals, which means that their
broadleaf and grassy weeds in a variety of tree
fanciful colors.... Bougainvillea are tropical fruit, nut, vine, and field crops. ... An estimated excrement contains coli bacteria. …When
and must be protected from frost... ” 12% of cotton acreage in Louisiana was treated bacteria levels are elevated, the river and
with oxyfluorfen in 1992; ... tributaries become potentially unsafe …
”
CORAG’s Result CORAG’s Result CORAG’s Result

The flowers are actually modified leaves, Avoid contact with eyes and skin... CROP
SAFETY OXYFLUORFEN 240 EC may be Bacteria in the modern taxonomic sense
called bracts, that are long-lasting and applied as directed spray around dormant peach,
bright... Bougainvillea are tropical and are one of the three Domains. They must
plum, apricot, almond, apple and pear trees and
must be protected from frost... grapevines of all ages when applied at rates of have split from …
Bougainvillea thrives in full sun. less than 1.0 L/ha...
NaïveRAG’s Result NaïveRAG’s Result NaïveRAG’s Result

Figure 9: Case Study for CORAG

cannot capture. Finally, for the query “Where does bacteria come 7.2 Design Choices
from?”, NaiveRAG retrieves chunks with the keyword ’bacteria’ To address the challenges identified, the following design choices
but does not address its origin, whereas CORAG supplies a more aim to optimize the performance of RAG systems:
complete response, including sources of bacteria and conditions P1: Co-Design of Retrieval and Reranking Processes In CORAG,
for their proliferation. These cases illustrate that CORAG excels in parallel expansion in tree search accelerates query processing by
retrieving logically connected information, making it more effective enabling concurrent retrieval and reranking, significantly reducing
than NaiveRAG for queries requiring more than simple keyword latency. Future optimizations could address bottlenecks by elimi-
matching. nating stage-specific delays, further enhancing ranking efficiency.
This co-design approach efficiently manages chunk combination
order, improving the ranking process and relevance scoring.
7 INSIGHTS AND FUTURE DESIGN CHOICES P2: Optimization of Tree Structure and Search Iterations Re-
ON RAG sults indicate that shorter policy tree height enhances search ef-
ficiency by reducing computational overhead, especially advan-
7.1 Shortcomings of Current RAG tageous for large datasets. Minimizing tree height in tree-based
We provide an analysis of current RAG systems revealing perfor- searches improves the search speed for contextually relevant chunks,
mance challenges across the Retrieve (S1), Augment (S2), and Gen- significantly lowering latency and computational costs. This op-
eration (S3) phases. timization approach enhances RAG system performance across
S1: Retrieval Overhead Current RAG systems often utilize LLMs extensive datasets.
for summarization and indexing structures, overlooking the high P3: Dynamic Prompt Engineering Selecting rerankers based on
computational costs associated with external LLMs, thus escalat- query type and using adaptable prompt templates improve retrieval
ing compute expenses. Model-based rerankers, while improving relevance for LLMs. Dynamic prompt structures that align with
relevance during retrieval, introduce notable latency, which can query intent and domain-specific contexts maintain output qual-
impede efficiency in latency-sensitive contexts. Cost-effective index ity within resource constraints. This adaptive approach to prompt
construction and reranking optimization are essential to balance engineering achieves an effective balance between efficiency and
efficiency and performance. retrieval quality, addressing the dynamic nature of RAG system
S2: Augmentation Overhead Post-retrieval techniques, such as queries.
optimized chunk combination ordering, enhance context relevancy
but demand additional computation. Pruning strategies that mini- 8 CONCLUSION
mize the search space and refine combination order are critical for Considering the non-monotonicity of the chunk’s utility, correla-
balancing computational cost and augmented context relevancy. tion between the chunks, and the diversity of the different query
Efficient chunk combination optimization, emphasizing order and domain, we propose a learning-based retrieval optimization system
coherence, is vital for reducing costs and enhancing retrieval per- CORAG. Initially, we model the chunk combination orders as a
formance. policy tree and employ MCTS to explore this policy tree, aiming to
S3: Generation Overhead Effective prompt engineering for op- identify the optimal chunk combination order. We introduce a con-
timal chunk combinations requires significant computational re- figuration agent that accurately predicts the optimal configuration
sources. Query-specific prompt refinement and compression are and reranker for the given query. Additionally, we also design a
crucial to reduce overhead while maintaining input relevance and parallel query expansion strategy to expand multiple nodes in each
conciseness. Adaptive strategies that handle diverse query types iteration. Experimental results demonstrate that our method signif-
and domain-specific requirements ensure prompt efficiency without icantly outperforms state-of-the-art methods within constrained
compromising output quality. cost limits and also shows notable efficiency.
12
REFERENCES [28] Devendra Singh Sachan, Mike Lewis, Mandar Joshi, Armen Aghajanyan, Wen
[1] Mohammad Alkhalaf, Ping Yu, Mengyang Yin, and Chao Deng. 2024. Applying tau Yih, Joelle Pineau, and Luke Zettlemoyer. 2023. Improving Passage Retrieval
generative AI with retrieval augmented generation to summarize and extract with Zero-Shot Question Generation. arXiv:2204.07496 [cs.CL]
key clinical information from electronic health records. (2024), 104662. [29] Parth Sarthi, Salman Abdullah, Aditi Tuli, Shubh Khanna, Anna Goldie, and
[2] Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. 2023. Christopher D. Manning. 2024. RAPTOR: Recursive Abstractive Processing for
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. Tree-Organized Retrieval.
(2023). [30] Weijia Shi, Sewon Min, Michihiro Yasunaga, Minjoon Seo, Rich James, Mike
[3] Peter Auer. 2002. Using confidence bounds for exploitation-exploration trade-offs. Lewis, Luke Zettlemoyer, and Wen-tau Yih. 2023. Replug: Retrieval-augmented
3, Nov (2002), 397–422. black-box language models. (2023).
[4] Tom B Brown. 2020. Language models are few-shot learners. (2020). [31] Devendra Singh, Siva Reddy, Will Hamilton, Chris Dyer, and Dani Yogatama. 2021.
[5] Chi-Min Chan, Chunpu Xu, Ruibin Yuan, Hongyin Luo, Wei Xue, Yike Guo, End-to-end training of multi-document reader and retriever for open-domain
and Jie Fu. 2024. Rq-rag: Learning to refine queries for retrieval augmented question answering. 34 (2021), 25968–25981.
generation. (2024). [32] Karan Singhal, Shekoofeh Azizi, Tao Tu, S. Sara Mahdavi, Jason Wei, Hyung Won
[6] Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. 2024. Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl,
BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Perry Payne, Martin Seneviratne, Paul Gamble, Chris Kelly, Nathaneal Scharli,
Embeddings Through Self-Knowledge Distillation. arXiv:2402.03216 [cs.CL] Aakanksha Chowdhery, Philip Mansfield, Blaise Aguera y Arcas, Dale Webster,
[7] Daniel Cohen, Liu Yang, and W. Bruce Croft. 2018. WikiPassageQA: A Benchmark Greg S. Corrado, Yossi Matias, Katherine Chou, Juraj Gottweis, Nenad Tomasev,
Collection for Research on Non-factoid Answer Passage Retrieval. abs/1805.03797 Yun Liu, Alvin Rajkomar, Joelle Barral, Christopher Semturs, Alan Karthike-
(2018). arXiv:1805.03797 salingam, and Vivek Natarajan. 2022. Large Language Models Encode Clinical
[8] Florin Cuconasu, Giovanni Trappolini, Federico Siciliano, Simone Filice, Cesare Knowledge. arXiv:2212.13138 [cs.CL]
Campagnano, Yoelle Maarek, Nicola Tonellotto, and Fabrizio Silvestri. 2024. The [33] Manveer Singh Tamber, Ronak Pradeep, and Jimmy Lin. 2023. Scaling Down, LiT-
Power of Noise: Redefining Retrieval for RAG Systems. 719–729. ting Up: Efficient Zero-Shot Listwise Reranking with Seq2seq Encoder-Decoder
[9] Goetz Graefe and William J McKenna. 1993. The volcano optimizer generator: Models. (2023).
Extensibility and efficient search. 209–218. [34] Andrew Trotman, Antti Puurula, and Blake Burgess. 2014. Improvements to
[10] Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian BM25 and language models examined. 58–65.
Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting [35] Shuting Wang, Xin Xu, Mang Wang, Weipeng Chen, Yutao Zhu, and Zhicheng
Liu. 2023. A Survey on Hallucination in Large Language Models: Principles, Dou. 2024. RichRAG: Crafting Rich Responses for Multi-faceted Queries in
Taxonomy, Challenges, and Open Questions. arXiv:2311.05232 [cs.CL] Retrieval-Augmented Generation. (2024).
[11] Ziyan Jiang, Xueguang Ma, and Wenhu Chen. 2024. LongRAG: En- [36] Zheng Wang, Bingzheng Gan, and Wei Shi. 2024. Multimodal Query Suggestion
hancing Retrieval-Augmented Generation with Long-context LLMs. with Multi-Agebm25g from Human Feedback. arXiv:2402.04867 [cs.IR]
arXiv:2406.15319 [cs.CL] [37] Jeff Wu, Long Ouyang, Daniel M Ziegler, Nisan Stiennon, Ryan Lowe, Jan Leike,
[12] Ming Jin, Shiyu Wang, Lintao Ma, Zhixuan Chu, James Y Zhang, Xiaoming Shi, and Paul Christiano. 2021. Recursively summarizing books with human feedback.
Pin-Yu Chen, Yuxuan Liang, Yuan-Fang Li, Shirui Pan, et al. 2023. Time-llm: (2021).
Time series forecasting by reprogramming large language models. (2023). [38] Ziwei Xu, Sanjay Jain, and Mohan Kankanhalli. 2024. Hallucination is Inevitable:
[13] Angeliki Lazaridou, Elena Gribovskaya, Wojciech Stokowiec, and Nikolai Grig- An Innate Limitation of Large Language Models. arXiv:2401.11817 [cs.CL]
orev. 2022. Internet-augmented language models through few-shot prompting [39] Siqiao Xue, Danrui Qi, Caigao Jiang, Wenhui Shi, Fangyin Cheng, Keting Chen,
for open-domain question answering. (2022). Hongjun Yang, Zhiping Zhang, Jianshan He, Hongyang Zhang, Ganglin Wei,
[14] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Wang Zhao, Fan Zhou, Hong Yi, Shaodong Liu, Hongjun Yang, and Faqiang Chen.
Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rock- 2024. Demonstration of DB-GPT: Next Generation Data Interaction System
täschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive Empowered by Large Language Models. arXiv:2404.10209 [cs.AI]
nlp tasks. 33 (2020), 9459–9474. [40] Jiexia Ye, Weiqi Zhang, Ke Yi, Yongzi Yu, Ziyue Li, Jia Li, and Fugee Tsung.
[15] Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2015. A 2024. A Survey of Time Series Foundation Models: Generalizing Time Series
diversity-promoting objective function for neural conversation models. (2015). Representation with Large Language Model. arXiv:2405.02358 [cs.LG]
[16] Xinze Li, Zhenghao Liu, Chenyan Xiong, Shi Yu, Yu Gu, Zhiyuan Liu, and Ge Yu. [41] Ziqi Yin, Shanshan Feng, Shang Liu, Gao Cong, Yew Soon Ong, and Bin Cui.
2023. Structure-Aware Language Model Pretraining Improves Dense Retrieval 2024. LIST: Learning to Index Spatio-Textual Data for Embedding based Spatial
on Structured Data. (2023). Keyword Queries. (2024).
[17] Shu Liu, Asim Biswal, Audrey Cheng, Xiangxi Mo, Shiyi Cao, Joseph E Gonza- [42] Tan Yu, Anbang Xu, and Rama Akkiraju. 2024. In Defense of RAG in the Era of
lez, Ion Stoica, and Matei Zaharia. 2024. Optimizing llm queries in relational Long-Context Language Models. arXiv:2409.01666 [cs.CL]
workloads. (2024). [43] Yue Yu, Wei Ping, Zihan Liu, Boxin Wang, Jiaxuan You, Chao Zhang, Mohammad
[18] Yanming Liu, Xinyue Peng, Xuhong Zhang, Weihao Liu, Jianwei Yin, Jiannan Shoeybi, and Bryan Catanzaro. 2024. RankRAG: Unifying Context Ranking with
Cao, and Tianyu Du. 2024. RA-ISF: Learning to Answer and Understand from Retrieval-Augmented Generation in LLMs. (2024).
Retrieval Augmentation via Iterative Self-Feedback. (2024). [44] Hamed Zamani and Michael Bendersky. 2024. Stochastic RAG: End-to-
[19] Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. End Retrieval-Augmented Generation through Expected Utility Maximization.
2021. Fantastically ordered prompts and where to find them: Overcoming few- arXiv:2405.02816 [cs.CL]
shot prompt order sensitivity. (2021). [45] Hailin Zhang, Yujing Wang, Qi Chen, Ruiheng Chang, Ting Zhang, Ziming Miao,
[20] Yuanjie Lyu, Zhiyu Li, Simin Niu, Feiyu Xiong, Bo Tang, Wenjin Wang, Hao Wu, Yingyan Hou, Yang Ding, Xupeng Miao, Haonan Wang, et al. 2024. Model-
Huanyong Liu, Tong Xu, and Enhong Chen. 2024. Crud-rag: A comprehensive enhanced vector index. 36 (2024).
chinese benchmark for retrieval-augmented generation of large language models. [46] Penghao Zhao, Hailin Zhang, Qinhan Yu, Zhengren Wang, Yunteng Geng,
(2024). Fangcheng Fu, Ling Yang, Wentao Zhang, and Bin Cui. 2024. Retrieval-augmented
[21] Xinbei Ma, Yeyun Gong, Pengcheng He, Hai Zhao, and Nan Duan. 2023. Query generation for ai-generated content: A survey. (2024).
rewriting for retrieval-augmented large language models. (2023). [47] Siyun Zhao, Yuqing Yang, Zilong Wang, Zhiyuan He, Luna K. Qiu, and Lili
[22] Microsoft. 2024. GraphRAG. https://microsoft.github.io/graphrag/ Qiu. 2024. Retrieval Augmented Generation (RAG) and Beyond: A Compre-
[23] Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina hensive Survey on How to Make your LLMs use External Data More Wisely.
Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, arXiv:2409.14924 [cs.CL]
et al. 2021. Webgpt: Browser-assisted question-answering with human feedback. [48] Xinyang Zhao, Xuanhe Zhou, and Guoliang Li. 2024. Chat2Data: An Interactive
(2021). Data Analysis System with RAG, Vector Databases and LLMs. (2024).
[24] Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan [49] Xuanhe Zhou, Guoliang Li, Chengliang Chai, and Jianhua Feng. 2021. A learned
Majumder, and Li Deng. 2016. MS MARCO: A Human Generated MAchine query rewrite system using monte carlo tree search. 15, 1 (2021), 46–58.
Reading COmprehension Dataset. abs/1611.09268 (2016). arXiv:1611.09268 [50] Honglei Zhuang, Zhen Qin, Rolf Jagerman, Kai Hui, Ji Ma, Jing Lu, Jianmo Ni,
[25] Zhen Qin, Rolf Jagerman, Kai Hui, Honglei Zhuang, Junru Wu, Le Yan, Jiaming Xuanhui Wang, and Michael Bendersky. 2022. RankT5: Fine-Tuning T5 for Text
Shen, Tianqi Liu, Jialu Liu, Donald Metzler, et al. 2023. Large language models Ranking with Ranking Losses. arXiv:2210.10634 [cs.IR]
are effective text rankers with pairwise ranking prompting. (2023). [51] Shengyao Zhuang, Bing Liu, Bevan Koopman, and Guido Zuccon. 2023. Open-
[26] Chidaksh Ravuru, Sagar Srinivas Sakhinana, and Venkataramana Runkana. source large language models are strong zero-shot query likelihood models for
2024. Agentic Retrieval-Augmented Generation for Time Series Analysis. document ranking. (2023).
arXiv:2408.14484 [cs.AI]
[27] Stephen Robertson, Hugo Zaragoza, and Michael Taylor. 2004. Simple BM25
extension to multiple weighted fields. 42–49.

ZIV IRL Manual M0irlf1806iv11
No ratings yet
ZIV IRL Manual M0irlf1806iv11
860 pages
RAG Understanding PDF
No ratings yet
RAG Understanding PDF
12 pages
Studio 5000 Logix Designer Basic Lab Manual
100% (2)
Studio 5000 Logix Designer Basic Lab Manual
143 pages
RAG Cheat Sheet-2
No ratings yet
RAG Cheat Sheet-2
29 pages
Optimizing Retrieval-Augmented Generation: Analysis of Hyperparameter Impact On Performance and Efficiency
No ratings yet
Optimizing Retrieval-Augmented Generation: Analysis of Hyperparameter Impact On Performance and Efficiency
14 pages
Chain-of-Retrieval Augmented Generation
No ratings yet
Chain-of-Retrieval Augmented Generation
18 pages
Module 2
No ratings yet
Module 2
92 pages
L - G L L M: Earned Rule Augmented Eneration FOR Arge Anguage Odels
No ratings yet
L - G L L M: Earned Rule Augmented Eneration FOR Arge Anguage Odels
22 pages
Lecture Notes - Algorithms and Data Structures - Part 4: Searching and Sorting
No ratings yet
Lecture Notes - Algorithms and Data Structures - Part 4: Searching and Sorting
72 pages
New Answers
No ratings yet
New Answers
28 pages
Anthropic
No ratings yet
Anthropic
16 pages
Descriptive ANSWER AI
No ratings yet
Descriptive ANSWER AI
15 pages
Rag System Notes
No ratings yet
Rag System Notes
26 pages
Ragcache: Efficient Knowledge Caching For Retrieval-Augmented Generation
No ratings yet
Ragcache: Efficient Knowledge Caching For Retrieval-Augmented Generation
14 pages
Certified Generative Ai Engineer Associate
No ratings yet
Certified Generative Ai Engineer Associate
25 pages
Information Retrivals Ans
No ratings yet
Information Retrivals Ans
78 pages
Master Thesis
No ratings yet
Master Thesis
58 pages
P21CO003 Seminar 1
No ratings yet
P21CO003 Seminar 1
45 pages
Data Stream Management
No ratings yet
Data Stream Management
46 pages
IASD Master Thesis
No ratings yet
IASD Master Thesis
48 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
46 pages
Reasoning Based Retrieval Planning For Complex RAG
No ratings yet
Reasoning Based Retrieval Planning For Complex RAG
8 pages
Comprehensive Solutions To Greedy, Dynamic Programming, Backtracking, and NP-Complete Problems: A Design and Analysis Perspective
No ratings yet
Comprehensive Solutions To Greedy, Dynamic Programming, Backtracking, and NP-Complete Problems: A Design and Analysis Perspective
18 pages
The Power of Noise: Redefining Retrieval For RAG Systems: Florin Cuconasu Giovanni Trappolini Federico Siciliano
No ratings yet
The Power of Noise: Redefining Retrieval For RAG Systems: Florin Cuconasu Giovanni Trappolini Federico Siciliano
11 pages
Reading:: Sources
No ratings yet
Reading:: Sources
15 pages
Latex Conversion
No ratings yet
Latex Conversion
42 pages
Certificate Write Up
No ratings yet
Certificate Write Up
9 pages
Keyword Search in Structured Databases: Vagelis Hristidis
No ratings yet
Keyword Search in Structured Databases: Vagelis Hristidis
58 pages
A Parallel Implementation of Cue-Based Retrieval For Episodic Memory
No ratings yet
A Parallel Implementation of Cue-Based Retrieval For Episodic Memory
47 pages
Search Algorithms in AI7
No ratings yet
Search Algorithms in AI7
13 pages
Deeprag
No ratings yet
Deeprag
12 pages
Import Contents BPB2-0026-0036-Httpwww Wi PB Edu Plplikinaukazeszytyz2rybak-Full
No ratings yet
Import Contents BPB2-0026-0036-Httpwww Wi PB Edu Plplikinaukazeszytyz2rybak-Full
20 pages
Hilti Ai Assisted Data Search or Ai Powered Data Management - Shikhar Ashutosh Moondra
No ratings yet
Hilti Ai Assisted Data Search or Ai Powered Data Management - Shikhar Ashutosh Moondra
12 pages
RAGtreebased
No ratings yet
RAGtreebased
17 pages
AI Unit2 Question Bank With Solution
No ratings yet
AI Unit2 Question Bank With Solution
9 pages
Complex Enterprise RAG Sustem - Technical Considerations
No ratings yet
Complex Enterprise RAG Sustem - Technical Considerations
20 pages
L Rag: S F R - A G: Ight Imple and AST Etrieval Ugmented Eneration
No ratings yet
L Rag: S F R - A G: Ight Imple and AST Etrieval Ugmented Eneration
16 pages
Unit - II
No ratings yet
Unit - II
12 pages
Maximizing Rag Efficiency A Comparative Analysis of Rag Methods
No ratings yet
Maximizing Rag Efficiency A Comparative Analysis of Rag Methods
25 pages
02 Search
No ratings yet
02 Search
40 pages
System Design
No ratings yet
System Design
150 pages
System Design
No ratings yet
System Design
150 pages
CISSP Last Minute Guide
No ratings yet
CISSP Last Minute Guide
28 pages
Unit - II Problem Solving in Ai
No ratings yet
Unit - II Problem Solving in Ai
87 pages
Technical Considerations For Complex RAG (Retrieval Augmented Generation)
No ratings yet
Technical Considerations For Complex RAG (Retrieval Augmented Generation)
25 pages
Retrieval-Augmented Generation For Natural Language Processing: A Survey
No ratings yet
Retrieval-Augmented Generation For Natural Language Processing: A Survey
19 pages
Enhanced Retrieval-Augmented Reasoning With Open-Source Large Language Models
No ratings yet
Enhanced Retrieval-Augmented Reasoning With Open-Source Large Language Models
14 pages
Information Retrieval 1 Introduction To IR
No ratings yet
Information Retrieval 1 Introduction To IR
12 pages
Bca Curriculum With All Syllabus 2021 v9 5.0 (14.07.2021)
No ratings yet
Bca Curriculum With All Syllabus 2021 v9 5.0 (14.07.2021)
164 pages
QP 1
No ratings yet
QP 1
26 pages
Zhao Et Al (2024) - Retrieval-Augmented Generation For AI-Generated Content
No ratings yet
Zhao Et Al (2024) - Retrieval-Augmented Generation For AI-Generated Content
21 pages
A Survey On Rag Meeting LLM
No ratings yet
A Survey On Rag Meeting LLM
18 pages
Thesis RAG Retrieval Augmented Generation For The IR-Anthology
No ratings yet
Thesis RAG Retrieval Augmented Generation For The IR-Anthology
83 pages
Lecture Notes For Algorithm Analysis and Design: JNTU World
No ratings yet
Lecture Notes For Algorithm Analysis and Design: JNTU World
128 pages
Advanced RAG Techniques - What They Are & How To Use Them
No ratings yet
Advanced RAG Techniques - What They Are & How To Use Them
16 pages
05 Intelligent Systems-SearchMethods
No ratings yet
05 Intelligent Systems-SearchMethods
61 pages
IR LLMs
No ratings yet
IR LLMs
17 pages
Active Learning Methods For Interactive Image Retrieval
No ratings yet
Active Learning Methods For Interactive Image Retrieval
78 pages
CS2109S Notes
No ratings yet
CS2109S Notes
19 pages
User Manual: Semikron Skiip - Tester Manual Control Unit
100% (2)
User Manual: Semikron Skiip - Tester Manual Control Unit
20 pages
Part A: Unit Ii Representation of Knowledge
No ratings yet
Part A: Unit Ii Representation of Knowledge
11 pages
Introduction To Telecom Technologies (Telecom) : Getachew Mamo
No ratings yet
Introduction To Telecom Technologies (Telecom) : Getachew Mamo
65 pages
Long-Context LLMs Meet RAG: Overcoming Challenges For Long Inputs in RAG
No ratings yet
Long-Context LLMs Meet RAG: Overcoming Challenges For Long Inputs in RAG
34 pages
Hierarchical Clustering PDF
No ratings yet
Hierarchical Clustering PDF
9 pages
Irs Important Questions
0% (1)
Irs Important Questions
3 pages
Ariesogeo User Guide
100% (1)
Ariesogeo User Guide
263 pages
A Survey of Small Language Models
No ratings yet
A Survey of Small Language Models
20 pages
Business Informatics MCA
No ratings yet
Business Informatics MCA
65 pages
Completing The Square
No ratings yet
Completing The Square
5 pages
BOOK AGILE 60-Questions-New-Scrum-Master-v1
No ratings yet
BOOK AGILE 60-Questions-New-Scrum-Master-v1
20 pages
Healthray Profile
No ratings yet
Healthray Profile
19 pages
掌握需求过程 PDF
No ratings yet
掌握需求过程 PDF
337 pages
GREEN-UP Charging Station Cat No. 0590-00-01-05-06 - 0580-00-01 V1.7
No ratings yet
GREEN-UP Charging Station Cat No. 0590-00-01-05-06 - 0580-00-01 V1.7
6 pages
Anti-Surge Presentation Mod LV
No ratings yet
Anti-Surge Presentation Mod LV
46 pages
DIGI COM 1 - Introduction and Sampling
No ratings yet
DIGI COM 1 - Introduction and Sampling
29 pages
Module 6 - Business Process PDF
No ratings yet
Module 6 - Business Process PDF
12 pages
GTD2000-Tx Instruction Manual: Revision: 1
No ratings yet
GTD2000-Tx Instruction Manual: Revision: 1
44 pages
Home Page Acumatica
No ratings yet
Home Page Acumatica
88 pages
Generative Representational Instruction Tuning
No ratings yet
Generative Representational Instruction Tuning
66 pages
False Color Analysis Page: Software Starting
No ratings yet
False Color Analysis Page: Software Starting
23 pages
Archer AX50 (EU&US) 1.0 - Datasheet
No ratings yet
Archer AX50 (EU&US) 1.0 - Datasheet
7 pages
Log2024 12 22
No ratings yet
Log2024 12 22
5 pages
Can Multiple-Choice Questions Really Be Useful in Detecting The Abilities of LLMs
No ratings yet
Can Multiple-Choice Questions Really Be Useful in Detecting The Abilities of LLMs
16 pages
Synthesize Step-by-Step Tools, Templates and LLMs As Data Generators For Reasoning-Based Chart VQA
No ratings yet
Synthesize Step-by-Step Tools, Templates and LLMs As Data Generators For Reasoning-Based Chart VQA
16 pages
Agile Web Development With Rails 6 1st Edition Sam Ruby All Chapter Instant Download
100% (3)
Agile Web Development With Rails 6 1st Edition Sam Ruby All Chapter Instant Download
54 pages
MMDOCBENCH Benchmarking Large Vision-Language Models For Fine-Grained Visual Document Understanding
No ratings yet
MMDOCBENCH Benchmarking Large Vision-Language Models For Fine-Grained Visual Document Understanding
28 pages
FRONTIERMATH A Benchmark For Evaluating Advanced Mathematical Reasoning in AI
No ratings yet
FRONTIERMATH A Benchmark For Evaluating Advanced Mathematical Reasoning in AI
26 pages
Wordpluginhowtoxpto 2003
No ratings yet
Wordpluginhowtoxpto 2003
15 pages
Congestion Control Techniques
No ratings yet
Congestion Control Techniques
4 pages
MindBench A Comprehensive Benchmark For Structured Document Analysis
No ratings yet
MindBench A Comprehensive Benchmark For Structured Document Analysis
13 pages
Traffic Light Detection System in Self-Driving Cars
No ratings yet
Traffic Light Detection System in Self-Driving Cars
6 pages
Dsa Suggetions
No ratings yet
Dsa Suggetions
12 pages
Case Study-WPS Office
No ratings yet
Case Study-WPS Office
5 pages
0417 Information and Communication Technology: MARK SCHEME For The October/November 2013 Series
No ratings yet
0417 Information and Communication Technology: MARK SCHEME For The October/November 2013 Series
7 pages
Lets Encrypt - Mosquitto Broker With SSL LetsEncrypt Certificate and Self Signed Client Certificate - Stack Overflow
No ratings yet
Lets Encrypt - Mosquitto Broker With SSL LetsEncrypt Certificate and Self Signed Client Certificate - Stack Overflow
3 pages
Free Fall Aesthetic App Icons Guitar & Lace
No ratings yet
Free Fall Aesthetic App Icons Guitar & Lace
1 page
ADF Data Sheet
No ratings yet
ADF Data Sheet
2 pages

CORAG A Cost-Constrained Retrieval Optimization System For Retrieval-Augmented Generation

Uploaded by

CORAG A Cost-Constrained Retrieval Optimization System For Retrieval-Augmented Generation

Uploaded by

CORAG: A Cost-Constrained Retrieval Optimization System for

Gao Cong Feifei Li

Figure 2: The architecture overview of CORAG system

Figure 3: Workflow of MCTS Based Chunk Optimization

Algorithm 1 MCTS-Based Policy Tree Search Algorithm 2 NodeSelection

Figure 4: Overview of Configuration Agent

𝐿total (𝜃 ) = 𝐿con (𝜃 ) + 𝐿cla (𝜃 ) + 𝐿reg (𝜃 ) (11) 1 https://www.llamaindex.ai/

Retrieve Time (s)

Retrieve Time (s)

0 0.0 Budget Methods R1 R2 RL Cost

Figure 9: Case Study for CORAG

You might also like