AUTO-RAG: AUTONOMOUS RETRIEVAL-AUGMENTED GENERATION FOR LARGE LANGUAGE MODELS
AUTO-RAG: AUTONOMOUS RETRIEVAL-AUGMENTED GENERATION FOR LARGE LANGUAGE MODELS
AUTO-RAG: AUTONOMOUS RETRIEVAL-AUGMENTED GENERATION FOR LARGE LANGUAGE MODELS
A BSTRACT
Iterative retrieval refers to the process in which the model continuously queries
the retriever during generation to enhance the relevance of the retrieved knowl-
edge, thereby improving the performance of Retrieval-Augmented Generation
(RAG). Existing work typically employs few-shot prompting or manually con-
structed rules to implement iterative retrieval. This introduces additional inference
overhead and overlooks the remarkable reasoning capabilities of Large Language
Models (LLMs). In this paper, we introduce Auto-RAG, an autonomous itera-
tive retrieval model centered on the LLM’s powerful decision-making capabilities.
Auto-RAG engages in multi-turn dialogues with the retriever, systematically plan-
ning retrievals and refining queries to acquire valuable knowledge. This process
continues until sufficient external information is gathered, at which point the re-
sults are presented to the user. To this end, we develop a method for autonomously
synthesizing reasoning-based decision-making instructions in iterative retrieval
and fine-tuned the latest open-source LLMs. The experimental results indicate
that Auto-RAG is capable of autonomous iterative interaction with the retriever,
effectively leveraging the remarkable reasoning and decision-making abilities of
LLMs, which lead to outstanding performance across six benchmarks. Further
analysis reveals that Auto-RAG can autonomously adjust the number of iterations
based on the difficulty of the questions and the utility of the retrieved knowl-
edge, without requiring any human intervention. Moreover, Auto-RAG expresses
the iterative retrieval process in natural language, enhancing interpretability while
providing users with a more intuitive experience1 .
1 I NTRODUCTION
Retrieval-augmented generation (RAG) for Large Language Models (LLMs) is widely employed
to tackle knowledge-intensive tasks (Asai et al., 2023; Dubey et al., 2024; Jiang et al., 2023; Feng
et al., 2023; Gao et al., 2024), which substantially improves output quality and effectively mitigates
hallucinations (Gao et al., 2024; Lewis et al., 2020). However, certain limitations persist, such as
noise in retrieved content (Yu et al., 2023) and the challenge of retrieving sufficient knowledge for
complex queries in a single attempt (Feng et al., 2023; Chen et al., 2024). These issues ultimately
undermine the overall performance of RAG systems and impede their widespread adoption.
To address these limitations, iterative retrieval has been proposed, which consistently updates re-
trieval results to satisfy the dynamic information needs that arise during the generation process (Feng
et al., 2023; Chen et al., 2024; Asai et al., 2023). Existing work often relies on few-shot prompting
and manually crafted rules to implement iterative retrieval (Jiang et al., 2023; Feng et al., 2023;
Wang et al., 2024a), which involves substantial human effort and additional computational over-
head during inference. Moreover, these methods overlook LLMs’ reasoning and decision-making
capabilities (Wei et al., 2023), wasting their potential on determining when and what to retrieve.
∗
Corresponding Author: Yang Feng.
1
Code is available at https://github.com/ictnlp/Auto-RAG.
1
Preprint
Figure 1: A concrete example of how Auto-RAG addresses complex multi-hop questions. Auto-
RAG engages in iterative reasoning, strategically plans retrievals, extracts relevant knowledge,
precisely identifies information needs, and refines query for the next retrieval, ultimately con-
verging on the final answer. In this example, Auto-RAG terminates after five interactions with the
retriever, successfully yielding the correct answer.
To this end, we introduce Auto-RAG, an autonomous iterative retrieval model centered on the
LLM’s powerful decision-making capabilities. As shown in Figure 1, Auto-RAG models the in-
teraction between the LLM and the retriever through multi-turn dialogue. During iterative retrieval,
Auto-RAG employs reasoning for retrieval planning, extracting valuable external knowledge, iden-
tifying information needs, rewriting queries, and continuously querying the retriever for new in-
formation until it can adequately answer the user’s question. To empower LLMs with the ability
for autonomous decision-making in iterative retrieval, we developed a framework for the automatic
synthesis of reasoning-based instructions for autonomous decision-making in iterative retrieval and
fine-tuned the latest open-source LLMs, such as Llama-3-8B-Instruct 2 (Dubey et al., 2024).
We conduct experiments on six representative benchmarks, covering both open-domain QA
(Kwiatkowski et al., 2019; Joshi et al., 2017; Berant et al., 2013; Mallen et al., 2023) and multi-hop
QA (Ho et al., 2020; Yang et al., 2018). Experimental results demonstrate that, even with limited
training data, Auto-RAG delivers outstanding performance. Further analysis reveals that Auto-RAG
dynamically adjusts the number of iterations based on the complexity of the questions and the rele-
vance of the retrieved knowledge. Moreover, Auto-RAG expresses the iterative retrieval process in
natural language, thereby improving interpretability and offering a more intuitive user experience.
2
https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct
2
Preprint
2 R ELATED W ORK
3 M ETHOD
We conceptualize the iterative retrieval process as a multi-turn interaction between LLM and re-
triever. The user’s query initiates a sequence of interactions between the LLM and the retriever,
continuing until sufficient knowledge is acquired to generate a final answer. In each iteration, Auto-
RAG engages in meticulous reasoning based on the current state to ascertain whether additional
retrieval is required and what specific information to seek. Once sufficient information is acquired,
Auto-RAG ceases to generate new queries and delivers a final answer to the user.
We begin by formally delineating the objectives for reasoning-based instruction synthesis. For each
input-output pair (X, Y ) in the original dataset D, our goal is to curate instruction data collection,
DInst , that empowers LLMs to engage in reasoning and query refinement during iterative retrieval,
ultimately converging on the correct answer, which can be formally expressed as follows:
(X, Y ) → [X, R0 , (Qt , Dt , Rt )1≤i≤T , A], (1)
3
Preprint
where T is the maximum iteration3 , R0 denotes the reasoning performed when only the user’s input
X is present. At the t-th iteration (1 ≤ t ≤ T ), if the previous iteration’s reasoning Rt−1 includes
an information need4 , the query Qt will be sampled, and the retriever will provide the document Dt
for Qt . The model will then generate the reasoning Rt for that iteration. If the previous reasoning
Rt−1 does not include an information need, the model is prompted to generate the final answer A.
Next, we will provide the details of how LLM is guided to perform such reasoning and query refine-
ment. Additionally, we will elucidate the methods utilized for data filtering and formatting.
To optimize efficiency and ensure coherence during iterative processes, it is essential to develop
a well-designed reasoning paradigm. Specifically, mirroring the human cognitive process during
retrieval, we propose that iterative retrieval should incorporate three distinct types of reasoning: (1)
Retrieval Planning, (2) Information Extraction, and (3) Answer Inference.
• (1) Retrieval Planning Upon receiving the user’s question, the LLM should explicitly identify the
knowledge necessary to address the query. Furthermore, upon receiving retrieved documents, the
LLM must evaluate whether further retrievals are needed and, if so, specify the precise information
to be sought next. Maintaining strategic planning throughout the retrieval process is crucial for
improving efficiency and mitigating the risk of losing direction midway (Wang et al., 2024a).
• (2) Information Extraction Upon receiving retrieved documents, the LLM should adeptly extract
relevant information essential for addressing the problem at hand. This human-like summarization
process bolsters the LLM’s capacity to filter out irrelevant information, thereby enhancing both its
efficiency and accuracy in processing external knowledge(Wei et al., 2023; Xu et al., 2024).
• (3) Answer inference Once LLM has gathered all pertinent knowledge required to address the
question, it should employ reasoning to formulate the final answer. This process enhances LLM’s
ability to generate accurate responses based on available information, thereby mitigating the risk
of generating hallucinations (Wei et al., 2023).
3
During data synthesis, T is set to 10 for 2WikiMultihopQA and 5 for Natural Questions.
4
We predefined terms like "however," "no information," "find," and "refine" to signal the model’s informa-
tion needs. If any appear in the output, they indicate an information need.
4
Preprint
These three types of reasoning collectively constitute the Chain-of-Thought utilized during iterative
retrieval. To elicit such a reasoning process, we utilize few-shot prompting following Jiang et al.
(2023); Brown et al. (2020); Wei et al. (2023). It is noteworthy that steps (2) and (3) are typically
omitted upon the initial reception of the user’s question. Furthermore, if the retrieved information
is found to be entirely irrelevant, step (2) is also excluded. Such adjustments enable LLMs to make
informed judgments based on the actual context, rather than merely imitating demonstrations and
generating hallucinations. The prompt used to elicit reasoning is presented in Appendix C.1.
With an appropriate reasoning process, LLM can iteratively refine the query based on the user input
and previous retrieval plan, continually adapting to new information requirements. To generate a
sufficiently diverse set of queries without being constrained by the query styles present in few-shot
prompts, we utilize a more flexible prompting methodology, as shown in Appendix C.5.
3.2 T RAINING
To equip an arbitrary LLM with the capability for autonomous decision-making in iterative retrieval,
we adopted a standard supervised fine-tuning strategy following Yoran et al. (2023); Jiang et al.
(2024). For each instance containing (xt , yt )0≤t≤T , the cross-entropy loss L can be calculated as:
X
L=− log Pr(yt |x≤t , y<t ), (3)
0≤t≤T
where yt denotes the output at iteration t, x≤t represents the input up to the current iteration, and
y<t signifies the outputs from all preceding steps.
3.3 I NFERENCE
After training, Auto-RAG has acquired the ability to make reasoning-based autonomous decisions
during iterative retrieval, effectively discerning both when and what to retrieve. During each it-
eration, it suffices to provide Auto-RAG with input—whether user inquiries or retrieved docu-
ments—and to extract the planned actions designated by Auto-RAG for subsequent steps. Specif-
ically, in the 0-th iteration, Auto-RAG receives the user’s question as input and subsequently gen-
erates the reasoning and planning output yt . In the t-th iteration, if the output from the previous
iteration yt−1 includes a query q, this query is utilized for retrieval, and the retrieved documents dt
5
Preprint
are then provided to Auto-RAG as input, resulting in the output for that iteration yt . Conversely,
if the output from the previous iteration yt−1 does not contain a query but instead presents a final
answer, the iteration is concluded, and the final answer is returned to the user.
Utilization of parametric knowledge Due to the limitations of the retriever and the retrieval corpus,
Auto-RAG may fail to acquire the necessary knowledge to answer a question, resulting in perpet-
ual iterations. Furthermore, the parametric knowledge of the LLM may not be effectively utilized
during this process. To address this issue, we attempted to provide Auto-RAG with self-generated
documents or answers. If Auto-RAG has not terminated after interacting with the retriever for T
iterations, the generated query is used to prompt itself to create a document, which is subsequently
utilized as input for the next iteration. If Auto-RAG continues without termination after an addi-
tional T P K iterations, we follow Wang et al., 2024a to provide the answer produced by Auto-RAG
without retrieval to the user. The prompt used to elicit parametric knowledge is shown in Appendix
C.4, the pseudocode representing the inference process is presented in Algorithm 2, and examples
of the synthesized instructions can be found in Appendix C.6. The experiments investigating the
order of external and parametric knowledge can be found in Appendix A.3.
4 E XPERIMENTS
6
Preprint
Table 1: Main results on six benchmarks. Auto-RAG consistently outperforms all baselines.
sized instructions for five epochs to enhance its capacity for autonomous decision-making during
iterative retrieval. The distribution of iteration counts in the training data is illustrated in Figure 2.
To evaluate the effectiveness and robustness of Auto-RAG, we conducted assessments across six
datasets: NQ, 2Wiki, TriviaQA (TQA) (Joshi et al., 2017), PopQA (PQA) (Mallen et al., 2023),
HotpotQA (HQA) (Yang et al., 2018), and WebQuestions (WQ) (Berant et al., 2013). We employed
E5-base-v2 (Wang et al., 2024b) as the retriever and utilized the widely used Wikipedia dump from
December 2018 as the retrieval corpus (Karpukhin et al., 2020) following Jin et al. (2024). Given
the variations in base models, retrievers, and retrieval corpora employed by different RAG methods,
performing a fair comparison becomes challenging. Therefore, consistent with Jin et al. (2024), we
report results and metrics based on their reproduction under an identical experimental setup. We
present Exact Match (EM) for NQ, TQA, and WQ, and F1 scores for 2Wiki, PQA, and HQA, in
accordance with Jin et al. (2024). Hyperparameters are detailed in Appendix B.
4.2 BASELINES
For baselines without retrieval (Naive Gen), we evaluated the performance of Llama-3-8B-Instruct.
Following Jin et al. (2024), we adopted a zero-shot setting. We consider Standard RAG for retrieval-
based baselines, where models generate answers based on documents retrieved by the user’s input.
The prompts used for Naive and Standard RAG are shown in Appendix C.2. For single time re-
trieval, we compare with RECOMP-abstractive (Xu et al., 2023) and Selective-Context (Li et al.,
2023), which optimize on context selection, REPLUG (Shi et al., 2024), which enhances the gen-
erator’s performance, and IRCoT (Trivedi et al., 2023), which adopts a Chain-of-Thought (CoT)
process when reading and interpreting the retrieved documents. For multiple-time retrieval (itera-
tive retrieval), we compare Auto-RAG with three methods that are most relevant to our approach:
FLARE (Jiang et al., 2023), Iter-RetGen (Feng et al., 2023), and Self-RAG (Asai et al., 2023).
Table 1 shows the main results across six benchmarks, demonstrating that Auto-RAG achieves supe-
rior performance across all datasets. Notably, Auto-RAG surpasses other iterative retrieval methods,
yielding significantly improved outcomes. While Iter-RetGen (Feng et al., 2023) relies on manually
defined retrieval content and the number of iterations, and FLARE (Jiang et al., 2023) determines re-
trieval timing through predefined rules (e.g., output probabilities), Auto-RAG distinguishes itself by
autonomously determining both when and what to retrieve, leading to superior overall performance.
Self-RAG (Asai et al., 2023) directly predicts reflection tokens to decide when to retrieve and eval-
uate the quality of the retrieved results. In contrast, Auto-RAG incorporates a reasoning process
at each iteration, enabling it to make more sophisticated and informed decisions. This reasoning
mechanism enhances the Auto-RAG’s capacity to optimize retrieval strategies and autonomously
navigate complex tasks, resulting in improved performance across six benchmarks. Since variations
in base LLMs and different versions of Wikipedia can impact performance (Izacard et al., 2022),
to facilitate comparisons in future research, the results from other base models (such as the Llama-
7
Preprint
3 U R S R U W L R Q
3 U R S R U W L R Q