0% found this document useful (0 votes)

58 views12 pages

Deeprag

The document introduces DeepRAG, a framework that enhances retrieval-augmented reasoning in large language models (LLMs) by modeling the process as a Markov Decision Process (MDP). DeepRAG improves retrieval efficiency and answer accuracy by strategically determining when to retrieve external knowledge versus relying on parametric knowledge, achieving a 21.99% increase in answer accuracy in experiments across various QA datasets. The framework consists of three key components: binary tree search for subquery generation, imitation learning for effective retrieval patterns, and a chain of calibration to refine the model's understanding of its knowledge boundaries.

Uploaded by

hahuyhoanghhh41

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

58 views12 pages

Deeprag

Uploaded by

hahuyhoanghhh41

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

DeepRAG: Thinking to Retrieval Step by Step for Large Language Models

Xinyan Guan1,2 , Jiali Zeng3 , Fandong Meng3 , Chunlei Xin1,2 , Yaojie Lu1 ,
Hongyu Lin1 , Xianpei Han1 , Le Sun 1 , Jie Zhou3
1
Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences
2
University of Chinese Academy of Sciences
3
Pattern Recognition Center, WeChat AI, Tencent Inc, China
{guanxinyan2022,chunlei2021,hongyu,luyaojie,xianpei,sunle}@iscas.ac.cn
{lemonzeng,fandongmeng,withtomzhou}@tencent.com

Abstract Question: What is the total runtime of all movies in The Lord of the Rings?

Human Thinking DeepRAG

Step 1
Large Language Models (LLMs) have shown What are the titles of the movies in
I know The Lord of the Rings trilogy
remarkable potential in reasoning while they consists of three films: The The Lord of the Rings?
Fellowship of the Ring, The Two The Fellowship of the Ring, The Two
still suffer from severe factual hallucinations
arXiv:2502.01142v1 [cs.AI] 3 Feb 2025

Towers, and The Return of the King. Towers, The Return of the King
Step 2,3,4
due to timeliness, accuracy, and coverage of However, I am unsure of their exact Whatisisthe
theruntime
runtimeofofThe
The
What
parametric knowledge. Meanwhile, integrat- runtimes. Fellowshipofofthe
Fellowship theRing?
Ring?
To find out, I first need to determine
ing reasoning with retrieval-augmented gener- 178 minutes
the runtime of each film individually.
Step 5
ation (RAG) remains challenging due to inef- searching... What is the total runtime of the
fective task decomposition and redundant re- After retrieving this information, I
The Lord of the Rings?

can compute the total runtime as: 178 + 179 + 201 = 558 minutes
trieval, which can introduce noise and degrade 178 + 179 + 201 = 558 minutes.
response quality. In this paper, we propose Final answer: 558 minutes
Atomic
Retrieval narrative decisions
DeepRAG, a framework that models retrieval-
augmented reasoning as a Markov Decision
Process (MDP), enabling strategic and adaptive Figure 1: Correspondence between human thinking pro-
retrieval. By iteratively decomposing queries, cesses and DeepRAG. Specifically, retrieval narrative
DeepRAG dynamically determines whether to ensures a structured and adaptive retrieval flow, generat-
retrieve external knowledge or rely on para- ing subqueries informed by previously retrieved infor-
metric reasoning at each step. Experiments mation, and atomic decisions dynamically determines
show that DeepRAG improves retrieval effi- whether to retrieve external knowledge or rely solely on
ciency while improving answer accuracy by the parametric knowledge for each subquery.
21.99%, demonstrating its effectiveness in op-
timizing retrieval-augmented reasoning.
as a solution to continuously update retrieval re-
1 Introduction sults to address the dynamic information needs
that arise during the generation process (Yue et al.,
Large Language Models (LLMs) have demon- 2024). However, LLMs often struggle to generate
strated significant potential in reasoning (Plaat atomic and precise subqueries, which are critical
et al., 2024). However, limited by the capac- for more effective retrieval (Wu et al., 2024). From
ity and capabilities of LLM, it still suffers from the perspective of RAG, iterative retrieval should
severe factual hallucination problems due to the ideally generate the next atomic query based on
timeliness, accuracy, and coverage of parametric the current question and the available information
knowledge (Zhang et al., 2023; Huang et al., 2023). in an adaptive manner. Moreover, retrieval is not
Retrieval-Augmented Generation (RAG) has been always necessary. Some queries require knowl-
proposed as a promising paradigm to address this is- edge, while others rely solely on reasoning within
sue by integrating relevant information from knowl- the LLM. Furthermore, LLMs have demonstrated
edge bases or search engines, thereby improving their capability to serve as knowledge bases them-
the factuality of model response (Zhao et al., 2024). selves (Petroni et al., 2019). Unnecessary retrieval,
However, incorporating reasoning with retrieval- in addition to being redundant, can introduce noise,
augmented generation still presents several chal- degrade generation quality, and increase inference
lenges. One major issue is that complex queries latency (Chen et al., 2023; Tan et al., 2024; Bian
often require multi-step decomposition to estab- et al., 2024).
lish a coherent reasoning process (Radhakrishnan To address this, inspired by the way humans
et al., 2023). Iterative retrieval has been proposed search the Internet based on demand, we propose

1
DeepRAG, a new framework designed to enhance metrics, and LLM-based methods (Asai et al.,
reasoning ability in retrieval-augmented generation 2023; Zhang et al., 2024) generating retrieval
by modeling the process as a Markov Decision Pro- decisions but often fail to accurately recognize
cess (MDP). The framework introduces two key their knowledge boundaries, making it unreliable
components: retrieval narrative and atomic deci- to delegate retrieval timing decisions to the model.
sions, which together form a strategic and adaptive Our method leverages the inherent generative
retrieval framework. As illustrated in Figure 1, re- capabilities of LLMs to explore knowledge
trieval narrative ensures a structured and adaptive boundaries in RAG settings. This design maintains
retrieval flow, generating subqueries informed by the model’s native generation abilities while
previously retrieved information. For each sub- eliminating the need for additional parameters or
query, atomic decisions dynamically determines unreliable uncertainty metrics.
whether to retrieve external knowledge or rely
solely on the parametric knowledge of the LLM. To Reasoning in Retrieval-Augmented Generation
achieve this, we design a binary tree search method Recent advances in RAG have increasingly fo-
that explores the impact of atomic decisions on rea- cused on incorporating reasoning capabilities. Self-
soning outcomes. Based on it, we first synthesize RAG (Asai et al., 2023) and Auto-RAG (Yu et al.,
data to the LLM to learn retrieval narrative, cap- 2024) leverage automatic data synthesis to en-
turing the pattern of “subquery generation – atomic hance reasoning within retrieval-augmented frame-
decision – intermediate answer” through imitation works. Search-o1 (Li et al., 2025) incorporates
learning. Subsequently, we employ a chain of cali- retrieval into inference to construct an agentic sys-
bration approach to refine the model’s understand- tem, though its applicability is limited to o1-like
ing of its own knowledge boundaries, enabling it large reasoning models. AirRAG (Feng et al.,
to make more accurate atomic decisions regarding 2025) combines Monte Carlo Tree Search and self-
the necessity of retrieval. By explicitly enhancing consistency. In contrast to these approaches that
the LLM’s ability to recognize its own knowledge rely heavily on extensive retrieval operations or
boundaries, we can train an arbitrary model in an large reasoning models, DeepRAG provides an
end-to-end manner, enabling it to dynamically de- end-to-end method, enabling an arbitrary model
termine when retrieval is necessary. to think to retrieval step by step on demand.
We conduct experiments on five open-domain Knowledge Boundary LLMs struggle to accu-
QA datasets to validate the effectiveness of Deep- rately distinguish between what they know and
RAG, including HotpotQA, 2WikiMultihopQA, what they don’t know (Yin et al., 2023; Kapoor
and PopQA for multi-hop factual QA, CAG for et al., 2024a; Yin et al., 2024). Additional fine-
time-sensitive QA, and WebQuestions for hetero- tuning (Kapoor et al., 2024b) or precise prob-
geneous knowledge base QA. Experimental results ing (Cheng et al., 2024) is typically required to
demonstrate that DeepRAG significantly outper- calibrate the model’s cognition. Our approach ex-
forms existing methods, achieving 21.99% higher plores knowledge boundaries in RAG settings.
answer accuracy while improving retrieval effi-
ciency. Further analysis reveals that DeepRAG 3 Thinking to Retrieval Step by Step
exhibits a stronger correlation between its retrieval
decisions and parametric knowledge, indicating In this section, we introduce our proposed method
more effective knowledge boundary calibration. DeepRAG. At its core, DeepRAG treats the process
of question decomposition, atomic decisions, and
2 Related Work final answer generation as a Markov Decision Pro-
cess (MDP). As shown in Figure 2, our framework
Adaptive Retrieval-Augmented Generation comprises three key steps: 1) Binary Tree Search,
Existing adaptive RAG approaches can be broadly which constructs a binary tree for each subquery
categorized into three types: classifier-based related to the given question, exploring paths based
methods (Cheng et al., 2024; Jeong et al., 2024) on either parametric knowledge or external knowl-
requiring additional linear head training for edge base; 2) Imitation Learning, which extracts
retrieval decisions, confidence-based methods the reasoning process that arrives at the correct
(Jiang et al., 2023; Su et al., 2024; Dhole, 2025) final answer with minimum retrieval cost for im-
relying heavily on threshold-dependent uncertainty itation learning; 3) Chain of Calibration, which

2
Retrieved Parametric Subquery Trajectory with least retrieval Optimization Preference pairs

Binary Tree Search Stage I: Imitation Learning

Question
Question Ans
RALM

Stage II: Chain of Calibration

Ans Ans Ans Ans

Parametric
Knowledge Ans
Question Calibrated
Retrieved RALM
Documents
Subquery Knowledge Intermediate
Source Answer

Figure 2: An overview of DeepRAG, our framework comprises three steps: (1) Binary Tree Search, (2) Imitation
Learning, and (3) Chain of Calibration. Given a dataset, we first employ binary tree search to synthesize data
for imitation learning, enabling the model to learn retrieval patterns. Subsequently, we use binary tree search to
construct preference data for further calibrating the LLM’s awareness of its knowledge boundaries.

calibrates the LLM’s internal knowledge by cali- the model decides whether to retrieve external
brating each atomic decision. Specifically, given a knowledge or rely solely on its parametric knowl-
set of supervised datasets, we first employ binary edge. Formally, this decision is represented as
tree search to synthesize data for imitation learn- δt+1 ∈ {retrieve, parametric}.
ing, enabling the model to learn effective retrieval
patterns. Subsequently, we use binary tree search Transitions. After executing the action at+1 =
to construct preference data for further calibrating (σt+1 , δt+1 ) in state st , the environment updates
the LLM’s awareness of its knowledge boundaries. the state to st+1 .
In the following subsections, we will describe each Specifically, if σt = terminate, the pro-
component of DeepRAG in detail. cess concludes by generating the final answer
o, resulting in the terminal state st+1 =
3.1 Overview of the MDP Modeling x, (q1 , r1 ), . . . , (qt , rt ), o . Otherwise, it gen-
erates the next subquery qt+1 .
We formalize the step by step reasoning process
for retrieval-augmented generation as a Markov If δt+1 = retrieve, the model retrieves docu-
Decision Process (MDP) defined by the tuple ments dt+1 and generates an intermediate answer
(S, A, P , R), which comprises a set of states S, iat+1 for subquery qt+1 . Otherwise, it relies on
actions A, transition dynamics P , and a reward parametric knowledge to generate the intermediate
function R. answer. The response rt+1 is set as [dt+1 , iat+1 ]
(if retrieved)
or iat+1 (if not). The updated state is
States. At each step t, the state st ∈ S represents st+1 = x, (q1 , r1 ), . . . , (qt+1 , rt+1 ) .
the partial solution to the original question. We

denote st = x, (q1 , r1 ), . . . , (qt , rt ) , where x Rewards. The reward function evaluates the state
is the input question, and (qi , ri ) captures the i-th based on answer correctness and retrieval cost, ap-
subquery along with the intermediate answer (and final answer o. For-
plied only after generating the
any retrieved documents). mally, R st+1 = st + [o] = −C(o) × T (st ),
where C(o) indicates correctness (1 if correct, ∞
Actions. At state st , the model selects an action otherwise), and T (st ) represents the total retrieval
at+1 = (σt+1 , δt+1 ) ∈ A, which consists of two cost in state st .
sub-decisions:
1. Termination decision: Given the partial solu- 3.2 Binary Tree Search
tion st , the model makes a binary decision σt+1 ∈ In Section 3.1, we model the step-by-step reasoning
{continue, terminate} to determine whether to process as a Markov decision process, where the
proceed with generating the next subquery qt+1 or LLM iteratively decomposes a given question into
finalize the answer o. subqueries, each derived from previously acquired
2. Atomic decision: For each subquery qt+1 , information. The detailed generation instruction is

3
outlined in Appendix A.1, with the answer format Algorithm 1 Data Construction for Stage I
presented below. Require: Question x, answer y, language model M, Re-
Building on this formulation, we implement a triever R, max history length T
Ensure: Optimal reasoning process s∗ or null
binary tree search to construct reasoning paths that 1: Initialize priority queue PQ ← {([x], 0)}
integrate different retrieval strategies for each sub- ▷ (trajectory, retrieval count)
query. As illustrated in Figure 2, given a question, 2: while PQ is not empty do
3: (h, r) ← PQ.dequeue()
the model generates the i-th subquery and explores ▷ Get trajectory with lowest retrieval count
two answering strategies: directly leveraging para- 4: q ← M(h) ▷ Subquery Generation
5: if ShouldAnswer(q) or length(h) > T then
metric knowledge (blue node) or retrieving exter- 6: o ← M(h, q) ▷ Final answer
nal documents (green node). This approach not 7: if IsEqual(o, y) then return h
only decomposes the question into a sequence of 8: else
forward-dependent subqueries but also thoroughly 9: a ← M(h, q) ▷ Direct answer
10: PQ.enqueue(([h, (q, a)], r))
examines the influence of retrieval choices on the 11: d ← R(q) ▷ Retrieve document
final answer. 12: a ← M(h, q, d) ▷ Retrieved answer
13: PQ.enqueue(([h, (q, (d, a))], r + 1))
Answer format 14: return null

Question: <Question>
Follow up: <Subquery1> language models in gaining the capacity for adap-
Let’s search the question in Wikipedia.
Context: <Paragraph Text> tive inference-time compute generation.
Intermediate answer: <Intremediate Answer1>
Follow up: <Subquery2> Training Objective Specifically, we implement
Intermediate answer: <Intermediate Answer2> a masked loss function for the retrieved documents
......
So the final answer is: <Answer> to prevent the model from learning irrelevant or
noisy text that could negatively impact its perfor-
mance. In this way, we hope the model to enhance
the ability to decompose subqueries and retrieve
3.3 Imitation Learning them based on demand. For each instance, the loss
function is formulated as follows:
In this section, we present an algorithm that lever-
ages binary trees to identify the optimal reason- X
ing process that leads to the correct final answer L=− log [Pr(qi |si−1 ) + Pr(ai |si−1 , qi , di )]
while minimizing retrieval costs, corresponding 1≤i≤n

to the highest reward as defined in Section 3.1.

di refers to null if there is no reieval for ith reason-
Based on the synthesized optimal reasoning data,
ing step, n refers to the total iteration.
we fine-tune the model to improve its termination
and atomic decisions while enhancing its query de- 3.4 Chain of Calibration
composition capabilities and generating faithful in-
termediate answers, thereby enabling a more com- Building on the Markov process in Section 3.1, we
prehensive and coherent retrieval narrative pro- identify four key optimization aspects for Deep-
cess. RAG: termination and atomic decisions, query de-
composition, and intermediate answer generation.
Synthesizing Data As illustrated in Algorithm 1, Unlike the others, atomic decisions require the
we utilize a priority queue to efficiently explore model to recognize its own knowledge boundaries
potential reasoning trajectories. This approach al- to make precise judgments.
lows us to prioritize paths with lower retrieval costs, We propose a method that dynamically opti-
balancing accuracy and computational efficiency. mizes atomic decisions for each subquery, rather
The algorithm iteratively constructs and evaluates than training LLMs on complete reasoning paths.
reasoning paths until either discovering a process Our approach consists of two key components: (1)
that generates the correct answer or exhausting all synthesizing preference data to determine when re-
viable options within specified constraints. trieval is necessary, and (2) fine-tuning the LLM
Through the synthesis process above, the train- with this data using Chain of Calibration training
ing dataset obtained contains an adaptive inference to enhance its ability to make informed atomic de-
process, which can be used to facilitate arbitrary cisions based on its internal knowledge boundaries.

4
Synthesizing Preference Data First, we identify out-of-distribution datasets consist of CAG (Pan
an optimal path with minimal retrieval based on Al- et al., 2024), PopQA (Mallen et al., 2022), and
gorithm 1 using the model trained in Stage I. This WebQuestions (Berant et al., 2013). Specifically,
provides the optimal atomic decision for each sub- we employ the time-sensitive subset of CAG to
query, determining whether retrieval is necessary. evaluate temporal reasoning capabilities. Further-
From this path, we construct preference pairs for more, WebQuestions is built upon Freebase to as-
each subquery to indicate the preferred retrieval sess model robustness when information may be
choice. For example, in Figure 2, the optimal path absent from the knowledge base.
may suggest answering the first subquery using
parametric knowledge while requiring document 4.2 Baselines
retrieval for the second. Accordingly, we generate We use the following baselines to evaluate the per-
preference pairs: one favoring parametric knowl- formance: CoT (Wei et al., 2022) and CoT*, which
edge over retrieval for the first subquery and an- employ 8-shot examples extracted from the train-
other favoring retrieval over parametric knowledge ing dataset. The asterisk (*) indicates that the
for the second. This process enables the LLM to model output was trained using the same data em-
learn when to retrieve external information, thereby ployed for training the DeepRAG. CoT-Retrieve
improving its ability to maximize the use of para- and CoT-Retrieve* augment the eight examples
metric knowledge while minimizing unnecessary in the context with retrieved relevant documents
retrieval. based on the query. IterDRAG (Yue et al., 2024)
Chain of Calibration Objective We fine-tune refers to decomposing question and answer step by
the LLM using a Chain of Calibration objective on step based on in-context learning. UAR (Cheng
our synthesized preference data. et al., 2024) employs a trained classifier to deter-
Given the i-th subquery and a state si = mine when retrieval is necessary. FLARE (Jiang
[x, q1 , r1 , · · · , qi−1 , ri−1 ], we have two distince in- et al., 2023) and DRAGIN (Su et al., 2024) are
termediate answer ri1 = a1i and ri2 = (di , a2i ). confidence-based method that decide the timing
Based on the process above, we have known which of retrieval based on token importance and uncer-
ri is preferred. As a result, the training objective tainty. TAARE (Zhang et al., 2024) allows the
can be formulated as follows: LLM itself to determine when retrieval is needed.
AutoRAG (Yu et al., 2024) uses trained models to
iteratively decompose questions and retrieve rele-
πθ (yw | si , qi ) πθ (yl | si , qi )
L = − log σ β log − β log vant documents for answering.
πref (yw | si , qi ) πref (yl | si , qi )

where σ is the logistic function, the hyperparam- 4.3 Implementation Details

eter β regulates the penalty imposed for the devi- We construct training datasets using the training
ations from the base reference model πref . The subsets of 2 QA datasets: HotpotQA, and 2WMQA.
terms yw and yl refer to the generated snippets For imitation learning, we randomly sampled 4,000
for direct answers and retrieved answers, respec- data from HotpotQA, and 2WMQA respectively.
tively. Specifically, the snippet “Intermediate An- For chain of calibration, we individually sampled
swer:” corresponds to a direct answer, while the 1,000 data points from each of the two datasets. We
snippet “Let’s search the question on Wikipedia” evaluate our method on the corresponding test sets
corresponds to retrieval-based answers. of these datasets with Exact Match (EM) and F1
score as evaluation metrics.
4 Experiment
Following Su et al. (2024), we adopt BM25
4.1 Datasets as our retrieval model. For the external knowl-
We use five open-domain QA datasets for our ex- edge corpus, we utilize Wikipedia1 , with each ar-
periments. We split the datasets used for training ticle segmented into 100-token passages. We se-
our models as the in-distribution dataset, while lected Llama-3-8B-Instruct (Dubey et al., 2024)
those not used for training are considered the out- and Qwen-2.5-7B (Yang et al., 2024) as our base
of-distribution dataset. The in-distribution datasets model.
include HotpotQA (Yang et al., 2018), and 2Wik- 1
https://github.com/facebookresearch/DPR/tree/
MultihopQA (2WMQA) (Ho et al., 2020), and the main

5
in-distribution out-of-distribution
Types Methods Hotpot QA 2WikiMultihopQA CAG PopQA Web Question Avg
EM F1 EM F1 EM F1 EM F1 EM F1
Llama-3-8B
CoT 27.20 37.75 28.20 34.85 7.17 10.41 21.20 25.33 25.20 40.56 25.79
CoT-Retrieve 34.90 46.85 35.80 43.41 55.45 64.08 32.80 45.87 22.90 39.22 42.13
CoT* 21.80 31.69 25.60 30.89 5.30 7.58 23.10 25.31 26.80 40.20 23.83
Reasoning CoT-Retrieve* 22.50 32.15 23.70 29.21 44.86 55.69 38.70 45.64 17.60 29.20 33.93
IterDRAG 23.20 30.95 19.60 24.80 38.32 46.18 22.70 34.53 15.90 26.79 28.30
Auto-RAG 25.80 36.09 23.00 30.09 49.22 59.61 27.80 42.02 17.40 32.94 34.40
FLARE 23.80 32.88 30.30 37.45 34.89 43.45 28.80 40.61 28.80 40.61 34.16
DRAGIN 27.60 38.05 29.10 35.68 4.05 7.18 22.60 28.53 21.20 38.72 25.27
Adaptive UAR 29.70 40.66 34.80 42.40 52.96 61.53 33.00 45.95 22.70 39.10 40.28
TAARE 30.60 41.43 35.20 42.85 52.96 61.59 33.20 46.01 23.40 39.56 40.68
DeepRAG-Imi 35.10 46.59 47.20 52.33 50.47 59.55 43.60 48.50 30.00 41.76 45.38
Ours DeepRAG 40.70 51.54 48.10 53.25 52.96 61.92 42.50 47.80 32.70 45.24 47.67
Qwen-2.5-7B
CoT 18.90 27.81 23.40 28.97 3.12 5.71 15.20 19.20 18.30 34.86 19.55
CoT-Retreive 24.90 34.78 18.60 23.44 41.43 51.47 27.30 41.20 15.10 29.84 30.81
Resaoning CoT* 17.60 26.15 25.10 29.62 3.12 5.62 7.90 11.06 15.60 32.45 17.42
CoT-Retrieve* 23.40 32.29 22.40 27.51 43.30 54.51 26.60 35.46 13.80 25.60 30.49
IterDRAG 13.70 26.84 9.30 20.47 21.81 39.59 18.00 31.44 12.50 26.95 22.06
FLARE 23.40 32.06 21.80 26.51 34.89 42.62 19.00 28.24 16.10 31.89 27.65
DRAGIN 16.70 24.60 12.40 16.76 3.43 5.45 12.00 15.80 17.40 32.43 15.70
Adaptive UAR 24.50 34.22 23.90 28.20 34.89 43.92 27.00 40.47 16.60 32.28 30.60
TAARE 25.30 35.03 21.30 25.67 40.81 50.78 27.00 40.92 18.20 33.14 31.81
DeepRAG-Imi 30.40 39.44 32.00 38.32 47.98 56.99 37.50 40.72 23.90 38.62 38.59
Ours DeepRAG 32.10 41.14 40.40 44.87 51.09 59.76 40.60 43.19 24.20 38.83 41.62

Table 1: The overall experimental results of DeepRAG and other baselines on five benchmarks. The best/second best
scores in each dataset are bolded/underlined. DeepRAG-Imi (Stage I) and DeepRAG (Stage II) both demonstrate
superior performance compared to existing methods across all test scenarios.

4.4 Overall Results formance is highly sensitive to threshold selection.

Meanwhile, iterative retrieval methods like Auto-
We evaluate DeepRAG on two in-distribution
RAG often fall into continuous retrieval loops when
datasets and three out-of-distribution datasets. The
no highly relevant information is found. It is worth
results in Table 1 demonstrate DeepRAG’s supe-
noting that the CoT-Retrieve method outperforms
rior performance and robustness across different
on CAG. We attribute this to the fact that CAG con-
scenarios.
sists of straightforward, one-hop questions, where
Our method demonstrates superior perfor- direct question-relevant retrieval proves more ef-
mance across most datasets via thinking to re- fective.
trieval step by step. Our method consistently out-
performs existing approaches by enabling step-by- Our DeepRAG approach exhibits remarkable
step retrieval reasoning. Compared to reasoning- generalization capabilities and robustness in
based and adaptive RAG baselines, DeepRAG time-sensitive and out-of-distribution settings.
achieves improvements across all datasets, demon- In the time-sensitive dataset CAG, DeepRAG per-
strating the effectiveness of the structured retrieval forms well compared to other adaptive retrieval
narrative and its reliable, on-demand atomic deci- methods. Furthermore, DeepRAG achieves sub-
sions. Specifically, the poor performance of Iter- stantial F1 score improvements of 2.63 and 4.57
DRAG highlights the necessity of learning both on PopQA and WebQuestions respectively, even
query decomposition and faithful answering. In in scenarios where relevant information may be
contrast, confidence-based methods like FLARE sparse or missing from the knowledge base.
struggle to determine the optimal retrieval timing By learning from self-synthesized data, Deep-
due to their reliance on unstable, predefined met- RAG effectively explores knowledge boundaries
rics. Moreover, we observe that such confidence- while minimizing hallucination risks. We ob-
based methods suffer from instability, as their per- serve that TAARE often underperforms direct re-

6
trieval methods, highlighting the mismatch be- Dataset Method EM
Avg. Retrievals
All Correct Incorrect
tween its internal knowledge and verbose. More-
FLARE 30.30 0.99 1.00 0.99
over, aggressive fine-tuning approaches like CoT*
DRAGIN 29.10 1.03 1.03 1.03
and CoT-Retrieve* can actually degrade model per- UAR 34.80 0.81 0.68 0.89
formance by forcing the model to learn knowledge 2WMQA TAARE 35.20 0.93 0.93 0.97
IterDRAG 19.60 2.46 2.49 2.45
beyond its natural capabilities. In contrast, our Auto-RAG 23.00 6.26 4.13 1.81
approach carefully preserves model capabilities DeepRAG-Imi 47.20 1.13 0.95 1.28
during fine-tuning by leveraging self-synthesized DeepRAG 48.10 1.09 0.92 1.25
FLARE 28.80 0.00 0.00 0.00
data, effectively preventing additional hallucination DRAGIN 21.20 0.00 0.00 0.00
while maintaining performance. UAR 22.70 0.96 0.95 0.97
WQ TAARE 23.40 0.66 0.65 0.66
5 Analysis IterDRAG 15.90 2.25 2.16 2.27
Auto-RAG 17.40 4.52 3.03 2.35
DeepRAG-Imi 30.00 0.43 0.13 0.56
5.1 Retrieval Efficiency DeepRAG 32.70 0.28 0.12 0.36
To demonstrate the efficiency of our method, we
compare the average number of retrievals on 2Wiki- Table 2: Retrieval frequency analysis on 2WikiMulti-
hopQA(2WMQA) and WebQuestions(WQ) across dif-
MultihopQA and WebQuestions. As shown in Ta-
ferent adaptive retrieval methods. "Correct" indicates
ble 2, We have following observations: 1) Deep- the average number of retrievals for instances where
RAG can achieve higher accuracy with relatively the model produced correct answers, while "Incorrect"
lower retrieval costs, attributed to its dynamic us- represents the average retrievals for cases with incorrect
age of internal knowledge. 2) Confidence-based answers.
approaches demonstrate limited robustness across
datasets. For instance, both FLARE and DRAGIN Method F1 Acc Balanced Acc MCC
methods doesn’t trigger retrieval under the default FLARE 0.000 0.718 0.500 0.000
confidence threshold in WQ. 3) Iterative retrieval- DRAGIN 0.007 0.709 0.495 -0.045
based approaches typically require numerous re- UAR 0.481 0.756 0.648 0.341
TAARE 0.127 0.712 0.518 0.078
trieval operations. Therefore, efficient adaptive re-
Iter-DRAG 0.000 0.718 0.500 0.000
trieval methods like DeepRAG become crucial for Auto-RAG 0.000 0.718 0.500 0.000
optimizing resource utilization while maintaining DeepRAG-Imi 0.580 0.732 0.709 0.393
performance. DeepRAG 0.621 0.749 0.743 0.451

5.2 Relevance to Parametric Knowledge Table 3: Analysis of internal knowledge utilization

across different adaptive retrieval methods on 2Wiki-
In this section, we investigate the relationship be- MultihopQA.
tween retrieval needs and internal knowledge to
demonstrate how effectively atomic decisions ex-
plores the knowledge boundary. The detail setting gle to properly avoid unnecessary retrievals.
are shown in Appendix B.2. We report four metrics.
F1 score and Accuracy serve as basic performance 5.3 Different Inference Strategy
measures, while balanced accuracy and Matthews To gain a deep insight into the effectiveness of
Correlation Coefficient(MCC) (contributors, 2025) retrieval narartive, we evaluate DeepRAG’s per-
are employed to account for the class imbalance be- formance under two extreme scenarios: relying
tween retrieval-required and retrieval-not-required solely on internal knowledge and using retrieval in
cases. each subquery. As shown in Figure 5, depending
As shown in Table 3, we find that: 1) Deep- solely on internal knowledge yields poor perfor-
RAG demonstrates superior relevance performance mance, while relying entirely on external knowl-
across F1, balanced accuracy, and MCC metrics. edge achieves relatively higher accuracy but incurs
This suggests that DeepRAG successfully identifies substantial retrieval costs. In contrast, DeepRAG
retrieval necessity by exploring knowledge bound- achieves superior performance by adaptively se-
ary; 2) While FLARE, DRAGIN, and TAARE lecting between internal and external knowledge
exhibit high accuracy scores, their relatively low sources. Specifically, DeepRAG outperforms the
balanced accuracy and MCC scores suggest they retrieve only approach. This may be attributed
mainly succeed in retrieval-required cases but strug- to the fact that in certain scenarios, retrieval can

7
hinder model performance due to long context or Method ID CAG PopQA WebQuestion
Avg
F1 EM EM EM
irrelevant knowledge, making internal knowledge
DeepRAG-Imi 49.46 50.47 43.60 30.00 44.60
the more reliable choice. most 47.31 51.09 31.30 28.00 41.12
random 44.76 51.40 34.80 27.10 40.56
5.4 Question Decomposition Effectiveness
We systematically analyze the effectiveness of Table 4: Experiment results of the ablation study on the
Imitation Learning Stage. ID refers to the average score
question decomposition in retrieval narrative. As
of two in-distribution dataset HotpotQA and 2WikiMul-
shown in Figure 3, we present the distribution of tihopQA.
subquery counts and retrieval attempts for different
questions. Most questions require 3-5 decompo- Method ID CAG PopQA WebQuestion
Avg
sition steps, while retrieval attempts are primarily F1 EM EM EM
concentrated within 0-2 rounds. This demonstrates DeepRAG 52.40 61.92 47.80 45.24 47.67
all-node 50.92 50.47 41.50 32.70 45.30
that DeepRAG effectively decomposes questions sentence-wise 30.16 12.46 20.00 12.90 21.14
while minimizing redundant retrieval.
Table 5: Experiment results of the ablation study on the
3 0
Chain of Calibration Stage.
4 381 1
683 177
5 9 2
191 ≥6
433
≥3 in Table 4, DeepRAG-Imi enables the model to
65 61 learn knowledge boundaries during the imitation
(a) (b) learning stage. Notably, CAG performs relatively
poorly at this stage due to its time-sensitive na-
Figure 3: (a) Subquery Statistics. (b) Retrieval Statis- ture, which necessitates constant retrieval of up-to-
tics. date information. Moreover, as illustrated in Fig-
ure 6(a), DeepRAG-Imi achieves lower retrieval
Moreover, we analyze the average counts of costs and higher average performance compared
WH-words, nouns, verbs, and conjunctions in sub- to both the maximum-retrieval-cost and random
queries, as shown in Figure 4. The results indicate selection methods.
that DeepRAG decomposes atomic queries with
fewer pronouns and conjunctions. Chain of Calibration We compare our default
approach of constructing preferences based on
DeepRAG 2.05 nodes from optimal paths against two alternatives:
2 Auto-RAG
constructing pairs for all nodes and constructing
Average Count

1.5
sentence-level partial order pairs based on retrieval
1.26 efficiency. As shown in Table 5, DeepRAG demon-
1.13
0.98
1 strates significant advantages over both variants.
0.77
Specifically, as illustrated in Figure 6(b), Deep-
0.5 RAG achieves lower retrieval costs while maintain-
0.30
0.12
0
0.04 ing higher average performance. In contrast, the
WH Noun Verb And/Or sentence-level partial order pairs learned incorrect
preferences, resulting in over-reliance on internal
Figure 4: Average counts of WH-words, nouns,
verbs, and conjunctions (and/or) per subquery. knowledge and consequently leading to both low
retrieval costs and poor performance.

5.5 Ablation Study 5.6 Performance against Strong Baseline

Models
In this section, we conducted experiments to vali-
In this section, we compare DeepRAG with recent
date the effectiveness of DeepRAG’s data construc-
strong baseline models. Specifically, we select two
tion and training process.
state-of-the-art open-source models: QwQ-32B-
Imitation Learning We compare our default preview (Team, 2024) and gpt-4o-turbo (OpenAI).
strategy of selecting paths with minimal retrieval As shown in Table 6, by leveraging external knowl-
cost against two alternative approaches: maximum edge bases through dynamic cognitive decision-
retrieval cost and random path selection. As shown making, DeepRAG achieves superior average per-

8
46 3
55 2
55 Avg. Score 4 Avg. Score

Average Retrievals
Avg. Score

Average Retrievals
Avg. Retrievals Avg. Retrievals
45 44.6 1.8

Average Retrievals
50 50 47.67 Avg. Retrievals

Average Score
47.67 2.5
Average Score

Average Score
45.39 2.77 3 45.3 1.56
2.11 45 1.6
45 44
1.81 2 1.37
40 1.4
40 43
2 1.2
35 1.37
1.5 35
42 1
30 0.98 41.12 30
25.32 1 41 1 0.8
40.56
25 25
21.14 0.57 0.6
0 40 0.5
20 0 20
y -Imi ise
i c onl epRAG ve on
ly RAG most rando
m
Deep
RAG ll-node nce-w
param
e t r
De retrie Deep a sente

(a) (b)
Figure 5: Comparative analysis of
retrieval strategies: parametric only Figure 6: Average score and retrievals on the ablation study for Imitation
or retrieve only. Learning and Chain of Calibration.

What is the place of birth of the director of film Peter'S Friends? What is the place of birth of the director of film Peter'S Friends?

Analysis: To answer this question, I would need to know the director of the film “Peter's Follow up: Who is the director of film Peter's Friends?
Friends” and his place of birth. Let's search the question in Wikipedia.
Auto Initial Query: Who is the director of the film “Peter's Friends” and where was he born? Context: Peter's Friends is a 1992 British comedy film written by Rita
Ours
RAG Document 1: Peter's Friends Peter's Friends is a 1992 British comedy film…, and Rudner and Martin Bergman, and directed and produced by Kenneth
directed and produced by Kenneth Branagh…… Branagh......
Intermediate answer_1: Based on the Retrieved Document_1, the director of film Intermediate answer: Kenneth Branagh
“Peter‘s Friends” is Richard Curtis. However,…, we need to find the place of birth of
Richard Curtis. Follow up: What is the place of birth of Kenneth Branagh?
Refined Query: What is the place of birth of Richard Curtis, the director of Peter's Intermediate answer: Belfast, Northern Ireland
Friends?
Document 2: Richard Curtis …was born in New Zealand to Australian parents. Final answer:
Intermediate answer_2: Based on the Retrieved Document_2, Richard Curtis was born <answer long>The director of film Peter's Friends is Kenneth Branagh,
in New Zealand. Therefore, the place of birth of the director of film “Peter‘s Friends” is and he was born in Belfast, Northern Ireland.</answer long>
New Zealand. <answer short>Belfast</answer short>
Final answer: New Zealand.

Figure 7: Case Study: Auto-RAG vs. DeepRAG. DeepRAG achieves success by atomic query decomposition,
faithful intermediate answer, and adaptively using internal knowledge.

formance over QwQ and gpt-4o, particularly in this issue using its own relevant documents, it
time-sensitive QA tasks. Notably, while Deep- falls into endless loops in most cases. In con-
RAG does not surpass gpt-4o in some cases, it trast, DeepRAG iteratively generates subqueries
achieves comparable performance levels. These and determines whether to use internal knowledge
results demonstrate that DeepRAG not only effec- at each iteration. The binary tree search data syn-
tively recognizes its knowledge boundaries but also thesis method for optimization ensures reliable sub-
adapts well to time-sensitive scenarios. query generation, intermediate answers, and final
answers. Even when no related information exists
ID CAG PopQA WQ in retrieved documents, the model is directed to pro-
Models F1 EM EM EM Avg
vide a final answer based on internal knowledge.
QwQ-32B 31.43 3.43 10.60 15.10 18.40
gpt-4o-turbo 60.6 23.36 43.50 25.35 42.68 6 Conclusion
DeepRAG-qwen 43.00 51.09 40.60 24.20 40.38
DeepRAG-llama 52.40 52.96 42.50 32.70 46.59 In this paper, we present DeepRAG, a simple yet ef-
fective approach that enhances LLM’s awareness of
Table 6: Performance against strong baseline models.
retrieval requirements through self-calibration. Our
method decomposes queries into subqueries and
uses binary tree search for data synthesis to help
5.7 Case Study
models better understand their knowledge bound-
As illustrated in Figure 7, we conduct a case study aries. Experimental results across various QA tasks
comparing DeepRAG with Auto-RAG (Yu et al., demonstrate that DeepRAG significantly improves
2024), a closely related method that utilizes iter- the accuracy and efficiency of retrieval-augmented
ative retrieval for retrieval-augmented generation. generation.
For each subquery, Auto-RAG retrieves relevant
documents and generates a corresponding suban-
swer. This approach is not only time-consuming References
but also fails when no relevant documents are re- Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and
trieved. Although Auto-RAG attempts to address Hannaneh Hajishirzi. 2023. Self-rag: Learning to

9
retrieve, generate, and critique through self-reflection. Zhengbao Jiang, Frank F. Xu, Luyu Gao, Zhiqing Sun,
arXiv preprint arXiv:2310.11511. Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie
Callan, and Graham Neubig. 2023. Active retrieval
Jonathan Berant, Andrew Chou, Roy Frostig, and Percy augmented generation.
Liang. 2013. Semantic parsing on freebase from
question-answer pairs. In Proceedings of the 2013 Sanyam Kapoor, Nate Gruver, Manley Roberts, Kather-
conference on empirical methods in natural language ine Collins, Arka Pal, Umang Bhatt, Adrian Weller,
processing, pages 1533–1544. Samuel Dooley, Micah Goldblum, and Andrew Gor-
don Wilson. 2024a. Large language models must be
Ning Bian, Hongyu Lin, Peilin Liu, Yaojie Lu, taught to know what they don’t know. arXiv preprint
Chunkang Zhang, Ben He, Xianpei Han, and Le Sun. arXiv:2406.08391.
2024. Influence of external information on large lan-
guage models mirrors social cognitive patterns. IEEE Sanyam Kapoor, Nate Gruver, Manley Roberts, Arka
Transactions on Computational Social Systems. Pal, Samuel Dooley, Micah Goldblum, and Andrew
Wilson. 2024b. Calibration-tuning: Teaching large
Hung-Ting Chen, Fangyuan Xu, Shane Arora, and language models to know what they don‘t know. In
Eunsol Choi. 2023. Understanding retrieval aug- Proceedings of the 1st Workshop on Uncertainty-
mentation for long-form question answering. arXiv Aware NLP (UncertaiNLP 2024), pages 1–14, St
preprint arXiv:2310.12150. Julians, Malta. Association for Computational Lin-
guistics.
Qinyuan Cheng, Xiaonan Li, Shimin Li, Qin Zhu,
Zhangyue Yin, Yunfan Shao, Linyang Li, Tianxi- Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang,
ang Sun, Hang Yan, and Xipeng Qiu. 2024. Unified Yujia Zhou, Yutao Zhu, Peitian Zhang, and
active retrieval for retrieval augmented generation. Zhicheng Dou. 2025. Search-o1: Agentic search-
arXiv preprint arXiv:2406.12534. enhanced large reasoning models. arXiv preprint
arXiv:2501.05366.
Wikipedia contributors. 2025. Phi coefficient —
Wikipedia, the free encyclopedia. https:// Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das,
en.wikipedia.org/wiki/Phi_coefficient. Ac- Daniel Khashabi, and Hannaneh Hajishirzi. 2022.
cessed: 2025-01-22. When not to trust language models: Investigating
effectiveness of parametric and non-parametric mem-
Kaustubh D. Dhole. 2025. To retrieve or not to re- ories. arXiv preprint arXiv:2212.10511.
trieve? uncertainty detection for dynamic retrieval
augmented generation. Preprint, arXiv:2501.09292. OpenAI. Hello gpt-4o. https://openai.com/index/
hello-gpt-4o/. [Online; accessed 22-January-
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, 2025].
Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman,
Akhil Mathur, Alan Schelten, Amy Yang, Angela Ruotong Pan, Boxi Cao, Hongyu Lin, Xianpei Han, Jia
Fan, et al. 2024. The llama 3 herd of models. arXiv Zheng, Sirui Wang, Xunliang Cai, and Le Sun. 2024.
preprint arXiv:2407.21783. Not all contexts are equal: Teaching llms credibility-
aware generation. arXiv preprint arXiv:2404.06809.
Wenfeng Feng, Chuzhan Hao, Yuewei Zhang, Jingyi
Song, and Hao Wang. 2025. Airrag: Activating in- Fabio Petroni, Tim Rocktäschel, Sebastian Riedel,
trinsic reasoning for retrieval augmented generation Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and
via tree-based search. Preprint, arXiv:2501.10053. Alexander Miller. 2019. Language models as knowl-
edge bases? In Proceedings of the 2019 Confer-
Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, ence on Empirical Methods in Natural Language Pro-
and Akiko Aizawa. 2020. Constructing a multi- cessing and the 9th International Joint Conference
hop QA dataset for comprehensive evaluation of on Natural Language Processing (EMNLP-IJCNLP),
reasoning steps. In Proceedings of the 28th Inter- pages 2463–2473, Hong Kong, China. Association
national Conference on Computational Linguistics, for Computational Linguistics.
pages 6609–6625, Barcelona, Spain (Online). Inter-
national Committee on Computational Linguistics. Aske Plaat, Annie Wong, Suzan Verberne, Joost
Broekens, Niki van Stein, and Thomas Back. 2024.
Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Reasoning with large language models, a survey.
Zhangyin Feng, Haotian Wang, Qianglong Chen, arXiv preprint arXiv:2407.11511.
Weihua Peng, Xiaocheng Feng, Bing Qin, et al. 2023.
A survey on hallucination in large language models: Ansh Radhakrishnan, Karina Nguyen, Anna Chen,
Principles, taxonomy, challenges, and open questions. Carol Chen, Carson Denison, Danny Hernandez, Esin
ACM Transactions on Information Systems. Durmus, Evan Hubinger, Jackson Kernion, Kamilė
Lukošiūtė, et al. 2023. Question decomposition im-
Soyeong Jeong, Jinheon Baek, Sukmin Cho, Sung Ju proves the faithfulness of model-generated reasoning.
Hwang, and Jong C Park. 2024. Adaptive-rag: Learn- arXiv preprint arXiv:2307.11768.
ing to adapt retrieval-augmented large language mod-
els through question complexity. arXiv preprint Weihang Su, Yichen Tang, Qingyao Ai, Zhijing Wu,
arXiv:2403.14403. and Yiqun Liu. 2024. DRAGIN: Dynamic retrieval

10
augmented generation based on the real-time informa- Zhenrui Yue, Honglei Zhuang, Aijun Bai, Kai Hui, Rolf
tion needs of large language models. In Proceedings Jagerman, Hansi Zeng, Zhen Qin, Dong Wang, Xuan-
of the 62nd Annual Meeting of the Association for hui Wang, and Michael Bendersky. 2024. Inference
Computational Linguistics (Volume 1: Long Papers), scaling for long-context retrieval augmented genera-
pages 12991–13013, Bangkok, Thailand. Association tion. arXiv preprint arXiv:2410.04343.
for Computational Linguistics.
Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu,
Hexiang Tan, Fei Sun, Wanli Yang, Yuanzhuo Wang, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang,
Qi Cao, and Xueqi Cheng. 2024. Blinded by gen- Yulong Chen, Longyue Wang, Anh Tuan Luu, Wei
erated contexts: How language models merge gen- Bi, Freda Shi, and Shuming Shi. 2023. Siren’s song
erated and retrieved contexts for open-domain qa? in the ai ocean: A survey on hallucination in large
arXiv preprint arXiv:2401.11911. language models. arXiv preprint arXiv:2309.01219.

Qwen Team. 2024. Qwq: Reflect deeply on the bound- Zihan Zhang, Meng Fang, and Ling Chen. 2024. Re-
aries of the unknown. trievalqa: Assessing adaptive retrieval-augmented
generation for short-form open-domain question an-
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten swering. arXiv preprint arXiv:2402.16457.
Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou,
et al. 2022. Chain-of-thought prompting elicits rea- Penghao Zhao, Hailin Zhang, Qinhan Yu, Zhengren
soning in large language models. Advances in neural Wang, Yunteng Geng, Fangcheng Fu, Ling Yang,
information processing systems, 35:24824–24837. Wentao Zhang, and Bin Cui. 2024. Retrieval-
augmented generation for ai-generated content: A
Zhuofeng Wu, He Bai, Aonan Zhang, Jiatao Gu, VG Vy- survey. arXiv preprint arXiv:2402.19473.
diswaran, Navdeep Jaitly, and Yizhe Zhang. 2024.
Divide-or-conquer? which part should you distill A Templates
your llm? arXiv preprint arXiv:2402.15000.
A.1 DeepRAG Construct Instruction
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng,
Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Instruction: You are a helpful Retrieve-Augmented
Li, Dayiheng Liu, Fei Huang, Guanting Dong, Hao- Generation (RAG) model. Your task is to answer
ran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian questions by logically decomposing them into clear
Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin sub-questions and iteratively addressing each one.
Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang
Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Use "Follow up:" to introduce each sub-question and
"Intermediate answer:" to provide answers.
Mei Li, Mingfeng Xue, Na Ni, Pei Zhang, Peng
Wang, Ru Peng, Rui Men, Ruize Gao, Runji Lin, For each sub-question, decide whether you can pro-
Shijie Wang, Shuai Bai, Sinan Tan, Tianhang Zhu, vide a direct answer or if additional information is
Tianhao Li, Tianyu Liu, Wenbin Ge, Xiaodong Deng, required. If additional information is needed, state,
Xiaohuan Zhou, Xingzhang Ren, Xinyu Zhang, Xipin "Let’s search the question in Wikipedia." and then use
Wei, Xuancheng Ren, Yang Fan, Yang Yao, Yichang the retrieved information to respond comprehensively.
Zhang, Yu Wan, Yunfei Chu, Yuqiong Liu, Zeyu If a direct answer is possible, provide it immediately
Cui, Zhenru Zhang, and Zhihao Fan. 2024. Qwen2 without searching.
technical report. arXiv preprint arXiv:2407.10671.

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Ben- B Detailed Analysis
gio, William W Cohen, Ruslan Salakhutdinov, and
Christopher D Manning. 2018. Hotpotqa: A dataset B.1 Retrieval Efficiency
for diverse, explainable multi-hop question answer-
To demonstrate the efficiency of our method, we
ing. arXiv preprint arXiv:1809.09600.
compare the average number of retrievals on 2Wiki-
Xunjian Yin, Xu Zhang, Jie Ruan, and Xiaojun Wan. MultihopQA and WebQuestions. As shown in Ta-
2024. Benchmarking knowledge boundary for large ble 2, We have following observations:
language model: A different perspective on model 1) Compared to other adaptive retrieval meth-
evaluation. arXiv preprint arXiv:2402.11493.
ods, DeepRAG can achieve higher accuracy with
Zhangyue Yin, Qiushi Sun, Qipeng Guo, Jiawen Wu, relatively lower retrieval costs. This can be at-
Xipeng Qiu, and Xuanjing Huang. 2023. Do large tributed to our dynamic usage of internal knowl-
language models know what they don’t know? arXiv edge. Additionally, DeepRAG exhibits a posi-
preprint arXiv:2305.18153. tive trend in exploring relevant evidence when
Tian Yu, Shaolei Zhang, and Yang Feng. 2024. Auto-
faced with insufficient retrieval results, as evi-
rag: Autonomous retrieval-augmented generation for denced by the lower average retrieval numbers in
large language models. both 2WMQA (0.92 compared to 1.25) and WQ

11
(0.12 compared to 0.36). 2) Confidence-based ap- B.3 Ablation Study
proaches demonstrate limited robustness across Table 7 and Table 8 show the detailed results of the
datasets. For instance, while using identical thresh- ablation study.
olds, both FLARE and DRAGIN methods show
inconsistent behavior: they trigger approximately HotpotQA 2WMQA CAG PopQA WebQuestion
Avg
F1 F1 EM EM EM
one retrieval per query in 2WMQA, but fail to reach DeepRAG-Imi 46.59 52.33 50.47 43.60 30.00 44.60
the retrieval threshold entirely in WQ. This incon- most 47.73 46.88 51.09 31.30 28.00 41.12
random 46.78 42.75 51.40 34.80 27.10 40.56
sistency highlights the challenge of maintaining
reliable performance across different datasets using Table 7: Detailed Experiment results of the ablation
confidence-based methods. 3) Iterative retrieval- study on the Imitation Learning Stage.
based approaches typically require numerous re-
trieval operations, resulting in substantial computa-
HotpotQA 2WMQA CAG PopQA WebQuestion
tional costs. Therefore, efficient adaptive retrieval F1 F1 EM EM EM Avg
methods like DeepRAG become crucial for opti- DeepRAG 51.54 53.25 61.92 47.80 45.24 47.67
all-node 49.99 51.85 50.47 41.50 32.70 45.30
mizing resource utilization while maintaining per- sentence-wise 29.03 31.28 12.46 20.00 12.90 21.14
formance.
Table 8: Detailed experiment results of the ablation
study on the Chain of Calibration Stage.
B.2 Relevance to Parametric Knowledge

In this section, we investigate the relationship be-

tween retrieval needs and parametric knowledge to
demonstrate how effectively our method explores
the knowledge boundary.
Ideally, models should initiate retrieval for
queries beyond their parametric knowledge while
utilizing their existing knowledge for familiar
queries. We use CoT results as an indicator of
whether the model can answer questions using
its parametric knowledge. Subsequently, we ana-
lyze whether other adaptive retrieval methods align
with this pattern of parametric knowledge utiliza-
tion. We evaluate the relevance using four met-
rics. F1 score and Accuracy serve as basic per-
formance measures, while balanced accuracy and
Matthews Correlation Coefficient(MCC) are em-
ployed to account for the class imbalance between
retrieval-required and retrieval-not-required cases.
The MCC ranges from -1 to 1, where a value of 1
indicates perfect correlation, 0 represents no corre-
lation (random chance), and -1 signifies an inverse
correlation.
As shown in Table 3, we find that 1) Deep-
RAG demonstrates superior relevance performance
across F1, balanced accuracy, and MCC metrics.
This suggests that DeepRAG successfully identifies
retrieval necessity by exploring knowledge bound-
ary. 2) While FLARE, DRAGIN, and TAARE
exhibit high accuracy scores, their relatively low
balanced accuracy and MCC scores suggest they
mainly succeed in retrieval-required cases but strug-
gle to properly avoid unnecessary retrievals.

RAG Slide ENG
No ratings yet
RAG Slide ENG
41 pages
Principles of Divisional Charts - Sanjay Rath
75% (4)
Principles of Divisional Charts - Sanjay Rath
13 pages
Breastfeeding Guidelines
100% (2)
Breastfeeding Guidelines
15 pages
Retrieval-Augmented Generation For Large Language Models: A Survey
No ratings yet
Retrieval-Augmented Generation For Large Language Models: A Survey
26 pages
Retrieval-Augmented Generation For Large Language Models A Survey
No ratings yet
Retrieval-Augmented Generation For Large Language Models A Survey
26 pages
Detailed Lesson Plan (Statement of The Problem)
No ratings yet
Detailed Lesson Plan (Statement of The Problem)
6 pages
EED 001 - 2 Approaches To Values Ed
No ratings yet
EED 001 - 2 Approaches To Values Ed
23 pages
Generative AI
No ratings yet
Generative AI
25 pages
G200 Pilot Initial Client Guide
No ratings yet
G200 Pilot Initial Client Guide
126 pages
Integrated Marketing Communication - Emirates
100% (2)
Integrated Marketing Communication - Emirates
11 pages
A Simple Guide To Retrieval Augmented Generation 1720484135
No ratings yet
A Simple Guide To Retrieval Augmented Generation 1720484135
9 pages
Developing Retrieval Augmented Generation (RAG) Based LLM Systems From Pdfs - An Expert Report
No ratings yet
Developing Retrieval Augmented Generation (RAG) Based LLM Systems From Pdfs - An Expert Report
36 pages
The Ultimate Guide To Sales Prospecting
No ratings yet
The Ultimate Guide To Sales Prospecting
17 pages
G-9 Non-Mendelian Genetics
50% (2)
G-9 Non-Mendelian Genetics
5 pages
Generative AI PPT Final
No ratings yet
Generative AI PPT Final
34 pages
The Disappearances of Draco Malfoy by SpeechWriter Booklet
No ratings yet
The Disappearances of Draco Malfoy by SpeechWriter Booklet
328 pages
Applsci 14 09318 v2
No ratings yet
Applsci 14 09318 v2
18 pages
TCR 2006 Lang Dialect and Register
No ratings yet
TCR 2006 Lang Dialect and Register
26 pages
Optimizing Retrieval-Augmented Generation: Analysis of Hyperparameter Impact On Performance and Efficiency
No ratings yet
Optimizing Retrieval-Augmented Generation: Analysis of Hyperparameter Impact On Performance and Efficiency
14 pages
The Maxwell-Boltzmann Distribution Brennan 5.4: Lecture Prepared by Melanie Hill
100% (1)
The Maxwell-Boltzmann Distribution Brennan 5.4: Lecture Prepared by Melanie Hill
28 pages
Mathisen - The End of The Western Roman Empire in V Century
No ratings yet
Mathisen - The End of The Western Roman Empire in V Century
23 pages
12 Essential RAG Types 1735544647
No ratings yet
12 Essential RAG Types 1735544647
29 pages
24 Civil Court Ordinance (7-10)
No ratings yet
24 Civil Court Ordinance (7-10)
11 pages
Akt Practice
No ratings yet
Akt Practice
104 pages
Practical RAG
No ratings yet
Practical RAG
127 pages
A Survey On Rag Meeting LLM
No ratings yet
A Survey On Rag Meeting LLM
18 pages
Untitled 2
No ratings yet
Untitled 2
40 pages
Long-Context LLMs Meet RAG: Overcoming Challenges For Long Inputs in RAG
No ratings yet
Long-Context LLMs Meet RAG: Overcoming Challenges For Long Inputs in RAG
34 pages
Retrieval-Augmented Generation For Natural Language Processing: A Survey
No ratings yet
Retrieval-Augmented Generation For Natural Language Processing: A Survey
19 pages
Instructrag: Leveraging Retrieval-Augmented Generation On Instruction Graphs For Llm-Based Task Planning
No ratings yet
Instructrag: Leveraging Retrieval-Augmented Generation On Instruction Graphs For Llm-Based Task Planning
16 pages
SSRN 5267341
No ratings yet
SSRN 5267341
16 pages
Maximizing Rag Efficiency A Comparative Analysis of Rag Methods
No ratings yet
Maximizing Rag Efficiency A Comparative Analysis of Rag Methods
25 pages
Know Your RAG: Dataset Taxonomy and Generation Strategies For Evaluating RAG Systems
No ratings yet
Know Your RAG: Dataset Taxonomy and Generation Strategies For Evaluating RAG Systems
19 pages
Learning: Gen Ai
No ratings yet
Learning: Gen Ai
6 pages
EasyChair Preprint 15614
No ratings yet
EasyChair Preprint 15614
20 pages
Topic 8B (Plants) - Lesson 3
No ratings yet
Topic 8B (Plants) - Lesson 3
33 pages
Gautam 2024 Evaluating
No ratings yet
Gautam 2024 Evaluating
7 pages
Retrieval-Augmented Generation For Natural Language Processing-A Survey
No ratings yet
Retrieval-Augmented Generation For Natural Language Processing-A Survey
17 pages
IR LLMs
No ratings yet
IR LLMs
17 pages
Judicial Reform Constitutionalism and The Rule of Law in Zambia
No ratings yet
Judicial Reform Constitutionalism and The Rule of Law in Zambia
27 pages
Auto-Rag: Autonomous Retrieval-Augmented Generation For Large Language Models
No ratings yet
Auto-Rag: Autonomous Retrieval-Augmented Generation For Large Language Models
32 pages
External Information On Large Linguistic Models Utilizing Retrieval Enhanced Generation (RAG)
100% (10)
External Information On Large Linguistic Models Utilizing Retrieval Enhanced Generation (RAG)
6 pages
Medium
No ratings yet
Medium
22 pages
Rag Foundry - Diff Framework
No ratings yet
Rag Foundry - Diff Framework
10 pages
Medexqa
No ratings yet
Medexqa
27 pages
Critical Thinking
No ratings yet
Critical Thinking
57 pages
Semantic Search and Beyond handout-Tim-Clarke
No ratings yet
Semantic Search and Beyond handout-Tim-Clarke
16 pages
A Deep Dive Into Retrieval Augmented Generation: Team Members
No ratings yet
A Deep Dive Into Retrieval Augmented Generation: Team Members
14 pages
S S - T P - S - I D: Caling Peech EXT RE Training With YN Thetic Nterleaved ATA
No ratings yet
S S - T P - S - I D: Caling Peech EXT RE Training With YN Thetic Nterleaved ATA
23 pages
Enhanced Retrieval-Augmented Reasoning With Open-Source Large Language Models
No ratings yet
Enhanced Retrieval-Augmented Reasoning With Open-Source Large Language Models
14 pages
Speculative RAG Enhancing RAG Through Drafting 1721165432
No ratings yet
Speculative RAG Enhancing RAG Through Drafting 1721165432
17 pages
The Geometry of Queries: Query-Based Innovations in Retrieval-Augmented Generation
No ratings yet
The Geometry of Queries: Query-Based Innovations in Retrieval-Augmented Generation
22 pages
01rag For LLM A Survey
No ratings yet
01rag For LLM A Survey
21 pages
K - A I R M - A S: Nowledge Ware Terative Etrieval For Ulti Gent Ystems
No ratings yet
K - A I R M - A S: Nowledge Ware Terative Etrieval For Ulti Gent Ystems
18 pages
Ragcache: Efficient Knowledge Caching For Retrieval-Augmented Generation
No ratings yet
Ragcache: Efficient Knowledge Caching For Retrieval-Augmented Generation
14 pages
Crag Pa Peer
No ratings yet
Crag Pa Peer
16 pages
Knowledge Retrieval Based On Generative AI: 1 Te-Lun Yang
No ratings yet
Knowledge Retrieval Based On Generative AI: 1 Te-Lun Yang
8 pages
RAGtreebased
No ratings yet
RAGtreebased
17 pages
A Survey On Rag Meeting LLMS: Towards Retrieval-Augmented Large Language Models
No ratings yet
A Survey On Rag Meeting LLMS: Towards Retrieval-Augmented Large Language Models
18 pages
Regression Analysis
No ratings yet
Regression Analysis
14 pages
Corrective Retrieval Augmented Generation: Zhang Et Al. 2023b Muhlgay Et Al. 2023
No ratings yet
Corrective Retrieval Augmented Generation: Zhang Et Al. 2023b Muhlgay Et Al. 2023
14 pages
Enhancing Retrieval-Augmente Generation Practices
No ratings yet
Enhancing Retrieval-Augmente Generation Practices
13 pages
Longrag: Enhancing Retrieval-Augmented Generation With Long-Context Llms
No ratings yet
Longrag: Enhancing Retrieval-Augmented Generation With Long-Context Llms
13 pages
Knowing You Don't Know: Learning When To Continue Search in Multi-Round RAG Through Self-Practicing
No ratings yet
Knowing You Don't Know: Learning When To Continue Search in Multi-Round RAG Through Self-Practicing
11 pages
Searching For Best Practices in Retrieval-Augmented Generation
No ratings yet
Searching For Best Practices in Retrieval-Augmented Generation
22 pages
Ragg
No ratings yet
Ragg
23 pages
Towards Understanding Retrieval Accuracy and Prompt Quality in RAG Systems
No ratings yet
Towards Understanding Retrieval Accuracy and Prompt Quality in RAG Systems
18 pages
A Survey On Retrieval-Augmented Text Generation For Large Language Models
No ratings yet
A Survey On Retrieval-Augmented Text Generation For Large Language Models
18 pages
Retrieval Augmented Generation or Long-Context LLMs
No ratings yet
Retrieval Augmented Generation or Long-Context LLMs
13 pages
Rag
No ratings yet
Rag
10 pages
Good Source File Lean With Insurance Commissioner On Their Bonds (2018!10!07 07-07-38 UTC)
100% (12)
Good Source File Lean With Insurance Commissioner On Their Bonds (2018!10!07 07-07-38 UTC)
24 pages
Timeaware
No ratings yet
Timeaware
12 pages
Understand What LLM Needs: Dual Preference Alignment For Retrieval-Augmented Generation
No ratings yet
Understand What LLM Needs: Dual Preference Alignment For Retrieval-Augmented Generation
37 pages
Probing Language Models On Their Knowledge Source
No ratings yet
Probing Language Models On Their Knowledge Source
11 pages
Harnessing Retrieval Augmented Generatio
No ratings yet
Harnessing Retrieval Augmented Generatio
4 pages
Corrective Retrieval Augmented Generation: Zhang Et Al. 2023b Muhlgay Et Al. 2023
No ratings yet
Corrective Retrieval Augmented Generation: Zhang Et Al. 2023b Muhlgay Et Al. 2023
13 pages
PSE Form
No ratings yet
PSE Form
8 pages
Llmrag
No ratings yet
Llmrag
6 pages
Mi 100B T
No ratings yet
Mi 100B T
24 pages
Multi-Task Retriever Fine-Tuning For Domain-Specific and Efficient RAG
No ratings yet
Multi-Task Retriever Fine-Tuning For Domain-Specific and Efficient RAG
9 pages
1shenquan Yu Xiaomei Chen Yin Yang Bagua Zhang Palm Mangxing
No ratings yet
1shenquan Yu Xiaomei Chen Yin Yang Bagua Zhang Palm Mangxing
10 pages
9 Threshold OCEAN VUONG SQs & PQs
No ratings yet
9 Threshold OCEAN VUONG SQs & PQs
3 pages
The Power of Noise: Redefining Retrieval For RAG Systems: Florin Cuconasu Giovanni Trappolini Federico Siciliano
No ratings yet
The Power of Noise: Redefining Retrieval For RAG Systems: Florin Cuconasu Giovanni Trappolini Federico Siciliano
11 pages
Physics 3 - Mass and Weight
No ratings yet
Physics 3 - Mass and Weight
23 pages
Lippi 2018
No ratings yet
Lippi 2018
10 pages
Armenia
No ratings yet
Armenia
7 pages
A Procedure To Obtain The Effective Nuclear Charge From The Atomic Spectrum of Sodium
No ratings yet
A Procedure To Obtain The Effective Nuclear Charge From The Atomic Spectrum of Sodium
3 pages
Task 4 Greisy Romero
No ratings yet
Task 4 Greisy Romero
5 pages
Ex Four Amylase
No ratings yet
Ex Four Amylase
9 pages
Adaptive Rag
No ratings yet
Adaptive Rag
1 page
The Pseudohistorical Foundation Myth of of London
No ratings yet
The Pseudohistorical Foundation Myth of of London
4 pages
Contoh Overview
No ratings yet
Contoh Overview
4 pages
Career Research Assignment - Clinical Psychologist
No ratings yet
Career Research Assignment - Clinical Psychologist
2 pages
Neural Networks with Python
From Everand
Neural Networks with Python
Mei Wong
No ratings yet
Shadowrun: CTRL Issues: Shadowrun
From Everand
Shadowrun: CTRL Issues: Shadowrun
Bryan CP Steele
4/5 (1)

Deeprag

Uploaded by

Deeprag

Uploaded by

DeepRAG: Thinking to Retrieval Step by Step for Large Language Models

Human Thinking DeepRAG

Binary Tree Search Stage I: Imitation Learning

Stage II: Chain of Calibration

to the highest reward as defined in Section 3.1.

where σ is the logistic function, the hyperparam- 4.3 Implementation Details

4.4 Overall Results formance is highly sensitive to threshold selection.

5.2 Relevance to Parametric Knowledge Table 3: Analysis of internal knowledge utilization

5.5 Ablation Study 5.6 Performance against Strong Baseline

In this section, we investigate the relationship be-

You might also like