Deeprag
Deeprag
Xinyan Guan1,2 , Jiali Zeng3 , Fandong Meng3 , Chunlei Xin1,2 , Yaojie Lu1 ,
Hongyu Lin1 , Xianpei Han1 , Le Sun 1 , Jie Zhou3
1
Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences
2
University of Chinese Academy of Sciences
3
Pattern Recognition Center, WeChat AI, Tencent Inc, China
{guanxinyan2022,chunlei2021,hongyu,luyaojie,xianpei,sunle}@iscas.ac.cn
{lemonzeng,fandongmeng,withtomzhou}@tencent.com
Abstract Question: What is the total runtime of all movies in The Lord of the Rings?
Towers, and The Return of the King. Towers, The Return of the King
Step 2,3,4
due to timeliness, accuracy, and coverage of However, I am unsure of their exact Whatisisthe
theruntime
runtimeofofThe
The
What
parametric knowledge. Meanwhile, integrat- runtimes. Fellowshipofofthe
Fellowship theRing?
Ring?
To find out, I first need to determine
ing reasoning with retrieval-augmented gener- 178 minutes
the runtime of each film individually.
Step 5
ation (RAG) remains challenging due to inef- searching... What is the total runtime of the
fective task decomposition and redundant re- After retrieving this information, I
The Lord of the Rings?
can compute the total runtime as: 178 + 179 + 201 = 558 minutes
trieval, which can introduce noise and degrade 178 + 179 + 201 = 558 minutes.
response quality. In this paper, we propose Final answer: 558 minutes
Atomic
Retrieval narrative decisions
DeepRAG, a framework that models retrieval-
augmented reasoning as a Markov Decision
Process (MDP), enabling strategic and adaptive Figure 1: Correspondence between human thinking pro-
retrieval. By iteratively decomposing queries, cesses and DeepRAG. Specifically, retrieval narrative
DeepRAG dynamically determines whether to ensures a structured and adaptive retrieval flow, generat-
retrieve external knowledge or rely on para- ing subqueries informed by previously retrieved infor-
metric reasoning at each step. Experiments mation, and atomic decisions dynamically determines
show that DeepRAG improves retrieval effi- whether to retrieve external knowledge or rely solely on
ciency while improving answer accuracy by the parametric knowledge for each subquery.
21.99%, demonstrating its effectiveness in op-
timizing retrieval-augmented reasoning.
as a solution to continuously update retrieval re-
1 Introduction sults to address the dynamic information needs
that arise during the generation process (Yue et al.,
Large Language Models (LLMs) have demon- 2024). However, LLMs often struggle to generate
strated significant potential in reasoning (Plaat atomic and precise subqueries, which are critical
et al., 2024). However, limited by the capac- for more effective retrieval (Wu et al., 2024). From
ity and capabilities of LLM, it still suffers from the perspective of RAG, iterative retrieval should
severe factual hallucination problems due to the ideally generate the next atomic query based on
timeliness, accuracy, and coverage of parametric the current question and the available information
knowledge (Zhang et al., 2023; Huang et al., 2023). in an adaptive manner. Moreover, retrieval is not
Retrieval-Augmented Generation (RAG) has been always necessary. Some queries require knowl-
proposed as a promising paradigm to address this is- edge, while others rely solely on reasoning within
sue by integrating relevant information from knowl- the LLM. Furthermore, LLMs have demonstrated
edge bases or search engines, thereby improving their capability to serve as knowledge bases them-
the factuality of model response (Zhao et al., 2024). selves (Petroni et al., 2019). Unnecessary retrieval,
However, incorporating reasoning with retrieval- in addition to being redundant, can introduce noise,
augmented generation still presents several chal- degrade generation quality, and increase inference
lenges. One major issue is that complex queries latency (Chen et al., 2023; Tan et al., 2024; Bian
often require multi-step decomposition to estab- et al., 2024).
lish a coherent reasoning process (Radhakrishnan To address this, inspired by the way humans
et al., 2023). Iterative retrieval has been proposed search the Internet based on demand, we propose
1
DeepRAG, a new framework designed to enhance metrics, and LLM-based methods (Asai et al.,
reasoning ability in retrieval-augmented generation 2023; Zhang et al., 2024) generating retrieval
by modeling the process as a Markov Decision Pro- decisions but often fail to accurately recognize
cess (MDP). The framework introduces two key their knowledge boundaries, making it unreliable
components: retrieval narrative and atomic deci- to delegate retrieval timing decisions to the model.
sions, which together form a strategic and adaptive Our method leverages the inherent generative
retrieval framework. As illustrated in Figure 1, re- capabilities of LLMs to explore knowledge
trieval narrative ensures a structured and adaptive boundaries in RAG settings. This design maintains
retrieval flow, generating subqueries informed by the model’s native generation abilities while
previously retrieved information. For each sub- eliminating the need for additional parameters or
query, atomic decisions dynamically determines unreliable uncertainty metrics.
whether to retrieve external knowledge or rely
solely on the parametric knowledge of the LLM. To Reasoning in Retrieval-Augmented Generation
achieve this, we design a binary tree search method Recent advances in RAG have increasingly fo-
that explores the impact of atomic decisions on rea- cused on incorporating reasoning capabilities. Self-
soning outcomes. Based on it, we first synthesize RAG (Asai et al., 2023) and Auto-RAG (Yu et al.,
data to the LLM to learn retrieval narrative, cap- 2024) leverage automatic data synthesis to en-
turing the pattern of “subquery generation – atomic hance reasoning within retrieval-augmented frame-
decision – intermediate answer” through imitation works. Search-o1 (Li et al., 2025) incorporates
learning. Subsequently, we employ a chain of cali- retrieval into inference to construct an agentic sys-
bration approach to refine the model’s understand- tem, though its applicability is limited to o1-like
ing of its own knowledge boundaries, enabling it large reasoning models. AirRAG (Feng et al.,
to make more accurate atomic decisions regarding 2025) combines Monte Carlo Tree Search and self-
the necessity of retrieval. By explicitly enhancing consistency. In contrast to these approaches that
the LLM’s ability to recognize its own knowledge rely heavily on extensive retrieval operations or
boundaries, we can train an arbitrary model in an large reasoning models, DeepRAG provides an
end-to-end manner, enabling it to dynamically de- end-to-end method, enabling an arbitrary model
termine when retrieval is necessary. to think to retrieval step by step on demand.
We conduct experiments on five open-domain Knowledge Boundary LLMs struggle to accu-
QA datasets to validate the effectiveness of Deep- rately distinguish between what they know and
RAG, including HotpotQA, 2WikiMultihopQA, what they don’t know (Yin et al., 2023; Kapoor
and PopQA for multi-hop factual QA, CAG for et al., 2024a; Yin et al., 2024). Additional fine-
time-sensitive QA, and WebQuestions for hetero- tuning (Kapoor et al., 2024b) or precise prob-
geneous knowledge base QA. Experimental results ing (Cheng et al., 2024) is typically required to
demonstrate that DeepRAG significantly outper- calibrate the model’s cognition. Our approach ex-
forms existing methods, achieving 21.99% higher plores knowledge boundaries in RAG settings.
answer accuracy while improving retrieval effi-
ciency. Further analysis reveals that DeepRAG 3 Thinking to Retrieval Step by Step
exhibits a stronger correlation between its retrieval
decisions and parametric knowledge, indicating In this section, we introduce our proposed method
more effective knowledge boundary calibration. DeepRAG. At its core, DeepRAG treats the process
of question decomposition, atomic decisions, and
2 Related Work final answer generation as a Markov Decision Pro-
cess (MDP). As shown in Figure 2, our framework
Adaptive Retrieval-Augmented Generation comprises three key steps: 1) Binary Tree Search,
Existing adaptive RAG approaches can be broadly which constructs a binary tree for each subquery
categorized into three types: classifier-based related to the given question, exploring paths based
methods (Cheng et al., 2024; Jeong et al., 2024) on either parametric knowledge or external knowl-
requiring additional linear head training for edge base; 2) Imitation Learning, which extracts
retrieval decisions, confidence-based methods the reasoning process that arrives at the correct
(Jiang et al., 2023; Su et al., 2024; Dhole, 2025) final answer with minimum retrieval cost for im-
relying heavily on threshold-dependent uncertainty itation learning; 3) Chain of Calibration, which
2
Retrieved Parametric Subquery Trajectory with least retrieval Optimization Preference pairs
Parametric
Knowledge Ans
Question Calibrated
Retrieved RALM
Documents
Subquery Knowledge Intermediate
Source Answer
Figure 2: An overview of DeepRAG, our framework comprises three steps: (1) Binary Tree Search, (2) Imitation
Learning, and (3) Chain of Calibration. Given a dataset, we first employ binary tree search to synthesize data
for imitation learning, enabling the model to learn retrieval patterns. Subsequently, we use binary tree search to
construct preference data for further calibrating the LLM’s awareness of its knowledge boundaries.
calibrates the LLM’s internal knowledge by cali- the model decides whether to retrieve external
brating each atomic decision. Specifically, given a knowledge or rely solely on its parametric knowl-
set of supervised datasets, we first employ binary edge. Formally, this decision is represented as
tree search to synthesize data for imitation learn- δt+1 ∈ {retrieve, parametric}.
ing, enabling the model to learn effective retrieval
patterns. Subsequently, we use binary tree search Transitions. After executing the action at+1 =
to construct preference data for further calibrating (σt+1 , δt+1 ) in state st , the environment updates
the LLM’s awareness of its knowledge boundaries. the state to st+1 .
In the following subsections, we will describe each Specifically, if σt = terminate, the pro-
component of DeepRAG in detail. cess concludes by generating the final answer
o, resulting in the terminal state st+1 =
3.1 Overview of the MDP Modeling x, (q1 , r1 ), . . . , (qt , rt ), o . Otherwise, it gen-
erates the next subquery qt+1 .
We formalize the step by step reasoning process
for retrieval-augmented generation as a Markov If δt+1 = retrieve, the model retrieves docu-
Decision Process (MDP) defined by the tuple ments dt+1 and generates an intermediate answer
(S, A, P , R), which comprises a set of states S, iat+1 for subquery qt+1 . Otherwise, it relies on
actions A, transition dynamics P , and a reward parametric knowledge to generate the intermediate
function R. answer. The response rt+1 is set as [dt+1 , iat+1 ]
(if retrieved)
or iat+1 (if not). The updated state is
States. At each step t, the state st ∈ S represents st+1 = x, (q1 , r1 ), . . . , (qt+1 , rt+1 ) .
the partial solution to the original question. We
denote st = x, (q1 , r1 ), . . . , (qt , rt ) , where x Rewards. The reward function evaluates the state
is the input question, and (qi , ri ) captures the i-th based on answer correctness and retrieval cost, ap-
subquery along with the intermediate answer (and final answer o. For-
plied only after generating the
any retrieved documents). mally, R st+1 = st + [o] = −C(o) × T (st ),
where C(o) indicates correctness (1 if correct, ∞
Actions. At state st , the model selects an action otherwise), and T (st ) represents the total retrieval
at+1 = (σt+1 , δt+1 ) ∈ A, which consists of two cost in state st .
sub-decisions:
1. Termination decision: Given the partial solu- 3.2 Binary Tree Search
tion st , the model makes a binary decision σt+1 ∈ In Section 3.1, we model the step-by-step reasoning
{continue, terminate} to determine whether to process as a Markov decision process, where the
proceed with generating the next subquery qt+1 or LLM iteratively decomposes a given question into
finalize the answer o. subqueries, each derived from previously acquired
2. Atomic decision: For each subquery qt+1 , information. The detailed generation instruction is
3
outlined in Appendix A.1, with the answer format Algorithm 1 Data Construction for Stage I
presented below. Require: Question x, answer y, language model M, Re-
Building on this formulation, we implement a triever R, max history length T
Ensure: Optimal reasoning process s∗ or null
binary tree search to construct reasoning paths that 1: Initialize priority queue PQ ← {([x], 0)}
integrate different retrieval strategies for each sub- ▷ (trajectory, retrieval count)
query. As illustrated in Figure 2, given a question, 2: while PQ is not empty do
3: (h, r) ← PQ.dequeue()
the model generates the i-th subquery and explores ▷ Get trajectory with lowest retrieval count
two answering strategies: directly leveraging para- 4: q ← M(h) ▷ Subquery Generation
5: if ShouldAnswer(q) or length(h) > T then
metric knowledge (blue node) or retrieving exter- 6: o ← M(h, q) ▷ Final answer
nal documents (green node). This approach not 7: if IsEqual(o, y) then return h
only decomposes the question into a sequence of 8: else
forward-dependent subqueries but also thoroughly 9: a ← M(h, q) ▷ Direct answer
10: PQ.enqueue(([h, (q, a)], r))
examines the influence of retrieval choices on the 11: d ← R(q) ▷ Retrieve document
final answer. 12: a ← M(h, q, d) ▷ Retrieved answer
13: PQ.enqueue(([h, (q, (d, a))], r + 1))
Answer format 14: return null
Question: <Question>
Follow up: <Subquery1> language models in gaining the capacity for adap-
Let’s search the question in Wikipedia.
Context: <Paragraph Text> tive inference-time compute generation.
Intermediate answer: <Intremediate Answer1>
Follow up: <Subquery2> Training Objective Specifically, we implement
Intermediate answer: <Intermediate Answer2> a masked loss function for the retrieved documents
......
So the final answer is: <Answer> to prevent the model from learning irrelevant or
noisy text that could negatively impact its perfor-
mance. In this way, we hope the model to enhance
the ability to decompose subqueries and retrieve
3.3 Imitation Learning them based on demand. For each instance, the loss
function is formulated as follows:
In this section, we present an algorithm that lever-
ages binary trees to identify the optimal reason- X
ing process that leads to the correct final answer L=− log [Pr(qi |si−1 ) + Pr(ai |si−1 , qi , di )]
while minimizing retrieval costs, corresponding 1≤i≤n
4
Synthesizing Preference Data First, we identify out-of-distribution datasets consist of CAG (Pan
an optimal path with minimal retrieval based on Al- et al., 2024), PopQA (Mallen et al., 2022), and
gorithm 1 using the model trained in Stage I. This WebQuestions (Berant et al., 2013). Specifically,
provides the optimal atomic decision for each sub- we employ the time-sensitive subset of CAG to
query, determining whether retrieval is necessary. evaluate temporal reasoning capabilities. Further-
From this path, we construct preference pairs for more, WebQuestions is built upon Freebase to as-
each subquery to indicate the preferred retrieval sess model robustness when information may be
choice. For example, in Figure 2, the optimal path absent from the knowledge base.
may suggest answering the first subquery using
parametric knowledge while requiring document 4.2 Baselines
retrieval for the second. Accordingly, we generate We use the following baselines to evaluate the per-
preference pairs: one favoring parametric knowl- formance: CoT (Wei et al., 2022) and CoT*, which
edge over retrieval for the first subquery and an- employ 8-shot examples extracted from the train-
other favoring retrieval over parametric knowledge ing dataset. The asterisk (*) indicates that the
for the second. This process enables the LLM to model output was trained using the same data em-
learn when to retrieve external information, thereby ployed for training the DeepRAG. CoT-Retrieve
improving its ability to maximize the use of para- and CoT-Retrieve* augment the eight examples
metric knowledge while minimizing unnecessary in the context with retrieved relevant documents
retrieval. based on the query. IterDRAG (Yue et al., 2024)
Chain of Calibration Objective We fine-tune refers to decomposing question and answer step by
the LLM using a Chain of Calibration objective on step based on in-context learning. UAR (Cheng
our synthesized preference data. et al., 2024) employs a trained classifier to deter-
Given the i-th subquery and a state si = mine when retrieval is necessary. FLARE (Jiang
[x, q1 , r1 , · · · , qi−1 , ri−1 ], we have two distince in- et al., 2023) and DRAGIN (Su et al., 2024) are
termediate answer ri1 = a1i and ri2 = (di , a2i ). confidence-based method that decide the timing
Based on the process above, we have known which of retrieval based on token importance and uncer-
ri is preferred. As a result, the training objective tainty. TAARE (Zhang et al., 2024) allows the
can be formulated as follows: LLM itself to determine when retrieval is needed.
AutoRAG (Yu et al., 2024) uses trained models to
iteratively decompose questions and retrieve rele-
πθ (yw | si , qi ) πθ (yl | si , qi )
L = − log σ β log − β log vant documents for answering.
πref (yw | si , qi ) πref (yl | si , qi )
5
in-distribution out-of-distribution
Types Methods Hotpot QA 2WikiMultihopQA CAG PopQA Web Question Avg
EM F1 EM F1 EM F1 EM F1 EM F1
Llama-3-8B
CoT 27.20 37.75 28.20 34.85 7.17 10.41 21.20 25.33 25.20 40.56 25.79
CoT-Retrieve 34.90 46.85 35.80 43.41 55.45 64.08 32.80 45.87 22.90 39.22 42.13
CoT* 21.80 31.69 25.60 30.89 5.30 7.58 23.10 25.31 26.80 40.20 23.83
Reasoning CoT-Retrieve* 22.50 32.15 23.70 29.21 44.86 55.69 38.70 45.64 17.60 29.20 33.93
IterDRAG 23.20 30.95 19.60 24.80 38.32 46.18 22.70 34.53 15.90 26.79 28.30
Auto-RAG 25.80 36.09 23.00 30.09 49.22 59.61 27.80 42.02 17.40 32.94 34.40
FLARE 23.80 32.88 30.30 37.45 34.89 43.45 28.80 40.61 28.80 40.61 34.16
DRAGIN 27.60 38.05 29.10 35.68 4.05 7.18 22.60 28.53 21.20 38.72 25.27
Adaptive UAR 29.70 40.66 34.80 42.40 52.96 61.53 33.00 45.95 22.70 39.10 40.28
TAARE 30.60 41.43 35.20 42.85 52.96 61.59 33.20 46.01 23.40 39.56 40.68
DeepRAG-Imi 35.10 46.59 47.20 52.33 50.47 59.55 43.60 48.50 30.00 41.76 45.38
Ours DeepRAG 40.70 51.54 48.10 53.25 52.96 61.92 42.50 47.80 32.70 45.24 47.67
Qwen-2.5-7B
CoT 18.90 27.81 23.40 28.97 3.12 5.71 15.20 19.20 18.30 34.86 19.55
CoT-Retreive 24.90 34.78 18.60 23.44 41.43 51.47 27.30 41.20 15.10 29.84 30.81
Resaoning CoT* 17.60 26.15 25.10 29.62 3.12 5.62 7.90 11.06 15.60 32.45 17.42
CoT-Retrieve* 23.40 32.29 22.40 27.51 43.30 54.51 26.60 35.46 13.80 25.60 30.49
IterDRAG 13.70 26.84 9.30 20.47 21.81 39.59 18.00 31.44 12.50 26.95 22.06
FLARE 23.40 32.06 21.80 26.51 34.89 42.62 19.00 28.24 16.10 31.89 27.65
DRAGIN 16.70 24.60 12.40 16.76 3.43 5.45 12.00 15.80 17.40 32.43 15.70
Adaptive UAR 24.50 34.22 23.90 28.20 34.89 43.92 27.00 40.47 16.60 32.28 30.60
TAARE 25.30 35.03 21.30 25.67 40.81 50.78 27.00 40.92 18.20 33.14 31.81
DeepRAG-Imi 30.40 39.44 32.00 38.32 47.98 56.99 37.50 40.72 23.90 38.62 38.59
Ours DeepRAG 32.10 41.14 40.40 44.87 51.09 59.76 40.60 43.19 24.20 38.83 41.62
Table 1: The overall experimental results of DeepRAG and other baselines on five benchmarks. The best/second best
scores in each dataset are bolded/underlined. DeepRAG-Imi (Stage I) and DeepRAG (Stage II) both demonstrate
superior performance compared to existing methods across all test scenarios.
6
trieval methods, highlighting the mismatch be- Dataset Method EM
Avg. Retrievals
All Correct Incorrect
tween its internal knowledge and verbose. More-
FLARE 30.30 0.99 1.00 0.99
over, aggressive fine-tuning approaches like CoT*
DRAGIN 29.10 1.03 1.03 1.03
and CoT-Retrieve* can actually degrade model per- UAR 34.80 0.81 0.68 0.89
formance by forcing the model to learn knowledge 2WMQA TAARE 35.20 0.93 0.93 0.97
IterDRAG 19.60 2.46 2.49 2.45
beyond its natural capabilities. In contrast, our Auto-RAG 23.00 6.26 4.13 1.81
approach carefully preserves model capabilities DeepRAG-Imi 47.20 1.13 0.95 1.28
during fine-tuning by leveraging self-synthesized DeepRAG 48.10 1.09 0.92 1.25
FLARE 28.80 0.00 0.00 0.00
data, effectively preventing additional hallucination DRAGIN 21.20 0.00 0.00 0.00
while maintaining performance. UAR 22.70 0.96 0.95 0.97
WQ TAARE 23.40 0.66 0.65 0.66
5 Analysis IterDRAG 15.90 2.25 2.16 2.27
Auto-RAG 17.40 4.52 3.03 2.35
DeepRAG-Imi 30.00 0.43 0.13 0.56
5.1 Retrieval Efficiency DeepRAG 32.70 0.28 0.12 0.36
To demonstrate the efficiency of our method, we
compare the average number of retrievals on 2Wiki- Table 2: Retrieval frequency analysis on 2WikiMulti-
hopQA(2WMQA) and WebQuestions(WQ) across dif-
MultihopQA and WebQuestions. As shown in Ta-
ferent adaptive retrieval methods. "Correct" indicates
ble 2, We have following observations: 1) Deep- the average number of retrievals for instances where
RAG can achieve higher accuracy with relatively the model produced correct answers, while "Incorrect"
lower retrieval costs, attributed to its dynamic us- represents the average retrievals for cases with incorrect
age of internal knowledge. 2) Confidence-based answers.
approaches demonstrate limited robustness across
datasets. For instance, both FLARE and DRAGIN Method F1 Acc Balanced Acc MCC
methods doesn’t trigger retrieval under the default FLARE 0.000 0.718 0.500 0.000
confidence threshold in WQ. 3) Iterative retrieval- DRAGIN 0.007 0.709 0.495 -0.045
based approaches typically require numerous re- UAR 0.481 0.756 0.648 0.341
TAARE 0.127 0.712 0.518 0.078
trieval operations. Therefore, efficient adaptive re-
Iter-DRAG 0.000 0.718 0.500 0.000
trieval methods like DeepRAG become crucial for Auto-RAG 0.000 0.718 0.500 0.000
optimizing resource utilization while maintaining DeepRAG-Imi 0.580 0.732 0.709 0.393
performance. DeepRAG 0.621 0.749 0.743 0.451
7
hinder model performance due to long context or Method ID CAG PopQA WebQuestion
Avg
F1 EM EM EM
irrelevant knowledge, making internal knowledge
DeepRAG-Imi 49.46 50.47 43.60 30.00 44.60
the more reliable choice. most 47.31 51.09 31.30 28.00 41.12
random 44.76 51.40 34.80 27.10 40.56
5.4 Question Decomposition Effectiveness
We systematically analyze the effectiveness of Table 4: Experiment results of the ablation study on the
Imitation Learning Stage. ID refers to the average score
question decomposition in retrieval narrative. As
of two in-distribution dataset HotpotQA and 2WikiMul-
shown in Figure 3, we present the distribution of tihopQA.
subquery counts and retrieval attempts for different
questions. Most questions require 3-5 decompo- Method ID CAG PopQA WebQuestion
Avg
sition steps, while retrieval attempts are primarily F1 EM EM EM
concentrated within 0-2 rounds. This demonstrates DeepRAG 52.40 61.92 47.80 45.24 47.67
all-node 50.92 50.47 41.50 32.70 45.30
that DeepRAG effectively decomposes questions sentence-wise 30.16 12.46 20.00 12.90 21.14
while minimizing redundant retrieval.
Table 5: Experiment results of the ablation study on the
3 0
Chain of Calibration Stage.
4 381 1
683 177
5 9 2
191 ≥6
433
≥3 in Table 4, DeepRAG-Imi enables the model to
65 61 learn knowledge boundaries during the imitation
(a) (b) learning stage. Notably, CAG performs relatively
poorly at this stage due to its time-sensitive na-
Figure 3: (a) Subquery Statistics. (b) Retrieval Statis- ture, which necessitates constant retrieval of up-to-
tics. date information. Moreover, as illustrated in Fig-
ure 6(a), DeepRAG-Imi achieves lower retrieval
Moreover, we analyze the average counts of costs and higher average performance compared
WH-words, nouns, verbs, and conjunctions in sub- to both the maximum-retrieval-cost and random
queries, as shown in Figure 4. The results indicate selection methods.
that DeepRAG decomposes atomic queries with
fewer pronouns and conjunctions. Chain of Calibration We compare our default
approach of constructing preferences based on
DeepRAG 2.05 nodes from optimal paths against two alternatives:
2 Auto-RAG
constructing pairs for all nodes and constructing
Average Count
1.5
sentence-level partial order pairs based on retrieval
1.26 efficiency. As shown in Table 5, DeepRAG demon-
1.13
0.98
1 strates significant advantages over both variants.
0.77
Specifically, as illustrated in Figure 6(b), Deep-
0.5 RAG achieves lower retrieval costs while maintain-
0.30
0.12
0
0.04 ing higher average performance. In contrast, the
WH Noun Verb And/Or sentence-level partial order pairs learned incorrect
preferences, resulting in over-reliance on internal
Figure 4: Average counts of WH-words, nouns,
verbs, and conjunctions (and/or) per subquery. knowledge and consequently leading to both low
retrieval costs and poor performance.
8
46 3
55 2
55 Avg. Score 4 Avg. Score
Average Retrievals
Avg. Score
Average Retrievals
Avg. Retrievals Avg. Retrievals
45 44.6 1.8
Average Retrievals
50 50 47.67 Avg. Retrievals
Average Score
47.67 2.5
Average Score
Average Score
45.39 2.77 3 45.3 1.56
2.11 45 1.6
45 44
1.81 2 1.37
40 1.4
40 43
2 1.2
35 1.37
1.5 35
42 1
30 0.98 41.12 30
25.32 1 41 1 0.8
40.56
25 25
21.14 0.57 0.6
0 40 0.5
20 0 20
y -Imi ise
i c onl epRAG ve on
ly RAG most rando
m
Deep
RAG ll-node nce-w
param
e t r
De retrie Deep a sente
(a) (b)
Figure 5: Comparative analysis of
retrieval strategies: parametric only Figure 6: Average score and retrievals on the ablation study for Imitation
or retrieve only. Learning and Chain of Calibration.
What is the place of birth of the director of film Peter'S Friends? What is the place of birth of the director of film Peter'S Friends?
Analysis: To answer this question, I would need to know the director of the film “Peter's Follow up: Who is the director of film Peter's Friends?
Friends” and his place of birth. Let's search the question in Wikipedia.
Auto Initial Query: Who is the director of the film “Peter's Friends” and where was he born? Context: Peter's Friends is a 1992 British comedy film written by Rita
Ours
RAG Document 1: Peter's Friends Peter's Friends is a 1992 British comedy film…, and Rudner and Martin Bergman, and directed and produced by Kenneth
directed and produced by Kenneth Branagh…… Branagh......
Intermediate answer_1: Based on the Retrieved Document_1, the director of film Intermediate answer: Kenneth Branagh
“Peter‘s Friends” is Richard Curtis. However,…, we need to find the place of birth of
Richard Curtis. Follow up: What is the place of birth of Kenneth Branagh?
Refined Query: What is the place of birth of Richard Curtis, the director of Peter's Intermediate answer: Belfast, Northern Ireland
Friends?
Document 2: Richard Curtis …was born in New Zealand to Australian parents. Final answer:
Intermediate answer_2: Based on the Retrieved Document_2, Richard Curtis was born <answer long>The director of film Peter's Friends is Kenneth Branagh,
in New Zealand. Therefore, the place of birth of the director of film “Peter‘s Friends” is and he was born in Belfast, Northern Ireland.</answer long>
New Zealand. <answer short>Belfast</answer short>
Final answer: New Zealand.
Figure 7: Case Study: Auto-RAG vs. DeepRAG. DeepRAG achieves success by atomic query decomposition,
faithful intermediate answer, and adaptively using internal knowledge.
formance over QwQ and gpt-4o, particularly in this issue using its own relevant documents, it
time-sensitive QA tasks. Notably, while Deep- falls into endless loops in most cases. In con-
RAG does not surpass gpt-4o in some cases, it trast, DeepRAG iteratively generates subqueries
achieves comparable performance levels. These and determines whether to use internal knowledge
results demonstrate that DeepRAG not only effec- at each iteration. The binary tree search data syn-
tively recognizes its knowledge boundaries but also thesis method for optimization ensures reliable sub-
adapts well to time-sensitive scenarios. query generation, intermediate answers, and final
answers. Even when no related information exists
ID CAG PopQA WQ in retrieved documents, the model is directed to pro-
Models F1 EM EM EM Avg
vide a final answer based on internal knowledge.
QwQ-32B 31.43 3.43 10.60 15.10 18.40
gpt-4o-turbo 60.6 23.36 43.50 25.35 42.68 6 Conclusion
DeepRAG-qwen 43.00 51.09 40.60 24.20 40.38
DeepRAG-llama 52.40 52.96 42.50 32.70 46.59 In this paper, we present DeepRAG, a simple yet ef-
fective approach that enhances LLM’s awareness of
Table 6: Performance against strong baseline models.
retrieval requirements through self-calibration. Our
method decomposes queries into subqueries and
uses binary tree search for data synthesis to help
5.7 Case Study
models better understand their knowledge bound-
As illustrated in Figure 7, we conduct a case study aries. Experimental results across various QA tasks
comparing DeepRAG with Auto-RAG (Yu et al., demonstrate that DeepRAG significantly improves
2024), a closely related method that utilizes iter- the accuracy and efficiency of retrieval-augmented
ative retrieval for retrieval-augmented generation. generation.
For each subquery, Auto-RAG retrieves relevant
documents and generates a corresponding suban-
swer. This approach is not only time-consuming References
but also fails when no relevant documents are re- Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and
trieved. Although Auto-RAG attempts to address Hannaneh Hajishirzi. 2023. Self-rag: Learning to
9
retrieve, generate, and critique through self-reflection. Zhengbao Jiang, Frank F. Xu, Luyu Gao, Zhiqing Sun,
arXiv preprint arXiv:2310.11511. Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie
Callan, and Graham Neubig. 2023. Active retrieval
Jonathan Berant, Andrew Chou, Roy Frostig, and Percy augmented generation.
Liang. 2013. Semantic parsing on freebase from
question-answer pairs. In Proceedings of the 2013 Sanyam Kapoor, Nate Gruver, Manley Roberts, Kather-
conference on empirical methods in natural language ine Collins, Arka Pal, Umang Bhatt, Adrian Weller,
processing, pages 1533–1544. Samuel Dooley, Micah Goldblum, and Andrew Gor-
don Wilson. 2024a. Large language models must be
Ning Bian, Hongyu Lin, Peilin Liu, Yaojie Lu, taught to know what they don’t know. arXiv preprint
Chunkang Zhang, Ben He, Xianpei Han, and Le Sun. arXiv:2406.08391.
2024. Influence of external information on large lan-
guage models mirrors social cognitive patterns. IEEE Sanyam Kapoor, Nate Gruver, Manley Roberts, Arka
Transactions on Computational Social Systems. Pal, Samuel Dooley, Micah Goldblum, and Andrew
Wilson. 2024b. Calibration-tuning: Teaching large
Hung-Ting Chen, Fangyuan Xu, Shane Arora, and language models to know what they don‘t know. In
Eunsol Choi. 2023. Understanding retrieval aug- Proceedings of the 1st Workshop on Uncertainty-
mentation for long-form question answering. arXiv Aware NLP (UncertaiNLP 2024), pages 1–14, St
preprint arXiv:2310.12150. Julians, Malta. Association for Computational Lin-
guistics.
Qinyuan Cheng, Xiaonan Li, Shimin Li, Qin Zhu,
Zhangyue Yin, Yunfan Shao, Linyang Li, Tianxi- Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang,
ang Sun, Hang Yan, and Xipeng Qiu. 2024. Unified Yujia Zhou, Yutao Zhu, Peitian Zhang, and
active retrieval for retrieval augmented generation. Zhicheng Dou. 2025. Search-o1: Agentic search-
arXiv preprint arXiv:2406.12534. enhanced large reasoning models. arXiv preprint
arXiv:2501.05366.
Wikipedia contributors. 2025. Phi coefficient —
Wikipedia, the free encyclopedia. https:// Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das,
en.wikipedia.org/wiki/Phi_coefficient. Ac- Daniel Khashabi, and Hannaneh Hajishirzi. 2022.
cessed: 2025-01-22. When not to trust language models: Investigating
effectiveness of parametric and non-parametric mem-
Kaustubh D. Dhole. 2025. To retrieve or not to re- ories. arXiv preprint arXiv:2212.10511.
trieve? uncertainty detection for dynamic retrieval
augmented generation. Preprint, arXiv:2501.09292. OpenAI. Hello gpt-4o. https://openai.com/index/
hello-gpt-4o/. [Online; accessed 22-January-
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, 2025].
Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman,
Akhil Mathur, Alan Schelten, Amy Yang, Angela Ruotong Pan, Boxi Cao, Hongyu Lin, Xianpei Han, Jia
Fan, et al. 2024. The llama 3 herd of models. arXiv Zheng, Sirui Wang, Xunliang Cai, and Le Sun. 2024.
preprint arXiv:2407.21783. Not all contexts are equal: Teaching llms credibility-
aware generation. arXiv preprint arXiv:2404.06809.
Wenfeng Feng, Chuzhan Hao, Yuewei Zhang, Jingyi
Song, and Hao Wang. 2025. Airrag: Activating in- Fabio Petroni, Tim Rocktäschel, Sebastian Riedel,
trinsic reasoning for retrieval augmented generation Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and
via tree-based search. Preprint, arXiv:2501.10053. Alexander Miller. 2019. Language models as knowl-
edge bases? In Proceedings of the 2019 Confer-
Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, ence on Empirical Methods in Natural Language Pro-
and Akiko Aizawa. 2020. Constructing a multi- cessing and the 9th International Joint Conference
hop QA dataset for comprehensive evaluation of on Natural Language Processing (EMNLP-IJCNLP),
reasoning steps. In Proceedings of the 28th Inter- pages 2463–2473, Hong Kong, China. Association
national Conference on Computational Linguistics, for Computational Linguistics.
pages 6609–6625, Barcelona, Spain (Online). Inter-
national Committee on Computational Linguistics. Aske Plaat, Annie Wong, Suzan Verberne, Joost
Broekens, Niki van Stein, and Thomas Back. 2024.
Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Reasoning with large language models, a survey.
Zhangyin Feng, Haotian Wang, Qianglong Chen, arXiv preprint arXiv:2407.11511.
Weihua Peng, Xiaocheng Feng, Bing Qin, et al. 2023.
A survey on hallucination in large language models: Ansh Radhakrishnan, Karina Nguyen, Anna Chen,
Principles, taxonomy, challenges, and open questions. Carol Chen, Carson Denison, Danny Hernandez, Esin
ACM Transactions on Information Systems. Durmus, Evan Hubinger, Jackson Kernion, Kamilė
Lukošiūtė, et al. 2023. Question decomposition im-
Soyeong Jeong, Jinheon Baek, Sukmin Cho, Sung Ju proves the faithfulness of model-generated reasoning.
Hwang, and Jong C Park. 2024. Adaptive-rag: Learn- arXiv preprint arXiv:2307.11768.
ing to adapt retrieval-augmented large language mod-
els through question complexity. arXiv preprint Weihang Su, Yichen Tang, Qingyao Ai, Zhijing Wu,
arXiv:2403.14403. and Yiqun Liu. 2024. DRAGIN: Dynamic retrieval
10
augmented generation based on the real-time informa- Zhenrui Yue, Honglei Zhuang, Aijun Bai, Kai Hui, Rolf
tion needs of large language models. In Proceedings Jagerman, Hansi Zeng, Zhen Qin, Dong Wang, Xuan-
of the 62nd Annual Meeting of the Association for hui Wang, and Michael Bendersky. 2024. Inference
Computational Linguistics (Volume 1: Long Papers), scaling for long-context retrieval augmented genera-
pages 12991–13013, Bangkok, Thailand. Association tion. arXiv preprint arXiv:2410.04343.
for Computational Linguistics.
Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu,
Hexiang Tan, Fei Sun, Wanli Yang, Yuanzhuo Wang, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang,
Qi Cao, and Xueqi Cheng. 2024. Blinded by gen- Yulong Chen, Longyue Wang, Anh Tuan Luu, Wei
erated contexts: How language models merge gen- Bi, Freda Shi, and Shuming Shi. 2023. Siren’s song
erated and retrieved contexts for open-domain qa? in the ai ocean: A survey on hallucination in large
arXiv preprint arXiv:2401.11911. language models. arXiv preprint arXiv:2309.01219.
Qwen Team. 2024. Qwq: Reflect deeply on the bound- Zihan Zhang, Meng Fang, and Ling Chen. 2024. Re-
aries of the unknown. trievalqa: Assessing adaptive retrieval-augmented
generation for short-form open-domain question an-
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten swering. arXiv preprint arXiv:2402.16457.
Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou,
et al. 2022. Chain-of-thought prompting elicits rea- Penghao Zhao, Hailin Zhang, Qinhan Yu, Zhengren
soning in large language models. Advances in neural Wang, Yunteng Geng, Fangcheng Fu, Ling Yang,
information processing systems, 35:24824–24837. Wentao Zhang, and Bin Cui. 2024. Retrieval-
augmented generation for ai-generated content: A
Zhuofeng Wu, He Bai, Aonan Zhang, Jiatao Gu, VG Vy- survey. arXiv preprint arXiv:2402.19473.
diswaran, Navdeep Jaitly, and Yizhe Zhang. 2024.
Divide-or-conquer? which part should you distill A Templates
your llm? arXiv preprint arXiv:2402.15000.
A.1 DeepRAG Construct Instruction
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng,
Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Instruction: You are a helpful Retrieve-Augmented
Li, Dayiheng Liu, Fei Huang, Guanting Dong, Hao- Generation (RAG) model. Your task is to answer
ran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian questions by logically decomposing them into clear
Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin sub-questions and iteratively addressing each one.
Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang
Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Use "Follow up:" to introduce each sub-question and
"Intermediate answer:" to provide answers.
Mei Li, Mingfeng Xue, Na Ni, Pei Zhang, Peng
Wang, Ru Peng, Rui Men, Ruize Gao, Runji Lin, For each sub-question, decide whether you can pro-
Shijie Wang, Shuai Bai, Sinan Tan, Tianhang Zhu, vide a direct answer or if additional information is
Tianhao Li, Tianyu Liu, Wenbin Ge, Xiaodong Deng, required. If additional information is needed, state,
Xiaohuan Zhou, Xingzhang Ren, Xinyu Zhang, Xipin "Let’s search the question in Wikipedia." and then use
Wei, Xuancheng Ren, Yang Fan, Yang Yao, Yichang the retrieved information to respond comprehensively.
Zhang, Yu Wan, Yunfei Chu, Yuqiong Liu, Zeyu If a direct answer is possible, provide it immediately
Cui, Zhenru Zhang, and Zhihao Fan. 2024. Qwen2 without searching.
technical report. arXiv preprint arXiv:2407.10671.
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Ben- B Detailed Analysis
gio, William W Cohen, Ruslan Salakhutdinov, and
Christopher D Manning. 2018. Hotpotqa: A dataset B.1 Retrieval Efficiency
for diverse, explainable multi-hop question answer-
To demonstrate the efficiency of our method, we
ing. arXiv preprint arXiv:1809.09600.
compare the average number of retrievals on 2Wiki-
Xunjian Yin, Xu Zhang, Jie Ruan, and Xiaojun Wan. MultihopQA and WebQuestions. As shown in Ta-
2024. Benchmarking knowledge boundary for large ble 2, We have following observations:
language model: A different perspective on model 1) Compared to other adaptive retrieval meth-
evaluation. arXiv preprint arXiv:2402.11493.
ods, DeepRAG can achieve higher accuracy with
Zhangyue Yin, Qiushi Sun, Qipeng Guo, Jiawen Wu, relatively lower retrieval costs. This can be at-
Xipeng Qiu, and Xuanjing Huang. 2023. Do large tributed to our dynamic usage of internal knowl-
language models know what they don’t know? arXiv edge. Additionally, DeepRAG exhibits a posi-
preprint arXiv:2305.18153. tive trend in exploring relevant evidence when
Tian Yu, Shaolei Zhang, and Yang Feng. 2024. Auto-
faced with insufficient retrieval results, as evi-
rag: Autonomous retrieval-augmented generation for denced by the lower average retrieval numbers in
large language models. both 2WMQA (0.92 compared to 1.25) and WQ
11
(0.12 compared to 0.36). 2) Confidence-based ap- B.3 Ablation Study
proaches demonstrate limited robustness across Table 7 and Table 8 show the detailed results of the
datasets. For instance, while using identical thresh- ablation study.
olds, both FLARE and DRAGIN methods show
inconsistent behavior: they trigger approximately HotpotQA 2WMQA CAG PopQA WebQuestion
Avg
F1 F1 EM EM EM
one retrieval per query in 2WMQA, but fail to reach DeepRAG-Imi 46.59 52.33 50.47 43.60 30.00 44.60
the retrieval threshold entirely in WQ. This incon- most 47.73 46.88 51.09 31.30 28.00 41.12
random 46.78 42.75 51.40 34.80 27.10 40.56
sistency highlights the challenge of maintaining
reliable performance across different datasets using Table 7: Detailed Experiment results of the ablation
confidence-based methods. 3) Iterative retrieval- study on the Imitation Learning Stage.
based approaches typically require numerous re-
trieval operations, resulting in substantial computa-
HotpotQA 2WMQA CAG PopQA WebQuestion
tional costs. Therefore, efficient adaptive retrieval F1 F1 EM EM EM Avg
methods like DeepRAG become crucial for opti- DeepRAG 51.54 53.25 61.92 47.80 45.24 47.67
all-node 49.99 51.85 50.47 41.50 32.70 45.30
mizing resource utilization while maintaining per- sentence-wise 29.03 31.28 12.46 20.00 12.90 21.14
formance.
Table 8: Detailed experiment results of the ablation
study on the Chain of Calibration Stage.
B.2 Relevance to Parametric Knowledge
12