0% found this document useful (0 votes)

140 views22 pages

Interleaving Retrieval With Chain-Of-Thought Reasoning For Knowledge-Intensive Multi-Step Questions

The document presents IRCoT, a novel approach that interleaves retrieval with chain-of-thought (CoT) reasoning to enhance multi-step question answering (QA) using large language models (LLMs). This method significantly improves retrieval and QA performance on various datasets by allowing the reasoning process to inform retrieval and vice versa, thus reducing model hallucination and factual errors. The approach demonstrates effectiveness even with smaller models, outperforming traditional retrieval methods and achieving better results in both in-distribution and out-of-distribution settings.

Uploaded by

Tran Cong Duy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

140 views22 pages

Interleaving Retrieval With Chain-Of-Thought Reasoning For Knowledge-Intensive Multi-Step Questions

Uploaded by

Tran Cong Duy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

Interleaving Retrieval with Chain-of-Thought Reasoning

for Knowledge-Intensive Multi-Step Questions

Harsh Trivedi† Niranjan Balasubramanian† Tushar Khot‡ Ashish Sabharwal‡
† ‡
Stony Brook University Allen Institute for AI
Stony Brook, U.S.A. Seattle, U.S.A.
{hjtrivedi,niranjan}@cs.stonybrook.edu {tushark,ashishs}@allenai.org

Abstract In what country was

Lost Gravity manufactured?
Prompting-based large language models
(LLMs) are surprisingly powerful at gener- cumulate docs
ating natural language reasoning steps or
arXiv:2212.10509v2 [cs.CL] 23 Jun 2023

Chains-of-Thoughts (CoT) for multi-step

The Lost Gravity was
question answering (QA). They struggle, manufactured by Mack Rides.
however, when the necessary knowledge is
either unavailable to the LLM or not up-to-date cumulate docs
within its parameters. While using the question
to retrieve relevant text from an external Mack Rides is a company
knowledge source helps LLMs, we observe from Germany.
that this one-step retrieve-and-read approach
is insufficient for multi-step QA. Here, what cumulate docs
to retrieve depends on what has already
been derived, which in turn may depend on
The answer is Germany.
what was previously retrieved. To address
this, we propose IRCoT, a new approach
for multi-step QA that interleaves retrieval Figure 1: IRCoT interleaves chain-of-thought (CoT)
with steps (sentences) in a CoT, guiding the generation and knowledge retrieval steps in order to
retrieval with CoT and in turn using retrieved guide the retrieval by CoT and vice-versa. This inter-
results to improve CoT. Using IRCoT with leaving allows retrieving more relevant information for
GPT3 substantially improves retrieval (up to later reasoning steps, compared to standard retrieval us-
21 points) as well as downstream QA (up ing solely the question as the query.
to 15 points) on four datasets: HotpotQA,
2WikiMultihopQA, MuSiQue, and IIRC. We
observe similar substantial gains in out-of- However, for many open-domain questions, all re-
distribution (OOD) settings as well as with
quired knowledge is not always available or up-to-
much smaller models such as Flan-T5-large
without additional training. IRCoT reduces date in models’ parameters and it’s beneficial to
model hallucination, resulting in factually retrieve knowledge from external sources (Lazari-
more accurate CoT reasoning.1 . dou et al., 2022; Kasai et al., 2022).
How can we augment chain-of-thought prompt-
1 Introduction ing for open-domain, knowledge-intensive tasks
Large language models are capable of answer- that require complex, multi-step reasoning?
ing complex questions by generating step-by- While a one-shot retrieval from a knowledge
step natural language reasoning steps—so called source based solely on the question can success-
chains of thoughts (CoT)—when prompted appro- fully augment LMs with relevant knowledge for
priately (Wei et al., 2022). This approach has been many factoid-based tasks (Lewis et al., 2020; Guu
successful when all information needed to answer et al., 2020; Borgeaud et al., 2022; Izacard et al.,
the question is either provided as context (e.g., al- 2022), this strategy has clear limitations for more
gebra questions) or assumed to be present in the complex multi-step reasoning questions. For such
model’s parameters (e.g., commonsense reasoning). questions, one often must retrieve partial knowl-
1
Code, data, and prompts are available at https:// edge, perform partial reasoning, retrieve additional
github.com/stonybrooknlp/ircot information based on the outcome of the partial
reasoning done so far, and iterate. As an example, strate that retrieval using IRCoT is substantially
consider the question illustrated in Fig. 1, “In what more effective than the baseline, one-step, question-
country was Lost Gravity manufactured?”. The based retrieval by 11-21 recall points under a fixed-
Wikipedia document retrieved using the question budget optimal recall setup.3 When IRCoT is used
(in particular, the roller coaster Lost Gravity) as the in conjunction with a prompting-based reader, it
query does not mention where Lost Gravity was also leads to substantial improvement (up to 15 F1
manufactured. Instead, one must first infer that points) in downstream few-shot QA performance
it was manufactured by a company called Mack and reduces factual errors in generated CoT by
Rides, and then perform further retrieval, guided up to 50%. Our approach also works on much
by the inferred company name, to obtain evidence smaller Flan-T5 models (11B, 3B, and 0.7B) show-
pointing to the manufacturing country. ing similar trends. In particular, we find QA using
Thus, the retrieval and reasoning steps must in- Flan-T5-XL (3B) with IRCoT even outperforms
form each other. Without retrieval, a model is likely the 58X larger GPT3 with a one-step question-
to generate an incorrect reasoning step due to hallu- based retrieval. Furthermore, these improvements
cination. Additionally, without generating the first also hold up in an out-of-distribution (OOD) setting
reasoning step, the text supporting the second step where the demonstrations from one dataset are used
can’t be identified easily given the lack of lexical or when testing on another dataset. Lastly, we note
even semantic overlap with the question. In other that our QA scores exceed those reported by recent
words, we need retrieved facts in order to generate works on few-shot prompting for open-domain QA
factually correct reasoning steps and the reasoning (ODQA) (Khot et al., 2023; Press et al., 2022; Yao
steps to retrieve relevant facts. et al., 2022), although a fair apples-to-apples com-
Based on this intuition, we propose an interleav- parison with them isn’t possible (cf. Appendix C).
ing approach to this problem, where the idea is to In summary, our main contribution is a novel re-
use retrieval to guide the chain-of-thought (CoT) trieval method, IRCoT, that leverages LMs’ chain-
reasoning steps and use CoT reasoning to guide the of-thought generation capabilities to guide retrieval
retrieval. Fig. 1 shows an overview of our retrieval and uses retrieval in turn to improve CoT reasoning.
method, which we call IRCoT.2 We begin by re- We demonstrate that IRCoT:
trieving a base set of paragraphs using the question 1. improves both retrieval and few-shot QA per-
as a query. Subsequently, we alternate between the formance on several multi-step open-domain
following two steps: (i) extend CoT: use the ques- QA datasets, in both IID and OOD settings;
tion, the paragraphs collected thus far, and the CoT 2. reduces factual errors in generated CoTs; and
sentences generated thus far to generate the next 3. improves performance with both large-scale
CoT sentence; (ii) expand retrieved information: (175B models) as well as smaller-scale mod-
use the last CoT sentence as a query to retrieve els (Flan-T5-*, ≤11B) without any training.
additional paragraphs to add to the collected set.
We repeat these steps till the CoT reports an an- 2 Related Work
swer or we reach the maximum allowed number
of reasoning steps. Upon termination, all collected Prompting for Open-Domain QA. LLMs can
paragraphs are returned as the retrieval outcome. learn various tasks by simply using a few exam-
Finally, we use these as the context for answering ples as prompts (Brown et al., 2020). They’ve
the question via direct QA prompting (Brown et al., also been shown to answer complex questions
2020) or CoT prompting (Wei et al., 2022). by producing step-by-step reasoning (chain-of-
thoughts, or CoT) when prompted with a few or
We evaluate the efficacy of our system
zero demonstrations (Wei et al., 2022; Kojima et al.,
on 4 multi-step reasoning datasets under an
2022). Prompting has been applied to open-domain
open-domain setting: HotpotQA (Yang et al.,
QA (Lazaridou et al., 2022; Sun et al., 2022; Yu
2018), 2WikiMultihopQA (Ho et al., 2020),
et al., 2023) but its value in improving retrieval and
MuSiQue (Trivedi et al., 2022), and IIRC (Fer-
QA for multi-step open-domain questions remains
guson et al., 2020). Our experiments using OpenAI
relatively underexplored.
GPT3 (code-davinci-002) (Brown et al., 2020;
Ouyang et al., 2022; Chen et al., 2021) demon- 3
We explain later (in the Metric section and Footnote 7)
the appropriateness of this metric in our setting as opposed to
2
Interleaved Retrieval guided by Chain-of-Thought. more mainstream information recall metrics.
Recently three approaches have been proposed 3 Chain-of-Thought-Guided Retrieval
for multi-step open-domain QA. SelfAsk (Press and Open-Domain QA
et al., 2022) prompts LLMs to decompose a ques-
tion into subquestions and answers subquestions by Our goal is to answer a knowledge-intensive multi-
a call to Google Search API. DecomP (Khot et al., step reasoning question Q in a few-shot setting
2023) is a general framework that decomposes a by using a knowledge source containing a large
task and delegates sub-tasks to appropriate sub- number of documents. To do this we follow a
models. They also decompose questions but dele- retrieve-and-read paradigm (Zhu et al., 2021),
gate retrieval to a BM25-based retriever. Both of where the retriever first retrieves documents from
these approaches are not developed for CoT reason- the knowledge source and the QA model reads the
ing, do not focus on the retrieval problem, and re- retrieved documents and the question to generate
quire a single-hop QA model to answer the decom- the final answer. Our contribution is mainly in the
posed questions. Recently proposed ReAct (Yao retrieve step (§3.1), and we use standard prompt-
et al., 2022) system frames the problem as generating strategies for the read step (§3.2).
ing a sequence of reasoning and action steps. These As noted earlier, for multi-step reasoning, re-
steps are much more complex, rely on much larger trieval can help guide the next reasoning step,
models (PaLM-540B), and require fine-tuning to which in turn can inform what to retrieve next. This
outperform CoT for multi-step ODQA. Further- motivates our interleaving strategy, discussed next.
more, none of these works have been shown to be
3.1 Interleaving Retrieval with
effective for smaller models without any training.
Chain-of-Thought Reasoning
While a direct comparison with these approaches is
not straightforward (difference in knowledge cor- Our proposed retriever method, IRCoT, can be
pus, LLMs, examples), we find that our ODQA instantiated from the following three ingredients:
performance is much higher than all their reported (i) a base retriever that can take a query and re-
numbers where available (§5). turn a given number of paragraphs from a corpus
or knowledge source; (ii) a language model with
Supervised Multi-Step Open-Domain QA. zero/few-shot Chain-of-Thought (CoT) generation
Prior work has explored iterative retrieval for capabilities; and (iii) a small number of annotated
open-domain QA in a fully supervised setting. Das questions with reasoning steps explaining how to
et al. (2019) proposes an iterative retrieval model arrive at the answer in natural language (chain of
that retrieves using a neural query representation thoughts) and a set of paragraphs from the knowl-
and then updates it based on a reading compre- edge source that collectively support the reasoning
hension model’s output. Feldman and El-Yaniv chain and the answer.
(2019) apply similar neural query reformulation The overview of IRCoT is given in Fig. 2. We
idea for multihop open-domain QA. Xiong et al. first gather a base set of paragraphs by retrieving K
(2021) extends the widely-used Dense Passage paragraphs using the question Q as the query. Then,
Retrieval (DPR) (Karpukhin et al., 2020) to we interleave two steps (reason and retrieve)
multihop setting, which has since been improved iteratively until the termination criterion is met.
by Khattab et al. (2021). Asai et al. (2020) The retrieval-guided reasoning step (“Rea-
leverages the graph structure induced by the entity son”) generates the next CoT sentence using the
links present in Wikipedia paragraphs to perform question, the paragraphs collected thus far, and
iterative multi-step retrieval. GoldEn (Gold Entity) the CoT sentences generated thus far. The prompt
retriever (Qi et al., 2019) iteratively generates template for the task looks as follows:
text queries based on paragraphs retrieved from Wikipedia Title: <Page Title>
an off-the-shelf retriever but requires training <Paragraph Text>
data for this next query generator. Nakano et al. ...
Wikipedia Title: <Page Title>
(2021) used GPT3 to answer long-form questions <Paragraph Text>
by interacting with the browser but relied on
human annotations of these interactions. All of Q: <Question>
A: <CoT-Sent-1> ... <CoT-Sent-n>
these methods rely on supervised training on a
large-scale dataset and can not be easily extended For in-context demonstrations, we use the com-
to a few-shot setting. plete CoT in the above format. For a test instance,
Q Who wrote the 1970 international hit song that Murray Head is most recognized for? Retrieve( q )

IRCoT Retrieve (Q) → q

Interleaved Retrieval guided
by Chain-of-Thought Reasoning
T1 ← Reason (Q, , )
The 1970 international hit song that Reason( Q , , T1 )
T1 Murray Head is most recognized for
is "Super Star" Retrieve (T1) → Wikipedia Title: Mack Rides
Mack Rides GmbH & Co KG, also ...

T2 ← Reason (Q, + , T1) Q: In what country was

Lost Gravity manufactured?
A: The Lost Gravity was manufactured by Mack
"Super Star" was written by Rides. Mack Rides is a company from
T2
Andrew Lloyd Webber and Tim Rice. Retrieve (T2) → Germany. The answer is Germany.
...
Wikipedia Title: Murray Head
Murray Seafield St George Head ..
T3 ← Reason (Q, + + , T1+T2) ...
So the answer is: Wikipedia Title: Most Beautifullest Hits
T3 Andrew Lloyd Webber and Tim Rice. The Most Beautifullest Hits is ...
Stop Q: Who wrote the 1970 international hit .. Q
A: The 1970 international hit song that
Murray Head is most recognized for
is "Super Star". "Super Star" was written T1
by. Andrew Lloyd Webber and Tim Rice.
+ +

Figure 2: IRCoT interleaves chain-of-thought (CoT) generation and retrieval steps to guide the retrieval by CoT and
vice-versa. We start by retrieving K documents using the question as they query and repeat two steps alternatingly
until termination. (i) reason-step generates next CoT sentence based on the question, so far retrieved paragraphs,
and CoT sentences. (ii) retrieve-step retrieves K more paragraphs based on the last CoT sentence. The process
terminates when the generated CoT has “answer is” or the number of steps exceeds a threshold. The collection of
all paragraphs is returned as the retrieval result on the termination.

we show the model only the CoT sentences gen- two versions of the QA reader implemented via two
erated thus far and let it complete the rest. Even prompting strategies: CoT Prompting as proposed
though the model may output multiple sentences, by Wei et al. (2022), Direct Prompting as proposed
for each reason-step, we only take the first gen- by Brown et al. (2020). For CoT prompting, we use
erated sentence and discard the rest. the same template as shown in §3.2, but at test time
For the paragraphs in the in-context demonstra- we ask the model to generate the full CoT from
tions, we use ground-truth supporting paragraphs scratch. The final sentence of CoT is expected to
and M randomly sampled paragraphs shuffled and be of the form “answer is: ...”, so that the answer
concatenated together in the above format. For a can be extracted programmatically. If it’s not in
test instance, we show all the paragraphs collected that form, the full generation is returned as the
thus far across all the previous retrieve-steps. answer. For Direct Prompting, we use the same
If the generated CoT sentence has the “answer template as CoT Prompting but the answer field
is:” string or the maximum number of steps4 has (“A: ”) contains only the final answer instead of
been reached, we terminate the process and return CoT. See App. G for details.
all collected paragraphs as the retrieval result.
The CoT-guided retrieval step (“Retrieve”) 4 Experimental Setup
uses the last generated CoT sentence as a query
to retrieve more paragraphs and adds them to the We evaluate our method on 4 multi-step
collected paragraphs. We cap the total number of QA datasets in the open-domain setting:
collected paragraphs5 so as to fit in at least a few HotpotQA (Yang et al., 2018), 2WikiMul-
demonstrations in the model’s context limit. tihopQA (Ho et al., 2020), answerable subset of
MuSiQue (Trivedi et al., 2022), and answerable
3.2 Question Answering Reader subset of IIRC (Ferguson et al., 2020). For
HotpotQA, we use the Wikipedia corpus that
The QA reader answers the question using retrieved
comes with it for the open-domain setting. For
paragraphs taken from the retriever. We consider
each of the other three datasets, which originally
4
set to 8 in our experiments. come in a reading comprehension or mixed setting,
5
set to 15 in our experiments. we used the associated contexts to construct a
corpus for our open-domain setting (see App. A The reported metric can thus be viewed as the fixed-
for details). For each dataset, we use 100 randomly budget optimal recall for each system considered.7
sampled questions from the original development
set for tuning hyperparameters, and 500 other QA Reader. To implement the reader, we use
randomly sampled questions as our test set. the same LMs as used in the reason-step of
IRCoT Retriever. We found that QA readers im-
4.1 Models plemented with Flan-T5-* perform better with the
Direct Prompting strategy and GPT3 performs bet-
Retriever. We use BM25 (Robertson et al., 2009) ter with CoT Prompting strategy (see App. E).
implemented in Elasticsearch6 as our base retriever. Hence we use Direct prompting strategy for QA
We compare two retriever systems: with Flan-T5-* and CoT with GPT3 for the experi-
(i) One-step Retriever (OneR) uses the ques- ments.8
tion as a query to retrieve K paragraphs. We select The QA reader has one hyperparameter M : the
K ∈ {5, 7, 9, 11, 13, 15} that’s best on the dev set. number of distractor paragraphs in the in-context
(ii) IRCoT Retriever is our method de- demonstrations. We search for M in {1, 2, 3}.
scribed in §3. We use BM25 as its underly- When used in conjunction with IRCoT retriever
ing retriever and experiment with OpenAI GPT3 M is tied for the CoT generator and the reader.
(code-davinci-002) (Brown et al., 2020; Ouyang
et al., 2022; Chen et al., 2021) and Flan-T5 (Chung Open-Domain QA (ODQA) Models. Putting re-
et al., 2022) of different sizes as its CoT generator. trievers and readers together, we experiment with
ODQA models constructed from the various lan-
For demonstrating in-context examples to these
guage models denoted as OneR QA and IRCoT
LMs, we wrote CoTs for 20 questions for all the
QA. For IRCoT QA, the choice of LM for the CoT
datasets (see App. §G). We then create 3 demon-
generator and the reader is kept the same. We also
stration (“training”) sets by sampling 15 questions
experiment with retriever-less QA readers NoR QA
each for each dataset. For each experiment, we
to assess how well LMs can answer the question
search for the best hyperparameters for the dev set
from their parametric knowledge alone. To select
using the first demonstration set and evaluate each
the best hyperparameters for the ODQA model,
demonstration set on the test set using the selected
we search for the hyperparameters K and M that
hyperparameters. We report the mean and standard
maximize the answer F1 on the development set.
deviation of these 3 results for each experiment.
IIRC is structured slightly differently from the
At test time, we pack as many demonstrations
other datasets, in that its questions are grounded
as possible within the model’s context length limit.
in a main passage and other supporting paragraphs
The context limit for GPT3 (code-davinci-002)
come from the Wikipedia pages of entities men-
is 8K word pieces. Flan-T5-* doesn’t have any
tioned in this passage. We slightly modify the re-
hard limit as it uses relative position embeddings.
trievers and readers to account for this (see App. B).
But we limit Flan-T5’s context to 6K word pieces,
which is the maximum we could fit in the memory 5 Results
of our 80G A100 GPUs.
IRCoT Retriever has one key hyperparameter: IRCoT retrieval is better than one-step. Fig. 3
K ∈ {2, 4, 6, 8}, the number of paragraphs to re- compares OneR with IRCoT retrievers made from
trieve at each step. Additionally, when creating 7
Note that our retrieved documents are not ranked, mak-
“training” demonstrations for IRCoT’s Reasoner ing standard information retrieval metrics such as MAP and
module, we use gold paragraphs and a smaller num- DCG inapplicable. Further, we can only limit the number of
retrieved paragraphs per step to K. Since the total number
ber M ∈ {1, 2, 3} of distractor paragraphs (§3.1). of reasoning steps varies for questions, and in some cases,
Retrieval Metric: We allow a maximum of 15 we don’t even obtain all K paragraphs in a given step, the
paragraphs for all retriever systems and measure total number of retrieved paragraphs also varies (even though
capped at 15). This makes Recall@k, Precision@k, etc., also
the recall of the gold paragraphs among the re- not applicable as metrics for any given k.
8
trieved set of paragraphs. We search for the hyper- IRCoT, by construction, produces a CoT as a part of its
parameter K (and M for IRCoT) that maximizes retrieval process. Thus, instead of having a separate post-hoc
reader, one can also just extract the answer from the CoT
the recall on the dev set and use it on the test set. generated during retrieval. However, we found this to be a
suboptimal choice, so we always use a separate reader (see
6
https://www.elastic.co/ App. F).
Figure 3: Retrieval recall for one-step retriever (OneR) and IRCoT instantiated from Flan-T5-XXL (left) and GPT3
(right) models. IRCoT outperforms OneR for both models and all datasets.

Figure 4: Answer F1 for ODQA model made using (i) no retriever (NoR QA) (ii) one-step retriever (OneR QA) and
(iii) IRCoT QA instantiated from Flan-T5-XXL (left) and GPT3 (right) models. IRCoT QA outperforms OneR QA
and NoR QA for both models on all datasets, except for GPT3 on IIRC.

Flan-T5-XXL and GPT3 LMs. For both models, much worse than IRCoT QA, indicating the limits
IRCoT significantly outperforms one-step retrieval of the models’ parametric knowledge.
across all datasets. For Flan-T5-XXL, IRCoT im-
proves our recall metric relative to one-step re- IRCoT is effective in OOD setting. Since CoT
trieval, on HotpotQA by 7.9, on 2WikiMultihopQA may not always be easy to write for new datasets,
by 14.3, on MuSiQue by 3.5, and on IIRC by 10.2 we evaluate NoR, OneR, and IRCoT on generaliza-
points. For GPT3, this improvement is by 11.3, 22.6, tion to new datasets, i.e. OOD setting. To do so,
12.5, and 21.2 points, respectively. we use prompt demonstrations from one dataset to
evaluate on another dataset.9 For all pairs of the
IRCoT QA outperforms NoR and OneR QA. datasets10 and for both Flan-T5-XXL and GPT3, we
Fig. 4 compares ODQA performance using find the same trend as in the IID setting: IRCoT re-
NoR, OneR and IRCoT retriever made from trieval outperforms OneR (Fig. 5), and IRCoT QA
Flan-T5-XXL and GPT3 LMs. For Flan-T5-XXL, outperforms both OneR QA and NoR QA (Fig. 6).
IRCoT QA outperforms OneR QA on HotpotQA
by 9.4, on 2WikiMultihopQA by 15.3, on MuSiQue IRCoT generates CoT with fewer factual errors.
by 5.0 and IIRC by 2.5 F1 points. For GPT3, the To assess whether our approach also improves the
corresponding numbers (except for IIRC) are 7.1, factuality of generated CoTs, we manually anno-
13.2, and 7.1 F1 points. For GPT3, IRCoT doesn’t tated CoTs generated by NoR QA, OneR QA, and
improve the QA score on IIRC, despite signifi- IRCoT QA using GPT3 for 40 randomly sampled
cantly improved retrieval (21 points as shown in questions from each of the four datasets. We con-
Fig. 3). This is likely because IIRC relevant knowl- sidered CoT to have a factual error if at least one
edge may already be present in GPT3, as also ev- 9
We use the evaluation dataset’s corpus for retrieval.
idenced by its NoR QA score being similar. For 10
We skip IIRC in this exploration as the task is structured
other datasets and model combinations, NoR QA is a bit differently and requires special handling (see App. B).
Figure 5: Retrieval recall for OneR and IRCoT using Flan-T5-XXL (Left) and GPT3 (Right) in out-of-distribution
(OOD) setting. HQ (HotpotQA), 2W (2WikiMultihopQA), MQ (MuSiQue). The result X→Y indicates prompt
demonstrations are from dataset X and evaluation is on dataset Y. IRCoT outperforms OneR in such an OOD setting.

Figure 6: Answer F1 for NoR QA, OneR QA and IRCoT QA using Flan-T5-XXL (Left) and GPT3 (Right) in
out-of-distribution (OOD) setting. HQ (HotpotQA), 2W (2WikiMultihopQA), MQ (MuSiQue). The result X→Y
indicates prompt demonstrations are from dataset X and evaluation is on dataset Y. IRCoT QA outperforms OneR
QA and NoR QA in such OOD setting.

and IRCoT the least. In particular, IRCoT reduces

the factual errors over OneR by 50% on HotpotQA
and 40% on 2WikiMultihopQA.
Table 2 illustrates how the CoT predictions for
different methods vary qualitatively. Since NoR
relies completely on parametric knowledge, it often
makes a factual error in the first sentence, which
derails the full CoT. OneR can retrieve relevant
information closest to the question and is less likely
to make such errors early on, but it still makes
Figure 7: Number of questions, out of 40, where CoT errors later in the CoT. IRCoT, on the other hand,
generated by GPT3 using different methods has at least is often able to prevent such errors in each step.
1 factual error. Factual errors: IRCoT < OneR < NoR.
IRCoT is also effective for smaller models. To
see how effective IRCoT is at different LM sizes,
of the facts11 is not true.12 As Fig. 7 shows, NoR we show the scaling plots in Fig. 8.13 We com-
makes the most factual errors, OneR makes fewer, pare the recall for OneR and IRCoT using Flan-T5
{base (0.2B), large (0.7B), XL (3B), XXL (11B)},
11
all sentences before the final “answer is:” sentence. and GPT3 code-davinci-002 (175B). IRCoT
12
Note that factual error doesn’t necessarily mean the pre- with even the smallest model (0.2B) is better than
dicted answer is incorrect and vice-versa. This is because the
13
model can generate a wrong answer despite all correct facts, We skip IIRC here as the smaller models are not good at
and vice-versa. We also account for the possibility of answer identifying Wikipedia titles from a paragraph and a question
annotation errors in the original datasets. which is necessary for IIRC (see App. B).
Figure 8: Retrieval recall for OneR (bottom) and IRCoT (top) for LMs of increasing sizes: Flan-T5 {base (0.2B),
large (0.7B), XL (3B), XXL (11B)} and GPT3 (175B) on HotpotQA, 2WikiMultihopQA, MuSiQue. IRCoT
outperforms OneR for all model sizes, including the 0.3B model, and the difference roughly grows with model size.
Note: OneR doesn’t use LM in its retrieval and so has a fixed score.

Figure 9: Answer F1 for ODQA models made using OneR (bottom) and IRCoT (top) for LMs of increasing sizes:
Flan-T5 {base (0.2B), large (0.7B), XL (3B), XXL (11B)} and GPT3 (175B) on HotpotQA, 2WikiMultihopQA and
MuSiQue. IRCoT QA outperforms OneR QA for all model sizes except for the smallest, 0.3B. IRCoT with 3B
model even outperforms OneR with 58X larger GPT3 model showing the value of improved retrieval.

OneR, and the performance roughly improves with Model HpQABr HpQA 2WikiMQA MQ2H
the model size. This shows the CoT generation InterAug − | − 30.3 | − −|− −|−
capabilities of even small models can be leveraged RECITE − | − 37.1 | 48.4 −|− −|−
for improving retrieval. Furthermore, we show the ReAct − | − 35.1 | − −|− −|−
SelfAsk −|− −|− 40.1 | − 15.2 | −
effect of model size on the QA score in Fig. 9. For DecomP − | 50.0 − | − − | 59.3 −|−
all sizes except the smallest (0.2B), we see IRCoT IRCoT QA 45.8 | 58.5 49.3 | 60.7 57.7 | 68.0 34.2 | 43.8
QA is better than OneR QA. Moreover, IRCoT
with a 3B model even outperforms OneR and NoR Table 1: Comparison with other LLM-based ODQA
with a 58X larger 175B GPT3 model in all datasets. systems on EM and F1 scores. ‘−’: score is unavail-
able. HpQABr : Bridge questions subset of HotpotQA.
IRCoT is SOTA for few-shot multistep ODQA.14 MQ2H : MuSiQue 2-hop questions. IRCoT QA with
We compare IRCoT QA with five recent ap- GPT3 (ours) outperforms other systems by a large mar-
proaches to using LLMs for ODQA: Internet- gin. Note: Comparisons aren’t head-to-head as dis-
Augmented QA (Lazaridou et al., 2022), RE- cussed in the text. App. §C reports updated SOTA num-
CITE (Sun et al., 2022) ReAct (Yao et al., 2022), bers, including contemporaneous and newer works.
SelfAsk (Press et al., 2022), and DecomP (Khot
et al., 2022). Although these are not head-to-head As shown in Table 1, IRCoT QA significantly
comparisons as different methods use different outperforms all of these recent systems by a large
APIs, knowledge sources, and even LLMs (see margin, setting a new state of the art in terms of
App. C for details), it is still informative to ex- what’s achievable via retrieval-augmented LLMs
plore, in a leaderboard-style fashion, how IRCoT (without supervised training).
performs relative to the best numbers published for
6 Conclusions
these recent systems.
14
App. §C reports updated SOTA numbers, including con- Chain-of-thought prompting has significantly im-
temporaneous and newer works. proved LLMs’ ability to perform multi-step reason-
Question Gold Facts NoR QA OneR QA IRCoT QA
The tap dancing ⇒ The tap dancing The tap dancing soloist The tap dancing soloist The tap dancing soloist
soloist in My soloist in My Feet in My Feet Keep Danc- in My Feet Keep Danc- in My Feet Keep Danc-
Feet Keep Keep Dancing is Fayard ing is Eleanor Pow- ing is Fayard Nicholas. ing is Fayard Nicholas.
HotpotQA

Dancing also Nicholas. ell. Eleanor Powell Fayard Nicholas also Fayard Nicholas starred
starred in what ⇒ Fayard Nicholas also starred in the 1935 starred in the 1935 in the 1935 MGM mu-
1935 MGM starred in the 1935 MGM musical Broad- MGM musical Top Hat. sical An All-Colored
musical? MGM musical An way Melody of 1936. So the answer is: Top Vaudeville Show. So
All-Colored Vaudeville So the answer is: Broad- Hat. the answer is: An
Show. way Melody of 1936. All-Colored Vaudeville
Show.
Who is the ⇒ Queen Hyojeong Queen Hyojeong has a Queen Hyojeong is the Queen Hyojeong is the
father-in-law is the wife of King husband named King wife of King Heonjong wife of King Heonjong
of Queen Hyo- Heonjong of Joseon. Gojong. King Gojong of Joseon. King Heon- of Joseon. King Heon-
jeong? ⇒ King Heonjong has a father named jong of Joseon is the jong of Joseon is the son
2WikiMQA

of Joseon is the son Heungseon Daewongun. son of King Sejo of of Crown Prince Hy-
of Crown Prince Hy- Thus, Queen Hyojeong Joseon. Thus, King Sejo omyeong. Thus, Crown
omyeong. has a father-in-law of Joseon is the father- Prince Hyomyeong is
named Heungseon in-law of Queen Hyo- the father-in-law of
Daewongun. So the jeong. So the answer is: Queen Hyojeong. So
answer is: Heungseon King Sejo of Joseon. the answer is: Crown
Daewongun. Prince Hyomyeong.
What is the ⇒ A Collection The performer of A Col- A Collection A Collection
name of the 1984–1989 was per- lection 1984–1989 is 1984–1989 was 1984–1989 was
castle in the formed by Jane Siberry. The The. The The was performed by Jane performed by Jane
MuSiQue

city where the ⇒ Jane Siberry was born in London. The Siberry. Jane Siberry Siberry. Jane Siberry
performer of born in Toronto. name of the castle in was born in Toronto. was born in Toronto.
A Collection ⇒ The castle in Toronto London is the Tower of The castle in Toronto The castle in Toronto is
1984–1989 was is the Casa Loma. London. So the answer is Peqin Castle. So the the Casa Loma. So the
born? is: the Tower of Lon- answer is: Peqin Castle. answer is: Casa Loma.
don.

Table 2: Example CoTs generated by GPT3 with different methods. Since NoR relies on parametric knowledge, it
often makes a factual error in the first sentence derailing the full CoT. OneR can retrieve relevant information closest
to the question and is less likely to make such errors early on, but it still makes errors later in the CoT. As IRCoT
performs retrieval after each step, it is often able to prevent such errors in each step. More examples are in App. D.

ing. We leveraged this ability to improve retrieval, LMs will likely increasingly acquire such ability,
and in turn, improve QA performance for com- making IRCoT compatible with many more LMs.
plex knowledge-intensive open-domain tasks in a IRCoT also relies on the base LM to support
few-shot setting. We argued that one-step question- long inputs as multiple retrieved paragraphs need
based retrieval is insufficient for such tasks, and to fit in the LM’s input, in addition to at least
introduced IRCoT, which uses interleaved CoT rea- a few demonstrations of QA or CoT with para-
soning and retrieval steps that guide each other graphs. This was supported by the models we used
step-by-step. On four datasets, IRCoT significantly as code-davinci-002 (GPT3) allows 8K tokens
improves both retrieval and QA performance when and Flan-T5-* uses relative position embeddings
compared to one-step retrieval, for both large and making it as extensible as the GPU memory con-
relatively smaller-scale LMs. Additionally, CoTs straints allow. Future work can explore strategies to
generated by IRCoT contain fewer factual errors. rerank and select the retrieved paragraphs instead
of passing all of them to the LM to alleviate the
Limitations need for the LM to support long input.
The performance gain of IRCoT retriever and
IRCoT relies on the base LM to have a zero or QA (over OneR and ZeroR baselines) come with
few-shot CoT-generation ability. While this is com- an additional computational cost. This is because
monly available in large LMs (over 100B), it’s not IRCoT makes a separate call to an (L)LM for each
as common for small LMs (under 20B), which to sentence of CoT. Future work can focus on, for
some extent limits IRCoT adoptability. Given the instance, dynamically deciding when to retrieve
recent surge of interest (Tay et al., 2023; Magis- more information and when to perform additional
ter et al., 2022; Ho et al., 2022), however, smaller reasoning with the current information.
Lastly, a portion of our experiments was carried question answering. In International Conference on
out using a commercial LLM API from OpenAI Learning Representations.
(code-davinci-002). This model was deprecated
Sebastian Borgeaud, Arthur Mensch, Jordan Hoff-
by OpenAI after our submission making the repro- mann, Trevor Cai, Eliza Rutherford, Katie Milli-
duction of these experiments challenging despite can, George Bm Van Den Driessche, Jean-Baptiste
our best efforts, just like any other work using such Lespiau, Bogdan Damoc, Aidan Clark, Diego
APIs. The trends discussed in the paper (IRCoT De Las Casas, Aurelia Guy, Jacob Menick, Roman
Ring, Tom Hennigan, Saffron Huang, Loren Mag-
> OneR > NoR), we believe, would still hold. giore, Chris Jones, Albin Cassirer, Andy Brock,
Additionally, all our experiments using Flan-T5-*, Michela Paganini, Geoffrey Irving, Oriol Vinyals,
which exhibit similar trends as that of GPT3, will Simon Osindero, Karen Simonyan, Jack Rae, Erich
remain reproducible, thanks to its publicly avail- Elsen, and Laurent Sifre. 2022. Improving language
models by retrieving from trillions of tokens. In
able model weights.
Proceedings of the 39th International Conference
on Machine Learning, volume 162 of Proceedings
Ethical Considerations of Machine Learning Research, pages 2206–2240.
PMLR.
Language models are known to hallucinate incor-
rect and potentially biased information. This is Tom Brown, Benjamin Mann, Nick Ryder, Melanie
especially problematic when the questions asked Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind
to it are of a sensitive nature. While retrieval- Neelakantan, Pranav Shyam, Girish Sastry, Amanda
augmented approaches such as ours are expected Askell, et al. 2020. Language models are few-shot
learners. Advances in neural information processing
to alleviate this issue to some extent by grounding systems, 33:1877–1901.
generation in external text, this by no means solves
the problem of generating biased or offensive state- Mark Chen, Jerry Tworek, Heewoo Jun, Qiming
ments. Appropriate care should thus be taken if Yuan, Henrique Ponde de Oliveira Pinto, Jared Ka-
deploying such systems in user-facing applications. plan, Harri Edwards, Yuri Burda, Nicholas Joseph,
Greg Brockman, et al. 2021. Evaluating large
All the datasets and models used in this work language models trained on code. arXiv preprint
are publicly available with permissible licenses. arXiv:2107.03374.
HotpotQA has CC BY-SA 4.0 license15 , 2Wiki-
MultihopQA has Apache-2.0 license16 , MuSiQUe Hyung Won Chung, Le Hou, Shayne Longpre, Bar-
ret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi
and IIRC have CC BY 4.0 license17 , and Flan-T5-* Wang, Mostafa Dehghani, Siddhartha Brahma, et al.
models have Apache-2.0 license. 2022. Scaling instruction-finetuned language models.
arXiv preprint arXiv:2210.11416.
Acknowledgments
Rajarshi Das, Shehzaad Dhuliawala, Manzil Zaheer,
We thank the reviewers for their valuable feedback and Andrew McCallum. 2019. Multi-step retriever-
and suggestions. We also thank OpenAI for provid- reader interaction for scalable open-domain question
ing access to the code-davinci-002 API. This ma- answering. In International Conference on Learning
terial is based on research supported in part by the Representations.
Air Force Research Laboratory (AFRL), DARPA, Yair Feldman and Ran El-Yaniv. 2019. Multi-hop para-
for the KAIROS program under agreement number graph retrieval for open-domain question answering.
FA8750-19-2-1003, in part by the National Science In Proceedings of the 57th Annual Meeting of the As-
Foundation under the award IIS #2007290, and in sociation for Computational Linguistics, pages 2296–
2309, Florence, Italy. Association for Computational
part by an award from the Stony Brook Trustees
Linguistics.
Faculty Awards Program.
James Ferguson, Matt Gardner, Hannaneh Hajishirzi,
Tushar Khot, and Pradeep Dasigi. 2020. IIRC: A
References dataset of incomplete information reading compre-
hension questions. In EMNLP.
Akari Asai, Kazuma Hashimoto, Hannaneh Hajishirzi,
Richard Socher, and Caiming Xiong. 2020. Learning
to retrieve reasoning paths over wikipedia graph for Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat,
and Mingwei Chang. 2020. Retrieval augmented
15
https://creativecommons.org/licenses/by-sa/4. language model pre-training. In Proceedings of the
0/ 37th International Conference on Machine Learning,
16
https://www.apache.org/licenses/LICENSE-2.0 volume 119 of Proceedings of Machine Learning
17
https://creativecommons.org/licenses/by/4.0 Research, pages 3929–3938. PMLR.
Namgyu Ho, Laura Schmid, and Se-Young Yun. 2022. Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio
Large language models are reasoning teachers. arXiv Petroni, Vladimir Karpukhin, Naman Goyal, Hein-
preprint arXiv:2212.10071. rich Küttler, Mike Lewis, Wen-tau Yih, Tim Rock-
täschel, Sebastian Riedel, and Douwe Kiela. 2020.
Xanh Ho, A. Nguyen, Saku Sugawara, and Akiko Retrieval-augmented generation for knowledge-
Aizawa. 2020. Constructing a multi-hop qa dataset intensive nlp tasks. In Advances in Neural Infor-
for comprehensive evaluation of reasoning steps. In mation Processing Systems, volume 33, pages 9459–
COLING. 9474. Curran Associates, Inc.

Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Lucie Charlotte Magister, Jonathan Mallinson, Jakub
Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi- Adamek, Eric Malmi, and Aliaksei Severyn. 2022.
Yu, Armand Joulin, Sebastian Riedel, and Edouard Teaching small language models to reason. arXiv
Grave. 2022. Atlas: Few-shot learning with re- preprint arXiv:2212.08410.
trieval augmented language models. arXiv preprint
arXiv:2208.03299. Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu,
Long Ouyang, Christina Kim, Christopher Hesse,
Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Shantanu Jain, Vineet Kosaraju, William Saunders,
Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and et al. 2021. WebGPT: Browser-assisted question-
Wen-tau Yih. 2020. Dense passage retrieval for open- answering with human feedback. arXiv preprint
domain question answering. In Proceedings of the arXiv:2112.09332.
2020 Conference on Empirical Methods in Natural Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida,
Language Processing (EMNLP), pages 6769–6781, Carroll Wainwright, Pamela Mishkin, Chong Zhang,
Online. Association for Computational Linguistics. Sandhini Agarwal, Katarina Slama, Alex Gray, John
Schulman, Jacob Hilton, Fraser Kelton, Luke Miller,
Jungo Kasai, Keisuke Sakaguchi, Yoichi Takahashi, Maddie Simens, Amanda Askell, Peter Welinder,
Ronan Le Bras, Akari Asai, Xinyan Yu, Dragomir Paul Christiano, Jan Leike, and Ryan Lowe. 2022.
Radev, Noah A Smith, Yejin Choi, and Kentaro Inui. Training language models to follow instructions with
2022. RealTime QA: What’s the answer right now? human feedback. In Advances in Neural Information
arXiv preprint arXiv:2207.13332. Processing Systems.
Omar Khattab, Christopher Potts, and Matei Zaharia. Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt,
2021. Baleen: Robust multi-hop reasoning at scale Noah A Smith, and Mike Lewis. 2022. Measuring
via condensed retrieval. In Advances in Neural Infor- and narrowing the compositionality gap in language
mation Processing Systems. models. arXiv preprint arXiv:2210.03350.

Omar Khattab, Keshav Santhanam, Xiang Lisa Li, Peng Qi, Xiaowen Lin, Leo Mehr, Zijian Wang, and
David Hall, Percy Liang, Christopher Potts, and Christopher D. Manning. 2019. Answering complex
Matei Zaharia. 2023. Demonstrate-search-predict: open-domain questions through iterative query gen-
Composing retrieval and language models for eration. In Proceedings of the 2019 Conference on
knowledge-intensive NLP. Empirical Methods in Natural Language Processing
and the 9th International Joint Conference on Natu-
Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu, ral Language Processing (EMNLP-IJCNLP), pages
Kyle Richardson, Peter Clark, and Ashish Sabharwal. 2590–2602, Hong Kong, China. Association for Com-
2022. Decomposed prompting: A modular approach putational Linguistics.
for solving complex tasks.
Stephen Robertson, Hugo Zaragoza, et al. 2009. The
Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao probabilistic relevance framework: Bm25 and be-
Fu, Kyle Richardson, Peter Clark, and Ashish Sab- yond. Foundations and Trends® in Information Re-
harwal. 2023. Decomposed prompting: A modular trieval, 3(4):333–389.
approach for solving complex tasks. In The Eleventh Zhiqing Sun, Xuezhi Wang, Yi Tay, Yiming Yang, and
International Conference on Learning Representa- Denny Zhou. 2022. Recitation-augmented language
tions. models. arXiv preprint arXiv:2210.01296.
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yu- Yi Tay, Mostafa Dehghani, Vinh Q. Tran, Xavier Garcia,
taka Matsuo, and Yusuke Iwasawa. 2022. Large lan- Jason Wei, Xuezhi Wang, Hyung Won Chung, Dara
guage models are zero-shot reasoners. In ICML 2022 Bahri, Tal Schuster, Steven Zheng, Denny Zhou, Neil
Workshop on Knowledge Retrieval and Language Houlsby, and Donald Metzler. 2023. UL2: Unifying
Models. language learning paradigms. In The Eleventh Inter-
national Conference on Learning Representations.
Angeliki Lazaridou, Elena Gribovskaya, Wojciech
Stokowiec, and Nikolai Grigorev. 2022. Internet- Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot,
augmented language models through few-shot and Ashish Sabharwal. 2022. MuSiQue: Multi-
prompting for open-domain question answering. hop questions via single-hop question composition.
arXiv preprint arXiv:2203.05115. TACL, 10:539–554.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
Bosma, brian ichter, Fei Xia, Ed H. Chi, Quoc V Le,
and Denny Zhou. 2022. Chain of thought prompt-
ing elicits reasoning in large language models. In
Advances in Neural Information Processing Systems.
Wenhan Xiong, Xiang Li, Srini Iyer, Jingfei Du, Patrick
Lewis, William Yang Wang, Yashar Mehdad, Scott
Yih, Sebastian Riedel, Douwe Kiela, and Barlas
Oguz. 2021. Answering complex open-domain ques-
tions with multi-hop dense retrieval. In International
Conference on Learning Representations.
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Ben-
gio, William W. Cohen, Ruslan Salakhutdinov, and
Christopher D. Manning. 2018. HotpotQA: A dataset
for diverse, explainable multi-hop question answer-
ing. In EMNLP.
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak
Shafran, Karthik Narasimhan, and Yuan Cao. 2022.
ReAct: Synergizing reasoning and acting in language
models. arXiv preprint arXiv:2210.03629.
Wenhao Yu, Dan Iter, Shuohang Wang, Yichong Xu,
Mingxuan Ju, Soumya Sanyal, Chenguang Zhu,
Michael Zeng, and Meng Jiang. 2023. Generate
rather than retrieve: Large language models are
strong context generators. In The Eleventh Inter-
national Conference on Learning Representations.

Fengbin Zhu, Wenqiang Lei, Chao Wang, Jianming

Zheng, Soujanya Poria, and Tat-Seng Chua. 2021.
Retrieving and reading: A comprehensive survey on
open-domain question answering. arXiv preprint
arXiv:2101.00774.
A Constructing Retrieval Corpora IIRC, we use the following template.
Wikipedia Title: <Main Page Title>
HotpotQA already comes with the associated <Main Paragraph Text>
Wikipedia corpus for the open-domain setting,
so we use it directly. 2WikiMultihopQA and Q: The question is: '<Question>'. Generate titles
of <N> Wikipedia pages that have relevant
MuSiQue, however, are originally reading com- information to answer this question.
prehension datasets. Questions in 2WikiMulti- A: ["<Title-1>", "<Title-2>", ...]
hopQA and MuSiQue are associated with 10 and 20
For “training”, i.e., for demonstrations, N (≤ 3)
paragraphs respectively, 2-4 of which are support-
is the number of supporting Wikipedia page titles
ing and others are non-supporting. To turn these
for the question. At test time, since the number
datasets into an open-domain setting, we make two
of supporting page titles is unknown, we use a
corpora, one for each dataset, by combining all
fixed value of 3. We found this trick of prompting
supporting and non-supporting paragraphs for all
the model to generate more titles at the test time
its questions in the train, development, and test
improves its recall over letting the model decide by
sets. IIRC is originally a mix between reading
itself how many titles to generate.
comprehension and an open-domain setting. Each
question is grounded in one main paragraph, which
C Comparison with Previous Systems for
contains links to multiple Wikipedia pages with
ODQA with LLMs
several paragraphs each. We create a corpus out
of all the paragraphs from all the Wikipedia pages We showed a leaderboard-style comparison with
present in the dataset.18 We do assume the avail- previous approaches to using large language mod-
ability of the main passage which doesn’t need els for open-domain QA in § 5. We noted though
to be retrieved and is always present. We don’t that the comparison is not head-to-head given vari-
assume the availability of Wikipedia links in the ous differences. We briefly describe each method
main passage, however, to keep the retrieval prob- and the differences in API, LLM, retrieval corpus,
lem challenging.19 and other choices here.
Internet-Augmented QA (Lazaridou et al., 2022)
B Special Handling of Models for IIRC does (one-step) Google Search retrieval, performs
IIRC is slightly different from the other datasets, additional LLM-based filtering on it, and then
in that the question is grounded in the main pas- prompts an LLM to answer the question using
sage and other supporting paragraphs come from the resulting context. It uses the Gopher 280B
the Wikipedia pages of entities mentioned in this language model. RECITE (Sun et al., 2022) by-
passage. We modify the retrievers and readers to passes the retrieval and instead prompts an LLM
account for this difference: (i) We always keep the to first generate (recite) one or several relevant pas-
main passage as part of the input to the model re- sages from its own memory, and generate the an-
gardless of the retrieval strategy used. (ii) For all swer conditioned on this generation. They exper-
the retrieval methods, we first prompt the model to iment with many LLMs, the highest performing
generate a list of Wikipedia page titles using the of which is code-davinci-002 which we report
main passage and the question. We map these gen- here. ReAct (Yao et al., 2022) prompts LLMs to
erated titles to the nearest Wikipedia page titles in produce reasoning and action traces where actions
the corpus (found using BM25), and then the rest are calls to a Wikipedia API to return the sum-
of the paragraph retrieval queries are scoped within mary for a given Wikipedia page title. It uses
only those Wikipedia pages. the PALM 540B model. SelfAsk (Press et al.,
To prompt the model to generate Wikipedia page 2022) prompts LLMs to decompose a question
titles using the main passage and the question for into subquestions and answers these subquestions
by issuing separate calls to the Google Search
18
Following are the corpus sizes for the datasets: Hot- API. It uses the GPT3 (text-davinci-002) model.
potQA (5,233,329), 2WikiMultihopQA (430,225), MuSiQue
(139,416), and IIRC (1,882,415) Finally, DecomP (Khot et al., 2023) is a gen-
19
IIRC corpus has a positional bias, i.e., the majority of sup- eral framework that decomposes a task and del-
porting paragraphs are always within the first few positions of egates sub-tasks to appropriate sub-models. Sim-
the Wikipedia page. To keep the retrieval problem challenging
enough we shuffle the paragraphs before indexing the corpus, ilar to our system, it uses BM25 Search and the
i.e., we don’t use positional information in any way. GPT3 (code-davinci-002) model. And lastly,
Model HpQABr HpQA 2WikiMQA MQ2H MQ
InterAug (Lazaridou et al., 2022) −|− 30.3 | − −|− −|− −|−
RECITE (Sun et al., 2022) −|− 37.1 | 48.4 −|− −|− −|−
ReAct (Yao et al., 2022) −|− 35.1 | − −|− −|− −|−
SelfAsk (Press et al., 2022) −|− −|− 40.1 | − 15.2 | − −|−
DecomP (Khot et al., 2022) − | 50.0 −|− − | 59.3 −|− −|−
DecomP (Khot et al., 2023) * −|− − | 53.5 − | 70.8 −|− − | 30.9
DSP (Khattab et al., 2023) * −|− 51.4 | 62.9 −|− −|− −|−
IRCoT QA (ours) 45.8 | 58.5 49.3 | 60.7 57.7 | 68.0 34.2 | 43.8 26.5 | 36.5

Table 3: Extended comparison with published LLM-based ODQA systems (as of May 25, 2023) on EM and
F1 scores (with new numbers marked with *). ‘−’: score is unavailable. HpQABr : Bridge questions subset of
HotpotQA. MQ2H : MuSiQue 2-hop questions. IRCoT remains SOTA for MuSiQue and is close to SOTA for
HotpotQA and 2WikiMultihopQA. Note the comparisons here are not head-to-head as discussed in the text.

Flan-T5-XXL GPT3
Model HotpotQA 2WikiMQA MuSiQue IIRC HotpotQA 2WikiMQA MuSiQue IIRC
Direct 25.3± 0.3 32.7± 0.3 13.7± 0.3 28.9± 0.3 41.0± 1.1 38.5± 1.1 19.0± 1.2 40.9± 0.7
ZeroR QA
CoT 22.9± 0.1 31.7± 1.5 10.3± 0.5 24.4± 0.1 47.5± 0.4 41.2± 1.0 25.2± 1.2 52.1± 0.1
Direct 49.7± 0.5 51.2± 0.3 25.8± 0.6 40.0± 1.3 50.7± 0.1 46.4± 2.9 20.4± 0.3 40.1± 0.9
OneR QA
CoT 43.1± 0.7 47.8± 0.9 17.6± 0.2 34.5± 1.5 53.6± 0.7 54.8± 2.1 29.4± 0.8 49.8± 2.3
Direct 59.1± 0.9 66.5± 1.4 30.8± 0.2 42.5± 2.1 60.6± 1.0 63.5± 2.7 36.0± 0.5 47.9± 2.3
IRCoT QA
CoT 52.0± 0.6 55.1± 1.0 24.9± 1.0 36.5± 1.3 60.7± 1.1 68.0± 1.5 36.5± 1.2 49.9± 1.1

Table 4: Answer F1 for different ODQA models made from NoR, One and IRCoT retrievals, and Direct and
CoT prompting readers. For Flan-T5-XXL, Direct prompting is a better choice for the reader, and for GPT3, CoT
prompting is a better choice for the reader. Hence, we make different reader choices for Flan-T5 and GPT3 for the
experiments in the main paper. Note that IRCoT QA > OneR QA > ZeroR QA holds up regardless of this choice.

DSP (Khattab et al., 2023) provides a way to pro- We speculate DecomP performs well on 2WikiMul-
grammatically define interactions between LLM tihopQA because it has only a few easy-to-predict
and retrieval for ODQA (e.g., via question decom- decomposition patterns, which DecomP’s question
position), bootstrap demonstrations for such a pro- decomposition can leverage. The lack of such pat-
gram, and use them to make the answer prediction. terns in HotpotQA and MuSiQue causes it to un-
It uses GPT3.5 LLM with ColBERT-based retrieval. derperform compared to IRCoT. Lastly, it will be
Since most of these methods use different knowl- useful to assess whether DSP, which is hardcoded
edge sources or APIs and are built using different for 2-hop questions like that of HotpotQA, will
LLMs and retrieval models, it’s difficult to make a work well for a dataset with a varied number of
fair scientific comparison across these systems. Ad- hops like that of MuSiQue. We leave this further
ditionally, the evaluations in the respective papers investigation to future work.
are on different random subsets (from the same
distribution) of test instances. D Additional CoT Generation Examples
Despite these differences, it is still informative to Table 5 provides illustrations, in addition to the
explore, in a leaderboard-style fashion, how IRCoT ones provided in Table 2, for how the CoT gen-
performs relative to the best numbers published erations for NoR QA, OneR QA, and IRCoT QA
for these recent systems. Table 3 shows results methods vary. This gives an insight into how IR-
from different systems, including contemporane- CoT improves QA performance. Since NoR re-
ous and newer numbers. The two new systems in lies completely on parametric knowledge, it often
this table (relative to Table 1) are DecomP (newer makes a factual error in the first sentence, which de-
version) and DSP. While IRCoT remains SOTA on rails the full reasoning chain. Some of this factual
MuSiQue, DSP outperforms it on HotpotQA by 2.0 information can be fixed by OneR, especially infor-
points and the newer version of Decomp outper- mation closest to the question (i.e., can be retrieved
forms IRCoT on 2WikiMultihopQA by 2.8 points. using the question). This is insufficient for fixing
Question Gold Facts NoR QA OneR QA IRCoT QA
What was the ⇒ Marinelli Glacier is Marinelli Glacier is Marinelli Glacier is lo- Marinelli Glacier is lo-
island, on located on the island of located on the island of cated on Isla Grande cated on the island of
HotpotQA

which Marinelli Tierra del Fuego. Graham Land. Graham de Tierra del Fuego. Tierra del Fuego. The is-
Glacier is lo- ⇒ The island of Tierra Land was formerly Isla Grande de Tierra land of Tierra del Fuego
cated, formerly del Fuego was formerly known as Graham’s del Fuego was formerly was formerly known as
known as? known as Isla de Xativa. Land. So the answer is: known as Tierra del Isla de Xativa. So the
Graham’s Land. Fuego. So the answer answer is: Isla de Xa-
is: Tierra del Fuego. tiva.
Who is the ⇒ The director of The film Mukhyamantri Mukhyamantri (1996 The director of Mukhya-
child of the Mukhyamantri (1996 was directed by S. V. Ra- film) was directed mantri (1996 film)
director of film film) is Anjan Choud- jendra Singh Babu. S. by Anjan Choudhury. is Anjan Choudhury.
2WikiMQA

Mukhyamantri hury. V. Rajendra Singh Babu Anjan Choudhury has a Anjan Choudhury has
(1996 Film)? ⇒ Anjan Choudhury has a child named S. child named Aniruddha a child named Sandip
has a child named V. Rajendra Singh Babu. Choudhury. So the Choudhury. So the
Sandip Choudhury. So the answer is: S. V. answer is: Aniruddha answer is: Sandip
Rajendra Singh Babu. Choudhury. Choudhury.
When did the ⇒ Sacerdotii Nostri Pri- The author of Sacer- Sacerdotii Nostri Pri- Sacerdotii Nostri Pri-
Governorship mordia was written by dotii Nostri Primordia is mordia was written by mordia was written by
end of the Pope John XXIII. Pope Pius IX. Pope Pius Pope John XXIII. Pope Pope John XXIII. Pope
MuSiQue

city where the ⇒ Pope John XXIII IX died in the city of John XXIII died in the John XXIII died in Vat-
author of Sac- died in Vatican City. Rome. The Governor- city of Rome. The ican City. The Gover-
erdotii Nostri ⇒ The Governorship of ship of Rome ended in Governorship of Rome norship of Vatican City
Primordia died? Vatican City ended in 1870. So the answer is: ended in 1870. So the ended in 1952. So the
1952. 1870. answer is: 1870. answer is: 1952.

Table 5: Additional CoTs generated by GPT3 with different methods. ZeroR is most prone to factual errors. OneR
often fixes some of the factual information which is closest to the question but doesn’t always fix it all the way.
Since IRCoT retrieves after each step, it can also fix the errors at each step. More examples are in Table 2.

all the mistakes. Since IRCoT involves retrieval Model HotpotQA 2WikiMQA MuSiQue IIRC
after each step, it can fix errors at each step. IRCoT QA 59.1± 0.9 66.5± 1.4 30.8± 0.2 42.5± 2.1
GPT3 Flan

w/o reader 52.6± 0.3 60.9± 0.6 24.9± 0.2 40.3± 0.2
E Direct vs CoT Prompting Readers IRCoT QA 60.7± 1.1 68.0± 1.5 36.5± 1.2 49.9± 1.1
w/o reader 61.0± 0.7 70.4± 1.5 31.5± 0.6 48.4± 1.0
Table 4 compares reader choice (Direct vs CoT
Prompting) for Flan-T5-XXL and GPT3. We find Table 6: Answer F1 of IRCoT QA with and without
that Flan-T5-XXL works better with Direct Prompt- a separate reader for Flan-T5-XXL (top two rows) and
ing as a reader and GPT3 works better with CoT GPT3 (bottom two rows). When the reader is not used,
the answer is extracted from the CoT generated by
Prompting as a reader. Therefore, for the experi-
IRCoT while doing the retrieval. Ablating the reader
ments in the main paper, we go with this choice. usually hurts the performance.
Note though that the trends discussed in § 5 (IRCoT
QA > OneR QA > ZeroR QA) hold regardless of
the choice of the reader. G Prompts
Our manually written chain-of-thought annotations
F Separate Reader in IRCoT QA
for HotpotQA, 2WikiMultihopQA, MuSiQue, and
IRCoT, by construction, produces a CoT as a part IIRC are given in Listing 1, 2, 3 and 4 respec-
of its retrieval process. So, instead of having a sep- tively. Our prompts for GPT3 CoT Prompting are
arate post-hoc reader, one can also just extract the the same as these, except they have Wikipipedia
answer from the CoT generated during retrieval. paragraphs on the top of the questions as shown
As Table 6 shows the effect of such an ablation. in § 3.120 . Our prompts for GPT3 Direct Prompt-
For Flan-T5-XXL having a separate reader is sig- ing are the same as that of CoT prompting, except
nificantly better. For GPT3, this is not always true, have the answer after "A:" directly. Our prompts
but at least a model with a separate reader is al- for Flan-T5-* are slightly different from that of
ways better or close to the one without. So overall GPT3. For CoT Prompting, we prefix the question
we go with the choice of using the reader for the 20
We are not showing the paragraphs in the paper for brevity
experiments in this paper. but they can be obtained from the released code.
line: "Q: Answer the following question by reason-
ing step-by-step. <actual-question>". For Direct
Prompting, we prefix the question line: "Q: Answer
the following question. <actual-question>". We
did this to follow Flan-T5-*’s training format and
found it to help its CoT generation.
Listing 1: Chain-of-Thought annotations for HotpotQA.
Q: Jeremy Theobald and Christopher Nolan share what profession?
A: Jeremy Theobald is an actor and producer. Christopher Nolan is a director, producer, and screenwriter. Therefore, they
both share the profession of being a producer. So the answer is: producer.

Q: What film directed by Brian Patrick Butler was inspired by a film directed by F.W. Murnau?
A: Brian Patrick Butler directed the film The Phantom Hour. The Phantom Hour was inspired by the films such as Nosferatu
and The Cabinet of Dr. Caligari. Of these Nosferatu was directed by F.W. Murnau. So the answer is: The Phantom Hour.

Q: How many episodes were in the South Korean television series in which Ryu Hye−young played Bo−ra?
A: The South Korean television series in which Ryu Hye−young played Bo−ra is Reply 1988. The number of episodes Reply
1988 has is 20. So the answer is: 20.

Q: Vertical Limit stars which actor who also played astronaut Alan Shepard in "The Right Stuff"?
A: The actor who played astronaut Alan Shepard in "The Right Stuff" is Scott Glenn. The movie Vertical Limit also starred
Scott Glenn. So the answer is: Scott Glenn.

Q: What was the 2014 population of the city where Lake Wales Medical Center is located?
A: Lake Wales Medical Center is located in the city of Polk County, Florida. The population of Polk County in 2014 was
15,140. So the answer is: 15,140.

Q: Who was born first? Jan de Bont or Raoul Walsh?

A: Jan de Bont was born on 22 October 1943. Raoul Walsh was born on March 11, 1887. Thus, Raoul Walsh was born the
first. So the answer is: Raoul Walsh.

Q: In what country was Lost Gravity manufactured?

A: The Lost Gravity (roller coaster) was manufactured by Mack Rides. Mack Rides is a German company. So the answer is:
Germany.

Q: Which of the following had a debut album entitled "We Have an Emergency": Hot Hot Heat or The Operation M.D.?
A: The debut album of the band "Hot Hot Heat" was "Make Up the Breakdown". The debut album of the band "The Operation
M.D." was "We Have an Emergency". So the answer is: The Operation M.D..

Q: In which country did this Australian who was detained in Guantanamo Bay detention camp and published "Guantanamo:
My Journey" receive para−military training?
A: The Australian who was detained in Guantanamo Bay detention camp and published "Guantanamo: My Journey" is David
Hicks. David Hicks received his para−military training in Afghanistan. So the answer is: Afghanistan.

Q: Does The Border Surrender or Unsane have more members?

A: The Border Surrender band has following members: Keith Austin, Simon Shields, Johnny Manning and Mark Austin. That
is, it has 4 members. Unsane is a trio of 3 members. Thus, The Border Surrender has more members. So the answer is:
The Border Surrender.

Q: Which band formed first, Sponge Cola or Hurricane No. 1?

A: Sponge Cola band was formed in 1998. Hurricane No. 1 was formed in 1996. Thus, Hurricane No. 1 band formed the first.
So the answer is: Hurricane No. 1.

Q: James Paris Lee is best known for investing the Lee−Metford rifle and another rifle often referred to by what acronymn?
A: James Paris Lee is best known for investing the Lee−Metford rifle and LeeâĂŞEnfield series of rifles. LeeâĂŞEnfield is
often referred to by the acronym of SMLE. So the answer is: SMLE.

Q: Who was born first, James D Grant, who uses the pen name of Lee Child, or Bernhard Schlink?
A: James D Grant, who uses the pen name of Lee Child, was born in 1954. Bernhard Schlink was born in 1944. Thus,
Bernhard Schlink was born first. So the answer is: Bernhard Schlink.

Q: Which American neo−noir science fiction has Pierce Gagnon starred?

A: Pierce Gagnon has starred in One Tree Hill, Looper, Wish I Was Here and Extant. Of these, Looper is an American neo−
noir science fiction. So the answer is: Looper.

Q: What year did Edburga of Minster−in−Thanet's father die?

A: The father of Edburga of Minster−in−Thanet is King Centwine. Centwine died after 685. So the answer is: after 685.

Q: Were Lonny and Allure both founded in the 1990s?

A: Lonny (magazine) was founded in 2009. Allure (magazine) was founded in 1991. Thus, of the two, only Allure was
founded in 1990s. So the answer is: no.

Q: The actor that stars as Joe Proctor on the series "Power" also played a character on "Entourage" that has what last name?
A: The actor that stars as Joe Proctor on the series "Power" is Jerry Ferrara. Jerry Ferrara also played a character on Entourage
named Turtle Assante. Thus, Turtle Assante's last name is Assante. So the answer is: Assante.

Q: Nobody Loves You was written by John Lennon and released on what album that was issued by Apple Records, and was
written, recorded, and released during his 18 month separation from Yoko Ono?
A: The album issued by Apple Records, and written, recorded, and released during John Lennon's 18 month separation from
Yoko Ono is Walls and Bridges. Nobody Loves You was written by John Lennon on Walls and Bridges album. So the
answer is: Walls and Bridges.

Q: How many awards did the "A Girl Like Me" singer win at the American Music Awards of 2012?
A: The singer of "A Girl Like Me" singer is Rihanna. In the American Music Awards of 2012, Rihana won one award. So the
answer is: one.

Q: who is older Jeremy Horn or Renato Sobral ?

A: Jeremy Horn was born on August 25, 1975. Renato Sobral was born on September 7, 1975. Thus, Jeremy Horn is older. So
the answer is: Jeremy Horn.

Listing 2: Chain-of-Thought annotations for 2WikiMultihopQA.

Q: When did the director of film Hypocrite (Film) die?
A: The film Hypocrite was directed by Miguel Morayta. Miguel Morayta died on 19 June 2013. So the answer is: 19 June
2013.

Q: Do director of film Coolie No. 1 (1995 Film) and director of film The Sensational Trial have the same nationality?
A: Coolie No. 1 (1995 film) was directed by David Dhawan. The Sensational Trial was directed by Karl Freund. David
Dhawan's nationality is India. Karl Freund's nationality is Germany. Thus, they do not have the same nationality. So the
answer is: no.

Q: Are both Kurram Garhi and Trojkrsti located in the same country?
A: Kurram Garhi is located in the country of Pakistan. Trojkrsti is located in the country of Republic of Macedonia. Thus,
they are not in the same country. So the answer is: no.

Q: Who was born first out of Martin Hodge and Ivania Martinich?
A: Martin Hodge was born on 4 February 1959. Ivania Martinich was born on 25 July 1995. Thus, Martin Hodge was born
first. So the answer is: Martin Hodge.

Q: Which film came out first, The Night Of Tricks or The Genealogy?
A: The Night of Tricks was published in the year 1939. The Genealogy was published in the year 1979. Thus, The Night of
Tricks came out first. So the answer is: The Night Of Tricks.

Q: When did the director of film Laughter In Hell die?

A: The film Laughter In Hell was directed by Edward L. Cahn. Edward L. Cahn died on August 25, 1963. So the answer is:
August 25, 1963.

Q: Which film has the director died later, The Gal Who Took the West or Twenty Plus Two?
A: The film Twenty Plus Two was directed by Joseph M. Newman. The Gal Who Took the West was directed by Frederick de
Cordova. Joseph M. Newman died on January 23, 2006. Fred de Cordova died on September 15, 2001. Thus, the person
to die later from the two is Twenty Plus Two. So the answer is: Twenty Plus Two.

Q: Who is Boraqchin (Wife Of ÃŰgedei)'s father−in−law?

A: Boraqchin is married to ÃŰgedei Khan. ÃŰgedei Khan's father is Genghis Khan. Thus, Boraqchin's father−in−law is
Genghis Khan. So the answer is: Genghis Khan.

Q: What is the cause of death of Grand Duke Alexei Alexandrovich Of Russia's mother?
A: The mother of Grand Duke Alexei Alexandrovich of Russia is Maria Alexandrovna. Maria Alexandrovna died from
tuberculosis. So the answer is: tuberculosis.

Q: Which film has the director died earlier, When The Mad Aunts Arrive or The Miracle Worker (1962 Film)?
A: When The Mad Aunts Arrive was directed by Franz Josef Gottlieb. The Miracle Worker (1962 film) was directed by
Arthur Penn. Franz Josef Gottlieb died on 23 July 2006. Arthur Penn died on September 28, 2010. Thus, of the two, the
director to die earlier is Franz Josef Gottlieb, who directed When The Mad Aunts Arrive. So the answer is: When The
Mad Aunts Arrive.

Q: Which album was released earlier, What'S Inside or Cassandra'S Dream (Album)?
A: What's Inside was released in the year 1995. Cassandra's Dream (album) was released in the year 2008. Thus, of the two,
the album to release earlier is What's Inside. So the answer is: What's Inside.

Q: Are both mountains, Serre Mourene and Monte Galbiga, located in the same country?
A: Serre Mourene is located in Spain. Monte Galbiga is located in Italy. Thus, the two countries are not located in the same
country. So the answer is: no.

Q: What is the date of birth of the director of film Best Friends (1982 Film)?
A: The film Best Friends was directed by Norman Jewison. Norman Jewison was born on July 21, 1926. So the answer is:
July 21, 1926.

Q: Which film has the director born first, Two Weeks With Pay or Chhailla Babu?
A: Two Weeks with Pay was directed by Maurice Campbell. Chhailla Babu was directed by Joy Mukherjee. Maurice
Campbell was born on November 28, 1919. Joy Mukherjee was born on 24 February 1939. Thus, from the two directors,
Chhailla Babu was born first, who directed Two Weeks With Pay. So the answer is: Two Weeks With Pay.

Q: Who is the grandchild of Krishna Shah (Nepalese Royal)?

A: Krishna Shah has a child named Rudra Shah. Rudra Shah has a child named Prithvipati Shah. Thus, Krishna Shah has a
grandchild named Prithvipati Shah. So the answer is: Prithvipati Shah.

Q: When was the director of film P.S. Jerusalem born?

A: P.S. Jerusalem was directed by Danae Elon. Danae Elon was born on December 23, 1970. So the answer is: December 23,
1970.

Q: Which album was released more recently, If I Have to Stand Alone or Answering Machine Music?
A: If I Have to Stand Alone was published in the year 1991. Answering Machine Music was released in the year 1999. Thus,
of the two, the album to release more recently is Answering Machine Music. So the answer is: Answering Machine
Music.

Q: Where did the director of film Maddalena (1954 Film) die?

A: The film Maddalena is directed by Augusto Genina. Augusto Genina died in Rome. So the answer is: Rome.

Q: When did the director of film The Boy And The Fog die?
A: The director of The Boy and the Fog is Roberto GavaldÃşn. Roberto GavaldÃşn died on September 4, 1986. So the answer
is: September 4, 1986.

Q: Are the directors of films The Sun of the Sleepless and Nevada (1927 film) both from the same country?
A: The director of Sun of the Sleepless is Temur Babluani. The director of Nevada (1927 film) is John Waters. John Waters is
from the country of America. Temur Babluani is from the country of Georgia. Thus, John Walters and Temur Babluani
are not from the same country. So the answer is: no.

Listing 3: Chain-of-Thought annotations for MuSiQue.

Q: When did the first large winter carnival take place in the city where CIMI−FM is licensed to broadcast?
A: CIMI−FM is licensed to broadcast in Quebec City. The first large winter carnival in Quebec City took place in 1894. So
the answer is: 1894.

Q: When was Neville A. Stanton's employer founded?

A: The employer of Neville A. Stanton is University of Southampton. The University of Southampton was founded in 1862.
So the answer is: 1862.

Q: What county is Hebron located in, in the same province the Heritage Places Protection Act applies to?
A: Heritage Places Protection Act applies to the jurisdiction of Prince Edward Island. Hebron, Prince Edward Island is located
in the Prince County. So the answer is: Prince County.

Q: What weekly publication in the Connecticut city with the most Zagat rated restaurants is issued by university of America−
Lite: How Imperial Academia Dismantled Our Culture's author?
A: The author of America−Lite: How Imperial Academia Dismantled Our Culture is David Gelernter. David Gelernter was
educated at the Yale University. The city in Connecticut that has the highest number of Zagat−rated restaurants is New
Haven. The weekly publication in New Haven that is issued by Yale University is Yale Herald. So the answer is: Yale
Herald.

Q: What is the headquarters for the organization who sets the standards for ISO 21500?
A: The standards for ISO 21500 were set by International Organization for Standardization. The International Organization
for Standardization has headquarters in Geneva. So the answer is: Geneva.

Q: What did the publisher of Banjo−Tooie rely primarily on for its support?
A: The publisher of Banjo−Tooie is Nintendo. Nintendo relied primarily for its support on first−party games. So the answer is:
first−party games.

Q: In which county was the birthplace of the Smoke in tha City performer?
A: The performer of Smoke in tha City is MC Eiht. MC Eiht's birthplace is Compton. Compton is located in the county of Los
Angeles County. So the answer is: Los Angeles County.

Q: What region of the state where Guy Shepherdson was born, contains SMA Negeri 68?
A: Guy Shepherdson was born in Jakarta. SMA Negeri 68 Jakarta is located in Central Jakarta. So the answer is: Central
Jakarta.

Q: When did Britain withdraw from the country containing Hoora?

A: Hoora is in the country of Bahrain. Britain withdrew from Bahrain in 1971. So the answer is: 1971.

Q: Where does the Snake River start, in the state where Lima Mountain is located?
A: Lima Mountain is located in the state of Minnesota. The snake river in Minnesota starts in southern Aitkin County. So the
answer is: southern Aitkin County.

Q: What shares a border with RiviÃĺre−Verte in the province WRSU−FM broadcasts in?
A: WRSU−FM was licensed to broadcast to New Brunswick. RiviÃĺre−Verte, New Brunswick shares border with
Edmundston. So the answer is: Edmundston.

Q: When was the state of emergency declared in the country where the Senate is located?
A: The Senate is in the country of Kenya. The state of emergency was declared in Kenya on 20 October 1952. So the answer
is: 20 October 1952.

Q: How long is the US border with the country that borders the state where Finding Dory takes place?
A: Finding Dory is supposed to take place in California. The country that shares a border with California is Mexico. The
length of the us border with Mexico is 1,989 mi. So the answer is: 1,989 mi.

Q: What genre is the record label of the performer of So Long, See You Tomorrow associated with?
A: The performer of So Long, See You Tomorrow is Bombay Bicycle Club. The record label of Bombay Bicycle Club is
Island Records. The genre of Island Records is jazz. So the answer is: jazz.

Q: When did the first large winter carnival happen in Olivier Robitaille's place of birth?
A: Olivier Robitaille was born in Quebec City. The first large winter carnival in Quebec City happened in the 1894. So the
answer is: 1894.

Q: What is the genre of the record label of the band that performed on the Crush Tour?
A: The Crush Tour is performed by the band Bon Jovi. The record label of Bon Jovi is Island Records. The genre of Island
Records is jazz. So the answer is: jazz.

Q: When was the first railway line constructed between Kotri and the city where Marie Adelaide Leprosy Centre is located?
A: Marie Adelaide Leprosy Centre is located in Karachi. The first railway line between Kotri and Karachi was constructed in
April 1858. So the answer is: April 1858.

Q: Where is the crying stone found in the country in which Raphael Tuju holds citizenship?
A: Raphael Tuju is a citizen of Kenya. The crying stone in Kenya is found along the highway towards Kisumu. So the answer
is: along the highway towards Kisumu.

Q: When did Britain withdraw from the country where the village of Wadyan is found?
A: Wadyan is in the country of Bahrain. Britain withdraw from Bahrain in 1971. So the answer is: 1971.

Q: How many countries in Pacific National University's continent are recognized by the organization that mediated the truce
ending the Iran−Iraq war?
A: Pacific National University is located in Khabarovsk, Russia Khabarovsk, Russian is in the continent of Asia. The entity
that mediated the truce which ended the Iran−Iraq War is the UN. The number of member states that UN recognises in
Asia is 53. So the answer is: 53.

Listing 4: Chain-of-Thought annotations for IIRC.

Q: What is the age difference between the kicker and the quarterback for the Chargers?
A: The kicker for the Chargers is Nate Kaeding. The quarterback (QB) for the Chargers is Philip Rivers. Nate Kaeding was
born in the year 1982. Philip Rivers was born in the year 1981. Thus, the age difference between them is of 1 year. So
the answer is: 1.

Q: How many years was the ship that took the battalion from New South Wales to Ceylon in service?
A: The ship that took the battalion from New South Wales to Ceylon is General Hewitt. General Hewitt was launched in
Calcutta in 1811. General Hewitt was sold for a hulk or to be broken up in 1864. So she served for a total of 1864 −
1811 = 53 years. So the answer is: 53.

Q: What year was the theatre that held the 2016 NFL Draft built?
A: The theatre that held the 2016 NFL Draft is Auditorium Theatre. The Auditorium Theatre was built in 1889. So the answer
is: 1889.

Q: How long had Milan been established by the year that Nava returned there as a reserve in the first team's defense?
A: Nava returned to Milan as a reserve in the first team's defense in the year 1990. Milan had been established in the year
1899. Thus, Milan had been established for 1990 − 1899 = 91 years when Milan returned to Milan as a reserve in the
first team's defense. So the answer is: 91.

Q: When was the town Scott was born in founded?

A: Scott was born in the town of Cooksville, Illinois. Cooksville was founded in the year 1882. So the answer is: 1882.

Q: In what country did Wright leave the French privateers?

A: Wright left the French privateers in Bluefield's river. Bluefields is the capital of the South Caribbean Autonomous Region (
RAAS) in the country of Nicaragua. So the answer is: Nicaragua.

Q: Who plays the A−Team character that Dr. Hibbert fashioned his hair after?
A: Dr. Hibbert fashioned his hair after Mr. T from The A−Team. Mr T.'s birthname is Lawrence Tureaud. So the answer is:
Lawrence Tureaud.

Q: How many people attended the conference held near Berlin in January 1942?
A: The conference held near Berlin in January 1942 is Wannsee Conference. Wannsee Conference was attended by 15 people.
So the answer is: 15.

Q: When did the country Ottwalt went into exile in founded?

A: Ottwalt went into exile in the country of Denmark. Denmark has been inhabited since around 12,500 BC. So the answer is:
12,500 BC.

Q: When was the J2 club Uki played for in 2001 founded?

A: The J2 club that Uki played for is Montedio Yamagata. Montedio Yamagata was founded in 1984. So the answer is: 1984.

Q: When was the person who produced A Little Ain't Enough born?
A: A Little Ain't Enough was produced by Bob Rock. Bob Rock was born on April 19, 1954. So the answer is: April 19, 1954.

Q: Which of the schools Fiser is affiliated with was founded first?

A: The schools that Fiser is affiliated with (1) Academy of Music, University of Zagreb (2) Mozarteum University of Salzburg
(3) Croatian Music Institute orchestra. Academy of Music, University of Zagreb was founded in the year 1829.
Mozarteum University of Salzburg was founded in the year 1841. Croatian Music Institute was founded in the year 1827.
Thus, the school founded earliest of these is Croatian Music Institute. So the answer is: Croatian Music Institute.

Q: How many casualties were there at the battle that Dearing fought at under Jubal Early?
A: Under Jubal Early, Dearing fought the First Battle of Bull Run. First Battle of Bull Run has 460 union casualties and 387
confederate casualties. Thus, in total the First Battle of Bull Run had 460 + 387 = 847 casualties. So the answer is: 847.

Q: Which of the two congregations which provided leadership to the Pilgrims was founded first?
A: The congregations which provided leadership to the Pilgrims are Brownists and Separatist Puritans. Brownist was founded
in 1581. The Separatist Puritans was founded in 1640. Thus, Brownist was founded first. So the answer is: Brownist.

Q: How long had the Rock and Roll Hall of Fame been open when the band was inducted into it?
A: The band was inducted into Rock and Roll Hall of Fame in the year 2017. Rock and Roll Hall of Fame was established in
the year of 1983. Thus, Rock and Roll Hall of Fame been open for 2018 − 1983 = 34 years when the band was inducted
into it. So the answer is: 34.

Q: Did the Lord Sewer who was appointed at the 1509 coronation live longer than his king?
A: Lord Sewer who was appointed at the 1509 coronation was Robert Radcliffe, 1st Earl of Sussex. Lord Sever's king in 1509
was Henry VIII of England. Robert Radcliffe, 1st Earl of Sussex was born in the year 1483, and died in the year 1542.
So Robert lived for 1542 − 1483 = 59 years. Henry VIII of England was born in the year 1491 and died in the year 1547.
So Henry VIII lived for 1547 − 1491 = 56 years. Thus, Robert Radcliffe lived longer than Henry VIII. So the answer is:
yes.

Q: When was the place near where Manuchar was defeated by Qvarqvare established?
A: Manuchar was defeated by Qvarqvare near Erzurum. Erzurum was founded during the Urartian period. So the answer is:
Urartian period.

Q: What year was the man who implemented the 46 calendar reform born?
A: The man who implemented the 46 calendar reform is Julius Caesar. Julius Caesar was born in the year 100 BC. So the
answer is: 100 BC.

Q: How many years after the first recorded Tommy John surgery did Scott Baker undergo his?
A: The first recorded Tommy John surgery happened when it was invented in the year 1974. Scott Baker underwent Tommy
John surgery in the year 2012. Thus, Scott Baker underwent Tommy John surgery 2012 − 1974 = 38 years after it was
first recorded. So the answer is: 38.

Q: Which was the older of the two players who found the net in the Double−Headed Eagle of the North in the sixth final for
PAOK?
A: The two players who found the net in the Double−Headed Eagle of the North in the sixth final for PAOK are Koudas and
Matzourakis. Koudas was born on 23 November 1946. Matzourakis was born on 6 June 1949. Thus, the older person
among the two is Koudas. So the answer is: Koudas.