0% found this document useful (0 votes)
157 views48 pages

RL Prompt-Optimization

This document provides an introduction and motivation for research on query-dependent prompt optimization for arithmetic reasoning tasks using large language models (LLMs). It identifies two key challenges: 1) inference-time evaluation of prompts is difficult without ground truth answers, and 2) online prompt evaluation and optimization is prohibitively expensive. The proposed solution, Prompt-OIRL, addresses these challenges by leveraging offline inverse reinforcement learning to learn a reward model from demonstration data, allowing for efficient offline prompt evaluation and optimization without accessing LLMs directly. Experimental results demonstrate the efficacy and efficiency of the proposed approach across different LLMs and arithmetic datasets.

Uploaded by

sayedwilliams99
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
157 views48 pages

RL Prompt-Optimization

This document provides an introduction and motivation for research on query-dependent prompt optimization for arithmetic reasoning tasks using large language models (LLMs). It identifies two key challenges: 1) inference-time evaluation of prompts is difficult without ground truth answers, and 2) online prompt evaluation and optimization is prohibitively expensive. The proposed solution, Prompt-OIRL, addresses these challenges by leveraging offline inverse reinforcement learning to learn a reward model from demonstration data, allowing for efficient offline prompt evaluation and optimization without accessing LLMs directly. Experimental results demonstrate the efficacy and efficiency of the proposed approach across different LLMs and arithmetic datasets.

Uploaded by

sayedwilliams99
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

Preliminary Work.

Q UERY-D EPENDENT P ROMPT E VALUATION AND O PTI -


MIZATION WITH O FFLINE I NVERSE RL

Hao Sun∗, Alihan Hüyük, Mihaela van der Schaar


DAMTP, University of Cambridge

A BSTRACT
In this study, we aim to enhance the arithmetic reasoning ability of Large Lan-
arXiv:2309.06553v3 [cs.CL] 21 Oct 2023

guage Models (LLMs) through zero-shot prompt optimization. We identify a


previously overlooked objective of query dependency in such optimization and
elucidate two ensuing challenges that impede the successful and economical de-
sign of prompt optimization techniques. One primary issue is the absence of an
effective method to evaluate prompts during inference when the golden answer
is unavailable. Concurrently, learning via interactions with the LLMs to navigate
the expansive natural language prompting space proves to be resource-intensive.
To address this, we introduce Prompt-OIRL, which harnesses offline inverse re-
inforcement learning to draw insights from offline prompting demonstration data.
Such data exists as by-products when diverse prompts are benchmarked on open-
accessible datasets. With Prompt-OIRL, the query-dependent prompt optimiza-
tion objective is achieved by first learning an offline reward model. This model
can evaluate any query-prompt pairs without accessing LLMs. Subsequently, a
best-of-N strategy is deployed to recommend the optimal prompt. Our experi-
mental evaluations across various LLM scales and arithmetic reasoning datasets
underscore both the efficacy and economic viability of the proposed approach.

Figure 1: A motivating example.(Left, Right) No prompt is perfect that works for all queries. The optimal
prompt is query-dependent. Yet the seeking of such prompts is hindered by the Challenges 1-2 we identified.
Our method optimizes prompt during inference on a query-dependent level effectively and cost-efficiently.
1 I NTRODUCTION
Recent advances in Large Language Models (LLMs), such as ChatGPT (Ouyang et al., 2022), un-
derscore the value of aligning with human preferences (Ouyang et al., 2022; OpenAI, 2023). Using
an offline alignment dataset generated by annotators, the harmlessness, helpfulness, and ability to
follow instructions of LLMs can be largely enhanced (Bai et al., 2022) through the technique of
RLHF (Christiano et al., 2017). That said, even for state-of-the-art general-purpose LLMs such as
GPT-4, solving intricate problems such as arithmetic reasoning can still be challenging. Further
unlocking LLMs’ full potential for those complex tasks remains an open challenge.
Query-Dependent Zero-Shot Prompting for Arithmetic Reasoning: Opportunities and Chal-
lenges. Out of the many attempts, prompting — a natural language prefix or instruction that ex-
plains how to complete the task — stands out as a lightweight promising solution for eliciting
the capabilities of LLMs without model parameter tuning (Wei et al., 2022). While the advances
in zero-shot prompting strategies (Kojima et al., 2022) highlight the potential in finding effective
query-independent solutions, its reliance on manual crafting efforts and the vast search space over
natural language intensifies the difficulty in discovering effective prompts (Deng et al., 2022). This
study aims to augment the arithmetic reasoning capacities of LLMs, a capability considered cru-
cial (OpenCompass, 2023; Liu et al., 2023a). While there is a wealth of both expert-devised and

[email protected], Code is available at: https://github.com/holarissun/Prompt-OIRL

1
Preliminary Work.

machine-generated prompts in this domain (Kojima et al., 2022; Zhou et al., 2022b; Yang et al.,
2023), our endeavor is to advance further via prompt optimization. Our approach begins with the
articulation of an adjusted objective, subsequently highlighting two inherent challenges.
Adjusted Objective: Query-Dependent Prompt Optimization. In the literature, research on zero-
shot prompting mainly focused on finding prompts that work better on a distributional level rather
than an instance level. For instance, while multi-agent debate Liang et al. (2023); Du et al. (2023)
can improve reasoning in general, there are cases where not using a prompt can be even better than
using generally effective prompts. Figure 1 showcases a motivating example. Motivated by the fact
that no prompt is perfect for all queries, in this work, we set an adjusted objective of query-dependent
prompt optimization, rather than a distributional-level prompt optimization.
Challenge 1: Inference Time Evaluation is Hard. A primary challenge arises in evaluating the
effectiveness of prompts during inference, especially when the true answer to a query is not known
a priori. For arithmetic reasoning, enumerating many potentially effective prompts and enquire
language models with those prompted queries will only result in a batch of answers, yet determining
which response is correct in the absence of a ground-truth label requires extra non-trivial effort.
Additionally, the computational cost associated with this prompt enumeration process is substantial.
Challenge 2: Online Prompt Evaluation and Optimization is Expensive. On the other hand,
searching for potentially effective prompts heavily relies on expensive online evaluation. In the lit-
erature, evaluating the effectiveness of a proposed prompt requires assessing its performance on
multiple datasets and LLMs, to show its distributional superiority over the others. In the arith-
metic reasoning task, using the GPT-3.5-turbo API (OpenAI, 2023) to evaluate a single prompt on
a medium-sized dataset with 10k query-answer pairs will lead to an ≈$1 cost, yet learning from
trials and errors often requires millions of interactions even for tasks with a few actions to choose
from (Schulman et al., 2017). The vast action space of natural language further increases the pro-
hibitive cost for such interactive search (Zhang et al., 2022a; Deng et al., 2022).
Solution: Query-Dependent Prompt Evaluation and Optimization with Offline Inverse RL.
Having witnessed the significant advancements achieved in expert-crafted and algorithm-discovered
prompting in arithmetic reasoning (Zhou et al., 2022b; Pryzant et al., 2023; Kojima et al., 2022),
our work seeks a solution of combining existing human knowledge in prompting systematically,
effectively, and cost-efficiently. Our contributions can be summarized as follows:

• Formally, we identify the overlooked query-dependent prompt optimization objective and


its challenges, and introduce Offline Inverse Reinforcement Learning to integrate rich hu-
man expertise as a systematic approach.
• Methodologically, we introduce Prompt-OIRL to first perform a query-dependent offline
prompt evaluation with a learned reward model and then perform offline prompt optimiza-
tion to improve prompting performance.
• Practically, we highlight the existence of offline datasets generated — as by-products when
existing prompting strategies are benchmarked on open-access tasks — can be directly
adopted for Prompt-OIRL learning.
• Empirically, we validate the efficacy and efficiency of Prompt-OIRL in offline prompt eval-
uation and optimization through experiments with 3 distinct LLMs, namely GPT-3.5-turbo,
LLaMA-2-7B-Chat, and TigerBot-13B-Chat, across 3 arithmetic datasets: GSM8K (Cobbe
et al., 2021a), MAWPS (Roy & Roth, 2016), and SVAMP(Patel et al., 2021)

2 T HE Q UERY-D EPENDENT P ROMPTING P ROBLEM

We first introduce preliminaries to provide a formal definition of the query-dependent prompting


problem. We then introduce our solution in the next section.
Queries & Answers We consider the task of answering queries x ∈ X = V ∞ expressed in a
natural language with vocabulary V. We assume that each query x has an expected answer y ∗ ∈ Y.
Such an answer can be a real-valued number or composed by natural language. For example, x can
be an arithmetic question and y ∗ the mathematically correct answer to the question, or x can be the
review of a movie and y ∗ a sentiment label for a movie. We denote a dataset containing queries and
answers with D.

2
Preliminary Work.

Language Model Such tasks can be performed using a language model ℓ : X → Y by feeding
queries x to the language model and getting answers ŷ = ℓ(x). We assume that the quality of these
answers can be evaluated using a metric r : Y × Y → R. For example, in tasks where golden
labels are available, this metric can be r(y ∗ , ŷ) = 1{ŷ = y ∗ }. Generally, a higher value of r(y ∗ , ŷ)
indicates ŷ matches the expected answer y ∗ better.
Prompting It is well known that the performance of a language model in answering queries can
be boosted through appropriate prompting—a natural language prefix or instruction that explains
how to complete the task. We consider prompts π : X → X that maps original queries x to
modified prompted query π(x). (For clarity, we use prompt to denote the prefix and instructions,
and prompted query to denote the final input to language models after prompting.) These prompted
queries are then fed into the language model to get better answers ŷ = ℓ(π(x)).
Objective Given a dataset D = {x(i) , y ∗(i) )}i∈[N ] of queries and their expected answers, the
objective of query-agnostic zero-shot prompt optimization usually studied in the literature is to find
the distributional optimal prompting strategy π̄ ∗ that maximize the expected quality of answers w.r.t.
metric r:
h i
π̄ ∗ = arg max E(x(i) ,y∗(i) )∼D r(y ∗(i) , ℓ(π(x(i) ))) , (1)
π
Instead, in this work we meet the Adjusted Objective by using a query-dependent approach:
π ∗ = arg max r(y ∗(i) , ℓ(π(x(i) ))), (2)
π
In a nutshell, Equation(1) seeks a single prompting strategy that achieves good performance on the
dataset, yet Equation(2) seeks different prompts for different queries in the dataset. Clearly, π ∗
should be better than π̄ ∗ in the sense
h i h i
E(x(i) ,y∗(i) )∼D r(y ∗(i) , ℓ(π ∗ (x(i) ))) ≥ E(x(i) ,y∗(i) )∼D r(y ∗(i) , ℓ(π̄ ∗ (x(i) ))) . (3)

3 P ROMPTING WITH O FFLINE I NVERSE RL


Having established the learning objective in Equa-
tion (2), applying RL to the search for an op-
timal prompting strategy π may seem a natural
approach. Yet, as depicted in Figure 2, such a
method confronts two distinct challenges: Chal-
lenge 1 pertains to the inability to compute re-
wards during inference, while Challenge 2 con-
cerns the necessity for extensive interactions with
LLMs. Different from conventional RL tasks on
which algorithms are developed and optimized, Figure 2: The Adjusted Objective and Challenges
the vast action space of vocabulary, sparse trajec- in prompt optimization. We use blue to denote fixed
tory level feedback that is only available when the functions, pink for datasets, and green for functions
final answer is generated, and the significant over- to be optimized. Solid lines show the flow of out-
heads associated with LLM interactions, all com- puts, and dashed lines denote the learning process.
plicate the pursuit of an ideal prompting strategy.
Furthermore, the absence of definitive labels during the inference phase renders the prompt evalua-
tion phase infeasible. With this context, we now present Prompt-OIRL as our proposed solution.

Prompt-OIRL: Prompting as Offline Inverse RL In pursuance of the new objective and ad-
dressing the challenges, we draw inspiration from the successes that have been achieved by RLHF.
We highlight the prompting problem is essentially aligning prompts based on LLM preferences,
a departure from RLHF’s goal of aligning LLMs with human preferences. For a more in-depth
exploration of this perspective, we direct readers to Appendix A.
Aiming for an efficient and economical prompting strategy, we sidestep the complexities of exten-
sive LLM interactions by leveraging readily available datasets and propose offline IRL as a viable
learning solution. We outline its three fundamental steps: Step 1. construction of offline dataset,
Step 2. offline query-dependent prompt evaluation through reward modeling, and Step 3. offline
query-dependent prompt optimization using the learned reward model.

3
Preliminary Work.

Step 1: Existence of Offline Prompt Demonstrations We start by emphasizing the pres-


ence and significance of prompt demonstrations used in benchmarking existing prompts. In the
realm of research dedicated to improving the arithmetic reasoning capacities of LLMs, several
zero-shot prompts have been proposed. Notable examples include CoT (Kojima et al., 2022),
APE (Zhou et al., 2022b), and a concurrent work OPRO (Yang et al., 2023), among others.
It’s important to highlight that none of these
prompts uniformly enhance the LLMs’ arithmetic
reasoning capabilities. Their efficacy is assessed at
a distributional level, as per Equation(1). Different
language models might exhibit varied preferences
for prompts (Yang et al., 2023). To see that the
best prompting strategy is dependent on the query,
readers may hark back to the motivating example in
Figure 1 or find more case studies in Appendix E.1.
To benchmark the effectiveness of the proposed
prompts, prior researchers have utilized stan- Figure 3: The offline demonstration dataset is
dardized, openly accessible arithmetic reasoning generated as a by-product of evaluating existing
datasets and reported the overall success rate of (query-agnostic) prompts.
obtaining correct answers. Denoting different ex-
isting prompts with superscripts, e.g., π (1) (x) = (x) the no prompting, π (2) = CoT prompting,
π (3) = APE prompting, and so on — when there are K prompts have been evaluated, the following
demonstrations will be constructed as by-products:

Ddem = {x(i) , π (k) , r(i,k) = r(y ∗(i) , ℓ(π (k) (x(i) ))}i∈[N ],k∈[K] (4)
This process is visualized in Figure 3, depicting the generation of prompt-alignment demonstration

datasets from evaluating existing prompts. Notably, the rewards in Ddem are influenced by ℓ.

Step 2: Offline Reward Modeling: Inverse RL without an Environment Upon initial inspec-
tion, it might seem that reward modeling is unnecessary given the reward metric. i.e., r(y ∗ , ŷ) =
1{ŷ = y∗ }. However, a more thorough examination of this metric reveals two concerns that limit its
broader utilization: first, the reward metric is a function over language model ℓ, making every single
call of such a metric costly; second, the computation of the reward necessitates access to the ground
truth answer y ∗ , which is always not available in inference (deployment) time. Those are essentially
the challenges associated with the input dependency of this reward metric demonstrated in Figure 2.
To address those issues, we introduce a parameterized proxy reward model that is a function of query
x and prompt π. We denote the proxy reward as Υθ (x, π(x)) with parameters θ. Notably, this proxy
reward omits the language models ℓ and the ground-truth labels from its calculations. By design, its
output should align with the true reward r. Therefore, we establish a supervised learning objective
to minimize the discrepancies between the proxy reward model and the true reward.

Figure 4: Prompt-OIRL addresses the specified Objective and challenges. It first learns a proxy reward model
from the offline demonstration dataset we created in the last section. Such a learned reward model can be
applied in inference time to evaluate prompts in a query-dependent manner without access to the language
model, hence optimizing prompt w.r.t. such a proxy reward model solves all issues identified.

In the case of arithmetic reasoning tasks, the reward signal is binary, hence the proxy reward model
can be trained as a classification task, predicting whether the prompt can result in a correct an-
swer when fed to the language model. Specifically, we consider the Cross-Entropy loss given the
demonstration data collected in the previous section:
h     i
ℓ (i,k) (i,k)
LCE (θ; Ddem ) = −Ei∈[N ],k∼[K] r(i,k) log σ Υθ + (1 − r(i,k) ) log 1 − σ Υθ (5)

4
Preliminary Work.

(i,k)
where we use Υθ to denote Υθ (x(i) , π (k) (x(i) )) for conciseness, and σ is the sigmoid function.
Different from the online reward signal provided by r(y ∗ , ℓ(x′ )) that requires expensive interactions
with black-box LLMs, the proxy reward model Υθ (x, π(x)) can provide offline white-box feedback
to prompts. Therefore, it can be more accessible in guiding the search for a better prompting strategy.
Figure 4 illustrates how such a reward model can be learned and leveraged in providing feedback to
prompting strategy optimization.
Step 3: Offline Prompt Optimization with the Learned Reward Model.
Given the learned reward model is an effective proxy of the performance evaluator for query-
prompting strategy pairs, we would then be able to solve the prompt optimization problem of Euqa-
tion (2) with an alternative offline objective that is feasible to execute in inference-time — without
the requirement of a language model ℓ and the non-accessible inference-time golden label y ∗ :
π ∗ = arg max Υθ (x, π(x)) ≈ arg max r(y ∗ , ℓ(π(x))) (6)
π π
In general, any policy optimization technique has the potential to be applied in solving Equation (6).
To name a few examples, such a technique can be any of the previous approaches used in online
prompt optimization like RL (Deng et al., 2022), beam search (Pryzant et al., 2023), evolution
strategies (Zhou et al., 2022b).
In this work, we would like to highlight the effectiveness and cost-efficiency of optimizing prompts
with regard to such a learned reward model as an accessible offline proxy to the true reward. We
choose to use a minimalist approach and isolate the source of gains by using a best-of-n approach:
we generate a batch of candidate prompts with a general-purpose language model and select the best
one according to the learned reward model. We leave the investigation of other approaches, which
could be more tailored and better-performing, yet computationally more expensive as a promising
direction for future exploration.

4 R ELATED W ORK
We discuss the most related literature on learning-based prompt optimization and reinforcement
learning from human knowledge in this section. Extended discussions are deferred in Appendix B.

Table 1: Prompt-OIRL mainly differentiates from existing literature on prompt optimization by (1) consid-
ering the Adjusted Objective and optimizing query-dependent prompt; (2) being able to perform offline
prompt evaluation to address Challenge 1; (3) optimizing prompt in the offline setting without access to the
LLMs to address Challenge 2; (4) using natural language prompt space; (5) using offline inverse reinforcement
learning to solve the problem. More unique properties of Prompt-OIRL are discussed in Table 4.
(1) Query (2) Offline (3) Offline (4) Prompt (5) Solver
Method Dependent Prompt Prompt Space* Used
Prompt Evaluation Optimization
Soft-Prompt ✓ ✗ ✗ Embeddings Gradient-Guided Search
APO ✗ ✗ ✗ X = V∞ Beam Search
APE ✗ ✗ ✗ X = V∞ Evolution Strategy
TEMPERA ✓ ✗ ✗ Edit RL
RLPrompt ✗ ✗ ✗ {V 2 , V 5 } RL
Prompt-OIRL (ours) ✓ ✓ ✓ X = V∞ Offline Inverse-RL
*
Embeddings: the language of LMs; Edit: including operations like swap, delete, etc.; V is the vocabulary, and the superscript over it
denotes the length of prompts. V ∞ denotes the natural language space — the most general interpretable format.

Learning-Based Prompt Optimization The Automatic Prompt Optimization (APO) (Pryzant et al.,
2023) leverages training data and initial prompts to perform a ”gradient descent” process operated
in the natural language space guided by large language models. A generate-and-select process is
then applied to update the prompting mechanism. Automatic Prompt Engineer (APE) (Zhou et al.,
2022b) asks LLMs to generate a set of instruction candidates based on demonstrations and then
evaluate the performance of those instructions by computing corresponding scores, finally, APE im-
proves the best candidates by proposing semantically similar instruction variants to specific tasks.
There are also works that explore the usage of auxiliary models or differentiable representations of
the prompt (Shin et al., 2020; Li & Liang, 2021; Qin & Eisner, 2021; Vu et al., 2021; Gu et al.,
2021). Yet those methods require access to the embeddings of the LMs for soft prompt optimiza-
tion, hence falling short in reusability across LMs. In the work of Hao et al. (2022), reinforcement

5
Preliminary Work.

learning is used to optimize a prompt policy in text-to-image generation tasks in pursuance of more
aesthetically pleasing images preserving the original user intentions. Zhang et al. (2022a) proposes
TEMPERA to perform test time prompt editing using reinforcement learning. TEMPERA designs
the action space to be editing operations, such as swapping, deleting, adding, or changing verbaliz-
ers. The RLPrompt Deng et al. (2022) uses reinforcement learning to optimize LLMs to specialize
as prompting agents. Yet their prompts generated are task-agnostic, and limited to combinations of
words that can not be easily transferred to insights for human prompt engineers.
Reinforcement Learning with Human Knowledge and Demonstrations Human knowledge and
demonstration have pushed forward the progress of machine learning in many domains including
natural language processing (Rajpurkar et al., 2018; Ouyang et al., 2022) and many others (Deng
et al., 2009; Redmon et al., 2016; Atkeson & Schaal, 1997; Zhang et al., 2018; Mandlekar et al.,
2020). By injecting prior knowledge of the task from human, reinforcement learning algorithms
are able to achieve human-level performance or even outperform human (Silver et al., 2016; Vinyals
et al., 2019). Our solver of the prompting problem is related to inverse RL (Ng et al., 2000; Abbeel &
Ng, 2004; Ho & Ermon, 2016) that aims at inferring reward and learning policies with a batch of be-
havioral demonstrations, which is unique compared with existing literature in prompt optimization.
The main distinctions between this study and those related works are summarized and highlighted
in Table 1. For an extended discussion on more distinctions please refer to Table 4 in Appendix B.

5 E XPERIMENT
In this section, we present empirical evidence illustrating the efficacy of Prompt-OIRL in addressing
the previously highlighted challenges and meeting the revised objective. We first outline the general
experimental setups. Subsequently, we delve into fulfilling the Adjusted Objective and detail the
proof of solutions to the Challenges in Sections 5.1 through 5.3.

Tasks We use the tasks of MultiArith (Roy & Roth, 2016), GSM8K (Cobbe et al., 2021a),
SVAMP (Patel et al., 2021) in the arithmetic reasoning domain because they are widely studied
in zero-shot prompting, and hence rich expert-crafted and machine-generated prompting knowledge
is available. Leveraging such knowledge facilitates our offline data collection procedure.
Table 2: prompts used in offline training dataset collection.
No. Effective Prompts Discovered by Experts and Algorithms Explanation
1 “The answer is:” direct prompting
2 “Let’s think step by step:” zero-shot CoT
(Kojima et al., 2022)
3 “Let’s work this out in a step by step way to be sure we have the right APE discovered
answer:” (Zhou et al., 2022b)
4 “First, decompose the question into several sub-questions that need to be Least-to-most
solved, and then solve each question step by step:” (Zhou et al., 2022a)
5 “Imagine three different experts are answering this question. All experts Tree-of-thought
will write down 1 step of their thinking, and then share it with the group. (Hulbert, 2023)
Then all experts will go on to the next step, etc. If any expert realizes
they’re wrong at any point then they leave.”
6 “3 experts are discussing the question, trying to solve it step by step, and multi-agent debate
make sure the result is correct:” (Liang et al., 2023)

Prompts As shown in Table 2, we consider 6 existing zero-shot prompts in arithmetic reasoning


tasks. We construct our offline demonstration dataset using interaction logs between LLMs and
queries prompted by those strategies.

LLMs To demonstrate the general applicability of Prompt-OIRL, we experiment with datasets


generated with LLMs at different abilities, scaling from the GPT-3.5-turbo model (Ouyang et al.,
2022), to TigerBot-13B-chat (TigerResearch, 2023), and LLaMA2-7B-chat (Touvron et al., 2023).
All created offline demonstration datasets, including the query-prompt pairs, prompted answers
from different LLMs, and the correctness of those answers will be released as a publicly accessible
dataset. We defer detailed discussions on the usage of embeddings, data processing, and training
details can be found in Appendix C.

6
Preliminary Work.

5.1 I MPROVING A RITHMETIC R EASONING BY F ULFILLING THE A DJUSTED O BJECTIVE

Experiment Setup In this section, we demonstrate the superiority of query-dependency in


prompting. We showcase that Prompt-OIRL effectively fulfills the Adjusted Objective. In the ex-
periment, we use BoTr Eqn.(1) to denote the setting of selecting the best-of-training time prompts
according to objective Eqn.(1). In this case, the prompt to be chosen in test or inference time should
be the one that achieves the overall best training performance. We then consider the setting of BoTr
Eqn.(2), when Eqn.(2) becomes the objective. In such a setting, the proxy reward model will be
used to select in the query-dependent best-of-N selection. Alternatively, we compare the LLM-
confidence-based baseline in approaching Eqn.(2), which uses LLMs to reflect how confident they
are for different prompted answers. LLM Confidence thence selects the most confident answers
that correspond to training prompts according to confidence scores provided by LLMs. Finally,
our method Prompt-OIRL selects the best prompt from both training time prompts and held-out
prompts with regard to the learned reward model.

Figure 5: Performance of Prompt-OIRL under two typical settings. On both settings with scarce and rich
demonstration data, Prompt-OIRL achieves better performance.
Results We present results on two typical settings in Figure 5 under different demonstration avail-
ability: (1) in the scarce demonstration setting (left panel) when K = 1, where only 1 prompt is
available for reward model learning; and (2) in the rich demonstration setting (right panel) when
K = 5, where more prompts demonstrations are available. We note when K = 1, there should be
no difference between using Eqn.(1) and Eqn.(2) in the test-of-training strategy.
From the results, we can conclude: (1). in the scarce demonstration setting where demonstration
data is collected with a single prompt, Prompt-OIRL significantly (+24.3%) outperforms the best-
of-training strategy; (2). in the rich demonstration settings where multiple prompts are available,
Prompt-OIRL is significantly better (+8.8%) than using a query-agnostic objective of Eqn.(1);
(3). in the rich demonstration setting, Prompt-
OIRL is slightly better (+1.8%) than select-
ing the best training strategy according to the
learned reward model — this is not surpris-
ing as the prompts we used for training are all
tested as helpful themselves; (4). On both set-
tings, Prompt-OIRL achieves significantly (by
+24.3% and +14.7%, separately) better results
compared to the LLM confidence score-based
baseline — without further interactions with
the LLMs.
Figure 6: The averaged performance of different meth-
Additionally, we change the availability of ods under varying demonstration data availability in
prompts used for training and provide a perfor- arithmetic reasoning.
mance comparison averaged over all tasks and
LLMs we benchmarked in Figure 6. In general, using more prompts in training leads to better
results. The results under different individual settings are provided in Appendix D.1

Take-Aways Prompt-OIRL fulfills the Adjusted Objective and achieves superior perfor-
mance over the baselines under various demonstration data availability. When demonstration
data is limited, Prompt-OIRL achieves a remarkably higher success rate in getting correct an-
swers, and its performance can be further improved with an increase in data availability.

7
Preliminary Work.

5.2 A DDRESSING C HALLENGE 1 THROUGH O FFLINE P ROMPT E VALUATION

Experiment Setup In this section, we study the effectiveness of the learned reward model by
verifying its generalization ability from two perspectives: (1) with seen-prompt + unseen-query:
we evaluate the effectiveness of the learned reward model with training-time prompts on held-out
queries and (2) with unseen-prompt + unseen-query: we evaluate the effectiveness of the learned
reward model with held-out prompts on held-out queries. To enable the performance evaluation
on held-out prompts, we collect interactive logs on each language model for 10 held-out prompts.
Details of those held-out prompts are provided in Appendix C.3.
We benchmark our approach against the language models for self-criticism (LMSC) baseline (Wang
et al., 2023). In this method, LLMs are tasked with verifying the accuracy of the answer, given
both the query and the prompted response. Throughout our experiments, we gauge performance
using held-out test queries. In our subsequent results, the notation Ours (Q) refers to evaluation
outcomes employing training time prompts (with “Q” representing test queries), while Ours (P)
refers to evaluations conducted on held-out prompts (with “P” representing test prompts). The
same notations are applied to LMSC as LMSC (Q) and LMSC (P). Moreover, we compare results
varying prompt demonstration data availabilities. Specifically, we change the number of prompts,
represented as K, that is used in the
 training of the reward model. The result we report for K is
6
averaged over combinations of K . Accuracy and precision are used as the evaluation metric.

Table 3: Effectiveness of the learned reward model is demonstrated through a comparison with LLM-based
self-critic. Accuracies and precisions in predicting whether correct answers can be obtained on held-out queries
are reported as an evaluation metric. Higher is better.
Task Method LLaMA2-7B-Chat TigerBot-13B-Chat GPT-3.5-turbo
Acc. Precision Acc. Precision Acc. Precision
LMSC (Q) 0.47 0.433 0.776 0.175 0.662 0.664
Ours (Q) K = 1 0.784 0.621 0.952 0.593 0.96 0.965
MAWPS Ours (Q) K = 5 0.795 0.632 0.95 0.593 0.96 0.965
LMSC (P) 0.457 0.441 0.414 0.692 0.661 0.661
Ours (P) K = 1 0.621 0.569 0.467 0.737 0.767 0.82
Ours (P) K = 5 0.658 0.595 0.496 0.795 0.803 0.825
LMSC (Q) 0.474 0.441 0.434 0.735 0.801 0.805
Ours (Q) K = 1 0.791 0.646 0.735 0.764 0.964 0.975
SVAMP Ours (Q) K = 5 0.8 0.65 0.738 0.764 0.962 0.97
LMSC (P) 0.464 0.447 0.421 0.739 0.795 0.796
Ours (P) K = 1 0.634 0.602 0.685 0.741 0.824 0.883
Ours (P) K = 5 0.655 0.593 0.707 0.751 0.861 0.885
LMSC (Q) 0.205 0.187 0.635 0.537 0.662 0.656
Ours (Q) K = 1 0.773 0.289 0.642 0.53 0.659 0.663
GSM8K Ours (Q) K = 5 0.726 0.301 0.646 0.542 0.656 0.657
LMSC (P) 0.207 0.195 0.636 0.515 0.653 0.648
Ours (P) K = 1 0.747 0.269 0.631 0.516 0.653 0.658
Ours (P) K = 5 0.647 0.246 0.643 0.54 0.648 0.648

Results Table 3 showcases the results across the 3 tasks employing the 3 LLMs. From the results,
we can deduce: (1). On held-out queries, we find the learned reward model is able to accurately
predict whether a certain prompt can lead to a correct answer. This holds true for both training
prompts and held-out test prompts. (2). The prediction accuracy and precision of the learned reward
model are significantly better than the LMSC baseline, demonstrating the superiority of leveraging
the learned model rather than the LM itself in evaluating prompts and prompted answers. (3). Incor-
porating more prompts into the training dataset generally improves performance, particularly when
evaluating held-out prompts. For more comprehensive results, please refer to Appendix D.2.

Take-Aways Challenge 1 can be effectively solved by Prompt-OIRL, which learns a reward


model that is able to accurately evaluate prompts in an offline manner. The learned reward
model is effective in predicting whether a given prompt can solve a given query without access
to the language model. The prediction accuracy and precision are in general significantly better
than the self-critic baseline, in which online interactions with LLMs are needed.

8
Preliminary Work.

5.3 A DDRESSING C HALLENGE 2: C OST-E FFICIENT P ROMPT O PTIMIZATION

Experiment Setup As previously argued, utilizing the proxy reward model for prompt optimiza-
tion, as opposed to directly using language models, would result in significantly reduced costs.
Indeed, the expenses associated with inference time can be even more expensive (Chen et al., 2023).
In this section, we emphasize the cost-efficiency advantage of Prompt-OIRL during the inference
time prompt optimization.
Prompt-OIRL evaluates prompts through its offline reward model and obviates the need for extra
LLM interactions during inference. Picture a scenario where, for each query at inference, there exist
K potential prompts to assess and select. With Prompt-OIRL, the optimal prompt is chosen with-
out LLM interaction, ensuring that only the chosen prompt undergoes inference with the LLM to
produce a singular response. Conversely, evaluating prompts via LLMs — by gauging their confi-
dence or dissecting the nuances of various answers — demands extensive interaction prior to settling
on a choice. We provide a detailed cost analysis of these methods, outlining the monetary implica-
tions of employing the GPT-3.5-turbo and TigerBot-13B-Chat APIs. We disclose the inference time,
quantified in GPU hours for LLaMA2-7B-chat, which operated locally on an NVIDIA A4000 GPU.

Results Figure 7 illustrates the inference


time costs associated with various LLMs
and prompt selection strategies. Employ-
ing Prompt-OIRL for prompt optimization
proves to be substantially more cost-efficient
than methods reliant on LLMs. In the latter
approach, all prompts need to be processed
to obtain distinct prompted answers, neces-
sitating K interactions with the LLMs. Fol-
lowing this, the LLMs are queried to verify
the correctness of those answers and provide
their confidence scores, incurring an addi-
tional K interactions. In contrast, Prompt-
OIRL utilizes the offline reward model to Figure 7: The inference cost of different methods for a
pinpoint the most suitable prompt and only single query under different numbers of prompt choices.
forwards the chosen prompt to the LLM for processing.

Take-Aways Challenge 2 is effectively mitigated by Prompt-OIRL. Prompt-OIRL permits


offline evaluation and optimization of prompts through the learned reward model. Yet the
LLM-Based methods rely on additional LLM interactions to evaluate prompted answers. Con-
sequently, Prompt-OIRL offers a distinct advantage in terms of cost efficiency.

6 C ONCLUSION AND D ISCUSSION


We propose Prompt-OIRL, a novel approach grounded in offline inverse reinforcement learning, de-
signed to reconcile effective and cost-efficient query-dependent prompt evaluation and optimization.
This method leverages offline datasets from existing evaluations, utilizing Inverse-RL to craft a re-
ward model tailored for offline, query-specific prompt evaluations. Prompt-OIRL offers several ben-
efits: it forecasts prompt efficacy, minimizes costs, and explores the prompt space more effectively
— all at a query-dependent level. We validate our approach across various LLMs and arithmetic
reasoning datasets, underscoring its viability as a formidable solution for query-dependent offline
prompt evaluation and optimization.
While our research primarily centers on arithmetic reasoning tasks, we wish to underscore the versa-
tility of Prompt-OIRL’s insights for broader applications, especially where there exists a prompting
demonstration dataset accompanied by ratings of the prompted responses. As a hypothetical ap-
proach to dataset construction with human annotators incorporated into the process, consider this:
human annotators could employ LLMs to accomplish specific tasks. They might offer multiple
prompts as instructions for the task, and the ensuing LLM responses can then be graded based on
proficiency in executing the given task. In fact, these annotators could be everyday LLM users keen
on evaluating diverse responses. We earmark this intriguing concept for subsequent exploration.

9
Preliminary Work.

R EFERENCES
Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning. In
Proceedings of the twenty-first international conference on Machine learning, pp. 1, 2004.
Rishabh Agarwal, Dale Schuurmans, and Mohammad Norouzi. An optimistic perspective on offline
reinforcement learning. In International Conference on Machine Learning, pp. 104–114. PMLR,
2020.
Christopher G Atkeson and Stefan Schaal. Robot learning from demonstration. In ICML, volume 97,
pp. 12–20. Citeseer, 1997.
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn
Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless
assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862,
2022.
Homanga Bharadhwaj, Aviral Kumar, Nicholas Rhinehart, Sergey Levine, Florian Shkurti, and Ani-
mesh Garg. Conservative safety critics for exploration. arXiv preprint arXiv:2010.14497, 2020.
Daniel Brown, Wonjoon Goo, Prabhat Nagarajan, and Scott Niekum. Extrapolating beyond sub-
optimal demonstrations via inverse reinforcement learning from observations. In International
conference on machine learning, pp. 783–792. PMLR, 2019.
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal,
Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are
few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
Lingjiao Chen, Matei Zaharia, and James Zou. Frugalgpt: How to use large language models while
reducing cost and improving performance. arXiv preprint arXiv:2305.05176, 2023.
Tianqi Chen, Tong He, Michael Benesty, Vadim Khotilovich, Yuan Tang, Hyunsu Cho, Kailong
Chen, Rory Mitchell, Ignacio Cano, Tianyi Zhou, et al. Xgboost: extreme gradient boosting. R
package version 0.4-2, 1(4):1–4, 2015.
Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario
Amodei. Deep reinforcement learning from human preferences. In I. Guyon, U. Von
Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Ad-
vances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.,
2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/
file/d5e2c0adad503c91f91df240d0cd4e49-Paper.pdf.
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser,
Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to
solve math word problems. arXiv preprint arXiv:2110.14168, 2021a.
Karl W Cobbe, Jacob Hilton, Oleg Klimov, and John Schulman. Phasic policy gradient. In Interna-
tional Conference on Machine Learning, pp. 2020–2027. PMLR, 2021b.
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hi-
erarchical image database. In 2009 IEEE conference on computer vision and pattern recognition,
pp. 248–255. Ieee, 2009.
Mingkai Deng, Jianyu Wang, Cheng-Ping Hsieh, Yihan Wang, Han Guo, Tianmin Shu, Meng Song,
Eric P Xing, and Zhiting Hu. Rlprompt: Optimizing discrete text prompts with reinforcement
learning. arXiv preprint arXiv:2205.12548, 2022.
Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improv-
ing factuality and reasoning in language models through multiagent debate. arXiv preprint
arXiv:2305.14325, 2023.
Pete Florence, Corey Lynch, Andy Zeng, Oscar A Ramirez, Ayzaan Wahid, Laura Downs, Adrian
Wong, Johnny Lee, Igor Mordatch, and Jonathan Tompson. Implicit behavioral cloning. In
Conference on Robot Learning, pp. 158–168. PMLR, 2022.

10
Preliminary Work.

Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep
data-driven reinforcement learning. arXiv preprint arXiv:2004.07219, 2020.
Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without
exploration. arXiv preprint arXiv:1812.02900, 2018a.
Scott Fujimoto, Herke Van Hoof, and David Meger. Addressing function approximation error in
actor-critic methods. arXiv preprint arXiv:1802.09477, 2018b.
Divyansh Garg, Shuvam Chakraborty, Chris Cundy, Jiaming Song, and Stefano Ermon. Iq-learn:
Inverse soft-q learning for imitation. Advances in Neural Information Processing Systems, 34:
4028–4039, 2021.
Ryan Greene, Ted Sanders, Lilian Weng, and Arvind Neelakantan. New and improved embedding
model, 2022.
Yuxian Gu, Xu Han, Zhiyuan Liu, and Minlie Huang. Ppt: Pre-trained prompt tuning for few-shot
learning. arXiv preprint arXiv:2109.04332, 2021.
Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-
policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint
arXiv:1801.01290, 2018.
Yaru Hao, Zewen Chi, Li Dong, and Furu Wei. Optimizing prompts for text-to-image generation.
arXiv preprint arXiv:2212.09611, 2022.
Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. Advances in neural
information processing systems, 29, 2016.
Dave Hulbert. Tree of knowledge: Tok aka tree of knowledge dataset for large language models llm.
https://github.com/dave1010/tree-of-thought-prompting, 2023.
Daniel Jarrett, Alihan Hüyük, and Mihaela Van Der Schaar. Inverse decision modeling: Learning
interpretable representations of behavior. In International Conference on Machine Learning, pp.
4755–4771. PMLR, 2021.
Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-
Yan Liu. Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural
information processing systems, 30, 2017.
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large
language models are zero-shot reasoners. Advances in neural information processing systems,
35:22199–22213, 2022.
Aviral Kumar, Justin Fu, George Tucker, and Sergey Levine. Stabilizing off-policy q-learning via
bootstrapping error reduction. arXiv preprint arXiv:1906.00949, 2019.
Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tuto-
rial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal,
Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented genera-
tion for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:
9459–9474, 2020.
Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. arXiv
preprint arXiv:2101.00190, 2021.
Xiaonan Li and Xipeng Qiu. Mot: Pre-thinking and recalling enable chatgpt to self-improve with
memory-of-thoughts. arXiv preprint arXiv:2305.05181, 2023.
Zekun Li, Baolin Peng, Pengcheng He, Michel Galley, Jianfeng Gao, and Xifeng Yan. Guiding large
language models via directional stimulus prompting. arXiv preprint arXiv:2302.11520, 2023.

11
Preliminary Work.

Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Zhaopeng
Tu, and Shuming Shi. Encouraging divergent thinking in large language models through multi-
agent debate. arXiv preprint arXiv:2305.19118, 2023.
Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa,
David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv
preprint arXiv:1509.02971, 2015.
Jiacheng Liu, Alisa Liu, Ximing Lu, Sean Welleck, Peter West, Ronan Le Bras, Yejin Choi, and Han-
naneh Hajishirzi. Generated knowledge prompting for commonsense reasoning. arXiv preprint
arXiv:2110.08387, 2021.
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan,
Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around
player? arXiv preprint arXiv:2307.06281, 2023a.
Zhihan Liu, Hao Hu, Shenao Zhang, Hongyi Guo, Shuqi Ke, Boyi Liu, and Zhaoran Wang. Reason
for future, act for now: A principled framework for autonomous llm agents with provable sample
efficiency. arXiv preprint arXiv:2309.17382, 2023b.
Jieyi Long. Large language model guided tree-of-thought. arXiv preprint arXiv:2305.08291, 2023.
Ajay Mandlekar, Danfei Xu, Roberto Martı́n-Martı́n, Silvio Savarese, and Li Fei-Fei. Learn-
ing to generalize across long-horizon tasks from human demonstrations. arXiv preprint
arXiv:2003.06085, 2020.
Ajay Mandlekar, Danfei Xu, Josiah Wong, Soroush Nasiriany, Chen Wang, Rohun Kulkarni, Li Fei-
Fei, Silvio Savarese, Yuke Zhu, and Roberto Martı́n-Martı́n. What matters in learning from offline
human demonstrations for robot manipulation. arXiv preprint arXiv:2108.03298, 2021.
Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke
Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning work? arXiv
preprint arXiv:2202.12837, 2022.
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Belle-
mare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level
control through deep reinforcement learning. nature, 518(7540):529–533, 2015.
Ashvin Nair, Bob McGrew, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. Over-
coming exploration in reinforcement learning with demonstrations. In 2018 IEEE international
conference on robotics and automation (ICRA), pp. 6292–6299. IEEE, 2018.
Andrew Y Ng, Stuart Russell, et al. Algorithms for inverse reinforcement learning. In Icml, vol-
ume 1, pp. 2, 2000.
R OpenAI. Gpt-4 technical report. arXiv, pp. 2303–08774, 2023.
OpenCompass. Opencompass: A universal evaluation platform for foundation models. https:
//github.com/open-compass/opencompass, 2023.
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong
Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow
instructions with human feedback. Advances in Neural Information Processing Systems, 35:
27730–27744, 2022.
Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are nlp models really able to solve simple math
word problems? arXiv preprint arXiv:2103.07191, 2021.
Dean A Pomerleau. Alvinn: An autonomous land vehicle in a neural network. Advances in neural
information processing systems, 1, 1988.
Reid Pryzant, Dan Iter, Jerry Li, Yin Tat Lee, Chenguang Zhu, and Michael Zeng. Automatic prompt
optimization with” gradient descent” and beam search. arXiv preprint arXiv:2305.03495, 2023.

12
Preliminary Work.

Guanghui Qin and Jason Eisner. Learning how to ask: Querying lms with mixtures of soft prompts.
arXiv preprint arXiv:2104.06599, 2021.
Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don’t know: Unanswerable questions
for squad. arXiv preprint arXiv:1806.03822, 2018.
Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified,
real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pp. 779–788, 2016.
Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and struc-
tured prediction to no-regret online learning. In Proceedings of the fourteenth international con-
ference on artificial intelligence and statistics, pp. 627–635. JMLR Workshop and Conference
Proceedings, 2011.
Subhro Roy and Dan Roth. Solving general arithmetic word problems. arXiv preprint
arXiv:1608.01413, 2016.
Stefan Schaal. Learning from demonstration. Advances in neural information processing systems,
9, 1996.
John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region
policy optimization. In International conference on machine learning, pp. 1889–1897, 2015.
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy
optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
Taylor Shin, Yasaman Razeghi, Robert L Logan IV, Eric Wallace, and Sameer Singh. Autoprompt:
Eliciting knowledge from language models with automatically generated prompts. arXiv preprint
arXiv:2010.15980, 2020.
Noah Y Siegel, Jost Tobias Springenberg, Felix Berkenkamp, Abbas Abdolmaleki, Michael Ne-
unert, Thomas Lampe, Roland Hafner, Nicolas Heess, and Martin Riedmiller. Keep doing
what worked: Behavioral modelling priors for offline reinforcement learning. arXiv preprint
arXiv:2002.08396, 2020.
David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche,
Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering
the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.
Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
TigerResearch. Tigerbot: A cutting-edge foundation for your very own llm. https://github.
com/TigerResearch/TigerBot, 2023.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Niko-
lay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open founda-
tion and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Juny-
oung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster
level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019.
Tu Vu, Brian Lester, Noah Constant, Rami Al-Rfou, and Daniel Cer. Spot: Better frozen model
adaptation through soft prompt transfer. arXiv preprint arXiv:2110.07904, 2021.
Rui Wang, Hongru Wang, Fei Mi, Yi Chen, Ruifeng Xu, and Kam-Fai Wong. Self-critique prompt-
ing with large language models for inductive instructions. arXiv preprint arXiv:2305.13733,
2023.
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdh-
ery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.
arXiv preprint arXiv:2203.11171, 2022.

13
Preliminary Work.

Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du,
Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. arXiv preprint
arXiv:2109.01652, 2021.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny
Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in
Neural Information Processing Systems, 35:24824–24837, 2022.
Yifan Wu, George Tucker, and Ofir Nachum. Behavior regularized offline reinforcement learning.
arXiv preprint arXiv:1911.11361, 2019a.
Yueh-Hua Wu, Nontawat Charoenphakdee, Han Bao, Voot Tangkaratt, and Masashi Sugiyama. Imi-
tation learning from imperfect demonstration. In International Conference on Machine Learning,
pp. 6818–6827. PMLR, 2019b.
Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V. Le, Denny Zhou, and Xinyun
Chen. Large language models as optimizers. arXiv preprint arXiv:2309.03409, 2023.
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L Griffiths, Yuan Cao, and Karthik
Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. arXiv
preprint arXiv:2305.10601, 2023.
Chi Zhang, Sanmukh Rao Kuppannagari, and Viktor Prasanna. Brac+: Going deeper with behavior
regularized offline reinforcement learning. 2020.
Tianhao Zhang, Zoe McCarthy, Owen Jow, Dennis Lee, Xi Chen, Ken Goldberg, and Pieter Abbeel.
Deep imitation learning for complex manipulation tasks from virtual reality teleoperation. In
2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 5628–5635. IEEE,
2018.
Tianjun Zhang, Xuezhi Wang, Denny Zhou, Dale Schuurmans, and Joseph E Gonzalez. Tempera:
Test-time prompt editing via reinforcement learning. In The Eleventh International Conference
on Learning Representations, 2022a.
Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. Automatic chain of thought prompting in
large language models. arXiv preprint arXiv:2210.03493, 2022b.
Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuur-
mans, Claire Cui, Olivier Bousquet, Quoc Le, et al. Least-to-most prompting enables complex
reasoning in large language models. arXiv preprint arXiv:2205.10625, 2022a.
Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan,
and Jimmy Ba. Large language models are human-level prompt engineers. arXiv preprint
arXiv:2211.01910, 2022b.

14
Preliminary Work.

Ethics Statement This paper introduces the pioneering application of offline inverse reinforce-
ment learning to query-dependent prompting, enhancing the arithmetic reasoning capabilities of
LLMs. Yet, deploying such a method inappropriately could yield undesirable results. Addition-
ally, the method’s dependence on LLMs and prompting demonstration data underscores potential
concerns about data privacy and consent, especially as it leans on future offline datasets for human-
centered tasks.

Reproducibility Statement All code and offline demonstration datasets will be released upon
acceptance. For the seek of full reproducibility, we provide supplemental experiment details in
Appendix C, and attached example code in the supplementary material of the submission.

A PPENDIX : TABLE OF C ONTENTS

A Extended Discussion on Prompt-OIRL and RLHF 16


A.1 Connections and Difference from a Descriptive Perspective . . . . . . . . . . . . . 16
A.2 Prompt-OIRL and RLHF with the Formal RL Language. . . . . . . . . . . . . . . 16

B Extended Related Work 20


B.1 Learning from Human Expertise and Imitation Learning . . . . . . . . . . . . . . 20
B.2 Zero-Shot and Few-Shot Prompting . . . . . . . . . . . . . . . . . . . . . . . . . 21
B.3 Chain-of-Thought Prompting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
B.4 Other Prompting Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

C Supplemental Experiment Details 22


C.1 Offline Data Processing and Embeddings . . . . . . . . . . . . . . . . . . . . . . 22
C.2 Reward Modeling: Implementation Matters . . . . . . . . . . . . . . . . . . . . . 22
C.3 Held-Out Test Prompts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
C.4 Code, Hardware and Training Time . . . . . . . . . . . . . . . . . . . . . . . . . 26

D Detailed Experiment Results 26


D.1 Improving Arithmetic Reasoning: How good are the optimized prompts? . . . . . . 26
D.2 Reward Modeling: Accuracy and Precision. . . . . . . . . . . . . . . . . . . . . . 27

E Case Studies 29
E.1 Optimal Prompting Strategy is Query Dependent . . . . . . . . . . . . . . . . . . 29
E.2 Expert-Crafted Prompts Can All Fail . . . . . . . . . . . . . . . . . . . . . . . . . 41
E.3 Generated Human-Readable Prompts . . . . . . . . . . . . . . . . . . . . . . . . . 46

15
Preliminary Work.

A E XTENDED D ISCUSSION ON P ROMPT-OIRL AND RLHF

A.1 C ONNECTIONS AND D IFFERENCE FROM A D ESCRIPTIVE P ERSPECTIVE

To better understand our contribution, it would be helpful to link and contrast Prompt-OIRL with
the framework of reinforcement learning from human feedback (RLHF) or AI feedback (RLAIF).
In general, the success of RLHF and RLAIF motivates our idea of integrating human expertise in
prompting, and Prompt-OIRL can be interpreted as a special type of prompt policy learning from
offline AI feedback, which differentiates from existing literature.
The fact that Prompt-OIRL first learns a reward model and then performs policy optimization using
the reward model makes the method related to RLHF and RLAIF. From such a perspective, Prompt-
OIRL treats prompt evaluation and optimization as a process of aligning prompting strategies with
the preferences of the LLMs, contrasting to aligning LLM responses with human preferences. In
analogy with the RLHF that improves the performance of LLMs with regard to human preferences,
we hypothesize that such a perspective could lead to more effective prompts while avoiding the need
for random exploration of the prompt space, which is computationally expensive.
That said, there are several key differences that make Prompt-OIRL distinguishable from the existing
RLHF and RLAIF literature. RLHF (or RLAIF) often includes the following steps: 1. sampling
from pre-trained language models to generate dialogue data; 2. human (or AI) raters are then asked
to rank those generated data according to given criteria such as harmless and helpful; 3. a preference
model is trained on the ranked dataset, as proxy to the human (or AI) raters; 4. the language model
is fine-tuned with the learned preference model using reinforcement learning.
By contrast, we would like to highlight the differences in objective, non-interactive offline learning
problem setting, and optimization procedure flexibility: First, Objective: The objective of Prompt-
OIRL is to evaluate and optimize prompting strategies, rather than enhance LLMs’ alignment to
non-differentiable objectives. Importantly, the learned reward model in Prompt-OIRL can work in
isolation as a prompt evaluator. Second, Non-Interactive Offline Setting: Prompt-OIRL works in a
purely offline setting, rather than assuming access to human or AI raters to actively generate feed-
back. Third, the optimization procedure in Prompt-OIRL is highly flexible, which is not necessary
to be a language model as in RLHF and RLAIF. For instance, the prompt optimization process
can collaborate with a human prompting engineer, who proposes potential prompting strategies and
selects the best according to the learned reward model.

A.2 P ROMPT-OIRL AND RLHF WITH THE F ORMAL RL L ANGUAGE .

Reinforcement Learning and Offline-RL In Reinforcement Learning (RL), an agent learns


through interacting with an environment and receiving feedback in the form of rewards (Sutton
& Barto, 2018). The fundamental objective of RL is to find a policy, which is a mapping from states
to actions, that maximizes the expected cumulative reward over time.
Such a learning paradigm can be formally represented using the Markov Decision Processes
(MDPs), where decisions are made in discrete time steps, and each decision affects the state of
the environment in the subsequent step.
Formally, we denote the MDP as M = {S, A, T , R, ρ0 , γ}, where S ⊂ Rd denotes the d-dim state
space, A is the action space. Broadly, the environment includes T and R, the former denotes the
transition dynamics T : S ×A 7→ S that controls transitions between states, and the reward function
R : S × A 7→ R provides feedback. ρ0 = p(s0 ) ∈ ∆(S) denotes the initial state distribution. γ is
the discount factor that trades off between short-term and long-term returns.
Online RL considers the problem of learning a policy π ∈ Π : S 7→ ∆(A), such that the expected
PT
cumulative reward Eat ∼π,st+1 ∼T ,s0 ∼ρ0 t=0 γ t rt (st , at ) in the MDP is maximized. In the online
RL setting, an agent learns through trial and error. It actively interacts with the environments —
including both transition dynamics T and the reward function R. And optimize its polity through
either on-policy (Schulman et al., 2015; 2017; Cobbe et al., 2021b) or off-policy algorithms (Lilli-
crap et al., 2015; Mnih et al., 2015; Fujimoto et al., 2018b; Haarnoja et al., 2018).

16
Preliminary Work.

Figure 8: Pictorial illustration of Online RL: an agent actively interacts with the environment, which
is composed of the dynamics model that controls transition, and the reward model that provides
feedback.

At each time step t, an agent observes a state st from the environment and selects an action at
according to its policy π. Upon taking the action, the agent receives a reward rt and transit to a new
state st+1 . The agent’s objective is to maximize its expected return.

Figure 9: Pictorial illustration of Offline RL: a decision dataset is collected through interactive logs
between a behavior agent and the environment. Such offline dataset is then used to optimized a
parameterized policy — the learnable agent.

Offline RL, also known as batch-RL or data-driven RL, focuses on learning optimal policies from a
fixed dataset of interaction data without further interaction with the environment. This approach is
particularly relevant in scenarios where online interaction is expensive, risky, or impractical. The
primary challenge in Offline RL is that the dataset may not sufficiently cover important regions
of the state-action space, which can lead to extrapolation errors and suboptimal policies (Fujimoto
et al., 2018a).
Offline RL algorithms aim to leverage this fixed dataset to derive a policy that would perform well
if deployed in the real environment. Techniques often employed in Offline RL include constraining
the policy to only consider data in the dataset, regularizing against drastic changes from a behavior
policy, and employing uncertainty estimation to avoid regions where the dataset provides limited
information (Kumar et al., 2019; Wu et al., 2019a; Kumar et al., 2019; Siegel et al., 2020). In
general, such a dataset can either be generated by rolling out an expert that generates high-quality
solutions to the task (Fujimoto et al., 2018a; Zhang et al., 2020; Fu et al., 2020) or a non-expert that
executes sub-optimal behaviors (Fu et al., 2020; Wu et al., 2019a; Kumar et al., 2019; Agarwal et al.,
2020; Jarrett et al., 2021) or be a mixture of both (Bharadhwaj et al., 2020).
That said, the research on existing offline-RL mainly focuses on application on dense-reward tasks
like robotics control (Fu et al., 2020; Levine et al., 2020). The optimization for reward on the
trajectory level such as the case in LLM researchwhere responses may only be evaluated on the
sentence-level, rather than on a token-wise level — is relatively underexplored.

Imitation Learning, Behavior Clone, and Inverse RL Imitation learning and learning from
demonstrations are widely studied in the field of reinforcement learning and robotic learning (Ng
et al., 2000; Abbeel & Ng, 2004; Ho & Ermon, 2016; Garg et al., 2021), aiming at learning policies
with a batch of expert trajectories. Working as the most straightforward solution, Behavior Cloning
(BC) (Pomerleau, 1988; Florence et al., 2022) optimizes a policy through supervised learning, yet
may suffer from compounding error or instability in learning from sub-optimal demonstrations (Ross
et al., 2011; Wu et al., 2019b). It is shown in Mandlekar et al. (2021) that data quality is of great
importance in the success of BC. It is worth noting that, while imitation learning assumes the un-
derlying reward mechanism is unknown, the learning from demonstration literature always uses the

17
Preliminary Work.

demonstrations as a warm-start for reward-sparse tasks like robotics manipulation (Schaal, 1996;
Nair et al., 2018).

Figure 10: Pictorial illustration of IL: a dynamics model is always needed to obtain observations.
The decision dataset can be used to provide a reference for the desired behavior. Learning (the pink
dashed line) can be achieved by minimizing the behavior divergence between the decision dataset
and the agent.

Figure 11: Pictorial illustration of IRL: Inverse RL first learns a reward model, with which the
imitation learning now becomes an “online” RL problem — the accessible dynamics provide obser-
vations, and the learned reward model provides feedback.

LLM Alignment with Human Preference The task of LLM alignment under human preference
can be framed as RL, yet the challenge of online learning. See Figure 12 for a more detailed expla-
nation.

Figure 12: Pictorial illustration of LLM alignment as an online RL task. However, this is only the
ideal case: it is normally unaffordable to always keep humans in the loop. Human is the reward
model, and the LLMs themselves act as a dynamics model.

In practice, collecting a limited number of human feedbacks and saving them as a decision dataset
is feasible. In such cases, the LLM alignment as online RL is turned into an offline RL problem.
As we have discussed above, offline RL also suffers from difficulties in learning. However, there
is a unique property in the LLM alignment problem that enables Imitation Learning with those
alignments (human-feedback) data.

Figure 13: Pictorial illustration of LLM alignment as an offline RL task: The alignment dataset is
collected through a limited number of interactions between humans and LLMs.

18
Preliminary Work.

RLHF: Solving an Offline-RL Task with Online Inverse RL We highlight that RLHF solves the
offline RL task of aligning human preferences through an offline dataset with online IRL. Figure 14
and Figure 15 illustrates such an interpretation.

Figure 14: Pictorial illustration of RLHF as IL: as LLM itself acts as the dynamics model, the offline
RL problem of LLM alignment with human preference becomes an imitation learning problem.

Figure 15: Pictorial illustration of RLHF as Online IRL: Approaching the imitation learning problem
through IRL, RLHF first learns a reward model, with which it has both components in an environ-
ment — the dynamics model (itself) and the reward model (learned). Therefore, RLHF can apply
conventional online RL algorithms like PPO as its solver.

Prompting as Policy Learning: Alignment with LLM Preference While prompt optimization
is by nature an RL task and can be approached by learning from trial and error, randomly exploring
the action space of natural language is infeasible. An analogy is the task of LLM alignment with
human feedback: where an LLM agent interacts with human annotators to collect feedback on its
responses and seek for better alignment with the human annotators’ preference. In such a task, it is
also unrealistic and infeasible to always put humans in the loop. Therefore, techniques of RLHF are
proposed to seek the solution when only an offline alignment dataset is available.
On the other hand, prompting optimization is indeed an LLM-centric type of alignment, as opposed
to human-centric alignment: it aligns the prompter’s prompting strategies according to LLMs’ pref-
erences and feedback. It faces the same type of challenge that always keeping LLM in the loop
during optimization is not feasible. Figure 16 provides a pictorial illustration of how prompting can
be interpreted as an (expensive) online RL task.

Figure 16: Pictorial illustration of prompting as an (expensive) online RL task: the reward calcula-
tion requires extensive usage of the LLM, which is expensive in practice.

Prompt-OIRL: Solving an Offline RL Task with Offline Inverse RL The key observation and
claim we would emphasize is that we can interpret Prompt-OIRL as solving the offline RL of align-
ing prompting strategies from LLMs’ feedback through offline IRL.

19
Preliminary Work.

Offline Data as By-Products In the main text, we have highlighted the existence and importance
of prompt demonstrations generated in benchmarking existing prompts. For instance, in the research
on enhancing the arithmetic reasoning abilities of LLMs, multiple zero-shot prompts are proposed,
such as the CoT (Kojima et al., 2022), APE (Zhou et al., 2022b), and more recently OPRO (Yang
et al., 2023), etc.
In order to benchmark the effectiveness of those proposed prompts, previous researchers use stan-
dardized open-accessible arithmetic reasoning datasets and report the overall successful rate of get-
ting correct answers. Figure 17 provides a pictorial illustration of how the evaluation of existing
prompts generates prompt-alignment demonstration datasets as its by-products.

Figure 17: Pictorial illustration of prompting as an offline RL task.

Solving Offline RL by Offline Inverse RL The key observation of such a problem is that zero-
shot prompting are single-step decision-making problem. Therefore, even without the transition
dynamics models (LLMs), we are still able to conduct an inverse RL in an offline manner. Prompt-
OIRL circumvents the difficulty in evaluating prompts using offline reward modeling. Figure 18
provides a pictorial illustration of such an interpretation.

Figure 18: Pictorial illustration of prompting as an Offline IRL task: the offline prompt demonstra-
tion dataset is used to learn a reward model, as a function of query and input, to avoid the requirement
of an LLM during evaluation. Prompt-OIRL then leverages such a learned reward model for prompt
optimization in a purely offline manner.

Conclusive Remark RLHF solves the problem of learning from offline LLM alignment dataset
according to human feedback by turning the problem from offline RL into an online IRL, and then
an online RL after learning the reward model. Prompt-OIRL solves the problem of learning from
offline prompting dataset according to LLM feedback by turning the problem from offline RL into
an offline IRL problem, to get rid of the dependency of LLMs in evaluation and optimization.

B E XTENDED R ELATED W ORK


B.1 L EARNING FROM H UMAN E XPERTISE AND I MITATION L EARNING

Recent advances in Large Language Models (LLMs) have shown impressive performance as
general-purpose agents, especially with the remarkable success of ChatGPT where human knowl-
edge is injected via reinforcement learning from human feedback (RLHF) (Christiano et al., 2017;
Ouyang et al., 2022; OpenAI, 2023). During the training of those models, human feedback is lever-
aged to reduce the harmfulness and at the same time improve the instruction-following ability of
models. In Liu et al. (2023b), LLMs are instantiated as decision-making agents to master various
skills.
Imitation learning and learning from demonstrations are widely studied in the field of reinforce-
ment learning and robotic learning (Ng et al., 2000; Abbeel & Ng, 2004; Ho & Ermon, 2016; Garg

20
Preliminary Work.

et al., 2021), aiming at learning policies with a batch of expert trajectories. Working as the most
straightforward solution, Behavior Cloning (BC) (Pomerleau, 1988; Florence et al., 2022) optimizes
a policy through supervised learning, yet may suffer from compounding error or instability in learn-
ing from sub-optimal demonstrations (Ross et al., 2011; Wu et al., 2019b). It is shown in Mandlekar
et al. (2021) that data quality is of great importance in the success of BC. It is worth noting that,
while imitation learning assumes the underlying reward mechanism is unknown, the learning from
demonstration literature always uses the demonstrations as a warm-start for reward-sparse tasks like
robotics manipulation (Schaal, 1996; Nair et al., 2018).
In prompt generation tasks, the demonstrations are always imperfect, T-REX (Brown et al., 2019)
extrapolates beyond a learned reward function to achieve super-demonstration performance. We
build our work based on T-REX to address the difficulty of learning from imperfect demonstra-
tion. We also benchmark the performance of BC, with a specific consideration on the quality of
demonstrations (Wu et al., 2019b; Mandlekar et al., 2021).

B.2 Z ERO -S HOT AND F EW-S HOT P ROMPTING

The ability of Zero-shot prompting emerges in the language models trained on large amounts of data
like GPT-3 and GPT-4 (Ouyang et al., 2022; OpenAI, 2023). And it was shown in Wei et al. (2021)
that instruction-fine-tuning improves the zero-shot learning ability of language models.
Notwithstanding the impressive zero-shot performance exhibited by large language models, these
models often exhibit suboptimal performance in executing more complex tasks under a zero-shot
setting. Leveraging few-shot prompting presents a viable approach for facilitating in-context learn-
ing (Brown et al., 2020; Min et al., 2022). This technique necessitates the inclusion of demon-
strations within the prompt, effectively guiding the model toward enhanced performance. These
demonstrations act as conditioning mechanisms for succeeding examples, leading the model to gen-
erate better responses.

B.3 C HAIN - OF -T HOUGHT P ROMPTING

In some more challenging tasks like complex arithmetic, commonsense, and symbolic reasoning
tasks, the chain-of-thought (CoT) prompting is shown to be more effective in helping the language
models to get correct answers (Wei et al., 2022). CoT includes additional reasoning steps in the
few-shot prompting examples. Kojima et al. (2022) further introduces zero-shot CoT, showing that
adding task-agnostic instruction can improve the model performance in specific tasks. In Zhang
et al. (2022b), Auto-CoT combines the universality of zero-shot CoT and the capability of original
CoT driven by demonstrations and proposes to automatically construct demonstrations based on
clustering and diversity-based sampling that are beneficial for CoT reasoning.

B.4 OTHER P ROMPTING S TRATEGIES

Wang et al. (2022) further improves the few-shot CoT method by sampling multiple diverse reason-
ing paths and marginalizing those paths, choosing the most consistent answers among all sampled
reasoning paths. Li et al. (2023) introduces Directional-Stimulus-Prompting, which is a framework
that uses a tuneable language model to provide guidance for the black-box frozen large language
model toward desirable properties.
The Generated Knowledge Prompting Liu et al. (2021) improves commonsense reasoning by in-
corporating knowledge or information related to the questions to make more accurate predictions.
Tree-of-thoughts (ToT) methods (Long, 2023; Yao et al., 2023) combines tree-based planning meth-
ods with reasoning skills of language models, and solves hard reasoning problems step by step via
multiple round conversations. Hulbert (2023) also put forward a related idea that leverages multiple
thoughts of a language model in a single prompt. Memory and Retrieval Augmented Generation
(RAG) (Lewis et al., 2020), that is able to combine parametric memory and non-parametric memory
like Wikipedia in completing knowledge-intensive tasks. MoT (Li & Qiu, 2023): Pre-thinking based
on the external unlabeled dataset and then recalling the related knowledge during inference. In the
concurrent work of Yang et al. (2023), LLMs are used as optimizers to solve a variety of optimiza-
tion problems, including prompt optimization. It is worth noting the most important two differences
are in the offline nature and the query-dependant zero-shot prompting strategy used in our approach.

21
Preliminary Work.

Table 4: Prompt-OIRL differentiates from existing literature on prompt optimization by (1) considering the
Adjusted Objective and optimizing query-dependent prompt; (2) being able to perform offline prompt
evaluation to address Challenge 1; (3) optimizing prompt in the offline setting without access to the LLMs
to address Challenge 2; (4) utilizing existing expert knowledge to reduce the difficulty in RL; (5) generating
human-readable prompts; (6) working on the most general natural language prompt space; (7) free from
gradient information of LLMs; (8) using offline inverse reinforcement learning as the solver of the problem.
(1) Query (2) Offline (3) Offline (4) Expert (5) Human (6) Prompt (7) LLM (8) Solver
Method Dependent Prompt Prompt Knowledge Readable Space Gradient Used

Prompt Evaluation Optimization Inspired Prompt Free

Soft-Prompt ✓ ✗ ✗ ✗ ✗ Embeddings ✗ Gradient-Guided Search

APO ✗ ✗ ✗ ✓ ✓ X = V∞ ✓ Beam Search

APE ✗ ✗ ✗ ✗ ✓ X = V∞ ✗ Evolution Strategy

TEMPERA ✓ ✗ ✗ ✓ ✓ Edit ✓ RL

RLPrompt ✗ ✗ ✗ ✗ ✗ {V 2 , V 5 } ✓ RL

Prompt-OIRL (ours) ✓ ✓ ✓ ✓ ✓ X = V∞ ✓ Offline Inverse-RL

Embeddings: the language of LMs; Edit: including operations like swap, delete, etc.; V is the vocabulary, and the superscript over it
denotes the length of prompts. V ∞ denotes the natural language space — the most general interpretable format.

C S UPPLEMENTAL E XPERIMENT D ETAILS

C.1 O FFLINE DATA P ROCESSING AND E MBEDDINGS

For LLM ℓ and task k, we re-organize the offline demonstration dataset using the embedding func-
tion E : V ∞ 7→ R1536 , which maps a sequence of natural language context to a fix-length vector of
size 1536 (Greene et al., 2022). Therefore, using ex = E(x(i) ), eπ = E(π (j) ), r(ij) = 1{y ∗ =
(i) (j)
(j) (i)
ℓ(π (x ))} to denote the embeddings and reward instantiation, our demonstration dataset Equa-
tion (4) in implementation can be expressed as follows:
(train) (i) (j)
1. Training Data: Dℓ,k = {ex , eπ , r(ij) }i∈{1,...,N },j∈{1,...,K}
(test q) (i) (j)
2. Test Data on Query: Dℓ,k = {ex , eπ , r(ij) }i∈{N +1,...,N +M },j∈{1,...,K}
(test p) (i) (j)
3. Test Data on Prompts: Dℓ,k = {ex , eπ , r(ij) }i∈{N +1,...,N +M },j∈{K+1,...,K+P }
The GSM8K task contains 7473 queries with golden answers for training and 1319 held-out queries
with golden answers for testing; the SVAMP task contains 19690 examples, which are split into a
training query-answer set of size 15000 and a testing query-answer set of size 4690; the MAWPS
task contains 7685 examples, which are split into a training query-answer set of size 6000 and
a testing query-answer set of size 1685. Therefore, for each prompting strategy, there are 7473,
15000, and 6000 demonstrative examples that can be collected through the interactive logs with a
certain language model for the three tasks that can be used for training, respectively.
Our processed offline datasets and code for processing will be provided as open-source assets to
facilitate future studies.

C.2 R EWARD M ODELING : I MPLEMENTATION M ATTERS

In our experiments, we find the gradient boosting methods (Chen et al., 2015; Ke et al., 2017) are
significantly better than using neural networks in reward modeling — in terms of both computational
efficiency and performance. We demonstrate the effectiveness using a minimalist approach by di-
rectly applying the XGBoost models and leaving further investigation on model selection to future
work. That said, to exclude some potential alternatives — we have explored using fully connected
neural networks as the reward model or using a ranked dataset mimicking the conventional RLHF
setting, yet both of those choices perform worse than our presented implementation.
To enhance replicability, we use the following hyper-parameters for the gradient boosting
model Chen et al. (2015) in all experiment settings:
param = {’max_depth’: 10, ’eta’: 0.001, ’objective’: ’binary:logistic’}

22
Preliminary Work.

Further investigation on hyper-parameter tuning will lead to further improvement, especially in the
pursuance of higher precision values in imbalanced prompt demonstrate datasets. (e.g., when using
smaller LLMs like LLaMA-7B for the more challenging tasks like GSM8k.)

C.3 H ELD -O UT T EST P ROMPTS

The following 10 prompts are used as a held-out test set for offline prompt evaluation. To enable the
performance evaluation on held-out prompts, we collect interactive logs on each language model for
10 held-out novel prompting strategies. We will release the corresponding promoted answers on test
queries for those held-out prompts as part of our offline dataset.

1. Approaching this logically, the steps to find the answer are:


2. Let’s break this down into manageable steps to fully understand the problem:
3. Consider this as a puzzle, each piece contributing to the final answer. Let’s place each
piece, one by one:
4. Three scholars are analyzing this query from various perspectives, working collaboratively
to build a comprehensive answer. Each contributes a step:
5. Let’s solve this like a detective would solve a mystery, gathering clues and building up to
the final solution step by step:
6. Imagine we’re navigating a maze; each decision brings us closer to the center. Let’s map
our route:
7. Envision a round table meeting of expert problem solvers. Each participant suggests a step,
building towards a consensus answer:
8. Like an architect constructing a building, let’s design our answer carefully, layer by layer:
9. As if we are assembling a complex machine, let’s put it together piece by piece:
10. Three wise philosophers are debating this question, each contributing a different aspect of
the answer. Let’s follow their discourse:

In order to further verify the generalization ability of the learned reward model, we extend to
more held-out prompts. We use GPT-4 to generate those prompts through in-context learn-
ing. Specifically, we provided the human-crafted 6 prompting strategies to GPT-4, and asked
GPT-4 to generate other potential prompting strategies. Our original chat history and prompts
are available in the following anonymous link: https://chat.openai.com/share/
2c7652d2-2f96-48e8-b34f-efa4b15f8a61.

1. Like a bee pollinating flowers, let’s gather the essence from each point:
2. Picture this as a chain. Each link strengthens the whole:
3. As a librarian categorizes books, let’s sort the information accordingly:
4. This is like unfolding origami to understand each crease and fold:
5. Imagine it as an echo in a valley, every sound has an origin and meaning:
6. Let’s approach this like a chef creating a new recipe, ingredient by ingredient:
7. As a cartographer maps terrains, let’s chart the nuances and details:
8. Like an athlete training for an event, every exercise has a purpose:
9. Think of it as wind chimes, each note contributing to the melody:
10. It’s like flying a kite. Every tug and adjustment affects its flight:
11. Imagine we’re in a lab, every test and observation is crucial for the conclusion:
12. Like a blacksmith forging metal, let’s shape our understanding with precision:
13. As a botanist identifies plants, let’s classify each detail:
14. This is like a radio tuning to different frequencies, let’s find the right wavelength:
15. Imagine it’s a theater play. Scene by scene, we unfold the plot:
16. Let’s dive into this like a marine biologist exploring the ocean’s depths:

23
Preliminary Work.

17. As an electrician wires a circuit, each connection powers the system:


18. This is like a gardener pruning a plant, every cut makes it flourish:
19. Think of it as the gears of a bicycle, each turn propelling us forward:
20. Like a mountain climber uses tools, let’s leverage each resource for understanding:
21. As a composer orchestrates a symphony, let’s integrate each instrument:
22. This is like a pot of stew, simmering to meld the flavors:
23. Imagine building a sandcastle, each grain matters for the structure:
24. It’s like peeling an onion, layer by layer, revealing the core:
25. Like a captain steering a ship, each decision adjusts our course:
26. As a mason lays bricks, every placement supports the structure:
27. This is like a game of chess, anticipating each move ahead:
28. Let’s tackle this as a mechanic examines an engine, part by part:
29. Like a bird building a nest, every twig and feather matters:
30. Imagine crafting a sculpture, every detail brings it to life:
31. Like a student taking notes, we’ll highlight the key points:
32. It’s like sifting for gold, discarding the dirt to find the nuggets:
33. Let’s weave through this as a spider spins its intricate web:
34. As a farmer tills the soil, every effort nurtures the crop:
35. Like a director shoots a film, each scene tells a part of the story:
36. Think of it as a flashlight in the dark, illuminating step by step:
37. Like a butterfly metamorphosing, we’ll transition through stages of understanding:
38. Let’s construct our insights as an engineer designs machinery:
39. Imagine it as the flow of a river, every bend and ripple shapes the course:
40. It’s like a hawk soaring in the sky, observing details from a vantage point:
41. Like a detective following leads, every hint brings us closer to the truth:
42. As a tailor sews a garment, every stitch defines the shape:
43. Think of it as blending flavors in a dish, balancing to perfection:
44. Like a maestro leading an orchestra, every cue creates harmony:
45. Let’s mold our comprehension as a child shapes clay into forms:
46. Imagine this as a marathon; every mile, every step takes us closer to the finish:
47. It’s like mining for gems; we dig deep, assessing each discovery:
48. Let’s examine this as a doctor diagnoses symptoms to understand the ailment:
49. Think of it as a tree growing, each branch reaching out to new insights:
50. Like a beekeeper tends the hive, let’s understand each aspect of the colony:
51. As a carpenter joins wood, every angle and joint is crucial:
52. This is like creating a mosaic, each shard contributes to the whole:
53. Let’s distill our thoughts as a brewer crafts a fine beverage:
54. Imagine as if we’re setting a trap, each component is pivotal:
55. It’s like a locksmith crafting a key, every cut unlocking understanding:
56. Like a painter blending colors, let’s mix our ideas for clarity:
57. Let’s tread as an explorer maps uncharted territory, noting every landmark:
58. Think of this as a puzzle box, each mechanism revealing deeper layers:
59. Like an artisan molds pottery, our hands shape the outcome:
60. Let’s embark on this quest as a knight faces challenges, overcoming each obstacle:

24
Preliminary Work.

61. Imagine we’re piecing together an ancient manuscript; each fragment reveals more:
62. This is like building a bridge. Each segment must connect perfectly. Let’s build:
63. Picture a gardener planting seeds. Each step, from soil preparation to watering, matters:
64. Think of this as weaving a tapestry. Every thread adds to the whole image:
65. Let’s explore this like an astronaut discovering a new planet, detail by detail:
66. Envision this as a mosaic. Each tile contributes to the final artwork:
67. Like navigating a ship through a storm, each decision is vital. Let’s chart the course:
68. It’s as if we’re restoring a masterpiece painting. Layer by layer, let’s uncover:
69. Imagine being an archaeologist, unearthing an artifact. Bit by bit, we reveal:
70. Think of this as a dance, each move flowing into the next. Let’s choreograph:
71. As a detective collecting clues for a case, let’s unravel the mystery:
72. Like a novelist crafting a story, let’s develop our narrative chapter by chapter:
73. Envision assembling a jigsaw puzzle, connecting pieces to see the whole picture:
74. Imagine sculpting from a block of marble; every chisel matters. Let’s sculpt:
75. As a chemist mixing solutions, each ingredient alters the result. Let’s mix:
76. This is like tuning an instrument. Every adjustment leads to harmony:
77. Like an actor rehearsing a script, let’s understand our lines and their meaning:
78. Think of it as navigating a map. Each route we explore gives more insight:
79. As if we’re crafting a potion, each herb and element has a purpose. Brew with me:
80. Let’s approach this like a mathematician proving a theorem, step by logical step:
81. Picture us on a safari, observing every species. Detail by detail, we document:
82. Like setting up dominoes for a cascade, each placement is crucial. Let’s set up:
83. Imagine you’re a jeweler, evaluating a gem. Every facet reflects light differently:
84. Like a historian deciphering an old text, let’s understand its context:
85. Think of this as a musical composition. Each note leads to the next movement:
86. It’s like setting a table for a grand feast. Every detail adds to the ambiance:
87. As a photographer captures moments, let’s focus on each element:
88. Imagine we’re tailoring a suit. Each stitch, cut, and measurement counts:
89. This is like assembling a watch; every gear and spring is vital. Let’s assemble:
90. Envision ourselves as geologists, studying layers of rock. Layer by layer, we analyze:
91. Like drafting an architectural blueprint, every line and measurement matters:
92. As if we’re brewing a perfect cup of tea, each ingredient and timing is crucial:
93. This is like a relay race. Each leg of the race builds upon the last:
94. Think of it as mixing colors for a painting. Every shade adds depth and nuance:
95. It’s like being a conductor, ensuring each instrument plays its part:
96. Imagine decoding an encrypted message. Every symbol has a meaning:
97. Like planning a trip, each destination and route makes the journey:
98. This is like lighting a sequence of candles. Each one illuminates more:
99. Envision it as a waterfall, each drop contributing to the flow:
100. Like a potter shaping clay, let’s mold our understanding step by step:

25
Preliminary Work.

C.4 C ODE , H ARDWARE AND T RAINING T IME

Our code, as well as the offline datasets, will be released as open-accessible. During the review
process, we provide source code in the supplementary material. We highlight the Prompt-OIRL
can be reproduced within a few hours using a single laptop using CPU. With our implementation,
conducting OIRL for the GSM8k takes 50 minutes on a MacBookAir with an 8-core M2 chip, and
takes only 5 minutes on a server with 16(out of 64)-core AMD 3995WX CPUs.

D D ETAILED E XPERIMENT R ESULTS

D.1 I MPROVING A RITHMETIC R EASONING : H OW GOOD ARE THE OPTIMIZED PROMPTS ?

We provide detailed performance comparisons on all datasets and LLMs we used in the experiment.
Averaging the performance will result in Figure 6. In most cases, using Prompt-OIRL consistently
achieves superior performance, especially when the number of training prompts is limited. Addi-
tionally, it is not surprising that in some cases, when the number of training prompts is sufficient, se-
lecting the best prompt during training, rather than seeking further optimization on held-out prompts
can be the best strategy.
That said, we can also observe some fail cases of Prompt-OIRL, when the weaker LLMs are trying
to solve the most challenging tasks. In those cases, the difficulties mainly result from the inability
to accurately evaluate prompts — for more details, please refer to the following section.

Figure 19: Prompting Success Rate on GPT 3.5 Turbo

Figure 20: Prompting Success Rate on LLaMA-2-7B-Chat

Figure 21: Prompting Success Rate on TigerBot-13B-Chat

26
Preliminary Work.

D.2 R EWARD M ODELING : ACCURACY AND P RECISION .

Accuracy We present the accuracy of different methods on different models and datasets in Fig-
ure 22 - Figure 24. The performance of the learned reward model consistently achieves higher
prediction accuracy than using LLMs as critics. The prediction accuracy does not change much as a
function of the number of training prompts.

Figure 22: Accuracy on GPT 3.5 Turbo

Figure 23: Accuracy on LLaMA-2-7B-Chat

Figure 24: Accuracy on TigerBot-13B-Chat

27
Preliminary Work.

Precision We present the precision of different methods on different models and datasets in Fig-
ure 25 - Figure 27. In most cases, using the learned reward model achieves significantly higher
precision than using LLMs, meaning there is a higher probability that the predicted well-performing
prompt is able to get a correct answer — the basis of our choice of using precision as the metric.
However, the reward model suffers difficulty when predicting the performance of LLMs with lower
capabilities in arithmetic reasoning: when LLaMA-2-7B-Chat and the TigerBot-13B-Chat are fac-
ing the most challenging GSM8K tasks, their answers to the queries will be wrong in most cases.
Achieving a high precision — in such cases — becomes more challenging. In this work, we set
hyper-parameters uniformly to enhance reproducibility and disclose the superiority of the proposed
method with minimal engineering optimization. Yet in practice, such an issue can potentially be
further alleviated through e.g., sub-sampling or re-balancing the training dataset.

Figure 25: Precision on GPT 3.5 Turbo

Figure 26: Precision on LLaMA-2-7B-Chat

Figure 27: Precision on TigerBot-13B-Chat

28
Preliminary Work.

E C ASE S TUDIES

E.1 O PTIMAL P ROMPTING S TRATEGY IS Q UERY D EPENDENT

In this section, we provide case studies to demonstrate that expert-crafted prompting strategies are
verified and optimized on a statistical level, rather than an instance-wise level. There exist many
cases when some of those prompts work better than others on one query, but worse on another. We
now present results on the GSM8K dataset with GPT-3.5-turbo (v.2023.05).

W HEN N O -P ROMPT IS THE B EST. C ASE 1


Question: Shiela bought five cell phones for $150 each for a 3-month installment. A 2%
interest will be charged for each unit. How much will Shiela pay each month for 3 months?
Answer: A $150 x 2/100 = $〈〈150*2/100=3〉〉3 interest will be charged for each unit. So,
each cell phone costs $150 + $3 = $〈〈150+3=153〉〉153. The total amount to be paid for 5 cell
phones is $153 x 5 = $〈〈153*5=765〉〉765. Thus, Shiela will pay $765/3 = $〈〈765/3=255〉〉255
each month.
#### 255〈—endoftext—〉

P ROMPTED A NSWERS

No prompt: The total cost of the five cell phones is $150 x 5 = $〈〈150*5=750〉〉750. The
interest for each unit is $150 x 2/100 = $〈〈150*2/100=3〉〉3. So, the total interest for the five
units is $3 x 5 = $〈〈3*5=15〉〉15. Therefore, Shiela will pay $750 + $15 = $〈〈750+15=765〉〉765
for the five cell phones. Hence, Shiela will pay $765/3 = $〈〈765/3=255〉〉255 each month for
3 months. Answer: 255 .
CoT prompt: First, let’s calculate the total cost of the five cell phones. Since each phone
costs $150, the total cost would be 5 x $150 = $〈〈5*150=750〉〉750. Next, let’s calculate
the interest for each unit. The interest for each unit is 2% of $150, which is 0.02 x $150 =
$〈〈0.02*150=3〉〉3. Now, let’s calculate the total interest for all five units. The total interest
would be 5 x $3 = $〈〈5*3=15〉〉15. Finally, let’s calculate the total amount Shiela needs to pay
each month for 3 months. The total amount would be the cost of the phones plus the interest,
which is $750 + $15 = $〈〈750+15=765〉〉765. Therefore, Shiela needs to pay $765 each month
for 3 months.
APE prompt: First, let’s calculate the total cost of the five cell phones. Since each phone
costs $150, the total cost would be 5 x $150 = $〈〈5*150=750〉〉750. Next, let’s calculate
the interest for each unit. The interest for each unit is 2% of $150, which is 0.02 x $150 =
$〈〈0.02*150=3〉〉3. Now, let’s calculate the total interest for all five units. The total interest
would be 5 x $3 = $〈〈5*3=15〉〉15. Finally, let’s calculate the total amount Shiela needs to pay
each month for 3 months. The total amount would be the cost of the phones plus the interest,
which is $750 + $15 = $〈〈750+15=765〉〉765. Therefore, Shiela needs to pay $765 each month
for 3 months.
ToT prompt: Expert 1: To find out how much Shiela will pay each month for 3 months, we
need to calculate the total cost of the cell phones first. Since she bought five cell phones for
$150 each, the total cost would be $150 x 5 = $750.
Expert 2: Now, let’s calculate the interest charged for each unit. Since the interest rate is 2%,
we can calculate the interest for each unit as $150 x 0.02 = $3.
Expert 3: Finally, to find out how much Shiela will pay each month for 3 months, we need to
divide the total cost by the number of months. So, Shiela will pay $750 / 3 = $250 each month
for 3 months.

29
Preliminary Work.

W HEN N O -P ROMPT IS THE B EST. C ASE 2


Question: Mary is an avid gardener. Yesterday, she received 18 new potted plants from her
favorite plant nursery. She already has 2 potted plants on each of the 40 window ledges of her
large country home. Feeling generous, she has decided that she will give 1 potted plant from
each ledge to friends and family tomorrow. How many potted plants will Mary remain with?
Answer: Yesterday, before receiving the plants, Mary had 2*40 = 〈〈2*40=80〉〉80 potted plants
After receiving an additional 18 plants, she therefore had a total of 80 + 18 = 〈〈80+18=98〉〉98
potted plants Tomorrow, Mary’s plant giveaway will be 40 *1 = 〈〈40*1=40〉〉40 potted plants.
She will therefore remain with 98 - 40 = 〈〈98-40=58〉〉58 potted plants.
#### 58〈—endoftext—〉

P ROMPTED A NSWERS

No prompt: Mary currently has 40 window ledges with 2 potted plants each, which totals 40
* 2 = 〈〈40*2=80〉〉80 potted plants. If she received 18 new potted plants, her total number of
potted plants will be 80 + 18 = 〈〈80+18=98〉〉98 potted plants. If she gives away 1 potted plant
from each ledge, she will give away 40 * 1 = 〈〈40*1=40〉〉40 potted plants. Therefore, Mary
will remain with 98 - 40 = 〈〈98-40=58〉〉58 potted plants. Answer: 58 .
CoT prompt: Step 1: Calculate the total number of potted plants Mary currently has on her
window ledges. Mary has 2 potted plants on each of the 40 window ledges, so she has 2 * 40
= 〈〈2*40=80〉〉80 potted plants.
Step 2: Calculate the total number of potted plants Mary will give away. Mary plans to give
away 1 potted plant from each of the 40 window ledges, so she will give away 1 * 40 =
〈〈1*40=40〉〉40 potted plants.
Step 3: Calculate the number of potted plants Mary will remain with. Mary currently has
80 potted plants and plans to give away 40 potted plants, so she will remain with 80 - 40 =
〈〈80-40=40〉〉40 potted plants.
Therefore, Mary will remain with 40 potted plants.
APE prompt: Step 1: Calculate the total number of potted plants Mary currently has on her
window ledges. Mary has 2 potted plants on each of the 40 window ledges, so she has 2 * 40
= 〈〈2*40=80〉〉80 potted plants.
Step 2: Calculate the total number of potted plants Mary will give away. Mary plans to give
away 1 potted plant from each of the 40 window ledges, so she will give away 1 * 40 =
〈〈1*40=40〉〉40 potted plants.
Step 3: Calculate the number of potted plants Mary will remain with. Mary currently has
80 potted plants and plans to give away 40 potted plants, so she will remain with 80 - 40 =
〈〈80-40=40〉〉40 potted plants.
Therefore, Mary will remain with 40 potted plants.
ToT prompt: Expert 1: First, let’s calculate the total number of potted plants that Mary cur-
rently has. She has 2 potted plants on each of the 40 window ledges, so that would be 2 x 40 =
80 potted plants.
Expert 2: Next, let’s calculate the number of potted plants that Mary will give away to friends
and family. Since she has decided to give away 1 potted plant from each ledge, she will give
away a total of 40 potted plants.
Expert 3: Finally, let’s calculate the number of potted plants that Mary will remain with. She
initially had 80 potted plants and gave away 40, so she will remain with 80 - 40 = 40 potted
plants.

30
Preliminary Work.

W HEN N O -P ROMPT IS THE B EST. C ASE 3


Question:
Debra is monitoring a beehive to see how many bees come and go in a day. She sees 30 bees
leave the hive in the first 6 hours, and then she sees 1/2 that many bees return in the next 6
hours. She sees two times as many bees as she saw first leave the hive fly from the hive and
leave in the next 6 hours. Then every bee that left before that hadn’t already returned returns
to the hive in the next 6 hours. How many bees did Debra see return to the hive in the last 6
hours of the day?
Answer: Debra first sees 30 bees leave the hive, then sees 1/2 that many return, 30 x 1/2 =
〈〈30*1/2=15〉〉15 bees that return and 15 bees that are still away from the hive. Debra sees
twice as many bees as first left the hive–which was 30–fly from the hive in the next 6 hours,
30 x 2 = 〈〈30*2=60〉〉60 bees that leave the hive. In total, 30 + 60 = 〈〈30+60=90〉〉90 bees
left the hive. There were 90 bees total that left the hive - 15 bees that already returned =
〈〈90-15=75〉〉75 bees that return to the hive in the last 6 hours of the day.
#### 75〈—endoftext—〉

P ROMPTED A NSWERS

No prompt: Debra saw 30/2 = 〈〈30/2=15〉〉15 bees return to the hive in the next 6 hours. Then
she saw 2*30 = 〈〈2*30=60〉〉60 bees leave the hive in the next 6 hours. So she saw 60+30 =
〈〈60+30=90〉〉90 bees leave the hive in total. Therefore, she saw 90-15 = 〈〈90-15=75〉〉75 bees
return to the hive in the last 6 hours of the day. Answer: 75 .
CoT prompt: Step 1: Debra sees 30 bees leave the hive in the first 6 hours. Step 2: Debra sees
1/2 that many bees return in the next 6 hours, so she sees 30/2 = 〈〈30/2=15〉〉15 bees return.
Step 3: Debra sees two times as many bees as she saw leave the hive fly from the hive and
leave in the next 6 hours, so she sees 30*2 = 〈〈30*2=60〉〉60 bees leave. Step 4: Every bee that
left before that hadn’t already returned returns to the hive in the next 6 hours, so Debra sees
30+15 = 〈〈30+15=45〉〉45 bees return. Answer: 45 .
APE prompt: Step 1: Debra saw 30 bees leave the hive in the first 6 hours. Step 2: In the
next 6 hours, she saw 1/2 that many bees return, so she saw 30/2 = 〈〈30/2=15〉〉15 bees return.
Step 3: She saw two times as many bees as she saw leave the hive in the first 6 hours leave in
the next 6 hours, so she saw 2 * 30 = 〈〈2*30=60〉〉60 bees leave. Step 4: In the last 6 hours,
every bee that left before that hadn’t already returned returns to the hive. Since 30 bees left in
the first 6 hours and 60 bees left in the next 6 hours, a total of 30 + 60 = 〈〈30+60=90〉〉90 bees
left before the last 6 hours. Step 5: Therefore, Debra saw 90 bees return to the hive in the last
6 hours of the day.
ToT prompt: Expert 1: First, let’s calculate how many bees returned to the hive in the second
6-hour period. We know that Debra saw 30 bees leave the hive in the first 6 hours. So, in the
next 6 hours, she saw 1/2 of that number return. Therefore, she saw 30/2 = 15 bees return to
the hive in the second 6-hour period.
Expert 2: In the next 6-hour period, Debra saw two times as many bees leave the hive as she
saw in the first 6 hours. So, if she saw 30 bees leave in the first 6 hours, she saw 2 * 30 = 60
bees leave in the next 6 hours.
Expert 3: Now, let’s calculate how many bees returned to the hive in the last 6 hours of the
day. We know that every bee that left before that hadn’t already returned returns to the hive in
the next 6 hours. So, the total number of bees that returned in the last 6 hours is 30 + 15 + 60
= 105 bees.

31
Preliminary Work.

W HEN C OT IS THE B EST. C ASE 1


Question: John runs 60 miles a week. He runs 3 days a week. He runs 3 hours the first day
and half as much the other two days he runs. How fast does he run?
Answer: He has 7-1=〈〈7-1=6〉〉6 possible running days. He runs half of those so he runs
6/2=〈〈6/2=3〉〉3 days. So for two days, he runs 3/2=〈〈3/2=1.5〉〉1.5 hours. So in total, he runs
3+1.5+1.5=〈〈3+1.5+1.5=6〉〉6 hours. So he runs 60/6=〈〈60/6=10〉〉10 mph.
#### 10〈—endoftext—〉

P ROMPTED A NSWERS

No prompt: To find out how fast John runs, we need to divide the total distance he runs by the
total time he spends running.
First, let’s calculate the total distance John runs in a week. Since he runs 60 miles a week and
runs 3 days a week, he runs 60/3 = 〈〈60/3=20〉〉20 miles each day.
Next, let’s calculate the total time John spends running. On the first day, he runs for 3 hours,
and on the other two days, he runs for half as much, which is 3/2 = 〈〈3/2=1.5〉〉1.5 hours each
day. So the total time John spends running in a week is 3 + 1.5 + 1.5 = 〈〈3+1.5+1.5=6〉〉6
hours.
Finally, let’s calculate John’s speed by dividing the total distance he runs by the total time he
spends running: 20 miles / 6 hours = 〈〈20/6=3.33〉〉3.33 miles per hour.
Therefore, John runs at a speed of 3.33 miles per hour.
CoT prompt: Step 1: Calculate the total distance John runs in a week. John runs 60 miles a
week.
Step 2: Calculate the total number of hours John runs in a week. John runs 3 days a week. On
the first day, he runs 3 hours. On the other two days, he runs half as much, which is 3/2 = 1.5
hours. So, the total number of hours John runs in a week is 3 + 1.5 + 1.5 = 6 hours.
Step 3: Calculate John’s average speed. Average speed = Total distance / Total time Average
speed = 60 miles / 6 hours Average speed = 10 miles per hour
Therefore, John runs at an average speed of 10 miles per hour.
APE prompt: First, let’s find out how many miles John runs each day. Since he runs 60 miles
a week and runs 3 days a week, he runs 60/3 = 〈〈60/3=20〉〉20 miles each day. Next, let’s find
out how long John runs each day. We know that he runs 3 hours the first day and half as much
the other two days. So, the second and third days he runs 3/2 = 〈〈3/2=1.5〉〉1.5 hours each day.
Finally, let’s calculate John’s speed. We know that speed is equal to distance divided by time.
So, on the first day, his speed is 20 miles / 3 hours = 〈〈20/3=6.67〉〉6.67 miles per hour. On
the second and third days, his speed is 20 miles / 1.5 hours = 〈〈20/1.5=13.33〉〉13.33 miles per
hour. Therefore, John runs at a speed of 6.67 miles per hour on the first day and 13.33 miles
per hour on the second and third days.
ToT prompt: Expert 1: To find out how fast John runs, we need to calculate his average speed.
We know that he runs a total of 60 miles in a week, so we can divide that by the total number
of hours he runs to get his average speed.
Expert 2: John runs 3 days a week, with the first day being 3 hours and the other two days
being half as much. So, the total number of hours he runs in a week is 3 + (0.5 * 2) = 4 hours.
Expert 3: Now we can calculate John’s average speed by dividing the total distance he runs
(60 miles) by the total time he runs (4 hours). So, his average speed is 60 miles / 4 hours = 15
miles per hour.
Therefore, John runs at an average speed of 15 miles per hour.

32
Preliminary Work.

W HEN C OT IS THE B EST. C ASE 2


Question: According to its nutritional info, a bag of chips has 250 calories per serving. If a 300g bag
has 5 servings, how many grams can you eat if your daily calorie target is 2000 and you have already
consumed 1800 calories?
Answer: If the total calorie target is 2000 and I have consumed 1800 calories then I have 2000-1800 =
〈〈2000-1800=200〉〉200 calories left to eat If each serving of chips has 250 calories and I only have 200
calories left to eat, then I can only eat 200/250 of a serving = 4/5 of a serving We also know that a 300g
bag of chips has 5 servings, hence each serving has 300g/5 = 〈〈300/5=60〉〉60 grams If I can only eat 4/5
of a serving, then I can eat only 60g * 4/5 = 240g/5 = 48 grams
#### 48〈—endoftext—〉

P ROMPTED A NSWERS
No prompt: To find out how many grams of chips you can eat, you need to calculate the remaining
calories you can consume and then convert it to grams. First, calculate the remaining calories you can
consume: Daily calorie target - Consumed calories = Remaining calories 2000 calories - 1800 calories =
200 calories Next, calculate the remaining servings of chips you can eat: Remaining calories ÷ Calories
per serving = Remaining servings 200 calories ÷ 250 calories per serving = 0.8 servings Finally, calculate
the remaining grams of chips you can eat: Remaining servings × Grams per serving = Remaining grams
0.8 servings × 300 grams per serving = 240 grams
Therefore, you can eat 240 grams of chips if your daily calorie target is 2000 and you have already
consumed 1800 calories.
CoT prompt: Step 1: Calculate the total number of calories in the bag of chips. Since the bag has 5
servings and each serving has 250 calories, the total number of calories in the bag is 5 servings * 250
calories/serving = 1250 calories.
Step 2: Calculate the remaining number of calories you can consume. If your daily calorie target is 2000
and you have already consumed 1800 calories, then you have 2000 - 1800 = 200 calories remaining.
Step 3: Calculate the maximum number of grams you can eat from the bag of chips. To calculate the
maximum number of grams you can eat, you need to find out how many calories are in 1 gram of chips.
Divide the total number of calories in the bag (1250 calories) by the weight of the bag (300 grams) to
find the calories per gram: 1250 calories / 300 grams = 4.17 calories/gram.
Now, divide the remaining number of calories (200 calories) by the calories per gram (4.17 calories/gram)
to find the maximum number of grams you can eat: 200 calories / 4.17 calories/gram = 48 grams.
Therefore, you can eat a maximum of 48 grams of chips from the bag.
APE prompt: First, we need to calculate the total number of calories in the bag of chips. Since there are
5 servings in a 300g bag and each serving has 250 calories, the total number of calories in the bag is 5
servings * 250 calories/serving = 1250 calories.
Next, we need to calculate how many calories you have left to consume for the day. If your daily calorie
target is 2000 and you have already consumed 1800 calories, then you have 2000 - 1800 = 200 calories
left to consume.
To determine how many grams of chips you can eat, we need to find out how many servings of chips
equal 200 calories. Since each serving has 250 calories, we can calculate this by dividing 200 calories by
250 calories/serving: 200 calories / 250 calories/serving = 0.8 servings.
Finally, we can calculate the grams of chips you can eat by multiplying the number of servings by the
serving size. Since each serving is 300g, we can calculate this by multiplying 0.8 servings by 300g/serv-
ing: 0.8 servings * 300g/serving = 240g.
Therefore, you can eat 240 grams of chips if your daily calorie target is 2000 and you have already
consumed 1800 calories.
ToT prompt: Expert 1: Let’s start by calculating the total calories in the bag of chips. Since there are 5
servings in a 300g bag, and each serving has 250 calories, the total calories in the bag would be 5 servings
* 250 calories per serving = 1250 calories.

33
Preliminary Work.

Expert 2: Now, let’s calculate the remaining calories you can consume based on your daily calorie target.
If your daily calorie target is 2000 and you have already consumed 1800 calories, then you have 2000 -
1800 = 200 calories left.
Expert 3: To determine how many grams of chips you can eat, we need to find out how many calories are
in 1 gram of chips. Since the bag has 1250 calories and weighs 300g, we can calculate the calories per
gram by dividing the total calories by the weight: 1250 calories / 300g = 4.17 calories per gram.
Expert 1: Finally, to find out how many grams of chips you can eat, we divide the remaining calories
(200) by the calories per gram (4.17): 200 calories / 4.17 calories per gram = 47.94 grams.
Expert 2: Therefore, you can eat approximately 47.94 grams of chips if your daily calorie target is 2000
and you have already consumed 1800 calories.

W HEN C OT IS THE B EST. C ASE 3


Question: Jean is two years older than Mark. Two years ago Mark was 5 years older than half Jan’s age.
If Jan is 30 how old is Jean?
Answer: 2 years ago Jan was 30-2=〈〈30-2=28〉〉28 years old So Marks age two years ago
was 5+28/2=〈〈5+28/2=19〉〉19 years old Mark is now 19+2=〈〈19+2=21〉〉21 years old So Jean is
21+2=〈〈21+2=23〉〉23 years old
#### 23〈—endoftext—〉

P ROMPTED A NSWERS
No prompt: Jean is 32 years old.
CoT prompt: First, let’s find out how old Mark is. We know that Jean is two years older than Mark, so
if we let Mark’s age be x, then Jean’s age would be x + 2.
Two years ago, Mark was 5 years older than half of Jan’s age. Two years ago, Jan would have been 30 -
2 = 28 years old. Half of Jan’s age two years ago would be 28/2 = 14 years old.
So, two years ago, Mark was 5 years older than 14, which means Mark’s age two years ago was 14 + 5 =
19 years old.
If Mark’s age two years ago was 19, then his current age would be 19 + 2 = 21 years old.
Since Jean is two years older than Mark, Jean’s age would be 21 + 2 = 23 years old.
Therefore, Jean is 23 years old.
APE prompt: If Jan is 30 years old, then two years ago she would have been 28 years old. If Mark was
5 years older than half of Jan’s age two years ago, then half of Jan’s age two years ago would have been
23 years old. If Mark was 5 years older than half of Jan’s age two years ago, then Mark would have been
28 + 5 = 33 years old two years ago. If Jean is two years older than Mark, then Jean would have been 33
+ 2 = 35 years old two years ago. If Jean was 35 years old two years ago, then Jean is currently 35 + 2 =
37 years old.
ToT prompt: Expert 1: Let’s start by finding Mark’s age. If two years ago Mark was 5 years older than
half of Jan’s age, then we can subtract 5 from Mark’s age two years ago to find half of Jan’s age two years
ago.
Expert 2: If Mark was 5 years older than half of Jan’s age two years ago, then half of Jan’s age two years
ago would be Mark’s age minus 5.
Expert 3: If half of Jan’s age two years ago was Mark’s age minus 5, then we can add 5 to half of Jan’s
age two years ago to find Mark’s age two years ago.
Expert 1: Now that we know Mark’s age two years ago, we can add 2 to find his current age.
Expert 2: Since Jean is two years older than Mark, we can add 2 to Mark’s current age to find Jean’s age.
Expert 3: Therefore, if Jan is 30 years old, Jean would be 32 years old.

34
Preliminary Work.

W HEN APE IS THE B EST. C ASE 1


Question:
Blake and Kelly are having a contest to see who can run the most in 15 minutes. They decide to do it
on a football field that is 100 yards long. Blake runs back and forth 15 times. Kelly runs back and forth
once, and then decides that she doesn’t want to run next to Blake, so she starts to run to the 40-yard line
and back. She does this 34 times. How much farther does the winner run than the loser?
Answer: Blake runs 200 yards for each back and forth because 100 x 2 = 200 He runs 3000 yards
in total because 200 x 15 = 〈〈200*15=3000〉〉3000 Kelly runs 200 yards at first because 100 x 2 =
〈〈100*2=200〉〉200 She runs 80 yards each time she does a back and forth to the 40-yard line because 40
x 2 = 〈〈40*2=80〉〉80 She runs 2720 from these back and forth because 34 x 80 = 〈〈34*80=2720〉〉2720
She runs 2920 in total because 2720 + 200 = 〈〈2720+200=2920〉〉2920 Blake won because 3000 〉2920
The winner ran 80 yards more than the loser because 3000 - 2920 = 〈〈3000-2920=80〉〉80
#### 80〈—endoftext—〉

P ROMPTED A NSWERS
No prompt: To find out how far Blake runs, we need to multiply the number of times he runs back and
forth by the length of the football field: 15 times * 100 yards/time = 〈〈15*100=1500〉〉1500 yards To find
out how far Kelly runs, we need to multiply the number of times she runs back and forth by the length
of the football field: 34 times * 40 yards/time = 〈〈34*40=1360〉〉1360 yards The difference in distance is
1500 yards - 1360 yards = 〈〈1500-1360=140〉〉140 yards. Answer: 140 .
CoT prompt: Step 1: Calculate the total distance Blake runs. Blake runs back and forth 15 times, so he
runs a total of 15 * 100 = 〈〈15*100=1500〉〉1500 yards.
Step 2: Calculate the total distance Kelly runs. Kelly runs back and forth once, so she runs a total of
1 * 100 = 〈〈1*100=100〉〉100 yards. Then she runs to the 40-yard line and back 34 times, so she runs
an additional 34 * 80 = 〈〈34*80=2720〉〉2720 yards. Therefore, Kelly runs a total of 100 + 2720 =
〈〈100+2720=2820〉〉2820 yards.
Step 3: Calculate the difference in distance between the winner and the loser. The winner runs 2820 -
1500 = 〈〈2820-1500=1320〉〉1320 yards farther than the loser.
APE prompt: First, let’s calculate how far Blake runs in total. Since he runs back and forth 15 times,
and the football field is 100 yards long, Blake runs a total of 15 * 2 * 100 = 3000 yards.
Next, let’s calculate how far Kelly runs in total. She runs back and forth once, which is a total of 2 * 100
= 200 yards. Then she runs to the 40-yard line and back 34 times, which is a total of 34 * 2 * 40 = 2720
yards. So Kelly runs a total of 200 + 2720 = 2920 yards.
Finally, let’s calculate the difference in distance between the winner and the loser. The winner, in this
case, is Blake, who runs 3000 yards. The loser is Kelly, who runs 2920 yards. Therefore, the difference
in distance is 3000 - 2920 = 80 yards.
So the winner runs 80 yards farther than the loser.
ToT prompt: Expert 1: To find out how much farther the winner runs than the loser, we need to calculate
the total distance each person runs. Let’s start by calculating the distance Blake runs.
Expert 2: Blake runs back and forth 15 times on a 100-yard football field. So, the total distance Blake
runs is 15 times the length of the field, which is 15 * 100 = 1500 yards.
Expert 3: Now let’s calculate the distance Kelly runs. Kelly runs back and forth once on the football
field, which is a total distance of 2 * 100 = 200 yards. Then she runs to the 40-yard line and back 34
times, which is a total distance of 34 * 2 * 40 = 2720 yards.
Expert 1: So, the total distance Kelly runs is 200 + 2720 = 2920 yards.
Expert 2: To find out how much farther the winner runs than the loser, we subtract the total distance Blake
runs from the total distance Kelly runs. Therefore, the winner runs 2920 - 1500 = 1420 yards farther than
the loser.

35
Preliminary Work.

W HEN APE IS THE B EST. C ASE 2


Question: James is counting his Pokemon cards. He has 30 fire type, 20 grass type, and 40 water type.
If he loses 8 of the water type and buys 14 grass type, what’s the percentage chance (rounded to the
nearest integer) that a randomly picked card will be a water type?
Answer: First find the final number of grass type Pokemon: 20 grass type + 14 grass type =
〈〈20+14=34〉〉34 grass type Then find the final number of water type Pokemon: 40 water type - 8 water
type = 〈〈40-8=32〉〉32 water type Then add the number of each type of Pokemon to find the total number:
34 Pokemon + 32 Pokemon + 30 Pokemon = 〈〈34+32+30=96〉〉96 Pokemon Then divide the number of
water type Pokemon by the total number of Pokemon and multiply by 100% to express the answer as a
percentage: 32 water type / 96 Pokemon * 100% = 33.333...%, which rounds down to 33%
#### 33〈—endoftext—〉

P ROMPTED A NSWERS
No prompt: The total number of water type cards after losing 8 and buying 14 is 40 - 8 + 14 = 46. The
total number of cards is 30 + 46 + 20 = 96. The percentage chance of picking a water type card is (46/96)
* 100 = 47.92%. Rounded to the nearest integer, the percentage chance is 48%.
CoT prompt: First, let’s calculate the total number of water type cards James has after losing 8 and
buying 14. He initially had 40 water type cards. After losing 8, he will have 40 - 8 = 32 water type cards.
After buying 14 grass type cards, the number of water type cards remains the same at 32. Now, let’s
calculate the total number of cards James has after losing 8 water type cards and buying 14 grass type
cards. He initially had 30 fire type cards, 20 grass type cards, and 40 water type cards. After losing 8
water type cards, he will have 30 + 20 + 32 = 82 cards. The percentage chance of picking a water type
card is (32 / 82) * 100 = 39.02%.
Rounded to the nearest integer, the percentage chance is 39%.
APE prompt: First, let’s calculate the total number of water type cards James has after losing 8 and
buying 14. He initially had 40 water type cards, so after losing 8, he has 40 - 8 = 32 water type cards.
After buying 14 grass type cards, the total number of water type cards remains the same at 32.
Next, let’s calculate the total number of cards James has after losing 8 water type cards and buying 14
grass type cards. He initially had 30 fire type cards, 20 grass type cards, and 40 water type cards, so the
total number of cards is 30 + 20 + 40 = 90 cards. After losing 8 water type cards and buying 14 grass
type cards, the total number of cards becomes 90 - 8 + 14 = 96 cards.
Finally, let’s calculate the percentage chance of picking a water type card. The percentage chance is
calculated by dividing the number of water type cards by the total number of cards and multiplying
by 100. So, the percentage chance is (32 / 96) * 100 = 33.33%. Rounded to the nearest integer, the
percentage chance is 33%.
Therefore, the percentage chance that a randomly picked card will be a water type is 33%.
ToT prompt: Expert 1: To find the percentage chance of picking a water type card, we need to calculate
the total number of water type cards after James loses 8 and buys 14 grass type cards.
Expert 2: James initially has 40 water type cards. After losing 8 and buying 14 grass type cards, he will
have 40 - 8 + 14 = 46 water type cards.
Expert 3: To find the percentage chance, we need to divide the number of water type cards by the total
number of cards and multiply by 100. The total number of cards is 30 + 46 + 20 = 96. So the percentage
chance of picking a water type card is (46 / 96) * 100 = 47.92%.
Final answer: The percentage chance (rounded to the nearest integer) of picking a water type card is 48%.

36
Preliminary Work.

W HEN APE IS THE B EST. C ASE 3


Question: Steve decides to start eating more tomatoes and decides to grows his own cherry
tomatoes. He eats twice as much as his girlfriend. He eats 6 per day. If a vine can produce 3
tomatoes per week how many vines does he need?
Answer: His girlfriend eats 6/2=〈〈6/2=3〉〉3 tomatoes per day So together they eat
6+3=〈〈6+3=9〉〉9 tomatoes a day That means they eat 9*7=〈〈9*7=63〉〉63 tomatoes a day So
he needs 63/3=〈〈63/3=21〉〉21 tomato vines
#### 21〈—endoftext—〉

P ROMPTED A NSWERS

No prompt: To determine how many vines Steve needs, we need to calculate the number of
tomatoes he eats in a week. Since he eats 6 tomatoes per day, he eats 6 * 7 = 〈〈6*7=42〉〉42
tomatoes in a week. Since each vine can produce 3 tomatoes per week, Steve needs 42 / 3 =
〈〈42/3=14〉〉14 vines.
CoT prompt: 1. Steve eats 6 cherry tomatoes per day. 2. His girlfriend eats half as much
as Steve, so she eats 6/2 = 3 cherry tomatoes per day. 3. Together, they eat 6 + 3 = 9 cherry
tomatoes per day. 4. A vine can produce 3 cherry tomatoes per week. 5. To meet their daily
consumption, Steve needs 9 cherry tomatoes per day. 6. Therefore, he needs 9/3 = 3 vines to
meet their daily consumption.
APE prompt: First, we need to determine how many tomatoes Steve’s girlfriend eats per day.
Since Steve eats twice as much as his girlfriend, we can divide his daily consumption by 2: 6
tomatoes / 2 = 3 tomatoes per day.
Next, we need to determine how many tomatoes Steve and his girlfriend eat together per day:
6 tomatoes + 3 tomatoes = 9 tomatoes per day.
Now, we need to determine how many tomatoes Steve and his girlfriend eat together per week:
9 tomatoes per day * 7 days = 63 tomatoes per week.
Finally, we need to determine how many vines Steve needs to grow in order to produce 63
tomatoes per week. Since each vine produces 3 tomatoes per week, we can divide the total
number of tomatoes needed by the number of tomatoes produced per vine: 63 tomatoes / 3
tomatoes per vine = 21 vines.
Therefore, Steve needs to grow 21 vines in order to produce enough cherry tomatoes for his
increased consumption.
ToT prompt: Expert 1: To determine how many vines Steve needs, we first need to find out
how many tomatoes he eats in a week. Since he eats 6 tomatoes per day, he eats a total of 6 *
7 = 42 tomatoes in a week.
Expert 2: Now, let’s calculate how many tomatoes a single vine can produce in a week. If
a vine produces 3 tomatoes per week, then Steve would need 42 / 3 = 14 vines to meet his
weekly tomato consumption.
Expert 3: However, since Steve wants to grow cherry tomatoes, which are smaller than reg-
ular tomatoes, he might need more vines to produce the same amount. Let’s assume that
cherry tomatoes are half the size of regular tomatoes. In that case, Steve would need double
the number of vines, so he would need 14 * 2 = 28 vines to meet his weekly cherry tomato
consumption.

37
Preliminary Work.

W HEN T OT IS THE B EST. C ASE 1


Question:
Candice put 80 post-it notes in her purse before she headed out to her job at the coffee shop. On her way,
she stopped off at the store and purchased a package of Post-it notes; At work, she placed a single Post-it
note on each of 220 different cups of coffee. If she had 23 post-it notes remaining overall, how many
Post-it notes were in the package that she purchased?
Answer: Let’s let x be the number of post-it notes in the package that Candice purchased. Then, the
total number of post-it notes she had when she left the store was 80+x. And if she used 220 Post-it notes
at work and had 23 left, then 80+x=220+23 Subtracting 80 from each side of the equation, we see that
x=220+23-80=〈〈220+23-80=163〉〉163 Post-it notes.
#### 163〈—endoftext—〉

P ROMPTED A NSWERS
No prompt: Let’s break down the information given: Candice started with 80 post-it notes in her purse.
She placed a single post-it note on each of 220 cups of coffee. She had 23 post-it notes remaining overall.
To find out how many post-it notes were in the package she purchased, we need to subtract the number
of post-it notes she had remaining from the total number of post-it notes she started with and placed on
the cups of coffee. Total post-it notes = Starting post-it notes + Post-it notes placed on cups of coffee -
Post-it notes remaining Total post-it notes = 80 + 220 - 23 Total post-it notes = 300
Therefore, there were 300 post-it notes in the package that Candice purchased.
CoT prompt: Step 1: Calculate the total number of post-it notes Candice had before purchasing the
package. Candice had 80 post-it notes in her purse.
Step 2: Calculate the total number of post-it notes Candice used at work. Candice placed a single post-it
note on each of 220 different cups of coffee.
Step 3: Calculate the total number of post-it notes Candice had remaining after work. Candice had 23
post-it notes remaining overall.
Step 4: Calculate the total number of post-it notes Candice had before purchasing the package. Total
post-it notes before purchasing = post-it notes in purse + post-it notes remaining after work Total post-it
notes before purchasing = 80 + 23
Step 5: Calculate the number of post-it notes in the package that Candice purchased. Number of post-it
notes in the package = Total post-it notes before purchasing - post-it notes used at work Number of post-it
notes in the package = (80 + 23) - 220
Step 6: Calculate the final answer. Number of post-it notes in the package = 103 - 220 Number of post-it
notes in the package = -117
Therefore, Candice did not purchase any post-it notes.
APE prompt: First, we need to determine how many post-it notes Candice had before she purchased the
package. We know she started with 80 post-it notes and had 23 remaining, so she used 80 - 23 = 〈〈80-
23=57〉〉57 post-it notes at work. Next, we need to determine how many post-it notes were in the package
she purchased. We can do this by subtracting the number of post-it notes she had before purchasing the
package from the total number of post-it notes she used at work: 57 - 80 = 〈〈57-80=-23〉〉-23. Since we
can’t have a negative number of post-it notes, there must have been an error in our calculations. Let’s
try again. If Candice had 80 post-it notes before purchasing the package and used 57 at work, then she
must have had 80 - 57 = 〈〈80-57=23〉〉23 post-it notes remaining. Since we know she had 23 post-it notes
remaining overall, this means she used all the post-it notes she purchased at work. Therefore, the package
she purchased must have contained 220 - 23 = 〈〈220-23=197〉〉197 post-it notes.
ToT prompt: Expert 1: Let’s start by figuring out how many post-it notes Candice had before she
purchased the package. She initially had 80 post-it notes in her purse.
Expert 2: Next, we need to subtract the number of post-it notes she used at work. She placed a single
post-it note on each of the 220 cups of coffee. So, she used 220 post-it notes at work.

38
Preliminary Work.

Expert 3: Now, we can calculate the total number of post-it notes Candice had remaining. We subtract
the number of post-it notes used at work (220) from the initial number of post-it notes (80). This gives
us 80 - 220 = -140.
Expert 1: Since Candice had 23 post-it notes remaining overall, we can set up an equation to find the
number of post-it notes in the package she purchased. Let’s call the number of post-it notes in the package
”x”. We can write the equation as -140 + x = 23.
Expert 2: To solve for x, we need to isolate it on one side of the equation. We can do this by adding 140
to both sides of the equation. This gives us x = 23 + 140 = 163.
Expert 3: Therefore, the number of post-it notes in the package that Candice purchased is 163.

W HEN T OT IS THE B EST. C ASE 2


Question: Christina is planning a birthday party and needs .75 gift bags per invited guest, because 1/4 of
attendees don’t show up. She invited 16 friends. Gift bags are $2 each. How much will she spend?
Answer: She needs 12 gift bags because 16 x .75 = 〈〈16*.75=12〉〉12 She will spend $24 because 12 x 2
= 〈〈12*2=24〉〉24
#### 24〈—endoftext—〉

P ROMPTED A NSWERS
No prompt: Christina invited 16 friends, but 1/4 of them won’t show up, so 16 * 1/4 = 〈〈16*1/4=4〉〉4
friends won’t attend. So the number of guests who will attend is 16 - 4 = 〈〈16-4=12〉〉12 guests. She
needs .75 gift bags per guest, so she will need 12 * .75 = 〈〈12*.75=9〉〉9 gift bags. Each gift bag costs
$2, so she will spend 9 * $2 = $〈〈9*2=18〉〉18. Answer: 18 .
CoT prompt: Step 1: Calculate the number of guests who will actually attend the party. Since 1/4 of the
invited guests don’t show up, we need to subtract 1/4 of the total number of guests from the total number
of guests invited. 1/4 of 16 guests is (1/4) * 16 = 4 guests. So, the number of guests who will actually
attend the party is 16 - 4 = 12 guests.
Step 2: Calculate the number of gift bags needed. Since Christina needs 0.75 gift bags per guest, we can
multiply the number of guests by 0.75 to find the total number of gift bags needed. 0.75 * 12 guests = 9
gift bags.
Step 3: Calculate the total cost of the gift bags. Since each gift bag costs $2, we can multiply the number
of gift bags by the cost per gift bag to find the total cost. 9 gift bags * $2/gift bag = $18.
Therefore, Christina will spend $18 on gift bags for the birthday party.
APE prompt: First, we need to calculate how many guests will actually show up. We know that 1/4 of
the invited guests won’t attend, so we can calculate this by multiplying the number of invited guests by
3/4: 16 guests * 3/4 = 12 guests.
Next, we need to calculate how many gift bags Christina will need. We know that she needs 0.75 gift
bags per guest, so we can calculate this by multiplying the number of guests by 0.75: 12 guests * 0.75
gift bags = 9 gift bags.
Finally, we can calculate the total cost of the gift bags. We know that each gift bag costs $2, so we can
calculate this by multiplying the number of gift bags by the cost per gift bag: 9 gift bags * $2/gift bag =
$18.
Therefore, Christina will spend $18 on gift bags for her birthday party.
ToT prompt: Expert 1: To find out how many gift bags Christina needs, I’ll multiply the number of
invited guests by 0.75 since 1/4 of attendees won’t show up. Expert 2: So, if Christina invited 16 friends,
we’ll multiply 16 by 0.75 to get the number of gift bags needed. Expert 3: That means Christina needs
12 gift bags. To find out how much she will spend, I’ll multiply the number of gift bags by the cost per
bag, which is $2. Expert 1: So, Christina will spend 12 gift bags multiplied by $2 per bag, which equals
$24 in total.

39
Preliminary Work.

W HEN T OT IS THE B EST. C ASE 3


Question:
Julia was preparing for a dinner party at her house, where she intended to serve stew. She noticed that
she was out of plastic spoons, so she bought a new package of spoons. Later, her husband also bought a
package of 5 new spoons and gave them to Julia. While Julia was making the stew, she used three of the
spoons to sample her stew. Later, when she went to set the table, she had a total of 12 spoons. How many
spoons were in the package that Julia bought?
Answer: The total number of spoons from Julia and her husband was 12+3=〈〈12+3=15〉〉15 spoons.
Since the husband bought a package of five spoons, then Julia’s package contained 15-5=〈〈15-5=10〉〉10
spoons.
#### 10〈—endoftext—〉

P ROMPTED A NSWERS
No prompt: To find the number of spoons in the package that Julia bought, we need to subtract the
spoons that Julia used and the spoons that her husband bought from the total number of spoons on the
table.
Julia used 3 spoons to sample her stew. Her husband bought 5 spoons. So, the total number of spoons
used is 3 + 5 = 8.
If there were 12 spoons on the table, and 8 of them were used, then the number of spoons in the package
that Julia bought is 12 - 8 = 4.
CoT prompt: 1. Julia bought a new package of spoons. 2. Her husband also bought a package of 5 new
spoons and gave them to Julia. 3. Julia used three of the spoons to sample her stew. 4. When she went to
set the table, she had a total of 12 spoons.
To find out how many spoons were in the package that Julia bought, we need to subtract the spoons used
for sampling and the spoons given by her husband from the total number of spoons on the table.
Let’s denote the number of spoons in the package Julia bought as ”x”.
So, the equation becomes: x - 3 - 5 = 12
Simplifying the equation: x - 8 = 12
Adding 8 to both sides of the equation: x = 20
Therefore, there were 20 spoons in the package that Julia bought.
APE prompt: 1. Julia bought a new package of spoons. 2. Her husband also bought a package of 5 new
spoons and gave them to Julia. 3. Julia used 3 spoons to sample her stew. 4. Julia had a total of 12 spoons
when she went to set the table.
To find out how many spoons were in the package that Julia bought, we can subtract the spoons used for
sampling and the spoons given by her husband from the total number of spoons she had.
Total spoons - spoons used for sampling - spoons given by husband = spoons in the package Julia bought
12 - 3 - 5 = spoons in the package Julia bought
Therefore, there were 4 spoons in the package that Julia bought.
ToT prompt: Expert 1: Let’s start by setting up an equation to represent the given information. Let’s
say the number of spoons Julia bought is x. Her husband bought 5 spoons, so the total number of spoons
they have is x + 5. Julia used 3 spoons to sample the stew, so the number of spoons left is (x + 5) - 3.
Expert 2: We know that Julia had a total of 12 spoons left after setting the table. So we can set up the
equation (x + 5) - 3 = 12.
Expert 3: Simplifying the equation, we have x + 2 = 12. To isolate x, we subtract 2 from both sides of
the equation. This gives us x = 10.
Expert 1: Therefore, Julia bought a package of 10 spoons.

40
Preliminary Work.

E.2 E XPERT-C RAFTED P ROMPTS C AN A LL FAIL

In this section, we show the cases when all expert-crafted prompts could fail. In some of those
cases, optimizing prompt through Prompt-OIRL can lead to prompting strategies that are able to
get the correct response. Examples below are from the GSM8K dataset and using GPT-3.5-turbo
(v.2023.05) as the LLM.

P ROMPT-OIRL D ISCOVERS B ETTER P ROMPTS . C ASE 1


Question: Jamal’s phone can hold 6 times more photographs than can Brittany’s phone. The maximum
number of photographs that Brittany’s phone can hold is 50 times more than the number of birds in
Jamal’s photograph of the ducks at the zoo. If Jamal’s phone can hold 1800 photographs, how many
ducks can be seen in Jamal’s photograph of ducks at the zoo?
Answer: Since Jamal’s phone can hold 1800 photos, which is 6 times more than Brittany’s phone can
hold, Brittany’s phone can hold ‘1800/6=〈〈1800/6=300〉〉300 photographs. Since Brittany’s phone can
hold 50 times more photos than the number of birds in Jamal’s photograph, then 300/50=6 ducks can be
seen in Jamal’s photograph of ducks at the zoo.
#### 6〈—endoftext—〉

E XPERT-C RAFTED P ROMPTS AND A NSWERS


No prompt: To find the number of ducks in Jamal’s photograph, we need to first find the maximum
number of photographs that Brittany’s phone can hold. We know that Jamal’s phone can hold 1800
photographs, so Brittany’s phone can hold 1800 * 6 = 〈〈1800*6=10800〉〉10800 photographs. We also
know that the maximum number of photographs that Brittany’s phone can hold is 50 times more than
the number of birds in Jamal’s photograph of the ducks at the zoo. So, the number of birds in Jamal’s
photograph is 10800 / 50 = 〈〈10800/50=216〉〉216. Answer: 216 .
CoT prompt: Step 1: Let’s find out how many photographs Brittany’s phone can hold. Since Jamal’s
phone can hold 1800 photographs, and Brittany’s phone can hold 6 times more photographs, we can
calculate that Brittany’s phone can hold 1800 * 6 = 〈〈1800*6=10800〉〉10800 photographs.
Step 2: Let’s find out how many birds are in Jamal’s photograph of the ducks at the zoo. Since Brittany’s
phone can hold 50 times more photographs than the number of birds in Jamal’s photograph, we can
calculate that the number of birds in Jamal’s photograph is 10800 / 50 = 〈〈10800/50=216〉〉216.
Therefore, there are 216 ducks in Jamal’s photograph of ducks at the zoo.
APE prompt: First, let’s find out how many photographs Brittany’s phone can hold. We know that
Jamal’s phone can hold 1800 photographs, so Brittany’s phone can hold 6 times more photographs,
which is 1800 * 6 = 〈〈1800*6=10800〉〉10800 photographs.
Next, we are told that the maximum number of photographs Brittany’s phone can hold is 50 times more
than the number of birds in Jamal’s photograph of the ducks at the zoo. Let’s call the number of birds in
Jamal’s photograph ”x”. This means that 10800 = 50x.
To find the value of x, we can divide both sides of the equation by 50: x = 10800 / 50 =
〈〈10800/50=216〉〉216.
Therefore, there are 216 ducks in Jamal’s photograph of the ducks at the zoo.
ToT prompt: Expert 1: Let’s start by finding out how many photographs Brittany’s phone can hold. We
know that Jamal’s phone can hold 1800 photographs, so Brittany’s phone can hold 6 times more, which
means it can hold 1800 * 6 = 10,800 photographs.
Expert 2: Now, we need to figure out the number of birds in Jamal’s photograph of the ducks at the zoo.
We are told that the maximum number of photographs Brittany’s phone can hold is 50 times more than
the number of birds in Jamal’s photograph. So, the number of birds in Jamal’s photograph is 10,800 / 50
= 216.
Expert 3: Therefore, there are 216 ducks in Jamal’s photograph of the ducks at the zoo.

41
Preliminary Work.

P ROMPT-OIRL D ISCOVERED S UCCESSFUL P ROMPTS AND A NSWERS


Prompt-OIRL Discovered Prompt 1: Let’s break this down into manageable steps to fully understand
the problem:
Prompted Answer: 1. Jamal’s phone can hold 6 times more photographs than Brittany’s phone. 2. The
maximum number of photographs that Brittany’s phone can hold is 50 times more than the number of
birds in Jamal’s photograph of the ducks at the zoo. 3. Jamal’s phone can hold 1800 photographs. 4. We
need to find the number of ducks in Jamal’s photograph.
To find the number of ducks in Jamal’s photograph, we need to find the number of birds in the photograph
and then divide it by 50.
Let’s assume the number of birds in Jamal’s photograph is x.
According to the problem, Brittany’s phone can hold 50 times more photographs than the number of birds
in Jamal’s photograph. So, Brittany’s phone can hold 50x photographs.
Since Jamal’s phone can hold 6 times more photographs than Brittany’s phone, we can set up the equation:
1800 = 6 * 50x
Simplifying the equation:
1800 = 300x
Dividing both sides by 300:
x=6
Therefore, there are 6 ducks in Jamal’s photograph.
Prompt-OIRL Discovered Prompt 2: Three wise philosophers are debating this question, each con-
tributing a different aspect of the answer. Let’s follow their discourse:
Prompted Answer: Philosopher 1: Let’s call the number of birds in Jamal’s photograph of ducks at the
zoo ”x”. According to the information given, Brittany’s phone can hold 50 times more photographs than
the number of birds in Jamal’s photograph. So, the maximum number of photographs Brittany’s phone
can hold is 50x.
Philosopher 2: We also know that Jamal’s phone can hold 6 times more photographs than Brittany’s
phone. So, if Brittany’s phone can hold 50x photographs, then Jamal’s phone can hold 6 times that
amount, which is 6 * 50x = 300x.
Philosopher 3: We are given that Jamal’s phone can hold 1800 photographs. So, we can set up the
equation 300x = 1800 to find the value of x.
Solving the equation, we find that x = 1800 / 300 = 6.
Therefore, there are 6 ducks in Jamal’s photograph of ducks at the zoo.

P ROMPT-OIRL D ISCOVERS B ETTER P ROMPTS . C ASE 2


Question: Adrien’s total salary was 30 percent higher than Lylah’s. Four years later, his salary had
increased, and he was earning 40% more than what he was making four years ago. If Adrien’s and
Lylah’s salary increased simultaneously, and Adrien earned $40000 four years ago, calculate the total
salary the two were receiving four years later?
Answer: Since Adrien was earning $40000 four years ago and received a raise that makes him earn
40% more, he received a 40/100*$40000 = $〈〈40/100*40000=16000〉〉16000 raise. In total, four years
later, Adrien’s salary is $40000+$16000 = $56000 If four years ago Adrien was earning $40000, and
Lylah’s salary was 30% less, then Lylah’s salary was 30/100*$40000= $12000 less than Adrien’s salary
four years ago. Four years ago, Lylah was earning $40000-$12000 =$28000 After receiving a 40% raise,
Lylah earns 40/100*$28000 = $〈〈40/100*28000=11200〉〉11200 In total, four years later, Lylah earns
$28000+$11200= $〈〈28000+11200=39200〉〉39200. The total amount of salary the two earn four years
later is $39200+$56000 = $〈〈39200+56000=95200〉〉95200
#### 95200〈—endoftext—〉

42
Preliminary Work.

E XPERT-C RAFTED P ROMPTS AND A NSWERS


No prompt: Adrien’s salary four years ago was $40000, and since he was earning 40% more four
years later, he was earning 40/100*$40000 = $〈〈40000*40/100=16000〉〉16000 more. The total amount
of money Adrien was earning four years later is $40000+$16000 = $〈〈40000+16000=56000〉〉56000
Since Adrien’s salary was 30 percent higher than Lylah’s, Lylah was earning 30/100*$40000 =
$〈〈30/100*40000=12000〉〉12000 less than Adrien four years ago. The total amount of money Lylah was
earning four years ago is $40000-$12000 = $〈〈40000-12000=28000〉〉28000 Four years later, Lylah’s
salary increased by 40/100*$40000 = $〈〈40/100*40000=16000〉〉16000 The total amount of money Ly-
lah was earning four years later is $28000+$16000 = $〈〈28000+16000=44000〉〉44000 Together, the two
were earning $44000+$56000 = $〈〈44000+56000=100000〉〉100000 four years later. Answer: 100000 .
CoT prompt: First, we need to find out how much Adrien was earning four years ago. We are told that
Adrien’s salary four years ago was $40000.
Next, we need to calculate how much Adrien’s salary increased by. We are told that his salary increased
by 40% from what he was making four years ago. To find the increase, we multiply his salary four years
ago by 40/100: $40000 * 40/100 = $16000.
To find out how much Adrien is currently earning, we add the increase to his salary four years ago:
$40000 + $16000 = $56000.
Now, we need to find out how much Lylah was earning four years ago. We are told that Adrien’s salary
was 30% higher than Lylah’s. To find Lylah’s salary, we divide Adrien’s salary four years ago by 1.3
(since 100% + 30% = 130% = 1.3): $40000 / 1.3 = $30769.23 (rounded to the nearest cent).
Next, we need to calculate how much Lylah’s salary increased by. We are told that Adrien’s and Lylah’s
salary increased simultaneously. Therefore, Lylah’s salary increased by the same percentage as Adrien’s,
which is 40%. To find the increase, we multiply Lylah’s salary four years ago by 40/100: $30769.23 *
40/100 = $12307.69 (rounded to the nearest cent).
To find out how much Lylah is currently earning, we add the increase to her salary four years ago:
$30769.23 + $12307.69 = $43076.92 (rounded to the nearest cent).
Finally, to find the total salary the two were receiving four years later, we add Adrien’s and Lylah’s
current salaries: $56000 + $43076.92 = $99076.92 (rounded to the nearest cent).
Therefore, the total salary the two were receiving four years later is $99076.92.
APE prompt: First, let’s find out how much Adrien was earning four years ago. We know that his salary
had increased by 40%, so we can calculate his salary four years ago by dividing his current salary by 1.4:
$40000 / 1.4 = $28571.43
Now, let’s find out how much Lylah was earning four years ago. We know that Adrien’s salary was 30%
higher than Lylah’s, so we can calculate Lylah’s salary four years ago by dividing Adrien’s salary four
years ago by 1.3: $28571.43 / 1.3 = $21978.02
Now, let’s find out how much Adrien is earning currently. We know that his salary has increased by 40%,
so we can calculate his current salary by multiplying his salary four years ago by 1.4: $40000 * 1.4 =
$56000
Now, let’s find out how much Lylah is earning currently. We know that Adrien’s salary is 30% higher than
Lylah’s, so we can calculate Lylah’s current salary by dividing Adrien’s current salary by 1.3: $56000 /
1.3 = $43076.92
Finally, let’s calculate the total salary the two are receiving four years later by adding Adrien’s and Lylah’s
current salaries: $56000 + $43076.92 = $99076.92
Therefore, the total salary the two were receiving four years later is $99076.92.
ToT prompt: Expert 1: Let’s start by finding out Adrien’s salary four years ago. We know that his salary
increased by 40% since then, so we can calculate his salary four years ago by dividing his current salary
by 1.4.
Expert 2: Adrien’s current salary is $40,000, and we know that it increased by 40% since four years ago.
So, to find out his salary four years ago, we can divide $40,000 by 1.4.
Expert 3: $40,000 divided by 1.4 is approximately $28,571.43. Therefore, Adrien’s salary four years ago
was $28,571.43.

43
Preliminary Work.

Expert 1: Now that we know Adrien’s salary four years ago, we can calculate his current salary by adding
a 40% increase to his salary four years ago.
Expert 2: Adrien’s salary four years ago was $28,571.43, and we know that it increased by 40%. To find
out his current salary, we can multiply $28,571.43 by 1.4.
Expert 3: $28,571.43 multiplied by 1.4 is approximately $40,000. Therefore, Adrien’s current salary is
$40,000.
Expert 1: Now that we have Adrien’s current salary, we can calculate Lylah’s salary by finding out what
30% of Adrien’s salary is and subtracting it from Adrien’s salary.
Expert 2: Adrien’s current salary is $40,000, and we know that it is 30% higher than Lylah’s salary. To
find out Lylah’s salary, we can calculate 30% of $40,000 and subtract it from $40,000.
Expert 3: 30% of $40,000 is $12,000. Therefore, Lylah’s salary is $40,000 - $12,000 = $28,000.
Expert 1: Finally, to calculate the total salary the two were receiving four years later, we can add Adrien’s
current salary and Lylah’s current salary.
Expert 2: Adrien’s current salary is $40,000, and Lylah’s current salary is $28,000. Therefore, the total
salary the two were receiving four years later is $40,000 + $28,000 = 68,000$

P ROMPT-OIRL D ISCOVERED S UCCESSFUL P ROMPTS AND A NSWERS


Prompt-OIRL Discovered Prompt 1: Imagine we’re navigating a maze; each decision brings us closer
to the center. Let’s map our route:
Prompted Answer: 1. Let’s start by finding out how much Adrien was earning four years ago. We know
that his salary was $40000 at that time.
2. Adrien’s salary increased by 40% over the next four years. To calculate his current salary, we need to
add 40% of $40000 to $40000. 40% of $40000 is (40/100) * $40000 = $16000. Adding this to his initial
salary, we get $40000 + $16000 = $56000.
3. Adrien’s total salary four years later is $56000.
4. We also know that Adrien’s salary was 30% higher than Lylah’s. To find Lylah’s salary, we need
to calculate 30% of Adrien’s salary and subtract it from Adrien’s salary. 30% of $56000 is (30/100) *
$56000 = $16800. Subtracting this from Adrien’s salary, we get $56000 - $16800 = $39200.
5. Lylah’s salary four years later is $39200.
6. To find the total salary of Adrien and Lylah four years later, we need to add their individual salaries.
$56000 + $39200 = $95200.
7. Therefore, the total salary of Adrien and Lylah four years later is $95200.
Prompt-OIRL Discovered Prompt 2: Like an architect constructing a building, let’s design our answer
carefully, layer by layer:
Prompted Answer: 1. Adrien’s salary four years ago was $40000. 2. Adrien’s salary increased by 40%
four years later, so his current salary is $40000 + ($40000 * 40%) = $40000 + $16000 = $56000. 3.
Lylah’s salary four years ago was 30% lower than Adrien’s salary, so her salary was $40000 - ($40000
* 30%) = $40000 - $12000 = $28000. 4. Lylah’s salary also increased by 40% four years later, so her
current salary is $28000 + ($28000 * 40%) = $28000 + $11200 = $39200. 5. The total salary the two
were receiving four years later is $56000 + $39200 = $95200.

P ROMPT-OIRL D ISCOVERS B ETTER P ROMPTS . C ASE 3


Question:
Sadie slept 8 hours on Monday. For the next two days, she slept 2 hours less, each, because she had to
complete some assignments. If the rest of the week she slept 1 hour more than those two days, how many
hours did she sleep in total throughout the week?

44
Preliminary Work.

Answer: If on Monday she slept 8 hours, then she slept 8-2 = 〈〈8-2=6〉〉6 hours on each of the next two
days. The total number of hours slept for the two days will be 6*2 = 〈〈6*2=12〉〉12 hours. If she slept 1
hour more for the rest of the week each day, then each day she slept 6+1 = 〈〈6+1=7〉〉7 hours. The total
for the rest of the week will be 4*7 = 〈〈4*7=28〉〉28 hours. The total number of hours slept for the whole
week will be 8+12+28 = 〈〈8+12+28=48〉〉48 hours.
#### 48〈—endoftext—〉

E XPERT-C RAFTED P ROMPTS AND A NSWERS


No prompt: On Tuesday and Wednesday, Sadie slept 8 - 2 = 〈〈8-2=6〉〉6 hours each day. On Thursday
and Friday, she slept 6 + 1 = 〈〈6+1=7〉〉7 hours each day. Throughout the week, Sadie slept 8 + 6 + 6 + 7
+ 7 = 〈〈8+6+6+7+7=34〉〉34 hours. Answer: 34 .
CoT prompt: First, let’s calculate how many hours Sadie slept on Tuesday and Wednesday. She slept
2 hours less than on Monday, so she slept 8 - 2 = 〈〈8-2=6〉〉6 hours on each of those days. Next, let’s
calculate how many hours Sadie slept on the rest of the week. She slept 1 hour more than on Tuesday
and Wednesday, so she slept 6 + 1 = 〈〈6+1=7〉〉7 hours on each of those days. Finally, let’s calculate the
total number of hours Sadie slept throughout the week. On Monday, she slept 8 hours. On Tuesday and
Wednesday, she slept 6 hours each, so that’s 6 + 6 = 〈〈6+6=12〉〉12 hours. On the rest of the week, she
slept 7 hours each day, so that’s 7 + 7 + 7 + 7 + 7 = 〈〈7+7+7+7+7=35〉〉35 hours. In total, Sadie slept 8 +
12 + 35 = 〈〈8+12+35=55〉〉55 hours throughout the week. Answer: 55 .
APE prompt: First, let’s calculate how many hours Sadie slept on Tuesday and Wednesday. She slept 2
hours less each day, so she slept 8 - 2 = 〈〈8-2=6〉〉6 hours on Tuesday. And on Wednesday, she also slept
2 hours less, so she slept 8 - 2 = 〈〈8-2=6〉〉6 hours. Now let’s calculate how many hours she slept for
the rest of the week. She slept 1 hour more than the previous two days, so she slept 6 + 1 = 〈〈6+1=7〉〉7
hours for the rest of the week. Now let’s add up all the hours she slept throughout the week: 8 + 6 + 6 +
7 = 〈〈8+6+6+7=27〉〉27 hours. So, Sadie slept a total of 27 hours throughout the week.
ToT prompt: Expert 1: Step 1 - Calculate the total hours of sleep for the next two days. Sadie slept 8
hours on Monday. For the next two days, she slept 2 hours less each day. So, on Tuesday and Wednesday,
she slept 8 - 2 = 6 hours each day.
Expert 2: Step 2 - Calculate the total hours of sleep for the rest of the week. The rest of the week consists
of 5 days (Thursday, Friday, Saturday, Sunday, and Monday). Sadie slept 1 hour more than the previous
two days, so she slept 6 + 1 = 7 hours each day for the rest of the week.
Expert 3: Step 3 - Calculate the total hours of sleep throughout the week. On Monday, Sadie slept 8
hours. On Tuesday and Wednesday, she slept 6 hours each day. And for the rest of the week, she slept 7
hours each day. So, the total hours of sleep throughout the week are 8 + 6 + 6 + (7 * 5) = 8 + 6 + 6 + 35
= 55 hours.

P ROMPT-OIRL D ISCOVERED S UCCESSFUL P ROMPTS AND A NSWERS


Prompt-OIRL Discovered Prompt 1: Consider this as a puzzle, each piece contributing to the final
answer. Let’s place each piece, one by one:
Prompted Answer: Sadie slept 8 hours on Monday. On Tuesday and Wednesday, she slept 2 hours less
each day, so she slept 8 - 2 = 〈〈8-2=6〉〉6 hours each day. For the rest of the week, she slept 1 hour more
than those two days, so she slept 6 + 1 = 〈〈6+1=7〉〉7 hours each day. From Monday to Wednesday, she
slept a total of 8 + 6 + 6 = 〈〈8+6+6=20〉〉20 hours. From Thursday to Sunday, she slept a total of 7 + 7
+ 7 + 7 = 〈〈7+7+7+7=28〉〉28 hours. Therefore, Sadie slept a total of 20 + 28 = 〈〈20+28=48〉〉48 hours
throughout the week. Answer: 48 .
Prompt-OIRL Discovered Prompt 2: Three wise philosophers are debating this question, each con-
tributing a different aspect of the answer. Let’s follow their discourse:
Prompted Answer: Philosopher 1: Let’s start by calculating how many hours Sadie slept on Tuesday
and Wednesday. We know that she slept 2 hours less each day, so she slept 8 - 2 = 6 hours on Tuesday
and Wednesday.
Philosopher 2: That’s correct. Now, let’s calculate how many hours Sadie slept on the rest of the week.
We know that she slept 1 hour more than the previous two days, so she slept 6 + 1 = 7 hours on each of
the remaining days.

45
Preliminary Work.

Philosopher 3: So, Sadie slept 8 hours on Monday, 6 hours on Tuesday and Wednesday, and 7 hours on
the remaining days. To find the total hours she slept throughout the week, we can add these numbers
together: 8 + 6 + 6 + 7 + 7 + 7 + 7 = 48 hours.
Conclusion: Sadie slept a total of 48 hours throughout the week.

E.3 G ENERATED H UMAN -R EADABLE P ROMPTS

We provide case studies on the LLaMA-7B-HF model for a better understanding of Prompt-OIRL.
GSM8K

E XAMPLE
Question: For four hours, Patrick sold 15 cups of lemonade per hour at a price of $0.50 per cup. In the
next two hours, he sold 10 cups of lemonade per hour at a price of $0.60 per cup. How much money did
Patrick earn, in dollars, from selling lemonade for 6 hours?
Golden Answer: Patrick earned 0.50 × 15 = $7.50 from selling the lemonade for $0.50 per cup. So
for four hours, his total earnings are $7.50 × 4 = $30. He earned $0.60 × 10 = $6 from selling the
lemonade for $0.60 per cup. So for two hours, his total earnings are $6 × 2 = $12. Therefore, Patrick’s
total earnings:$30 + $12 = $42. Hence, Patrick earned a total of $42 from selling lemonade for 6 hours.
held-out prompt: As an electrician wires a circuit, each connection powers the system:
prompted answer: The electrician wires a circuit for four hours, selling 15 cups of lemonade per hour
at a price of $0.50 per cup. For the next two hours, the electrician sells 10 cups of lemonade per hour at
a price of $0.60 per cup. To find out how much money Patrick earned from selling lemonade for 6 hours,
we need to calculate the total number of cups of lemonade sold during each hour and then multiply it by
the price per cup. For the first 4 hours, Patrick sold 15 cups of lemonade per hour, so he sold a total of 60
cups of lemonade (15 x 4 = 60). At a price of $0.50 per cup, he earned a total of $30 (60 x $0.50 = $30).
For the next 2 hours, Patrick sold 10 cups of lemonade per hour, so he sold a total of 20 cups of lemonade
(10 x 2 = 20). At a price of $0.60 per cup, he earned a total of $12 (20 x $0.60 = $12). Therefore, Patrick
earned a total of $42 from selling lemonade for 6 hours ($30 + $12 = $42).

E XAMPLE
Question: Kelly has 5 quarters and 2 dimes. If she buys a can of pop for 55 cents, how many cents will
she have left?
Golden Answer: Her quarters are worth 5 × 25 = 125 cents. The dimes are worth 2 × 10 = 20 cents.
Kelly has a total of 125 + 20 = 145 cents. After buying the can of pop, Kelly will have 145 − 55 = 90
cents left.
held-out prompt: As an electrician wires a circuit, each connection powers the system:
prompted answer: The electrician has 5 quarters, which is equal to 5 × 25 = 125 cents. She also has 2
dimes, which is equal to 2 × 10 = 20 cents. In total, the electrician has 125 + 20 = 145 cents. If she
buys a can of pop for 55 cents, she will have 145 − 55 = 90 cents left.

46
Preliminary Work.

SVAMP

E XAMPLE
Question: Henry bought 2 new fireworks and his friend bought 3. If they already had 6 saved from last
year, how many do they have now?
Golden Answer: 11.0 fireworks.
held-out prompt: As a librarian categorizes books, let’s sort the information accordingly:
prompted answer: “Henry bought 2 new fireworks” means “Add 2 to the total number of fireworks
Henry has.” “His friend bought 3” means “Add 3 to the total number of fireworks his friend has.”
So, let’s add the total number of fireworks Henry and his friend have: 6 (from last year) + 2 (new ones
Henry bought) + 3 (new ones his friend bought) = 11 fireworks in total.
Therefore, they have 11 fireworks in total.

E XAMPLE
Question: A trivia team had 5.0 members in total, but during a game, 2.0 members didn’t show up. If
each member that did show up scored 6.0 points, how many points were scored in total?
Golden Answer: 18.0 points.
held-out prompt: Let’s approach this like a mathematician proving a theorem, step by logical step:
prompted answer:
Step 1: Identify the given information: The trivia team had 5.0 members total. 2.0 members did not show
up. Each member that did show up scored 6.0 points.
Step 2: Use the information provided to find the total number of members that showed up: Total number
of members that showed up = 5.0 − 2.0 = 3.0.
Step 3: Find the total number of points scored by the members who showed up: Total points scored =
3.0 × 6.0 = 18.0.
Therefore, the trivia team scored 18.0 points in the game.

47
Preliminary Work.

MAWPS

E XAMPLE
Question: While shopping, Emily bought 5.0 green towels and 30.0 white towels. If she gave her mother
26.0 of them, how many towels did Emily end up with?
Golden Answer: Emily ended up with 9.0 towels.
held-out prompt: As a librarian categorizes books, let’s sort the information accordingly:
prompted answer: Emily bought 5.0 green towels. Emily bought 30.0 white towels. Emily gave her
mother 26.0 towels (a combination of green and white towels).
Now, let’s deduce how many towels Emily ended up with: Total towels Emily bought = 5.0 + 30.0 =
35.0. After giving towels to her mother, Emily was left with: 35.0 − 26.0 = 9.0 towels.
Therefore, Emily ended up with 9.0 towels in total.

E XAMPLE
Question: A construction company is repaving a damaged road. So far, they have repaved a total of
4938.0 inches of the road. Today, they repaved 805.0 inches of the road. How many inches of the road
had they repaved before today?
Golden Answer: The company had repaved 4133.0 inches before today.
held-out prompt: Think of it as mixing colors for a painting. Every shade adds depth and nuance:
prompted answer: The total repaved so far is 4938.0 inches. Today, they added another layer (shade) of
805.0 inches.
Using the analogy of mixing colors, we remove today’s shade to reveal the previous one: Inches repaved
before today = 4938.0 − 805.0 = 4133.0 inches.
Thus, before today, the company had repaved 4133.0 inches of the road.

48

You might also like