0% found this document useful (0 votes)
10 views12 pages

Learning To Retrieve Iteratively For In-Context Learning

The document introduces iterative retrieval, a framework that enhances retrievers' decision-making through policy optimization for in-context learning (ICL). It describes a training procedure using reinforcement learning to improve the selection of ICL exemplars, achieving better performance on semantic parsing tasks with minimal additional parameters. The proposed method outperforms traditional retrievers by effectively adapting to various task requirements and exemplars.

Uploaded by

johnny.wuj81
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views12 pages

Learning To Retrieve Iteratively For In-Context Learning

The document introduces iterative retrieval, a framework that enhances retrievers' decision-making through policy optimization for in-context learning (ICL). It describes a training procedure using reinforcement learning to improve the selection of ICL exemplars, achieving better performance on semantic parsing tasks with minimal additional parameters. The proposed method outperforms traditional retrievers by effectively adapting to various task requirements and exemplars.

Uploaded by

johnny.wuj81
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Learning to Retrieve Iteratively for In-Context Learning

Yunmo Chen* , Tongfei Chen, Harsh Jhamtani, Patrick Xia, Richard Shin†
Jason Eisner, Benjamin Van Durme
Microsoft
[email protected], {tongfeichen,hjhamtani,patrickxia,jeisner,ben.vandurme}@microsoft.com

Abstract x ;
<latexit sha1_base64="wh59+9gENwSk+Z4jzA6EiA+X4Mg=">AAAC6nicfVFNbxMxEPUuLZTwlcKRi0WElEpVtItQQeqlCA4ciAiCtJWyUWQ7k8SqP1a2txCZzY/ghrjylzjwW7jg3eyBNKhPWu/zvGfPeIbmgluXJL+j+MbO7s1be7dbd+7eu/+gvf/w1OrCMBgyLbQ5p8SC4AqGjjsB57kBIqmAM3rxutLPLsFYrtUnt8xhLMlc8RlnxIXQpK0HE58Zid/1S5xRPu9mbKodXlX86wqvahyvttFd/w7r9eDwGgnXV9r6+oNJu5P0khp4m6QN6aAGg8l+xLKpZoUE5Zgg1o7SJHdjT4zjTEDZygoLOWEXZA6jQBWRYMe+7kyJn4bIFM+0CZ9yuI7+e8ITae1S0uCUxC3sVa0K/k8bFW72cuy5ygsHiq0TzQqBncZVm/GUG2BOLAMhzPBQK2YLYghzYRgbWRwPBW8+I6QVAozOw0MUfHZf6jJa2RsIHTDQD7tXIl8QCs5nlcaIKP37/sfSM2mXpZelV9f5KS39MNgsbbxhLOnVIWyT02e99Kh39OF556TbDGgPPUZPUBel6AU6QW/RAA0RQ7/Qn2gn2o1F/C3+Hv9YW+OoOfMIbSD++RfYY/U0</latexit>

%LM · ( x1 , y1 ), ( x2 , y2 ), · · ·

We introduce iterative retrieval, a novel frame-


work that empowers retrievers to make iterative <latexit sha1_base64="0anXxibT8mkEgPd43ALq1L7SyJU=">AAAChHicfVFNbxMxEHUW2oZQaApHLiuiSjmgaBfox6kKogcuFUE0H1ISVbYzaaz4Y2XPtkTW/gyu8Lv4NzjJIpEEMZKl55k3mjdvWCaFwyT5VYkePd7bP6g+qT09fPb8qH78oudMbjl0uZHGDhh1IIWGLgqUMMgsUMUk9Nn847LevwfrhNE3uMhgrOidFlPBKYbUcKQozjiV/qq4rTeSVrKKeBekJWiQMjq3xxU+mhieK9DIJXVumCYZjj21KLiEojbKHWSUz+kdDAPUVIEb+5XmIj4JmUk8NTY8jfEq+3eHp8q5hWKBudTotmvL5L9qwxynF2MvdJYjaL4eNM1ljCZeGhBPhAWOchEA5VYErTGfUUs5Bps2pqAIgjfXCGOlBGuysIiGB/y2klEbXUFwwMJ1+H2Q2YwyQP/H28J/vv5aeK7covCq8Pp/fMYK3w00x0puOEu6fYRd0HvbSs9aZ1/eN9rN8kBV8oq8Jk2SknPSJp9Ih3QJJ4Z8Jz/Iz2g/ehO9i07X1KhS9rwkGxFd/gbd5snH</latexit>

D R
arXiv:2406.14739v1 [cs.CL] 20 Jun 2024

decisions through policy optimization. Find-


ing an optimal portfolio of retrieved items is <latexit sha1_base64="wh59+9gENwSk+Z4jzA6EiA+X4Mg=">AAAC6nicfVFNbxMxEPUuLZTwlcKRi0WElEpVtItQQeqlCA4ciAiCtJWyUWQ7k8SqP1a2txCZzY/ghrjylzjwW7jg3eyBNKhPWu/zvGfPeIbmgluXJL+j+MbO7s1be7dbd+7eu/+gvf/w1OrCMBgyLbQ5p8SC4AqGjjsB57kBIqmAM3rxutLPLsFYrtUnt8xhLMlc8RlnxIXQpK0HE58Zid/1S5xRPu9mbKodXlX86wqvahyvttFd/w7r9eDwGgnXV9r6+oNJu5P0khp4m6QN6aAGg8l+xLKpZoUE5Zgg1o7SJHdjT4zjTEDZygoLOWEXZA6jQBWRYMe+7kyJn4bIFM+0CZ9yuI7+e8ITae1S0uCUxC3sVa0K/k8bFW72cuy5ygsHiq0TzQqBncZVm/GUG2BOLAMhzPBQK2YLYghzYRgbWRwPBW8+I6QVAozOw0MUfHZf6jJa2RsIHTDQD7tXIl8QCs5nlcaIKP37/sfSM2mXpZelV9f5KS39MNgsbbxhLOnVIWyT02e99Kh39OF556TbDGgPPUZPUBel6AU6QW/RAA0RQ7/Qn2gn2o1F/C3+Hv9YW+OoOfMIbSD++RfYY/U0</latexit>

%LM · x ; ( x1 , y1 ), ( x2 , y2 ), · · ·
a combinatorial optimization problem, gener-
ally considered NP-hard. This approach pro- Rstep Rstep
D D
<latexit sha1_base64="0anXxibT8mkEgPd43ALq1L7SyJU=">AAAChHicfVFNbxMxEHUW2oZQaApHLiuiSjmgaBfox6kKogcuFUE0H1ISVbYzaaz4Y2XPtkTW/gyu8Lv4NzjJIpEEMZKl55k3mjdvWCaFwyT5VYkePd7bP6g+qT09fPb8qH78oudMbjl0uZHGDhh1IIWGLgqUMMgsUMUk9Nn847LevwfrhNE3uMhgrOidFlPBKYbUcKQozjiV/qq4rTeSVrKKeBekJWiQMjq3xxU+mhieK9DIJXVumCYZjj21KLiEojbKHWSUz+kdDAPUVIEb+5XmIj4JmUk8NTY8jfEq+3eHp8q5hWKBudTotmvL5L9qwxynF2MvdJYjaL4eNM1ljCZeGhBPhAWOchEA5VYErTGfUUs5Bps2pqAIgjfXCGOlBGuysIiGB/y2klEbXUFwwMJ1+H2Q2YwyQP/H28J/vv5aeK7covCq8Pp/fMYK3w00x0puOEu6fYRd0HvbSs9aZ1/eN9rN8kBV8oq8Jk2SknPSJp9Ih3QJJ4Z8Jz/Iz2g/ehO9i07X1KhS9rwkGxFd/gbd5snH</latexit> <latexit sha1_base64="0anXxibT8mkEgPd43ALq1L7SyJU=">AAAChHicfVFNbxMxEHUW2oZQaApHLiuiSjmgaBfox6kKogcuFUE0H1ISVbYzaaz4Y2XPtkTW/gyu8Lv4NzjJIpEEMZKl55k3mjdvWCaFwyT5VYkePd7bP6g+qT09fPb8qH78oudMbjl0uZHGDhh1IIWGLgqUMMgsUMUk9Nn847LevwfrhNE3uMhgrOidFlPBKYbUcKQozjiV/qq4rTeSVrKKeBekJWiQMjq3xxU+mhieK9DIJXVumCYZjj21KLiEojbKHWSUz+kdDAPUVIEb+5XmIj4JmUk8NTY8jfEq+3eHp8q5hWKBudTotmvL5L9qwxynF2MvdJYjaL4eNM1ljCZeGhBPhAWOchEA5VYErTGfUUs5Bps2pqAIgjfXCGOlBGuysIiGB/y2klEbXUFwwMJ1+H2Q2YwyQP/H28J/vv5aeK7covCq8Pp/fMYK3w00x0puOEu6fYRd0HvbSs9aZ1/eN9rN8kBV8oq8Jk2SknPSJp9Ih3QJJ4Z8Jz/Iz2g/ehO9i07X1KhS9rwkGxFd/gbd5snH</latexit>

vides a learned approximation to such a solu-


tion, meeting specific task requirements under a s0 <latexit sha1_base64="QyTU34SLBrGGRjgaPsBsv1A4/4I=">AAACfXicfVFNTxsxEHWWUmhKWz6OvawaIXGoot0KQY8gOHBBgCCAlI3Q2JkQC9u7smeByNq/wJX+tf4a8IY9EKg6kqXneW80b2Z4oaSjJPnbiuY+zH9cWPzU/rz05eu35ZXVc5eXVmBP5Cq3lxwcKmmwR5IUXhYWQXOFF/xmr+YvbtE6mZszmhQ40HBt5EgKoDqVEZRXy52km0wjfg/SBnRYE8dXKy2RDXNRajQkFDjXT5OCBh4sSaGwamelwwLEDVxjP0ADGt3AT81W8XrIDONRbsMzFE+zrys8aOcmmgelBhq7t1yd/BfXL2n0e+ClKUpCI14ajUoVUx7Xk8dDaVGQmgQAwsrgNRZjsCAo7GemC8lgeHaM0FYptHkRBjF4R/dTG+1sH8MGLB6G364qxsCRfFZzAlTljw5PKy+0m1ReV978T8955XtB5nijDWdJ3x7hPTj/1U23ulsnm52djeZAi+w7+8E2WMq22Q47YMesxwQbswf2yP60nqL16GfUfZFGraZmjc1EtP0MFKnHtA==</latexit>

g s1 <latexit sha1_base64="QyTU34SLBrGGRjgaPsBsv1A4/4I=">AAACfXicfVFNTxsxEHWWUmhKWz6OvawaIXGoot0KQY8gOHBBgCCAlI3Q2JkQC9u7smeByNq/wJX+tf4a8IY9EKg6kqXneW80b2Z4oaSjJPnbiuY+zH9cWPzU/rz05eu35ZXVc5eXVmBP5Cq3lxwcKmmwR5IUXhYWQXOFF/xmr+YvbtE6mZszmhQ40HBt5EgKoDqVEZRXy52km0wjfg/SBnRYE8dXKy2RDXNRajQkFDjXT5OCBh4sSaGwamelwwLEDVxjP0ADGt3AT81W8XrIDONRbsMzFE+zrys8aOcmmgelBhq7t1yd/BfXL2n0e+ClKUpCI14ajUoVUx7Xk8dDaVGQmgQAwsrgNRZjsCAo7GemC8lgeHaM0FYptHkRBjF4R/dTG+1sH8MGLB6G364qxsCRfFZzAlTljw5PKy+0m1ReV978T8955XtB5nijDWdJ3x7hPTj/1U23ulsnm52djeZAi+w7+8E2WMq22Q47YMesxwQbswf2yP60nqL16GfUfZFGraZmjc1EtP0MFKnHtA==</latexit>

g s2
given family of large language models (LLMs).
We propose a training procedure based on re- Figure 1: Above: ICL under a single retriever call. Be-
inforcement learning, incorporating feedback low: ICL under our proposed iterative retriever.
from LLMs. We instantiate an iterative re-
triever for composing in-context learning (ICL)
multiple targets are required, nor the specific char-
exemplars and apply it to various semantic pars-
ing tasks that demand synthesized programs
acteristics of the inference LLMs and downstream
as outputs. By adding only 4M additional pa- task requirements. Research (Gao et al., 2021; Liu
rameters for state encoding, we convert an off- et al., 2022; Lu et al., 2022, i.a.) has shown that
the-shelf dense retriever into a stateful iterative ICL is sensitive to both the exemplars provided
retriever, outperforming previous methods in and their order within prompts. Off-the-shelf re-
selecting ICL exemplars on semantic parsing trievers, which generally rank items based solely
datasets such as SMC AL F LOW, T REE DST, on semantic similarity (Lee et al., 2019; Reimers
and MTOP. Additionally, the trained iterative
and Gurevych, 2019a, i.a.), do not ensure optimal
retriever generalizes across different inference
LLMs beyond the one used during training. conditions for either criterion, leading to subopti-
mal performance in downstream LLM generation.
1 Introduction Hence, there is a need for a retriever capable of
constructing a portfolio of items tailored to achieve
A significant emergent capability of large language optimal generation with LLMs.
models (LLMs) is in-context learning (ICL; Brown We propose iterative retrieval to address this
et al., 2020), which facilitates few-shot learning. In problem. Unlike traditional retrievers that perform
ICL, a set of exemplars1 is usually provided to build a single call to obtain a list of similar items ordered
the mapping relationship between inputs and out- by their similarities, iterative retrieval involves a se-
puts. These exemplars can either be hand-crafted quence of retrieval calls, each using different query
and fixed or retrieved from a training set. However, vectors. This makes the retriever stateful, maintain-
if retrieving from the dataset, the retrievers used in ing an internal state. The process can be likened to
such applications are typically off-the-shelf models navigating the encoding space of exemplars, with
(e.g., Contriever (Izacard et al., 2022)) that do not each step adjusting direction based on previously
consider interactions among retrieved items when selected exemplars, thus building a trajectory of
* Johns Hopkins University; performed while interning at exemplar selections.
Microsoft.
† Google; performed while at Microsoft. This approach can be formulated as Markov de-
1 An exemplar is a tuple of input and output, demonstrating cision processes (MDPs). At each step, the action
the mapping relationship between the two. taken by the retriever is a retrieval call that fetches
(potentially multiple) documents from the dataset exactly. Much of prior work resort to selecting
D.2 The policy is trained to optimally select ex- top-𝑘 exemplars based on a scoring function 𝑆:
emplars at each step so that the overall trajectory
maximizes the reward, leading to better ICL perfor- 𝑅(𝑥) = arg top 𝑘 𝑆(𝑥, (𝑥 ′ , 𝑦 ′ )) (2)
( 𝑥 ′ ,𝑦 ′ ) ∈D
mance. By leveraging the LLMs as environments,
we create simulators that allow a policy to roll out Prior work has differed on the choice of the scor-
in the environment and receive feedback on the ing function 𝑆: BM25 (Roy et al., 2023), coverage
effectiveness of the composed prompts, measured (Gupta et al., 2022), etc. However, such method
by a reward (metric). Thus, exemplar selection and did not model the interaction between the retrieved
prompt composition can be framed as policy opti- exemplars and the language model. We propose an
mization aimed at maximizing rewards, which can iterative version, where we create a retrieval state
be addressed through reinforcement learning. 𝑠, and for each step 𝑖 one exemplar (𝑥, 𝑦) ∈ D is
We situate our study in in-context semantic pars- retrieved. This is an approximation to the optimiza-
ing due to its difficulty, popularity, and practical tion problem in Equation 1.
value.3 We instantiate an iterative retriever and in-
vestigate the performance of policy learning under (𝑥 𝑖 , 𝑦 𝑖 ) ← 𝑅step (𝑠𝑖 ) (3)
this setup. Our contributions include: 𝑠𝑖+1 ← 𝜏(𝑠𝑖 , (𝑥 𝑖 , 𝑦 𝑖 )) (4)
• We propose a novel iterative retrieval framework
After 𝐾 steps, the retrieved sequence would be
that builds a portfolio of exemplars for ICL, con-
𝑅iter (𝑥) = ((𝑥 𝑖 , 𝑦 𝑖 ))1≤𝑖 ≤𝐾 . This formulation of an
sidering both interactions among retrieved exem-
iterative retriever naturally fits in the definition
plars and their relationship with LLMs;
of a Markov decision process (MDP). Here, our
• We instantiate this iterative retriever for the in- decision process comprises of (D∗ , D, 𝜏, 𝑟), where
context semantic parsing task and train its policy
• The state set D∗ contains exemplar sequences
via reinforcement learning, demonstrating supe-
whose elements are in D;
rior performance over strong baselines from prior
work, thereby proving its effectiveness; • The action set is just D: each action selects one
exemplar from D. In theory, more than 1 exem-
• Through a series of analyses, we provide insights
plar can be selected at each step, but we proceed
into the behaviors of an iterative retriever initial-
with just 1 exemplar for simplicity;
ized with an off-the-shelf retriever.
• The transition function 𝜏 : D∗ ×D → D∗ appends
2 Overview of an Iterative Retriever an exemplar to the existing sequence;
We consider the problem of in-context learning • The reward function 𝑟 : D∗ × D → R funnels
(ICL): given a dataset D = {(𝑥 𝑖 , 𝑦 𝑖 )}𝑖 of exem- signal from the LLM back to the retriever. It will
plars, a retriever 𝑅 retrieves a sequence of exem- be discussed in §4.
plars 𝑅(𝑥) based on input query 𝑥 and generate the By situating our proposed iterative retriever un-
answer 𝑦 based on the distribution 𝑃LM (·|𝑥; 𝑅(𝑥)). der this RL scenario, we can utilize all sorts of RL
This retriever 𝑅 : X → D 𝐾 retrieves an ordered techniques to train this retriever from the environ-
list (of length 𝐾) of exemplars for the LM. The ment, which is the LLM itself. In the next section,
goal of the retriever 𝑅 is to select a sequence of we instantiate a neural iterative retriever and situate
exemplars ((𝑥 𝑖 , 𝑦 𝑖 ))1≤𝑖 ≤𝐾 such that the probability it under a common task, namely semantic parsing,
of the expected output 𝑦 is maximized: under this ICL framework.
arg max 𝑃LM (𝑦|𝑥; ((𝑥𝑖 , 𝑦 𝑖 ))1≤𝑖 ≤𝐾 ). (1) 3 Instantiating an Iterative Retriever
( 𝑥𝑖 ,𝑦𝑖 ) ∈D
We consider an instance of in-context learning,
However, this is a combinatorial optimization
namely few-shot semantic parsing. Given a natural
problem that is computationally infeasible to solve
language query 𝑥, a model is expected to output a
2The action space is at least as large as D. semantic representation 𝑦 of 𝑥 given a sequence of
3Code generation is considered one of the most useful but exemplars (see Figure 3).
challenging techniques in the era of LLMs. Some semantic
parsing tasks share structural similarity with code generation We instantiate a neural iterative retriever based
and program synthesis. on the formulation we proposed above:
<latexit sha1_base64="4kwaA5pwv9kDLucouR3yTnL/CPg=">AAADYXicpVJda9swFJXjfbTZV9o99kUsDFIYwR6j22PX7qEPDc3Y0hbiECTlJhGVZCPJ24zq/M8+D/Y7Krt+aNoxxnbAcHzvubr36ohmghsbRVdBK3zw8NHjjc32k6fPnr/obG2fmjTXDEYsFak+p8SA4ApGllsB55kGIqmAM3pxWOXPvoE2PFVfbZHBRJKF4nPOiPWhaecXHk5doiU+HpTJAV8senhV4FVNL/Hqr4Hf4ITyqvofUZfv/u8ptw9is9QaXG+yO+10o35UA98ncUO6qMFwuhWwZJayXIKyTBBjxnGU2Ykj2nImoGwnuYGMsAuygLGnikgwE1cbUuLXPjLD81T7T1lcR29XOCKNKST1Skns0tzNVcHf5ca5nX+YOK6y3IJiN43mucA2xZW7eMY1MCsKTwjT3M+K2ZJowqx/A2tdLPcDr6/h2woBOs38Igq+2x/1GO3kE/gb0DDwfx9FtiQUrEuqHCOidCeDL6Vj0hSlk6VTf9JTWrqRlxnaaL0t8V0T7pPTt/14r7/3+V13v9cYtIF20CvUQzF6j/bRERqiEWLBcaADF1y2foabYSfcvpG2gqbmJVpDuHMNPK8qRA==</latexit>
(do
(Yield 1 y
(Yield
(Execute y2 !
Thank you, can (Execute (ReviseConstraint …
What about the What about in
%LM H x
tailgating party?
, x1 you also decline (ConfirmAndReturnAction)))
(Yield
, x2 January?
(EventDuringRange
(FullMonthOfMonth
,···
the Tailgate Party (EventAttendence … (January))))))
(?~= “Tailgate Party”)))))

Fenc ·
<latexit sha1_base64="0anXxibT8mkEgPd43ALq1L7SyJU=">AAAChHicfVFNbxMxEHUW2oZQaApHLiuiSjmgaBfox6kKogcuFUE0H1ISVbYzaaz4Y2XPtkTW/gyu8Lv4NzjJIpEEMZKl55k3mjdvWCaFwyT5VYkePd7bP6g+qT09fPb8qH78oudMbjl0uZHGDhh1IIWGLgqUMMgsUMUk9Nn847LevwfrhNE3uMhgrOidFlPBKYbUcKQozjiV/qq4rTeSVrKKeBekJWiQMjq3xxU+mhieK9DIJXVumCYZjj21KLiEojbKHWSUz+kdDAPUVIEb+5XmIj4JmUk8NTY8jfEq+3eHp8q5hWKBudTotmvL5L9qwxynF2MvdJYjaL4eNM1ljCZeGhBPhAWOchEA5VYErTGfUUs5Bps2pqAIgjfXCGOlBGuysIiGB/y2klEbXUFwwMJ1+H2Q2YwyQP/H28J/vv5aeK7covCq8Pp/fMYK3w00x0puOEu6fYRd0HvbSs9aZ1/eN9rN8kBV8oq8Jk2SknPSJp9Ih3QJJ4Z8Jz/Iz2g/ehO9i07X1KhS9rwkGxFd/gbd5snH</latexit>

<latexit sha1_base64="4kwaA5pwv9kDLucouR3yTnL/CPg=">AAADYXicpVJda9swFJXjfbTZV9o99kUsDFIYwR6j22PX7qEPDc3Y0hbiECTlJhGVZCPJ24zq/M8+D/Y7Krt+aNoxxnbAcHzvubr36ohmghsbRVdBK3zw8NHjjc32k6fPnr/obG2fmjTXDEYsFak+p8SA4ApGllsB55kGIqmAM3pxWOXPvoE2PFVfbZHBRJKF4nPOiPWhaecXHk5doiU+HpTJAV8senhV4FVNL/Hqr4Hf4ITyqvofUZfv/u8ptw9is9QaXG+yO+10o35UA98ncUO6qMFwuhWwZJayXIKyTBBjxnGU2Ykj2nImoGwnuYGMsAuygLGnikgwE1cbUuLXPjLD81T7T1lcR29XOCKNKST1Skns0tzNVcHf5ca5nX+YOK6y3IJiN43mucA2xZW7eMY1MCsKTwjT3M+K2ZJowqx/A2tdLPcDr6/h2woBOs38Igq+2x/1GO3kE/gb0DDwfx9FtiQUrEuqHCOidCeDL6Vj0hSlk6VTf9JTWrqRlxnaaL0t8V0T7pPTt/14r7/3+V13v9cYtIF20CvUQzF6j/bRERqiEWLBcaADF1y2foabYSfcvpG2gqbmJVpDuHMNPK8qRA==</latexit>
(Yield
(Execute 1 y
(Yield
(Execute y 2 !
(ReviseConstraint …) (ReviseConstraint …)
What about the what about the OK, how about
%LM H x
tailgating party?
, x1
video camp?
(ConstraintTypeIntension
(Event.subject_? (
, x2 the circus?
(ConstraintTypeIntension)
(Event.subject_? (
,···
(?~= “video camp”)))))) (?~= “circus”))))))

D · D ·
<latexit sha1_base64="0anXxibT8mkEgPd43ALq1L7SyJU=">AAAChHicfVFNbxMxEHUW2oZQaApHLiuiSjmgaBfox6kKogcuFUE0H1ISVbYzaaz4Y2XPtkTW/gyu8Lv4NzjJIpEEMZKl55k3mjdvWCaFwyT5VYkePd7bP6g+qT09fPb8qH78oudMbjl0uZHGDhh1IIWGLgqUMMgsUMUk9Nn847LevwfrhNE3uMhgrOidFlPBKYbUcKQozjiV/qq4rTeSVrKKeBekJWiQMjq3xxU+mhieK9DIJXVumCYZjj21KLiEojbKHWSUz+kdDAPUVIEb+5XmIj4JmUk8NTY8jfEq+3eHp8q5hWKBudTotmvL5L9qwxynF2MvdJYjaL4eNM1ljCZeGhBPhAWOchEA5VYErTGfUUs5Bps2pqAIgjfXCGOlBGuysIiGB/y2klEbXUFwwMJ1+H2Q2YwyQP/H28J/vv5aeK7covCq8Pp/fMYK3w00x0puOEu6fYRd0HvbSs9aZ1/eN9rN8kBV8oq8Jk2SknPSJp9Ih3QJJ4Z8Jz/Iz2g/ehO9i07X1KhS9rwkGxFd/gbd5snH</latexit>

<latexit sha1_base64="0anXxibT8mkEgPd43ALq1L7SyJU=">AAAChHicfVFNbxMxEHUW2oZQaApHLiuiSjmgaBfox6kKogcuFUE0H1ISVbYzaaz4Y2XPtkTW/gyu8Lv4NzjJIpEEMZKl55k3mjdvWCaFwyT5VYkePd7bP6g+qT09fPb8qH78oudMbjl0uZHGDhh1IIWGLgqUMMgsUMUk9Nn847LevwfrhNE3uMhgrOidFlPBKYbUcKQozjiV/qq4rTeSVrKKeBekJWiQMjq3xxU+mhieK9DIJXVumCYZjj21KLiEojbKHWSUz+kdDAPUVIEb+5XmIj4JmUk8NTY8jfEq+3eHp8q5hWKBudTotmvL5L9qwxynF2MvdJYjaL4eNM1ljCZeGhBPhAWOchEA5VYErTGfUUs5Bps2pqAIgjfXCGOlBGuysIiGB/y2klEbXUFwwMJ1+H2Q2YwyQP/H28J/vv5aeK7covCq8Pp/fMYK3w00x0puOEu6fYRd0HvbSs9aZ1/eN9rN8kBV8oq8Jk2SknPSJp9Ih3QJJ4Z8Jz/Iz2g/ehO9i07X1KhS9rwkGxFd/gbd5snH</latexit>

Fenc Fenc Q Fenc


Q

s0 GRU s1 GRU s2 GRU

Figure 2: ICL prompt construction for an example in SMC AL F LOW. Above: ICL with BM25 as the retriever. Below:
An instance of our iterative retriever. BM25 retrieves examples that overlaps lexically with the query, whereas the
trained iterative retriever is better at retrieving structurally similar exemplars since it is trained to maximize the
probability of the LM generating the reference parse.

CalFlow
(Yield Under this policy, if we take greedy decoding,
When is my next staff :output (Event.start
meeting scheduled for? :obj (FindNumNextEvent the retrieval step would just be
:constraint (Event.subject_?
:obj (?~= “staff meeting”))
:number 1L)))
(𝑥 𝑖 , 𝑦 𝑖 ) ← 𝑅step (s𝑖 ) = arg max 𝜋((𝑥 𝑖 , 𝑦 𝑖 )|𝑠𝑖 )
( 𝑥 ′ ,𝑦 ′ ) ∈D
TreeDST
(plan = arg max Q(s𝑖 ) · Fenc (𝑥 𝑖 ). (6)
Hey assistant, what is (Find ( 𝑥 ′ ,𝑦 ′ ) ∈D
:focus (Restaurant.priceRange_?
the price range of
always)
Stazione restaurant? :object (Restaurant.restaurantName_? This is a maximum inner product search (MIPS)
(?= “Stazione restaurant”))))
problem, and thus can be solved with a vector
MTOP
[IN:CREATE_REMINDER
index such as FAISS (Douze et al., 2024).
Set up a reminder to [SL:TODO
[IN:SEND_MESSAGE
• State transition is modeled by a gated recurrent
message Mike at 7pm
[SL:METHOD_MESSAGE message] unit (GRU; Chung et al., 2014) update:
tonight. [SL:RECIPIENT Mike]]]
[SL:DATE_TIME at 7pm tonight]]
s𝑖+1 ← GRU(s𝑖 , Fenc (𝑥 𝑖 )) (7)
Figure 3: Samples of (𝑥, 𝑦) pairs for semantic parsing
where the encoded vector of the retrieved exem-
under different datasets used in this paper.
plar 𝑥 𝑖 is passed to the GRU to update the state.4
Note that the only additional parameters we in-
• The state of the MDP, i.e. the sequence of exem-
cluded in this neural iterative retriever is the state
plars, is modeled by a fixed-length vector s ∈ R𝑑 .
transition model, where we instantiate as a GRU.
The initial state s0 is a parameter.
This is different from a regular retriever, where
• At each step 1 exemplar is retrieved. We define a a single retrieval call to the training set 𝑅(𝑥) =
policy distribution that picks one exemplar from arg max ( 𝑥,𝑦) ∈D q · Fenc (𝑥) is made. The iterative
the training set D, similar to Lu et al. (2023): retriever navigates the encoding space of exem-
plars, adjusting the query vector q′ at each step
𝜋((𝑥 𝑖 , 𝑦 𝑖 )|𝑠𝑖 ) ∝ exp(Q(s𝑖 ) · Fenc (𝑥 𝑖 )/𝛽) (5)
based on previously selected exemplars 𝑠, thus
where Q : R𝑑 → R𝑑 maps a state vector s𝑖 to steering the search process to find new candidates.
a query vector q𝑖 , Fenc : 𝑉 ∗ → R𝑑 is a text Figure 2 demonstrates the process of such an it-
embedder that maps a text sequence into a vector, erative retriever. This stateful design allows for
and 𝛽 is a temperature hyperparameter. In our optimized retrieval results through iterative inter-
experiments, Fenc is initialized with the weights actions, incorporating signals from both external
of Contriever (Izacard et al., 2022), a general- 4
Using a Transformer decoder here results in more unsta-
purpose text embedder trained for retrieval. ble training as we discovered in our experiments. See §6.1.
sources (LLMs) and internal states (previously re- given the existing exemplar sequence 𝑠𝑖 :
trieved items tracked via state transitions).
𝑟 (𝑠𝑖 , 𝑥𝑖 ) = 𝑃LM (𝑦 ∗ | 𝑥; 𝑠𝑖 , (𝑥𝑖 , 𝑦 𝑖 ))
4 Training − 𝑃LM (𝑦 ∗ | 𝑥; 𝑠𝑖 ). (8)

Policy Optimization We employ proximal pol-


Environment Simulator To construct feedback
icy optimization (PPO; Schulman et al., 2017) to
(or reward) from the underlying LLMs, we treat
train an iterative retriever for its stability and ef-
LLMs as environments where actions are per-
ficiency.7 One core idea of PPO is to define a
formed and evaluated. We design an iterative
clipping term that controls the policy optimization
prompting schedule within this LLM environment
process, so that variance is reduced. Given a trajec-
to simulate the process of iterative retrieval and
tory (𝑥 1 , · · · , 𝑥𝑇 ), we have
 
corresponding ICL prompt execution. At each step
𝑖, the current sequence of chosen exemplars, 𝑠𝑖 , is L𝑖 (𝜃) = Ê𝑖 min(𝜌𝑖 , clip 𝜀 (𝜌𝑖 )) · 𝐴ˆ 𝑖 ) , (9)
clip

turned into an LLM prompt using a predefined tem-


plate,5 then used for LLM generation. This sched- 𝜋 𝜃 ( 𝑥𝑖 |𝑠𝑖 )
where 𝜌𝑖 = 𝜋 𝜃old ( 𝑥𝑖 |𝑠𝑖 ) is a probability ratio be-
ule effectively simulates the real-world scenario of
tween action 𝑥𝑖 8 performed against the current pol-
prompting LLMs, allowing us to observe various
icy 𝜋 𝜃 and old policy 𝜋 𝜃old at state 𝑠𝑖 , clip 𝜀 (𝜌) clips
𝜌 to be within (1 − 𝜀, 1 + 𝜀) and 𝐴ˆ is the advantage.
execution dynamics, such as generated hypotheses
Advantage 𝐴ˆ 𝑖 at step 𝑖 describes how much bet-
and their probabilities.
ter it is to take a specific action 𝑥 𝑖 at state 𝑠𝑖 , over
Reward Design Technically, if the final task met- randomly selecting an action according to 𝜋(𝑥 𝑖 |𝑠𝑖 ).
ric were available, it can be used directly as the To compute it, besides the neural model defined in
reward to optimize for. However, such a reward is §3, we follow common practice in reinforcement
often too coarse to reflect differences in partially from human feedback (RLHF; Huang et al., 2024)
correct results. For example, if the metric is exact to add a single linear layer to serve as a state-value
match, which is common in semantic parsing tasks, function 𝑉 (𝑠) = v·s that maps states to values. Gen-
the reward would simply be the Kronecker delta eralized advantage estimate (GAE; Schulman et al.,
𝛿(𝑦 ∗ , 𝑦), yielding 1 only if the prediction 𝑦 exactly 2016) is then used for variance-reduced advantage
matches the reference 𝑦 ∗ , and 0 otherwise. estimation atop the learned state-value function:
Given that the LLM simulator provides access to
the probabilities of generated sequences,6 we em- 𝐴ˆ 𝑖 = 𝛿𝑖 + (𝛾𝜆)𝛿𝑖+1 + · · · + (𝛾𝜆) 𝑇 −𝑖+1 𝛿𝑇 −1 (10)
ploy a more general reward design that is not task- 𝛿𝑖 = 𝑟 𝑖 + 𝛾𝑉 (𝑠𝑖+1 ) − 𝑉 (𝑠𝑖 ) (11)
specific. Our reward leverages the LM completion
probability of the reference sequence 𝑃LM (𝑦 ∗ |𝑥) where 𝑟 𝑖 is the reward obtained at step 𝑖, 𝛾 is the
(Shin et al., 2021; Shi et al., 2023), which captures discount factor, 𝜆 downweighs rewards correspond-
subtle changes in the likelihood of the LM generat- ing to delayed effects. Following Schulman et al.
ing the target sequence with respect to changes in (2017) on PPO in Actor-Critic style, we minimize
the input 𝑥. In ICL, more exemplars typically result the value function error term by a squared-error
in better performance before reaching saturation. loss, with an additional entropy bonus term −𝐻:
Inspired by Zhang et al. (2022), We further refine  clip 
the reward to reflect the increase in the likelihood 𝐿 𝑖PPO = E𝑖 L𝑖 (𝜃) + 𝑐 1 𝐴ˆ 𝑖2 (𝜃) − 𝑐 2 𝐻 𝜋 𝜃 (𝑠𝑡 ) (12)
of the reference 𝑦 ∗ given the prompt. This is a
where 𝑐 1 , 𝑐 2 are coefficients.
proxy value that measure how much this exemplar
contribute to generating the reference parse. This Sampling & Collecting Experience In a single
design encourages the model to select exemplars retrieval step, the retriever selects an exemplar from
that most significantly contribute to the final result a candidate set, with the policy 𝜋 𝜃 (𝑥 𝑖 |𝑠𝑖 ) defining a
7 We experimented with various other RL algorithms (in-
5Refer to Appendix A.3 for the template used in this work. cluding policy gradient (Sutton et al., 1999) and advantage
6 This is generally accessible in many LLM inference actor critic (A2C; Mnih et al., 2016)) and found that PPO is
implementations such as vLLM (Kwon et al., 2023). For the most stable one for our scenario.
OpenAI-style APIs, this can be accessed using the “echo” 8 𝑥 describes that the action of an iterative retriever is to
𝑖
parameter. retrieve an exemplar from a candidate set, hence 𝑥𝑖 ∈ D.
c̃(G)
<latexit sha1_base64="GY3jVz9IlW1ZI0DT7ord+ku/Kvs=">AAAChXicfVFNT+MwEHUDLNBld/k4comoVoIDVbJChRsgOHBBgKCA1FTIdqfUwnYsewJUVv4GV/hb/BvckgOFFSNZep73RvNmhhkpHCbJay2amp75MTs3X/+58Ov3n8Wl5UuXF5ZDm+cyt9eMOpBCQxsFSrg2FqhiEq7Y3cGIv7oH60SuL3BooKvorRZ9wSmGVJahkD3IjFh/3LhZbCTNZBzxV5BWoEGqOL1ZqvGsl/NCgUYuqXOdNDHY9dSi4BLKelY4MJTf0VvoBKipAtf1Y9Nl/DdkenE/t+FpjMfZjxWeKueGigWlojhwn7lR8n9cp8D+TtcLbQoEzd8b9QsZYx6PNhD3hAWOchgA5VYErzEfUEs5hj1NdEERDE+OEdpKCTY3YRAND/g4tlHPDiFswMJx+O1LM6AM0GcjjlNZ+pPj89Jz5YalV6XX3+kZK307yByrtOEs6ecjfAWX/5ppq9k622rs7VQHmiOrZI2sk5Rskz1yRE5Jm3BiyBN5Ji/RbLQZbUWtd2lUq2pWyEREu2//8Mnb</latexit>

5 Experimental Setup
Renormalization
Datasets We validate our pilot iterative retriever
for ICL on a set of semantic parsing datasets,
Stratified sampling namely SMC AL F LOW (Andreas et al., 2020),
<latexit sha1_base64="hktWJImfT7U7+HhJvwRURczYoU4=">AAACf3icfVFNTxsxEHW2tEBKy9eRy4qogl6iXYT4uIHKgQsCBAlI2QjZzoQYbK9lzwKRtf+BK/wz/g1O2AOBipEsPc97o3kzw4wUDpPkpRZ9m/r+Y3pmtv5z7tfv+YXFpbbLC8uhxXOZ20tGHUihoYUCJVwaC1QxCRfs9t+Iv7gD60Suz3FooKvotRZ9wSmGVDszYv3h79VCI2km44g/g7QCDVLFydVijWe9nBcKNHJJneukicGupxYFl1DWs8KBofyWXkMnQE0VuK4f2y3jPyHTi/u5DU9jPM6+r/BUOTdULCgVxYH7yI2S/+M6BfZ3ul5oUyBo/taoX8gY83g0e9wTFjjKYQCUWxG8xnxALeUYNjTRBUUwPDlGaCsl2NyEQTTc48PYRj07gLABC0fhty/NgDJAn404TmXpj4/OSs+VG5ZelV5/pWes9K0gc6zShrOkH4/wGbQ3mulWc+t0s7G3Ux1ohqyQVbJOUrJN9sghOSEtwskNeSRP5DmqRWtRM0repFGtqlkmExHtvgL4Xccx</latexit>

c(G) T REE DST (Cheng et al., 2020), and MTOP (En-


glish portion only; Li et al., 2021), following the
Figure 4: Stratified sampling employed in our approach. BenchClamp benchmark (Roy et al., 2023). Sam-
Our sampling method retains the top 𝑘/𝑁 𝑠 samples and ples of representations are shown in Figure 3. For
split the rest into (𝑁 𝑠 − 1) strata to perform stratified statistics, see Appendix A.1.
sampling. The resulting 𝑘 samples are renormalized to
construct action distribution. Baselines We compare our iterative retriever
(henceforth denoted as I TER R) with a range of off-
the-shelf retrievers, including BM25 (Robertson
probability distribution over candidates 𝑥 𝑖 ∈ D. In
and Zaragoza, 2009) and a dense encoder, Con-
this RL simulation, it is crucial to sample different
triever (Izacard et al., 2022). Additionally, we
actions at each step to enable the model to explore
benchmark against two strong baselines from prior
various trajectories and benefit from those that yield
work on improving exemplar selection: EPR (Ru-
higher rewards. However, constructing the entire
bin et al., 2022) and CEIL (Ye et al., 2023). EPR
distribution and sampling from it at each step is
is an efficient exemplar retrieval method for in-
computationally infeasible, especially when the
context learning (ICL) that leverages a scoring LM
number of candidates exceeds 100K. Furthermore,
to label positive and negative training examples,
these distributions often exhibit a long-tailed nature,
then using this dataset for a contrastively learned
where many candidates have low scores, suggesting
dense retriever. CEIL uses determinantal point
that a significant portion of candidates may be less
processes (DPPs) to model the interaction between
similar and potentially less useful for ICL.
the given input and in-context exemplars.
To address these challenges and reduce the com- For the EPR baseline, we replace the base dense
putational cost of sampling trajectories while man- retrieval encoder with Contriever instead of S-
aging the trade-offs between exploration and ex- BERT (Reimers and Gurevych, 2019b) for fair
ploitation, we propose a stratified sampling (𝑁 𝑠 comparison. Following Ye et al. (2023), we use the
strata) method to construct a modified policy 𝜋˜ that trained EPR model as initialization for CEIL. Simi-
contains 𝑘 candidates. larly, the same EPR checkpoint is used to initialize
To start, we construct a buffer with top-𝐵 exem- the text encoder in I TER R. Note that in I TER R, we
plars retrieved with Equation 5. Retain the top freeze the weights of the EPR encoder and only
𝑘/𝑁 𝑠 samples in the policy. Split the rest into train the GRU-based state transition function, pol-
(𝑁 𝑠 − 1) strata, and sample 𝑘/𝑁 𝑠 from each. Com- icy network, and value network, resulting in 4M
bine all these selected exemplars and renormal- more parameters compared to the original Con-
ize these scores with softmax (with temperature triever (110M → 114M).
𝛽renorm ). This method enables the model to focus For retrievers without iterative capabilities, we
on more promising candidates while still allowing adapt them by taking only the top-𝑘 retrieved items
for exploration (see Figure 4 for an illustration). and keeping their original ranks. For EPR, CEIL,
and I TER R, we selected the best performing model
During training, experience replay (Lin, 1992) checkpoints on the validation set. All generation is
is employed to improve training efficiency. To col- run with 10 exemplars; i.e. 𝑘 = 10.
lect experience, we run inference with the current
policy fixed on several training examples to gen- Generation with LLMs The inference LLM is
erate trajectories. At each step, information such essential for executing input prompts to generate
as policy, reward, and value is recorded. These tra- responses. In our experiments, we use Llama-2-7b
jectories are stored in a replay buffer, then shuffled to build the environment simulator and train the pol-
and split into mini-batches for policy optimization. icy using its signals. With the learned policy, we in-
This approach allows the same experiences to be vestigate both intra-family and inter-family general-
replayed multiple times, reducing the number of ization by replacing the inference LLMs. For mod-
required simulation runs. els within the same Llama-2 family, we explore var-
ious model sizes and finetuned versions, including Retriever Exact Match SMatch
CodeLlama-70b-Instruct (Rozière et al., 2023), @1 @2 @3 P R F
a model further fine-tuned for code generation. For SMC AL F LOW
inter-family experiments, we choose Mistral-7b BM25 39.8 43.7 44.0 66.6 64.2 65.3
(Jiang et al., 2023). For decoding configurations, Contriever 44.0 48.5 48.9 68.4 66.8 67.6
EPR 48.5 52.0 52.3 73.3 76.7 75.0
we consistently use beam search with beam size 3 CEIL 51.1 54.2 55.8 74.9 75.2 75.1
and sampling temperature 0.5. I TER R 54.1 58.4 58.5 76.6 78.1 77.3
T REE DST
Hyperparameters Please refer to Appendix A.2. BM25 50.8 56.1 56.6 81.8 81.8 81.8
Contriever 54.7 60.4 61.0 83.3 82.5 82.9
Evaluation Metrics We follow prior work in EPR 54.0 58.2 58.8 84.7 83.4 84.0
evaluating semantic parsing (Roy et al., 2023), CEIL 56.2 58.3 61.6 81.3 84.4 84.9
where exact match at 𝑘 (EM@𝑘) is used. Exact I TER R 58.2 63.4 63.8 85.5 85.8 85.7
match results for top-𝑘 decoded hypotheses reflects MTOP
beam search decoding used in LLMs, where multi- BM25 57.4 63.2 63.9 - - -
Contriever 59.3 64.2 64.7 - - -
ple parsing results are generated simultaneously. EPR 62.3 68.8 69.2 - - -
However, EM is a stringent metric, penalizing CEIL 63.6 69.4 69.8 - - -
I TER R 63.9 70.9 71.0 - - -
even minor mismatches. For instance, a parse with
a substructure reordered differently (a && b) from
Table 1: Comparison of our approach, I TER R against
the reference (b && a) is still correct but would baselines. “EM@𝑘” denotes exact match at top-𝑘; “P”,
score zero under EM. This is problematic in se- “R” and “F” denote precision, recall, and F1 score re-
mantic parsing, where target parses are composi- spectively. Experiment results are run with 10 exem-
tional, making it important to assess the correct- plars in the prompt, averaged over 3 inference runs, and
ness of substructures. Since SMC AL F LOW and significance tests using paired 𝑡-test confirm that the
T REE DST involve deeply nested structures, we improvements over Contriever, EPR, and CEIL are sta-
tistically significant (𝑝 < 0.05).
also adopt SMatch (Cai and Knight, 2013), follow-
ing Chen et al. (2023), to evaluate performance
on substructures. SMatch is designed to evaluate minimized by training only once, ideally using a
AMRs (Langkilde and Knight, 1998). Generated smaller LM. Hence in this section we investigate
code can be transformed to AMRs by treating each the generalization capabilities of I TER R trained
function’s return value as an entity and each argu- with a smaller LM 𝐴, but used for generation under
ment to a function as a value, where the parameter a larger LM 𝐵.
name is the relation. See Appendix B for details. In the following experiments, I TER R is trained
with Llama-2-7b as the environment, but used for
6 Results & Analyses (a) intra-family LMs: variants within the Llama-2
We evaluate the performance of different retrievers model family; and (b) inter-family LMs: Mistral
by comparing their downstream ICL performance (Jiang et al., 2023) from a different model family.
on semantic parsing (Table 1). I TER R outperforms We follow the setups described in §6, substituting
all baselines across three datasets on all metrics. only the LLM. As shown in Figure 5, I TER R signif-
The gain in EM is intuitive since it aligns with icantly outperforms (> 1% gain) baselines for 75%
the training objective, which involves the probabil- of the settings and is comparable to a prior strong
ity of generating target parses. The improvement baseline (within 1% in absolute performance) for
in SMatch indicates that I TER R optimizes retrieval 15% of settings, demonstrating its generalization
results to improve compositionality to some extent, within and beyond its own model family.
even with a simple objective.9 In intra-family generalization, performance met-
rics improve with larger model sizes, and I TER R
Generalization across Inference LLMs I TER R consistently outperforms all baselines. This im-
benefits from interactive training with an underly- provement is most evident with larger models such
ing LLM. While training incurs costs, these can be as Llama-2-70b and CodeLlama-70b-Instruct.
9 While a more dedicated reward design, such as incorpo- For inter-family generalization, I TER R maintains
rating various linearizations of target structures, might further its advantage across datasets, though this is less
enhance I TER R’s performance. This work focuses on demon-
strating the framework’s effectiveness rather than dedicatedly pronounced than within the same model family.
optimizing for a specific task design. This is expected, as the signal from LLM simulator
Figure 5: Performance comparisons on using various LLMs for inference (top row: SMC AL F LOW; mid: T REE DST;
bottom: MTOP). Our I TER R used in these experiments are trained with Llama-2-7b but performs retrieval of ICL
exemplars used on other LLMs.

is more representative for models sharing the same Variant EM@1 SMatch-F
pre-training procedure. Notably, with Mistral, Con- Contriever 44.0 67.6
triever performs worse than BM25 on MTOP, but I TER R 54.1 77.3
I TER R still shows improvement. This suggests that − EPR intialization 45.1 68.8
I TER R, comprising a frozen EPR and additional − GRU; + Transformer decoder 50.1 75.1
− Stratified sampling 52.3 75.7
GRU layers, can learn task-specific abilities not
present in the vanilla EPR. Table 2: Results on ablation study. − EPR intialization
indicates the model is trained from Contriever instead
of a EPR finetuned checkpoint. + Transformer decoder
replaces GRU with a Transformer decoder. − Stratified
ICL & Number of Exemplars We investigated sampling replaces the stratified sampling described in
how the performance of I TER R changes with the Figure 4 with sampling directly from the buffer.
number of exemplars ({1, · · · , 10}) used for ICL
on the SMC AL F LOW dataset (Figure 6). I TER R 6.1 Ablation Study
consistently outperforms baseline models across
various metrics and numbers of exemplars, with We further conduct ablation study on components
one exception for the EM@3 metric when using of an iterative retriever, focusing on the SM-
6 exemplars. This aligns with our training objec- C AL F LOW dataset and use Llama-2-7b while
tive, where actions that boost performance at each changing the configuration of the iterative retriever.
step receive higher advantages. I TER R achieves Results are reported in Table 2.
comparable performance with fewer exemplars.
EPR Initialization Although we follow prior
CEIL shows a similar trend in EM, but its work in using EPR as initialization for Fenc , our
SMatch performance lags significantly, indicating iterative retriever is agnostic to the choice of base
poorer quality in near-miss predictions compared encoders for similarity search. Even without EPR
to I TER R. Practically, this means our method al- initialization, our training procedure still improves
lows for a trade-off between performance and cost, performance against Contriever (≈ 1% gain under
enabling effective ICL with fewer exemplars and Contriever, but ≈ 6% gain under EPR). We see
reducing the number of tokens processed by LLMs. that I TER R benefits more when using EPR initial-
EM@1 EM@3 SMatch
77.5
55
50 75.0

CalFlowV2
EPR
50 CEIL
45 72.5
IterR
45
40 70.0
40
2 4 6 8 10 2 4 6 8 10 2 4 6 8 10

Figure 6: Performance comparisons across the various numbers of exemplars used for ICL.

ization, significantly outperforming the baselines. 7 Additional Related Work


We hypothesize that this advantage stems from two
sources: (1) EPR is fine-tuned on the target dataset, LLMs as Environment in RL Lu et al. (2023)
making it more domain-specific; (2) EPR restruc- used policy gradient to learn a dense retriever for
tures the action space, subsequently enhancing sam- ICL exemplar retrieval, but the state does not con-
ple efficiency in RL training. tain previously selected examples, and thus is not it-
erative and unable to model exemplar order. Zhang
et al. (2022) used 𝑄-learning RL for ICL exemplar
State Transition with Transformer Decoder In reordering, with a similar reward design like ours.
§3, we parameterize the state transition function in However, the proposed method does not extend to
the iterative retriever with a GRU. To explore alter- exemplar retrieval, since the policy space is too
natives, we conducted an ablation experiment by re- large to be handled by 𝑄-learning.
placing the GRU with a more powerful Transformer
decoder, configured with 3 layers, 1024 hidden Few-shot Semantic Parsing Few-shot semantic
dimensions, with learnable positional encodings. parsing using LLMs has shown impressive capa-
Despite the increased expressiveness of the Trans- bilities in understanding new examples with min-
former decoder, we observed a performance drop. imal training data (Shin et al., 2021; Shin and
During training, employing the warmup technique Van Durme, 2022). However, these parsers often
(Xiong et al., 2020) led to a trivial solution where struggle with generalization and fail to parse unob-
the policy learned to predict a nearly fixed trajec- served local structures due to their limited access
tory across test examples. Disabling the warmup to information encoded through exemplars (Bo-
stabilized the training but did not improve perfor- gin et al., 2022). To this end, recent research has
mance. Developing a stabilized approach to train explored various approaches to improving exem-
the Transformer decoder as a state encoder is be- plar selection. EPR (Rubin et al., 2022) used a
yond the scope of this work, as our focus is on proxy LM to score outputs from an unsupervised
demonstrating the overall framework of iterative retriever, enabling better training of a dense re-
retrieval rather than optimizing a specific model for triever. Oren et al. (2021), Gupta et al. (2022),
the state transition function. Notably, even with the and Levy et al. (2023) emphasize learning to select
less powerful GRU, our iterative retriever success- exemplars based on particular criteria, such as di-
fully learns a policy that retrieves a more optimized versity measures and coverage of local structures,
sequence of ICL exemplars. to enhance compositional generalization. While
these approaches have shown performance im-
provements in semantic parsing tasks, these are
Effectiveness of Stratified Sampling To collect
highly based on heuristics constructed from re-
diverse experience from policy rollouts, we intro-
searcher’s experience. Our approach could be seen
duce a stratified sampling method (described in
as an automated version (through RL) of seeking
§4) that balances the trade-off between exploration
information useful for semantic parsing.
and exploitation. We found that sampling from
the raw policy in Equation 5 results in a signifi- 8 Conclusion
cant performance drop. Additionally, qualitative
examination of several such distributions revealed We proposed iterative retrievers that iteratively
a preference for exploitation over exploration, as builds a prompt to perform in-context learning.
similar items at the top of the retrieved list all had Such retrievers are framed as Markov decision pro-
higher probabilities. cesses and trained via policy optimization from
LLM feedback, where the policy directs which ex- Ben Bogin, Shivanshu Gupta, and Jonathan Berant.
emplar to append to the existing exemplar sequence. 2022. Unobserved local structures make composi-
tional generalization hard. In Proceedings of the
Experiments on semantic parsing demonstrated per-
2022 Conference on Empirical Methods in Natu-
formance gain of iterative retrievers over various ral Language Processing, pages 2731–2747, Abu
datasets and state-of-the-art baselines, showing that Dhabi, United Arab Emirates. Association for Com-
they are able to construct prompts that improves in- putational Linguistics.
context learning and downstream LLM generation.
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie
Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind
Limitations Neelakantan, Pranav Shyam, Girish Sastry, Amanda
Askell, Sandhini Agarwal, Ariel Herbert-Voss,
In our instantiation of the iterative retriever, at each Gretchen Krueger, Tom Henighan, Rewon Child,
step a single exemplar is retrieved. One could en- Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu,
vision multiple exemplars being retrieved at each Clemens Winter, Christopher Hesse, Mark Chen, Eric
step, thus making the RL trajectory shorter. This Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess,
Jack Clark, Christopher Berner, Sam McCandlish,
could make RL training easier and inference faster. Alec Radford, Ilya Sutskever, and Dario Amodei.
Our reward design depends on a particular lin- 2020. Language models are few-shot learners. In Ad-
earization of the target structure. A more structured vances in Neural Information Processing Systems 33:
reward function may exhibit better training behav- Annual Conference on Neural Information Process-
ing Systems 2020, NeurIPS 2020, December 6-12,
ior and lead to better performance. 2020, virtual.
The encoder for queries in the iterative retriever
is frozen in our current setup. A trainable query Shu Cai and Kevin Knight. 2013. Smatch: an evaluation
encoder that receives feedback from LLMs may be metric for semantic feature structures. In Proceed-
ings of the 51st Annual Meeting of the Association
desired, but we left that for future work.
for Computational Linguistics (Volume 2: Short Pa-
While we believe that semantic parsing / code pers), pages 748–752, Sofia, Bulgaria. Association
generation is one of the most useful but challenging for Computational Linguistics.
task for LLMs, as such is a representative task for
ICL research, we have not tested the effectiveness Yunmo Chen, William Gantt, Tongfei Chen, Aaron
White, and Benjamin Van Durme. 2023. A unified
of iterative retrievers under other LLM tasks. view of evaluation metrics for structured prediction.
In Proceedings of the 2023 Conference on Empiri-
Acknowledgements cal Methods in Natural Language Processing, pages
12868–12882, Singapore. Association for Computa-
This work has been supported by the U.S. National tional Linguistics.
Science Foundation under grant 2204926. Any
opinions, findings, and conclusions or recommen- Jianpeng Cheng, Devang Agrawal, Héctor
Martínez Alonso, Shruti Bhargava, Joris Driesen,
dations expressed in this article are those of the Federico Flego, Dain Kaplan, Dimitri Kartsaklis,
authors and do not necessarily reflect the views of Lin Li, Dhivya Piraviperumal, Jason D. Williams,
the National Science Foundation. Hong Yu, Diarmuid Ó Séaghdha, and Anders
Johannsen. 2020. Conversational semantic parsing
for dialog state tracking. In Proceedings of the
References 2020 Conference on Empirical Methods in Natural
Language Processing (EMNLP), pages 8107–8117,
Jacob Andreas, John Bufe, David Burkett, Charles Online. Association for Computational Linguistics.
Chen, Josh Clausman, Jean Crawford, Kate Crim,
Jordan DeLoach, Leah Dorner, Jason Eisner, Hao Junyoung Chung, Çaglar Gülçehre, KyungHyun Cho,
Fang, Alan Guo, David Hall, Kristin Hayes, Kellie and Yoshua Bengio. 2014. Empirical evaluation of
Hill, Diana Ho, Wendy Iwaszuk, Smriti Jha, Dan gated recurrent neural networks on sequence model-
Klein, Jayant Krishnamurthy, Theo Lanman, Percy ing. CoRR, abs/1412.3555.
Liang, Christopher H. Lin, Ilya Lintsbakh, Andy Mc-
Govern, Aleksandr Nisnevich, Adam Pauls, Dmitrij Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff
Petters, Brent Read, Dan Roth, Subhro Roy, Jesse Johnson, Gergely Szilvasy, Pierre-Emmanuel Mazaré,
Rusak, Beth Short, Div Slomin, Ben Snyder, Stephon Maria Lomeli, Lucas Hosseini, and Hervé Jégou.
Striplin, Yu Su, Zachary Tellman, Sam Thomson, An- 2024. The faiss library.
drei Vorobev, Izabela Witoszko, Jason Wolfe, Abby
Wray, Yuchen Zhang, and Alexander Zotov. 2020. Tianyu Gao, Adam Fisch, and Danqi Chen. 2021.
Task-oriented dialogue as dataflow synthesis. Trans- Making pre-trained language models better few-shot
actions of the Association for Computational Linguis- learners. In Proceedings of the 59th Annual Meet-
tics, 8:556–571. ing of the Association for Computational Linguistics
and the 11th International Joint Conference on Natu- Haoran Li, Abhinav Arora, Shuohui Chen, Anchit
ral Language Processing (Volume 1: Long Papers), Gupta, Sonal Gupta, and Yashar Mehdad. 2021.
pages 3816–3830, Online. Association for Computa- MTOP: A comprehensive multilingual task-oriented
tional Linguistics. semantic parsing benchmark. In Proceedings of the
16th Conference of the European Chapter of the Asso-
Shivanshu Gupta, Sameer Singh, and Matt Gardner. ciation for Computational Linguistics: Main Volume,
2022. Structurally diverse sampling for sample- pages 2950–2962, Online. Association for Computa-
efficient training and comprehensive evaluation. In tional Linguistics.
Findings of the Association for Computational Lin-
guistics: EMNLP 2022, pages 4966–4979, Abu Long Ji Lin. 1992. Self-improving reactive agents based
Dhabi, United Arab Emirates. Association for Com- on reinforcement learning, planning and teaching.
putational Linguistics. Mach. Learn., 8:293–321.

Shengyi Huang, Michael Noukhovitch, Arian Hosseini, Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan,
Kashif Rasul, Weixun Wang, and Lewis Tunstall. Lawrence Carin, and Weizhu Chen. 2022. What
2024. The N+ implementation details of RLHF with makes good in-context examples for GPT-3? In
PPO: A case study on tl;dr summarization. CoRR, Proceedings of Deep Learning Inside Out (DeeLIO
abs/2403.17031. 2022): The 3rd Workshop on Knowledge Extrac-
tion and Integration for Deep Learning Architectures,
Gautier Izacard, Mathilde Caron, Lucas Hosseini, Se- pages 100–114, Dublin, Ireland and Online. Associa-
bastian Riedel, Piotr Bojanowski, Armand Joulin, tion for Computational Linguistics.
and Edouard Grave. 2022. Unsupervised dense in-
formation retrieval with contrastive learning. Trans. Pan Lu, Liang Qiu, Kai-Wei Chang, Ying Nian Wu,
Mach. Learn. Res., 2022. Song-Chun Zhu, Tanmay Rajpurohit, Peter Clark,
and Ashwin Kalyan. 2023. Dynamic prompt learn-
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Men- ing via policy gradient for semi-structured mathe-
sch, Chris Bamford, Devendra Singh Chaplot, Diego matical reasoning. In The Eleventh International
de Las Casas, Florian Bressand, Gianna Lengyel, Conference on Learning Representations, ICLR 2023,
Guillaume Lample, Lucile Saulnier, Lélio Re- Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
nard Lavaud, Marie-Anne Lachaux, Pierre Stock,
Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel,
Teven Le Scao, Thibaut Lavril, Thomas Wang, Timo-
and Pontus Stenetorp. 2022. Fantastically ordered
thée Lacroix, and William El Sayed. 2023. Mistral
prompts and where to find them: Overcoming few-
7b. CoRR, abs/2310.06825.
shot prompt order sensitivity. In Proceedings of the
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying 60th Annual Meeting of the Association for Compu-
Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonza- tational Linguistics (Volume 1: Long Papers), pages
lez, Hao Zhang, and Ion Stoica. 2023. Efficient mem- 8086–8098, Dublin, Ireland. Association for Compu-
ory management for large language model serving tational Linguistics.
with pagedattention. In Proceedings of the 29th Sym- Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi
posium on Operating Systems Principles, SOSP 2023, Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley,
Koblenz, Germany, October 23-26, 2023, pages 611– David Silver, and Koray Kavukcuoglu. 2016. Asyn-
626. ACM. chronous methods for deep reinforcement learning.
CoRR, abs/1602.01783.
Irene Langkilde and Kevin Knight. 1998. Generation
that exploits corpus-based statistical knowledge. In Inbar Oren, Jonathan Herzig, and Jonathan Berant. 2021.
36th Annual Meeting of the Association for Compu- Finding needles in a haystack: Sampling structurally-
tational Linguistics and 17th International Confer- diverse training sets from synthetic data for compo-
ence on Computational Linguistics, Volume 1, pages sitional generalization. In Proceedings of the 2021
704–710, Montreal, Quebec, Canada. Association for Conference on Empirical Methods in Natural Lan-
Computational Linguistics. guage Processing, pages 10793–10809, Online and
Punta Cana, Dominican Republic. Association for
Kenton Lee, Ming-Wei Chang, and Kristina Toutanova. Computational Linguistics.
2019. Latent retrieval for weakly supervised open
domain question answering. In Proceedings of the Nils Reimers and Iryna Gurevych. 2019a. Sentence-
57th Annual Meeting of the Association for Computa- bert: Sentence embeddings using siamese bert-
tional Linguistics, pages 6086–6096, Florence, Italy. networks. In Proceedings of the 2019 Conference on
Association for Computational Linguistics. Empirical Methods in Natural Language Processing
and the 9th International Joint Conference on Nat-
Itay Levy, Ben Bogin, and Jonathan Berant. 2023. Di- ural Language Processing, EMNLP-IJCNLP 2019,
verse demonstrations improve in-context composi- Hong Kong, China, November 3-7, 2019, pages 3980–
tional generalization. In Proceedings of the 61st An- 3990. Association for Computational Linguistics.
nual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), pages 1401– Nils Reimers and Iryna Gurevych. 2019b. Sentence-
1422, Toronto, Canada. Association for Computa- BERT: Sentence embeddings using Siamese BERT-
tional Linguistics. networks. In Proceedings of the 2019 Conference on
Empirical Methods in Natural Language Processing Richard Shin and Benjamin Van Durme. 2022. Few-
and the 9th International Joint Conference on Natu- shot semantic parsing with language models trained
ral Language Processing (EMNLP-IJCNLP), pages on code. In Proceedings of the 2022 Conference of
3982–3992, Hong Kong, China. Association for Com- the North American Chapter of the Association for
putational Linguistics. Computational Linguistics: Human Language Tech-
nologies, pages 5417–5425, Seattle, United States.
Stephen E. Robertson and Hugo Zaragoza. 2009. The Association for Computational Linguistics.
probabilistic relevance framework: BM25 and be-
yond. Found. Trends Inf. Retr., 3(4):333–389. Richard S. Sutton, David A. McAllester, Satinder Singh,
and Yishay Mansour. 1999. Policy gradient methods
Subhro Roy, Samuel Thomson, Tongfei Chen, Richard for reinforcement learning with function approxima-
Shin, Adam Pauls, Jason Eisner, and Benjamin Van tion. In Advances in Neural Information Processing
Durme. 2023. Benchclamp: A benchmark for eval- Systems 12, [NIPS Conference, Denver, Colorado,
uating language models on syntactic and semantic USA, November 29 - December 4, 1999], pages 1057–
parsing. In Advances in Neural Information Pro- 1063. The MIT Press.
cessing Systems 36: Annual Conference on Neural
Information Processing Systems 2023, NeurIPS 2023, Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng,
New Orleans, LA, USA, December 10 - 16, 2023. Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan
Lan, Liwei Wang, and Tie-Yan Liu. 2020. On layer
Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten normalization in the transformer architecture. In Pro-
Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, ceedings of the 37th International Conference on
Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom Machine Learning, ICML 2020, 13-18 July 2020, Vir-
Kozhevnikov, Ivan Evtimov, Joanna Bitton, Man- tual Event, volume 119 of Proceedings of Machine
ish Bhatt, Cristian Canton-Ferrer, Aaron Grattafiori, Learning Research, pages 10524–10533. PMLR.
Wenhan Xiong, Alexandre Défossez, Jade Copet, Jiacheng Ye, Zhiyong Wu, Jiangtao Feng, Tao Yu, and
Faisal Azhar, Hugo Touvron, Louis Martin, Nico- Lingpeng Kong. 2023. Compositional exemplars
las Usunier, Thomas Scialom, and Gabriel Synnaeve. for in-context learning. In International Conference
2023. Code llama: Open foundation models for code. on Machine Learning, ICML 2023, 23-29 July 2023,
CoRR, abs/2308.12950. Honolulu, Hawaii, USA, volume 202 of Proceedings
of Machine Learning Research, pages 39818–39833.
Ohad Rubin, Jonathan Herzig, and Jonathan Berant. PMLR.
2022. Learning to retrieve prompts for in-context
learning. In Proceedings of the 2022 Conference of Yiming Zhang, Shi Feng, and Chenhao Tan. 2022. Ac-
the North American Chapter of the Association for tive example selection for in-context learning. In Pro-
Computational Linguistics: Human Language Tech- ceedings of the 2022 Conference on Empirical Meth-
nologies, pages 2655–2671, Seattle, United States. ods in Natural Language Processing, pages 9134–
Association for Computational Linguistics. 9148, Abu Dhabi, United Arab Emirates. Association
for Computational Linguistics.
John Schulman, Philipp Moritz, Sergey Levine,
Michael I. Jordan, and Pieter Abbeel. 2016. High-
dimensional continuous control using generalized
advantage estimation. In 4th International Confer-
ence on Learning Representations, ICLR 2016, San
Juan, Puerto Rico, May 2-4, 2016, Conference Track
Proceedings.

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec


Radford, and Oleg Klimov. 2017. Proximal policy
optimization algorithms. CoRR, abs/1707.06347.

Weijia Shi, Sewon Min, Michihiro Yasunaga, Minjoon


Seo, Rich James, Mike Lewis, Luke Zettlemoyer, and
Wen-tau Yih. 2023. REPLUG: retrieval-augmented
black-box language models. CoRR, abs/2301.12652.

Richard Shin, Christopher Lin, Sam Thomson, Charles


Chen, Subhro Roy, Emmanouil Antonios Platanios,
Adam Pauls, Dan Klein, Jason Eisner, and Benjamin
Van Durme. 2021. Constrained language models
yield few-shot semantic parsers. In Proceedings of
the 2021 Conference on Empirical Methods in Natu-
ral Language Processing, pages 7699–7715, Online
and Punta Cana, Dominican Republic. Association
for Computational Linguistics.
A Experiment Details B SMatch Evaluation
A.1 Dataset Statistics For evaluation of semantic parse or code genera-
tion on partial results, we utilize SMatch (Cai and
Dataset Train Dev Test Knight, 2013). Generated code can be transformed
SMC AL F LOW 108,753 12,271 (500 used) 13,496 to AMRs by treating each function’s return value
T REE DST 121,652 22,910 (500 used) 22,841 as an entity and each argument to a function as a
MTOP 15,667 2,235 (500 used) 4,386
value, where the parameter name is the relation.
Table 3: Dataset statistics. An example is given below.
Consider the following parse in SMC AL F LOW,
expressed in Lisp:
A.2 Hyperparameters ( Yield
: output ( Event . start
: obj ( FindNumNextEvent
Name Search Bounds : constraint ( Event . subject_ ?
Encoder { GTR-T5, Contriever, SBert } : obj (?~= " staff ␣ meeting "))
Learning rate {5 × 10−6 , 1 × 10−5 , 3 × 10−5 , 5 × 10−5 } : number 1L)))
LR scheduler { reduce-on-plateau, cosine-annealing }
State transition { GRU, LSTM, Transformer Decoder} This will be transformed into the following
𝛽renorm {0.5, 1.0, 5.0, 10.0} AMR:
𝑐1 {0.1, 0.3, 0.5, 0.7}
𝑐2 {0, 0.005, 0.01, 0.05, 0.1, 0.15} $0 Yield
output
𝛾 0.99
$1 Event.start
𝜆 0.95
obj
Action buffer size 768
$2 FindNumNextEvent
PPO ratio cutoff 1.2
constraint number
PPO batch size 128
$3 Event.subject_? 1L
Replay buffer size 2048
obj
Avg. training time 24 hrs
$4 ?~=
GPU used 4 Nvidia V100 32 GB
ARG0
# of parameters* ∼114M (w/ 110M frozen) staff meeting

Table 4: Hyperparameters and other reproducibility in- Figure 7: Example AMR based on the previous parse.
formation for I TER R. 𝛽renorm is the temperature used to
create a renormalized action distribution. 𝑐 1 and 𝑐 2 are
coefficients used in the PPO loss. 𝛾 and 𝜆 are discount This AMR can be easily converted to the follow-
factors used in GAE. ing triples.
instance ($0 , Yield )
output ($0 , $1 )
A.3 Prompt Template instance ($1 , Event . start )
obj ($1 , $2 )
The prompt template used across all our experi- instance ($2 , FindNumNextEvent )
ments is shown in Table 5. constraint ($2 , $3 )
instance ($3 , Event . subject_ ?)
obj ($3 , $4 )
Let’s translate what a human user says instance ($4 , ?~=)
into what a computer might say. ARG0 ($4 , " staff ␣ meeting ")
number ($2 , 1L)
Human: 𝑥1
Computer: 𝑦 1

···

Human: 𝑥 𝑁
Computer: 𝑦 𝑁

Human: 𝑥
Computer:

Table 5: Prompt template used in our experiments. This


template will be instantiated as prompts when filled with
retrieved exemplars 𝑅(𝑥) = ((𝑥 1 , 𝑦 1 ), · · · , (𝑥 𝑁 , 𝑦 𝑁 ))
and the test example 𝑥.

You might also like