29936-Article Text-33990-1-2-20240324

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)

ExpeL: LLM Agents Are Experiential Learners


Andrew Zhao1 , Daniel Huang2 , Quentin Xu2 , Matthieu Lin2 , Yong-Jin Liu2 , Gao Huang1 *
1
Department of Automation, BNRist, Tsinghua University
2
Department of Computer Science, BNRist, Tsinghua University
{zqc21,huang-jy22,xgd22,lyh21}@mails.tsinghua.edu.cn,
{liuyongjin,gaohuang}@tsinghua.edu.cn

Abstract On the one hand, previous works investigated finetun-


ing LLMs with a large number of environment interactions
The recent surge in research interest in applying large lan- (Yao et al. 2023c) or with a large amount of human-labeled
guage models (LLMs) to decision-making tasks has flour-
ished by leveraging the extensive world knowledge embed-
datasets (Nakano et al. 2021; Shaw et al. 2023). This class
ded in LLMs. While there is a growing demand to tailor of methods incurs high computational costs and needs ac-
LLMs for custom decision-making tasks, finetuning them cess to the LLM’s parametric weights. Furthermore, fine-
for specific tasks is resource-intensive and may diminish the tuning an LLM restricts its functionalities and can hurt its
model’s generalization capabilities. Moreover, state-of-the- generalization abilities (Du et al. 2022). On the other hand,
art language models like GPT-4 and Claude are primarily ac- prompting methods can augment an LLM with better se-
cessible through API calls, with their parametric weights re- quential decision-making planning abilities with only a few
maining proprietary and unavailable to the public. This sce- in-context examples (Hao et al. 2023; Lin et al. 2023b; Sun
nario emphasizes the growing need for new methodologies et al. 2023). However, since current LLMs are bounded by
that allow learning from agent experiences without requir- context window size (Tworkowski et al. 2023), these agents
ing parametric updates. To address these problems, we intro-
duce the Experiential Learning (ExpeL) agent. Our agent au-
have no recollections of what they have seen, and therefore
tonomously gathers experiences and extracts knowledge us- no learning can be done outside of a few demonstrations. So,
ing natural language from a collection of training tasks. At how can we strike a balance between these paradigms?
inference, the agent recalls its extracted insights and past ex- We present the Experiential Learning (ExpeL) agent as a
periences to make informed decisions. Our empirical results solution. Our agent autonomously gathers experiences from
highlight the robust learning efficacy of the ExpeL agent, in- a collection of training tasks through trial and error. From
dicating a consistent enhancement in its performance as it ac- these experiences, it derives natural language insights and
cumulates experiences. We further explore the emerging ca- employs its own successful experiences as in-context exam-
pabilities and transfer learning potential of the ExpeL agent
ples during test time. Our agent’s learning process is analo-
through qualitative observations and additional experiments.
gous to a student studying for an exam and then taking it on
a single attempt, reflecting many real-world situations. Un-
like self-improvement methods like Reflexion (Shinn et al.
1 Introduction 2023), our approach emphasizes the importance of retaining
A computer program is said to learn from experience experiences across multiple tasks to enhance agent perfor-
E with respect to some class of tasks T and mance. Moreover, ExpeL learns without parameter updates,
performance measure P , if its performance at tasks in making it compatible with powerful closed-source models
T , as measured by P , improves with experience E. like GPT-4 or Claude. Lastly, the experience-gathering step
Tom Mitchell does not require a large amount of data or human labels.
We evaluated ExpeL on three vastly different domains and
Machine learning research has long been captivated by consistently outperformed strong baselines. Additionally,
the potential of autonomous agents and their capabilities. we showcased a transfer learning scenario where our agent
In recent times, incorporating large language models into that accumulated knowledge from source tasks showed pos-
these agents (Wang et al. 2023a; Xi et al. 2023) has unveiled itive forward transfer to target tasks. Finally, we highlighted
a broad spectrum of applications, even extending beyond some unexpected emerged abilities the ExpeL agent gained.
academia (Yang et al. 2023a; Significant-Gravitas 2023). In summary, our key contributions are as follows: (1) we
One of the significant advantages of LLMs lies in their world introduced ExpeL, a novel LLM agent that autonomously
knowledge, allowing them to be inherently versatile across learns from experience without gradient updates; (2) We
various scenarios (Zhao et al. 2023b). evaluated ExpeL on a diverse set of tasks to showcase its
* Corresponding author. learning abilities and improvement on top of existing plan-
Copyright © 2024, Association for the Advancement of Artificial ning methods; (3) we showed a novel setting of transfer
Intelligence (www.aaai.org). All rights reserved. learning for our LLM agent and demonstrated forward trans-

19632
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)

Figure 1: ExpeL Agent Overview. Left: ExpeL operates in three stages: (1) Collection of success and failure experiences into
a pool. (2) Extraction/abstraction of cross-task knowledge from these experiences. (3) Application of the gained insights and
recall of past successes in evaluation tasks. Right: (A) Illustrates the experience gathering process via Reflexion (Shinn et al.
2023), enabling task reattempt after self-reflection on failures. (B) Illustrates the insight extraction step. When presented with
success/failure pairs or a list of L successes, the agent dynamically modifies an existing list of insights ι̂ using operations ADD,
UPVOTE, DOWNVOTE, and EDIT. This process has an emphasis on extracting prevalent failure patterns or best practices.

ferability from source tasks to target tasks. Lastly, we be- to these works, we focus on retrieving the ExpeL agent’s
lieve that as planning algorithms and foundational models self-generated experiences, thus reducing the dependency on
continue to improve, ExpeL’s paradigm stands to gain sig- gold examples and leveraging domain-specific corpus.
nificant benefits from their enhanced performances.1 Planning for LLM Agents: Application of LLM agents
in fields like robotics, natural sciences, game-playing, and
2 Related Work workflows has surged, with emphasis on their world knowl-
Prompt-based Learning: Prompt-based learning refines la- edge in fewshot settings (Ha, Florence, and Song 2023;
bel prediction tasks by modifying the input context, facili- Mu et al. 2023; Bran et al. 2023; Boiko, MacKnight, and
tating swift adaptation to new tasks with minimal data (Liu Gomes 2023; Yang et al. 2023b; Lin et al. 2023a; Nakano
et al. 2023a). This approach capitalizes on LLMs for an- et al. 2021; Wang et al. 2023b; Liu et al. 2023b). Moreover,
swers without parameter tuning as they can be augmented LLMs have demonstrated promising zero/few-shot planning
using in-context learning (Brown et al. 2020). LAMA and reasoning capabilities in various configurations (Sumers
(Petroni et al. 2019) and GPT-3 (Brown et al. 2020) are et al. 2023), including embodied environments and reason-
early works that promoted this formulation. Efforts to reduce ing tasks (Huang et al. 2022; Yao et al. 2023a; Wei et al.
the intricacies of prompt design include automatic reason- 2022b; Yao et al. 2023b; Gong et al. 2023).
ing chains for NLP (Kojima et al. 2022; Zhang et al. 2023). Self-improvement and Memory for LLM Agents: Agents
Similarly, the ExpeL agent also autonomously learns from like Reflexion showcase feedback-based improvement, yet
experiences using extracted insights and self-generated in- often lack cross-task memory (Shinn et al. 2023). Other
context trajectories by altering the execution prompt. agents exhibit potential in persistent memory within multi-
Retrieval Augmented Generation (RAG): Retrieval al- agent contexts (Park et al. 2023; Maas et al. 2023). Our
lows LLMs to access databases, mitigating hallucinations ExpeL agent combines these approaches, focusing on task-
(Li et al. 2022; Wang, Yang, and Wei 2023; Rubin, Herzig, solving while benefiting from self-generated in-context ex-
and Berant 2022; Liu et al. 2022). Retrieval has also been amples and abstracted insights from memory.
used to enhance the capabilities of decision-making agents
(Humphreys et al. 2022; Zhao et al. 2023a). In contrast 3 Preliminaries
1 Complex Interactive Tasks We work with complex inter-
Visit https://andrewzh112.github.io/#expel for prompts and
demos, and https://github.com/LeapLabTHU/ExpeL for code. active tasks where at each time step i ∈ {0, . . . , H}, the

19633
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)

agent receives an observation o ∈ O, and from its observa- developed the Experiential Learning (ExpeL) agent. During
tion history Ht decides to perform action a ∈ A. The ob- the training stage, the agent interacts with the environment,
jective of the agent is to achieve some goal g ∈ G. We only gathering experiences via trial and error. These experiences
deal with deterministic environments in this work. are stored in an experience pool (Lin 1992). From this pool,
the agent later extracts insights, similar to off-policy learn-
Large Language Models A large language model is a ing (Watkins and Dayan 1992), in which the agent can learn
statistical model of the natural language, typically a neu- from experiences of a behavior policy. During the evaluation
ral network. In our setting, we use an autoregressive lan- stage, the agent attempts unseen tasks with a single try, aug-
guage model (Brown et al. 2020; Touvron et al. 2023b,a; mented with extracted insights and successful trajectories in
Chowdhery et al. 2023), which given an ordered list of ex- its experience pool gathered from the training stage. Refer
isting tokens x = {x1 , x2 , ..., xl−1 }, outputs the probabil- to Fig. 1 for detailed information on our agent framework.
ity of the next token p(xl | x<l ). An instruction-following
LLM (Thoppilan et al. 2022; Chung et al. 2022; Wei et al. 4.1 Gathering Experiences
2022a) is typically finetuned on various NLP tasks that are
formatted into instruction, input, response tuples (Taori et al. To gather diverse experiences that can be useful to extract
2023). Instruction-tuned models are better at following natu- information from, we leverage Reflexion (Shinn et al. 2023)
ral language instructions which alleviates the need for heavy to continuously retry the training task at most Z times.
prompt engineering (Wei et al. 2022a). In particular, the agent will be given a training task tn
at the z-th trial, fewshot examples Fmanual and past reflec-
ReAct and Reflexion ReAct (Yao et al. 2023b) and Re- tions νn,z (initially, νn,0 is the empty string). At first, the
flexion (Shinn et al. 2023) are promising frameworks en- agent will attempt the task with fewshot examples concate-
abling the aforementioned proficiency of LLMs in reason- nated with its current trajectory τn,0 as the context, and
ing and self-improvement. ReAct explicitly intertwines ob- use ReAct (Yao et al. 2023b) as the base planning algo-
servations, actions, and thoughts, providing a foundation for rithm, LLMReAct (· | τn,0 , Fmanual , νn,0 ). On the z-th trial,
robust planning and reasoning capabilities. Building upon when the agent finishes the task or the maximum number
it, Reflexion introduces an additional reflective step before of steps H is reached, the ExpeL agent’s experience pool
reattempting the subsequent trial of the same task, enhanc- B ingests the trajectory τn,z . Then, if the agent succeeds,
ing the model’s adaptive learning process. it moves on to the next task. However, if the agent fails,
it will look at its failed trajectory and self-reflect to pro-
4 ExpeL: An Experiential Learning Agent duce νn,z+1 = concat(νn,z , LLMreflect (τn,z )) to see where
it can do better on the next retry, concatenated with the pre-
Recent advancements in generative LLMs suggest an in- vious reflections. In the next retry, the agent will augment its
triguing approach. Rather than altering the LLM param- context with reflection νn,z+1 , the input to the LLM policy,
eters, adjusting the prompts may be more beneficial: this LLMReAct (· | τn,z+1 , Fmanual , νn,z+1 ).
strategy ensures that the LLM’s inherent common sense To highlight, this trial and error way of gathering expe-
knowledge remains intact, allowing for superior general- riences not only improves the chances of getting more pos-
ization (Liu et al. 2023a). Furthermore, some of the most itive examples for experience recall during evaluation but
potent language models are proprietary. Thus, focusing on also allows for collecting valuable success/failure pairs used
prompt-based methods seems promising as a way to har- for comparisons during insight extraction (Sec. 4.2). The
ness the strengths of these advanced LLMs. Additionally, pseudo-code can be found in Alg. 1.
previous works on learning in LLM agents have primarily
been trained on extensive human-labeled datasets (Lin et al. 4.2 Learning from Experiences
2023a; Shaw et al. 2023) or improved via iterative retries
Human learning occurs mainly either by storing successful
(Shinn et al. 2023) on a single task. A relatively less ex-
trajectories in memory, which can be later recalled as spe-
plored area is facilitating agents to learn autonomously from
cific examples, or by extracting high-level insights from ex-
their own experiences, similar to a student gaining insights
periences, enabling generalization to novel situations. Ex-
from practicing for an exam. The student tackles practice
peL considers both of these learning modes to boost task
problems multiple times to derive insights. At the exam, the
performance. Concretely, an instruction I given to an LLM
student rely solely on these insights and draw memories of
agent can be broken down into task specifications and few-
similar problems to answer the questions with one attempt.
shot examples. We can augment task specifications with an
With this in mind, we wish to design an LLM agent that au-
agent’s extracted insights from past experiences, where an
tonomously gathers experiences and extracts insights, then
instruction-following LLM can be leveraged to follow them
uses these cross-task insights and memories of similar tasks
closely. For fewshot examples, we can allow the agent to re-
to aid its decision-making.
trieve from its experience pool with top-k relevant examples
We aim to enhance a planning LLM agent, such as Re- to aid its decisions. Next, we detail our experience recall and
Act, with learning abilities that allow it to improve through insight extraction mechanisms.
inter-task experiences without any parameter updates. In-
spired by the cognitive abilities inherent in human learning, Similar Experiences as Demonstrations Works have
as well as the benefits observed in self-learning autonomous shown that using in-context examples that are semantically
agents and the progress made in prompt-based methods, we similar to the task at hand results in better performance (Liu

19634
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)

et al. 2022). Moreover, when involved in a novel situation,


humans also recall from their memory similar tasks they’ve
solved as references when attempting the task (Kahneman
2011). Motivated by these observations, we propose experi-
ence recall to retrieve successful trajectories from the expe-
rience pool gathered during training based on task similarity.
Concretely, we used the Faiss vectorstore (Johnson,
Douze, and Jégou 2019) as the experience pool, kNN re-
triever and all-mpnet-base-v2 (Song et al. 2020) em-
bedder to obtain top-k successful trajectories that have the
maximum inner-product task similarity with the evaluation
task. The advantage of using task similarity as the retrieval
rank is that if the agent repeats a task or does a task simi-
lar to an existing successful trajectory from the experience
pool, the agent only needs to closely imitate the successful
trajectory and have less burden on ability extrapolation.
Learning from Successes and Failures To leverage the
diverse outcomes gathered during the experience collection
phase, we believe the agent should analyze experiences in
two distinct ways. First, we let the agent compare a failed
trajectory with a successful trajectory for the same task. This
comparison offers a concrete understanding of the agent’s
shortcomings, highlighting the correct and incorrect actions.
Second, we let the agent identify patterns within a set of
successful trajectories from different tasks. This approach
sheds light on common “good practices” that the agent can
adopt to ensure success in evaluation tasks.
For the implementation, we give the agent’s instruction-
following LLMinsights several operators to apply on an ex-
isting set of insights ι̂. We initialize the set of insights to
an empty set ι̂ = ∅ and iteratively provide the LLM with
fail/success pairs or lists of L successes (created by sam-
pling without replacement) from the experience pool. The
operations the LLM can perform are: ADD a new insight,
EDIT the content of an existing insight, DOWNVOTE to dis-
agree with an existing insight, or UPVOTE to agree with an
existing insight. A newly added insight will have an initial
importance count of two associated with it, and the count
will increment if subsequent operators UPVOTE or EDIT
are applied to it and will decrement when DOWNVOTE is ap-
plied to it. If an insight’s importance count reaches zero, it
will be removed. This particular design choice robustifies the
process since even successful trajectories can be suboptimal
and mislead the generated insights. The prompt template we
used can be found in Fig. 2. We kept the maximum size for a
list of successes to L and used gpt-4-0613 as the default
LLMinsights . We empirically found that gpt-4-0613 is bet-
ter than gpt-3.5-turbo-0613 at following instructions
on how to use the insight extraction operators and halluci-
nated less. Pseudo-code for this process can be found in Alg.
2. Finally, ExpeL utilizes these generated insights ι̂ in the Figure 2: Insight Extraction Prompt Template. The prompt
task inference phase, described next. template ExpeL agents used for insight extraction. The same
template is used both for success/fail pairs (A, in yellow) and
4.3 Task Inference L-sized successes (B, in green).
After the agent gathers experiences, extracts insights from
them, and sets up a vectorstore of successful trajectories, it
can proceed to the evaluation. For each task, the task specifi- list of extracted insights ι̂ = concat(ι1 , ι2 , ι3 , ...), and the
cations will be augmented with the concatenation of the full top-k trajectories with the highest task similarity will be re-

19635
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)

trieved and used as fewshot in-context examples, Fsimilar tasks .


Fig. 3 shows an example prompt template structure, and a Algorithm 1: ExpeL - Experience Gathering
pseudo-code for this step can be found in Alg. 3. We believe Initialize:
as the list of extracted insights grows, retrieval could be a Policy LLMReAct
feasible solution to manage the context window size. Self-reflection model LLMreflect
Collection of tasks Ttrain
Fewshot examples Fmanual
Experience pool B ← Fmanual
Number of training tasks N
Maximum retry number Z
Maximum step number H
Current task index n ← 1
while task n ≤ N do
tn ← Ttrain [n]
Reflection νn,0 ← “”
for trial z = 0 to Z do
o0 ← env.reset(tn )
Initialize trajectory τn,z ← o0
for timestep i = 0 to H do
ai ← LLMReAct (ai | τn,z , Fmanual , νn,z )
oi+1 , ri+1 , done ← env.step(ai )
τn,z ← τn,z ∪ {(oi , ai , oi+1 , ri+1 )}
if done then
break
end if
end for
B ← B ∪ τn,z
if done or z = Z then
n←n+1
break
else
Figure 3: Task Inference Prompt Template. We illustrate Ex- νn,z+1 ← concat(νn,z + LLMreflect (τn,z ))
peL’s prompt template during evaluation. The areas with a end if
white background are identical to the base ReAct agent’s end for
inputs. We differ by (purple areas) having additional ex- end while
tracted insights from past experience, and dynamically re- return B
trieved successful in-context examples from past experiences
based on task similarity.
Algorithm 2: ExpeL - Insight Extraction
4.4 Transfer Learning Initialize:
Experience pool B (from Alg. 1)
After demonstrating how learning by using experiences
Insight extraction model LLMinsights
from a training set can benefit an LLM agent in solving an
Set of insights ι̂ ← ∅
unseen task in the same task distribution, we investigate an-
Divide the successes in B into L-sized chunks:
other interesting setting where knowledge accumulated from
Csuccess = {{τ1success , τ2success , ...τLsuccess },
a source task distribution could be useful for a target task success success success
{τL+1 , τL+2 , ...τ2L }, ...}
distribution with minimal target task examples for the Ex-
Construct fail/success tuples of the same tasks in B:
peL agent. Like most transfer learning settings, we assume
Ccompare = {(τ1success , τ1,0
fail
), (τ1success , τ1,1
fail
), ...,
that the source and target tasks exhibit common knowledge. success fail
Therefore, experiences accumulated from source tasks can (τ2 , τ2,0 ), ...}
benefit the agent in solving a new set of target tasks. for each ccompare in Ccompare do
Similar to pretraining on source task and finetuning on tar- ι̂ ← LLMinsights (ccompare , ι̂)
get task in transfer learning literature (Zhuang et al. 2020), end for
we propose to use the extracted insights ι̂ from the source for each csuccess in Csuccess do
task and fewshot examples from the target task to “finetune” ι̂ ← LLMinsights (csuccess , ι̂)
the insights so that they are more applicable in the target end for
task. We hypothesize that using target task fewshot examples return ι̂
can better ground the insights into the target task and miti-

19636
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)

Algorithm 3: ExpeL - Evaluation


Initialize:
ExpeL agent LLMExpeL
Text Embedder E
Experience pool B (from Alg. 1)
Set of insights ι̂ (from Alg. 2)
Collection of evaluation tasks Tevaluation
Number of evaluation tasks M
Number of fewshots k
Number of successes S ← 0
for task m = 1 to M do
tm ← Tevaluation [m]
o0 ← env.reset(tm )
Initialize trajectory τm ← o0
Fsimilar tasks ← Faiss(tm , B, E, k)
for timestep i = 1 to H do
ai ← LLMExpeL (ai | τm , Fsimilar tasks , ι̂)
oi+1 , ri+1 , done ← env.step(ai )
τm ← τm ∪ {(oi , ai , oi+1 , ri+1 )}
if done then
break
end if
end for
if ri+1 = 1 then
S ←S+1
end if
end for
S
return M

gate hallucinations. An example prompt template to “fine-


tune” extracted insights from a source domain to tailor them
to a target domain is illustrated in Fig. 4. Figure 4: Transfer Learning Finetuning Prompt Template.
The prompt template used to finetune knowledge from
4.5 ExpeL’s Strengths source to target domain. Highlighted in grey should be for-
In this section, we outline the key strengths of our frame- matted with concise descriptions of the tasks.
work. First and foremost, ExpeL offers inherent inter-
pretability, as both the extracted experiences and successful
trajectories are presented in natural language. This design gpt-4 to extract insights outperforms gpt-3.5-turbo
allows users to easily inspect, modify, or remove potentially (refer to Sec. 5.6). Lastly, we introduced a method for trans-
harmful trajectories/insights — a challenge in finetuned ferring extracted insights across domains using only a small
models. Moreover, users can seamlessly add expert insights amount of finetuning examples, demonstrating the advan-
or trajectories to an ExpeL agent. Additionally, our learning tage of our approach in diverse settings with limited data.
approach is highly accessible; it demands less data, reduces
computational resources, and is straightforward to imple- 5 Experiments
ment. Furthermore, self-improvement methods like Reflex-
ion (Shinn et al. 2023) facilitate intra-task improvements, 5.1 Experimental Setup
but ExpeL enables inter-task learning. ExpeL does not rely In line with ReAct (Yao et al. 2023b), the experiments
on retries during deployment, which certain domains re- are designed based on four text-based benchmarks: Hot-
quire. On the flexibility front, the ExpeL agent boasts a potQA (Yang et al. 2018), a knowledge-intensive dataset
significant level of versatility. It is not restricted to spe- that challenges an agent to perform reasoning and question
cific language models and complements existing strategies answering using the search tool Wikipedia Docstore API,
aimed at enhancing LLM agent planning capabilities. More- ALFWorld and WebShop (Shridhar et al. 2021; Yao et al.
over, when applied in conjunction with them, ExpeL might 2022) that require the agent to perform interactive multi-step
even improve the capabilities of finetuned agents. Another decision-making tasks in respectively a household and an
strength lies in continuous improvement. Our method stands online shopping website environments, and FEVER (Thorne
to benefit from the ongoing enhancements in foundational et al. 2018), that focuses on fact verification tasks using the
models. As an illustration, our experiments show that using same API as HotpotQA which makes it suitable for knowl-

19637
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)

Figure 5: Main Results. Average task success rates (std. error in gray arrows) across three different domains: HotpotQA,
ALFWorld, and WebShop. ReAct and Act are used as baselines. ExpeL consistently outperforms the baselines on all domains,
highlighting the importance of learning from experience. Additionally, we compare ExpeL with ExpeL (retrieve-only) and
ExpeL (insights-only) to highlight that both insight extraction and task similarity retrieval are essential and synergistic.

edge transfer (Sec. 5.4). All experiments use four-fold vali- for HotpotQA and ALFWorld, respectively). The prominent
dation, and we report the mean and standard error over the influence of insights on HotpotQA can be due to its re-
folds. Following ReAct, for all environments, we use suc- liance on analysing (Wikipedia results) abilities. This high-
cess rate as the evaluation metric: exact matching for Hot- lights the need for general guidelines across various question
potQA and FEVER, completing the task in time for ALF- types. Conversely, ALFWorld’s task completion, dependent
World, and purchasing the item that matches all attributes on specific action sets, is better derived from past experi-
for WebShop. Some additional metrics are introduced when ential trajectories. Furthermore, WebShop presents a unique
the environment offers them: mean reward score r ∈ [0, 1] challenge, requiring both website-based reasoning (price
for WebShop and a score breakdown per task type for ALF. comparisons, query reformulation, etc.) and precise execu-
We use ReAct and Act as main baselines planning LLM tion of actions (searching, clicking, option selection, etc.).
agents (Yao et al. 2023b), where Act does not have the rea- Consequently, the performance across these tasks shows a
soning steps like ReAct. All agents, including ExpeL, used near equilibrium, as reflected in both the success rate and
gpt-3.5-turbo-0613 when performing actions during score (37%/38% and 0.675/0.67 for insights/retrieve-only
evaluation. All text generations were done with temperature respectively). These observations highlight the synergistic
0 and greedy decoding. Imitation learning (IL) results were interplay between abstraction and recollection in experien-
taken from the ReAct paper (Yao et al. 2023b). tial learning, with ExpeL showing a quantitative advantage
over baseline/restricted learning mode agents.
5.2 Main Results
Cross-task learning Another important finding we ob-
The primary findings of this study are presented in Fig. 5. serve is the comparison with the Reflexion agent (Shinn
IL-based method struggles to efficiently perform in Web- et al. 2023). ExpeL matches Reflexion’s performance (40%
Shop and ALFWorld, possibly due to their demand for more at R3 vs. 39%) for HotpotQA and even outperforms it for
substantial prior and reasoning abilities, which conventional ALFWorld (54% at R3 vs. 59%) without repeated attempts.
trainings from scratch fail to provide. This limitation shows While Reflexion improves results by iteratively refining in-
the promise of leveraging knowledge-based language mod- sights through repeated task execution (R1, R2, R3...), our
els to address these challenges. The following claims were ExpeL agent leverages cross-task learning by accumulating
made based on (1) a deep understanding of each environ- task experience. However, it is noteworthy that there remains
ment; (2) extracted insights and retrievable in-context exam- room for improvement in the context of WebShop tasks, ap-
ples; and (3) statistics (e.g. number of invalid actions per proaching the lower side of Reflexion’s success rates.
trial) of the runs.
Experiential learning Augmenting agents with ab- 5.3 Agent Behavioral Analysis
stracted insights and the ability to recall successful trajecto- In this section, we highlight some observations made by
ries improve performance across all environments compared manually inspecting the trajectories of ReAct agents and Ex-
to baseline agents. When restricting the ExpeL agent to peL agents, and by pinpointing possible causes of how some
only one mode of learning (insights-only or retrieval-only), unexpected behaviors might have emerged. Please visit the
HotpotQA and ALFWorld environments demonstrate con- paper’s webpage, https://andrewzh112.github.io/#expel, for
trasting quantitative distinctions (36%/31% and 50%/55% full trajectory demos illustrating the following findings.

19638
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)

Hypothesis Formulation & Constraints Adaptation Af- without, indicating the effectiveness of the proposed “fine-
ter extracting the insights from experiences gathered in the tuning” method in transfer learning scenarios.
training set, we noticed the agent subsequently gained the
ability to reassess its whole trajectory in the last steps and FEVER (SR %)
conclusively end the task rather than expressing its inep-
titude in providing a solution. This ability was particu- Act 58 ± 0.0
larly observed in HotpotQA where a likely influential in- ReAct 63 ± 0.4
sight was stating that the agent should “consider the answer ExpeL Transfer w/o Task Demos 65 ± 1.7
might be in the observations already made”. Therefore the ExpeL Transfer 70 ± 0.7
agent would finish by proposing the most probable answer
given its past observations rather than concluding with “Un- Table 1: Transfer Results. We transfer insights extracted
known” or “Information not available”. from HotpotQA to FEVER. Act and ReAct are baseline
agents, ExpeL w/o Task Demos does not utilize fewshot ex-
World Model Belief Update We noticed our ExpeL agent amples when altering the insights for the target task.
updated its beliefs through the insights and over its gained
experience. This belief thereby update enables the agent to
avoid unnecessary actions and increase efficiency in solv-
ing a given task. For example, in ALFWorld, the agent com- R0 R1 R2 R3
pletely changed the priors it had in ReAct on the likely loca- ReAct+Reflexion 40.3% 47.8% 52.2% 54.4%
tions of a pan (from drawers/countertops/cabinets to stove- ExpeL retrieve only 54.5% 57.5% 59.7% 60.4%
burners). This behavior emerged from the extracted insight ExpeL+Reflexion 59.0% 60.4% 63.4% 64.2%
claiming that “when searching for an item” it needs to “con-
sider its nature and its typical usage”, leading the agent to Table 2: Success Rate on ALFWorld with Reflexion Rounds.
promptly and accurately find the correct item at the first step ExpeL and Reflexion appear to be synergistic in the ALF-
while the ReAct agent could not find it in time. World environment. R1-R3 were obtained from failed R0
Self-correction Although ReAct was sometimes not able checkpoints.
to reassess its situation when attempting to solve a task, Ex-
peL demonstrated its proficiency in identifying and recti-
fying missteps. Notably, when incorrectly taking an object 5.5 ExpeL with Task Reattempts
in ALFWorld, the agent has shown its ability to put it back While not being the central focus of our study, we present
and resume the task by searching for the proper object. This preliminary findings on the effectiveness of incorporating
highlights ExpeL’s capacity to recover from errors and stay task reattempts into the evaluation phase using ExpeL by
on course without hallucinating when completing tasks. This resuming the failed checkpoints from R0. The performance
behavior is possibly encouraged by the generated insight of ExpeL combined with Reflexion, alongside two base-
“reassess the situation and consider alternative actions” if lines: ReAct/Reflexion and ExpeL without insights (ExpeL
“an attempt does not progress the task”. retrieve only), is detailed in Table 2. The results demonstrate
a notable improvement in the success rate when ExpeL is
5.4 Transfer Learning paired with Reflexion, with the success rate increasing as
In this experiment, we use the HotpotQA dataset (Yang the number of task reattempts grows.
et al. 2018) as source tasks and the FEVER dataset (Thorne
et al. 2018) as target tasks. Like the HotpotQA dataset, we 5.6 Ablation Studies
equip the agent with the ability to navigate on Wikipedia One main component of ExpeL is the agent’s ability to au-
using a Docstore API; therefore, we hypothesize that some tonomously gather valuable experiences benefiting its own
of the knowledge obtained from HotpotQA tasks should learning. Therefore, we wish to investigate if the number of
also be beneficial when transferred to the FEVER tasks. We useful experiences impacts the downstream performance of
use gpt-4-0613 for adapting the HotpotQA insights into ExpeL. We designed two different agents to compare our
FEVER insights. We use the same fewshot examples to fine- agent with. The first one only has access to initial fewshot
tune the insights as the ones that will be used during task examples and extracts insights from them. The second gath-
execution. We compare our ExpeL Transfer agent’s transfer ers experience using ReAct where the agent has no retries.
learning ability with (1) ReAct; (2) Act; and (3) an agent Thus, the agent will not only get less successful trajecto-
that “finetunes” insights without task demonstrations. No- ries but will also not have any success/failure comparison
tice that since source and target tasks are inherently differ- pairs during insights extraction. We conducted experiments
ent, we do not have an experience pool to retrieve from; thus, in the HotpotQA environment and presented the results in
the ExpeL Transfer agents use the existing fixed fewshot ex- Fig. 6. As we can see, the agent that extracts insights from
amples as in-context examples. the existing fewshots has no advantage compared to the Re-
Tab. 1 showcases the transfer learning results. Both agents Act agent, illustrating that experience is essential for ExpeL
that transferred knowledge from the source domain saw per- to learn from. This was reflected in a significantly better per-
formance gains. Notably, the agent with a few in-context formance for the two other agents having access to more ex-
examples had a more significant improvement than the one perience. Furthermore, the ExpeL agent with access to a di-

19639
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)

verse set of experiences (failure and success pairs obtained HotpotQA (SR %)
using Reflexion) performs better than the agent using only
ReAct during experience gathering. ReAct 28.0 ± 1.4
Hand-crafted insights 32.0 ± 1.1
Insights with reflections 29.0 ± 0.4
gpt-3.5-turbo insights 32.0 ± 0.4
ExpeL (ours) 39.0 ± 1.7
ALFWorld (SR %)
ReAct 40.0 ± 0.3
Reasoning similarity 48.5 ± 2.1
Random sampled 42.5 ± 0.8
ExpeL (ours) 59.0 ± 0.3
Figure 6: Effects of Experience on Performance. We high- Table 3: Ablations Results. Upper: Ablations on insight
light the correlation between the number of diverse expe- extraction. Hand-crafted insights enjoyed a performance
rience samples and the final performance. Concretely, we boost over ReAct but were less effective than LLM-
compare ExpeL with (1) ReAct, (2) ExpeL that only has generated ones. Furthermore, adding reflections to the
access to fewshot examples, and (3) ExpeL that only uses insight-generating process hurt performance. Lastly, better
ReAct during the experience gathering step. It is evident LLM base models give better insights. Lower: Ablations on
that extra autonomously collected experiences are essential in-context examples selection strategy. Randomly selected
to ExpeL’s success and that diversity of success/failure data baseline has a significant drop in performance while ranking
gathered using Reflexion was superior to using ReAct only. using reason similarity also has a noticeable dip.

Next, we will scrutinize the efficacy of the insight ex-


traction step of ExpeL. Since insights had the most signif- 6 Conclusion and Limitations
icant impact on the HotpotQA environment (Fig. 5), we per- Limitations In this work, we investigated tasks with tex-
formed the ablations on insights in this environment. We tual observation, which is limiting in real-world scenar-
use three dimensions to ablate the design choices for in- ios. Thus, incorporating image observations will make our
sight extraction by creating the following variants of Ex- method more generally applicable. Using Vision-Language
peL agents: (1) human-crafted insights, which were man- Models or captioning models to supplement the LLM to en-
ually engineered by carefully studying the agent’s mis- able image observations could be an interesting new avenue
takes during the experience gathering step; (2) adding re- of research. Additionally, we investigated the efficacy of our
flections ν into the insights construction step in addition method by using closed-source API LLMs, which can be
to using fail/success pairs and lists of successes; (3) us- off-limits in some applications. Exploring LLM agents us-
ing gpt-3.5-turbo-0613 as the LLMinsights . Results in ing open-source LLMs should be another promising future
Tab. 3 show several significant findings: (1) learned insights work (Zeng et al. 2023). Furthermore, since our extracted in-
by the agent are more advantageous than hand-crafted ones; sights do not exceed the current LLM’s token limit, we can
(2) using reflections in addition to success/failure pairs and fit them into the agent’s context window. However, extra re-
lists of successes is disadvantageous, possibly due to reflec- trieval steps for insights might be needed for truly lifelong
tions sometimes outputting hallucinations, therefore mis- learning agents to ensure a manageable context window size.
leading the insight extraction stage; and (3) a better LLM is Lastly, unlike reinforcement learning methods, prompting
more advantageous at improving ExpeL’s performance, sug- techniques lack theoretical underpinnings that could poten-
gesting our agent will enjoy free performance boosts with tially impact the efficiency of the resulting policies. Future
the ever-improving nature of base foundation models. research should explore the integration of these approaches
Lastly, we investigated the design choice of using task to yield more effective and optimal solutions.
similarity as the ranking score for retrieving successful in- In summary, we introduced ExpeL, a novel learning LLM
context examples in ALFWorld. In particular, we use (1) agent that autonomously gathers experience from a set of
reason similarity by retrieving top-k trajectories with the training tasks to improve its abilities in solving evaluation
most similar reasoning step as the latest reasoning step in tasks without access to model parameters. We demonstrated
the current trajectory, and (2) randomly sampling successful its learning abilities by showing its performance gain com-
trajectories from the experience pool. We clearly observe in pared to vanilla ReAct and Act agents. Furthermore, we in-
Tab. 3 that retrieving with task similarity (ExpeL) performs vestigated a transfer learning scenario where extracting in-
the best. Reason similarity is still advantageous but slightly sights from a set of source tasks can benefit the ExpeL agent
drops in performance, possibly due to dynamically chang- in solving a target task. Lastly, we presented several unex-
ing fewshots during a single trajectory, causing instabilities. pected emerged abilities our agent developed at the end of
Lastly, random sampling has a significant drop in perfor- its training. We believe that autonomously learning from ex-
mance, suggesting that our design choice of selecting the perience is essential for developing human-like intelligent
most pertinent in-context example is advantageous. agents, and our ExpeL agent is a step toward that goal.

19640
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)

Acknowledgements Johnson, J.; Douze, M.; and Jégou, H. 2019. Billion-scale


This work is supported in part by the National Key R&D Similarity Search with GPUs. IEEE Transactions on Big
Program of China (2022ZD0114900), the National Natu- Data.
ral Science Foundation of China under Grants 62022048, Kahneman, D. 2011. Thinking, Fast and Slow. Farrar, Straus
U2336214, and 62332019, and the Guoqiang Institute of Ts- and Giroux.
inghua University.
Kojima, T.; Gu, S. S.; Reid, M.; Matsuo, Y.; and Iwasawa,
Y. 2022. Large Language Models are Zero-Shot Reasoners.
References NeurIPS.
Boiko, D. A.; MacKnight, R.; and Gomes, G. 2023. Emer-
gent Autonomous Scientific Research Capabilities of Large Li, H.; Su, Y.; Cai, D.; Wang, Y.; and Liu, L. 2022. A Survey
Language Models. arXiv preprint. on Retrieval-Augmented Text Generation. arXiv preprint.
Bran, A. M.; Cox, S.; White, A. D.; and Schwaller, P. Lin, B. Y.; Fu, Y.; Yang, K.; Ammanabrolu, P.; Brahman, F.;
2023. ChemCrow: Augmenting Large-Language Models Huang, S.; Bhagavatula, C.; Choi, Y.; and Ren, X. 2023a.
with Chemistry Tools. arXiv preprint. SwiftSage: A Generative Agent with Fast and Slow Think-
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J. D.; ing for Complex Interactive Tasks. NeurIPS.
Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, Lin, K.; Agia, C.; Migimatsu, T.; Pavone, M.; and Bohg, J.
A.; et al. 2020. Language Models are Few-Shot Learners. 2023b. Text2Motion: From Natural Language Instructions
NeurIPS. to Feasible Plans. Autonomous Robots.
Chowdhery, A.; Narang, S.; Devlin, J.; Bosma, M.; Mishra, Lin, L.-J. 1992. Self-Improving Reactive Agents Based on
G.; Roberts, A.; Barham, P.; Chung, H. W.; Sutton, Reinforcement Learning, Planning and Teaching. Machine
C.; Gehrmann, S.; Schuh, P.; Shi, K.; Tsvyashchenko, learning.
S.; Maynez, J.; Rao, A.; Barnes, P.; Tay, Y.; Shazeer, Liu, J.; Shen, D.; Zhang, Y.; Dolan, B.; Carin, L.; and Chen,
N.; Prabhakaran, V.; Reif, E.; Du, N.; Hutchinson, B.; W. 2022. What Makes Good In-Context Examples for GPT-
Pope, R.; Bradbury, J.; Austin, J.; Isard, M.; Gur-Ari, G.; 3? In DeeLIO. Association for Computational Linguistics.
Yin, P.; Duke, T.; Levskaya, A.; Ghemawat, S.; Dev, S.;
Michalewski, H.; Garcia, X.; Misra, V.; Robinson, K.; Fe- Liu, P.; Yuan, W.; Fu, J.; Jiang, Z.; Hayashi, H.; and Neubig,
dus, L.; Zhou, D.; Ippolito, D.; Luan, D.; Lim, H.; Zoph, B.; G. 2023a. Pre-train, Prompt, and Predict: A Systematic Sur-
Spiridonov, A.; Sepassi, R.; Dohan, D.; Agrawal, S.; Omer- vey of Prompting Methods in Natural Language Processing.
nick, M.; Dai, A. M.; Pillai, T. S.; Pellat, M.; Lewkowycz, ACM Computing Surveys.
A.; Moreira, E.; Child, R.; Polozov, O.; Lee, K.; Zhou, Z.; Liu, X.; Yu, H.; Zhang, H.; Xu, Y.; Lei, X.; Lai, H.; Gu, Y.;
Wang, X.; Saeta, B.; Diaz, M.; Firat, O.; Catasta, M.; Wei, Ding, H.; Men, K.; Yang, K.; et al. 2023b. AgentBench:
J.; Meier-Hellstern, K.; Eck, D.; Dean, J.; Petrov, S.; and Evaluating LLMs as Agents. arXiv preprint.
Fiedel, N. 2023. PaLM: Scaling Language Modeling with
Maas; Carey; Wheeler; Saatchi; Billington; and Shamash.
Pathways. JMLR.
2023. To Infinity and Beyond: SHOW-1 and Showrunner
Chung, H. W.; Hou, L.; Longpre, S.; Zoph, B.; Tay, Y.; Agents in Multi-Agent Simulations. arXiv preprint.
Fedus, W.; Li, E.; Wang, X.; Dehghani, M.; Brahma, S.;
et al. 2022. Scaling Instruction-Finetuned Language Mod- Mu, Y.; Zhang, Q.; Hu, M.; Wang, W.; Ding, M.; Jin,
els. arXiv preprint. J.; Wang, B.; Dai, J.; Qiao, Y.; and Luo, P. 2023. Em-
bodiedGPT: Vision-Language Pre-Training via Embodied
Du, M.; He, F.; Zou, N.; Tao, D.; and Hu, X. 2022. Shortcut
Chain of Thought. NeurIPS.
Learning of Large Language Models in Natural Language
Understanding: A Survey. arXiv preprint. Nakano, R.; Hilton, J.; Balaji, S. A.; Wu, J.; Ouyang, L.;
Gong, R.; Huang, Q.; Ma, X.; Vo, H.; Durante, Z.; Noda, Y.; Kim, C.; Hesse, C.; Jain, S.; Kosaraju, V.; Saunders, W.;
Zheng, Z.; Zhu, S.-C.; Terzopoulos, D.; Fei-Fei, L.; et al. Jiang, X.; Cobbe, K.; Eloundou, T.; Krueger, G.; Button, K.;
2023. MindAgent: Emergent Gaming Interaction. arXiv Knight, M.; Chess, B.; and Schulman, J. 2021. WebGPT:
preprint. Browser-Assisted Question-Answering with Human Feed-
back. arXiv preprint.
Ha, H.; Florence, P.; and Song, S. 2023. Scaling Up and
Distilling Down: Language-Guided Robot Skill Acquisition. Park, J. S.; O’Brien, J.; Cai, C. J.; Morris, M. R.; Liang, P.;
In CoRL. PMLR. and Bernstein, M. S. 2023. Generative Agents: Interactive
Hao, S.; Gu, Y.; Ma, H.; Hong, J. J.; Wang, Z.; Wang, D. Z.; Simulacra of Human Behavior. In ACM Symposium on User
and Hu, Z. 2023. Reasoning with Language Model is Plan- Interface Software and Technology.
ning with World Model. arXiv preprint. Petroni, F.; Rocktäschel, T.; Riedel, S.; Lewis, P.; Bakhtin,
Huang, W.; Abbeel, P.; Pathak, D.; and Mordatch, I. 2022. A.; Wu, Y.; and Miller, A. 2019. Language Models as
Language Models as Zero-Shot Planners: Extracting Action- Knowledge Bases? In EMNLP-IJCNLP. Association for
able Knowledge for Embodied Agents. In ICML. PMLR. Computational Linguistics.
Humphreys, P.; Guez, A.; Tieleman, O.; Sifre, L.; Weber, T.; Rubin, O.; Herzig, J.; and Berant, J. 2022. Learning To Re-
and Lillicrap, T. 2022. Large-scale Retrieval for Reinforce- trieve Prompts for In-Context Learning. In NAACL. Associ-
ment Learning. NeurIPS. ation for Computational Linguistics.

19641
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)

Shaw, P.; Joshi, M.; Cohan, J.; Berant, J.; Pasupat, P.; Hu, Wei, J.; Bosma, M.; Zhao, V.; Guu, K.; Yu, A. W.; Lester,
H.; Khandelwal, U.; Lee, K.; and Toutanova, K. 2023. From B.; Du, N.; Dai, A. M.; and Le, Q. V. 2022a. Finetuned
Pixels to UI Actions: Learning to Follow Instructions via Language Models are Zero-Shot Learners. In ICLR.
Graphical User Interfaces. NeurIPS. Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Xia, F.; Chi,
Shinn, N.; Cassano, F.; Gopinath, A.; Narasimhan, K. R.; E.; Le, Q. V.; Zhou, D.; et al. 2022b. Chain-of-Thought
and Yao, S. 2023. Reflexion: Language Agents with Verbal Prompting Elicits Reasoning in Large Language Models.
Reinforcement Learning. In NeurIPS. NeurIPS.
Shridhar, M.; Yuan, X.; Côté, M.-A.; Bisk, Y.; Trischler, A.; Xi, Z.; Chen, W.; Guo, X.; He, W.; Ding, Y.; Hong, B.;
and Hausknecht, M. 2021. ALFWorld: Aligning Text and Zhang, M.; Wang, J.; Jin, S.; Zhou, E.; et al. 2023. The
Embodied Environments for Interactive Learning. In ICLR. Rise and Potential of Large Language Model Based Agents:
Significant-Gravitas. 2023. AutoGPT. https://github.com/ A Survey. arXiv preprint.
Significant-Gravitas/Auto-GPT. Yang, S.; Nachum, O.; Du, Y.; Wei, J.; Abbeel, P.; and Schu-
Song, K.; Tan, X.; Qin, T.; Lu, J.; and Liu, T.-Y. 2020. MP- urmans, D. 2023a. Foundation Models for Decision Making:
Net: Masked and Permuted Pre-training for Language Un- Problems, Methods, and Opportunities. arXiv preprint.
derstanding. NeurIPS. Yang, Z.; Li, L.; Wang, J.; Lin, K.; Azarnasab, E.; Ahmed,
Sumers, T. R.; Yao, S.; Narasimhan, K.; and Griffiths, T. L. F.; Liu, Z.; Liu, C.; Zeng, M.; and Wang, L. 2023b. MM-
2023. Cognitive Architectures for Language Agents. arXiv REACT: Prompting ChatGPT for Multimodal Reasoning
preprint. and Action. arXiv preprint.
Sun, H.; Zhuang, Y.; Kong, L.; Dai, B.; and Zhang, C. 2023. Yang, Z.; Qi, P.; Zhang, S.; Bengio, Y.; Cohen, W.; Salakhut-
AdaPlanner: Adaptive Planning from Feedback with Lan- dinov, R.; and Manning, C. D. 2018. HotpotQA: A Dataset
guage Models. NeurIPS. for Diverse, Explainable Multi-hop Question Answering. In
Taori, R.; Gulrajani, I.; Zhang, T.; Dubois, Y.; Li, X.; EMNLP. Association for Computational Linguistics.
Guestrin, C.; Liang, P.; and Hashimoto, T. B. 2023. Stanford Yao, S.; Chen, H.; Yang, J.; and Narasimhan, K. 2022. Web-
Alpaca: An Instruction-Following LLaMA Model. https: Shop: Towards Scalable Real-World Web Interaction with
//github.com/tatsu-lab/stanford alpaca. Grounded Language Agents. In NeurIPS.
Thoppilan, R.; De Freitas, D.; Hall, J.; Shazeer, N.; Kul- Yao, S.; Yu, D.; Zhao, J.; Shafran, I.; Griffiths, T. L.; Cao,
shreshtha, A.; Cheng, H.-T.; Jin, A.; Bos, T.; Baker, L.; Du, Y.; and Narasimhan, K. 2023a. Tree of Thoughts: Deliberate
Y.; et al. 2022. LaMDA: Language Models for Dialog Ap- Problem Solving with Large Language Models. NeurIPS.
plications. arXiv preprint. Yao, S.; Zhao, J.; Yu, D.; Du, N.; Shafran, I.; Narasimhan,
Thorne, J.; Vlachos, A.; Christodoulopoulos, C.; and Mittal, K.; and Cao, Y. 2023b. ReAct: Synergizing Reasoning and
A. 2018. FEVER: a Large-scale Dataset for Fact Extraction Acting in Language Models. In ICLR.
and VERification. In NAACL. Yao, W.; Heinecke, S.; Niebles, J. C.; Liu, Z.; Feng, Y.; Xue,
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, L.; Murthy, R.; Chen, Z.; Zhang, J.; Arpit, D.; Xu, R.; Mui,
M.-A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; P.; Wang, H.; Xiong, C.; and Savarese, S. 2023c. Retro-
Azhar, F.; et al. 2023a. LLaMA: Open and Efficient Foun- former: Retrospective Large Language Agents with Policy
dation Language Models. arXiv preprint. Gradient Optimization.
Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Zeng, A.; Liu, M.; Lu, R.; Wang, B.; Liu, X.; Dong, Y.; and
Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, Tang, J. 2023. AgentTuning: Enabling Generalized Agent
S.; et al. 2023b. Llama 2: Open Foundation and Fine-Tuned Abilities for LLMs. arXiv preprint.
Chat Models. arXiv preprint. Zhang, Z.; Zhang, A.; Li, M.; and Smola, A. 2023. Auto-
Tworkowski, S.; Staniszewski, K.; Pacek, M.; Wu, Y.; matic Chain of Thought Prompting in Large Language Mod-
Michalewski, H.; and Miłoś, P. 2023. Focused Transformer: els. In ICLR.
Contrastive Training for Context Scaling. In NeurIPS. Zhao, A.; Zhu, E.; Lu, R.; Lin, M.; Liu, Y.-J.; and Huang, G.
Wang, L.; Ma, C.; Feng, X.; Zhang, Z.; Yang, H.; Zhang, J.; 2023a. Augmenting Unsupervised Reinforcement Learning
Chen, Z.; Tang, J.; Chen, X.; Lin, Y.; et al. 2023a. A Sur- with Self-Reference. arXiv preprint.
vey on Large Language Model Based Autonomous Agents. Zhao, W. X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.;
arXiv preprint. Min, Y.; Zhang, B.; Zhang, J.; Dong, Z.; et al. 2023b. A
Wang, L.; Yang, N.; and Wei, F. 2023. Learning to Retrieve Survey of Large Language Models. arXiv preprint.
In-Context Examples for Large Language Models. arXiv Zhuang, F.; Qi, Z.; Duan, K.; Xi, D.; Zhu, Y.; Zhu, H.;
preprint. Xiong, H.; and He, Q. 2020. A Comprehensive Survey on
Wang, S.; Liu, C.; Zheng, Z.; Qi, S.; Chen, S.; Yang, Transfer Learning. Proceedings of the IEEE.
Q.; Zhao, A.; Wang, C.; Song, S.; and Huang, G. 2023b.
Avalon’s Game of Thoughts: Battle Against Deception
through Recursive Contemplation. arXiv preprint.
Watkins, C. J.; and Dayan, P. 1992. Q-learning. Machine
learning.

19642

You might also like