29936-Article Text-33990-1-2-20240324
29936-Article Text-33990-1-2-20240324
29936-Article Text-33990-1-2-20240324
19632
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)
Figure 1: ExpeL Agent Overview. Left: ExpeL operates in three stages: (1) Collection of success and failure experiences into
a pool. (2) Extraction/abstraction of cross-task knowledge from these experiences. (3) Application of the gained insights and
recall of past successes in evaluation tasks. Right: (A) Illustrates the experience gathering process via Reflexion (Shinn et al.
2023), enabling task reattempt after self-reflection on failures. (B) Illustrates the insight extraction step. When presented with
success/failure pairs or a list of L successes, the agent dynamically modifies an existing list of insights ι̂ using operations ADD,
UPVOTE, DOWNVOTE, and EDIT. This process has an emphasis on extracting prevalent failure patterns or best practices.
ferability from source tasks to target tasks. Lastly, we be- to these works, we focus on retrieving the ExpeL agent’s
lieve that as planning algorithms and foundational models self-generated experiences, thus reducing the dependency on
continue to improve, ExpeL’s paradigm stands to gain sig- gold examples and leveraging domain-specific corpus.
nificant benefits from their enhanced performances.1 Planning for LLM Agents: Application of LLM agents
in fields like robotics, natural sciences, game-playing, and
2 Related Work workflows has surged, with emphasis on their world knowl-
Prompt-based Learning: Prompt-based learning refines la- edge in fewshot settings (Ha, Florence, and Song 2023;
bel prediction tasks by modifying the input context, facili- Mu et al. 2023; Bran et al. 2023; Boiko, MacKnight, and
tating swift adaptation to new tasks with minimal data (Liu Gomes 2023; Yang et al. 2023b; Lin et al. 2023a; Nakano
et al. 2023a). This approach capitalizes on LLMs for an- et al. 2021; Wang et al. 2023b; Liu et al. 2023b). Moreover,
swers without parameter tuning as they can be augmented LLMs have demonstrated promising zero/few-shot planning
using in-context learning (Brown et al. 2020). LAMA and reasoning capabilities in various configurations (Sumers
(Petroni et al. 2019) and GPT-3 (Brown et al. 2020) are et al. 2023), including embodied environments and reason-
early works that promoted this formulation. Efforts to reduce ing tasks (Huang et al. 2022; Yao et al. 2023a; Wei et al.
the intricacies of prompt design include automatic reason- 2022b; Yao et al. 2023b; Gong et al. 2023).
ing chains for NLP (Kojima et al. 2022; Zhang et al. 2023). Self-improvement and Memory for LLM Agents: Agents
Similarly, the ExpeL agent also autonomously learns from like Reflexion showcase feedback-based improvement, yet
experiences using extracted insights and self-generated in- often lack cross-task memory (Shinn et al. 2023). Other
context trajectories by altering the execution prompt. agents exhibit potential in persistent memory within multi-
Retrieval Augmented Generation (RAG): Retrieval al- agent contexts (Park et al. 2023; Maas et al. 2023). Our
lows LLMs to access databases, mitigating hallucinations ExpeL agent combines these approaches, focusing on task-
(Li et al. 2022; Wang, Yang, and Wei 2023; Rubin, Herzig, solving while benefiting from self-generated in-context ex-
and Berant 2022; Liu et al. 2022). Retrieval has also been amples and abstracted insights from memory.
used to enhance the capabilities of decision-making agents
(Humphreys et al. 2022; Zhao et al. 2023a). In contrast 3 Preliminaries
1 Complex Interactive Tasks We work with complex inter-
Visit https://andrewzh112.github.io/#expel for prompts and
demos, and https://github.com/LeapLabTHU/ExpeL for code. active tasks where at each time step i ∈ {0, . . . , H}, the
19633
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)
agent receives an observation o ∈ O, and from its observa- developed the Experiential Learning (ExpeL) agent. During
tion history Ht decides to perform action a ∈ A. The ob- the training stage, the agent interacts with the environment,
jective of the agent is to achieve some goal g ∈ G. We only gathering experiences via trial and error. These experiences
deal with deterministic environments in this work. are stored in an experience pool (Lin 1992). From this pool,
the agent later extracts insights, similar to off-policy learn-
Large Language Models A large language model is a ing (Watkins and Dayan 1992), in which the agent can learn
statistical model of the natural language, typically a neu- from experiences of a behavior policy. During the evaluation
ral network. In our setting, we use an autoregressive lan- stage, the agent attempts unseen tasks with a single try, aug-
guage model (Brown et al. 2020; Touvron et al. 2023b,a; mented with extracted insights and successful trajectories in
Chowdhery et al. 2023), which given an ordered list of ex- its experience pool gathered from the training stage. Refer
isting tokens x = {x1 , x2 , ..., xl−1 }, outputs the probabil- to Fig. 1 for detailed information on our agent framework.
ity of the next token p(xl | x<l ). An instruction-following
LLM (Thoppilan et al. 2022; Chung et al. 2022; Wei et al. 4.1 Gathering Experiences
2022a) is typically finetuned on various NLP tasks that are
formatted into instruction, input, response tuples (Taori et al. To gather diverse experiences that can be useful to extract
2023). Instruction-tuned models are better at following natu- information from, we leverage Reflexion (Shinn et al. 2023)
ral language instructions which alleviates the need for heavy to continuously retry the training task at most Z times.
prompt engineering (Wei et al. 2022a). In particular, the agent will be given a training task tn
at the z-th trial, fewshot examples Fmanual and past reflec-
ReAct and Reflexion ReAct (Yao et al. 2023b) and Re- tions νn,z (initially, νn,0 is the empty string). At first, the
flexion (Shinn et al. 2023) are promising frameworks en- agent will attempt the task with fewshot examples concate-
abling the aforementioned proficiency of LLMs in reason- nated with its current trajectory τn,0 as the context, and
ing and self-improvement. ReAct explicitly intertwines ob- use ReAct (Yao et al. 2023b) as the base planning algo-
servations, actions, and thoughts, providing a foundation for rithm, LLMReAct (· | τn,0 , Fmanual , νn,0 ). On the z-th trial,
robust planning and reasoning capabilities. Building upon when the agent finishes the task or the maximum number
it, Reflexion introduces an additional reflective step before of steps H is reached, the ExpeL agent’s experience pool
reattempting the subsequent trial of the same task, enhanc- B ingests the trajectory τn,z . Then, if the agent succeeds,
ing the model’s adaptive learning process. it moves on to the next task. However, if the agent fails,
it will look at its failed trajectory and self-reflect to pro-
4 ExpeL: An Experiential Learning Agent duce νn,z+1 = concat(νn,z , LLMreflect (τn,z )) to see where
it can do better on the next retry, concatenated with the pre-
Recent advancements in generative LLMs suggest an in- vious reflections. In the next retry, the agent will augment its
triguing approach. Rather than altering the LLM param- context with reflection νn,z+1 , the input to the LLM policy,
eters, adjusting the prompts may be more beneficial: this LLMReAct (· | τn,z+1 , Fmanual , νn,z+1 ).
strategy ensures that the LLM’s inherent common sense To highlight, this trial and error way of gathering expe-
knowledge remains intact, allowing for superior general- riences not only improves the chances of getting more pos-
ization (Liu et al. 2023a). Furthermore, some of the most itive examples for experience recall during evaluation but
potent language models are proprietary. Thus, focusing on also allows for collecting valuable success/failure pairs used
prompt-based methods seems promising as a way to har- for comparisons during insight extraction (Sec. 4.2). The
ness the strengths of these advanced LLMs. Additionally, pseudo-code can be found in Alg. 1.
previous works on learning in LLM agents have primarily
been trained on extensive human-labeled datasets (Lin et al. 4.2 Learning from Experiences
2023a; Shaw et al. 2023) or improved via iterative retries
Human learning occurs mainly either by storing successful
(Shinn et al. 2023) on a single task. A relatively less ex-
trajectories in memory, which can be later recalled as spe-
plored area is facilitating agents to learn autonomously from
cific examples, or by extracting high-level insights from ex-
their own experiences, similar to a student gaining insights
periences, enabling generalization to novel situations. Ex-
from practicing for an exam. The student tackles practice
peL considers both of these learning modes to boost task
problems multiple times to derive insights. At the exam, the
performance. Concretely, an instruction I given to an LLM
student rely solely on these insights and draw memories of
agent can be broken down into task specifications and few-
similar problems to answer the questions with one attempt.
shot examples. We can augment task specifications with an
With this in mind, we wish to design an LLM agent that au-
agent’s extracted insights from past experiences, where an
tonomously gathers experiences and extracts insights, then
instruction-following LLM can be leveraged to follow them
uses these cross-task insights and memories of similar tasks
closely. For fewshot examples, we can allow the agent to re-
to aid its decision-making.
trieve from its experience pool with top-k relevant examples
We aim to enhance a planning LLM agent, such as Re- to aid its decisions. Next, we detail our experience recall and
Act, with learning abilities that allow it to improve through insight extraction mechanisms.
inter-task experiences without any parameter updates. In-
spired by the cognitive abilities inherent in human learning, Similar Experiences as Demonstrations Works have
as well as the benefits observed in self-learning autonomous shown that using in-context examples that are semantically
agents and the progress made in prompt-based methods, we similar to the task at hand results in better performance (Liu
19634
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)
19635
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)
19636
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)
19637
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)
Figure 5: Main Results. Average task success rates (std. error in gray arrows) across three different domains: HotpotQA,
ALFWorld, and WebShop. ReAct and Act are used as baselines. ExpeL consistently outperforms the baselines on all domains,
highlighting the importance of learning from experience. Additionally, we compare ExpeL with ExpeL (retrieve-only) and
ExpeL (insights-only) to highlight that both insight extraction and task similarity retrieval are essential and synergistic.
edge transfer (Sec. 5.4). All experiments use four-fold vali- for HotpotQA and ALFWorld, respectively). The prominent
dation, and we report the mean and standard error over the influence of insights on HotpotQA can be due to its re-
folds. Following ReAct, for all environments, we use suc- liance on analysing (Wikipedia results) abilities. This high-
cess rate as the evaluation metric: exact matching for Hot- lights the need for general guidelines across various question
potQA and FEVER, completing the task in time for ALF- types. Conversely, ALFWorld’s task completion, dependent
World, and purchasing the item that matches all attributes on specific action sets, is better derived from past experi-
for WebShop. Some additional metrics are introduced when ential trajectories. Furthermore, WebShop presents a unique
the environment offers them: mean reward score r ∈ [0, 1] challenge, requiring both website-based reasoning (price
for WebShop and a score breakdown per task type for ALF. comparisons, query reformulation, etc.) and precise execu-
We use ReAct and Act as main baselines planning LLM tion of actions (searching, clicking, option selection, etc.).
agents (Yao et al. 2023b), where Act does not have the rea- Consequently, the performance across these tasks shows a
soning steps like ReAct. All agents, including ExpeL, used near equilibrium, as reflected in both the success rate and
gpt-3.5-turbo-0613 when performing actions during score (37%/38% and 0.675/0.67 for insights/retrieve-only
evaluation. All text generations were done with temperature respectively). These observations highlight the synergistic
0 and greedy decoding. Imitation learning (IL) results were interplay between abstraction and recollection in experien-
taken from the ReAct paper (Yao et al. 2023b). tial learning, with ExpeL showing a quantitative advantage
over baseline/restricted learning mode agents.
5.2 Main Results
Cross-task learning Another important finding we ob-
The primary findings of this study are presented in Fig. 5. serve is the comparison with the Reflexion agent (Shinn
IL-based method struggles to efficiently perform in Web- et al. 2023). ExpeL matches Reflexion’s performance (40%
Shop and ALFWorld, possibly due to their demand for more at R3 vs. 39%) for HotpotQA and even outperforms it for
substantial prior and reasoning abilities, which conventional ALFWorld (54% at R3 vs. 59%) without repeated attempts.
trainings from scratch fail to provide. This limitation shows While Reflexion improves results by iteratively refining in-
the promise of leveraging knowledge-based language mod- sights through repeated task execution (R1, R2, R3...), our
els to address these challenges. The following claims were ExpeL agent leverages cross-task learning by accumulating
made based on (1) a deep understanding of each environ- task experience. However, it is noteworthy that there remains
ment; (2) extracted insights and retrievable in-context exam- room for improvement in the context of WebShop tasks, ap-
ples; and (3) statistics (e.g. number of invalid actions per proaching the lower side of Reflexion’s success rates.
trial) of the runs.
Experiential learning Augmenting agents with ab- 5.3 Agent Behavioral Analysis
stracted insights and the ability to recall successful trajecto- In this section, we highlight some observations made by
ries improve performance across all environments compared manually inspecting the trajectories of ReAct agents and Ex-
to baseline agents. When restricting the ExpeL agent to peL agents, and by pinpointing possible causes of how some
only one mode of learning (insights-only or retrieval-only), unexpected behaviors might have emerged. Please visit the
HotpotQA and ALFWorld environments demonstrate con- paper’s webpage, https://andrewzh112.github.io/#expel, for
trasting quantitative distinctions (36%/31% and 50%/55% full trajectory demos illustrating the following findings.
19638
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)
Hypothesis Formulation & Constraints Adaptation Af- without, indicating the effectiveness of the proposed “fine-
ter extracting the insights from experiences gathered in the tuning” method in transfer learning scenarios.
training set, we noticed the agent subsequently gained the
ability to reassess its whole trajectory in the last steps and FEVER (SR %)
conclusively end the task rather than expressing its inep-
titude in providing a solution. This ability was particu- Act 58 ± 0.0
larly observed in HotpotQA where a likely influential in- ReAct 63 ± 0.4
sight was stating that the agent should “consider the answer ExpeL Transfer w/o Task Demos 65 ± 1.7
might be in the observations already made”. Therefore the ExpeL Transfer 70 ± 0.7
agent would finish by proposing the most probable answer
given its past observations rather than concluding with “Un- Table 1: Transfer Results. We transfer insights extracted
known” or “Information not available”. from HotpotQA to FEVER. Act and ReAct are baseline
agents, ExpeL w/o Task Demos does not utilize fewshot ex-
World Model Belief Update We noticed our ExpeL agent amples when altering the insights for the target task.
updated its beliefs through the insights and over its gained
experience. This belief thereby update enables the agent to
avoid unnecessary actions and increase efficiency in solv-
ing a given task. For example, in ALFWorld, the agent com- R0 R1 R2 R3
pletely changed the priors it had in ReAct on the likely loca- ReAct+Reflexion 40.3% 47.8% 52.2% 54.4%
tions of a pan (from drawers/countertops/cabinets to stove- ExpeL retrieve only 54.5% 57.5% 59.7% 60.4%
burners). This behavior emerged from the extracted insight ExpeL+Reflexion 59.0% 60.4% 63.4% 64.2%
claiming that “when searching for an item” it needs to “con-
sider its nature and its typical usage”, leading the agent to Table 2: Success Rate on ALFWorld with Reflexion Rounds.
promptly and accurately find the correct item at the first step ExpeL and Reflexion appear to be synergistic in the ALF-
while the ReAct agent could not find it in time. World environment. R1-R3 were obtained from failed R0
Self-correction Although ReAct was sometimes not able checkpoints.
to reassess its situation when attempting to solve a task, Ex-
peL demonstrated its proficiency in identifying and recti-
fying missteps. Notably, when incorrectly taking an object 5.5 ExpeL with Task Reattempts
in ALFWorld, the agent has shown its ability to put it back While not being the central focus of our study, we present
and resume the task by searching for the proper object. This preliminary findings on the effectiveness of incorporating
highlights ExpeL’s capacity to recover from errors and stay task reattempts into the evaluation phase using ExpeL by
on course without hallucinating when completing tasks. This resuming the failed checkpoints from R0. The performance
behavior is possibly encouraged by the generated insight of ExpeL combined with Reflexion, alongside two base-
“reassess the situation and consider alternative actions” if lines: ReAct/Reflexion and ExpeL without insights (ExpeL
“an attempt does not progress the task”. retrieve only), is detailed in Table 2. The results demonstrate
a notable improvement in the success rate when ExpeL is
5.4 Transfer Learning paired with Reflexion, with the success rate increasing as
In this experiment, we use the HotpotQA dataset (Yang the number of task reattempts grows.
et al. 2018) as source tasks and the FEVER dataset (Thorne
et al. 2018) as target tasks. Like the HotpotQA dataset, we 5.6 Ablation Studies
equip the agent with the ability to navigate on Wikipedia One main component of ExpeL is the agent’s ability to au-
using a Docstore API; therefore, we hypothesize that some tonomously gather valuable experiences benefiting its own
of the knowledge obtained from HotpotQA tasks should learning. Therefore, we wish to investigate if the number of
also be beneficial when transferred to the FEVER tasks. We useful experiences impacts the downstream performance of
use gpt-4-0613 for adapting the HotpotQA insights into ExpeL. We designed two different agents to compare our
FEVER insights. We use the same fewshot examples to fine- agent with. The first one only has access to initial fewshot
tune the insights as the ones that will be used during task examples and extracts insights from them. The second gath-
execution. We compare our ExpeL Transfer agent’s transfer ers experience using ReAct where the agent has no retries.
learning ability with (1) ReAct; (2) Act; and (3) an agent Thus, the agent will not only get less successful trajecto-
that “finetunes” insights without task demonstrations. No- ries but will also not have any success/failure comparison
tice that since source and target tasks are inherently differ- pairs during insights extraction. We conducted experiments
ent, we do not have an experience pool to retrieve from; thus, in the HotpotQA environment and presented the results in
the ExpeL Transfer agents use the existing fixed fewshot ex- Fig. 6. As we can see, the agent that extracts insights from
amples as in-context examples. the existing fewshots has no advantage compared to the Re-
Tab. 1 showcases the transfer learning results. Both agents Act agent, illustrating that experience is essential for ExpeL
that transferred knowledge from the source domain saw per- to learn from. This was reflected in a significantly better per-
formance gains. Notably, the agent with a few in-context formance for the two other agents having access to more ex-
examples had a more significant improvement than the one perience. Furthermore, the ExpeL agent with access to a di-
19639
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)
verse set of experiences (failure and success pairs obtained HotpotQA (SR %)
using Reflexion) performs better than the agent using only
ReAct during experience gathering. ReAct 28.0 ± 1.4
Hand-crafted insights 32.0 ± 1.1
Insights with reflections 29.0 ± 0.4
gpt-3.5-turbo insights 32.0 ± 0.4
ExpeL (ours) 39.0 ± 1.7
ALFWorld (SR %)
ReAct 40.0 ± 0.3
Reasoning similarity 48.5 ± 2.1
Random sampled 42.5 ± 0.8
ExpeL (ours) 59.0 ± 0.3
Figure 6: Effects of Experience on Performance. We high- Table 3: Ablations Results. Upper: Ablations on insight
light the correlation between the number of diverse expe- extraction. Hand-crafted insights enjoyed a performance
rience samples and the final performance. Concretely, we boost over ReAct but were less effective than LLM-
compare ExpeL with (1) ReAct, (2) ExpeL that only has generated ones. Furthermore, adding reflections to the
access to fewshot examples, and (3) ExpeL that only uses insight-generating process hurt performance. Lastly, better
ReAct during the experience gathering step. It is evident LLM base models give better insights. Lower: Ablations on
that extra autonomously collected experiences are essential in-context examples selection strategy. Randomly selected
to ExpeL’s success and that diversity of success/failure data baseline has a significant drop in performance while ranking
gathered using Reflexion was superior to using ReAct only. using reason similarity also has a noticeable dip.
19640
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)
19641
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)
Shaw, P.; Joshi, M.; Cohan, J.; Berant, J.; Pasupat, P.; Hu, Wei, J.; Bosma, M.; Zhao, V.; Guu, K.; Yu, A. W.; Lester,
H.; Khandelwal, U.; Lee, K.; and Toutanova, K. 2023. From B.; Du, N.; Dai, A. M.; and Le, Q. V. 2022a. Finetuned
Pixels to UI Actions: Learning to Follow Instructions via Language Models are Zero-Shot Learners. In ICLR.
Graphical User Interfaces. NeurIPS. Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Xia, F.; Chi,
Shinn, N.; Cassano, F.; Gopinath, A.; Narasimhan, K. R.; E.; Le, Q. V.; Zhou, D.; et al. 2022b. Chain-of-Thought
and Yao, S. 2023. Reflexion: Language Agents with Verbal Prompting Elicits Reasoning in Large Language Models.
Reinforcement Learning. In NeurIPS. NeurIPS.
Shridhar, M.; Yuan, X.; Côté, M.-A.; Bisk, Y.; Trischler, A.; Xi, Z.; Chen, W.; Guo, X.; He, W.; Ding, Y.; Hong, B.;
and Hausknecht, M. 2021. ALFWorld: Aligning Text and Zhang, M.; Wang, J.; Jin, S.; Zhou, E.; et al. 2023. The
Embodied Environments for Interactive Learning. In ICLR. Rise and Potential of Large Language Model Based Agents:
Significant-Gravitas. 2023. AutoGPT. https://github.com/ A Survey. arXiv preprint.
Significant-Gravitas/Auto-GPT. Yang, S.; Nachum, O.; Du, Y.; Wei, J.; Abbeel, P.; and Schu-
Song, K.; Tan, X.; Qin, T.; Lu, J.; and Liu, T.-Y. 2020. MP- urmans, D. 2023a. Foundation Models for Decision Making:
Net: Masked and Permuted Pre-training for Language Un- Problems, Methods, and Opportunities. arXiv preprint.
derstanding. NeurIPS. Yang, Z.; Li, L.; Wang, J.; Lin, K.; Azarnasab, E.; Ahmed,
Sumers, T. R.; Yao, S.; Narasimhan, K.; and Griffiths, T. L. F.; Liu, Z.; Liu, C.; Zeng, M.; and Wang, L. 2023b. MM-
2023. Cognitive Architectures for Language Agents. arXiv REACT: Prompting ChatGPT for Multimodal Reasoning
preprint. and Action. arXiv preprint.
Sun, H.; Zhuang, Y.; Kong, L.; Dai, B.; and Zhang, C. 2023. Yang, Z.; Qi, P.; Zhang, S.; Bengio, Y.; Cohen, W.; Salakhut-
AdaPlanner: Adaptive Planning from Feedback with Lan- dinov, R.; and Manning, C. D. 2018. HotpotQA: A Dataset
guage Models. NeurIPS. for Diverse, Explainable Multi-hop Question Answering. In
Taori, R.; Gulrajani, I.; Zhang, T.; Dubois, Y.; Li, X.; EMNLP. Association for Computational Linguistics.
Guestrin, C.; Liang, P.; and Hashimoto, T. B. 2023. Stanford Yao, S.; Chen, H.; Yang, J.; and Narasimhan, K. 2022. Web-
Alpaca: An Instruction-Following LLaMA Model. https: Shop: Towards Scalable Real-World Web Interaction with
//github.com/tatsu-lab/stanford alpaca. Grounded Language Agents. In NeurIPS.
Thoppilan, R.; De Freitas, D.; Hall, J.; Shazeer, N.; Kul- Yao, S.; Yu, D.; Zhao, J.; Shafran, I.; Griffiths, T. L.; Cao,
shreshtha, A.; Cheng, H.-T.; Jin, A.; Bos, T.; Baker, L.; Du, Y.; and Narasimhan, K. 2023a. Tree of Thoughts: Deliberate
Y.; et al. 2022. LaMDA: Language Models for Dialog Ap- Problem Solving with Large Language Models. NeurIPS.
plications. arXiv preprint. Yao, S.; Zhao, J.; Yu, D.; Du, N.; Shafran, I.; Narasimhan,
Thorne, J.; Vlachos, A.; Christodoulopoulos, C.; and Mittal, K.; and Cao, Y. 2023b. ReAct: Synergizing Reasoning and
A. 2018. FEVER: a Large-scale Dataset for Fact Extraction Acting in Language Models. In ICLR.
and VERification. In NAACL. Yao, W.; Heinecke, S.; Niebles, J. C.; Liu, Z.; Feng, Y.; Xue,
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, L.; Murthy, R.; Chen, Z.; Zhang, J.; Arpit, D.; Xu, R.; Mui,
M.-A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; P.; Wang, H.; Xiong, C.; and Savarese, S. 2023c. Retro-
Azhar, F.; et al. 2023a. LLaMA: Open and Efficient Foun- former: Retrospective Large Language Agents with Policy
dation Language Models. arXiv preprint. Gradient Optimization.
Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Zeng, A.; Liu, M.; Lu, R.; Wang, B.; Liu, X.; Dong, Y.; and
Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, Tang, J. 2023. AgentTuning: Enabling Generalized Agent
S.; et al. 2023b. Llama 2: Open Foundation and Fine-Tuned Abilities for LLMs. arXiv preprint.
Chat Models. arXiv preprint. Zhang, Z.; Zhang, A.; Li, M.; and Smola, A. 2023. Auto-
Tworkowski, S.; Staniszewski, K.; Pacek, M.; Wu, Y.; matic Chain of Thought Prompting in Large Language Mod-
Michalewski, H.; and Miłoś, P. 2023. Focused Transformer: els. In ICLR.
Contrastive Training for Context Scaling. In NeurIPS. Zhao, A.; Zhu, E.; Lu, R.; Lin, M.; Liu, Y.-J.; and Huang, G.
Wang, L.; Ma, C.; Feng, X.; Zhang, Z.; Yang, H.; Zhang, J.; 2023a. Augmenting Unsupervised Reinforcement Learning
Chen, Z.; Tang, J.; Chen, X.; Lin, Y.; et al. 2023a. A Sur- with Self-Reference. arXiv preprint.
vey on Large Language Model Based Autonomous Agents. Zhao, W. X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.;
arXiv preprint. Min, Y.; Zhang, B.; Zhang, J.; Dong, Z.; et al. 2023b. A
Wang, L.; Yang, N.; and Wei, F. 2023. Learning to Retrieve Survey of Large Language Models. arXiv preprint.
In-Context Examples for Large Language Models. arXiv Zhuang, F.; Qi, Z.; Duan, K.; Xi, D.; Zhu, Y.; Zhu, H.;
preprint. Xiong, H.; and He, Q. 2020. A Comprehensive Survey on
Wang, S.; Liu, C.; Zheng, Z.; Qi, S.; Chen, S.; Yang, Transfer Learning. Proceedings of the IEEE.
Q.; Zhao, A.; Wang, C.; Song, S.; and Huang, G. 2023b.
Avalon’s Game of Thoughts: Battle Against Deception
through Recursive Contemplation. arXiv preprint.
Watkins, C. J.; and Dayan, P. 1992. Q-learning. Machine
learning.
19642