0% found this document useful (0 votes)
44 views9 pages

Dynamic Planning With A LLM

Uploaded by

Expecto Limited
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views9 pages

Dynamic Planning With A LLM

Uploaded by

Expecto Limited
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Dynamic Planning with a LLM

Gautier Dagan Frank Keller Alex Lascarides


School of Informatics
University of Edinburgh, UK
[email protected], {keller, alex}@inf.ed.ac.uk

Abstract Consistency (Wang et al., 2023b) augment the con-


text with reasoning traces. Other, agent-based ap-
While Large Language Models (LLMs) can
proaches, such as ReAct (Yao et al., 2023), inte-
solve many NLP tasks in zero-shot settings, ap-
grate feedback from the environment iteratively,
arXiv:2308.06391v1 [cs.CL] 11 Aug 2023

plications involving embodied agents remain


problematic. In particular, complex plans that giving the agent the ability to take ‘thinking’ steps
require multi-step reasoning become difficult or to augment its context with a reasoning trace.
and too costly as the context window grows. However, these approaches frequently involve high
Planning requires understanding the likely ef- computational costs due to the iterated invocations
fects of one’s actions and identifying whether of LLMs and still face challenges dealing with the
the current environment satisfies the goal state. limits of the context window and recovering from
While symbolic planners find optimal solu-
hallucinations, which can compromise the quality
tions quickly, they require a complete and ac-
curate representation of the planning problem, of the plans.
severely limiting their use in practical scenarios. Conversely, traditional symbolic planners, such
In contrast, modern LLMs cope with noisy ob- as the Fast-Forward planner (Hoffmann and Nebel,
servations and high levels of uncertainty when 2001) or the BFS(f) planner(Lipovetzky et al.,
reasoning about a task. Our work presents
2014), excel at finding optimal plans efficiently.
LLM Dynamic Planner (LLM-DP): a neuro-
symbolic framework where an LLM works
But symbolic planners require problem and domain
hand-in-hand with a traditional planner to solve descriptions as prerequisites (McDermott, 2000),
an embodied task. Given action-descriptions, which hampers their applicability in real-world sce-
LLM-DP solves Alfworld faster and more effi- narios where it may be infeasible to achieve these
ciently than a naive LLM ReAct baseline. high informational demands. For instance, know-
ing a complete and accurate description of the goal
1 Introduction may not be possible before exploring the environ-
Large Language Models (LLMs), like GPT-4 (Ope- ment through actions.
nAI, 2023), have proven remarkably effective at Previous work by (Liu et al., 2023) has shown
various natural language processing tasks, partic- that LLMs can generate valid problem files in the
ularly in zero-shot or few-shot settings (Brown Planning Domain Definition Language (PDDL ) for
et al., 2020). However, employing LLMs in em- many simple examples. Yet, the problem of incom-
bodied agents, which interact with dynamic envi- plete information remains: agents often need to
ronments, presents substantial challenges. LLMs interact with the world to discover their surround-
tend to generate incorrect or spurious information, ings before optimal planning can be applied. Some
a phenomenon known as hallucination, and their versions of PDDL have been proposed in the past to
performance is brittle to the phrasing of prompts deal with probabilities or Task and Motion Plan-
(Ji et al., 2022). Moreover, LLMs are ill-equipped ning, such as PPDDL and PDDLStream (Younes
for naive long-term planning since managing an and Littman, 2004; Garrett et al., 2018), but these
extensive context over multiple steps is complex still assume a human designer encoding the agent’s
and resource-consuming (Silver et al., 2022; Liu understanding of the domain and the planning prob-
et al., 2023). lem, rather than the agent learning from interac-
Various approaches have aimed to mitigate some tions. Therefore, where modern LLMs need mini-
of these limitations. For instance, methods like mal information to figure out a task, e.g. through
Chain-of-Thought (Wei et al., 2022) and Self- Few-shot or In-Context Learning (Honovich et al.,
(:action go-to
PDDL Domain
...)
(:action pick-up
...)
PDDL Problem(s)
(:action heat <Go to shelf-1,
...) (:goal (exists (?t - potato ?r - countertop)

Goal
pick-up potato-1,
(and (inReceptacle ?t ?r) Go to microwave-1
... > ✅
Heat a potato (isHot ?t))))
Plan
task

and put it on a LLM ❌

Init States
No Plan found
countertop (init:
(init: ...
...
Generator
(inReceptacle (init: ... fridge-1))
potato-1
<Go to fridge-1,
(inReceptacle
(inReceptacle potato-1 fridge-1))
potato-1 fridge-1))
open fridge-1,
pick-up potato-1
... > ✅

Observation
Action
Env
Selector

Figure 1: LLM Dynamic Planner (LLM-DP). The LLM grounds observations and processes natural language
instructions into PDDL to use with a symbolic planner. This model can solve plans for unobserved or previously
unknown objects because the LLM generates plausible predicates for relevant objects through semantic and
pragmatic inference. Through sampling possible predicates, multiple plans can be found, and an Action Selector
decides whether to act, review its understanding of the problem, or ask clarification questions.

2022; Chen et al., 2022; Min et al., 2022), tradi- tions quickly in well-defined domains. However,
tional planners need maximal information. their up-front requirement for comprehensive prob-
In this work, we introduce the LLM Dynamic lem and domain descriptions limits their applicabil-
Planner (LLM-DP), a neuro-symbolic frame- ity in complex real-world settings where complete
work that integrates an LLM with a symbolic information may not be available.
planner to solve embodied tasks.1 LLM-DP capi- LLMs in Planning and Reasoning In contrast
talises on the LLM’s ability to understand actions to symbolic planners, LLMs have shown promise
and their impact on their environment and com- in adapting to noisy planning and reasoning tasks
bines it with the planner’s efficiency in finding so- through various methods. Some general ap-
lutions. Using domain knowledge, LLM-DP solves proaches such as Chain-of-Thought (Wei et al.,
the Alfworld test set faster and more efficiently than 2022), Self-Consistency (Wang et al., 2023b), and
a LLM-only (ReAct) approach. The remainder of Reasoning via Planning (Hao et al., 2023) augment
this paper explores the architecture of LLM-DP, dis- the context with a reasoning trace that the LLM gen-
cusses how to combine the strengths of LLMs and erates to improve its final prediction. Alternatively,
symbolic planning and presents potential research giving access to tools/APIs (Schick et al., 2023;
avenues for future work in LLM-driven agents. Patil et al., 2023), outside knowledge or databases
(Peng et al., 2023; Hu et al., 2023), code (Surís
2 Related Work
et al., 2023), and even symbolic reasoners (Yang
Symbolic Planners Symbolic planners have been et al., 2023) to enrich an LLM’s context and abil-
a cornerstone in automated planning and artificial ity to reason. The LLM can trigger these external
intelligence for decades (Fikes and Nilsson, 1971). sources of information or logic (through fine-tuning
Based on formal logic, they operate over symbolic or prompting) to obtain additional context and im-
representations of the world to find a sequence of prove its downstream performance.
actions that transition from an initial state to a goal Embodied Agents with LLMs In a parallel di-
state. Since the introduction of PDDL (McDermott, rection, recent works such as ReAct (Yao et al.,
2000), the AI planning community has developed 2023), Reflexion (Shinn et al., 2023), AutoGPT
an array of efficient planning algorithms. For exam- (Significant-Gravitas, 2023), and Voyager (Wang
ple, the Fast-Forward planner (FF) (Hoffmann and et al., 2023a), take an agent-based approach and
Nebel, 2001) employs heuristics derived from a augment the reasoning process through a closed
relaxed version of the planning problem. Similarly, ‘while’ loop that feeds environment observations
the BFS(f) planner (Lipovetzky et al., 2014) com- back to the LLM. ReAct (Yao et al., 2023) allows
bines breadth-first search and specialised heuristics. the LLM agent to either take an action or a ‘think-
These planners find high-quality or optimal solu- ing’ step. This allows the LLM to augment its
1
Our code is available at github.com/itl-ed/llm-dp context with its reasoning, which can be seen as
agent-driven Chain-of-Thought prompting. Voy- Algorithm 1 LLM-DP Pseudo-code
ager (Wang et al., 2023a) incrementally builds an Require: LLM, PG, AS, Domain, task, obs0
agent’s capabilities from its interactions with the goal ← LLM(Domain, task)
environment and an accessible memory compo- W, B ←observe(goal, obs0 )
nent (skill library). While many of these works while goal not reached do
show promising results in building general exe- plans ← ∅
cutable agents in embodied environments (Wang for i in N do
et al., 2023a), they still require many expensive wbelief ←LLM(B, W)
calls to the LLMs, are limited by the LLM’s con-
S
plans ←PG(wbelief W)
text window, and do not guarantee optimal plans. end for
action ←AS(plans)
3 Alfworld
obs ←Env(action)
Alfworld (Shridhar et al., 2020) is a text-only home W, B ←observe(action, obs)
environment where an agent is tasked with seven end while
possible tasks, such as interacting with one or more
objects and placing them in a specific receptacle. 2. Perfect observations: The Alfworld environ-
At the start of each episode, the goal is given in ment provides a perfect textual description of
natural language, and the initial observation does the current location. This observation also
not include the location of any objects. Therefore contains the intrinsic attributes of observed
an agent must navigate the environment to search objects and receptacles, such as whether or
for the relevant objects and perform the correct ac- not a given receptacle can be opened.
tions. The possible locations of the environment
are known, and the agent can navigate to any re- 3. Causal Environment: changes in the envi-
ceptacle by using a ‘go to’ action. However, since ronment are entirely caused by the agent.
none of the objects’ locations are initially observed, 4. Valid actions always succeed
the agent must be able to plan around uncertainty,
4.2 Generating the Goal State
estimate where objects are likely to be observed
and adjust accordingly. LLM-DP uses an LLM to generate a PDDL goal,
given the natural language instruction (task) and
4 LLM-DP the valid predicates defined by the PDDL domain
file. Figure 1 shows an example task converted
To tackle an embodied environment like Alfworld,
to a valid PDDL goal. For each episode, we use a
we introduce the Large Language Model Dynamic
set of three in-context examples that are fixed for
Planner (LLM-DP), which operates as a closed-
the entire evaluation duration. We use the OpenAI
loop agent. LLM-DP uses a combination of lan-
gpt-3.5-turbo-0613 LLM model with a temper-
guage understanding and symbolic reasoning to
ature of 0 in all our LLM-DP experiments.
plan and solve tasks in the simulated environment.
The model tracks a World State W and beliefs B 4.3 Sampling Beliefs
about predicates in the environment, uses an LLM
We parse the initial scene description into a struc-
to translate the task description into an executable
tured representation of the environment W and a
goal state and samples its beliefs to generate plau-
set of beliefs B. The internal representation of the
sible world states. We describe the working of the
world W contains all known information, for in-
LLM-DP agent as pseudo-code in Algorithm 1.
stance, all receptacles (possible locations) in the
4.1 Assumptions scene from the initial observation and their intrin-
sic attributes are known (i.e. a fridge holds the
We make several simplifying assumptions when
isFridge predicate). Whereas the set of beliefs B
applying the LLM-DP framework to Alfworld:
are a set of possible valid predicates that can be
1. Known action-descriptions and predicates: true or false and which the model does not have
Our input to the planner and the LLM re- enough information to disambiguate. In Alfworld,
quires the PDDL domain file, which describes the objects’ locations are unknown; therefore, the
what actions can be taken, their pre- and post- set of possible predicates for each object includes
conditions, and what predicates exist. all possible locations.
Average Accuracy (%)
Model clean cool examine heat put puttwo overall (↑) LLM Tokens (↓)
LLM-DP 0.94 1.00 1.00 0.87 1.00 0.94 0.96 633k
LLM-DP-random 0.94 1.00 1.00 0.87 0.96 1.00 0.96 67k
ReAct (Yao et al., 2023) 0.61 0.81 0.89 0.30 0.79 0.47 0.64 —*
ReAct (ours) 0.35 0.90 0.33 0.65 0.71 0.29 0.54 9.16M

(a) The average accuracy and number of LLM Tokens processed (context + generation) for each model. *Not reported.
Average Episode Length
Model clean cool examine heat put puttwo overall (↓)
LLM-DP 12.00 13.67 12.06 12.30 12.75 17.59 13.16
LLM-DP-random 15.06 17.14 10.56 14.04 14.62 18.94 15.02
ReAct (ours) 25.10 9.86 21.67 14.70 15.33 24.94 18.69
(b) The average episode length for each model, where the length of an episode denotes how many actions the agent has taken or
attempted to take to complete a task. We do not count the ‘thinking’ action of ReAct as an action in this metric.

Table 1: Summary of model performance on the Alfword test set. LLM-DP and LLM-DP-random differ in the sampling
strategy of the belief. LLM-DP uses an LLM to generate n = 3 plausible world states, while LLM-DP-random
randomly samples n = 3 plausible world states.

LLM-DP uses stored observations W, beliefs B 4.4 Plan Generator


and an LLM to construct different planning prob-
Upon constructing the different PDDL problems, the
lem files in PDDL . A PDDL problem file includes the
agent uses a Plan Generator (PG) to solve each
objects observed (:objects), a representation of
problem and obtain a plan. We use the BFS(f)
the current state (:init) of the world and the ob-
solver (Lipovetzky et al., 2014) implemented as an
ject attributes, and the goal to be achieved (:goal).
executable by LAPKT (Ramirez et al., 2015). A
The goal is derived from the LLM (Section 4.2),
generated plan is a sequence of actions, where each
while the objects and their attributes are obtained
action is represented in a symbolic form, which,
from W (observations) and the beliefs the B has
if executed, would lead to the goal state from the
about the objects.
initial state.
Since B includes possible predicates which are
unknown, we sample from B using an LLM to
4.5 Action Selector
obtain wbelief . For instance, our belief could be
that (inReceptacle tomato ?x) where ?x can The Action Selector (AS) module decides the
be countertop, cabinet, fridge, etc. Since agent’s immediate next action. It takes the plan-
we want to condition the sampling of where the ner’s output, a set of plans, and selects an action
tomato can appear, we pass the known world from them. In our Alfworld experiments, the Ac-
state W along with the predicate (in this case tion Selector simply selects the shortest plan re-
inReceptacle) and its options to the LLM.This turned. If no valid plans are returned, all sampled
sampling leverages the LLM to complete a world states were satisfying goal states, there is a mistake
state and is extendable to any unknown predicate with the constructed domain/problem files, or the
from which a set of beliefs can be deduced. We planner has failed to find a path to the goal. In the
also compare LLM sampling with random sam- first case, we re-sample random world states and
pling (llmdp-random). re-run the planners once.
We describe our likely world state as the union We also propose exploring different strategies
between a sampled S set of beliefs and the known when valid plans cannot be found. For instance,
world state wbelief W. Then sampling i = similarly to self-reflection (Shinn et al., 2023), the
1, .., N different sets of beliefs during the planning Action Selector could prompt an update in the
loop, we obtain N likely world states. Finally, we agent’s belief about the world state if none of gener-
convert each likely world state to lists of predicates ated problem descriptions are solvable. The Action
to interface with the PDDL planner. Selector could also interact with a human teacher
or oracle to adjust its understanding of the environ- the type of task, such that two examples of the
ment (problem) or its logic (domain). same type of task being solved are always shown.
We also find that our reproduction of ReAct is
4.6 Observation Processing worse than the original and attribute this to the
LLM-DP uses the result of each action to update its gpt-3.5-turbo model being more conversational
internal state representation. It uses the symbolic than text-davinci-002, and thus less likely to
effects of the action to infer changes in the state of output valid actions as it favours fluency over fol-
the objects and receptacles. Then it integrates the lowing the templated action language.
information from the new observation, which might We also measure the length of each successful
reveal additional details not directly inferred from episode and find that LLM-DP reaches the goal
the action itself. For instance, opening an unseen state faster on average (13.16 actions) versus ReAct
drawer might reveal new objects inside. Observing (18.69 actions) and a random search strategy (15.02
also updates the beliefs – if an object is observed at actions). The Average Episode Length measures
a location, it cannot be elsewhere, but if an object the number of actions taken in the environment and
is not observed at a location, it cannot be there. how efficient the agent is.
Observations incorporate beliefs into W.
If the agent detects new information from the 6 Conclusion
scene - such as discovering new objects - it triggers
The LLM-DP agent effectively integrates language
a re-planning process. The agent then generates a
understanding, symbolic planning, and state track-
new set of possible PDDL problems using the up-
ing in a dynamic environment. It uses the language
dated state representation and corresponding plans
model to understand tasks and scenes expressed
using the Plan Generator. This approach is similar
in natural language, constructs and solves plan-
to some Task and Motion Planning (TAMP) meth-
ning problems to decide on a course of action, and
ods (Garrett et al., 2018; Chen et al., 2023), en-
keeps track of the world state to adapt to changes
abling the agent to adapt to environmental changes
and make informed decisions. This workflow en-
and unexpected outcomes of actions.
ables the agent to perform complex tasks in the
5 Results Alfworld environment, making it a promising ap-
proach for embodied tasks that involve language
We contrast the LLM-DP approach with ReAct understanding, reasoning, and decision-making.
(LLM-only baseline) from the original implemen- LLM-DP offers a cost and efficiency trade-off
tation by Yao et al. (2023). Since we use a differ- between a wholly symbolic solution and an LLM-
ent backbone LLM model (gpt-3.5-turbo rather only model. The LLM’s semantic knowledge of
than text-davinci-002) than the ReAct base- the world is leveraged to translate the problem into
line for cost purposes, we also reproduce their re- PDDL while guiding the search process through be-
sults using gpt-3.5-turbo and adapt the ReAct lief instantiation. We find that not only is LLM-DP
prompts to a chat format. cheaper, on a per-token comparison, but it is also
As shown in Table 1, LLM-DP solves Alfworld faster and more successful at long-term planning in
almost perfectly (96%) compared to our baseline an embodied environment. LLM-DP validates the
reproduction of ReAct (53%). The LLM-DP can need for LLM research to incorporate specialised
translate the task description into an executable tools, such as PDDL solvers, in embodied agents to
PDDL goal 97% of the time, but sampling reduces promote valid
the accuracy further when it fails to select a valid Despite these promising results, numerous topics
set of possible world states – for instance, by sam- and unresolved issues remain open for future in-
pling states where the goal is already satisfied. vestigation. Key among these is devising strategies
We note, that the ReAct baseline makes differ- to encode the world model and belief, currently
ent assumptions about the problem; while it does handled symbolically, and managing uncertain ob-
not require a domain file containing the action- servations — particularly from an image model
descriptions and object predicates, it uses two sep- — along with propagating any uncertainty to the
arate human-annotated episodes per example to planner and Action Selector. We intentionally kept
bootstrap its in-context logic. ReAct also switches the Action Selector simple for our experiments, but
out which examples to use in-context based on future work may also explore different strategies to
encourage self-reflection within the agent loop. For Shibo Hao, Yilan Gu, Haodi Ma, Joshua Jiahua Hong,
instance, if all plans prove invalid, beliefs may be Zhen Wang, Daisy Zhe Wang, and Zhiting Hu. 2023.
Reasoning with language model is planning with
updated, or it might indicate an incorrect domain
world model. ArXiv, abs/2305.14992.
definition. Such instances may necessitate agents
to interact with an instructor who can provide in- Jörg Hoffmann and Bernhard Nebel. 2001. The FF plan-
sights about action pre-conditions and effects. This ning system: Fast plan generation through heuristic
search. Journal of Artificial Intelligence Research,
direction could lead us from a static domain file 14:253–302.
towards an agent truly adaptable to new environ-
ments, fostering continual learning and adaptation. Or Honovich, Uri Shaham, Samuel R. Bowman, and
Omer Levy. 2022. Instruction induction: From
Acknowledgements few examples to natural language task descriptions.
ArXiv, abs/2205.10782.
This work was supported in part by the UKRI Cen-
Chenxu Hu, Jie Fu, Chenzhuang Du, Simian Luo,
tre for Doctoral Training in Natural Language Pro- Junbo Jake Zhao, and Hang Zhao. 2023. Chatdb:
cessing, funded by the UKRI (grant EP/S022481/1) Augmenting llms with databases as their symbolic
at the University of Edinburgh, School of Infor- memory. ArXiv, abs/2306.03901.
matics and School of Philosophy, Psychology &
Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan
Language Sciences and by the UKRI-funded TAS Su, Yan Xu, Etsuko Ishii, Yejin Bang, Wenliang Dai,
Governance Node (grant number EP/V026607/1). Andrea Madotto, and Pascale Fung. 2022. Survey of
hallucination in natural language generation. ACM
Computing Surveys, 55:1 – 38.
References
Nir Lipovetzky, Miquel Ramirez, Christian Muise, and
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Hector Geffner. 2014. Width and inference based
Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind planners: Siw, bfs (f), and probe. Proceedings of the
Neelakantan, Pranav Shyam, Girish Sastry, Amanda 8th International Planning Competition (IPC-2014),
Askell, Sandhini Agarwal, Ariel Herbert-Voss, page 43.
Gretchen Krueger, Tom Henighan, Rewon Child,
Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, B. Liu, Yuqian Jiang, Xiaohan Zhang, Qian Liu, Shiqi
Clemens Winter, Christopher Hesse, Mark Chen, Eric Zhang, Joydeep Biswas, and Peter Stone. 2023.
Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Llm+p: Empowering large language models with op-
Jack Clark, Christopher Berner, Sam McCandlish, timal planning proficiency. ArXiv, abs/2304.11477.
Alec Radford, Ilya Sutskever, and Dario Amodei.
2020. Language models are few-shot learners. In Ad- Drew McDermott. 2000. The 1998 ai planning systems
vances in Neural Information Processing Systems 33: competition. AI Magazine, 21(2):35–55.
Annual Conference on Neural Information Process-
Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe,
ing Systems 2020, NeurIPS 2020, December 6-12,
Mike Lewis, Hannaneh Hajishirzi, and Luke Zettle-
2020, virtual.
moyer. 2022. Rethinking the role of demonstrations:
Yanda Chen, Ruiqi Zhong, Sheng Zha, George Karypis, What makes in-context learning work? In Confer-
and He He. 2022. Meta-learning via language model ence on Empirical Methods in Natural Language
in-context tuning. In Proceedings of the 60th Annual Processing.
Meeting of the Association for Computational Lin-
guistics (Volume 1: Long Papers), pages 719–730, OpenAI. 2023. Gpt-4 technical report. arXiv preprint
Dublin, Ireland. Association for Computational Lin- arXiv:2303.08774. Computation and Language
guistics. (cs.CL); Artificial Intelligence (cs.AI).

Yongchao Chen, Jacob Arkin, Yang Zhang, Nicholas A. Shishir G. Patil, Tianjun Zhang, Xin Wang, and
Roy, and Chuchu Fan. 2023. Autotamp: Autoregres- Joseph E. Gonzalez. 2023. Gorilla: Large language
sive task and motion planning with llms as translators model connected with massive apis. arXiv preprint
and checkers. ArXiv, abs/2306.06531. arXiv:2305.15334.

Richard E. Fikes and Nils J. Nilsson. 1971. Strips: A Baolin Peng, Michel Galley, Pengcheng He, Hao Cheng,
new approach to the application of theorem proving Yujia Xie, Yu Hu, Qiuyuan Huang, Lars Lidén, Zhou
to problem solving. Artificial Intelligence, 2(3):189– Yu, Weizhu Chen, and Jianfeng Gao. 2023. Check
208. your facts and try again: Improving large language
models with external knowledge and automated feed-
Caelan Reed Garrett, Tomas Lozano-Perez, and back. ArXiv, abs/2302.12813.
Leslie Pack Kaelbling. 2018. Pddlstream: Integrat-
ing symbolic planners and blackbox samplers via Miquel Ramirez, Nir Lipovetzky, and Christian Muise.
optimistic adaptive planning. In International Con- 2015. Lightweight Automated Planning ToolKiT.
ference on Automated Planning and Scheduling. http://lapkt.org/. Accessed: 2020.
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Håkan LS Younes and Michael L Littman. 2004.
Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Ppddl1. 0: An extension to pddl for expressing plan-
Cancedda, and Thomas Scialom. 2023. Toolformer: ning domains with probabilistic effects. Techn. Rep.
Language models can teach themselves to use tools. CMU-CS-04-162, 2:99.
ArXiv, abs/2302.04761.

Noah Shinn, Beck Labash, and Ashwin Gopinath. SR EL


2023. Reflexion: An autonomous agent with dy- LLM-DP (n=3) 0.96 13.16
namic memory and self-reflection. arXiv preprint
arXiv:2303.11366. LLM-DP (n=3) - fallback 0.92 12.80
LLM-DP (n=5) 0.96 12.54
Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, LLM-DP (n=5) - fallback 0.94 12.24
Yonatan Bisk, Adam Trischler, and Matthew J.
Hausknecht. 2020. Alfworld: Aligning text and em- Table 2: We compare the average Success Rate (SR)
bodied environments for interactive learning. CoRR, and average Episode Length (EL) for different sam-
abs/2010.03768. pling sizes n and with or without a fallback to random
sampling. The random sampling fallback affects the
Significant-Gravitas. 2023. An experimental open- success rate as the LLM sampler can more often sample
source attempt to make gpt-4 fully autonomous. n world states which are already satisfied. However
https://github.com/significant-gravitas/ as n increases, we see that it becomes more likely for
auto-gpt. Accessed: 2023-06-09.
the sampling procedure to at find at least one plan, and
therefore the SR increases when no fallback (- fallback)
Tom Silver, Varun Hariprasad, Reece S Shuttle-
worth, Nishanth Kumar, Tomás Lozano-Pérez, and is used.
Leslie Pack Kaelbling. 2022. Pddl planning with
pretrained large language models. In NeurIPS 2022
Foundation Models for Decision Making Workshop. A Prompts and Few-shot details
Dídac Surís, Sachit Menon, and Carl Vondrick. 2023. See Table 3 and Table 4 for LLM-DP prompts used.
Vipergpt: Visual inference via python execution for
reasoning. ArXiv, abs/2303.08128. B ReAct
Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Man- B.1 Reproduction with Chat Model
dlekar, Chaowei Xiao, Yuke Zhu, Linxi (Jim) Fan, We slightly modify the ‘system’ prompt of
and Anima Anandkumar. 2023a. Voyager: An open-
ended embodied agent with large language models. the original ReAct (see Table 5) to guide the
ArXiv, abs/2305.16291. model away from its conversational tendencies.
gpt-3.5-turbo apologises significantly more than
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, the text-davinci-002 model, and we found that
Ed Huai hsin Chi, and Denny Zhou. 2023b. Self- it would often get stuck in loops of apologising.
consistency improves chain of thought reasoning in
language models. In International Conference on We also modify the code so that we replace all gen-
Learning Representations (ICLR). erated instances of ‘in’ and ‘on’ with ‘in/on’ if the
model did not generate it correctly, since Alfworld
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten expects ‘in/on’ but gpt-3.5-turbo tends to gen-
Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le,
and Denny Zhou. 2022. Chain-of-thought prompt-
erate only the correct preposition. Without these
ing elicits reasoning in large language models. In changes, ReAct would be significantly worse than
NeurIPS. our reported metric.

Zhun Yang, Adam Ishay, and Joohyung Lee. 2023. Cou- C LLM-DP
pling large language models with logic programming
for robust and general reasoning from text. In Find- C.1 Generated Goal Examples
ings of the Association for Computational Linguistics:
ACL 2023, Toronto, Canada, July 9-14, 2023, pages
See Table 6 for examples of generated goals, both
5186–5219. Association for Computational Linguis- valid and invalid.
tics.
C.2 Varying n
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak See Table 6 for results when different varying n
Shafran, Karthik Narasimhan, and Yuan Cao. 2023.
ReAct: Synergizing reasoning and acting in language
and fallback. Fallback is when no plans are sam-
models. In International Conference on Learning pled successfully through the LLM, LLM-DP re-
Representations (ICLR). samples n plans randomly.
(define (domain alfred)
(:predicates
(isReceptacle ?o - object) ; true if the object is a receptacle
(atReceptacleLocation ?r - object) ; true if the robot is at the receptacle location
(inReceptacle ?o - object ?r - object) ; true if object ?o is in receptacle ?r
(openable ?r - object) ; true if a receptacle is openable
(opened ?r - object) ; true if a receptacle is opened
(isLight ?o - object) ; true if an object is light source
(examined ?o - object ?l - object) ; whether the object has been looked at with light
(holds ?o - object) ; object ?o is held by robot
(isClean ?o - object) ; true if the object has been cleaned in sink
(isHot ?o - object) ; true if the object has been heated up
(isCool ?o - object) ; true if the object has been cooled
(isSink ?o - object) ; true if the object is a sink
(isMicrowave ?o - object) ; true if the object is a microwave
(isFridge ?o - object) ; true if the object is a fridge
))

Table 3: System Prompt used by gpt-3.5-turbo for generating the :goal in LLM-DP

Your task is to: put a clean plate in microwave.


(:goal
(exists (?t - plate ?r - microwave)
(and (inReceptacle ?t ?r)
(isClean ?t)
)))

Your task is to: examine an alarmclock with the desklamp",


(:goal
(exists (?t - alarmclock ?l - desklamp)
(and (examined ?t ?l) (holds ?t)
)))

Your task is to: put two cellphone in bed


(:goal
(exists (?t1 - cellphone ?t2 - cellphone ?r - bed)
(and (inReceptacle ?t1 ?r)
(inReceptacle ?t2 ?r)
(not (= ?t1 ?t2))
)))

Table 4: Fixed Few-shot examples used by gpt-3.5-turbo for generating the :goal in LLM-DP

Interact with a household to solve a task.


Only reply with > followed by the action to take or 'think'.
Do not apologize.
Follow the format of the two examples below.

Table 5: System Prompt used by gpt-3.5-turbo in our reproduction of ReAct


task: put a clean mug in coffeemachine.
task: put some peppershaker on drawer.
Generated:
Generated:
(:goal
(:goal
(exists (?t - mug ?r - coffeemachine)
(exists (?t - peppershaker ?r - drawer)
(and (inReceptacle ?t ?r)
(inReceptacle ?t ?r)
(isClean ?t)
))
)))
VALID ✓
VALID ✓
task: put two cd in safe. task: heat some mug and put it in coffeemachine.
Generated: Generated:
(:goal (:goal
(exists (?t1 - cd ?t2 - cd ?r - safe) (exists (?m - mug ?c - coffeemachine)
(and (inReceptacle ?t1 ?r) (and (isReceptacle ?m)
(inReceptacle ?t2 ?r) (isHot ?m)
(not (= ?t1 ?t2)) (inReceptacle ?m ?c)
))) )))
VALID ✓ INVALID ✗

Table 6: Sample of generated PDDL goals from LLM-DP. The generation gets confused by the semantics of
‘receptacle’ and identifies a mug as a receptacle. While it is true that a mug is a receptacle, in our defined logic,
receptacles are fixed, immovable objects which can contain other objects and therefore, a mug is not a Receptacle
which leads the planning to fail subsequently.

You might also like