Dynamic Planning With A LLM
Dynamic Planning With A LLM
Goal
pick-up potato-1,
(and (inReceptacle ?t ?r) Go to microwave-1
... > ✅
Heat a potato (isHot ?t))))
Plan
task
Init States
No Plan found
countertop (init:
(init: ...
...
Generator
(inReceptacle (init: ... fridge-1))
potato-1
<Go to fridge-1,
(inReceptacle
(inReceptacle potato-1 fridge-1))
potato-1 fridge-1))
open fridge-1,
pick-up potato-1
... > ✅
Observation
Action
Env
Selector
Figure 1: LLM Dynamic Planner (LLM-DP). The LLM grounds observations and processes natural language
instructions into PDDL to use with a symbolic planner. This model can solve plans for unobserved or previously
unknown objects because the LLM generates plausible predicates for relevant objects through semantic and
pragmatic inference. Through sampling possible predicates, multiple plans can be found, and an Action Selector
decides whether to act, review its understanding of the problem, or ask clarification questions.
2022; Chen et al., 2022; Min et al., 2022), tradi- tions quickly in well-defined domains. However,
tional planners need maximal information. their up-front requirement for comprehensive prob-
In this work, we introduce the LLM Dynamic lem and domain descriptions limits their applicabil-
Planner (LLM-DP), a neuro-symbolic frame- ity in complex real-world settings where complete
work that integrates an LLM with a symbolic information may not be available.
planner to solve embodied tasks.1 LLM-DP capi- LLMs in Planning and Reasoning In contrast
talises on the LLM’s ability to understand actions to symbolic planners, LLMs have shown promise
and their impact on their environment and com- in adapting to noisy planning and reasoning tasks
bines it with the planner’s efficiency in finding so- through various methods. Some general ap-
lutions. Using domain knowledge, LLM-DP solves proaches such as Chain-of-Thought (Wei et al.,
the Alfworld test set faster and more efficiently than 2022), Self-Consistency (Wang et al., 2023b), and
a LLM-only (ReAct) approach. The remainder of Reasoning via Planning (Hao et al., 2023) augment
this paper explores the architecture of LLM-DP, dis- the context with a reasoning trace that the LLM gen-
cusses how to combine the strengths of LLMs and erates to improve its final prediction. Alternatively,
symbolic planning and presents potential research giving access to tools/APIs (Schick et al., 2023;
avenues for future work in LLM-driven agents. Patil et al., 2023), outside knowledge or databases
(Peng et al., 2023; Hu et al., 2023), code (Surís
2 Related Work
et al., 2023), and even symbolic reasoners (Yang
Symbolic Planners Symbolic planners have been et al., 2023) to enrich an LLM’s context and abil-
a cornerstone in automated planning and artificial ity to reason. The LLM can trigger these external
intelligence for decades (Fikes and Nilsson, 1971). sources of information or logic (through fine-tuning
Based on formal logic, they operate over symbolic or prompting) to obtain additional context and im-
representations of the world to find a sequence of prove its downstream performance.
actions that transition from an initial state to a goal Embodied Agents with LLMs In a parallel di-
state. Since the introduction of PDDL (McDermott, rection, recent works such as ReAct (Yao et al.,
2000), the AI planning community has developed 2023), Reflexion (Shinn et al., 2023), AutoGPT
an array of efficient planning algorithms. For exam- (Significant-Gravitas, 2023), and Voyager (Wang
ple, the Fast-Forward planner (FF) (Hoffmann and et al., 2023a), take an agent-based approach and
Nebel, 2001) employs heuristics derived from a augment the reasoning process through a closed
relaxed version of the planning problem. Similarly, ‘while’ loop that feeds environment observations
the BFS(f) planner (Lipovetzky et al., 2014) com- back to the LLM. ReAct (Yao et al., 2023) allows
bines breadth-first search and specialised heuristics. the LLM agent to either take an action or a ‘think-
These planners find high-quality or optimal solu- ing’ step. This allows the LLM to augment its
1
Our code is available at github.com/itl-ed/llm-dp context with its reasoning, which can be seen as
agent-driven Chain-of-Thought prompting. Voy- Algorithm 1 LLM-DP Pseudo-code
ager (Wang et al., 2023a) incrementally builds an Require: LLM, PG, AS, Domain, task, obs0
agent’s capabilities from its interactions with the goal ← LLM(Domain, task)
environment and an accessible memory compo- W, B ←observe(goal, obs0 )
nent (skill library). While many of these works while goal not reached do
show promising results in building general exe- plans ← ∅
cutable agents in embodied environments (Wang for i in N do
et al., 2023a), they still require many expensive wbelief ←LLM(B, W)
calls to the LLMs, are limited by the LLM’s con-
S
plans ←PG(wbelief W)
text window, and do not guarantee optimal plans. end for
action ←AS(plans)
3 Alfworld
obs ←Env(action)
Alfworld (Shridhar et al., 2020) is a text-only home W, B ←observe(action, obs)
environment where an agent is tasked with seven end while
possible tasks, such as interacting with one or more
objects and placing them in a specific receptacle. 2. Perfect observations: The Alfworld environ-
At the start of each episode, the goal is given in ment provides a perfect textual description of
natural language, and the initial observation does the current location. This observation also
not include the location of any objects. Therefore contains the intrinsic attributes of observed
an agent must navigate the environment to search objects and receptacles, such as whether or
for the relevant objects and perform the correct ac- not a given receptacle can be opened.
tions. The possible locations of the environment
are known, and the agent can navigate to any re- 3. Causal Environment: changes in the envi-
ceptacle by using a ‘go to’ action. However, since ronment are entirely caused by the agent.
none of the objects’ locations are initially observed, 4. Valid actions always succeed
the agent must be able to plan around uncertainty,
4.2 Generating the Goal State
estimate where objects are likely to be observed
and adjust accordingly. LLM-DP uses an LLM to generate a PDDL goal,
given the natural language instruction (task) and
4 LLM-DP the valid predicates defined by the PDDL domain
file. Figure 1 shows an example task converted
To tackle an embodied environment like Alfworld,
to a valid PDDL goal. For each episode, we use a
we introduce the Large Language Model Dynamic
set of three in-context examples that are fixed for
Planner (LLM-DP), which operates as a closed-
the entire evaluation duration. We use the OpenAI
loop agent. LLM-DP uses a combination of lan-
gpt-3.5-turbo-0613 LLM model with a temper-
guage understanding and symbolic reasoning to
ature of 0 in all our LLM-DP experiments.
plan and solve tasks in the simulated environment.
The model tracks a World State W and beliefs B 4.3 Sampling Beliefs
about predicates in the environment, uses an LLM
We parse the initial scene description into a struc-
to translate the task description into an executable
tured representation of the environment W and a
goal state and samples its beliefs to generate plau-
set of beliefs B. The internal representation of the
sible world states. We describe the working of the
world W contains all known information, for in-
LLM-DP agent as pseudo-code in Algorithm 1.
stance, all receptacles (possible locations) in the
4.1 Assumptions scene from the initial observation and their intrin-
sic attributes are known (i.e. a fridge holds the
We make several simplifying assumptions when
isFridge predicate). Whereas the set of beliefs B
applying the LLM-DP framework to Alfworld:
are a set of possible valid predicates that can be
1. Known action-descriptions and predicates: true or false and which the model does not have
Our input to the planner and the LLM re- enough information to disambiguate. In Alfworld,
quires the PDDL domain file, which describes the objects’ locations are unknown; therefore, the
what actions can be taken, their pre- and post- set of possible predicates for each object includes
conditions, and what predicates exist. all possible locations.
Average Accuracy (%)
Model clean cool examine heat put puttwo overall (↑) LLM Tokens (↓)
LLM-DP 0.94 1.00 1.00 0.87 1.00 0.94 0.96 633k
LLM-DP-random 0.94 1.00 1.00 0.87 0.96 1.00 0.96 67k
ReAct (Yao et al., 2023) 0.61 0.81 0.89 0.30 0.79 0.47 0.64 —*
ReAct (ours) 0.35 0.90 0.33 0.65 0.71 0.29 0.54 9.16M
(a) The average accuracy and number of LLM Tokens processed (context + generation) for each model. *Not reported.
Average Episode Length
Model clean cool examine heat put puttwo overall (↓)
LLM-DP 12.00 13.67 12.06 12.30 12.75 17.59 13.16
LLM-DP-random 15.06 17.14 10.56 14.04 14.62 18.94 15.02
ReAct (ours) 25.10 9.86 21.67 14.70 15.33 24.94 18.69
(b) The average episode length for each model, where the length of an episode denotes how many actions the agent has taken or
attempted to take to complete a task. We do not count the ‘thinking’ action of ReAct as an action in this metric.
Table 1: Summary of model performance on the Alfword test set. LLM-DP and LLM-DP-random differ in the sampling
strategy of the belief. LLM-DP uses an LLM to generate n = 3 plausible world states, while LLM-DP-random
randomly samples n = 3 plausible world states.
Yongchao Chen, Jacob Arkin, Yang Zhang, Nicholas A. Shishir G. Patil, Tianjun Zhang, Xin Wang, and
Roy, and Chuchu Fan. 2023. Autotamp: Autoregres- Joseph E. Gonzalez. 2023. Gorilla: Large language
sive task and motion planning with llms as translators model connected with massive apis. arXiv preprint
and checkers. ArXiv, abs/2306.06531. arXiv:2305.15334.
Richard E. Fikes and Nils J. Nilsson. 1971. Strips: A Baolin Peng, Michel Galley, Pengcheng He, Hao Cheng,
new approach to the application of theorem proving Yujia Xie, Yu Hu, Qiuyuan Huang, Lars Lidén, Zhou
to problem solving. Artificial Intelligence, 2(3):189– Yu, Weizhu Chen, and Jianfeng Gao. 2023. Check
208. your facts and try again: Improving large language
models with external knowledge and automated feed-
Caelan Reed Garrett, Tomas Lozano-Perez, and back. ArXiv, abs/2302.12813.
Leslie Pack Kaelbling. 2018. Pddlstream: Integrat-
ing symbolic planners and blackbox samplers via Miquel Ramirez, Nir Lipovetzky, and Christian Muise.
optimistic adaptive planning. In International Con- 2015. Lightweight Automated Planning ToolKiT.
ference on Automated Planning and Scheduling. http://lapkt.org/. Accessed: 2020.
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Håkan LS Younes and Michael L Littman. 2004.
Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Ppddl1. 0: An extension to pddl for expressing plan-
Cancedda, and Thomas Scialom. 2023. Toolformer: ning domains with probabilistic effects. Techn. Rep.
Language models can teach themselves to use tools. CMU-CS-04-162, 2:99.
ArXiv, abs/2302.04761.
Zhun Yang, Adam Ishay, and Joohyung Lee. 2023. Cou- C LLM-DP
pling large language models with logic programming
for robust and general reasoning from text. In Find- C.1 Generated Goal Examples
ings of the Association for Computational Linguistics:
ACL 2023, Toronto, Canada, July 9-14, 2023, pages
See Table 6 for examples of generated goals, both
5186–5219. Association for Computational Linguis- valid and invalid.
tics.
C.2 Varying n
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak See Table 6 for results when different varying n
Shafran, Karthik Narasimhan, and Yuan Cao. 2023.
ReAct: Synergizing reasoning and acting in language
and fallback. Fallback is when no plans are sam-
models. In International Conference on Learning pled successfully through the LLM, LLM-DP re-
Representations (ICLR). samples n plans randomly.
(define (domain alfred)
(:predicates
(isReceptacle ?o - object) ; true if the object is a receptacle
(atReceptacleLocation ?r - object) ; true if the robot is at the receptacle location
(inReceptacle ?o - object ?r - object) ; true if object ?o is in receptacle ?r
(openable ?r - object) ; true if a receptacle is openable
(opened ?r - object) ; true if a receptacle is opened
(isLight ?o - object) ; true if an object is light source
(examined ?o - object ?l - object) ; whether the object has been looked at with light
(holds ?o - object) ; object ?o is held by robot
(isClean ?o - object) ; true if the object has been cleaned in sink
(isHot ?o - object) ; true if the object has been heated up
(isCool ?o - object) ; true if the object has been cooled
(isSink ?o - object) ; true if the object is a sink
(isMicrowave ?o - object) ; true if the object is a microwave
(isFridge ?o - object) ; true if the object is a fridge
))
Table 3: System Prompt used by gpt-3.5-turbo for generating the :goal in LLM-DP
Table 4: Fixed Few-shot examples used by gpt-3.5-turbo for generating the :goal in LLM-DP
Table 6: Sample of generated PDDL goals from LLM-DP. The generation gets confused by the semantics of
‘receptacle’ and identifies a mug as a receptacle. While it is true that a mug is a receptacle, in our defined logic,
receptacles are fixed, immovable objects which can contain other objects and therefore, a mug is not a Receptacle
which leads the planning to fail subsequently.