0% found this document useful (0 votes)

405 views

Large Language Models

The document proposes a framework called LLM+P that allows large language models to solve planning problems by leveraging classical planners. LLM+P takes a natural language description of a planning problem, converts it to a format understandable by planners, uses a planner to find a solution, and translates the solution back to natural language. Experiments show LLM+P can solve most planning problems correctly while language models alone fail on most. However, LLM+P currently does not recognize when a prompt requires its planning capabilities.

Uploaded by

Chadi Dehmane

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

405 views

Large Language Models

Uploaded by

Chadi Dehmane

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

LLM+P: Empowering Large Language Models

with Optimal Planning Proficiency

† BoLiu∗, † Yuqian Jiang∗ , ‡ Xiaohan Zhang
† Qiang Liu, ‡ Shiqi Zhang, † Joydeep Biswas, † Peter Stone
Department of Computer Science
† University of Texas at Austin, ‡ State University of New York at Binghamton
{bliu, lqiang, joydeep, pstone}@cs.utexas.edu
[email protected], {xzhan244,zhangs}@binghamton.edu
arXiv:2304.11477v2 [cs.AI] 5 May 2023

Abstract: Large language models (LLMs) have demonstrated remarkable zero-

shot generalization abilities: state-of-the-art chatbots can provide plausible an-
swers to many common questions that arise in daily life. However, so far, LLMs
cannot reliably solve long-horizon planning problems. By contrast, classical plan-
ners, once a problem is given in a formatted way, can use efficient search al-
gorithms to quickly identify correct, or even optimal, plans. In an effort to get
the best of both worlds, this paper introduces LLM+P, the first framework that
incorporates the strengths of classical planners into LLMs. LLM+P takes in a
natural language description of a planning problem, then returns a correct (or op-
timal) plan for solving that problem in natural language. LLM+P does so by
first converting the language description into a file written in the planning do-
main definition language (PDDL), then leveraging classical planners to quickly
find a solution, and then translating the found solution back into natural language.
Along with LLM+P, we define a diverse set of different benchmark problems
taken from common planning scenarios. Via a comprehensive set of experiments
on these benchmark problems, we find that LLM+P is able to provide optimal
solutions for most problems, while LLMs fail to provide even feasible plans for
most problems.2

Keywords: Planning, Large Language Model, In-context Learning

1 Introduction
Ever since the birth of the field, AI researchers have sought to create programs that can converse in
natural language with the same grace and flexibility as people. While even relatively simple models,
such as Eliza from 1966 [1], can generate responses to some prompts that seem reasonable, it has
always been relatively easy to generate prompts that expose their weaknesses compared to people
— their lack of true “understanding.”
While large language models (LLMs) such as GPT-4 [2] and ChatGPT [3] have far surpassed ex-
pectations of just a few years ago, they are no different in this respect. Indeed the internet is now
awash with examples of people reveling in getting ChatGPT to generate output that even a 5-year-old
human child would know to be ill-advised.
Given how LLMs are designed and trained, this phenomenon should come as no surprise. They
are not specifically built to demonstrate understanding. They are trained to generate sequences of
words that might be plausible to a human given a prior context. In the terms of Mahowald et al. [4]
LLMs have become amazingly proficient at linguistic competence — knowing how to say things;
but they are not nearly as good at functional competence — knowing what to say. Specifically, they
can be (relatively) easily fooled by, for example, asking for the result of a straightforward arithmetic
problem that does not appear in their training corpus or by asking for the solution to a novel planning
problem that requires knowledge about how the world works.
∗ Equal contribution.
2 The code and results are publicly available at https://github.com/Cranial-XIX/llm-pddl.git.
A Failure Example of GPT-4 in Planning

Problem (P1): You have 5 blocks. One cannot place more than one block on another block.
b5 is on top of b3. b4 is on top of b2. b2 is on top of b1. b3 is on top of b4. b1 is on the
table. b5 is clear. Your arm is empty. Your goal is to move the blocks. b1 should be on top
of b2. b3 should be on top of b5. b4 should be on top of b1.

GPT-4:
1. Move b5 from b3 to the table.
2. Move b3 from b4 to b5.
3. Move b4 from b2 to the table.
4. Move b2 from b1 to b5.
5. Move b1 from the table to b2.
6. Move b4 from the table to b1.

Does that mean that we should increase efforts to include all arithmetic and planning problems in
their training corpus? Clearly, that is a fool’s errand. On the other hand, why should it be necessary?
We already have calculators and general-purpose symbolic planners that are guaranteed to produce
correct answers. Thus a natural alternative approach, and one that we are admittedly not the first to
explore, is to connect LLMs to such tools.
With this motivation in mind, the objective of the research reported in this paper is, for the first time,
to enable LLMs to solve planning problems correctly. We aim to do so without altering the LLMs
themselves, even with finetuning [5, 6]. Rather, we introduce a methodology, called LLM+P by
which, when posed a natural language description of a planning problem, the LLM:

1. outputs a problem description suitable as input to a general-purpose planner;

2. solves the problem using the general-purpose planner; and
3. converts the output of the planner back to natural language.

Our extensive empirical evaluations indicate that LLM+P is able to generate correct solutions to
many more planning problems than are LLMs on their own. While demonstrated in this paper on
planning problems, this general methodology can be applied to any class of problems for which we
have a sound and complete solver, such as arithmetic problems (by leveraging calculators).

Limitation: In this paper, we do not ask the LLM to recognize that it has been posed a prompt that
is suitable for processing using the proposed LLM+P pipeline. A valuable future research direction
will be to consider recognizing when a prompt should be processed by LLM+P.

2 Background
This section introduces the notation we use for representing a planning problem to be solved by
LLMs, and recaps the standard representation of classical planners.

2.1 The Classical Planning Problem

Formally, the input of a planning problem P is defined by a tuple hS , sinit , S G , A , f i:

• S is a finite and discrete set of states used to describe the world’s state (i.e., state space).
We assume a factored state space such that each state s ∈ S is defined by the values of a
fixed set of variables.
• sinit ∈ S is an initial world state.
• S G ⊂ S is a set of goal states. S G are usually specified as a list of goal conditions, all of
which must hold in a goal state.
• A is a set of symbolic actions.

2
Module Generated Text Provided Text Context Ex. P & Ex. Sol

Problem (P) Problem (P)

Plan Plan
Domain Domain
LLM LLM
LLM-As-Planner LLM-As-Planner (In-context Learning)

Context Ex. P & Ex. PDDL Domain PDDL

Problem (P) Plan

LLM Planner LLM

Problem PDDL PDDL Plan

LLM + P (In-context Learning)

Figure 1: LLM+P makes use of a large language model (LLM) to produce the PDDL description
of the given problem, then leverages a classical planner for finding an optimal plan, then translates
the raw plan back to natural language using the LLM again.
• f is the underlying state transition function. f takes the current state and an action as input
and outputs the corresponding next state.

A solution to a planning problem P is a symbolic plan π in the form of ha1 , a2 , . . . , aN i, such that the
preconditions of a1 hold in sinit , the preconditions of a2 hold in the state that results from applying
a1 , and so on, with the goal conditions all holding in the state that results after applying aN .

2.2 Planning Domain Definition Language (PDDL)

The planning domain definition language (PDDL) serves as a standardized encoding of classical
planning problems [7, 8]. The PDDL representation of a planning problem P is separated into two
files: a domain file and a problem file. The domain PDDL file provides a lifted representation
of the underlying rules of the world. It includes a set of predicates that define the state space S
and the actions (i.e., A ) with their preconditions and effects (i.e., the transition function f ). The
problem PDDL file provides a list of objects to ground the domain, the problem’s initial state sinit
and goal conditions S G . We provide an example of the two PDDL files in Appendix A. There
exists a rich set of symbolic planners that implement efficient search algorithms to solve planning
problems formalized in PDDL. For a more detailed introduction to PDDL, we refer the readers to
the following slides.3 In this work, we aim to take a natural language prompt which describes the
initial state sinit and goal conditions S G , formulate it in PDDL, and leverage symbolic planners to
output correct plans. We assume the domain rules are available (See the assumptions in Section 5).

3 Method
This section first uses a running example to investigate ways for formulating planning prompts in
PDDL and then introduces the LLM+P method.

3 Werefer the readers to https://www.cs.torontnaturalo.edu/~sheila/2542/s14/A1/

introtopddl2.pdf as a good introduction to PDDL.

3
3.1 Large Language Model as a PDDL Writer

LLMs are bad at planning (or long-horizon reasoning) [9] but they are good at describing and trans-
lating textual inputs, including re-writing planning prompts in the PDDL format. The intuition is
that we can view PDDL as a different language than English, so re-writing planning prompts in
PDDL is essentially a “machine translation” task that LLMs are known to excel at. Here we pro-
vide an example of a PDDL problem file written by GPT-4 without any prompt-engineering. As we
can see, the generated file appears to have the correct PDDL syntax, but uses a made-up predicate
(empty) and misses the initial condition that b1 is on the table.
An Example PDDL Problem File Written by GPT-4

Prompt: Description of Problem (P1) + Provide me with the problem PDDL file that
describes the planning problem directly without further explanations.

GPT-4 (the generated problem PDDL):

(:objects b1 b2 b3 b4 b5 - block)
(:init (on b5 b3) (on b4 b2) (on b2 b1) (on b3 b4) (clear b5)
(empty))
(:goal (and (on b1 b2) (on b3 b5) (on b4 b1)))

3.2 In-Context Learning

LLMs are known to be capable of in-context learning without finetuning their parameters. By in-
context learning, we mean LLMs’ ability to perform unseen downstream tasks by simply condition-
ing on a few input-label pairs (demonstrations) [10]. Here is a pair of problem descriptions in natural
language and in PDDL provided to GPT-4 as context. When the context is included with the prompt
from the example above, the resulting PDDL problem file is directly solvable by the planner.
An Example PDDL Problem File Written by GPT-4 with Context

Context (shaded):
An example planning problem is: You have 5 blocks. b2 is on top of b5. b5 is on top of b1.
b1 is on top of b4. b3 is on top of b2. b4 is on the table. b3 is clear. Your arm is empty. Your
goal is to move the blocks. b4 should be on top of b3. The problem PDDL is:
(:objects b1 b2 b3 b4 b5)
(:init (arm-empty) (on b1 b4) (on b2 b5) (on b3 b2)
(on-table b4) (on b5 b1) (clear b3))
(:goal (and (on b4 b3))))
Prompt: Description of Problem (P1) + Context + Provide me with the problem PDDL file
that describes the planning problem directly without further explanations.

GPT-4 (the generated problem PDDL):

(:objects b1 b2 b3 b4 b5 )
(:init (arm-empty) (on-table b1) (on b2 b1) (on b3 b4)
(on b4 b2) (on b5 b3) (clear b5)))
(:goal (and (on b1 b2) (on b3 b5) (on b4 b1)))

Planner:
(unstack b5 b3) (putdown b5) (unstack b3 b4) (stack b3 b5)
(unstack b4 b2) (putdown b4) (unstack b2 b1) (putdown b2)
(pickup b1) (stack b1 b2) (pickup b4) (stack b4 b1)

3.3 Large Language Model + Classical Planner (LLM+P)

Having introduced the LLM’s ability to encode problems in PDDL and in-context learning, we are
ready to introduce the proposed LLM+P solution (the bottom of Figure 1). In particular, we assume
that for each problem domain, a human expert can provide a domain description (i.e. action precon-

4
ditions and effects) that will be fixed for all problem instances that happen in that domain. While the
problem of automatically generating the description is another valuable research question, in this
proposed work, we assume that the description is available as a PDDL domain file. The LLM+P
method is directly applicable as a natural language interface for giving tasks to robot systems. For
instance, assume we want a robot to act as a bartender to make cocktails. It is reasonable to tell it
what actions it can take, but leave itself to infer how to make new cocktails most efficiently given
a set of ingredients to combine. Moreover, we assume the agent is provided with a minimal exam-
ple that demonstrates what an example problem PDDL looks like for a simple problem inside that
domain. Next, the agent is provided with a new (potentially quite complicated) problem (P). The
LLM then uses the in-context learning to infer the problem PDDL file corresponding to P. Once the
problem PDDL file is generated, we feed it into any classical planner, together with the provided
domain PDDL file, to generate a PDDL plan [11]. In the end, the LLM translates the PDDL plan
back into the natural language to finish up the LLM+P pipeline.
To summarize, the assumptions we need for LLM+P are:

1. A chatbot knows when to trigger LLM+P based on its conversation with a human user.
2. A domain PDDL file is provided for the problem the user asks for.
3. A simple problem description in natural language and its corresponding problem PDDL
file are also provided beforehand.

4 Related Work
This section first provides a brief overview of classical planning algorithms. Then it summarizes
recent advances in using large language models for planning tasks. It concludes with a discussion
of recent research on augmenting LLMs with external modules.

Classical Planning Automated planning (or classical planning) techniques can be used for com-
puting a sequence of actions that achieves a given goal [12, 13, 14]. Automated planning algorithms
have been widely used in robot systems. Shakey is the first robot that was equipped with a plan-
ning component, which was constructed using STRIPS [15]. Some previous general-purpose plan-
ning architectures were also demonstrated to be useful for robot planning, such as PRODIGY [16]
and HTN [17]. Recent classical planning systems designed for robotics frequently use planning
domain description language (PDDL) or answer set programming (ASP) as the underlying action
language for the planners [18, 19, 20, 21]. For example, researchers have used classical planning al-
gorithms for sequencing actions for a mobile robot working on delivery tasks [22], reasoning about
safe and efficient urban driving behaviors for autonomous vehicles [23], and planning actions for a
team of mobile robots [24]. Task and motion planning (TAMP) is a hierarchical planning frame-
work that combines classical planning in discrete spaces and robot motion planning in continuous
space [25, 26].
Most of the above-mentioned planning methods require domain-specific programming languages as
the underlying representation of the problems and their solutions. LLM+P, on the other hand, takes
advantage of LLMs to use natural language descriptions for solving complex planning tasks. The
main feature that motivates us to use such classical planning systems is that most of these planners
are sound and complete, meaning that they are guaranteed to be logically correct and will output a
plan if one exists. Many are also able to find optimal (shortest) plans, at least if given sufficient time.

Planning with Large Language Models Various large language models (LLMs) have been devel-
oped in recent years, such as Bert [27], CodeX [28], Opt [29], GPT-3 [10], ChatGPT [30], GPT-4 [2],
LLAMA [31]. and PaLM [32]. As LLMs are pretrained with a tremendous amount of offline text
data, they can emerge with surprising zero-shot generalization ability, which can be leveraged for
robot planning tasks [33, 34, 35, 36, 37, 38, 39, 40, 41, 42]. Several recent methods had successes in
extracting task knowledge from LLMs to decompose commands or instructions for robots in natural
language. For instance, the work of Huang et al. showed that LLMs can be used for task planning
in household domains by iteratively augmenting prompts [38]. SayCan is another approach that
enabled robot planning with affordance functions to account for action feasibility, where the service
requests are specified in natural language [33]. Vemprala et al. recently studied how ChatGPT can
be applied to generalized robotics domains.

5
However, a major drawback of existing LLMs is their lack of long-horizon reasoning/planning abil-
ity for complex tasks (See [9, 43] and Section 8.2 from [2]). Specifically, the output they produce
when presented with such a task is often incorrect in the sense that following the output plan will
not actually solve the task. Therefore, in this work, we focus on resolving this issue by leveraging
the properties of classical planners. Similarly, some recent work also investigates approaches for
combining classical planning with LLMs [44, 45]. They either use prompting or fine-tuning to make
LLMs capable of solving PDDL planning problems. Improvements to long-horizon planning capa-
bilities have also been made by iteratively querying LLMs, as demonstrated in Minecraft [46]. In
contrast, we do not solely rely on LLM as the problem solver, but are more into taking the advan-
tage of both the planner (i.e., generating accurate and optimal plans) and the LLM itself (i.e., 1-shot
generalization for translating natural-language problem descriptions into PDDL).

Augmenting LLMs with External Modules Recently developed methods have shown that the
performance of downstream tasks of LLMs can be improved by combining them with external mod-
ules. For instance, WebGPT [47] is a fine-tuned version of GPT-3 by combining web knowledge
to answer open-ended questions. Lazaridou et al. studied how search engines like Google can be
utilized as external tools for LLMs. MemPrompt [49] presented a human-in-the-loop system where
a growing memory of errors and user feedback is served as past experience adding to the prompts for
more accurately answering new questions. REPLUG [50] is another retrieval-augmented language
modeling paradigm that treats the language model as a black box and augments it with a tuneable
retrieval model. Specifically, people have investigated using calculators for computation [51, 52].
In very recent work related to ours, Schick et al. trained a model called ToolFormer that can decide
when and how to call certain tool APIs by in-line augmentation on prompts for LLMs. In this work,
we propose that classical planners can be another particularly useful external module. In compar-
ison, LLM+P, does not rely on any fine-tuning or re-training of LLMs. By simply incorporating
knowledge from classical planners, LLM+P incorporates long-horizon reasoning and planning ca-
pabilities into existing LLMs.
The authors are informed that a concurrent work [54] presents preliminary results of integrating
LLMs with PDDL using the SayCan dataset [33]. However, the SayCan dataset has a limited scope,
as it contains only three predefined actions. Consequently, all model variants evaluated in the orig-
inal paper achieved a success rate of approximately 90%. Due to the homogeneity of the SayCan
dataset, Lyu et al. did not necessitate a rigorous definition of the domain PDDL, which can lead to
infeasible plans. As a result, we consider our LLM+P method as a more comprehensive investiga-
tion into enhancing LLMs with optimal planning proficiency.

5 Experiments

We conduct extensive experiments to answer the following three questions:

1. How well does LLM- AS -P work? That is, to what extent can LLMs be directly used for
planning? (Not at all)
2. How well does LLM+P work compared to LLM- AS -P? (Much better)
3. What role does the context play in the success of LLM+P? (It’s crucial)

To address these questions, we first introduce a set of benchmark problems borrowed from the
classical planning community. Then, we comprehensively evaluate both LLM- AS -P and LLM+P
on the proposed benchmark problems. In the end, we provide the results, analysis, and example
failure cases that could lead to future improvement.

5.1 Benchmark Problems

We present seven planning domains that are frequently utilized in classical planning competitions.
Moreover, we have developed an automatic generation script for each domain, which can generate a
virtually unlimited number of tasks in natural language. To ensure fair and reproducible benchmark-
ing, we restrict each domain to a maximum of 20 tasks. Below is a list of the planning domains,
along with a brief summary of each.

6
1. B LOCKSWORLD: Given a set of piles of blocks on a table, a robot is tasked with rearranging
them into a specified target configuration while obeying the laws of physics.
2. BARMAN: A bartender is tasked with creating cocktails for a customer’s order, utilizing
the available ingredients and aiming to use the minimum number of steps possible.
3. F LOORTILE: A set of robots are tasked to use different colors to paint patterns on floor
tiles. Robots can move around and change colors on their palettes but are restricted by
some motion constraints and color sequences.
4. G RIPPERS: A set of robots with grippers is given a task to move objects among different
rooms.
5. S TORAGE: Given a set of hoists, the goal is to lift and drop crates using the hoists into a
depot. Crates are initially stored in different areas and hoists can be moved among storage
areas.
6. T ERMES: A robot’s goal is to build complex structures by carrying and placing blocks, and
also climbing on them so that it can move to neighboring blocks.
7. T YREWORLD: The robot is given a task to replace flat tires by, for example, inflating tires,
tightening nuts, and moving tools back to the trunk when done, all in the proper order.

For each domain, we implement an automatic and procedural generation script to produce 20 dif-
ferent planning problems for reproducibility. The implemented generation scripts are made publicly
available in our codebase for the convenience of future research. For each problem P, P comes with
a natural language description and a ground-truth problem PDDL file. We assume each problem
domain has its own domain PDDL file given by the user or a domain expert prior to addressing any
planning problems in that domain.

5.2 Experiment Setup

We leverage the TEXT- DAVINCI -003 model4 provided by OpenAI5 for all experiments as it is the
latest text completion that has the highest text completion quality among all models we have access
to. We set the max token number to 2048 and temperature to 0, and use the top probability response.
As a result, the response returned from the LLM is deterministic. Once a text PDDL response is
generated, we feed it into the FAST- DOWNWARD planner,6 using the alias LAMA with a maximum
search time of 200 seconds. When the planner finds multiple solutions for a given problem, we pick
the solution with the fewest number of steps for validation. Note that we put a hard constraint on
the search time for the convenience of running experiments. Therefore, it is possible that some of
the found plans are near-optimal but not optimal. However, in theory, an optimal plan can always be
found in finite time.7 We run the experiment three times independently and report the average and
standard error of the success rate.

5.3 Results and Analysis

The results of applying LLM- AS -P and LLM+P across 7 domains are provided in Table 1.
Findings (LLM- AS -P):

1. We observe that though LLM- AS -P provides a plan in natural language for every problem,
most of these plans are not feasible. The main reason is that LLM- AS -P lacks the ability
to reason about preconditions. Moreover, adding context (e.g., LLM- AS -P (w/ context))
does not increase the success rate unless the new solution can be built from the example
solution as a template.
4 At the moment writing this paper, the authors cannot access GPT-4 APIs so we use TEXT- DAVINCI -003 as
it generates the best results on the benchmarked problems.
5 We use the OpenAI API following instructions from https://platform.openai.com/docs/
quickstart/build-your-application.
6 See https://www.fast-downward.org/ for the details on the FAST- DOWNWARD software. We use the
implementation from https://github.com/aibasel/downward.
7 We use automatic validation software to validate whether the predicted problem PDDL file is correct for
the given problem.

7
Success Rate %
Domain
LLM- AS -P (no ctx.) LLM- AS -P LLM+P (no ctx.) LLM+P
BARMAN 0 0 0 85.0 ±0.0
B LOCKSWORLD 0 0 0 56.7 ±2.9
F LOORTILE 0 0 0 53.3 ±2.9
G RIPPERS 10.0 15.0 0 95.0 ±0.0
S TORAGE 0 0 0 65.0 ±0.0
T ERMES 0 0 0 31.7 ±2.9
T YREWORLD 0 40.0 ±5.0 0 80.0 ±0.0

Table 1: Success rate of applying LLM- AS -P, LLM+P without context and LLM+P. Main mes-
sages: 1) LLM+P performs significantly better than LLM- AS -P and can solve most complicated
planning problems; 2) Without conditioning on the context, LLM+P fails completely; 3) LLM+P
is able to solve many of the problems correctly. Each experiment is conducted three times and we
report the mean and the standard error. The randomness exhibited in the results comes solely from
the planner’s search algorithm. We bold the best result in each domain.

2. The T YREWORLD domain has the most performance boost from LLM- AS -P (w/o context)
(0% success) to LLM- AS -P (w/ context) (40%). This domain has a fixed sequence of
actions for replacing each tire, and the problems only vary in the number of tires that have
to be replaced.
3. In domains other than T YREWORLD, LLM- AS -P fails in the same way with or without the
example plan as context. In particular, in the B LOCKSWORLD domain, LLM- AS -P cannot
keep track of properties like ON and CLEAR. In the G RIPPERS domain, the robot can only
pick up balls when they are in the same room, but most of the LLM- AS -P plans violate
this condition (See Appendix C).
4. The hardest domains are the ones with complex spatial relationship, e.g., F LOORTILE,
T ERMES, and S TORAGE require reasoning about connectivities and directions on a grid
world. The LLM- AS -P methods (with or without context) completely fail at this type of
problems. For example, LLM- AS -P generated “move right to tile 0-4 and paint tile 1-2
black” but the robot can only paint neighboring tiles.

Findings (LLM+P):

1. The proposed LLM+P produces an optimal plan for the majority of problems. Most failed
cases are due to mis-specified problem files, such as missing one of the initial conditions
causing the planning problem to be unsolvable.
2. Without the context (i.e., an example problem and its corresponding problem PDDL), we
observe that LLMs fail to produce correct problem PDDL files. The failures of LLM+P
(no context) come entirely from incorrect problem encodings. Therefore, the context is
important for LLM+P to work.

For a more detailed case study on the failure plans produced by both the baseline methods and
LLM- AS -P, please refer to Appendix C.

6 Conclusion and Future Work

In this work, we propose to leverage classical planners to empower large language models with
optimal planning capabilities. The key design choice of the proposed LLM+P framework is to
focus LLMs on translating the planning problem from natural language to structured PDDL format.
Moreover, we show that it is important to also make LLMs aware of a simple (problem, PDDL) pair
as a demonstration (or the context) for in-context learning. Some interesting directions to further
extend the LLM+P framework include: 1) enabling the LLM to auto-detect when and how to apply
LLM+P; and 2) reducing LLM+P’s dependency on information by humans, potentially involving
finetuning.

8
References
[1] J. Weizenbaum. Eliza—a computer program for the study of natural language communication
between man and machine. Communications of the ACM, 9(1):36–45, 1966.
[2] OpenAI. Gpt-4 technical report, 2023.
[3] S. Vemprala, R. Bonatti, A. Bucker, and A. Kapoor. Chatgpt for robotics: De-
sign principles and model abilities. Technical Report MSR-TR-2023-8, Microsoft,
February 2023. URL https://www.microsoft.com/en-us/research/publication/
chatgpt-for-robotics-design-principles-and-model-abilities/.
[4] K. Mahowald, A. A. Ivanova, I. A. Blank, N. Kanwisher, J. B. Tenenbaum, and E. Fedorenko.
Dissociating language and thought in large language models: a cognitive perspective. arXiv
preprint arXiv:2301.06627, 2023.
[5] C. Lee, K. Cho, and W. Kang. Mixout: Effective regularization to finetune large-scale pre-
trained language models. arXiv preprint arXiv:1909.11299, 2019.
[6] J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V. Le.
Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
[7] D. McDermott, M. Ghallab, A. Howe, C. Knoblock, A. Ram, M. Veloso, D. Weld, and
D. Wilkins. Pddl-the planning domain definition language. 1998.
[8] P. Haslum, N. Lipovetzky, D. Magazzeni, and C. Muise. An introduction to the planning do-
main definition language. Synthesis Lectures on Artificial Intelligence and Machine Learning,
13(2):1–187, 2019.
[9] K. Valmeekam, A. Olmo, S. Sreedharan, and S. Kambhampati. Large language models still
can’t plan (a benchmark for llms on planning and reasoning about change). arXiv preprint
arXiv:2206.10498, 2022.
[10] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan,
P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances
in neural information processing systems, 33:1877–1901, 2020.
[11] M. Helmert. The fast downward planning system. Journal of Artificial Intelligence Research,
26:191–246, 2006.
[12] T. Bylander. The computational complexity of propositional STRIPS planning. Artificial
Intelligence, 69(1-2):165–204, 1994.
[13] J. McCarthy. Situations, actions, and causal laws. Technical report, Stanford University Tech-
nical Report, 1963.
[14] R. E. Fikes and N. J. Nilsson. Strips: A new approach to the application of theorem proving to
problem solving. Artificial intelligence, 2(3-4):189–208, 1971.
[15] N. J. Nilsson et al. Shakey the robot. 1984.
[16] J. Carbonell, O. Etzioni, Y. Gil, R. Joseph, C. Knoblock, S. Minton, and M. Veloso. Prodigy:
An integrated architecture for planning and learning. ACM SIGART Bulletin, 2(4):51–55, 1991.
[17] D. S. Nau, T.-C. Au, O. Ilghami, U. Kuter, J. W. Murdock, D. Wu, and F. Yaman. Shop2: An
htn planning system. Journal of artificial intelligence research, 2003.
[18] Y.-q. Jiang, S.-q. Zhang, P. Khandelwal, and P. Stone. Task planning in robotics: an empirical
comparison of pddl-and asp-based systems. Frontiers of Information Technology & Electronic
Engineering, 20:363–373, 2019.
[19] G. Brewka, T. Eiter, and M. Truszczyński. Answer set programming at a glance. Communica-
tions of the ACM, 54(12):92–103, 2011.
[20] V. Lifschitz. Answer set programming and plan generation. Artificial Intelligence, 138(1-2):
39–54, 2002.

9
[21] M. Fox and D. Long. Pddl2. 1: An extension to pddl for expressing temporal planning domains.
Journal of artificial intelligence research, 20:61–124, 2003.
[22] S. Zhang, F. Yang, P. Khandelwal, and P. Stone. Mobile robot planning using action language
bc with an abstraction hierarchy. In International Conference on Logic Programming and
Nonmonotonic Reasoning, pages 502–516. Springer, 2015.
[23] Y. Ding, X. Zhang, X. Zhan, and S. Zhang. Task-motion planning for safe and efficient urban
driving. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS),
2020.
[24] Y. Jiang, H. Yedidsion, S. Zhang, G. Sharon, and P. Stone. Multi-robot planning with conflicts
and synergies. Autonomous Robots, 43(8):2011–2032, 2019.
[25] F. Lagriffoul, N. T. Dantam, C. Garrett, A. Akbari, S. Srivastava, and L. E. Kavraki. Platform-
independent benchmarks for task and motion planning. IEEE Robotics and Automation Letters,
3(4):3765–3772, 2018.
[26] L. P. Kaelbling and T. Lozano-Pérez. Integrated task and motion planning in belief space. The
International Journal of Robotics Research, 32(9-10):1194–1227, 2013.
[27] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional
transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
[28] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y. Burda,
N. Joseph, G. Brockman, et al. Evaluating large language models trained on code. arXiv
preprint arXiv:2107.03374, 2021.
[29] S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li,
X. V. Lin, et al. Opt: Open pre-trained transformer language models. arXiv preprint
arXiv:2205.01068, 2022.
[30] OpenAI. Chatgpt. Accessed: 2023-02-08, 2023. URL https://openai.com/blog/
chatgpt/. cit. on pp. 1, 16.
[31] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière,
N. Goyal, E. Hambro, F. Azhar, et al. Llama: Open and efficient foundation language models.
arXiv preprint arXiv:2302.13971, 2023.
[32] A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W.
Chung, C. Sutton, S. Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv
preprint arXiv:2204.02311, 2022.
[33] M. Ahn, A. Brohan, N. Brown, Y. Chebotar, O. Cortes, B. David, C. Finn, K. Gopalakrishnan,
K. Hausman, A. Herzog, et al. Do as i can, not as i say: Grounding language in robotic
affordances. arXiv preprint arXiv:2204.01691, 2022.
[34] Y. Ding, X. Zhang, C. Paxton, and S. Zhang. Task and motion planning with large language
models for object rearrangement. arXiv preprint arXiv:2303.06247, 2023.
[35] D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson,
Q. Vuong, T. Yu, et al. Palm-e: An embodied multimodal language model. arXiv preprint
arXiv:2303.03378, 2023.
[36] W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Florence, A. Zeng, J. Tompson, I. Mordatch,
Y. Chebotar, et al. Inner monologue: Embodied reasoning through planning with language
models. arXiv preprint arXiv:2207.05608, 2022.
[37] Y. Ding, X. Zhang, S. Amiri, N. Cao, H. Yang, C. Esselink, and S. Zhang. Robot task planning
and situation handling in open worlds. arXiv preprint arXiv:2210.01287, 2022.
[38] W. Huang, P. Abbeel, D. Pathak, and I. Mordatch. Language models as zero-shot planners: Ex-
tracting actionable knowledge for embodied agents. In International Conference on Machine
Learning, pages 9118–9147. PMLR, 2022.

10
[39] Y. Kant, A. Ramachandran, S. Yenamandra, I. Gilitschenski, D. Batra, A. Szot, and
H. Agrawal. Housekeep: Tidying virtual households using commonsense reasoning. In Com-
puter Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022,
Proceedings, Part XXXIX, pages 355–373. Springer, 2022.
[40] I. Singh, V. Blukis, A. Mousavian, A. Goyal, D. Xu, J. Tremblay, D. Fox, J. Thomason, and
A. Garg. Progprompt: Generating situated robot task plans using large language models. arXiv
preprint arXiv:2209.11302, 2022.
[41] K. Lin, C. Agia, T. Migimatsu, M. Pavone, and J. Bohg. Text2motion: From natural language
instructions to feasible plans. arXiv preprint arXiv:2303.12153, 2023.
[42] Y. Yang, J.-R. Gaglione, C. Neary, and U. Topcu. Automaton-based representations of task
knowledge from generative language models. arXiv preprint arXiv:2212.01944, 2023.
[43] K. Valmeekam, S. Sreedharan, M. Marquez, A. Olmo, and S. Kambhampati. On the planning
abilities of large language models (a critical investigation with a proposed benchmark). arXiv
preprint arXiv:2302.06706, 2023.
[44] T. Silver, V. Hariprasad, R. S. Shuttleworth, N. Kumar, T. Lozano-Pérez, and L. P. Kaelbling.
PDDL planning with pretrained large language models. In NeurIPS 2022 Foundation Mod-
els for Decision Making Workshop, 2022. URL https://openreview.net/forum?id=
1QMMUB4zfl.
[45] V. Pallagani, B. Muppasani, K. Murugesan, F. Rossi, L. Horesh, B. Srivastava, F. Fabiano,
and A. Loreggia. Plansformer: Generating symbolic plans using transformers. arXiv preprint
arXiv:2212.08681, 2022.
[46] Z. Wang, S. Cai, A. Liu, X. Ma, and Y. Liang. Describe, explain, plan and select: Interactive
planning with large language models enables open-world multi-task agents. arXiv preprint
arXiv:2302.01560, 2023.
[47] R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang, C. Kim, C. Hesse, S. Jain, V. Kosaraju,
W. Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv
preprint arXiv:2112.09332, 2021.
[48] A. Lazaridou, E. Gribovskaya, W. Stokowiec, and N. Grigorev. Internet-augmented language
models through few-shot prompting for open-domain question answering. arXiv preprint
arXiv:2203.05115, 2022.
[49] A. Madaan, N. Tandon, P. Clark, and Y. Yang. Memory-assisted prompt editing to improve
gpt-3 after deployment, 2023.
[50] W. Shi, S. Min, M. Yasunaga, M. Seo, R. James, M. Lewis, L. Zettlemoyer, and W.-t. Yih.
Replug: Retrieval-augmented black-box language models. arXiv preprint arXiv:2301.12652,
2023.
[51] W. Chen, X. Ma, X. Wang, and W. W. Cohen. Program of thoughts prompting: Disentangling
computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588,
2022.
[52] L. Gao, A. Madaan, S. Zhou, U. Alon, P. Liu, Y. Yang, J. Callan, and G. Neubig. Pal: Program-
aided language models. arXiv preprint arXiv:2211.10435, 2022.
[53] T. Schick, J. Dwivedi-Yu, R. Dessı̀, R. Raileanu, M. Lomeli, L. Zettlemoyer, N. Cancedda, and
T. Scialom. Toolformer: Language models can teach themselves to use tools. arXiv preprint
arXiv:2302.04761, 2023.
[54] Q. Lyu, S. Havaldar, A. Stein, L. Zhang, D. Rao, E. Wong, M. Apidianaki, and C. Callison-
Burch. Faithful chain-of-thought reasoning. arXiv preprint arXiv:2301.13379, 2023.

11
A Problem and Domain PDDL Files

Problem PDDL Example

(define (problem BW-rand-5)

(:domain blocksworld-4ops)
(:objects b1 b2 b3 b4 b5 )
(:init
(arm-empty)
(on b1 b2)
(on b2 b3)
(on b3 b5)
(on b4 b1)
(on-table b5)
(clear b4)
)
(:goal
(and
(on b1 b3)
(on b3 b5))
)
)

Domain PDDL Example

(define (domain blocksworld-4ops)

(:requirements :strips)
(:predicates (clear ?x)
(on-table ?x)
(arm-empty)
(holding ?x)
(on ?x ?y))

(:action pickup
:parameters (?ob)
:precondition (and (clear ?ob) (on-table ?ob) (arm-empty))
:effect (and (holding ?ob) (not (clear ?ob))
(not (on-table ?ob)) (not (arm-empty))))

(:action putdown
:parameters (?ob)
:precondition (holding ?ob)
:effect (and (clear ?ob) (arm-empty) (on-table ?ob)
(not (holding ?ob))))

(:action stack
:parameters (?ob ?underob)
:precondition (and (clear ?underob) (holding ?ob))
:effect (and (arm-empty) (clear ?ob) (on ?ob ?underob)
(not (clear ?underob)) (not (holding ?ob))))

(:action unstack
:parameters (?ob ?underob)
:precondition (and (on ?ob ?underob) (clear ?ob) (arm-empty))
:effect (and (holding ?ob) (clear ?underob)
(not (on ?ob ?underob)) (not (clear ?ob))
(not (arm-empty)))))

12
B Prompts for Different Methods

The Prompt for LLM- AS -P (no context)

DOMAIN-NL. Now consider a planning problem. The problem description is: TASK-
NL. Can you provide an optimal plan, in the way of a sequence of behaviors, to solve the
problem?

The Prompt for LLM- AS -P (with context)

DOMAIN-NL. An example planning problem is: EXAMPLE-NL. A plan for the example
problem is: EXAMPLE-SOL. Now I have a new planning problem and its description is:
TASK-NL. Can you provide an optimal plan, in the way of a sequence of behaviors, to
solve the problem?

The Prompt for LLM+P (no context)

DOMAIN-NL. Now consider a planning problem. The problem description is: TASK-NL.
Provide me with the problem PDDL file that describes the planning problem directly without
further explanations.

The Prompt for LLM+P (with context)

DOMAIN-NL. An example planning problem is: EXAMPLE-NL. The problem PDDL

file to this problem is: EXAMPLE-PDDL. Now I have a new planning problem and its de-
scription is: TASK-NL. Provide me with the problem PDDL file that describes the planning
problem directly without further explanations.

13
C Example Failures
A list of representative failures for both the baseline methods and LLM- AS -P is provided in this
section.
Failure 1 of LLM- AS -P (without context)

Problem (Barman):

You have 1 shaker with 3 levels, 4 shot glasses, 3 dispensers for 3 ingredients. The shaker
and shot glasses are clean, empty, and on the table.

Your left and right hands are empty. The first ingredient of cocktail1 is ingredient1.
The second ingredient of cocktail1 is ingredient3. The first ingredient of cocktail2 is
ingredient2. The second ingredient of cocktail2 is ingredient3. The first ingredient of
cocktail3 is ingredient1. The second ingredient of cocktail3 is ingredient2.

Your goal is to make 3 cocktails. shot1 contains cocktail1. shot2 contains cocktail3.
shot3 contains cocktail2.

GPT-3.5:

1. Grasp the shaker with your left hand.

2. Grasp shot glass 1 with your right hand.
3. Fill shot glass 1 with ingredient 1 from the dispenser.
4. Pour ingredient 1 from shot glass 1 to the shaker.
5. Grasp shot glass 2 with your right hand.
6. Fill shot glass 2 with ingredient 2 from the dispenser.
7. Pour ingredient 2 from shot glass 2 to the shaker.
8. Grasp shot glass 3 with your right hand.
9. Fill shot glass 3 with ingredient 3 from the dispenser.
10. Pour ingredient 3 from shot glass 3 to the shaker.
11. Shake the shaker.
12. Pour the cocktail1 from the shaker to shot glass 1.
13. Empty shot glass 1.
14. Clean shot glass 1.
15. Pour the cocktail2 from the shaker to shot glass 2.
16. Empty shot glass 2.
17. Clean shot glass 2.
18. Pour the cocktail3 from the shaker to shot glass 3.
19. Empty shot glass 3.
20. Clean shot glass 3.
21. Leave the shaker on the table.

14
Failure 2 of LLM- AS -P (without context)

Problem (BlocksWorld):

You have 3 blocks. b3 is on top of b2. b1 is on top of b3. b2 is on the table. b1 is clear. Your
arm is empty. Your goal is to move the blocks. b2 should be on top of b3. b3 should be on
top of b1.

GPT-3.5:

Pickup b1
Stack b1 on top of b2 (Failed because b2 is not clear)
Unstack b3 from b2
Stack b3 on top of b1
Putdown b1

15
Failure 3 of LLM- AS -P (without context)

Problem (Grippers):

There are 3 rooms and 2 balls. robot1 is in room1. robot4 is in room1. robot2 is in room1.
robot3 is in room1. ball2 is in room3. ball1 is in room2. The robots’ grippers are free.

Your goal is to transport the balls to their destinations. ball1 should be in room2.
ball2 should be in room3.

GPT-3.5:

1. Robot1 picks up ball1 with its left gripper. # (Failed because robot1 and
ball1 are in different rooms)
2. Robot1 moves to room2.
3. Robot1 drops ball1 in room2.
4. Robot2 picks up ball2 with its right gripper.
5. Robot2 moves to room3.
6. Robot2 drops ball2 in room3.

16
Failure 1 of LLM- AS -P (with in-context-learning)

Problem (Storage):

Depot storeareas are: depot48-1-1 depot48-1-2 depot48-2-1 depot48-2-2

Container storeareas are: container-0-0 container-0-1
Here is a map of depot storeareas:

depot48-1-1 depot48-1-2
depot48-2-1 depot48-2-2

According to the map, adjacent depot storeareas are connected. All depot storeareas are in
depot48. crate1 is on container-0-1. crate0 is on container-0-0. All crates and container
storeareas are in container0. All container storeareas are connected to loadarea. depot48-2-1
and loadarea are connected. depot48-2-2 depot48-2-1 depot48-1-1 are clear. hoist0 is in
depot48-1-2. All hoists are available.

Your goal is to move all crates to depot48.

GPT-3.5:

Go out with hoist0 from depot48-1-2 to loadarea. # (Failed because depot48-1-2

and loadarea are not connected)
Lift crate0 in container-0-0 in container 0 with hoist0 from
loadarea.
Go in with hoist0 from loadarea to depot48-2-1.
Drop crate0 with hoist0 from depot48-2-1 to depot48-1-1 in depot48.
Go out with hoist0 from depot48-2-1 to loadarea.
Lift crate1 in container-0-1 in container 0 with hoist0 from
loadarea.
Drop crate1 with hoist0 from loadarea to depot48-2-2 in depot48.

17
Failure 2 of LLM- AS -P (with in-context-learning)

Problem (Termes):

The robot is on a grid with 4 rows and 3 columns.

pos-0-0 pos-0-1 pos-0-2
pos-1-0 pos-1-1 pos-1-2
pos-2-0 pos-2-1 pos-2-2
pos-3-0 pos-3-1 pos-3-2
The robot is at pos-2-0. The depot for new blocks is at pos-2-0. The maximum height of
blocks is 3.

Your goal is to build blocks so that the height at pos-1-2 is 3. You cannot have an
unplaced block at the end.

GPT-3.5:

Create block at pos-2-0

move to pos-2-1
place block at pos-1-2 # (Failed because pos-2-1 and pos-1-2 are not
adjacent)
move to pos-2-2
place block at pos-1-2
move to pos-2-1
place block at pos-1-2
move to pos-2-0
place block at pos-1-2

18
Failure 1 of LLM+P (without context)

Problem (Tyreworld):

You have a jack, a pump, a wrench, a boot, 3 hubs, 3 nuts, 3 flat tyres, and 3 intact tyres.
The jack, pump, wrench, and intact tyres are in the boot. The boot is unlocked but is closed.
The intact tyres are not inflated. The flat tyres are on the hubs. The hubs are on the ground.
The nuts are tight on the hubs. The hubs are fastened.

Your goal is to replace flat tyres with intact tyres on the hubs. Intact tyres should be
inflated. The nuts should be tight on the hubs. The flat tyres, wrench, jack, and pump should
be in the boot. The boot should be closed.

GPT-3.5:

(define (problem replace-tyres)

(:domain tyre-replacement)
(:objects
jack pump wrench boot hub nut flat-tyre intact-tyre - object
)
(:init
(in jack boot)
(in pump boot)
(in wrench boot)
(in intact-tyre boot)
(inflat intact-tyre)
(on flat-tyre hub)
(on-ground hub)
(tight nut hub)
(fasten hub)
(closed boot)
)
(:goal (and
(not (on flat-tyre hub))
(on intact-tyre hub)
(inflat intact-tyre)
(tight nut hub)
(in jack boot)
(in pump boot)
(in wrench boot)
(in intact-tyre boot)
(closed boot)
))
)