WorldCoder, A Model-Based LLM Agent

Download as pdf or txt
Download as pdf or txt
You are on page 1of 42

WorldCoder, a Model-Based LLM Agent:

Building World Models by Writing Code and Interacting with the Environment

Hao Tang 1 Darren Key 1 Kevin Ellis 1

Abstract bedded in the LLM to bias Python code generation, facili-


We give a model-based agent that builds a Python tates sample-efficient transfer across tasks by reusing pieces
arXiv:2402.12275v1 [cs.AI] 19 Feb 2024

program representing its knowledge of the world of old programs, and allows inspecting and understanding
based on its interactions with the environment. the system’s knowledge, because high-level programming
The world model tries to explain its interactions, languages are designed to be interpretable by humans.
while also being optimistic about what reward it Fig. 1 diagrams our architecture for building and using
can achieve. We do this by extending work on pro- Python world models, which we cast as model-based rein-
gram synthesis via LLMs. We study our agent on forcement learning (RL). In Fig. 2 we position this work
gridworlds, finding our approach is more sample- relative to deep RL as well as LLM agents. In contrast to
efficient compared to deep RL, and more compute- deep RL (Schrittwieser et al., 2020, i.a.), we view the world
efficient compared to ReAct-style agents. model as something that should be rapidly learnable and
transferable across environments. However, our work has
different limitations from deep RL: we consider fully ob-
1. Introduction served, deterministic, low-dimensional (symbolic) envi-
Consider yourself learning to use a new device or play a new ronments. This simplifies the problem, but not to the point
game. Given the right prior knowledge, together with rela- of triviality: For instance, high-level robot task planning
tively few interactions, most people can acquire basic knowl- typically makes similar assumptions (Liang et al., 2022).
edge of how many devices or games work. This knowledge Central to our work is a particular claim about how an LLM
could then be used to achieve novel goals, can be transferred should relate to world models. In our setup, the LLM does
to similar devices or games in the future, and can be com- not simulate the world, but instead builds a simulation of
municated symbolically to other humans. How could an the world. Therefore the LLM does not need to act as a
AI system similarly acquire, transfer, and communicate its world model, and different world models can be learned
knowledge of how things work? in different environments. This should be contrasted with
We cast this problem as one of learning a world model: a LLM agents such as ReAct (Yao et al., 2022) where the
mapping, called the transition function, that predicts the next LLM directly reasons about different actions and their con-
state of affairs, given a current state and action (Sutton & sequences. We also do not expect the LLM to perform plan-
Barto, 2018; Schrittwieser et al., 2020; Hafner et al., 2021). ning, which they are known to struggle with (Valmeekam
Our proposed solution is an architecture that synthesizes et al., 2023). Instead, we require the LLM to possess fuzzy
a Python program to model its past experiences with the prior knowledge of how the world might be, together with
world, effectively learning the transition function, and which the programming skill to code and debug a transition func-
takes actions by planning using that world model. tion. Our use of LLMs is closest to (Guan et al., 2023;
Wong et al., 2023; Zhu & Simmons, 2024), which gener-
Our method extends and advances recent program synthe- ate planning productions, which can be seen as a partic-
sis methods based on Large Language Models (LLMs), ular kind of world model. Overall though, our problem
particularly (Qiu et al., 2023; Wang et al., 2023c; Chen statement is closer to (Das et al., 2023; Evans et al., 2021;
et al., 2023; Ellis, 2023). Representing world knowledge Tsividis et al., 2021), which learn world models in domain-
as code, and generating it from LLMs, addresses several specific programming languages. We further show how to
important problems. It allows prior world knowledge em- use Turing-complete languages like Python—which we be-
1 lieve important for general-purpose learners—and we also
Cornell University, Department of Computer Science. Corre-
spondence to: Hao Tang <[email protected]>, Kevin Ellis study efficient transfer learning and exploration strategies
<[email protected]>. for such agents.

1
Building World Models by Writing Code and Interacting with the Environment

state,
reward,goal
world
# world_model.py
def update(state, action, goal): replay
planner LLM
…………………………… buffer
return new_state, reward

action

Figure 1. Overall agent architecture. The agent also inputs a goal in natural language

Priors Sample complexity World model representation Inputs LLM calls/task


Deep Model-Based RL Low High Neural, learned High-dim –
LLM Agents High Zero∗ Neural, fixed Symbolic O(T )
Ours High Low Symbolic, learned Symbolic O(1)

Figure 2. Qualitative comparison of our method against deep model-based RL and LLM agents (ReAct, RAP, etc: Yao et al. (2022); Hao
et al. (2023); Zhao et al. (2023); Liu et al. (2023b)). Sample complexity refers to the number of environment interactions needed to learn a
world model (∗ LLM agents do not update their world model). LLM calls/task is the number of LLM calls needed to solve a new task in a
fixed environment, amortized over many such tasks, as a function of the maximum episode length T . Asymptotically, after learning a
world model, our method can accomplish new tasks by only at most one LLM query to update the reward function.

We make the following three contributions: We formalize this as a Contextual Markov Decision Process
(CMDP: Hallak et al. (2015)), which is a tuple (C, S, A, M )
1. An architecture, which we call WorldCoder, for learning where C is a set of contexts (i.e. goals), S is a set of states,
world models as code. The architecture supports learning A is a set of actions, and M is a function mapping a context
that is more transferable, interpretable, and dramatically c ∈ C to a Markov Decision Process (MDP). The context-
more sample-efficient compared to deep RL, and also conditioned MDP, M (c), is a tuple (S, A, T, Rc , γ) with
more compute-efficient than prior LLM agents. transition function T : S × A → S, discount factor γ, and
reward function Rc : S × A × S → R × {0, 1}. Note
2. A new learning objective for program-structured world the reward is nonstandard: it depends on the context, and
models that favors optimism in the face of uncertainty returns an extra Boolean flag indicating whether the goal has
(Section 2.2). We show that this learning objective gener- been achieved (whether the episode is over). The transition
ates goal-driven exploratory behavior, which can reduce function does not depend on the context. The objective is
the number of environment actions needed to obtain re- to select actions to maximize
P∞cumulative discounted future
ward by orders of magnitude. reward, which is given by t=0 γ t rt , where rt is the reward
received t timesteps into the future.
3. As an auxiliary contribution, we also give an improved
State representation. Motivated by the maturity of ob-
algorithm for using an LLM to debug and improve its
ject detection and the formalisms used in robotic task plan-
code, based on a bandit formulation that balances ex-
ning, we represent each state as a set of objects. Each
ploiting promising programs against exploring new ones.
object has a string-valued field name, integer-valued fields
We describe this new bandit algorithm in Section 2.5.
x and y indicating its position, and optionally additional
fields depending on the type of the object. For example, if
2. Methods name="door", then the object has two additional Boolean
fields indicating if the door is open/closed and locked/un-
2.1. Problem statement and core representations
locked. This can be seen as a variation of Object Oriented
We start with the standard MDP formalism but modify it MDPs (Diuk et al., 2008).
in three important ways. First, we assume a goal is given
Representing world models as code. The agent uses
in natural language. The goal could be something specific
Python code to model the transition and reward functions.
such as “pickup the ball”, or underspecified, such as “maxi-
Mathematically we think of this Python code as a tuple
mize reward.” Second, we restrict ourselves to deterministic
(T̂ , R̂) of a transition function T̂ : S × A → S and a reward
environments, and assume that environment dynamics are
model R̂ : C → (S × A × S → R × {0, 1}). Note again
fixed across goals. Third, we assume an episodic MDP with
that the reward depends on the context, and returns an extra
the current episode terminating upon reaching the goal.

2
Building World Models by Writing Code and Interacting with the Environment

Boolean indicating whether the goal has been reached, in a random subset of D and asking it to propose a candidate
which case the current episode terminates. Both functions program. If the resulting program does not fit all the data
are implemented as separate Python subroutines, which en- (satisfy ϕ1 ), then the LLM is backprompted with members
courages disentangling the dynamics from the reward. of D inconsistent with its latest program, and asked to re-
vise/debug its code, inspired by Chen et al. (2023). (If it
2.2. The world model learning problem does not satisfy ϕ2 , it is instead backprompted with s0 , c:
see Appendix D for prompts.) Iterating this self-debugging
What objectives and constraints should the world model strategy introduces an interesting explore/exploit tradeoff
satisfy? Clearly, the learned world model should explain between trying to debug the best programs (those consistent
the observed data by correctly predicting the observed state with most of D), and trying to debug new programs. In
transitions and rewards. This is a standard training objective Sec. 2.5 we introduce a bandit algorithm for balancing this
within model-based RL (Sutton, 1990). tradeoff.
One less obvious learning objective is that the world model
should suffice to plan to the goal. Given two world models, 2.3. Properties of optimism under uncertainty
both consistent with the observed data, the agent should
Having the world model satisfy ϕ2 has two non-obvious
prefer a model which implies it is possible to get positive re-
consequences, described below.
ward, effectively being optimistic in the face of uncertainty
about world dynamics. Assuming the low-data regime, there
will be multiple world models consistent with the data: pre- Exploration guided by goal-driven behavior. Before
ferring an optimistic one guarantees that the agent can at ever receiving reward, the constraint of satisfying ϕ2 forces
least start making progress toward the goal, even if it later the agent to invent a reward function. The reward function,
has to update its beliefs because they turned out to be too according to ϕ2 , must be something that the agent believes
optimistic. it can actually achieve. This mechanism causes the agent to
explore the environment by generating new goals, and then
Concretely, as standard, the agent collects a dataset D of update its knowledge based on the experience of actually
past environment interactions, each formatted as a tuple trying to achieve that goal. This emergent exploration con-
(s, a, r, s′ , c, d) of current state s, next state s′ , action a, and tinues until the agent learns how to finish an episode with
reward r in context c with d indicating if the episode ended positive reward. Appendix Sec. A proves this theoretically.
upon reaching s′ . (The variable d should be read as “done.”)
The agent also stores the initial state s0 and context c of For example, suppose the context (goal) is “open the door
each episode, so that it can prefer world models that can with the key”, and the agent has not yet achieved any reward,
reach the goal from an initial state. The learning problem nor has it ever even found the key. Then ϕ2 would (for
is to construct Python programs implementing a transition example) prefer R̂’s which reward touching the key. Even
function T̂ and reward model R̂ satisfying constraints ϕ1 if touching the key does not actually give reward, the agent
and ϕ2 , defined below: can make good progress by pretending it does.
 
ϕ1 D, T̂ , R̂ = fit data (1) Following natural-language instructions. After learning
how the world works, our agent should be able to receive
∀(s, a, r, s′ , c, d) ∈ D : (T̂ , R̂) ⊢ (s, a, r, s′ , c, d) natural language instructions and begin following them im-
where (T̂ , R̂) ⊢ (s, a, r, s′ , c, d) if and only if: mediately without further learning from environmental in-
teractions. Mathematically, given a learned program R̂ that
T̂ (s, a) = s′ ∧ R̂(c)(s, a, s′ ) = (r, d) implements reward functions for previously seen goals, to-
gether with a new goal c where c ̸∈ domain(R̂), the agent
 
ϕ2 s0 , c, T̂ , R̂ = optimism under uncertainty (2)
should update its reward model in order to cover c.
∃a1 , s1 , a2 , s2 , ..., aℓ , sℓ :
This zero-shot generalization to new goals is forced to occur
∀i ∈ [ℓ] : T̂ (si−1 , ai ) = si ∧ as an implicit consequence of enforcing optimism under
∃r > 0 : R̂(c)(sℓ−1 , aℓ , sℓ ) = (r, 1) uncertainty (ϕ2 in Eq. 2). Upon observing a new context c,
the constraint ϕ2 is violated. This triggers program synthesis
where s0 , c is the initial state/context of the episode, and the to debug R̂ so that it covers c, subject to the constraint that
turnstile (⊢) should be read as meaning that a given program R̂(c) allows reaching a goal state.
entails a given logical constraint or predicts a given replayed
Given the importance of instruction-following abilities, we
experience.
engineer a special prompt for trying to enforce ϕ2 upon
Constructing a world model satisfying ϕ1 ∧ ϕ2 is a program encountering a new goal (Appendix Sec. D.5). This prompt
synthesis problem. We solve it by prompting an LLM with is based on retrieving previously learned reward functions.

3
Building World Models by Writing Code and Interacting with the Environment

Algorithm 1 WorldCoder Agent Architecture the LLM with an erroneous program it previously gener-
Hyperparam: ϵ, random exploration probability ated, together with failed example testcase(s), and finally
(default to 5%) prompting it to fix its code.
Hyperparam: M IN DATA S IZE, min # actions before Complex programs typically require several rounds of re-
learning begins (default to 10) finement, however, and each round requires calling an LLM,
D←∅ ▷ replay buffer which can return several samples. Therefore, this process
T̂ , R̂ ←null, null ▷ init empty world model naturally generates a tree of possible programs (Fig. 3). Re-
loop forever through episodes: cent work has cast doubt on the ability of LLMs to refine
c ←E PISODE G OAL () ▷ get context (goal) their own outputs (Olausson et al., 2023), concluding that,
s0 ←C URRENT S TATE () ▷ store init state for ϕ2 for a fixed LLM token budget, it is better to repeatedly sam-
loop until episode ends: ple from the root of the tree: essentially, repeatedly guessing
s ←C URRENT S TATE () from scratch.
if not ϕ1 ∧ ϕ2 and |D| ≥ M IN DATA S IZE then
T̂ , R̂ ← S YNTHESIZE(T̂ , R̂, D, c, s0 ) ▷ Alg 2 Our investigations, however, suggest that refinement can
work, and even generate 150+ line programs, if and only
with probability ϵ do
if the system can intelligently decide which program it
a ←R ANDOM ACTION () ▷ ϵ-greedy explore
should refine next. Should we maximally exploit by refining
else
the program which passes the most test cases? Or should
a ←P LAN (s, T̂ , R̂(c)) ▷ Value Iteration
we explore by trying to improve the program which has
s′ , r, d ← E NV.S TEP (a) ▷ take action in state s
been refined the least so far? This exploration-exploitation
D ← D ∪ {(s, a, r, s′ , c, d)} ▷ record experience
tradeoff is made more interesting by the fact that every time
we refine a program, we create another brand-new program.
In what follows, we will first describe our method for gen-
This retrieval strategy assumes similar goals have similar
eral program synthesis problems, and then describe its spe-
reward structures, so that the LLM can generalize from old
cific instantiation for world models.
reward functions in R̂ when generating the new R̂(c). If
the LLM makes a mistake in predicting the reward function,
then the agent can recover by subsequent rounds of program The general setting. We frame refinement as an arm-
synthesis, which update R̂(c) based on interactions with the acquiring bandit problem: A bandit problem where new
environment. arms arrive over time (Whittle, 1981). Here each arm is
a program, and “pulling” an arm corresponds to refining
2.4. Overall architecture a program. New programs arrive over time, because with
each refinement, we generate a new program. We receive a
Ultimately our world models exist to serve downstream reward of 1 if pulling an arm it yields a program that fits the
decision-making: and in turn, taking actions serves to pro- data perfectly, and zero otherwise.
vide data for learning better world models. There are many
To formalize this, we write ρ to mean a program, and Φ
architectures for combining acting with world-model learn-
to mean the logical constraint that the program should sat-
ing, such as via planning (Schrittwieser et al., 2020), train-
isfy, such as being consistent with a dataset of input-output
ing policies in simulation (Hafner et al., 2021), or hybrid
examples. The bandit problem assumes a binary reward:
approaches (Sutton, 1990). We use a very basic agent archi-
We receive reward 1 if the refined program satisfies Φ and
tecture, shown in Algorithm 1. At a high level, it initially
reward 0 otherwise. Although we want to perfectly satisfy
performs random exploration to initialize a dataset D of
Φ, we assume access to a heuristic estimator of program
environmental interactions; it updates its world model using
quality, h(ρ), which returns a score between 0 and 1. For ex-
the program synthesis algorithm of Sec. 2.5; and, past its
initial exploration phase, it performs planning via depth-
limited value iteration.
refined program refined program
(60% correct) (40% correct)
2.5. Program Synthesis via Refinement as initial program
Arm-acquiring Bandits
(0% correct)
exploit
explore
A recent paradigm for problem-solving with LLMs is to refined program
have the language model refine, self-correct, or self-debug (50% correct)
its initial outputs (Chen et al., 2023); we use the term re-
finement for this whole body of work. When synthesizing Figure 3. Refinement tree. Exploit the best program passing 60%
programs from a dataset of examples, this means prompting of the test cases, or explore refining a program passing just 50%?

4
Building World Models by Writing Code and Interacting with the Environment

ample, h(ρ) could be the fraction of testcases that ρ passes. Alg. 2 puts these ingredients together. Although the pseu-
Writing r for the bandit reward and Prefine for the distribu- docode loops forever until it finds a program that perfectly
tion of LLM-generated refinements, the reward follows a explains the data, we bound the number of iterations to 50,
Bernoulli distribution whose parameter we write θρ :1 at which point, the best program found so far is returned.
θρ = P (r = 1|ρ) = E [1 [ρ′ ⊢ Φ]] (3) Choice of LLM for refinement. We use GPT-4 because
ρ′ ∼Prefine (·|ρ) recent work (Olausson et al., 2023) suggests it is, by far, the
We solve this bandit problem using Thompson Sam- best model for refining code. Although expensive, the point
pling (Russo et al., 2018; Chapelle & Li, 2011), meaning of our bandit algorithm is to minimize this cost, and indeed,
we maintain probabilistic beliefs about each arm’s expected we find that our new method is more effective at managing
reward, θρ . As standard, the distribution over θρ is modeled these costs than existing approaches to refinement (Fig. 4).
as a Beta distribution, whose parameters we write αρ and βρ .
Thompson sampling picks the next arm to pull by sampling
an expected reward θρ from each arm’s Beta distribution,
and then pulls the arm with the highest expected reward.
The corresponding Beta parameters are then updated via
Bayes Rule, which simply increments αρ or βρ depending
on whether reward was received. As a side-effect of refining
this program, the LLM generates a new program, which then
becomes an arm that could be pulled in subsequent itera-
tions. The whole process stops when a program is generated Figure 4. h-value at 50 refinements, contrasting our bandit method
that perfectly satisfies the constraint Φ. with greedily refining the most promising program so far.

However, we would also like to prioritize programs which


have higher heuristic value: for instance, if the heuristic Algorithm 2 Bandit formulation of program synthesis
measures how many test cases a program passes, then a Input: logical constraint Φ, heuristic h(·), seed program ρ
program which passes more test cases is more likely to be Hyperparameter: C > 0
refined into one which passes all tests. Mathematically this progs ← {ρ} ▷ initialize with one arm (one program)
is straightforwardly compatible with Thompson Sampling α, β ← dict(), dict() ▷ params for each arm
by just defining a prior over θρ : αρ ← 1 + C × h(ρ) ▷ prior beliefs
P (θρ ) = Beta(αρprior , βρprior ) (4) βρ ← 1 + C × (1 − h(ρ)) ▷ prior beliefs
repeat
αρprior = 1 + C × h(ρ) (5)
▷ Thompson Sampling
βρprior = 1 + C × (1 − h(ρ)) (6) ∀ρ ∈ progs : θρ ∼ Beta(αρ , βρ ) ▷ sample
ρ ← arg maxρ∈progs θρ ▷ select arm
where C is a hyperparameter. Increasing C encourages
ρ′ ∼ Prefine (·|ρ) ▷ call LLM to refine
greedier behavior by increasing the importance of the prior.
r ← 1 [ρ′ ⊢ Φ] ▷ get reward
We set C = 5.
αρ ← αρ + r ▷ belief update
βρ ← βρ + (1 − r) ▷ belief update
Bandit Refinement for World Models. Although this
▷ Arm arrival
formalism is general, for our specific setting, the program ρ
progs← progs ∪ {ρ′ } ▷ add to set of arms
corresponds to a tuple (T̂ , R̂) of a transition function and a
αρ′ ← 1 + C × h(ρ′ ) ▷ prior beliefs
reward model. Our heuristic measures what fraction of the
βρ′ ← 1 + C × (1 − h(ρ′ )) ▷ prior beliefs
replay buffer can be explained by a given program:
until ρ′ ⊢ Φ
return ρ′
P
1 [ρ ⊢ x]
h(T̂ , R̂) = x∈D (7)
|D|
To satisfy ϕ1 ∧ ϕ2 we first run the bandit algorithm with
Φ = ϕ1 , and then run it with Φ = ϕ1 ∧ ϕ2 . This 2-phase
3. Experimental Results
procedure first produces a program that is consistent with We study our system in two grid worlds, Sokoban and
the data before next enforcing optimism under uncertainty.2 Minigid, with the goal of understanding the sample effi-
1 ciency and computational efficiency of the learner, espe-
Typewriter font is used to distinguish rewards in this bandit
problem from MDP rewards. cially when transferring knowledge across similar environ-
2
Doing one synthesis phase is possible with Φ = ϕ1 ∧ ϕ2 , but ments, as well as the impact of our nonstandard world-model
pilot experiments showed this took slightly more LLM calls. learning objective, which prefers optimistic models.

5
Building World Models by Writing Code and Interacting with the Environment

(A) Sokoban example gameplay, 3 boxes

down ··· right up


+20 actions
reward reward reward
+1
··· +0 +10

(B) Learning curves. Solve rate on test levels. (C) LLM token cost (D) deep RL comparison

Figure 5. (A) Sokoban domain (per-step reward of -0.1 ellided from figure). (B) Learning curves. ReAct has the same pretrained
knowledge of Sokoban but cannot effectively play the game. (C) Our method has different asymptotic LLM cost compared to prior LLM
agents, which consume LLM calls/tokens at every action. (D) Deep RL takes >1 million steps to learn 2-box Sokoban.

Sokoban is a puzzle-solving task where the agent pushes ReAct-style architectures (Hao et al., 2023; Zhao et al.,
boxes around a 2d world, with the goal of pushing every 2023; Liu et al., 2023b, i.a.), also require expensive LLM
box onto a target (Fig. 5A). Solving hard Sokoban levels is a calls at every action, and so asymptotically their cost grows
challenging planning task that has received recent attention linearly with the number of actions taken. Our approach has
from the planning and RL communities (Chrestien et al., different asymptotics: after front-loading 400k LLM tokens
2023; Chung et al., 2023; Kujanpää et al., 2023; Feng et al., (about $15), it can issue as many actions as needed without
2022; 2020). Unlike these other works, our emphasis is not subsequent LLM queries, whereas ReAct demands many
on solving the hardest Sokoban levels. Instead, we wish to times more tokens (Fig. 5C).
show that our agent can rapidly achieve basic competence.
On Sokoban, the optimism under uncertainty objective (ϕ2 ,
Master-level play could then be achieved via any of the cited
orange curves in Fig. 5B) has little effect: Sokoban has
works that focus on sophisticated planning and search.
a dense reward structure that allows easy learning through
Starting with only the natural-language goal of “win the random exploration. Next we consider problems with sparse
game”, our agent builds a world model over the first 50 rewards and natural language instructions, which the opti-
actions. The resulting code is human-understandable (Ap- mism objective is designed to handle.
pendix C), and generalizes to solving levels with more boxes
Minigrid. To better understand the transfer learning and
(Fig. 5B). While the system cannot solve very hard Sokoban
instruction following aspects of our approach, we next study
levels (eg, 5+ boxes), that is an artifact of the difficulty of
Minigrid (Chevalier-Boisvert et al., 2023), a suite of grid
planning, and could be addressed by plugging the world
games designed for language-conditioned RL. Each min-
model into any of the techniques cited above. In contrast
igrid environment introduces new objects with their own
to this work, both model-based and model-free deep RL re-
dynamics, such as keys, doors, walls, balls, and boxes.
quire millions of experiences to solve basic levels (Fig. 5D).
Fig. 6 illustrates results for our agent playing a sequence of
Almost surely, an important reason why our system learns
minigrid environments, while Appendix A.1 gives a walk-
quickly is because the underlying LLM already knows about
through of an example learning trajectory. The agent inter-
Sokoban, 3 and can quickly infer that it is playing a simi-
acts with each environment episodically through different
lar game. However, simply knowing about Sokoban does
randomly-generated levels. We order the environments into
not suffice for the LLM to play the game, as demonstrated
a curriculum designed to illustrate different forms of trans-
by the poor performance of ReAct (Fig. 5B). ReAct is a
fer learning. For example, when transferring from the first
baseline which prompts the LLM with the state-action his-
to the second environment, the agent needs to extend its
tory, then asks it to think step-by-step (Reason) and before
knowledge to incorporate new objects and their associated
predicting an action (Act). Quantitatively, ReAct succeeds
dynamics (keys and doors). Learning about these new ob-
on only 15% ± 8% of basic levels, showing that pretrained
jects requires extra environment interactions, during which
knowledge of Sokoban does not, by itself, allow strong play.
the agent experiments with the objects to update its transi-
3 tion function. In contrast, the third environment presents no
Sokoban is a computer game from 1982, so descriptions of
how to play it are probably in GPT4’s training data new dynamics, but instead introduces new natural-language

6
Building World Models by Writing Code and Interacting with the Environment

(A) minigrid environments and curriculum ordering: testing different forms of transfer
env 1: empty env 2: door_key env 3: unlock env 4: fetch env 5: unlock_pickup
new transitions: new reward: new reward+ new transitions:
few-shot 0-shot new transition: few-shot
generalization generalization few-shot gen. generalization

“get to the green “use the key to open the “open the door” “fetch a green ball” “pick up the red box”
goal square” door and get to the goal”
(B) curriculum learning results

(C) no curriculum results

Figure 6. (A) Minigrid environments ordered into a curriculum that tests different kinds of transfer learning. (B) Transfer learning
performance, compared with (C) performance when solving each environment independently. Appendix Fig. 7: deep RL comparison.

goals. Our agent can follow these natural-language instruc- orated by classic behavioral experiments suggesting biologi-
tions by enforcing optimism under uncertainty (ϕ2 ). cal decision-making features a similar architecture (Tolman
& Honzik, 1930). At least in theory, an agent with a world
We observe that transfer is especially helpful in quickly
model enjoys many advantages. Such an agent could reason
solving new environments (Fig. 6B). Without transfer, more
in radically new situations by spending more time planning
episodes of exploration are needed to collect experience
different actions—requiring zero retraining—and can sim-
to build a good world model. However, optimism under
ilarly accomplish novel goals by just changing planning
uncertainty (ϕ2 ) helps in this regard for the non-transfer
objectives, again zero-shot. In contrast, merely learning a
setting by promoting goal-directed exploration, and in fact,
policy can leave the agent vulnerable to catastrophic fail-
absent transfer, it is only with ϕ2 that WorldCoder can solve
ure given even small changes to its initial conditions or
the harder environments. Optimism is also necessary for
goals (Kansky et al., 2017). But in practice, world models
zero-shot generalization to new goals (Fig. 6B, env 3).
are hard-won, requiring either large volumes of training
To better understand the importance of ϕ2 —which both data (Hafner et al., 2021; Ha & Schmidhuber, 2018), or
encourages exploration and enables natural-language in- careful hand-engineering (Kaelbling & Lozano-Pérez, 2011)
struction following—we contrast against a version of our (cf. Mao et al. (2022); Konidaris et al. (2018)).
approach which simply prompts GPT4 to generate a new
Neurosymbolic world models, such as Cosmos and
reward function upon receiving a new goal (green in Fig. 6B,
NPS (Sehgal et al., 2023; Goyal et al., 2021), learn a fac-
labelled ‘prompt4reward’). Theoretically this suffices to fol-
tored, object-based neural world model. This factoring helps
low new natural-language instructions, provided the LLM
compositional generalization—like in our work—but impor-
is reliable. Surprisingly, this ablation struggles to follow
tantly they can learn from raw perception, but at the expense
new instructions (eg, transfer from env 2→3 in Fig. 6A),
of transfer and sample complexity. Combining our work
showing the importance of ϕ2 in correcting mistakes made
with these others might be able to get the best of both.
by the LLM when generating reward functions.
LLMs as a world model. Whether LLMs can model the
4. Related Work world is an open question, but there is evidence that, given
the right training data in large quantities, transformers can
World models. World modeling is a classic paradigm act as decent world models, at least within certain situa-
for decision-making: It is the basis of model-based tions (Li et al., 2023; Xiang et al., 2023; Micheli et al.,
RL (Hafner et al., 2023; 2021), task-and-motion-planning in 2023). These works aim to learn a rich but frozen world
robotics (Kaelbling & Lozano-Pérez, 2011), and it is corrob- model from a relatively large volume of examples. We

7
Building World Models by Writing Code and Interacting with the Environment

tackle a different problem: building a simple world model languages: So far these works have used restricted domain-
on-the-fly from a modest amount of data. specific languages, but we show that a general-purpose com-
putational language, like Python, can be used to learn world
LLMs for building world models. Recent works (Zhu
models, which we hope expands the scope of this paradigm.
& Simmons, 2024; Wong et al., 2023; Guan et al., 2023)
We also show how to bias learning toward goal-directed
consider using LLMs to generate planning operators: a kind
behaviors, and how to support transfer across environments
of world model that as abstract, symbolic, and expressed
and goals. Last, we simplify the core program synthesis
in a domain-specific programming language for planning
algorithm: the cited prior works required relatively intricate
(cf. DECKARD (Nottingham et al., 2023), another LLM
synthesis algorithms, which we can avoid by using LLMs as
system which generates state-machine world models). In
general-purpose synthesizers. We hope our work can help
these works, the primary driver of world-model generation—
make this paradigm simpler and more general.
what the LLM first inputs—is natural language describing
affordances and goals. Our work considers a different prob- Other works have also explored how humans can manually
lem: building world models from first-and-foremost from provide knowledge to RL agents via source code: e.g., RL-
interacting with the environment. In practice, agents have Lang (Rodriguez-Sanchez et al., 2023) uses programs to
knowledge both from language and from acting in the world, specify parts of policies and world models, which could be
and so these families of works should be complementary. combined with our system to integrate prior knowledge.
LLMs for decision-making is an emerging paradigm that Exploration & Optimism under (model) uncertainty.
includes ReAct (Yao et al., 2022) and many others (Hao Theoretical work on in POMDPs (Liu et al., 2023a) pro-
et al., 2023; Zhao et al., 2023; Liu et al., 2023b; Ahn et al., posed conceptually related exploration strategies to our op-
2022, i.a.), which directly use LLMs to issue actions and timism constraint, finding it gives theoretical improvements,
reason about their consequences in the world. For instance, which is synergistic with our empirical findings.
ReAct works by prompting the LLM to think step-by-step
and then predict an action. To the extent that these meth- 5. Limitations and Open Directions
ods use a world model, it is implicitly encoded within the
weights of a neural network. We instead build an explicit Our work has important limitations, and naturally suggests
world model, which has the advantage of not needing to next steps. Currently we assume deterministic dynamics,
query the LLM for every action, because the agent can just which could be addressed by synthesizing probabilistic pro-
repeatedly execute the transition function: The cost of using grams (De Raedt et al., 2007; Goodman et al., 2008). Re-
the LLM is amortized, as it only needs to be done once to cent advances in synthesizing probabilistic programs (Saad,
get a good world model. However, ReAct-style approaches 2022), together with advances in using LLMs for determin-
can handle partially observable environments (Sun et al., istic code, this limitation seems nontrivial but surmountable.
2023b), which this paper does not consider.
By representing knowledge as code, our approach deliv-
Programs as Policies. Instead of learning a world model, ers better sample efficiency and transferability, but at high
one can learn a policy as a program. The first wave of cost: Our states must be symbolic and discrete, whereas
these works (Verma et al., 2018; 2019) considered domain- the real world is messy and continuous. While the obvious
specific languages, while recent LLM work (Wang et al., response is that the agent can be equipped with pretrained
2023b; Liang et al., 2022; Sun et al., 2023a) uses more flex- object detectors—a common assumption in robotic task
ible general-purpose languages like Python. An advantage planning (Konidaris et al., 2018, i.a.)—alternative routes
of learning a policy is that it does not need to model all include multimodal models (Hu et al., 2023) and using neu-
the details of the world, many of which may be irrelevant rosymbolic programs (Tang & Ellis, 2023) to bridge the gap
to decision making. A disadvantage is that policies can- between perception and symbol processing, which might be
not readily generalize to new goals—unlike world models, more robust to missing or misspecified symbols.
which can be used by a planner to achieve a variety of objec-
Last, our method uses only a very basic mechanism for grow-
tives. Relatedly, other recent work considers synthesizing
ing and transferring its knowledge. Instead of prompting to
programs that implement reward functions (Ma et al., 2023),
debug its code, we could have built a library of reusable sub-
and then generating a policy with conventional deep RL.
routines and classes shared across different environments
Programs as world models. We are strongly inspired by ex- and goals, reminiscent of library learning systems (Ellis
isting program synthesis algorithms for constructing world et al., 2023; Wang et al., 2023a; Grand et al., 2023; Bowers
models from state-action trajectories (Das et al., 2023; Evans et al., 2023), which refactor their code to expose sharable
et al., 2021; Tsividis et al., 2021). We believe that this fam- components. Further developing that and other ways of
ily of methods will not be generally applicable until they managing and growing symbolic knowledge about the world
can support general-purpose Turing-complete programming remains a prime target for future work.

8
Building World Models by Writing Code and Interacting with the Environment

Acknowledgements. We received funding support from Das, R., Tenenbaum, J. B., Solar-Lezama, A., and Tavares,
NSF grant #2310350 as well as gifts from Cisco and Joseph Z. Combining functional and automata synthesis to dis-
Bates. cover causal reactive programs. Proc. ACM Program.
Lang., 7(POPL), jan 2023. doi: 10.1145/3571249. URL
References https://doi.org/10.1145/3571249.

Ahn, M., Brohan, A., Brown, N., Chebotar, Y., Cortes, O., De Raedt, L., Kimmig, A., and Toivonen, H. Problog: a
David, B., Finn, C., Fu, C., Gopalakrishnan, K., Hausman, probabilistic prolog and its application in link discovery.
K., Herzog, A., Ho, D., Hsu, J., Ibarz, J., Ichter, B., In Proceedings of the 20th International Joint Conference
Irpan, A., Jang, E., Ruano, R. J., Jeffrey, K., Jesmonth, on Artifical Intelligence, IJCAI’07, pp. 2468–2473, San
S., Joshi, N., Julian, R., Kalashnikov, D., Kuang, Y., Lee, Francisco, CA, USA, 2007. Morgan Kaufmann Publish-
K.-H., Levine, S., Lu, Y., Luu, L., Parada, C., Pastor, ers Inc.
P., Quiambao, J., Rao, K., Rettinghouse, J., Reyes, D., Diuk, C., Cohen, A., and Littman, M. L. An object-oriented
Sermanet, P., Sievers, N., Tan, C., Toshev, A., Vanhoucke, representation for efficient reinforcement learning. In Pro-
V., Xia, F., Xiao, T., Xu, P., Xu, S., Yan, M., and Zeng, ceedings of the 25th international conference on Machine
A. Do as i can and not as i say: Grounding language in learning, pp. 240–247, 2008.
robotic affordances. In arXiv preprint arXiv:2204.01691,
2022. Ellis, K. Human-like few-shot learning via bayesian reason-
ing over natural language. NeurIPS, 2023.
Bowers, M., Olausson, T. X., Wong, L., Grand, G., Tenen-
baum, J. B., Ellis, K., and Solar-Lezama, A. Top-down Ellis, K., Wong, L., Nye, M., Sable-Meyer, M., Cary, L.,
synthesis for library learning. Proc. ACM Program. Anaya Pozo, L., Hewitt, L., Solar-Lezama, A., and Tenen-
Lang., 7(POPL), jan 2023. doi: 10.1145/3571234. URL baum, J. B. Dreamcoder: growing generalizable, inter-
https://doi.org/10.1145/3571234. pretable knowledge with wake–sleep bayesian program
learning. Philosophical Transactions of the Royal Society
Chapelle, O. and Li, L. An empirical evaluation of A, 381(2251):20220050, 2023.
thompson sampling. In Shawe-Taylor, J., Zemel, Evans, R., Bošnjak, M., Buesing, L., Ellis, K., Pfau, D.,
R., Bartlett, P., Pereira, F., and Weinberger, K. Kohli, P., and Sergot, M. Making sense of raw input.
(eds.), Advances in Neural Information Process- Artificial Intelligence, 299:103521, 2021.
ing Systems, volume 24. Curran Associates, Inc.,
2011. URL https://proceedings.neurips. Feng, D., Gomes, C. P., and Selman, B. A novel au-
cc/paper_files/paper/2011/file/ tomated curriculum strategy to solve hard sokoban
e53a0a2978c28872a4505bdb51db06dc-Paper. planning instances. In Larochelle, H., Ranzato, M.,
pdf. Hadsell, R., Balcan, M., and Lin, H. (eds.), Ad-
vances in Neural Information Processing Systems,
Chen, X., Lin, M., Schärli, N., and Zhou, D. Teaching volume 33, pp. 3141–3152. Curran Associates, Inc.,
large language models to self-debug. arXiv preprint 2020. URL https://proceedings.neurips.
arXiv:2304.05128, 2023. cc/paper_files/paper/2020/file/
2051bd70fc110a2208bdbd4a743e7f79-Paper.
Chevalier-Boisvert, M., Dai, B., Towers, M., de Lazcano, pdf.
R., Willems, L., Lahlou, S., Pal, S., Castro, P. S., and
Terry, J. Minigrid & miniworld: Modular & customizable Feng, D., Gomes, C., and Selman, B. Left heavy tails
reinforcement learning environments for goal-oriented and the effectiveness of the policy and value networks in
tasks. CoRR, abs/2306.13831, 2023. dnn-based best-first search for sokoban planning, 2022.
Goodman, N. D., Mansinghka, V. K., Roy, D., Bonawitz, K.,
Chrestien, L., Edelkamp, S., Komenda, A., and Pevný, T. and Tenenbaum, J. B. Church: a language for generative
Optimize planning heuristics to rank, not to estimate models. In Proceedings of the Twenty-Fourth Confer-
cost-to-goal. In Thirty-seventh Conference on Neural ence on Uncertainty in Artificial Intelligence, UAI’08, pp.
Information Processing Systems, 2023. URL https: 220–229, Arlington, Virginia, USA, 2008. AUAI Press.
//openreview.net/forum?id=Mgy6sgslPY. ISBN 0974903949.
Chung, S., Anokhin, I., and Krueger, D. Thinker: Learning Goyal, A., Didolkar, A., Ke, N. R., Blundell, C., Beaudoin,
to plan and act. In Thirty-seventh Conference on Neural P., Heess, N., Mozer, M., and Bengio, Y. Neural produc-
Information Processing Systems, 2023. URL https: tion systems: Learning rule-governed visual dynamics.
//openreview.net/forum?id=mumEBl0arj. arXiv preprint arXiv:2103.01937, 2021.

9
Building World Models by Writing Code and Interacting with the Environment

Grand, G., Wong, L., Bowers, M., Olausson, T. X., Liu, Li, K., Hopkins, A. K., Bau, D., Viégas, F., Pfister, H.,
M., Tenenbaum, J. B., and Andreas, J. Lilo: Learning and Wattenberg, M. Emergent world representations:
interpretable libraries by compressing and documenting Exploring a sequence model trained on a synthetic task.
code, 2023. ICLR, 2023.

Guan, L., Valmeekam, K., Sreedharan, S., and Kambham- Liang, J., Huang, W., Xia, F., Xu, P., Hausman, K., Ichter,
pati, S. Leveraging pre-trained large language models to B., Florence, P., and Zeng, A. Code as policies: Language
construct and utilize world models for model-based task model programs for embodied control. In arXiv preprint
planning. NeurIPS, 2023. arXiv:2209.07753, 2022.
Ha, D. and Schmidhuber, J. Recurrent world models fa- Liu, Q., Netrapalli, P., Szepesvari, C., and Jin, C. Optimistic
cilitate policy evolution. In Advances in Neural In- mle: A generic model-based algorithm for partially ob-
formation Processing Systems 31, pp. 2451–2463. Cur- servable sequential decision making. In Proceedings of
ran Associates, Inc., 2018. https://worldmodels. the 55th Annual ACM Symposium on Theory of Comput-
github.io. ing, pp. 363–376, 2023a.
Hafner, D., Lillicrap, T., Norouzi, M., and Ba, J. Mastering
Liu, Z., Hu, H., Zhang, S., Guo, H., Ke, S., Liu, B., and
atari with discrete world models. ICLR, 2021.
Wang, Z. Reason for future, act for now: A principled
Hafner, D., Pasukonis, J., Ba, J., and Lillicrap, T. Mastering framework for autonomous llm agents with provable sam-
diverse domains through world models, 2023. ple efficiency. arXiv preprint arXiv:2309.17382, 2023b.

Hallak, A., Di Castro, D., and Mannor, S. Con- Ma, Y. J., Liang, W., Wang, G., Huang, D.-A., Bastani, O.,
textual markov decision processes. arXiv preprint Jayaraman, D., Zhu, Y., Fan, L., and Anandkumar, A.
arXiv:1502.02259, 2015. Eureka: Human-level reward design via coding large lan-
guage models. arXiv preprint arXiv: Arxiv-2310.12931,
Hao, S., Gu, Y., Ma, H., Hong, J. J., Wang, Z., Wang, D. Z., 2023.
and Hu, Z. Reasoning with language model is planning
with world model. arXiv preprint arXiv:2305.14992, Mao, J., Lozano-Pérez, T., Tenenbaum, J., and Kaelbling, L.
2023. Pdsketch: Integrated domain programming, learning, and
planning. Advances in Neural Information Processing
Hu, Y., Lin, F., Zhang, T., Yi, L., and Gao, Y. Look before
Systems, 35:36972–36984, 2022.
you leap: Unveiling the power of gpt-4v in robotic vision-
language planning. arXiv preprint arXiv:2311.17842, Micheli, V., Alonso, E., and Fleuret, F. Transformers
2023. are sample-efficient world models. In The Eleventh
Kaelbling, L. P. and Lozano-Pérez, T. Hierarchical task and International Conference on Learning Representations,
motion planning in the now. In 2011 IEEE International 2023. URL https://openreview.net/forum?
Conference on Robotics and Automation, pp. 1470–1477, id=vhFu1Acb0xb.
2011. doi: 10.1109/ICRA.2011.5980391.
Nottingham, K., Ammanabrolu, P., Suhr, A., Choi, Y., Ha-
Kansky, K., Silver, T., Mély, D. A., Eldawy, M., Lázaro- jishirzi, H., Singh, S., and Fox, R. Do Embodied Agents
Gredilla, M., Lou, X., Dorfman, N., Sidor, S., Phoenix, Dream of Pixelated Sheep: Embodied Decision Making
S., and George, D. Schema networks: Zero-shot transfer using Language Guided World Modelling. In Interna-
with a generative causal model of intuitive physics. In tional Conference on Machine Learning (ICML), 2023.
International conference on machine learning, pp. 1809–
1818. PMLR, 2017. Olausson, T. X., Inala, J. P., Wang, C., Gao, J., and Solar-
Lezama, A. Is self-repair a silver bullet for code genera-
Konidaris, G., Kaelbling, L. P., and Lozano-Perez, T. From tion?, 2023.
skills to symbols: Learning symbolic representations for
abstract high-level planning. Journal of Artificial Intelli- Qiu, L., Jiang, L., Lu, X., Sclar, M., Pyatkin, V., Bhaga-
gence Research, 61:215–289, 2018. vatula, C., Wang, B., Kim, Y., Choi, Y., Dziri, N., and
Ren, X. Phenomenal yet puzzling: Testing inductive rea-
Kujanpää, K., Pajarinen, J., and Ilin, A. Hybrid search soning capabilities of language models with hypothesis
for efficient planning with completeness guarantees. In refinement, 2023.
Thirty-seventh Conference on Neural Information Pro-
cessing Systems, 2023. URL https://openreview. Raffin, A. Rl baselines3 zoo. https://github.com/
net/forum?id=bY0c46ZtXa. DLR-RM/rl-baselines3-zoo, 2020.

10
Building World Models by Writing Code and Interacting with the Environment

Raffin, A., Hill, A., Gleave, A., Kanervisto, A., Ernestus, Tsividis, P. A., Loula, J., Burga, J., Foss, N., Campero,
M., and Dormann, N. Stable-baselines3: Reliable rein- A., Pouncy, T., Gershman, S. J., and Tenenbaum, J. B.
forcement learning implementations. Journal of Machine Human-level reinforcement learning through theory-
Learning Research, 22(268):1–8, 2021. URL http: based modeling, exploration, and planning. arXiv
//jmlr.org/papers/v22/20-1364.html. preprint arXiv:2107.12544, 2021.

Rodriguez-Sanchez, R., Spiegel, B. A., Wang, J., Patel, Valmeekam, K., Marquez, M., Sreedharan, S., and Kamb-
R., Tellex, S., and Konidaris, G. Rlang: A declarative hampati, S. On the planning abilities of large lan-
language for describing partial world knowledge to rein- guage models - a critical investigation. In Thirty-seventh
forcement learning agents, 2023. Conference on Neural Information Processing Systems,
2023. URL https://openreview.net/forum?
Russo, D. J., Van Roy, B., Kazerouni, A., Osband, I., Wen, id=X6dEqXIsEW.
Z., et al. A tutorial on thompson sampling. Foundations Verma, A., Murali, V., Singh, R., Kohli, P., and Chaudhuri,
and Trends® in Machine Learning, 11(1):1–96, 2018. S. Programmatically interpretable reinforcement learning.
In International Conference on Machine Learning, pp.
Saad, F. Scalable Structure Learning, Inference, and Anal-
5045–5054. PMLR, 2018.
ysis with Probabilistic Programs. PhD thesis, Mas-
sachusetts Institute of Technology, 2022. Verma, A., Le, H., Yue, Y., and Chaudhuri, S. Imitation-
projected programmatic reinforcement learning. Ad-
Schrader, M.-P. B. gym-sokoban. https://github. vances in Neural Information Processing Systems, 32,
com/mpSchrader/gym-sokoban, 2018. 2019.

Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Wang, G., Xie, Y., Jiang, Y., Mandlekar, A., Xiao, C., Zhu,
Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, Y., Fan, L., and Anandkumar, A. Voyager: An open-
D., Graepel, T., et al. Mastering atari, go, chess and shogi ended embodied agent with large language models. arXiv
by planning with a learned model. Nature, 588(7839): preprint arXiv: Arxiv-2305.16291, 2023a.
604–609, 2020. Wang, H., Gonzalez-Pumariega, G., Sharma, Y., and Choud-
hury, S. Demo2code: From summarizing demonstra-
Sehgal, A., Grayeli, A., Sun, J. J., and Chaudhuri, S. Neu-
tions to synthesizing code via extended chain-of-thought.
rosymbolic grounding for compositional world models,
NeurIPS, 2023b.
2023.
Wang, R., Zelikman, E., Poesia, G., Pu, Y., Haber, N., and
Sun, H., Zhuang, Y., Kong, L., Dai, B., and Zhang, C. Ada- Goodman, N. D. Hypothesis search: Inductive reasoning
planner: Adaptive planning from feedback with language with language models, 2023c.
models. NeurIPS, 2023a.
Whittle, P. Arm-acquiring bandits. The Annals of Probabil-
Sun, L., Jha, D. K., Hori, C., Jain, S., Corcodel, R., Zhu, ity, 9(2):284–292, 1981.
X., Tomizuka, M., and Romeres, D. Interactive plan- Wong, L., Mao, J., Sharma, P., Siegel, Z. S., Feng, J., Ko-
ning using large language models for partially observable rneev, N., Tenenbaum, J. B., and Andreas, J. Learning
robotics tasks, 2023b. adaptive planning representations with natural language
guidance, 2023.
Sutton, R. S. Integrated architectures for learning, plan-
ning, and reacting based on approximating dynamic pro- Xiang, J., Tao, T., Gu, Y., Shu, T., Wang, Z., Yang, Z., and
gramming. In Machine learning proceedings 1990, pp. Hu, Z. Language models meet world models: Embodied
216–224. Elsevier, 1990. experiences enhance language models. NeurIPS, 2023.

Sutton, R. S. and Barto, A. G. Reinforcement learning: An Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan,
introduction. MIT press, 2018. K., and Cao, Y. React: Synergizing reasoning and acting
in language models. arXiv preprint arXiv:2210.03629,
Tang, H. and Ellis, K. From perception to programs: Reg- 2022.
ularize, overparameterize, and amortize. International Zhao, Z., Lee, W. S., and Hsu, D. Large language models as
Conference on Machine Learning (ICML), 2023. commonsense knowledge for large-scale task planning.
NeurIPS, 2023.
Tolman, E. C. and Honzik, C. H. Introduction and removal
of reward, and maze performance in rats. University of Zhu, F. and Simmons, R. Bootstrapping cognitive agents
California publications in psychology, 1930. with a large language model. In AAAI, 2024.

11
Building World Models by Writing Code and Interacting with the Environment

A. Theoretical analysis of Sample Efficiency using optimism under uncertainty (ϕ2 )


The optimism under uncertainty objective (ϕ1 ∧ ϕ2 in Sec. 2) is much more sample efficient than the traditional data-
consistency objective (ϕ1 ) as shown in Figure 6 in our experiments. We will later show a learning trajectory of it in the
MiniGrid-UnlockPickup-v0 environment in Appendix A.1. This UnlockPickup environment is difficult for exploration due
to its large search space. PPO failed to gain enough positive-reward supervision signals after 3 × 108 actions as shown in
Figure 7. Our method failed without the optimism objective as well. Nevertheless, our method with the optimism objective
learned the correct world model from scratch with no more than 100 actions due to its better sample efficiency in exploration.
Here we provide a simple theorem stating that significantly better sample efficiency is theoretically guaranteed when using
this objective (polynomical w.r.t the size of the state space and the solution space) following an intuitive observation:

Observation. When planning with a world model that satisfies the ϕ1 ∧ ϕ2 there are only two possible outcomes:

• Either the model is correct: the agent achieves the goal successfully;
• Or the model is incorrect: the agent then must find a counter-example to its world model’s prediction and gain more
knowledge of the environment.

It shows that the world model that satisfies ϕ1 ∧ ϕ2 is either correct or is efficient in guiding the agent to find a counter-
example of the current world model. Counter-example are precious because, when in search of the correct world model,
each interaction data (s, a, s′ , r, d) in the replay buffer implies a constraint to satisfy for the potentially correct world models
in the space of world models. Collecting data that can be explained by all potentially correct world models is useless because
it implies no stricter constraints than the current ones. Only through the counter-examples which cannot be explained by
the current world model, the size of the set of the potentially correct world models can become smaller and smaller, which
eventually leads to a set that only contains the correct world model.
The optimism under uncertainty objective is much more sample efficient to get those valuable counter-examples than random
exploration. For deterministic environments, the number of actions to find a counter-example with a world model that
satisfies ϕ1 ∧ ϕ2 is guaranteed to be smaller than the size of the state space for MDP or the history space for POMDP as
there is no need to go back to the same state/history. Random exploration needs exponentially more number of actions to
find this counter-example in the worst case. (Note that we do not assume free reset to all possible states, so the agent needs
to go through the trajectories from the initial state to each target state.)
More formally, we show that the maximum number of actions to learn/find the correct world model is polynomial w.r.t. the
size of the state space and the size of the model space for deterministic MDPs as follows:
Definition A.1. A set of data, D, is mutually independent w.r.t. a solution space, M, i.e., D ⊥⊥ M if and only if for all
data inside of the set, there exists a solution in the solution space such that this solution explains all other data but it:
∀d ∈ D, ∃m ∈ M, m ⊬ d ∧ (∀d′ ∈ D − {d}, m ⊢ d′ ).

Denote the maximum size of the mutually independent data sets w.r.t. a solution space M as K⊥⊥M = max{|D| : ∀D ∈
D, D ⊥⊥ M}, where D is the space of data sets, it is straightforward to prove that
Lemma A.2. K⊥⊥M ≤ |M|, i.e., the maximum size of the mutually independent data sets w.r.t. to a solution space is
smaller than the size of the solution space.

Proof: each data point d ∈ D must exclude one more unique solution in the solution space than the other data points.
Otherwise, the data set is not mutually independent w.r.t. the solution space.
Note that this is the loosest upper bound of K⊥⊥M . We assume nothing about the model space and the learning algorithm.
In practice, K⊥⊥M should be much smaller than the size of the solution space. For example, for linear models, we only need
n independent data points to characterize the correct solution for d ∈ Rn .
Theorem A.3. (Guaranteed sample efficiency when using the optimism under uncertainty objective.) When using the
optimism under uncertainty objective (ϕ1 ∧ ϕ2 in Sec. 2), the correct world model is guaranteed to be found in no more than
|S| × (KT ×R + 1) ≤ |S| × (|T | × |R| + 1) actions for a deterministic MDP environment M = (S, A, T, R, γ) where T
and R denotes the space of the transition function and the reward function, respectively, and KT ×R denotes the maximum
size of the data sets {(s, a, s′ , r, d)} that are mutually indepent w.r.t. the world model space, T × R.

12
Building World Models by Writing Code and Interacting with the Environment

Informal Proof: Lemma α: Each replay buffer data set, D, can be represented by its maximum mutually independent data
subset, D⊥⊥T ×R , from the perspective of finding the correct world model, i.e.,
^ ^
D⊥⊥T ×R ⊆ D D⊥⊥T ×R ⊥ ⊥ T × R ∀(T̂ , R̂) ∈ T × R, (T̂ , R̂) ⊢ D ⇔ (T̂ , R̂) ⊢ D⊥⊥T ×R .

Lemma β: For the replay buffer, D(t) , at any step t, if a world model (T̂ (t) , R̂(t) ) satisfies the optimism under uncertainty
objective as well as the traditional data-consistency objective, i.e., (T̂ , R̂) ⊢ D(t) ∧ (T̂ , R̂) ⊢ ϕ2 , then either the word model
is correct, which means it can guide the agent to the goal successfully, or the agent finds a counter-example, d′ , to the current
world model in |S| steps.
Lemma γ: This counter example, d′ , is mutually independent to the replay buffer, D(t) , because there is a model (T̂ (t) , R̂(t) )
such that (T̂ (t) , R̂(t) ) ⊬ d′ ∧ (T̂ (t) , R̂(t) ) ⊢ D(t) .
Given these lemmas, we have
(t+|S|) (t)
|D⊥⊥T ×R | ≥ |D⊥⊥T ×R | + 1
and therefore
|S|×(K +1)
|D⊥⊥T ×R⊥⊥T ×R | ≥ K⊥⊥T ×R + 1 + D(0) ≥ K⊥⊥T ×R + 1 > K⊥⊥T ×R .
Assume the world model after |S| × (K⊥⊥T ×R + 1) steps is still incorrect, we then build a mutually indedepent data set,
(|S|×(K +1)
D⊥⊥T ×R ⊥⊥T ×R , with size larger than K⊥⊥T ×R , which is contradictory to its definition.
The proofs of Lemma α and γ are straightforward by definitions. We can prove Lemma β by construction. Given the
definition of the optimism under uncertainty objective, ϕ2 in Sec. 2:
 
ϕ2 s0 , c, T̂ (t) , R̂(t) =
∃a1 , s1 , a2 , s2 , ..., aℓ , sℓ :
∀i ∈ [ℓ] : T̂ (t) (si−1 , ai ) = si ∧
∃r > 0 : R̂(t) (c)(sℓ−1 , aℓ , sℓ ) = (r, 1),

there exists a trajectory a1 , s1 , a2 , s2 , ..., aℓ , sℓ such that either the model correctly leads the agent to the goal, i.e., ∃r >
0 : R(c)(sℓ−1 , aℓ , sℓ ) = (r, 1), or there exists a counter-example, i.e., ∃i ∈ [ℓ] : T̂ (t) (si−1 , ai ) ̸= T (si−1 , ai ) ∨
R̂(t) (c)(si−1 , ai , si ) ̸= R(c)(si−1 , ai , si ), which means (T̂ (t) , R̂(t) ) ⊬ (si−1 , ai , si , ri , di ). We also have (T̂ (t) , R̂(t) ) ⊢
D(t) due to (T̂ (t) , R̂(t) ) ⊢ ϕ1 . We thus prove Lemma β.

A.1. Example learning trajectory using (ϕ1 ∧ ϕ2 ) MiniGridUnlockPickup


To demonstrate the effectiveness of the optimism objective (ϕ2 in Sec. 2) in improving the sample efficiency through
guided exploration, we show here an example of the learning trajectories in the MiniGrid-UnlockPickup-v0 environment.
In comparison with agents that only use the traditional data-consistency objective (ϕ1 ), which merely relies on random
exploration to gain new knowledge of the world through new data, agents that use the extra optimism under uncertainty
objective can

• imagine the functionality of actions and their interactions with the necessary tools, in order to achieve the goal, without
any real interactions with the environment;

• explore much more efficiently guided by the imagined world model;

• and correct the imagined world model given the newly collected data after the efficient exploration.

The goal-driven exploration is guaranteed to be efficient, as proved in Appendix. A. The learning trajectory is as follows:

Original reward function Original transit function

NULL NULL

13
Building World Models by Writing Code and Interacting with the Environment

Random exploration Data-consistent transit-func


No feasible move yet except for rota- (figured out how to rotate)
tion.
def turn_right(direction):
if direction == (-1, 0):
return (0, -1)
elif direction == (1, 0):
...
def turn_left(direction):
if direction == (-1, 0):
return (0, 1)
elif direction == (1, 0):
...
def transition(state, action):
agents = get_entities_by_name(state, "Agent")
assert len(agents) == 1, "There should be exactly one agent
in the state"
agent = agents[0]
if action == "turn right":
agent.direction = turn_right(agent.direction)
elif action == "turn left": # Add a condition for the "turn
left" action
agent.direction = turn_left(agent.direction)
# The other actions does not seem to affect the state
return state

Data-consistent reward-func
(random guess given the textual mission)

def reward_func(state, action, next_state):


# If the state does not change after action, the reward is
0.0 and not done
if state == next_state:
return 0.0, False
agent_n, agent_o = get_entities_by_name(next_state, "Agent")
[0], get_entities_by_name(state, "Agent")[0]
box_n, box_o = get_entities_by_name(next_state, "Box")[0],
get_entities_by_name(state, "Box")[0]
# If the agent’s position gets closer to the box, give a
bigger reward
if abs(agent_n.x - box_n.x) + abs(agent_n.y - box_n.y) < abs
(agent_o.x - box_o.x) + abs(agent_o.y - box_o.y):
return 1.0, False
# If the agent changes direction but not position, the
reward is 0.0 and not done
elif agent_o.x == agent_n.x and agent_o.y == agent_n.y and
agent_o.direction != agent_n.direction:
return 0.0, False
# If the agent picks up the green box, give a big reward and
finish
elif agent_n.carrying and agent_n.carrying.color == "green":
return 100.0, True
# If the agent moves but does not make significant progress,
give small penalty to make it efficient
else:
return -0.01, False

14
Building World Models by Writing Code and Interacting with the Environment

Goal-driven transit-func
(imagine how to move forward without interactions)
(imperfect though as not considering the locked door)

def transition(state, action):


...
elif action == ’move forward’:
dx, dy = agent.direction
new_x, new_y = X + dx, Y + dy
if not get_entities_by_name(get_entities_by_position(
next_state, new_x, new_y), ’Wall’):
agent.x, agent.y = new_x, new_y
...

Goal-driven reward-func
(more efficient representation <- already correct)

def reward_func(state, action, next_state):


# Get all agent entities in the next state
next_agents = get_entities_by_name(next_state, ’Agent’)
# Check if any agent is carrying a Box
for next_agent_carrying in [agent.carrying for agent in
next_agents]:
if isinstance(next_agent_carrying, Box) and
next_agent_carrying.color == ’green’:
return 1.0, True
return 0.0, False

Guided by imagined transit-func Data-consistent transit-func


Trying to cross a locked door (figured out the agent cannot step on the wall and the locked door,
but still do not know how to go to the goal.)

ddef transition(state, action):


...
elif action == ’move forward’:
dx, dy = agent.direction
new_x, new_y = X + dx, Y + dy
door_at_new_loc = get_entities_by_name(
get_entities_by_position(next_state, new_x, new_y), ’
Door’)
if door_at_new_loc:
door_at_new_loc = door_at_new_loc[0]
if door_at_new_loc.state == "locked":
return next_state
if not get_entities_by_name(get_entities_by_position(
next_state, new_x, new_y), ’Wall’):
agent.x, agent.y = new_x, new_y
...

15
Building World Models by Writing Code and Interacting with the Environment

Goal-driven transit-func
(imagine how to pickup, toggle, and drop, without interactions)
(imperfect though as trying to pickup at the same position as the agent)

def transition(state, action):


...
elif action == ’pick up’:
items_at_agent_location = get_entities_by_position(
next_state, agent.x, agent.y)
pickable_items = [item for item in
items_at_agent_location if item.name not in [’Door’,’
Wall’, ’Agent’]]
if pickable_items:
agent.carrying = pickable_items[0]
next_state.remove(pickable_items[0])
elif action == ’drop’:
if agent.carrying:
agent.carrying.x, agent.carrying.y = agent.x, agent.
y
next_state.append(agent.carrying)
agent.carrying = None
elif action == ’toggle’:
if isinstance(agent.carrying, Key):
door_at_right = [door for door in
get_entities_by_position(next_state, right_x,
right_y) if door.name == ’Door’]
if door_at_right and door_at_right[0].color == agent
.carrying.color:
door_at_right[0].state = ’unlocked’
return next_state

Guided by imagined transit-func Data-consistent transit-func


Trying to step on the key to pick it up (figured out the agent cannot step on the key,
but still do not know how to go to the goal)

...
def transition(state, action):
...
elif action == ’move forward’:
entities_at_front = [entity for entity in
get_entities_by_position(next_state, front_x, front_y)
if entity.name != ’Agent’]
if entities_at_front:
if entities_at_front[0].name == "Door" and
entities_at_front[0].state == "unlocked":
agent.x, agent.y = front_x, front_y
# other entities should prevent agent from moving
forward
else:
agent.x, agent.y = front_x, front_y
elif action == ’pick up’:
items_at_agent_location = get_entities_by_position(
next_state, agent.x, agent.y)
pickable_items = [item for item in
items_at_agent_location if item.name not in [’Door’,’
Wall’, ’Agent’]]
if pickable_items:
agent.carrying = pickable_items[0]
next_state.remove(pickable_items[0])
...

16
Building World Models by Writing Code and Interacting with the Environment

Goal-driven transit-func
(imagine the agent can also pickup the object in front of it,
which is correct, without interactions.

def transition(state, action):


...
elif action == ’pick up’:
items_at_agent_location = get_entities_by_position(
next_state, agent.x, agent.y)
items_at_front_location =
get_entities_by_position(next_state, front_x, front_y)
pickable_items = [item for item in
items_at_agent_location +
items_at_front_location if item.name not in [’Door’,’
Wall’, ’Agent’]]
if pickable_items:
agent.carrying = pickable_items[0]
next_state.remove(pickable_items[0])
...

Guided by imagined transit-func Data-consistent transit-func


Trying to toggle the door (correct some detail about opening the door)

...
def toggle_door(agent, next_state, next_x, next_y):
doors_in_next_position = [door for door in
get_entities_by_position(next_state, next_x, next_y) if door
.name == ’Door’]
if doors_in_next_position and doors_in_next_position[0].
color == agent.carrying.color :
doors_in_next_position[0].state = ’open’
...

17
Building World Models by Writing Code and Interacting with the Environment

Guided by imagined transit-func Goal-driven transit-func


Trying to pickup the box while carry- (figured out cannot pickup objects while carrying others; improve the drop
ing the key action that was imagined before) (SUCCEED!!)

...
def drop_item(agent, next_state, next_x, next_y):
entities_in_next_position = get_entities_by_position(
next_state, next_x, next_y)
if not entities_in_next_position and agent.carrying:
# Drop can only drop object if there’s no obstacle and
agent carries something.
agent.carrying.x, agent.carrying.y = next_x, next_y
next_state.append(agent.carrying)
agent.carrying = None
...
def check_no_obstacle_between(agent, next_state, x, y):
dx, dy = x - agent.x, y - agent.y
for i in range(min(abs(dx), abs(dy))):
entities_at_next_position = get_entities_by_position(
next_state, agent.x + i * dx, agent.y + i * dy)
if any(isinstance(entity, Wall) or (isinstance(entity,
Door) and entity.state != ’open’) for entity in
entities_at_next_position):
return False
return True
def pickup_item(agent, next_state):
items_in_current_location = get_entities_by_position(
next_state, agent.x, agent.y)
pickable_items = [item for item in items_in_current_location
if item.name not in [’Door’, ’Wall’, ’Agent’]]
if agent.carrying is None: # Agent can only pick up an item
when it is not carrying an item
if not pickable_items:
dx, dy = agent.direction
facing_x, facing_y = agent.x + dx, agent.y + dy
if check_no_obstacle_between(agent, next_state,
facing_x, facing_y): # Make sure there is no wall
or door between the agent and the item
items_in_facing_location =
get_entities_by_position(next_state, facing_x,
facing_y)
pickable_items = [item for item in
items_in_facing_location if item.name not in [’
Door’, ’Wall’]]
if pickable_items:
agent.carrying = pickable_items[0]
next_state.remove(pickable_items[0])
...

The final synthesized world model code

class Entity:
def __init__(self, x, y, **kwargs):
self.name = self.__class__.__name__
self.x = x
self.y = y
for key, value in kwargs.items():
setattr(self, key, value)
def __repr__(self):
attr = ’, ’.join(f’{key}={value}’ for key, value in self.__dict__.items() if key not in (’
name’, ’x’, ’y’))
if attr: return f"{self.name}({self.x}, {self.y}, {attr})"

18
Building World Models by Writing Code and Interacting with the Environment

else: return f"{self.name}({self.x}, {self.y})"


def __eq__(self, other):
return all(getattr(self, key) == getattr(other, key, None) for key in self.__dict__.keys()
)
def __hash__(self):
return hash(tuple(sorted(self.__dict__.items())))
class Agent(Entity): pass
class Key(Entity): pass
class Door(Entity): pass
class Goal(Entity): pass
class Wall(Entity): pass
class Box(Entity): pass
class Ball(Entity): pass
class Lava(Entity): pass
def update_direction(agent, action):
all_directions = [(0, -1), (-1, 0), (0, 1), (1, 0)]
current_dir_idx = all_directions.index(agent.direction)
if action == ’turn right’:
agent.direction = all_directions[(current_dir_idx - 1) % 4]
else: # turn left
agent.direction = all_directions[(current_dir_idx + 1) % 4]
def drop_item(agent, next_state, next_x, next_y):
entities_in_next_position = get_entities_by_position(next_state, next_x, next_y)
if not entities_in_next_position and agent.carrying:
# Drop can only drop object if there’s no obstacle and agent carries something.
agent.carrying.x, agent.carrying.y = next_x, next_y
next_state.append(agent.carrying)
agent.carrying = None
def toggle_door(agent, next_state, next_x, next_y):
doors_in_next_position = [door for door in get_entities_by_position(next_state, next_x, next_y
) if door.name == ’Door’]
if doors_in_next_position and doors_in_next_position[0].color == agent.carrying.color :
doors_in_next_position[0].state = ’open’
def get_entities_by_name(entities, name):
return [ entity for entity in entities if entity.name == name ]
def get_entities_by_position(entities, x, y):
return [ entity for entity in entities if entity.x == x and entity.y == y ]
def check_no_obstacle_between(agent, next_state, x, y):
dx, dy = x - agent.x, y - agent.y
for i in range(min(abs(dx), abs(dy))):
entities_at_next_position = get_entities_by_position(next_state, agent.x + i * dx, agent.y
+ i * dy)
if any(isinstance(entity, Wall) or (isinstance(entity, Door) and entity.state != ’open’)
for entity in entities_at_next_position):
return False
return True
def pickup_item(agent, next_state):
items_in_current_location = get_entities_by_position(next_state, agent.x, agent.y)
pickable_items = [item for item in items_in_current_location if item.name not in [’Door’, ’
Wall’, ’Agent’]]
if agent.carrying is None: # Agent can only pick up an item when it is not carrying an item
if not pickable_items:
dx, dy = agent.direction
facing_x, facing_y = agent.x + dx, agent.y + dy
if check_no_obstacle_between(agent, next_state, facing_x, facing_y): # Make sure
there is no wall or door between the agent and the item
items_in_facing_location = get_entities_by_position(next_state, facing_x, facing_y
)
pickable_items = [item for item in items_in_facing_location if item.name not in [’
Door’, ’Wall’]]
if pickable_items:
agent.carrying = pickable_items[0]
next_state.remove(pickable_items[0])
def transition(state, action):

19
Building World Models by Writing Code and Interacting with the Environment

next_state = list(state)
agent = get_entities_by_name(next_state, ’Agent’)[0]
dx, dy = agent.direction
front_x, front_y = agent.x + dx, agent.y + dy
if action == ’turn right’ or action == ’turn left’:
update_direction(agent, action)
elif action == ’move forward’:
update_position(agent, next_state, front_x, front_y)
elif action == ’pick up’:
pickup_item(agent, next_state)
elif action == ’drop’:
drop_item(agent, next_state, front_x, front_y)
elif action == ’toggle’:
toggle_door(agent, next_state, front_x, front_y)
return next_state
def update_position(agent, next_state, next_x, next_y):
entities_at_next_position = get_entities_by_position(next_state, next_x, next_y)
if not any(
(
isinstance(entity, Wall) or
isinstance(entity, Box) or
isinstance(entity, Ball) or
isinstance(entity, Lava) or
(isinstance(entity, Door) and entity.state != ’open’) or
isinstance(entity, Key)
)
for entity in entities_at_next_position
):
agent.x, agent.y = next_x, next_y
else:
agent.x, agent.y = agent.x, agent.y # Agent stays in place

B. More Experimental Results


B.1. PPO for MiniGrid

1.0
empty
doorkey
0.8 unlock
fetch
unlockpickup
0.6
solve rate

0.4

0.2

0.0 0
10 103 106 109
steps

Figure 7. Performance of PPO in MiniGrid environments.

We evaluate PPO, as a Deep RL baseline, in minigrid experiments. We use the tuned hyper-parameters from RL Baselines3
Zoo (Raffin, 2020) for each environment and use the same symbolic memory state as ours. As shown in Figure 7, PPO
is much more sample inefficient than ours. It needs 104 − 105 steps to learn valid policies (not perfect though) in most
environments. It cannot solve the UnlockPickup environment even in 3 × 108 steps.

20
Building World Models by Writing Code and Interacting with the Environment

B.2. PPO for Sokoban


The PPO baseline was implemented using Stable Baselines3 (Raffin et al., 2021) and the gym-sokoban library (Schrader,
2018) with 256 parallel environments, a batch_size of 2048, a horizon of 50 steps, and the rest default hyperparameters
according to the Stable Baselines3 library (n_steps 2048, learning rate 0.0003, gamma 0.99, n_epochs 10, gae_lambda 0.95,
clip_range 0.2, normalized advantage, ent_coef 0, vf_coef 0.5).
The input to the network is a (3, 7, 7) corresponding to the 7 by 7 Sokoban grid, with each item in the grid represent as an
RGB pixel.
The policy network used was a convolutional neural network with 3 convolutional layers with 16, 32, and 64 filters and
all with a kernel of size (2, 2) and a stride of 1, followed by a linear layer with 9 outputs, corresponding to the 9 possible
actions. All layers are separated by Rectified Linear Units (ReLu).

B.3. DreamerV3 for Sokoban


The DreamerV3 baseline was implemented using the simple training script provided by the open-source DreamerV3 library
created by Danijar Hafner (Hafner et al., 2023), and the gym-sokoban library (Schrader, 2018) with default hyperparameters
and 1 environment.

C. Example of synthesized world models for Sokoban


Synthesized transition function for Sokoban

def transition(state, action):


"""
Args:
state: a set of entities representing the state of the environment
action: the action can be "move right", "move left", "move up", "move down"
Returns:
next_state: the next state of the environment
"""
# here we define how the player coordinates change for each action
action_to_delta = {
"move right": (1, 0),
"move left": (-1, 0),
"move up": (0, -1),
"move down": (0, 1)
}
# Here we get the player and the boxes in the current state
player = get_entities_by_name(state, ’Player’)[0]
boxes = get_entities_by_name(state, ’Box’)
walls = get_entities_by_name(state, ’Wall’)
# Then, we calculate the new player position according to the action
delta_x, delta_y = action_to_delta[action]
new_player_x = player.x + delta_x
new_player_y = player.y + delta_y
# We check if the new player position is a Wall
if get_entities_by_position(walls, new_player_x, new_player_y):
# If so, the player does not move
pass
else:
# If not, the player moves to the new position
pushed_box = get_entities_by_position(boxes, new_player_x, new_player_y)
if pushed_box:
pushed_box_x = pushed_box[0].x + delta_x
pushed_box_y = pushed_box[0].y + delta_y
# Check if there is a wall or other box at the pushed box destination
if get_entities_by_position(boxes + walls, pushed_box_x, pushed_box_y):
# If so, the player and the box do not move

21
Building World Models by Writing Code and Interacting with the Environment

pass
else:
# If not, the box moves to the new position
pushed_box[0].x += delta_x
pushed_box[0].y += delta_y
player.x += delta_x
player.y += delta_y
else:
player.x += delta_x
player.y += delta_y
return state

Synthesized reward function for Sokoban

def reward_func(state, action, next_state):


reward = -0.1
done = False
boxes_prev = get_entities_by_name(state, ’Box’)
targets_prev = get_entities_by_name(state, ’Target’)
boxes_next = get_entities_by_name(next_state, ’Box’)
targets_next = get_entities_by_name(next_state, ’Target’)
for box in boxes_next:
if any(box.x == target.x and box.y == target.y for target in targets_next):
if not any(box.x == prev_box.x and box.y == prev_box.y for prev_box in boxes_prev
if any(prev_box.x == prev_target.x and prev_box.y == prev_target.y for
prev_target in targets_prev)):
reward += 1
for box in boxes_prev:
if any(box.x == target.x and box.y == target.y for target in targets_prev):
if not any(box.x == next_box.x and box.y == next_box.y for next_box in boxes_next
if any(next_box.x == next_target.x and next_box.y == next_target.y for
next_target in targets_next)):
reward -= 1
if all(any(box.x == target.x and box.y == target.y for target in targets_next) for box in
boxes_next):
reward += 10
done = True
return reward, done

D. Prompts
We list all the prompts that are used in our experiments in this section. The functionality for each prompt is stated in
its subsection name. We highlight the dynamic information as yellow and the main instruction as blue. The dynamic
information includes the data collected so far in the replay buffer and the codes synthesized so far by previous LLM calls.

D.1. Initializing the transition function


It asks LLMs to generate a transition function (s, a) → s′ following the code template to model seven uniformly randomly
sampled experience data (s, a, s′ ) in the replay buffer.

You are a robot exploring in an object-centric environment. Your goal is to model the logic of
the world in python. You will be provided experiences in the format of (state, action,
next_state) tuples. You will also be provided with a short natural language description that
briefly summarizes the difference between the state and the next state for each (state,
next_state,) pair. You need to implement the python code to model the logic of the world, as
seen in the provided experiences. Please follow the template to implement the code. The code

22
Building World Models by Writing Code and Interacting with the Environment

needs to be directly runnable on the state and return the next state in python as provided in
the experiences.

You need to implement python code to model the logic of the world as seen in the following expe
riences:

The action "toggle" transforms the state from


‘‘‘
Wall(0, 0) ; Wall(1, 0) ; Wall(2, 0) ; Wall(3, 0) ; Wall(4, 0) ;
Wall(0, 1) ; empty ; Wall(2, 1) ; empty ; Wall(4, 1) ;
Wall(0, 2) ; Agent(1, 2, direction=(0, -1), carrying=None) ; Door(2, 2, color=yellow, state=
locked) ; empty ; Wall(4, 2) ;
Wall(0, 3) ; Key(1, 3, color=yellow) ; Wall(2, 3) ; Goal(3, 3) ; Wall(4, 3) ;
Wall(0, 4) ; Wall(1, 4) ; Wall(2, 4) ; Wall(3, 4) ; Wall(4, 4) ;
‘‘‘
to
‘‘‘
Wall(0, 0) ; Wall(1, 0) ; Wall(2, 0) ; Wall(3, 0) ; Wall(4, 0) ;
Wall(0, 1) ; empty ; Wall(2, 1) ; empty ; Wall(4, 1) ;
Wall(0, 2) ; Agent(1, 2, direction=(0, -1), carrying=None) ; Door(2, 2, color=yellow, state=
locked) ; empty ; Wall(4, 2) ;
Wall(0, 3) ; Key(1, 3, color=yellow) ; Wall(2, 3) ; Goal(3, 3) ; Wall(4, 3) ;
Wall(0, 4) ; Wall(1, 4) ; Wall(2, 4) ; Wall(3, 4) ; Wall(4, 4) ;
‘‘‘
The difference is
"""
Nothing happened
"""

The action "toggle" transforms the state from


‘‘‘
Wall(0, 0) ; Wall(1, 0) ; Wall(2, 0) ; Wall(3, 0) ; Wall(4, 0) ;
Wall(0, 1) ; empty ; Wall(2, 1) ; empty ; Wall(4, 1) ;
Wall(0, 2) ; Agent(1, 2, direction=(0, 1), carrying=None) ; Door(2, 2, color=yellow, state=
locked) ; empty ; Wall(4, 2) ;
Wall(0, 3) ; Key(1, 3, color=yellow) ; Wall(2, 3) ; Goal(3, 3) ; Wall(4, 3) ;
Wall(0, 4) ; Wall(1, 4) ; Wall(2, 4) ; Wall(3, 4) ; Wall(4, 4) ;
‘‘‘
to
‘‘‘
Wall(0, 0) ; Wall(1, 0) ; Wall(2, 0) ; Wall(3, 0) ; Wall(4, 0) ;
Wall(0, 1) ; empty ; Wall(2, 1) ; empty ; Wall(4, 1) ;
Wall(0, 2) ; Agent(1, 2, direction=(0, 1), carrying=None) ; Door(2, 2, color=yellow, state=
locked) ; empty ; Wall(4, 2) ;
Wall(0, 3) ; Key(1, 3, color=yellow) ; Wall(2, 3) ; Goal(3, 3) ; Wall(4, 3) ;
Wall(0, 4) ; Wall(1, 4) ; Wall(2, 4) ; Wall(3, 4) ; Wall(4, 4) ;
‘‘‘
The difference is
"""
Nothing happened
"""

The action "turn right" transforms the state from


‘‘‘
Wall(0, 0) ; Wall(1, 0) ; Wall(2, 0) ; Wall(3, 0) ; Wall(4, 0) ;
Wall(0, 1) ; empty ; Wall(2, 1) ; empty ; Wall(4, 1) ;
Wall(0, 2) ; Agent(1, 2, direction=(1, 0), carrying=None) ; Door(2, 2, color=yellow, state=
locked) ; empty ; Wall(4, 2) ;
Wall(0, 3) ; Key(1, 3, color=yellow) ; Wall(2, 3) ; Goal(3, 3) ; Wall(4, 3) ;
Wall(0, 4) ; Wall(1, 4) ; Wall(2, 4) ; Wall(3, 4) ; Wall(4, 4) ;
‘‘‘
to
‘‘‘

23
Building World Models by Writing Code and Interacting with the Environment

Wall(0, 0) ; Wall(1, 0) ; Wall(2, 0) ; Wall(3, 0) ; Wall(4, 0) ;


Wall(0, 1) ; empty ; Wall(2, 1) ; empty ; Wall(4, 1) ;
Wall(0, 2) ; Agent(1, 2, direction=(0, 1), carrying=None) ; Door(2, 2, color=yellow, state=
locked) ; empty ; Wall(4, 2) ;
Wall(0, 3) ; Key(1, 3, color=yellow) ; Wall(2, 3) ; Goal(3, 3) ; Wall(4, 3) ;
Wall(0, 4) ; Wall(1, 4) ; Wall(2, 4) ; Wall(3, 4) ; Wall(4, 4) ;
‘‘‘
The difference is
"""
The agent (direction=(1, 0)) at pos (1, 2) becomes an agent (direction=(0, 1)).
"""

The action "turn left" transforms the state from


‘‘‘
Wall(0, 0) ; Wall(1, 0) ; Wall(2, 0) ; Wall(3, 0) ; Wall(4, 0) ;
Wall(0, 1) ; empty ; Wall(2, 1) ; empty ; Wall(4, 1) ;
Wall(0, 2) ; Agent(1, 2, direction=(-1, 0), carrying=None) ; Door(2, 2, color=yellow, state=
locked) ; empty ; Wall(4, 2) ;
Wall(0, 3) ; Key(1, 3, color=yellow) ; Wall(2, 3) ; Goal(3, 3) ; Wall(4, 3) ;
Wall(0, 4) ; Wall(1, 4) ; Wall(2, 4) ; Wall(3, 4) ; Wall(4, 4) ;
‘‘‘
to
‘‘‘
Wall(0, 0) ; Wall(1, 0) ; Wall(2, 0) ; Wall(3, 0) ; Wall(4, 0) ;
Wall(0, 1) ; empty ; Wall(2, 1) ; empty ; Wall(4, 1) ;
Wall(0, 2) ; Agent(1, 2, direction=(0, 1), carrying=None) ; Door(2, 2, color=yellow, state=
locked) ; empty ; Wall(4, 2) ;
Wall(0, 3) ; Key(1, 3, color=yellow) ; Wall(2, 3) ; Goal(3, 3) ; Wall(4, 3) ;
Wall(0, 4) ; Wall(1, 4) ; Wall(2, 4) ; Wall(3, 4) ; Wall(4, 4) ;
‘‘‘
The difference is
"""
The agent (direction=(-1, 0)) at pos (1, 2) becomes an agent (direction=(0, 1)).
"""

The action "turn right" transforms the state from


‘‘‘
Wall(0, 0) ; Wall(1, 0) ; Wall(2, 0) ; Wall(3, 0) ; Wall(4, 0) ;
Wall(0, 1) ; empty ; Wall(2, 1) ; empty ; Wall(4, 1) ;
Wall(0, 2) ; Agent(1, 2, direction=(0, 1), carrying=None) ; Door(2, 2, color=yellow, state=
locked) ; empty ; Wall(4, 2) ;
Wall(0, 3) ; Key(1, 3, color=yellow) ; Wall(2, 3) ; Goal(3, 3) ; Wall(4, 3) ;
Wall(0, 4) ; Wall(1, 4) ; Wall(2, 4) ; Wall(3, 4) ; Wall(4, 4) ;
‘‘‘
to
‘‘‘
Wall(0, 0) ; Wall(1, 0) ; Wall(2, 0) ; Wall(3, 0) ; Wall(4, 0) ;
Wall(0, 1) ; empty ; Wall(2, 1) ; empty ; Wall(4, 1) ;
Wall(0, 2) ; Agent(1, 2, direction=(-1, 0), carrying=None) ; Door(2, 2, color=yellow, state=
locked) ; empty ; Wall(4, 2) ;
Wall(0, 3) ; Key(1, 3, color=yellow) ; Wall(2, 3) ; Goal(3, 3) ; Wall(4, 3) ;
Wall(0, 4) ; Wall(1, 4) ; Wall(2, 4) ; Wall(3, 4) ; Wall(4, 4) ;
‘‘‘
The difference is
"""
The agent (direction=(0, 1)) at pos (1, 2) becomes an agent (direction=(-1, 0)).
"""

The action "turn right" transforms the state from


‘‘‘
Wall(0, 0) ; Wall(1, 0) ; Wall(2, 0) ; Wall(3, 0) ; Wall(4, 0) ;
Wall(0, 1) ; empty ; Wall(2, 1) ; empty ; Wall(4, 1) ;

24
Building World Models by Writing Code and Interacting with the Environment

Wall(0, 2) ; Agent(1, 2, direction=(-1, 0), carrying=None) ; Door(2, 2, color=yellow, state=


locked) ; empty ; Wall(4, 2) ;
Wall(0, 3) ; Key(1, 3, color=yellow) ; Wall(2, 3) ; Goal(3, 3) ; Wall(4, 3) ;
Wall(0, 4) ; Wall(1, 4) ; Wall(2, 4) ; Wall(3, 4) ; Wall(4, 4) ;
‘‘‘
to
‘‘‘
Wall(0, 0) ; Wall(1, 0) ; Wall(2, 0) ; Wall(3, 0) ; Wall(4, 0) ;
Wall(0, 1) ; empty ; Wall(2, 1) ; empty ; Wall(4, 1) ;
Wall(0, 2) ; Agent(1, 2, direction=(0, -1), carrying=None) ; Door(2, 2, color=yellow, state=
locked) ; empty ; Wall(4, 2) ;
Wall(0, 3) ; Key(1, 3, color=yellow) ; Wall(2, 3) ; Goal(3, 3) ; Wall(4, 3) ;
Wall(0, 4) ; Wall(1, 4) ; Wall(2, 4) ; Wall(3, 4) ; Wall(4, 4) ;
‘‘‘
The difference is
"""
The agent (direction=(-1, 0)) at pos (1, 2) becomes an agent (direction=(0, -1)).
"""

The action "turn left" transforms the state from


‘‘‘
Wall(0, 0) ; Wall(1, 0) ; Wall(2, 0) ; Wall(3, 0) ; Wall(4, 0) ;
Wall(0, 1) ; empty ; Wall(2, 1) ; empty ; Wall(4, 1) ;
Wall(0, 2) ; Agent(1, 2, direction=(0, 1), carrying=None) ; Door(2, 2, color=yellow, state=
locked) ; empty ; Wall(4, 2) ;
Wall(0, 3) ; Key(1, 3, color=yellow) ; Wall(2, 3) ; Goal(3, 3) ; Wall(4, 3) ;
Wall(0, 4) ; Wall(1, 4) ; Wall(2, 4) ; Wall(3, 4) ; Wall(4, 4) ;
‘‘‘
to
‘‘‘
Wall(0, 0) ; Wall(1, 0) ; Wall(2, 0) ; Wall(3, 0) ; Wall(4, 0) ;
Wall(0, 1) ; empty ; Wall(2, 1) ; empty ; Wall(4, 1) ;
Wall(0, 2) ; Agent(1, 2, direction=(1, 0), carrying=None) ; Door(2, 2, color=yellow, state=
locked) ; empty ; Wall(4, 2) ;
Wall(0, 3) ; Key(1, 3, color=yellow) ; Wall(2, 3) ; Goal(3, 3) ; Wall(4, 3) ;
Wall(0, 4) ; Wall(1, 4) ; Wall(2, 4) ; Wall(3, 4) ; Wall(4, 4) ;
‘‘‘
The difference is
"""
The agent (direction=(0, 1)) at pos (1, 2) becomes an agent (direction=(1, 0)).
"""

Please implement code to model the logic of the world as demonstrated by the experiences. Here
is the template for the transition function. Please implement the transition function following
the template. The code needs to be directly runnable on the inputs of (state, action) and
return the next state in python as provided in the experiences.

‘‘‘

class Entity:
def __init__(self, x, y, **kwargs):
self.name = self.__class__.__name__
self.x = x
self.y = y
for key, value in kwargs.items():
setattr(self, key, value)
def __repr__(self):
attr = ’, ’.join(f’{key}={value}’ for key, value in self.__dict__.items() if key not in
(’name’, ’x’, ’y’))
if attr: return f"{self.name}({self.x}, {self.y}, {attr})"
else: return f"{self.name}({self.x}, {self.y})"
def __eq__(self, other):
return all(getattr(self, key) == getattr(other, key, None) for key in self.__dict__.

25
Building World Models by Writing Code and Interacting with the Environment

keys())
def __hash__(self):
return hash(tuple(sorted(self.__dict__.items())))
class Agent(Entity): pass
class Key(Entity): pass
class Door(Entity): pass
class Goal(Entity): pass
class Wall(Entity): pass
class Box(Entity): pass
class Ball(Entity): pass
class Lava(Entity): pass
def get_entities_by_name(entities, name):
return [ entity for entity in entities if entity.name == name ]
def get_entities_by_position(entities, x, y):
return [ entity for entity in entities if entity.x == x and entity.y == y ]
def transition(state, action):
"""
Args:
state: the state of the environment
action: the action to be executed
Returns:
next_state: the next state of the environment
"""
raise NotImplementedError

‘‘‘

Please implement code to model the logic of the world as demonstrated by the experiences.
Please implement the code following the template. Feel free to implement the helper functions
you need. You can also implement the logic for difference actions in different helper functions
. However, you must implement the ‘ transition ‘ function as the main function to be called by
the environment. The code needs to be directly runnable on the inputs as (state, action) and
return the next state in python as provided in the experiences. Let’s think step by step.

D.2. Initializing the reward function


It asks LLMs to generate a reward function (s, a, s′ ) → (r, d) for the mission c following the code template to model seven
uniformly randomly sampled experience data (s, a, s′ , r, d) in the replay buffer for the mission c.

You are a robot exploring in an object-centric environment. Your goal is to model the logic of
the world in python. You will be provided experiences in the format of (state, action,
next_state, reward, done) tuples. You will also be provided with a short natural language
description that briefly summarizes the difference between the state and the next state for
each (state, next_state) pair. You need to implement the python code to model the logic of the
world, as seen in the provided experiences. Please follow the template to implement the code.
The code needs to be directly runnable on the (state, action, next_state) tuple and return the
(reward, done) tuple in python as provided in the experiences.

You need to implement python code to model the logic of the world as seen in the following expe
riences for mission "use the key to open the door and then get to the goal":

The action "turn left" transforms the state from


‘‘‘
Wall(0, 0) ; Wall(1, 0) ; Wall(2, 0) ; Wall(3, 0) ; Wall(4, 0) ;
Wall(0, 1) ; empty ; Wall(2, 1) ; empty ; Wall(4, 1) ;
Wall(0, 2) ; Agent(1, 2, direction=(0, -1), carrying=None) ; Door(2, 2, color=yellow, state=
locked) ; empty ; Wall(4, 2) ;
Wall(0, 3) ; Key(1, 3, color=yellow) ; Wall(2, 3) ; Goal(3, 3) ; Wall(4, 3) ;
Wall(0, 4) ; Wall(1, 4) ; Wall(2, 4) ; Wall(3, 4) ; Wall(4, 4) ;

26
Building World Models by Writing Code and Interacting with the Environment

‘‘‘
to
‘‘‘
Wall(0, 0) ; Wall(1, 0) ; Wall(2, 0) ; Wall(3, 0) ; Wall(4, 0) ;
Wall(0, 1) ; empty ; Wall(2, 1) ; empty ; Wall(4, 1) ;
Wall(0, 2) ; Agent(1, 2, direction=(-1, 0), carrying=None) ; Door(2, 2, color=yellow, state=
locked) ; empty ; Wall(4, 2) ;
Wall(0, 3) ; Key(1, 3, color=yellow) ; Wall(2, 3) ; Goal(3, 3) ; Wall(4, 3) ;
Wall(0, 4) ; Wall(1, 4) ; Wall(2, 4) ; Wall(3, 4) ; Wall(4, 4) ;
‘‘‘
The difference is
"""
The agent (direction=(0, -1)) at pos (1, 2) becomes an agent (direction=(-1, 0)).
"""
, the returned reward is ‘ 0.0 ‘ and the returned done is ‘ False ‘

The action "toggle" transforms the state from


‘‘‘
Wall(0, 0) ; Wall(1, 0) ; Wall(2, 0) ; Wall(3, 0) ; Wall(4, 0) ;
Wall(0, 1) ; empty ; Wall(2, 1) ; empty ; Wall(4, 1) ;
Wall(0, 2) ; Agent(1, 2, direction=(0, -1), carrying=None) ; Door(2, 2, color=yellow, state=
locked) ; empty ; Wall(4, 2) ;
Wall(0, 3) ; Key(1, 3, color=yellow) ; Wall(2, 3) ; Goal(3, 3) ; Wall(4, 3) ;
Wall(0, 4) ; Wall(1, 4) ; Wall(2, 4) ; Wall(3, 4) ; Wall(4, 4) ;
‘‘‘
to
‘‘‘
Wall(0, 0) ; Wall(1, 0) ; Wall(2, 0) ; Wall(3, 0) ; Wall(4, 0) ;
Wall(0, 1) ; empty ; Wall(2, 1) ; empty ; Wall(4, 1) ;
Wall(0, 2) ; Agent(1, 2, direction=(0, -1), carrying=None) ; Door(2, 2, color=yellow, state=
locked) ; empty ; Wall(4, 2) ;
Wall(0, 3) ; Key(1, 3, color=yellow) ; Wall(2, 3) ; Goal(3, 3) ; Wall(4, 3) ;
Wall(0, 4) ; Wall(1, 4) ; Wall(2, 4) ; Wall(3, 4) ; Wall(4, 4) ;
‘‘‘
The difference is
"""
Nothing happened
"""
, the returned reward is ‘ 0.0 ‘ and the returned done is ‘ False ‘

The action "turn right" transforms the state from


‘‘‘
Wall(0, 0) ; Wall(1, 0) ; Wall(2, 0) ; Wall(3, 0) ; Wall(4, 0) ;
Wall(0, 1) ; empty ; Wall(2, 1) ; empty ; Wall(4, 1) ;
Wall(0, 2) ; Agent(1, 2, direction=(1, 0), carrying=None) ; Door(2, 2, color=yellow, state=
locked) ; empty ; Wall(4, 2) ;
Wall(0, 3) ; Key(1, 3, color=yellow) ; Wall(2, 3) ; Goal(3, 3) ; Wall(4, 3) ;
Wall(0, 4) ; Wall(1, 4) ; Wall(2, 4) ; Wall(3, 4) ; Wall(4, 4) ;
‘‘‘
to
‘‘‘
Wall(0, 0) ; Wall(1, 0) ; Wall(2, 0) ; Wall(3, 0) ; Wall(4, 0) ;
Wall(0, 1) ; empty ; Wall(2, 1) ; empty ; Wall(4, 1) ;
Wall(0, 2) ; Agent(1, 2, direction=(0, 1), carrying=None) ; Door(2, 2, color=yellow, state=
locked) ; empty ; Wall(4, 2) ;
Wall(0, 3) ; Key(1, 3, color=yellow) ; Wall(2, 3) ; Goal(3, 3) ; Wall(4, 3) ;
Wall(0, 4) ; Wall(1, 4) ; Wall(2, 4) ; Wall(3, 4) ; Wall(4, 4) ;
‘‘‘
The difference is
"""
The agent (direction=(1, 0)) at pos (1, 2) becomes an agent (direction=(0, 1)).
"""
, the returned reward is ‘ 0.0 ‘ and the returned done is ‘ False ‘

27
Building World Models by Writing Code and Interacting with the Environment

The action "nothing" transforms the state from


‘‘‘
Wall(0, 0) ; Wall(1, 0) ; Wall(2, 0) ; Wall(3, 0) ; Wall(4, 0) ;
Wall(0, 1) ; empty ; Wall(2, 1) ; empty ; Wall(4, 1) ;
Wall(0, 2) ; Agent(1, 2, direction=(0, -1), carrying=None) ; Door(2, 2, color=yellow, state=
locked) ; empty ; Wall(4, 2) ;
Wall(0, 3) ; Key(1, 3, color=yellow) ; Wall(2, 3) ; Goal(3, 3) ; Wall(4, 3) ;
Wall(0, 4) ; Wall(1, 4) ; Wall(2, 4) ; Wall(3, 4) ; Wall(4, 4) ;
‘‘‘
to
‘‘‘
Wall(0, 0) ; Wall(1, 0) ; Wall(2, 0) ; Wall(3, 0) ; Wall(4, 0) ;
Wall(0, 1) ; empty ; Wall(2, 1) ; empty ; Wall(4, 1) ;
Wall(0, 2) ; Agent(1, 2, direction=(0, -1), carrying=None) ; Door(2, 2, color=yellow, state=
locked) ; empty ; Wall(4, 2) ;
Wall(0, 3) ; Key(1, 3, color=yellow) ; Wall(2, 3) ; Goal(3, 3) ; Wall(4, 3) ;
Wall(0, 4) ; Wall(1, 4) ; Wall(2, 4) ; Wall(3, 4) ; Wall(4, 4) ;
‘‘‘
The difference is
"""
Nothing happened
"""
, the returned reward is ‘ 0.0 ‘ and the returned done is ‘ False ‘

The action "drop" transforms the state from


‘‘‘
Wall(0, 0) ; Wall(1, 0) ; Wall(2, 0) ; Wall(3, 0) ; Wall(4, 0) ;
Wall(0, 1) ; empty ; Wall(2, 1) ; empty ; Wall(4, 1) ;
Wall(0, 2) ; Agent(1, 2, direction=(0, -1), carrying=None) ; Door(2, 2, color=yellow, state=
locked) ; empty ; Wall(4, 2) ;
Wall(0, 3) ; Key(1, 3, color=yellow) ; Wall(2, 3) ; Goal(3, 3) ; Wall(4, 3) ;
Wall(0, 4) ; Wall(1, 4) ; Wall(2, 4) ; Wall(3, 4) ; Wall(4, 4) ;
‘‘‘
to
‘‘‘
Wall(0, 0) ; Wall(1, 0) ; Wall(2, 0) ; Wall(3, 0) ; Wall(4, 0) ;
Wall(0, 1) ; empty ; Wall(2, 1) ; empty ; Wall(4, 1) ;
Wall(0, 2) ; Agent(1, 2, direction=(0, -1), carrying=None) ; Door(2, 2, color=yellow, state=
locked) ; empty ; Wall(4, 2) ;
Wall(0, 3) ; Key(1, 3, color=yellow) ; Wall(2, 3) ; Goal(3, 3) ; Wall(4, 3) ;
Wall(0, 4) ; Wall(1, 4) ; Wall(2, 4) ; Wall(3, 4) ; Wall(4, 4) ;
‘‘‘
The difference is
"""
Nothing happened
"""
, the returned reward is ‘ 0.0 ‘ and the returned done is ‘ False ‘

The action "turn left" transforms the state from


‘‘‘
Wall(0, 0) ; Wall(1, 0) ; Wall(2, 0) ; Wall(3, 0) ; Wall(4, 0) ;
Wall(0, 1) ; empty ; Wall(2, 1) ; empty ; Wall(4, 1) ;
Wall(0, 2) ; Agent(1, 2, direction=(-1, 0), carrying=None) ; Door(2, 2, color=yellow, state=
locked) ; empty ; Wall(4, 2) ;
Wall(0, 3) ; Key(1, 3, color=yellow) ; Wall(2, 3) ; Goal(3, 3) ; Wall(4, 3) ;
Wall(0, 4) ; Wall(1, 4) ; Wall(2, 4) ; Wall(3, 4) ; Wall(4, 4) ;
‘‘‘
to
‘‘‘
Wall(0, 0) ; Wall(1, 0) ; Wall(2, 0) ; Wall(3, 0) ; Wall(4, 0) ;
Wall(0, 1) ; empty ; Wall(2, 1) ; empty ; Wall(4, 1) ;

28
Building World Models by Writing Code and Interacting with the Environment

Wall(0, 2) ; Agent(1, 2, direction=(0, 1), carrying=None) ; Door(2, 2, color=yellow, state=


locked) ; empty ; Wall(4, 2) ;
Wall(0, 3) ; Key(1, 3, color=yellow) ; Wall(2, 3) ; Goal(3, 3) ; Wall(4, 3) ;
Wall(0, 4) ; Wall(1, 4) ; Wall(2, 4) ; Wall(3, 4) ; Wall(4, 4) ;
‘‘‘
The difference is
"""
The agent (direction=(-1, 0)) at pos (1, 2) becomes an agent (direction=(0, 1)).
"""
, the returned reward is ‘ 0.0 ‘ and the returned done is ‘ False ‘

The action "turn right" transforms the state from


‘‘‘
Wall(0, 0) ; Wall(1, 0) ; Wall(2, 0) ; Wall(3, 0) ; Wall(4, 0) ;
Wall(0, 1) ; empty ; Wall(2, 1) ; empty ; Wall(4, 1) ;
Wall(0, 2) ; Agent(1, 2, direction=(-1, 0), carrying=None) ; Door(2, 2, color=yellow, state=
locked) ; empty ; Wall(4, 2) ;
Wall(0, 3) ; Key(1, 3, color=yellow) ; Wall(2, 3) ; Goal(3, 3) ; Wall(4, 3) ;
Wall(0, 4) ; Wall(1, 4) ; Wall(2, 4) ; Wall(3, 4) ; Wall(4, 4) ;
‘‘‘
to
‘‘‘
Wall(0, 0) ; Wall(1, 0) ; Wall(2, 0) ; Wall(3, 0) ; Wall(4, 0) ;
Wall(0, 1) ; empty ; Wall(2, 1) ; empty ; Wall(4, 1) ;
Wall(0, 2) ; Agent(1, 2, direction=(0, -1), carrying=None) ; Door(2, 2, color=yellow, state=
locked) ; empty ; Wall(4, 2) ;
Wall(0, 3) ; Key(1, 3, color=yellow) ; Wall(2, 3) ; Goal(3, 3) ; Wall(4, 3) ;
Wall(0, 4) ; Wall(1, 4) ; Wall(2, 4) ; Wall(3, 4) ; Wall(4, 4) ;
‘‘‘
The difference is
"""
The agent (direction=(-1, 0)) at pos (1, 2) becomes an agent (direction=(0, -1)).
"""
, the returned reward is ‘ 0.0 ‘ and the returned done is ‘ False ‘

Please implement code to model the logic of the world as demonstrated by the experiences. Here
is the template for the reward function. Please implement the reward function following the
template. The code needs to be directly runnable on the inputs of (state, action, next_state)
and return (reward, done) in python as provided in the experiences.

‘‘‘

class Entity:
def __init__(self, x, y, **kwargs):
self.name = self.__class__.__name__
self.x = x
self.y = y
for key, value in kwargs.items():
setattr(self, key, value)
def __repr__(self):
attr = ’, ’.join(f’{key}={value}’ for key, value in self.__dict__.items() if key not in
(’name’, ’x’, ’y’))
if attr: return f"{self.name}({self.x}, {self.y}, {attr})"
else: return f"{self.name}({self.x}, {self.y})"
def __eq__(self, other):
return all(getattr(self, key) == getattr(other, key, None) for key in self.__dict__.
keys())
def __hash__(self):
return hash(tuple(sorted(self.__dict__.items())))
class Agent(Entity): pass
class Key(Entity): pass
class Door(Entity): pass
class Goal(Entity): pass

29
Building World Models by Writing Code and Interacting with the Environment

class Wall(Entity): pass


class Box(Entity): pass
class Ball(Entity): pass
class Lava(Entity): pass
def get_entities_by_name(entities, name):
return [ entity for entity in entities if entity.name == name ]
def get_entities_by_position(entities, x, y):
return [ entity for entity in entities if entity.x == x and entity.y == y ]
def reward_func(state, action, next_state):
"""
Args:
state: the state of the environment
action: the action to be executed
next_state: the next state of the environment
Returns:
reward: the reward of the action
done: whether the episode is done
"""
raise NotImplementedError

‘‘‘

Please implement code to model the logic of the world as demonstrated by the
experiences. Please implement the code following the template. You must implement the ‘
reward_func ‘ function as the main function to be called by the environment. The code needs to
be directly runnable on the inputs as (state, action, next_state) and return (reward, done) in
python as provided in the experiences. Let’s think step by step.

D.3. Refining the transition function


It asks LLMs to refine a partially correct transition function by providing it with a data point that it fails to model as well as
a few other data points that it succeeds in modelling. We also provide the wrong prediction by the partially correct code or
the error message during execution.

You are a robot exploring in an object-centric environment. Your goal is to model the logic of
the world in python. You have tried it before and came up with one partially correct solution.
However, it is not perfect. They can model the logic for some experiences but failed for others
. You need to improve your code to model the logic of the world for all the experiences. The
new code needs to be directly runnable on the (state, action) pair and return the next state in
python as provided in the experiences.

Here is the partially correct solution you came up with. It can model the logic for some
experiences but failed for others. You need to improve your code to model the logic of the
world for all the experiences. The new code needs to be directly runnable on the (state, action
) pair and return the next state in python as provided in the experiences.

‘‘‘

class Entity:
def __init__(self, x, y, **kwargs):
self.name = self.__class__.__name__
self.x = x
self.y = y
for key, value in kwargs.items():
setattr(self, key, value)
def __repr__(self):
attr = ’, ’.join(f’{key}={value}’ for key, value in self.__dict__.items() if key not in
(’name’, ’x’, ’y’))
if attr: return f"{self.name}({self.x}, {self.y}, {attr})"

30
Building World Models by Writing Code and Interacting with the Environment

else: return f"{self.name}({self.x}, {self.y})"


def __eq__(self, other):
return all(getattr(self, key) == getattr(other, key, None) for key in self.__dict__.
keys())
def __hash__(self):
return hash(tuple(sorted(self.__dict__.items())))
class Agent(Entity): pass
class Key(Entity): pass
class Door(Entity): pass
class Goal(Entity): pass
class Wall(Entity): pass
class Box(Entity): pass
class Ball(Entity): pass
class Lava(Entity): pass
import copy
def get_entities_by_name(entities, name):
return [ entity for entity in entities if entity.name == name ]
def get_entities_by_position(entities, x, y):
return [ entity for entity in entities if entity.x == x and entity.y == y ]
def transition(state, action):
next_state = copy.deepcopy(state)
agent = get_entities_by_name(next_state, ’Agent’)[0]
# Determine agent’s next position based on action
if action == ’move forward’:
next_pos = (agent.x + agent.direction[0], agent.y + agent.direction[1])
# Check if the next position isn’t a wall
if not any(isinstance(entity, Wall) for entity in get_entities_by_position(state, *
next_pos)):
# If agent is in front of a door and has the right color key, unlock the door
if any(isinstance(entity, Door) and entity.color == agent.carrying.color for entity
in get_entities_by_position(state, *next_pos)):
if action == ’toggle’:
agent.carrying = None # Drop the key
else:
agent.x, agent.y = next_pos # Move forward
elif action == ’pick up’:
# Pick up a key if there is a key at the agent’s position
for entity in get_entities_by_position(next_state, agent.x, agent.y):
if isinstance(entity, Key):
agent.carrying = entity
next_state.remove(entity)
break
elif action == ’drop’:
# Drop the key at the agent’s position if the agent is carrying a key
if agent.carrying is not None:
dropped_key = Key(agent.x, agent.y, color=agent.carrying.color)
next_state.append(dropped_key)
agent.carrying = None
elif action in [’turn left’, ’turn right’]:
# Existing code for turn left/right here
pass
elif action == ’toggle’:
# Existing code for toggle here
pass
return next_state

‘‘‘

The given code cannot model the logic of the world for all the experiences. Here are some
experiences that the code have successfully modeled.

The action "toggle" transforms the state from


‘‘‘

31
Building World Models by Writing Code and Interacting with the Environment

Wall(0, 0) ; Wall(1, 0) ; Wall(2, 0) ; Wall(3, 0) ; Wall(4, 0) ;


Wall(0, 1) ; empty ; Wall(2, 1) ; empty ; Wall(4, 1) ;
Wall(0, 2) ; Agent(1, 2, direction=(0, -1), carrying=None) ; Door(2, 2, color=yellow, state=
locked) ; empty ; Wall(4, 2) ;
Wall(0, 3) ; Key(1, 3, color=yellow) ; Wall(2, 3) ; Goal(3, 3) ; Wall(4, 3) ;
Wall(0, 4) ; Wall(1, 4) ; Wall(2, 4) ; Wall(3, 4) ; Wall(4, 4) ;
‘‘‘
to
‘‘‘
Wall(0, 0) ; Wall(1, 0) ; Wall(2, 0) ; Wall(3, 0) ; Wall(4, 0) ;
Wall(0, 1) ; empty ; Wall(2, 1) ; empty ; Wall(4, 1) ;
Wall(0, 2) ; Agent(1, 2, direction=(0, -1), carrying=None) ; Door(2, 2, color=yellow, state=
locked) ; empty ; Wall(4, 2) ;
Wall(0, 3) ; Key(1, 3, color=yellow) ; Wall(2, 3) ; Goal(3, 3) ; Wall(4, 3) ;
Wall(0, 4) ; Wall(1, 4) ; Wall(2, 4) ; Wall(3, 4) ; Wall(4, 4) ;
‘‘‘
The difference is
"""
Nothing happened
"""

The action "drop" transforms the state from


‘‘‘
Wall(0, 0) ; Wall(1, 0) ; Wall(2, 0) ; Wall(3, 0) ; Wall(4, 0) ;
Wall(0, 1) ; empty ; Wall(2, 1) ; empty ; Wall(4, 1) ;
Wall(0, 2) ; Agent(1, 2, direction=(0, -1), carrying=None) ; Door(2, 2, color=yellow, state=
locked) ; empty ; Wall(4, 2) ;
Wall(0, 3) ; Key(1, 3, color=yellow) ; Wall(2, 3) ; Goal(3, 3) ; Wall(4, 3) ;
Wall(0, 4) ; Wall(1, 4) ; Wall(2, 4) ; Wall(3, 4) ; Wall(4, 4) ;
‘‘‘
to
‘‘‘
Wall(0, 0) ; Wall(1, 0) ; Wall(2, 0) ; Wall(3, 0) ; Wall(4, 0) ;
Wall(0, 1) ; empty ; Wall(2, 1) ; empty ; Wall(4, 1) ;
Wall(0, 2) ; Agent(1, 2, direction=(0, -1), carrying=None) ; Door(2, 2, color=yellow, state=
locked) ; empty ; Wall(4, 2) ;
Wall(0, 3) ; Key(1, 3, color=yellow) ; Wall(2, 3) ; Goal(3, 3) ; Wall(4, 3) ;
Wall(0, 4) ; Wall(1, 4) ; Wall(2, 4) ; Wall(3, 4) ; Wall(4, 4) ;
‘‘‘
The difference is
"""
Nothing happened
"""

The action "nothing" transforms the state from


‘‘‘
Wall(0, 0) ; Wall(1, 0) ; Wall(2, 0) ; Wall(3, 0) ; Wall(4, 0) ;
Wall(0, 1) ; empty ; Wall(2, 1) ; empty ; Wall(4, 1) ;
Wall(0, 2) ; Agent(1, 2, direction=(0, -1), carrying=None) ; Door(2, 2, color=yellow, state=
locked) ; empty ; Wall(4, 2) ;
Wall(0, 3) ; Key(1, 3, color=yellow) ; Wall(2, 3) ; Goal(3, 3) ; Wall(4, 3) ;
Wall(0, 4) ; Wall(1, 4) ; Wall(2, 4) ; Wall(3, 4) ; Wall(4, 4) ;
‘‘‘
to
‘‘‘
Wall(0, 0) ; Wall(1, 0) ; Wall(2, 0) ; Wall(3, 0) ; Wall(4, 0) ;
Wall(0, 1) ; empty ; Wall(2, 1) ; empty ; Wall(4, 1) ;
Wall(0, 2) ; Agent(1, 2, direction=(0, -1), carrying=None) ; Door(2, 2, color=yellow, state=
locked) ; empty ; Wall(4, 2) ;
Wall(0, 3) ; Key(1, 3, color=yellow) ; Wall(2, 3) ; Goal(3, 3) ; Wall(4, 3) ;
Wall(0, 4) ; Wall(1, 4) ; Wall(2, 4) ; Wall(3, 4) ; Wall(4, 4) ;
‘‘‘
The difference is

32
Building World Models by Writing Code and Interacting with the Environment

"""
Nothing happened
"""

Here is an example of experiences that the code failed to model.

The action "turn left" should transform the state from


‘‘‘
Wall(0, 0) ; Wall(1, 0) ; Wall(2, 0) ; Wall(3, 0) ; Wall(4, 0) ;
Wall(0, 1) ; empty ; Wall(2, 1) ; empty ; Wall(4, 1) ;
Wall(0, 2) ; Agent(1, 2, direction=(0, 1), carrying=None) ; Door(2, 2, color=yellow, state=
locked) ; empty ; Wall(4, 2) ;
Wall(0, 3) ; Key(1, 3, color=yellow) ; Wall(2, 3) ; Goal(3, 3) ; Wall(4, 3) ;
Wall(0, 4) ; Wall(1, 4) ; Wall(2, 4) ; Wall(3, 4) ; Wall(4, 4) ;
‘‘‘
to
‘‘‘
Wall(0, 0) ; Wall(1, 0) ; Wall(2, 0) ; Wall(3, 0) ; Wall(4, 0) ;
Wall(0, 1) ; empty ; Wall(2, 1) ; empty ; Wall(4, 1) ;
Wall(0, 2) ; Agent(1, 2, direction=(1, 0), carrying=None) ; Door(2, 2, color=yellow, state=
locked) ; empty ; Wall(4, 2) ;
Wall(0, 3) ; Key(1, 3, color=yellow) ; Wall(2, 3) ; Goal(3, 3) ; Wall(4, 3) ;
Wall(0, 4) ; Wall(1, 4) ; Wall(2, 4) ; Wall(3, 4) ; Wall(4, 4) ;
‘‘‘
The difference is
"""
The agent (direction=(0, 1)) at pos (1, 2) becomes an agent (direction=(1, 0)).
"""
However, the implementation is wrong because it returns state as
‘‘‘
Wall(0, 0) ; Wall(1, 0) ; Wall(2, 0) ; Wall(3, 0) ; Wall(4, 0) ;
Wall(0, 1) ; empty ; Wall(2, 1) ; empty ; Wall(4, 1) ;
Wall(0, 2) ; Agent(1, 2, direction=(0, 1), carrying=None) ; Door(2, 2, color=yellow, state=
locked) ; empty ; Wall(4, 2) ;
Wall(0, 3) ; Key(1, 3, color=yellow) ; Wall(2, 3) ; Goal(3, 3) ; Wall(4, 3) ;
Wall(0, 4) ; Wall(1, 4) ; Wall(2, 4) ; Wall(3, 4) ; Wall(4, 4) ;
‘‘‘

For this failed experience, do you know what is different between the true transitions from the
environment and the predictions from the code? Do you know why the environment behaves in this
way? Do you know why the code behaves differently from the environment? Which part of the code
causes the problem? How to fix it? Please improve your code to model the logic of the world
for all the experiences, accordingly. Please implement the code following the template. Feel
free to implement any helper functions you need. You can also implement the logic for
difference actions in different helper functions. However, you must implement the ‘ transition
‘ function as the main function to be called by the environment. The code needs to be directly
runnable on the (state, action) tuple and return the new state in python as provided in the
experiences. If the code is too long, try to refactor it to be shorter.

D.4. Refine the reward function


It asks LLMs to refine a partially correct reward function by providing it with a data point that it fails to model as well as a
few other data points that it succeeds in modelling. We also provide the wrong prediction by the partially correct code or the
error message during execution.

You are a robot exploring in an object-centric environment. Your goal is to model the logic of
the world in python. You have tried it before and came up with one partially correct solution.
However, it is not perfect. They can model the logic for some experiences but failed for others
. You need to improve your code to model the logic of the world for all the experiences. The

33
Building World Models by Writing Code and Interacting with the Environment

new code needs to be directly runnable on the (state, action, next_state) tuple and return the
(reward, done) tuple in python as provided in the experiences.

Here is the partially correct solution you came up with for mission "use the key to open the
door and then get to the goal". It can model the logic for some experiences but failed for
others. You need to improve your code to model the logic of the world for all the experiences.
The new code need to be directly runnable on the (state, action, next_state) tuple and return
the (reward, done) tuple in python as provided in the experiences.

‘‘‘

class Entity:
def __init__(self, x, y, **kwargs):
self.name = self.__class__.__name__
self.x = x
self.y = y
for key, value in kwargs.items():
setattr(self, key, value)
def __repr__(self):
attr = ’, ’.join(f’{key}={value}’ for key, value in self.__dict__.items() if key not in
(’name’, ’x’, ’y’))
if attr: return f"{self.name}({self.x}, {self.y}, {attr})"
else: return f"{self.name}({self.x}, {self.y})"
def __eq__(self, other):
return all(getattr(self, key) == getattr(other, key, None) for key in self.__dict__.
keys())
def __hash__(self):
return hash(tuple(sorted(self.__dict__.items())))
class Agent(Entity): pass
class Key(Entity): pass
class Door(Entity): pass
class Goal(Entity): pass
class Wall(Entity): pass
class Box(Entity): pass
class Ball(Entity): pass
class Lava(Entity): pass
def get_entities_by_position(entities, x, y):
return [ entity for entity in entities if entity.x == x and entity.y == y ]
def reward_func(state, action, next_state):
state_set = set(state)
next_state_set = set(next_state)
agent = [e for e in state_set if isinstance(e, Agent)][0]
agent_next = [e for e in next_state_set if isinstance(e, Agent)][0]
on_goal = any(isinstance(entity, Goal) for entity in get_entities_by_position(next_state,
agent_next.x, agent_next.y))
done = on_goal
if state_set == next_state_set:
reward = -0.1 # Small negative reward for no-op actions to encourage faster solution
elif done:
reward = 1.0 # Reward for reaching the goal
else:
reward = 0.0 # No reward in other cases
return reward, done

‘‘‘

The given code cannot model the logic of the world for all the experiences. Here are some
experiences that the code has successfully modeled.

The action "turn right" transforms the state from


‘‘‘
Wall(0, 0) ; Wall(1, 0) ; Wall(2, 0) ; Wall(3, 0) ; Wall(4, 0) ;
Wall(0, 1) ; empty ; Wall(2, 1) ; empty ; Wall(4, 1) ;

34
Building World Models by Writing Code and Interacting with the Environment

Wall(0, 2) ; Agent(1, 2, direction=(-1, 0), carrying=None) ; Door(2, 2, color=yellow, state=


locked) ; empty ; Wall(4, 2) ;
Wall(0, 3) ; Key(1, 3, color=yellow) ; Wall(2, 3) ; Goal(3, 3) ; Wall(4, 3) ;
Wall(0, 4) ; Wall(1, 4) ; Wall(2, 4) ; Wall(3, 4) ; Wall(4, 4) ;
‘‘‘
to
‘‘‘
Wall(0, 0) ; Wall(1, 0) ; Wall(2, 0) ; Wall(3, 0) ; Wall(4, 0) ;
Wall(0, 1) ; empty ; Wall(2, 1) ; empty ; Wall(4, 1) ;
Wall(0, 2) ; Agent(1, 2, direction=(0, -1), carrying=None) ; Door(2, 2, color=yellow, state=
locked) ; empty ; Wall(4, 2) ;
Wall(0, 3) ; Key(1, 3, color=yellow) ; Wall(2, 3) ; Goal(3, 3) ; Wall(4, 3) ;
Wall(0, 4) ; Wall(1, 4) ; Wall(2, 4) ; Wall(3, 4) ; Wall(4, 4) ;
‘‘‘
The difference is
"""
The agent (direction=(-1, 0)) at pos (1, 2) becomes an agent (direction=(0, -1)).
"""
, the returned reward is ‘ 0.0 ‘ and the returned done is ‘ False ‘

The action "turn left" transforms the state from


‘‘‘
Wall(0, 0) ; Wall(1, 0) ; Wall(2, 0) ; Wall(3, 0) ; Wall(4, 0) ;
Wall(0, 1) ; empty ; Wall(2, 1) ; empty ; Wall(4, 1) ;
Wall(0, 2) ; Agent(1, 2, direction=(-1, 0), carrying=None) ; Door(2, 2, color=yellow, state=
locked) ; empty ; Wall(4, 2) ;
Wall(0, 3) ; Key(1, 3, color=yellow) ; Wall(2, 3) ; Goal(3, 3) ; Wall(4, 3) ;
Wall(0, 4) ; Wall(1, 4) ; Wall(2, 4) ; Wall(3, 4) ; Wall(4, 4) ;
‘‘‘
to
‘‘‘
Wall(0, 0) ; Wall(1, 0) ; Wall(2, 0) ; Wall(3, 0) ; Wall(4, 0) ;
Wall(0, 1) ; empty ; Wall(2, 1) ; empty ; Wall(4, 1) ;
Wall(0, 2) ; Agent(1, 2, direction=(0, 1), carrying=None) ; Door(2, 2, color=yellow, state=
locked) ; empty ; Wall(4, 2) ;
Wall(0, 3) ; Key(1, 3, color=yellow) ; Wall(2, 3) ; Goal(3, 3) ; Wall(4, 3) ;
Wall(0, 4) ; Wall(1, 4) ; Wall(2, 4) ; Wall(3, 4) ; Wall(4, 4) ;
‘‘‘
The difference is
"""
The agent (direction=(-1, 0)) at pos (1, 2) becomes an agent (direction=(0, 1)).
"""
, the returned reward is ‘ 0.0 ‘ and the returned done is ‘ False ‘

The action "turn left" transforms the state from


‘‘‘
Wall(0, 0) ; Wall(1, 0) ; Wall(2, 0) ; Wall(3, 0) ; Wall(4, 0) ;
Wall(0, 1) ; empty ; Wall(2, 1) ; empty ; Wall(4, 1) ;
Wall(0, 2) ; Agent(1, 2, direction=(0, -1), carrying=None) ; Door(2, 2, color=yellow, state=
locked) ; empty ; Wall(4, 2) ;
Wall(0, 3) ; Key(1, 3, color=yellow) ; Wall(2, 3) ; Goal(3, 3) ; Wall(4, 3) ;
Wall(0, 4) ; Wall(1, 4) ; Wall(2, 4) ; Wall(3, 4) ; Wall(4, 4) ;
‘‘‘
to
‘‘‘
Wall(0, 0) ; Wall(1, 0) ; Wall(2, 0) ; Wall(3, 0) ; Wall(4, 0) ;
Wall(0, 1) ; empty ; Wall(2, 1) ; empty ; Wall(4, 1) ;
Wall(0, 2) ; Agent(1, 2, direction=(-1, 0), carrying=None) ; Door(2, 2, color=yellow, state=
locked) ; empty ; Wall(4, 2) ;
Wall(0, 3) ; Key(1, 3, color=yellow) ; Wall(2, 3) ; Goal(3, 3) ; Wall(4, 3) ;
Wall(0, 4) ; Wall(1, 4) ; Wall(2, 4) ; Wall(3, 4) ; Wall(4, 4) ;
‘‘‘
The difference is

35
Building World Models by Writing Code and Interacting with the Environment

"""
The agent (direction=(0, -1)) at pos (1, 2) becomes an agent (direction=(-1, 0)).
"""
, the returned reward is ‘ 0.0 ‘ and the returned done is ‘ False ‘

Here is an example of experiences that the code failed to model.

The action "toggle" should transform the state from


‘‘‘
Wall(0, 0) ; Wall(1, 0) ; Wall(2, 0) ; Wall(3, 0) ; Wall(4, 0) ;
Wall(0, 1) ; empty ; Wall(2, 1) ; empty ; Wall(4, 1) ;
Wall(0, 2) ; Agent(1, 2, direction=(0, 1), carrying=None) ; Door(2, 2, color=yellow, state=
locked) ; empty ; Wall(4, 2) ;
Wall(0, 3) ; Key(1, 3, color=yellow) ; Wall(2, 3) ; Goal(3, 3) ; Wall(4, 3) ;
Wall(0, 4) ; Wall(1, 4) ; Wall(2, 4) ; Wall(3, 4) ; Wall(4, 4) ;
‘‘‘
to
‘‘‘
Wall(0, 0) ; Wall(1, 0) ; Wall(2, 0) ; Wall(3, 0) ; Wall(4, 0) ;
Wall(0, 1) ; empty ; Wall(2, 1) ; empty ; Wall(4, 1) ;
Wall(0, 2) ; Agent(1, 2, direction=(0, 1), carrying=None) ; Door(2, 2, color=yellow, state=
locked) ; empty ; Wall(4, 2) ;
Wall(0, 3) ; Key(1, 3, color=yellow) ; Wall(2, 3) ; Goal(3, 3) ; Wall(4, 3) ;
Wall(0, 4) ; Wall(1, 4) ; Wall(2, 4) ; Wall(3, 4) ; Wall(4, 4) ;
‘‘‘
The difference is
"""
Nothing happened
"""
, the returned reward should be ‘ 0.0 ‘ and the returned done should be ‘ False ‘.
However, the implementation is wrong because it returns the predicted reward as ‘ -0.1 ‘
instead of the correct reward as ‘ 0.0 ‘.

For this failed experience, do you know what is different between the true rewards and dones
from the environment and the predictions from the code? Do you know why the environment behaves
in this way? Do you know why the code behaves differently from the environment? Which part of
the code causes the problem? How to fix it? Please improve your code to model the logic of the
world for all the experiences, accordingly. Please implement the code following the template.
You must implement the ‘ reward_func ‘ function as the main function to be called by the
environment. The code needs to be directly runnable on the (state, action, next_state) tuple
and return (reward, done) in python as provided in the experiences. If the code is too long,
try to refactor it to be shorter.

D.5. Generating reward functions for new goals


It asks LLMs to generate new reward functions for new goals, given sample code that are synthesized for previous goals.

You are a robot exploring in an object-centric environment. Your goal is to model the logic of
the world in python, specifically the reward function that maps (state, action, next_state) to
(reward, done). You will to be given the mission for the environment you are going to act in,
as well as a few sample code from the other environments. You need to implement the new reward
function for the new environment you are going to act in. The new code needs to be directly
runnable on (state, action, next_state) and return (reward, done) in python.

Here is a few sample code for the reward function in other environments. Please check them in
detail and think about how to implement the reward function for mission "pick up the yellow box
" in the new environment. The code needs to be directly runnable on (state, action, next_state)
and return (reward, done) in python.

The reward function code for mission "pick up the grey box" is:

36
Building World Models by Writing Code and Interacting with the Environment

‘‘‘

state = [{"name":"Agent", "x":1, "y":1, "direction":(1,0), "carrying": {"name":"Key", "x":None,


"y":None, "color":"red"}},
{"name":"Door", "x":2, "y":2, "color":"red", "state":"locked"},
{"name":"Wall", "x":0, "y":0}]
action = "toggle"
next_state = [{"name":"Agent", "x":1, "y":1, "direction":(1,0), "carrying": {"name":"Key", "x":
None, "y":None, "color":"red"}},
{"name":"Door", "x":2, "y":2, "color":"red", "state":"open"},
{"name":"Wall", "x":0, "y":0}]
class Entity:
def __init__(self, x, y, **kwargs):
self.name = self.__class__.__name__
self.x = x
self.y = y
for key, value in kwargs.items():
setattr(self, key, value)
def __repr__(self):
attr = ’, ’.join(f’{key}={value}’ for key, value in self.__dict__.items() if key not in
(’name’, ’x’, ’y’))
if attr: return f"{self.name}({self.x}, {self.y}, {attr})"
else: return f"{self.name}({self.x}, {self.y})"
def __eq__(self, other):
return all(getattr(self, key) == getattr(other, key, None) for key in self.__dict__.
keys())
def __hash__(self):
return hash(tuple(sorted(self.__dict__.items())))
class Agent(Entity): pass
class Key(Entity): pass
class Door(Entity): pass
class Goal(Entity): pass
class Wall(Entity): pass
class Box(Entity): pass
class Ball(Entity): pass
class Lava(Entity): pass
def get_entities_by_name(entities, name):
return [ entity for entity in entities if entity.name == name ]
def get_entities_by_position(entities, x, y):
return [ entity for entity in entities if entity.x == x and entity.y == y ]
def reward_func(state, action, next_state):
"""
Args:
state: the state of the environment
action: the action to be executed
next_state: the next state of the environment
Returns:
reward: the reward of the action
done: whether the episode is done
"""
reward = 0.0 # initialise reward as 0.0 for all actions
done = False # initialise done as False for all actions
# extract the agent from the current and next state
agent = get_entities_by_name(state, ’Agent’)[0]
next_agent = get_entities_by_name(next_state, ’Agent’)[0]
# If the agent picks up the grey box in the next state, the reward is 1.0 and the episode
is done
if next_agent.carrying and isinstance(next_agent.carrying, Box) and next_agent.carrying.
color == ’grey’:
reward = 1.0
done = True
return reward, done

37
Building World Models by Writing Code and Interacting with the Environment

‘‘‘

The reward function code for mission "pick up the purple box" is:
‘‘‘

state = [{"name":"Agent", "x":1, "y":1, "direction":(1,0), "carrying": {"name":"Key", "x":None,


"y":None, "color":"red"}},
{"name":"Door", "x":2, "y":2, "color":"red", "state":"locked"},
{"name":"Wall", "x":0, "y":0}]
action = "toggle"
next_state = [{"name":"Agent", "x":1, "y":1, "direction":(1,0), "carrying": {"name":"Key", "x":
None, "y":None, "color":"red"}},
{"name":"Door", "x":2, "y":2, "color":"red", "state":"open"},
{"name":"Wall", "x":0, "y":0}]
class Entity:
def __init__(self, x, y, **kwargs):
self.name = self.__class__.__name__
self.x = x
self.y = y
for key, value in kwargs.items():
setattr(self, key, value)
def __repr__(self):
attr = ’, ’.join(f’{key}={value}’ for key, value in self.__dict__.items() if key not in
(’name’, ’x’, ’y’))
if attr: return f"{self.name}({self.x}, {self.y}, {attr})"
else: return f"{self.name}({self.x}, {self.y})"
def __eq__(self, other):
return all(getattr(self, key) == getattr(other, key, None) for key in self.__dict__.
keys())
def __hash__(self):
return hash(tuple(sorted(self.__dict__.items())))
class Agent(Entity): pass
class Key(Entity): pass
class Door(Entity): pass
class Goal(Entity): pass
class Wall(Entity): pass
class Box(Entity): pass
class Ball(Entity): pass
class Lava(Entity): pass
def get_entities_by_name(entities, name):
return [ entity for entity in entities if entity.name == name ]
def get_entities_by_position(entities, x, y):
return [ entity for entity in entities if entity.x == x and entity.y == y ]
def reward_func(state, action, next_state):
"""
Args:
state: the state of the environment
action: the action to be executed
next_state: the next state of the environment
Returns:
reward: the reward of the action
done: whether the episode is done
"""
reward = 0.0 # initialise reward as 0.0 for all actions
done = False # initialise done as False for all actions
# extract the agent from the current and next state
agent = get_entities_by_name(state, ’Agent’)[0]
next_agent = get_entities_by_name(next_state, ’Agent’)[0]
# If the agent picks up the purple box in the next state, the reward is 1.0 and the episode
is done
if next_agent.carrying and isinstance(next_agent.carrying, Box) and next_agent.carrying.
color == ’purple’:
reward = 1.0

38
Building World Models by Writing Code and Interacting with the Environment

done = True
return reward, done

‘‘‘

The reward function code for mission "pick up the green box" is:
‘‘‘

state = [{"name":"Agent", "x":1, "y":1, "direction":(1,0), "carrying": {"name":"Key", "x":None,


"y":None, "color":"red"}},
{"name":"Door", "x":2, "y":2, "color":"red", "state":"locked"},
{"name":"Wall", "x":0, "y":0}]
action = "toggle"
next_state = [{"name":"Agent", "x":1, "y":1, "direction":(1,0), "carrying": {"name":"Key", "x":
None, "y":None, "color":"red"}},
{"name":"Door", "x":2, "y":2, "color":"red", "state":"open"},
{"name":"Wall", "x":0, "y":0}]
class Entity:
def __init__(self, x, y, **kwargs):
self.name = self.__class__.__name__
self.x = x
self.y = y
for key, value in kwargs.items():
setattr(self, key, value)
def __repr__(self):
attr = ’, ’.join(f’{key}={value}’ for key, value in self.__dict__.items() if key not in
(’name’, ’x’, ’y’))
if attr: return f"{self.name}({self.x}, {self.y}, {attr})"
else: return f"{self.name}({self.x}, {self.y})"
def __eq__(self, other):
return all(getattr(self, key) == getattr(other, key, None) for key in self.__dict__.
keys())
def __hash__(self):
return hash(tuple(sorted(self.__dict__.items())))
class Agent(Entity): pass
class Key(Entity): pass
class Door(Entity): pass
class Goal(Entity): pass
class Wall(Entity): pass
class Box(Entity): pass
class Ball(Entity): pass
class Lava(Entity): pass
def get_entities_by_name(entities, name):
return [ entity for entity in entities if entity.name == name ]
def get_entities_by_position(entities, x, y):
return [ entity for entity in entities if entity.x == x and entity.y == y ]
def reward_func(state, action, next_state):
"""
Args:
state: the state of the environment
action: the action to be executed
next_state: the next state of the environment
Returns:
reward: the reward of the action
done: whether the episode is done
"""
reward = 0.0 # initialise reward as 0.0 for all actions
done = False # initialise done as False for all actions
# extract the agent from the current and next state
agent = get_entities_by_name(state, ’Agent’)[0]
next_agent = get_entities_by_name(next_state, ’Agent’)[0]
# If the agent picks up the green box in the next state, the reward is 1.0 and the episode
is done

39
Building World Models by Writing Code and Interacting with the Environment

if next_agent.carrying and isinstance(next_agent.carrying, Box) and next_agent.carrying.


color == ’green’:
reward = 1.0
done = True
return reward, done

‘‘‘

Now, you have entered a new environment. It shows a mission "pick up the yellow box". Do you
know what this mission means and how to implement it in a reward function? Analyze the
behaviors of the reward function case by case. In what situations will it return a positive
reward or not? In what situations will it return done=True or not? Why? Please implement the
code following the template in the sample code. You must implement the ‘ reward_func‘ function
as the main function to be called by the environment. The code needs to be directly runnable on
(mission, state, action, next_state) and return (reward, done) in python.

D.6. Refine to satisfy optimism under uncertainty


It asks LLMs to think why the goal cannot be achieved for a mission from an initial state using the provided transition and
reward functions. We also tell LLMs the valid action space and the measurement for achieving the goal (r > 0 ∧ d = 1).

You are a robot exploring in an object-centric environment. Your goal is to model the logic of
the world in python. You have tried it before and came up with one partially correct solution.
However, it is not perfect. The code can model the logic for some experiences but failed to
model the logic to achieve the goal in another environment. You need to improve your code so
that the agent can achieve the objective as specified by the mission from the given initial
state as well as still modelling the original logic. The new code should still follow the same
template. The ‘ transition ‘ function needs to be directly runnable on (state, action) and
return the next state in python. The ‘ reward_func ‘ function needs to be directly runnable on
(state, action, next_state) and return (reward, done) in python.

Here is the partially correct solution you came up with:

‘‘‘

class Entity:
def __init__(self, x, y, **kwargs):
self.name = self.__class__.__name__
self.x = x
self.y = y
for key, value in kwargs.items():
setattr(self, key, value)
def __repr__(self):
attr = ’, ’.join(f’{key}={value}’ for key, value in self.__dict__.items() if key not in
(’name’, ’x’, ’y’))
if attr: return f"{self.name}({self.x}, {self.y}, {attr})"
else: return f"{self.name}({self.x}, {self.y})"
def __eq__(self, other):
return all(getattr(self, key) == getattr(other, key, None) for key in self.__dict__.
keys())
def __hash__(self):
return hash(tuple(sorted(self.__dict__.items())))
class Agent(Entity): pass
class Key(Entity): pass
class Door(Entity): pass
class Goal(Entity): pass
class Wall(Entity): pass
class Box(Entity): pass
class Ball(Entity): pass
class Lava(Entity): pass

40
Building World Models by Writing Code and Interacting with the Environment

import copy

def transition(state, action):


"""
Args:
state: the state of the environment
action: the action to be executed
Returns:
next_state: the next state of the environment
"""
# We’ll make a deep copy of the state.
# This is because we don’t want to change the original state.
next_state = copy.deepcopy(state)
agent = get_entities_by_name(next_state, ’Agent’)[0]
if action == ’turn left’:
if agent.direction == (-1, 0):
agent.direction = (0, 1)
elif agent.direction == (0, 1):
agent.direction = (1, 0)
elif agent.direction == (1, 0):
agent.direction = (0, -1)
else: # if agent.direction == (0, -1)
agent.direction = (-1, 0)
elif action == ’turn right’:
if agent.direction == (-1, 0):
agent.direction = (0, -1)
elif agent.direction == (0, -1):
agent.direction = (1, 0)
elif agent.direction == (1, 0):
agent.direction = (0, 1)
else: # if agent.direction == (0, 1)
agent.direction = (-1, 0)
elif action == ’toggle’:
# We assume that the agent have the ability to toggle, regardless of what is in front
of him because the experiences provided do not dictate otherwise.
pass
elif action == ’nothing’:
pass
return next_state
def get_entities_by_name(entities, name):
return [ entity for entity in entities if entity.name == name ]
def get_entities_by_position(entities, x, y):
return [ entity for entity in entities if entity.x == x and entity.y == y ]
def reward_func(state, action, next_state):
"""
Args:
state: the state of the environment
action: the action to be executed
next_state: the next state of the environment
Returns:
reward: the reward of the action
done: whether the episode is done
"""
# Create sets of entities for easier comparison and access
state_set = set(state)
next_state_set = set(next_state)
# Get agent’s position in both states
agent = [e for e in state_set if isinstance(e, Agent)][0]
agent_next = [e for e in next_state_set if isinstance(e, Agent)][0]
# Done condition
on_goal = any(isinstance(entity, Goal) for entity in get_entities_by_position(next_state,
agent_next.x, agent_next.y))
done = on_goal

41
Building World Models by Writing Code and Interacting with the Environment

# Reward calculation
if state_set == next_state_set:
# If state didn’t change -> ‘nothing‘, ‘toggle‘ when not carrying a key or ‘drop‘ when
carrying nothing happened
reward = 0.0
elif action == ’turn left’ or action == ’turn right’:
# If direction of agent changes -> ‘turn left‘, ‘turn right‘ happened
reward = 0.0
else:
# In other cases, no reward. Can be modified when other scenarios are applied.
reward = 0.0
return reward, done

‘‘‘

However, the code failed to achieve the goal/objective as specified by the mission "use the key
to open the door and then get to the goal" from the following initial
state:

‘‘‘

Wall(0, 0) ; Wall(1, 0) ; Wall(2, 0) ; Wall(3, 0) ; Wall(4, 0) ;


Wall(0, 1) ; empty ; Wall(2, 1) ; empty ; Wall(4, 1) ;
Wall(0, 2) ; Agent(1, 2, direction=(0, 1), carrying=None) ; Door(2, 2, color=yellow, state=
locked) ; empty ; Wall(4, 2) ;
Wall(0, 3) ; Key(1, 3, color=yellow) ; Wall(2, 3) ; Goal(3, 3) ; Wall(4, 3) ;
Wall(0, 4) ; Wall(1, 4) ; Wall(2, 4) ; Wall(3, 4) ; Wall(4, 4) ;

‘‘‘

The measurement for achieving the goal/objective is as follows:

‘‘‘

def criterion(state, mission, action, next_state, reward, done,):


return reward > 0 and done

‘‘‘

The valid actions are {’turn right’, ’nothing’, ’move forward’, ’turn left’, ’toggle’, ’drop’,
’pick up’}.

Do you know why the mission cannot be achieved from the given initial state with the world
model as implemented in the code? What subgoals does the agent need to achieve in order to
achieve the final goal as specified by the mission? Can the agent achieve those subgoals using
the world model as implemented in the code? If not, what is missing or wrong? How can you
improve the code to achieve the goal/objective as specified by the mission from the given
initial state? Please improve the code as analyzed before so that the mission can be achieved
from the given initial state. Please implement the code following the template. Feel free to
implement any helper functions you need. You can also implement the logic for difference
actions in different helper functions. However, you must implement the ‘ transition ‘ function
and the ‘ reward_func ‘ function as the main functions to be called by the environment. The ‘
transition ‘ function needs to be directly runnable on (state, action) and return the next
state in python. The ‘ reward_func ‘ function needs to be directly runnable on (state, action,
next_state) and return (reward, done) in python. The new code, by themselves, should be
complete, compilable, and runnable.

42

You might also like