WorldCoder, A Model-Based LLM Agent
WorldCoder, A Model-Based LLM Agent
WorldCoder, A Model-Based LLM Agent
Building World Models by Writing Code and Interacting with the Environment
program representing its knowledge of the world of old programs, and allows inspecting and understanding
based on its interactions with the environment. the system’s knowledge, because high-level programming
The world model tries to explain its interactions, languages are designed to be interpretable by humans.
while also being optimistic about what reward it Fig. 1 diagrams our architecture for building and using
can achieve. We do this by extending work on pro- Python world models, which we cast as model-based rein-
gram synthesis via LLMs. We study our agent on forcement learning (RL). In Fig. 2 we position this work
gridworlds, finding our approach is more sample- relative to deep RL as well as LLM agents. In contrast to
efficient compared to deep RL, and more compute- deep RL (Schrittwieser et al., 2020, i.a.), we view the world
efficient compared to ReAct-style agents. model as something that should be rapidly learnable and
transferable across environments. However, our work has
different limitations from deep RL: we consider fully ob-
1. Introduction served, deterministic, low-dimensional (symbolic) envi-
Consider yourself learning to use a new device or play a new ronments. This simplifies the problem, but not to the point
game. Given the right prior knowledge, together with rela- of triviality: For instance, high-level robot task planning
tively few interactions, most people can acquire basic knowl- typically makes similar assumptions (Liang et al., 2022).
edge of how many devices or games work. This knowledge Central to our work is a particular claim about how an LLM
could then be used to achieve novel goals, can be transferred should relate to world models. In our setup, the LLM does
to similar devices or games in the future, and can be com- not simulate the world, but instead builds a simulation of
municated symbolically to other humans. How could an the world. Therefore the LLM does not need to act as a
AI system similarly acquire, transfer, and communicate its world model, and different world models can be learned
knowledge of how things work? in different environments. This should be contrasted with
We cast this problem as one of learning a world model: a LLM agents such as ReAct (Yao et al., 2022) where the
mapping, called the transition function, that predicts the next LLM directly reasons about different actions and their con-
state of affairs, given a current state and action (Sutton & sequences. We also do not expect the LLM to perform plan-
Barto, 2018; Schrittwieser et al., 2020; Hafner et al., 2021). ning, which they are known to struggle with (Valmeekam
Our proposed solution is an architecture that synthesizes et al., 2023). Instead, we require the LLM to possess fuzzy
a Python program to model its past experiences with the prior knowledge of how the world might be, together with
world, effectively learning the transition function, and which the programming skill to code and debug a transition func-
takes actions by planning using that world model. tion. Our use of LLMs is closest to (Guan et al., 2023;
Wong et al., 2023; Zhu & Simmons, 2024), which gener-
Our method extends and advances recent program synthe- ate planning productions, which can be seen as a partic-
sis methods based on Large Language Models (LLMs), ular kind of world model. Overall though, our problem
particularly (Qiu et al., 2023; Wang et al., 2023c; Chen statement is closer to (Das et al., 2023; Evans et al., 2021;
et al., 2023; Ellis, 2023). Representing world knowledge Tsividis et al., 2021), which learn world models in domain-
as code, and generating it from LLMs, addresses several specific programming languages. We further show how to
important problems. It allows prior world knowledge em- use Turing-complete languages like Python—which we be-
1 lieve important for general-purpose learners—and we also
Cornell University, Department of Computer Science. Corre-
spondence to: Hao Tang <[email protected]>, Kevin Ellis study efficient transfer learning and exploration strategies
<[email protected]>. for such agents.
1
Building World Models by Writing Code and Interacting with the Environment
state,
reward,goal
world
# world_model.py
def update(state, action, goal): replay
planner LLM
…………………………… buffer
return new_state, reward
action
Figure 1. Overall agent architecture. The agent also inputs a goal in natural language
Figure 2. Qualitative comparison of our method against deep model-based RL and LLM agents (ReAct, RAP, etc: Yao et al. (2022); Hao
et al. (2023); Zhao et al. (2023); Liu et al. (2023b)). Sample complexity refers to the number of environment interactions needed to learn a
world model (∗ LLM agents do not update their world model). LLM calls/task is the number of LLM calls needed to solve a new task in a
fixed environment, amortized over many such tasks, as a function of the maximum episode length T . Asymptotically, after learning a
world model, our method can accomplish new tasks by only at most one LLM query to update the reward function.
We make the following three contributions: We formalize this as a Contextual Markov Decision Process
(CMDP: Hallak et al. (2015)), which is a tuple (C, S, A, M )
1. An architecture, which we call WorldCoder, for learning where C is a set of contexts (i.e. goals), S is a set of states,
world models as code. The architecture supports learning A is a set of actions, and M is a function mapping a context
that is more transferable, interpretable, and dramatically c ∈ C to a Markov Decision Process (MDP). The context-
more sample-efficient compared to deep RL, and also conditioned MDP, M (c), is a tuple (S, A, T, Rc , γ) with
more compute-efficient than prior LLM agents. transition function T : S × A → S, discount factor γ, and
reward function Rc : S × A × S → R × {0, 1}. Note
2. A new learning objective for program-structured world the reward is nonstandard: it depends on the context, and
models that favors optimism in the face of uncertainty returns an extra Boolean flag indicating whether the goal has
(Section 2.2). We show that this learning objective gener- been achieved (whether the episode is over). The transition
ates goal-driven exploratory behavior, which can reduce function does not depend on the context. The objective is
the number of environment actions needed to obtain re- to select actions to maximize
P∞cumulative discounted future
ward by orders of magnitude. reward, which is given by t=0 γ t rt , where rt is the reward
received t timesteps into the future.
3. As an auxiliary contribution, we also give an improved
State representation. Motivated by the maturity of ob-
algorithm for using an LLM to debug and improve its
ject detection and the formalisms used in robotic task plan-
code, based on a bandit formulation that balances ex-
ning, we represent each state as a set of objects. Each
ploiting promising programs against exploring new ones.
object has a string-valued field name, integer-valued fields
We describe this new bandit algorithm in Section 2.5.
x and y indicating its position, and optionally additional
fields depending on the type of the object. For example, if
2. Methods name="door", then the object has two additional Boolean
fields indicating if the door is open/closed and locked/un-
2.1. Problem statement and core representations
locked. This can be seen as a variation of Object Oriented
We start with the standard MDP formalism but modify it MDPs (Diuk et al., 2008).
in three important ways. First, we assume a goal is given
Representing world models as code. The agent uses
in natural language. The goal could be something specific
Python code to model the transition and reward functions.
such as “pickup the ball”, or underspecified, such as “maxi-
Mathematically we think of this Python code as a tuple
mize reward.” Second, we restrict ourselves to deterministic
(T̂ , R̂) of a transition function T̂ : S × A → S and a reward
environments, and assume that environment dynamics are
model R̂ : C → (S × A × S → R × {0, 1}). Note again
fixed across goals. Third, we assume an episodic MDP with
that the reward depends on the context, and returns an extra
the current episode terminating upon reaching the goal.
2
Building World Models by Writing Code and Interacting with the Environment
Boolean indicating whether the goal has been reached, in a random subset of D and asking it to propose a candidate
which case the current episode terminates. Both functions program. If the resulting program does not fit all the data
are implemented as separate Python subroutines, which en- (satisfy ϕ1 ), then the LLM is backprompted with members
courages disentangling the dynamics from the reward. of D inconsistent with its latest program, and asked to re-
vise/debug its code, inspired by Chen et al. (2023). (If it
2.2. The world model learning problem does not satisfy ϕ2 , it is instead backprompted with s0 , c:
see Appendix D for prompts.) Iterating this self-debugging
What objectives and constraints should the world model strategy introduces an interesting explore/exploit tradeoff
satisfy? Clearly, the learned world model should explain between trying to debug the best programs (those consistent
the observed data by correctly predicting the observed state with most of D), and trying to debug new programs. In
transitions and rewards. This is a standard training objective Sec. 2.5 we introduce a bandit algorithm for balancing this
within model-based RL (Sutton, 1990). tradeoff.
One less obvious learning objective is that the world model
should suffice to plan to the goal. Given two world models, 2.3. Properties of optimism under uncertainty
both consistent with the observed data, the agent should
Having the world model satisfy ϕ2 has two non-obvious
prefer a model which implies it is possible to get positive re-
consequences, described below.
ward, effectively being optimistic in the face of uncertainty
about world dynamics. Assuming the low-data regime, there
will be multiple world models consistent with the data: pre- Exploration guided by goal-driven behavior. Before
ferring an optimistic one guarantees that the agent can at ever receiving reward, the constraint of satisfying ϕ2 forces
least start making progress toward the goal, even if it later the agent to invent a reward function. The reward function,
has to update its beliefs because they turned out to be too according to ϕ2 , must be something that the agent believes
optimistic. it can actually achieve. This mechanism causes the agent to
explore the environment by generating new goals, and then
Concretely, as standard, the agent collects a dataset D of update its knowledge based on the experience of actually
past environment interactions, each formatted as a tuple trying to achieve that goal. This emergent exploration con-
(s, a, r, s′ , c, d) of current state s, next state s′ , action a, and tinues until the agent learns how to finish an episode with
reward r in context c with d indicating if the episode ended positive reward. Appendix Sec. A proves this theoretically.
upon reaching s′ . (The variable d should be read as “done.”)
The agent also stores the initial state s0 and context c of For example, suppose the context (goal) is “open the door
each episode, so that it can prefer world models that can with the key”, and the agent has not yet achieved any reward,
reach the goal from an initial state. The learning problem nor has it ever even found the key. Then ϕ2 would (for
is to construct Python programs implementing a transition example) prefer R̂’s which reward touching the key. Even
function T̂ and reward model R̂ satisfying constraints ϕ1 if touching the key does not actually give reward, the agent
and ϕ2 , defined below: can make good progress by pretending it does.
ϕ1 D, T̂ , R̂ = fit data (1) Following natural-language instructions. After learning
how the world works, our agent should be able to receive
∀(s, a, r, s′ , c, d) ∈ D : (T̂ , R̂) ⊢ (s, a, r, s′ , c, d) natural language instructions and begin following them im-
where (T̂ , R̂) ⊢ (s, a, r, s′ , c, d) if and only if: mediately without further learning from environmental in-
teractions. Mathematically, given a learned program R̂ that
T̂ (s, a) = s′ ∧ R̂(c)(s, a, s′ ) = (r, d) implements reward functions for previously seen goals, to-
gether with a new goal c where c ̸∈ domain(R̂), the agent
ϕ2 s0 , c, T̂ , R̂ = optimism under uncertainty (2)
should update its reward model in order to cover c.
∃a1 , s1 , a2 , s2 , ..., aℓ , sℓ :
This zero-shot generalization to new goals is forced to occur
∀i ∈ [ℓ] : T̂ (si−1 , ai ) = si ∧ as an implicit consequence of enforcing optimism under
∃r > 0 : R̂(c)(sℓ−1 , aℓ , sℓ ) = (r, 1) uncertainty (ϕ2 in Eq. 2). Upon observing a new context c,
the constraint ϕ2 is violated. This triggers program synthesis
where s0 , c is the initial state/context of the episode, and the to debug R̂ so that it covers c, subject to the constraint that
turnstile (⊢) should be read as meaning that a given program R̂(c) allows reaching a goal state.
entails a given logical constraint or predicts a given replayed
Given the importance of instruction-following abilities, we
experience.
engineer a special prompt for trying to enforce ϕ2 upon
Constructing a world model satisfying ϕ1 ∧ ϕ2 is a program encountering a new goal (Appendix Sec. D.5). This prompt
synthesis problem. We solve it by prompting an LLM with is based on retrieving previously learned reward functions.
3
Building World Models by Writing Code and Interacting with the Environment
Algorithm 1 WorldCoder Agent Architecture the LLM with an erroneous program it previously gener-
Hyperparam: ϵ, random exploration probability ated, together with failed example testcase(s), and finally
(default to 5%) prompting it to fix its code.
Hyperparam: M IN DATA S IZE, min # actions before Complex programs typically require several rounds of re-
learning begins (default to 10) finement, however, and each round requires calling an LLM,
D←∅ ▷ replay buffer which can return several samples. Therefore, this process
T̂ , R̂ ←null, null ▷ init empty world model naturally generates a tree of possible programs (Fig. 3). Re-
loop forever through episodes: cent work has cast doubt on the ability of LLMs to refine
c ←E PISODE G OAL () ▷ get context (goal) their own outputs (Olausson et al., 2023), concluding that,
s0 ←C URRENT S TATE () ▷ store init state for ϕ2 for a fixed LLM token budget, it is better to repeatedly sam-
loop until episode ends: ple from the root of the tree: essentially, repeatedly guessing
s ←C URRENT S TATE () from scratch.
if not ϕ1 ∧ ϕ2 and |D| ≥ M IN DATA S IZE then
T̂ , R̂ ← S YNTHESIZE(T̂ , R̂, D, c, s0 ) ▷ Alg 2 Our investigations, however, suggest that refinement can
work, and even generate 150+ line programs, if and only
with probability ϵ do
if the system can intelligently decide which program it
a ←R ANDOM ACTION () ▷ ϵ-greedy explore
should refine next. Should we maximally exploit by refining
else
the program which passes the most test cases? Or should
a ←P LAN (s, T̂ , R̂(c)) ▷ Value Iteration
we explore by trying to improve the program which has
s′ , r, d ← E NV.S TEP (a) ▷ take action in state s
been refined the least so far? This exploration-exploitation
D ← D ∪ {(s, a, r, s′ , c, d)} ▷ record experience
tradeoff is made more interesting by the fact that every time
we refine a program, we create another brand-new program.
In what follows, we will first describe our method for gen-
This retrieval strategy assumes similar goals have similar
eral program synthesis problems, and then describe its spe-
reward structures, so that the LLM can generalize from old
cific instantiation for world models.
reward functions in R̂ when generating the new R̂(c). If
the LLM makes a mistake in predicting the reward function,
then the agent can recover by subsequent rounds of program The general setting. We frame refinement as an arm-
synthesis, which update R̂(c) based on interactions with the acquiring bandit problem: A bandit problem where new
environment. arms arrive over time (Whittle, 1981). Here each arm is
a program, and “pulling” an arm corresponds to refining
2.4. Overall architecture a program. New programs arrive over time, because with
each refinement, we generate a new program. We receive a
Ultimately our world models exist to serve downstream reward of 1 if pulling an arm it yields a program that fits the
decision-making: and in turn, taking actions serves to pro- data perfectly, and zero otherwise.
vide data for learning better world models. There are many
To formalize this, we write ρ to mean a program, and Φ
architectures for combining acting with world-model learn-
to mean the logical constraint that the program should sat-
ing, such as via planning (Schrittwieser et al., 2020), train-
isfy, such as being consistent with a dataset of input-output
ing policies in simulation (Hafner et al., 2021), or hybrid
examples. The bandit problem assumes a binary reward:
approaches (Sutton, 1990). We use a very basic agent archi-
We receive reward 1 if the refined program satisfies Φ and
tecture, shown in Algorithm 1. At a high level, it initially
reward 0 otherwise. Although we want to perfectly satisfy
performs random exploration to initialize a dataset D of
Φ, we assume access to a heuristic estimator of program
environmental interactions; it updates its world model using
quality, h(ρ), which returns a score between 0 and 1. For ex-
the program synthesis algorithm of Sec. 2.5; and, past its
initial exploration phase, it performs planning via depth-
limited value iteration.
refined program refined program
(60% correct) (40% correct)
2.5. Program Synthesis via Refinement as initial program
Arm-acquiring Bandits
(0% correct)
exploit
explore
A recent paradigm for problem-solving with LLMs is to refined program
have the language model refine, self-correct, or self-debug (50% correct)
its initial outputs (Chen et al., 2023); we use the term re-
finement for this whole body of work. When synthesizing Figure 3. Refinement tree. Exploit the best program passing 60%
programs from a dataset of examples, this means prompting of the test cases, or explore refining a program passing just 50%?
4
Building World Models by Writing Code and Interacting with the Environment
ample, h(ρ) could be the fraction of testcases that ρ passes. Alg. 2 puts these ingredients together. Although the pseu-
Writing r for the bandit reward and Prefine for the distribu- docode loops forever until it finds a program that perfectly
tion of LLM-generated refinements, the reward follows a explains the data, we bound the number of iterations to 50,
Bernoulli distribution whose parameter we write θρ :1 at which point, the best program found so far is returned.
θρ = P (r = 1|ρ) = E [1 [ρ′ ⊢ Φ]] (3) Choice of LLM for refinement. We use GPT-4 because
ρ′ ∼Prefine (·|ρ) recent work (Olausson et al., 2023) suggests it is, by far, the
We solve this bandit problem using Thompson Sam- best model for refining code. Although expensive, the point
pling (Russo et al., 2018; Chapelle & Li, 2011), meaning of our bandit algorithm is to minimize this cost, and indeed,
we maintain probabilistic beliefs about each arm’s expected we find that our new method is more effective at managing
reward, θρ . As standard, the distribution over θρ is modeled these costs than existing approaches to refinement (Fig. 4).
as a Beta distribution, whose parameters we write αρ and βρ .
Thompson sampling picks the next arm to pull by sampling
an expected reward θρ from each arm’s Beta distribution,
and then pulls the arm with the highest expected reward.
The corresponding Beta parameters are then updated via
Bayes Rule, which simply increments αρ or βρ depending
on whether reward was received. As a side-effect of refining
this program, the LLM generates a new program, which then
becomes an arm that could be pulled in subsequent itera-
tions. The whole process stops when a program is generated Figure 4. h-value at 50 refinements, contrasting our bandit method
that perfectly satisfies the constraint Φ. with greedily refining the most promising program so far.
5
Building World Models by Writing Code and Interacting with the Environment
(B) Learning curves. Solve rate on test levels. (C) LLM token cost (D) deep RL comparison
Figure 5. (A) Sokoban domain (per-step reward of -0.1 ellided from figure). (B) Learning curves. ReAct has the same pretrained
knowledge of Sokoban but cannot effectively play the game. (C) Our method has different asymptotic LLM cost compared to prior LLM
agents, which consume LLM calls/tokens at every action. (D) Deep RL takes >1 million steps to learn 2-box Sokoban.
Sokoban is a puzzle-solving task where the agent pushes ReAct-style architectures (Hao et al., 2023; Zhao et al.,
boxes around a 2d world, with the goal of pushing every 2023; Liu et al., 2023b, i.a.), also require expensive LLM
box onto a target (Fig. 5A). Solving hard Sokoban levels is a calls at every action, and so asymptotically their cost grows
challenging planning task that has received recent attention linearly with the number of actions taken. Our approach has
from the planning and RL communities (Chrestien et al., different asymptotics: after front-loading 400k LLM tokens
2023; Chung et al., 2023; Kujanpää et al., 2023; Feng et al., (about $15), it can issue as many actions as needed without
2022; 2020). Unlike these other works, our emphasis is not subsequent LLM queries, whereas ReAct demands many
on solving the hardest Sokoban levels. Instead, we wish to times more tokens (Fig. 5C).
show that our agent can rapidly achieve basic competence.
On Sokoban, the optimism under uncertainty objective (ϕ2 ,
Master-level play could then be achieved via any of the cited
orange curves in Fig. 5B) has little effect: Sokoban has
works that focus on sophisticated planning and search.
a dense reward structure that allows easy learning through
Starting with only the natural-language goal of “win the random exploration. Next we consider problems with sparse
game”, our agent builds a world model over the first 50 rewards and natural language instructions, which the opti-
actions. The resulting code is human-understandable (Ap- mism objective is designed to handle.
pendix C), and generalizes to solving levels with more boxes
Minigrid. To better understand the transfer learning and
(Fig. 5B). While the system cannot solve very hard Sokoban
instruction following aspects of our approach, we next study
levels (eg, 5+ boxes), that is an artifact of the difficulty of
Minigrid (Chevalier-Boisvert et al., 2023), a suite of grid
planning, and could be addressed by plugging the world
games designed for language-conditioned RL. Each min-
model into any of the techniques cited above. In contrast
igrid environment introduces new objects with their own
to this work, both model-based and model-free deep RL re-
dynamics, such as keys, doors, walls, balls, and boxes.
quire millions of experiences to solve basic levels (Fig. 5D).
Fig. 6 illustrates results for our agent playing a sequence of
Almost surely, an important reason why our system learns
minigrid environments, while Appendix A.1 gives a walk-
quickly is because the underlying LLM already knows about
through of an example learning trajectory. The agent inter-
Sokoban, 3 and can quickly infer that it is playing a simi-
acts with each environment episodically through different
lar game. However, simply knowing about Sokoban does
randomly-generated levels. We order the environments into
not suffice for the LLM to play the game, as demonstrated
a curriculum designed to illustrate different forms of trans-
by the poor performance of ReAct (Fig. 5B). ReAct is a
fer learning. For example, when transferring from the first
baseline which prompts the LLM with the state-action his-
to the second environment, the agent needs to extend its
tory, then asks it to think step-by-step (Reason) and before
knowledge to incorporate new objects and their associated
predicting an action (Act). Quantitatively, ReAct succeeds
dynamics (keys and doors). Learning about these new ob-
on only 15% ± 8% of basic levels, showing that pretrained
jects requires extra environment interactions, during which
knowledge of Sokoban does not, by itself, allow strong play.
the agent experiments with the objects to update its transi-
3 tion function. In contrast, the third environment presents no
Sokoban is a computer game from 1982, so descriptions of
how to play it are probably in GPT4’s training data new dynamics, but instead introduces new natural-language
6
Building World Models by Writing Code and Interacting with the Environment
(A) minigrid environments and curriculum ordering: testing different forms of transfer
env 1: empty env 2: door_key env 3: unlock env 4: fetch env 5: unlock_pickup
new transitions: new reward: new reward+ new transitions:
few-shot 0-shot new transition: few-shot
generalization generalization few-shot gen. generalization
“get to the green “use the key to open the “open the door” “fetch a green ball” “pick up the red box”
goal square” door and get to the goal”
(B) curriculum learning results
Figure 6. (A) Minigrid environments ordered into a curriculum that tests different kinds of transfer learning. (B) Transfer learning
performance, compared with (C) performance when solving each environment independently. Appendix Fig. 7: deep RL comparison.
goals. Our agent can follow these natural-language instruc- orated by classic behavioral experiments suggesting biologi-
tions by enforcing optimism under uncertainty (ϕ2 ). cal decision-making features a similar architecture (Tolman
& Honzik, 1930). At least in theory, an agent with a world
We observe that transfer is especially helpful in quickly
model enjoys many advantages. Such an agent could reason
solving new environments (Fig. 6B). Without transfer, more
in radically new situations by spending more time planning
episodes of exploration are needed to collect experience
different actions—requiring zero retraining—and can sim-
to build a good world model. However, optimism under
ilarly accomplish novel goals by just changing planning
uncertainty (ϕ2 ) helps in this regard for the non-transfer
objectives, again zero-shot. In contrast, merely learning a
setting by promoting goal-directed exploration, and in fact,
policy can leave the agent vulnerable to catastrophic fail-
absent transfer, it is only with ϕ2 that WorldCoder can solve
ure given even small changes to its initial conditions or
the harder environments. Optimism is also necessary for
goals (Kansky et al., 2017). But in practice, world models
zero-shot generalization to new goals (Fig. 6B, env 3).
are hard-won, requiring either large volumes of training
To better understand the importance of ϕ2 —which both data (Hafner et al., 2021; Ha & Schmidhuber, 2018), or
encourages exploration and enables natural-language in- careful hand-engineering (Kaelbling & Lozano-Pérez, 2011)
struction following—we contrast against a version of our (cf. Mao et al. (2022); Konidaris et al. (2018)).
approach which simply prompts GPT4 to generate a new
Neurosymbolic world models, such as Cosmos and
reward function upon receiving a new goal (green in Fig. 6B,
NPS (Sehgal et al., 2023; Goyal et al., 2021), learn a fac-
labelled ‘prompt4reward’). Theoretically this suffices to fol-
tored, object-based neural world model. This factoring helps
low new natural-language instructions, provided the LLM
compositional generalization—like in our work—but impor-
is reliable. Surprisingly, this ablation struggles to follow
tantly they can learn from raw perception, but at the expense
new instructions (eg, transfer from env 2→3 in Fig. 6A),
of transfer and sample complexity. Combining our work
showing the importance of ϕ2 in correcting mistakes made
with these others might be able to get the best of both.
by the LLM when generating reward functions.
LLMs as a world model. Whether LLMs can model the
4. Related Work world is an open question, but there is evidence that, given
the right training data in large quantities, transformers can
World models. World modeling is a classic paradigm act as decent world models, at least within certain situa-
for decision-making: It is the basis of model-based tions (Li et al., 2023; Xiang et al., 2023; Micheli et al.,
RL (Hafner et al., 2023; 2021), task-and-motion-planning in 2023). These works aim to learn a rich but frozen world
robotics (Kaelbling & Lozano-Pérez, 2011), and it is corrob- model from a relatively large volume of examples. We
7
Building World Models by Writing Code and Interacting with the Environment
tackle a different problem: building a simple world model languages: So far these works have used restricted domain-
on-the-fly from a modest amount of data. specific languages, but we show that a general-purpose com-
putational language, like Python, can be used to learn world
LLMs for building world models. Recent works (Zhu
models, which we hope expands the scope of this paradigm.
& Simmons, 2024; Wong et al., 2023; Guan et al., 2023)
We also show how to bias learning toward goal-directed
consider using LLMs to generate planning operators: a kind
behaviors, and how to support transfer across environments
of world model that as abstract, symbolic, and expressed
and goals. Last, we simplify the core program synthesis
in a domain-specific programming language for planning
algorithm: the cited prior works required relatively intricate
(cf. DECKARD (Nottingham et al., 2023), another LLM
synthesis algorithms, which we can avoid by using LLMs as
system which generates state-machine world models). In
general-purpose synthesizers. We hope our work can help
these works, the primary driver of world-model generation—
make this paradigm simpler and more general.
what the LLM first inputs—is natural language describing
affordances and goals. Our work considers a different prob- Other works have also explored how humans can manually
lem: building world models from first-and-foremost from provide knowledge to RL agents via source code: e.g., RL-
interacting with the environment. In practice, agents have Lang (Rodriguez-Sanchez et al., 2023) uses programs to
knowledge both from language and from acting in the world, specify parts of policies and world models, which could be
and so these families of works should be complementary. combined with our system to integrate prior knowledge.
LLMs for decision-making is an emerging paradigm that Exploration & Optimism under (model) uncertainty.
includes ReAct (Yao et al., 2022) and many others (Hao Theoretical work on in POMDPs (Liu et al., 2023a) pro-
et al., 2023; Zhao et al., 2023; Liu et al., 2023b; Ahn et al., posed conceptually related exploration strategies to our op-
2022, i.a.), which directly use LLMs to issue actions and timism constraint, finding it gives theoretical improvements,
reason about their consequences in the world. For instance, which is synergistic with our empirical findings.
ReAct works by prompting the LLM to think step-by-step
and then predict an action. To the extent that these meth- 5. Limitations and Open Directions
ods use a world model, it is implicitly encoded within the
weights of a neural network. We instead build an explicit Our work has important limitations, and naturally suggests
world model, which has the advantage of not needing to next steps. Currently we assume deterministic dynamics,
query the LLM for every action, because the agent can just which could be addressed by synthesizing probabilistic pro-
repeatedly execute the transition function: The cost of using grams (De Raedt et al., 2007; Goodman et al., 2008). Re-
the LLM is amortized, as it only needs to be done once to cent advances in synthesizing probabilistic programs (Saad,
get a good world model. However, ReAct-style approaches 2022), together with advances in using LLMs for determin-
can handle partially observable environments (Sun et al., istic code, this limitation seems nontrivial but surmountable.
2023b), which this paper does not consider.
By representing knowledge as code, our approach deliv-
Programs as Policies. Instead of learning a world model, ers better sample efficiency and transferability, but at high
one can learn a policy as a program. The first wave of cost: Our states must be symbolic and discrete, whereas
these works (Verma et al., 2018; 2019) considered domain- the real world is messy and continuous. While the obvious
specific languages, while recent LLM work (Wang et al., response is that the agent can be equipped with pretrained
2023b; Liang et al., 2022; Sun et al., 2023a) uses more flex- object detectors—a common assumption in robotic task
ible general-purpose languages like Python. An advantage planning (Konidaris et al., 2018, i.a.)—alternative routes
of learning a policy is that it does not need to model all include multimodal models (Hu et al., 2023) and using neu-
the details of the world, many of which may be irrelevant rosymbolic programs (Tang & Ellis, 2023) to bridge the gap
to decision making. A disadvantage is that policies can- between perception and symbol processing, which might be
not readily generalize to new goals—unlike world models, more robust to missing or misspecified symbols.
which can be used by a planner to achieve a variety of objec-
Last, our method uses only a very basic mechanism for grow-
tives. Relatedly, other recent work considers synthesizing
ing and transferring its knowledge. Instead of prompting to
programs that implement reward functions (Ma et al., 2023),
debug its code, we could have built a library of reusable sub-
and then generating a policy with conventional deep RL.
routines and classes shared across different environments
Programs as world models. We are strongly inspired by ex- and goals, reminiscent of library learning systems (Ellis
isting program synthesis algorithms for constructing world et al., 2023; Wang et al., 2023a; Grand et al., 2023; Bowers
models from state-action trajectories (Das et al., 2023; Evans et al., 2023), which refactor their code to expose sharable
et al., 2021; Tsividis et al., 2021). We believe that this fam- components. Further developing that and other ways of
ily of methods will not be generally applicable until they managing and growing symbolic knowledge about the world
can support general-purpose Turing-complete programming remains a prime target for future work.
8
Building World Models by Writing Code and Interacting with the Environment
Acknowledgements. We received funding support from Das, R., Tenenbaum, J. B., Solar-Lezama, A., and Tavares,
NSF grant #2310350 as well as gifts from Cisco and Joseph Z. Combining functional and automata synthesis to dis-
Bates. cover causal reactive programs. Proc. ACM Program.
Lang., 7(POPL), jan 2023. doi: 10.1145/3571249. URL
References https://doi.org/10.1145/3571249.
Ahn, M., Brohan, A., Brown, N., Chebotar, Y., Cortes, O., De Raedt, L., Kimmig, A., and Toivonen, H. Problog: a
David, B., Finn, C., Fu, C., Gopalakrishnan, K., Hausman, probabilistic prolog and its application in link discovery.
K., Herzog, A., Ho, D., Hsu, J., Ibarz, J., Ichter, B., In Proceedings of the 20th International Joint Conference
Irpan, A., Jang, E., Ruano, R. J., Jeffrey, K., Jesmonth, on Artifical Intelligence, IJCAI’07, pp. 2468–2473, San
S., Joshi, N., Julian, R., Kalashnikov, D., Kuang, Y., Lee, Francisco, CA, USA, 2007. Morgan Kaufmann Publish-
K.-H., Levine, S., Lu, Y., Luu, L., Parada, C., Pastor, ers Inc.
P., Quiambao, J., Rao, K., Rettinghouse, J., Reyes, D., Diuk, C., Cohen, A., and Littman, M. L. An object-oriented
Sermanet, P., Sievers, N., Tan, C., Toshev, A., Vanhoucke, representation for efficient reinforcement learning. In Pro-
V., Xia, F., Xiao, T., Xu, P., Xu, S., Yan, M., and Zeng, ceedings of the 25th international conference on Machine
A. Do as i can and not as i say: Grounding language in learning, pp. 240–247, 2008.
robotic affordances. In arXiv preprint arXiv:2204.01691,
2022. Ellis, K. Human-like few-shot learning via bayesian reason-
ing over natural language. NeurIPS, 2023.
Bowers, M., Olausson, T. X., Wong, L., Grand, G., Tenen-
baum, J. B., Ellis, K., and Solar-Lezama, A. Top-down Ellis, K., Wong, L., Nye, M., Sable-Meyer, M., Cary, L.,
synthesis for library learning. Proc. ACM Program. Anaya Pozo, L., Hewitt, L., Solar-Lezama, A., and Tenen-
Lang., 7(POPL), jan 2023. doi: 10.1145/3571234. URL baum, J. B. Dreamcoder: growing generalizable, inter-
https://doi.org/10.1145/3571234. pretable knowledge with wake–sleep bayesian program
learning. Philosophical Transactions of the Royal Society
Chapelle, O. and Li, L. An empirical evaluation of A, 381(2251):20220050, 2023.
thompson sampling. In Shawe-Taylor, J., Zemel, Evans, R., Bošnjak, M., Buesing, L., Ellis, K., Pfau, D.,
R., Bartlett, P., Pereira, F., and Weinberger, K. Kohli, P., and Sergot, M. Making sense of raw input.
(eds.), Advances in Neural Information Process- Artificial Intelligence, 299:103521, 2021.
ing Systems, volume 24. Curran Associates, Inc.,
2011. URL https://proceedings.neurips. Feng, D., Gomes, C. P., and Selman, B. A novel au-
cc/paper_files/paper/2011/file/ tomated curriculum strategy to solve hard sokoban
e53a0a2978c28872a4505bdb51db06dc-Paper. planning instances. In Larochelle, H., Ranzato, M.,
pdf. Hadsell, R., Balcan, M., and Lin, H. (eds.), Ad-
vances in Neural Information Processing Systems,
Chen, X., Lin, M., Schärli, N., and Zhou, D. Teaching volume 33, pp. 3141–3152. Curran Associates, Inc.,
large language models to self-debug. arXiv preprint 2020. URL https://proceedings.neurips.
arXiv:2304.05128, 2023. cc/paper_files/paper/2020/file/
2051bd70fc110a2208bdbd4a743e7f79-Paper.
Chevalier-Boisvert, M., Dai, B., Towers, M., de Lazcano, pdf.
R., Willems, L., Lahlou, S., Pal, S., Castro, P. S., and
Terry, J. Minigrid & miniworld: Modular & customizable Feng, D., Gomes, C., and Selman, B. Left heavy tails
reinforcement learning environments for goal-oriented and the effectiveness of the policy and value networks in
tasks. CoRR, abs/2306.13831, 2023. dnn-based best-first search for sokoban planning, 2022.
Goodman, N. D., Mansinghka, V. K., Roy, D., Bonawitz, K.,
Chrestien, L., Edelkamp, S., Komenda, A., and Pevný, T. and Tenenbaum, J. B. Church: a language for generative
Optimize planning heuristics to rank, not to estimate models. In Proceedings of the Twenty-Fourth Confer-
cost-to-goal. In Thirty-seventh Conference on Neural ence on Uncertainty in Artificial Intelligence, UAI’08, pp.
Information Processing Systems, 2023. URL https: 220–229, Arlington, Virginia, USA, 2008. AUAI Press.
//openreview.net/forum?id=Mgy6sgslPY. ISBN 0974903949.
Chung, S., Anokhin, I., and Krueger, D. Thinker: Learning Goyal, A., Didolkar, A., Ke, N. R., Blundell, C., Beaudoin,
to plan and act. In Thirty-seventh Conference on Neural P., Heess, N., Mozer, M., and Bengio, Y. Neural produc-
Information Processing Systems, 2023. URL https: tion systems: Learning rule-governed visual dynamics.
//openreview.net/forum?id=mumEBl0arj. arXiv preprint arXiv:2103.01937, 2021.
9
Building World Models by Writing Code and Interacting with the Environment
Grand, G., Wong, L., Bowers, M., Olausson, T. X., Liu, Li, K., Hopkins, A. K., Bau, D., Viégas, F., Pfister, H.,
M., Tenenbaum, J. B., and Andreas, J. Lilo: Learning and Wattenberg, M. Emergent world representations:
interpretable libraries by compressing and documenting Exploring a sequence model trained on a synthetic task.
code, 2023. ICLR, 2023.
Guan, L., Valmeekam, K., Sreedharan, S., and Kambham- Liang, J., Huang, W., Xia, F., Xu, P., Hausman, K., Ichter,
pati, S. Leveraging pre-trained large language models to B., Florence, P., and Zeng, A. Code as policies: Language
construct and utilize world models for model-based task model programs for embodied control. In arXiv preprint
planning. NeurIPS, 2023. arXiv:2209.07753, 2022.
Ha, D. and Schmidhuber, J. Recurrent world models fa- Liu, Q., Netrapalli, P., Szepesvari, C., and Jin, C. Optimistic
cilitate policy evolution. In Advances in Neural In- mle: A generic model-based algorithm for partially ob-
formation Processing Systems 31, pp. 2451–2463. Cur- servable sequential decision making. In Proceedings of
ran Associates, Inc., 2018. https://worldmodels. the 55th Annual ACM Symposium on Theory of Comput-
github.io. ing, pp. 363–376, 2023a.
Hafner, D., Lillicrap, T., Norouzi, M., and Ba, J. Mastering
Liu, Z., Hu, H., Zhang, S., Guo, H., Ke, S., Liu, B., and
atari with discrete world models. ICLR, 2021.
Wang, Z. Reason for future, act for now: A principled
Hafner, D., Pasukonis, J., Ba, J., and Lillicrap, T. Mastering framework for autonomous llm agents with provable sam-
diverse domains through world models, 2023. ple efficiency. arXiv preprint arXiv:2309.17382, 2023b.
Hallak, A., Di Castro, D., and Mannor, S. Con- Ma, Y. J., Liang, W., Wang, G., Huang, D.-A., Bastani, O.,
textual markov decision processes. arXiv preprint Jayaraman, D., Zhu, Y., Fan, L., and Anandkumar, A.
arXiv:1502.02259, 2015. Eureka: Human-level reward design via coding large lan-
guage models. arXiv preprint arXiv: Arxiv-2310.12931,
Hao, S., Gu, Y., Ma, H., Hong, J. J., Wang, Z., Wang, D. Z., 2023.
and Hu, Z. Reasoning with language model is planning
with world model. arXiv preprint arXiv:2305.14992, Mao, J., Lozano-Pérez, T., Tenenbaum, J., and Kaelbling, L.
2023. Pdsketch: Integrated domain programming, learning, and
planning. Advances in Neural Information Processing
Hu, Y., Lin, F., Zhang, T., Yi, L., and Gao, Y. Look before
Systems, 35:36972–36984, 2022.
you leap: Unveiling the power of gpt-4v in robotic vision-
language planning. arXiv preprint arXiv:2311.17842, Micheli, V., Alonso, E., and Fleuret, F. Transformers
2023. are sample-efficient world models. In The Eleventh
Kaelbling, L. P. and Lozano-Pérez, T. Hierarchical task and International Conference on Learning Representations,
motion planning in the now. In 2011 IEEE International 2023. URL https://openreview.net/forum?
Conference on Robotics and Automation, pp. 1470–1477, id=vhFu1Acb0xb.
2011. doi: 10.1109/ICRA.2011.5980391.
Nottingham, K., Ammanabrolu, P., Suhr, A., Choi, Y., Ha-
Kansky, K., Silver, T., Mély, D. A., Eldawy, M., Lázaro- jishirzi, H., Singh, S., and Fox, R. Do Embodied Agents
Gredilla, M., Lou, X., Dorfman, N., Sidor, S., Phoenix, Dream of Pixelated Sheep: Embodied Decision Making
S., and George, D. Schema networks: Zero-shot transfer using Language Guided World Modelling. In Interna-
with a generative causal model of intuitive physics. In tional Conference on Machine Learning (ICML), 2023.
International conference on machine learning, pp. 1809–
1818. PMLR, 2017. Olausson, T. X., Inala, J. P., Wang, C., Gao, J., and Solar-
Lezama, A. Is self-repair a silver bullet for code genera-
Konidaris, G., Kaelbling, L. P., and Lozano-Perez, T. From tion?, 2023.
skills to symbols: Learning symbolic representations for
abstract high-level planning. Journal of Artificial Intelli- Qiu, L., Jiang, L., Lu, X., Sclar, M., Pyatkin, V., Bhaga-
gence Research, 61:215–289, 2018. vatula, C., Wang, B., Kim, Y., Choi, Y., Dziri, N., and
Ren, X. Phenomenal yet puzzling: Testing inductive rea-
Kujanpää, K., Pajarinen, J., and Ilin, A. Hybrid search soning capabilities of language models with hypothesis
for efficient planning with completeness guarantees. In refinement, 2023.
Thirty-seventh Conference on Neural Information Pro-
cessing Systems, 2023. URL https://openreview. Raffin, A. Rl baselines3 zoo. https://github.com/
net/forum?id=bY0c46ZtXa. DLR-RM/rl-baselines3-zoo, 2020.
10
Building World Models by Writing Code and Interacting with the Environment
Raffin, A., Hill, A., Gleave, A., Kanervisto, A., Ernestus, Tsividis, P. A., Loula, J., Burga, J., Foss, N., Campero,
M., and Dormann, N. Stable-baselines3: Reliable rein- A., Pouncy, T., Gershman, S. J., and Tenenbaum, J. B.
forcement learning implementations. Journal of Machine Human-level reinforcement learning through theory-
Learning Research, 22(268):1–8, 2021. URL http: based modeling, exploration, and planning. arXiv
//jmlr.org/papers/v22/20-1364.html. preprint arXiv:2107.12544, 2021.
Rodriguez-Sanchez, R., Spiegel, B. A., Wang, J., Patel, Valmeekam, K., Marquez, M., Sreedharan, S., and Kamb-
R., Tellex, S., and Konidaris, G. Rlang: A declarative hampati, S. On the planning abilities of large lan-
language for describing partial world knowledge to rein- guage models - a critical investigation. In Thirty-seventh
forcement learning agents, 2023. Conference on Neural Information Processing Systems,
2023. URL https://openreview.net/forum?
Russo, D. J., Van Roy, B., Kazerouni, A., Osband, I., Wen, id=X6dEqXIsEW.
Z., et al. A tutorial on thompson sampling. Foundations Verma, A., Murali, V., Singh, R., Kohli, P., and Chaudhuri,
and Trends® in Machine Learning, 11(1):1–96, 2018. S. Programmatically interpretable reinforcement learning.
In International Conference on Machine Learning, pp.
Saad, F. Scalable Structure Learning, Inference, and Anal-
5045–5054. PMLR, 2018.
ysis with Probabilistic Programs. PhD thesis, Mas-
sachusetts Institute of Technology, 2022. Verma, A., Le, H., Yue, Y., and Chaudhuri, S. Imitation-
projected programmatic reinforcement learning. Ad-
Schrader, M.-P. B. gym-sokoban. https://github. vances in Neural Information Processing Systems, 32,
com/mpSchrader/gym-sokoban, 2018. 2019.
Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Wang, G., Xie, Y., Jiang, Y., Mandlekar, A., Xiao, C., Zhu,
Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, Y., Fan, L., and Anandkumar, A. Voyager: An open-
D., Graepel, T., et al. Mastering atari, go, chess and shogi ended embodied agent with large language models. arXiv
by planning with a learned model. Nature, 588(7839): preprint arXiv: Arxiv-2305.16291, 2023a.
604–609, 2020. Wang, H., Gonzalez-Pumariega, G., Sharma, Y., and Choud-
hury, S. Demo2code: From summarizing demonstra-
Sehgal, A., Grayeli, A., Sun, J. J., and Chaudhuri, S. Neu-
tions to synthesizing code via extended chain-of-thought.
rosymbolic grounding for compositional world models,
NeurIPS, 2023b.
2023.
Wang, R., Zelikman, E., Poesia, G., Pu, Y., Haber, N., and
Sun, H., Zhuang, Y., Kong, L., Dai, B., and Zhang, C. Ada- Goodman, N. D. Hypothesis search: Inductive reasoning
planner: Adaptive planning from feedback with language with language models, 2023c.
models. NeurIPS, 2023a.
Whittle, P. Arm-acquiring bandits. The Annals of Probabil-
Sun, L., Jha, D. K., Hori, C., Jain, S., Corcodel, R., Zhu, ity, 9(2):284–292, 1981.
X., Tomizuka, M., and Romeres, D. Interactive plan- Wong, L., Mao, J., Sharma, P., Siegel, Z. S., Feng, J., Ko-
ning using large language models for partially observable rneev, N., Tenenbaum, J. B., and Andreas, J. Learning
robotics tasks, 2023b. adaptive planning representations with natural language
guidance, 2023.
Sutton, R. S. Integrated architectures for learning, plan-
ning, and reacting based on approximating dynamic pro- Xiang, J., Tao, T., Gu, Y., Shu, T., Wang, Z., Yang, Z., and
gramming. In Machine learning proceedings 1990, pp. Hu, Z. Language models meet world models: Embodied
216–224. Elsevier, 1990. experiences enhance language models. NeurIPS, 2023.
Sutton, R. S. and Barto, A. G. Reinforcement learning: An Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan,
introduction. MIT press, 2018. K., and Cao, Y. React: Synergizing reasoning and acting
in language models. arXiv preprint arXiv:2210.03629,
Tang, H. and Ellis, K. From perception to programs: Reg- 2022.
ularize, overparameterize, and amortize. International Zhao, Z., Lee, W. S., and Hsu, D. Large language models as
Conference on Machine Learning (ICML), 2023. commonsense knowledge for large-scale task planning.
NeurIPS, 2023.
Tolman, E. C. and Honzik, C. H. Introduction and removal
of reward, and maze performance in rats. University of Zhu, F. and Simmons, R. Bootstrapping cognitive agents
California publications in psychology, 1930. with a large language model. In AAAI, 2024.
11
Building World Models by Writing Code and Interacting with the Environment
Observation. When planning with a world model that satisfies the ϕ1 ∧ ϕ2 there are only two possible outcomes:
• Either the model is correct: the agent achieves the goal successfully;
• Or the model is incorrect: the agent then must find a counter-example to its world model’s prediction and gain more
knowledge of the environment.
It shows that the world model that satisfies ϕ1 ∧ ϕ2 is either correct or is efficient in guiding the agent to find a counter-
example of the current world model. Counter-example are precious because, when in search of the correct world model,
each interaction data (s, a, s′ , r, d) in the replay buffer implies a constraint to satisfy for the potentially correct world models
in the space of world models. Collecting data that can be explained by all potentially correct world models is useless because
it implies no stricter constraints than the current ones. Only through the counter-examples which cannot be explained by
the current world model, the size of the set of the potentially correct world models can become smaller and smaller, which
eventually leads to a set that only contains the correct world model.
The optimism under uncertainty objective is much more sample efficient to get those valuable counter-examples than random
exploration. For deterministic environments, the number of actions to find a counter-example with a world model that
satisfies ϕ1 ∧ ϕ2 is guaranteed to be smaller than the size of the state space for MDP or the history space for POMDP as
there is no need to go back to the same state/history. Random exploration needs exponentially more number of actions to
find this counter-example in the worst case. (Note that we do not assume free reset to all possible states, so the agent needs
to go through the trajectories from the initial state to each target state.)
More formally, we show that the maximum number of actions to learn/find the correct world model is polynomial w.r.t. the
size of the state space and the size of the model space for deterministic MDPs as follows:
Definition A.1. A set of data, D, is mutually independent w.r.t. a solution space, M, i.e., D ⊥⊥ M if and only if for all
data inside of the set, there exists a solution in the solution space such that this solution explains all other data but it:
∀d ∈ D, ∃m ∈ M, m ⊬ d ∧ (∀d′ ∈ D − {d}, m ⊢ d′ ).
Denote the maximum size of the mutually independent data sets w.r.t. a solution space M as K⊥⊥M = max{|D| : ∀D ∈
D, D ⊥⊥ M}, where D is the space of data sets, it is straightforward to prove that
Lemma A.2. K⊥⊥M ≤ |M|, i.e., the maximum size of the mutually independent data sets w.r.t. to a solution space is
smaller than the size of the solution space.
Proof: each data point d ∈ D must exclude one more unique solution in the solution space than the other data points.
Otherwise, the data set is not mutually independent w.r.t. the solution space.
Note that this is the loosest upper bound of K⊥⊥M . We assume nothing about the model space and the learning algorithm.
In practice, K⊥⊥M should be much smaller than the size of the solution space. For example, for linear models, we only need
n independent data points to characterize the correct solution for d ∈ Rn .
Theorem A.3. (Guaranteed sample efficiency when using the optimism under uncertainty objective.) When using the
optimism under uncertainty objective (ϕ1 ∧ ϕ2 in Sec. 2), the correct world model is guaranteed to be found in no more than
|S| × (KT ×R + 1) ≤ |S| × (|T | × |R| + 1) actions for a deterministic MDP environment M = (S, A, T, R, γ) where T
and R denotes the space of the transition function and the reward function, respectively, and KT ×R denotes the maximum
size of the data sets {(s, a, s′ , r, d)} that are mutually indepent w.r.t. the world model space, T × R.
12
Building World Models by Writing Code and Interacting with the Environment
Informal Proof: Lemma α: Each replay buffer data set, D, can be represented by its maximum mutually independent data
subset, D⊥⊥T ×R , from the perspective of finding the correct world model, i.e.,
^ ^
D⊥⊥T ×R ⊆ D D⊥⊥T ×R ⊥ ⊥ T × R ∀(T̂ , R̂) ∈ T × R, (T̂ , R̂) ⊢ D ⇔ (T̂ , R̂) ⊢ D⊥⊥T ×R .
Lemma β: For the replay buffer, D(t) , at any step t, if a world model (T̂ (t) , R̂(t) ) satisfies the optimism under uncertainty
objective as well as the traditional data-consistency objective, i.e., (T̂ , R̂) ⊢ D(t) ∧ (T̂ , R̂) ⊢ ϕ2 , then either the word model
is correct, which means it can guide the agent to the goal successfully, or the agent finds a counter-example, d′ , to the current
world model in |S| steps.
Lemma γ: This counter example, d′ , is mutually independent to the replay buffer, D(t) , because there is a model (T̂ (t) , R̂(t) )
such that (T̂ (t) , R̂(t) ) ⊬ d′ ∧ (T̂ (t) , R̂(t) ) ⊢ D(t) .
Given these lemmas, we have
(t+|S|) (t)
|D⊥⊥T ×R | ≥ |D⊥⊥T ×R | + 1
and therefore
|S|×(K +1)
|D⊥⊥T ×R⊥⊥T ×R | ≥ K⊥⊥T ×R + 1 + D(0) ≥ K⊥⊥T ×R + 1 > K⊥⊥T ×R .
Assume the world model after |S| × (K⊥⊥T ×R + 1) steps is still incorrect, we then build a mutually indedepent data set,
(|S|×(K +1)
D⊥⊥T ×R ⊥⊥T ×R , with size larger than K⊥⊥T ×R , which is contradictory to its definition.
The proofs of Lemma α and γ are straightforward by definitions. We can prove Lemma β by construction. Given the
definition of the optimism under uncertainty objective, ϕ2 in Sec. 2:
ϕ2 s0 , c, T̂ (t) , R̂(t) =
∃a1 , s1 , a2 , s2 , ..., aℓ , sℓ :
∀i ∈ [ℓ] : T̂ (t) (si−1 , ai ) = si ∧
∃r > 0 : R̂(t) (c)(sℓ−1 , aℓ , sℓ ) = (r, 1),
there exists a trajectory a1 , s1 , a2 , s2 , ..., aℓ , sℓ such that either the model correctly leads the agent to the goal, i.e., ∃r >
0 : R(c)(sℓ−1 , aℓ , sℓ ) = (r, 1), or there exists a counter-example, i.e., ∃i ∈ [ℓ] : T̂ (t) (si−1 , ai ) ̸= T (si−1 , ai ) ∨
R̂(t) (c)(si−1 , ai , si ) ̸= R(c)(si−1 , ai , si ), which means (T̂ (t) , R̂(t) ) ⊬ (si−1 , ai , si , ri , di ). We also have (T̂ (t) , R̂(t) ) ⊢
D(t) due to (T̂ (t) , R̂(t) ) ⊢ ϕ1 . We thus prove Lemma β.
• imagine the functionality of actions and their interactions with the necessary tools, in order to achieve the goal, without
any real interactions with the environment;
• and correct the imagined world model given the newly collected data after the efficient exploration.
The goal-driven exploration is guaranteed to be efficient, as proved in Appendix. A. The learning trajectory is as follows:
NULL NULL
13
Building World Models by Writing Code and Interacting with the Environment
Data-consistent reward-func
(random guess given the textual mission)
14
Building World Models by Writing Code and Interacting with the Environment
Goal-driven transit-func
(imagine how to move forward without interactions)
(imperfect though as not considering the locked door)
Goal-driven reward-func
(more efficient representation <- already correct)
15
Building World Models by Writing Code and Interacting with the Environment
Goal-driven transit-func
(imagine how to pickup, toggle, and drop, without interactions)
(imperfect though as trying to pickup at the same position as the agent)
...
def transition(state, action):
...
elif action == ’move forward’:
entities_at_front = [entity for entity in
get_entities_by_position(next_state, front_x, front_y)
if entity.name != ’Agent’]
if entities_at_front:
if entities_at_front[0].name == "Door" and
entities_at_front[0].state == "unlocked":
agent.x, agent.y = front_x, front_y
# other entities should prevent agent from moving
forward
else:
agent.x, agent.y = front_x, front_y
elif action == ’pick up’:
items_at_agent_location = get_entities_by_position(
next_state, agent.x, agent.y)
pickable_items = [item for item in
items_at_agent_location if item.name not in [’Door’,’
Wall’, ’Agent’]]
if pickable_items:
agent.carrying = pickable_items[0]
next_state.remove(pickable_items[0])
...
16
Building World Models by Writing Code and Interacting with the Environment
Goal-driven transit-func
(imagine the agent can also pickup the object in front of it,
which is correct, without interactions.
...
def toggle_door(agent, next_state, next_x, next_y):
doors_in_next_position = [door for door in
get_entities_by_position(next_state, next_x, next_y) if door
.name == ’Door’]
if doors_in_next_position and doors_in_next_position[0].
color == agent.carrying.color :
doors_in_next_position[0].state = ’open’
...
17
Building World Models by Writing Code and Interacting with the Environment
...
def drop_item(agent, next_state, next_x, next_y):
entities_in_next_position = get_entities_by_position(
next_state, next_x, next_y)
if not entities_in_next_position and agent.carrying:
# Drop can only drop object if there’s no obstacle and
agent carries something.
agent.carrying.x, agent.carrying.y = next_x, next_y
next_state.append(agent.carrying)
agent.carrying = None
...
def check_no_obstacle_between(agent, next_state, x, y):
dx, dy = x - agent.x, y - agent.y
for i in range(min(abs(dx), abs(dy))):
entities_at_next_position = get_entities_by_position(
next_state, agent.x + i * dx, agent.y + i * dy)
if any(isinstance(entity, Wall) or (isinstance(entity,
Door) and entity.state != ’open’) for entity in
entities_at_next_position):
return False
return True
def pickup_item(agent, next_state):
items_in_current_location = get_entities_by_position(
next_state, agent.x, agent.y)
pickable_items = [item for item in items_in_current_location
if item.name not in [’Door’, ’Wall’, ’Agent’]]
if agent.carrying is None: # Agent can only pick up an item
when it is not carrying an item
if not pickable_items:
dx, dy = agent.direction
facing_x, facing_y = agent.x + dx, agent.y + dy
if check_no_obstacle_between(agent, next_state,
facing_x, facing_y): # Make sure there is no wall
or door between the agent and the item
items_in_facing_location =
get_entities_by_position(next_state, facing_x,
facing_y)
pickable_items = [item for item in
items_in_facing_location if item.name not in [’
Door’, ’Wall’]]
if pickable_items:
agent.carrying = pickable_items[0]
next_state.remove(pickable_items[0])
...
class Entity:
def __init__(self, x, y, **kwargs):
self.name = self.__class__.__name__
self.x = x
self.y = y
for key, value in kwargs.items():
setattr(self, key, value)
def __repr__(self):
attr = ’, ’.join(f’{key}={value}’ for key, value in self.__dict__.items() if key not in (’
name’, ’x’, ’y’))
if attr: return f"{self.name}({self.x}, {self.y}, {attr})"
18
Building World Models by Writing Code and Interacting with the Environment
19
Building World Models by Writing Code and Interacting with the Environment
next_state = list(state)
agent = get_entities_by_name(next_state, ’Agent’)[0]
dx, dy = agent.direction
front_x, front_y = agent.x + dx, agent.y + dy
if action == ’turn right’ or action == ’turn left’:
update_direction(agent, action)
elif action == ’move forward’:
update_position(agent, next_state, front_x, front_y)
elif action == ’pick up’:
pickup_item(agent, next_state)
elif action == ’drop’:
drop_item(agent, next_state, front_x, front_y)
elif action == ’toggle’:
toggle_door(agent, next_state, front_x, front_y)
return next_state
def update_position(agent, next_state, next_x, next_y):
entities_at_next_position = get_entities_by_position(next_state, next_x, next_y)
if not any(
(
isinstance(entity, Wall) or
isinstance(entity, Box) or
isinstance(entity, Ball) or
isinstance(entity, Lava) or
(isinstance(entity, Door) and entity.state != ’open’) or
isinstance(entity, Key)
)
for entity in entities_at_next_position
):
agent.x, agent.y = next_x, next_y
else:
agent.x, agent.y = agent.x, agent.y # Agent stays in place
1.0
empty
doorkey
0.8 unlock
fetch
unlockpickup
0.6
solve rate
0.4
0.2
0.0 0
10 103 106 109
steps
We evaluate PPO, as a Deep RL baseline, in minigrid experiments. We use the tuned hyper-parameters from RL Baselines3
Zoo (Raffin, 2020) for each environment and use the same symbolic memory state as ours. As shown in Figure 7, PPO
is much more sample inefficient than ours. It needs 104 − 105 steps to learn valid policies (not perfect though) in most
environments. It cannot solve the UnlockPickup environment even in 3 × 108 steps.
20
Building World Models by Writing Code and Interacting with the Environment
21
Building World Models by Writing Code and Interacting with the Environment
pass
else:
# If not, the box moves to the new position
pushed_box[0].x += delta_x
pushed_box[0].y += delta_y
player.x += delta_x
player.y += delta_y
else:
player.x += delta_x
player.y += delta_y
return state
D. Prompts
We list all the prompts that are used in our experiments in this section. The functionality for each prompt is stated in
its subsection name. We highlight the dynamic information as yellow and the main instruction as blue. The dynamic
information includes the data collected so far in the replay buffer and the codes synthesized so far by previous LLM calls.
You are a robot exploring in an object-centric environment. Your goal is to model the logic of
the world in python. You will be provided experiences in the format of (state, action,
next_state) tuples. You will also be provided with a short natural language description that
briefly summarizes the difference between the state and the next state for each (state,
next_state,) pair. You need to implement the python code to model the logic of the world, as
seen in the provided experiences. Please follow the template to implement the code. The code
22
Building World Models by Writing Code and Interacting with the Environment
needs to be directly runnable on the state and return the next state in python as provided in
the experiences.
You need to implement python code to model the logic of the world as seen in the following expe
riences:
23
Building World Models by Writing Code and Interacting with the Environment
24
Building World Models by Writing Code and Interacting with the Environment
Please implement code to model the logic of the world as demonstrated by the experiences. Here
is the template for the transition function. Please implement the transition function following
the template. The code needs to be directly runnable on the inputs of (state, action) and
return the next state in python as provided in the experiences.
‘‘‘
class Entity:
def __init__(self, x, y, **kwargs):
self.name = self.__class__.__name__
self.x = x
self.y = y
for key, value in kwargs.items():
setattr(self, key, value)
def __repr__(self):
attr = ’, ’.join(f’{key}={value}’ for key, value in self.__dict__.items() if key not in
(’name’, ’x’, ’y’))
if attr: return f"{self.name}({self.x}, {self.y}, {attr})"
else: return f"{self.name}({self.x}, {self.y})"
def __eq__(self, other):
return all(getattr(self, key) == getattr(other, key, None) for key in self.__dict__.
25
Building World Models by Writing Code and Interacting with the Environment
keys())
def __hash__(self):
return hash(tuple(sorted(self.__dict__.items())))
class Agent(Entity): pass
class Key(Entity): pass
class Door(Entity): pass
class Goal(Entity): pass
class Wall(Entity): pass
class Box(Entity): pass
class Ball(Entity): pass
class Lava(Entity): pass
def get_entities_by_name(entities, name):
return [ entity for entity in entities if entity.name == name ]
def get_entities_by_position(entities, x, y):
return [ entity for entity in entities if entity.x == x and entity.y == y ]
def transition(state, action):
"""
Args:
state: the state of the environment
action: the action to be executed
Returns:
next_state: the next state of the environment
"""
raise NotImplementedError
‘‘‘
Please implement code to model the logic of the world as demonstrated by the experiences.
Please implement the code following the template. Feel free to implement the helper functions
you need. You can also implement the logic for difference actions in different helper functions
. However, you must implement the ‘ transition ‘ function as the main function to be called by
the environment. The code needs to be directly runnable on the inputs as (state, action) and
return the next state in python as provided in the experiences. Let’s think step by step.
You are a robot exploring in an object-centric environment. Your goal is to model the logic of
the world in python. You will be provided experiences in the format of (state, action,
next_state, reward, done) tuples. You will also be provided with a short natural language
description that briefly summarizes the difference between the state and the next state for
each (state, next_state) pair. You need to implement the python code to model the logic of the
world, as seen in the provided experiences. Please follow the template to implement the code.
The code needs to be directly runnable on the (state, action, next_state) tuple and return the
(reward, done) tuple in python as provided in the experiences.
You need to implement python code to model the logic of the world as seen in the following expe
riences for mission "use the key to open the door and then get to the goal":
26
Building World Models by Writing Code and Interacting with the Environment
‘‘‘
to
‘‘‘
Wall(0, 0) ; Wall(1, 0) ; Wall(2, 0) ; Wall(3, 0) ; Wall(4, 0) ;
Wall(0, 1) ; empty ; Wall(2, 1) ; empty ; Wall(4, 1) ;
Wall(0, 2) ; Agent(1, 2, direction=(-1, 0), carrying=None) ; Door(2, 2, color=yellow, state=
locked) ; empty ; Wall(4, 2) ;
Wall(0, 3) ; Key(1, 3, color=yellow) ; Wall(2, 3) ; Goal(3, 3) ; Wall(4, 3) ;
Wall(0, 4) ; Wall(1, 4) ; Wall(2, 4) ; Wall(3, 4) ; Wall(4, 4) ;
‘‘‘
The difference is
"""
The agent (direction=(0, -1)) at pos (1, 2) becomes an agent (direction=(-1, 0)).
"""
, the returned reward is ‘ 0.0 ‘ and the returned done is ‘ False ‘
27
Building World Models by Writing Code and Interacting with the Environment
28
Building World Models by Writing Code and Interacting with the Environment
Please implement code to model the logic of the world as demonstrated by the experiences. Here
is the template for the reward function. Please implement the reward function following the
template. The code needs to be directly runnable on the inputs of (state, action, next_state)
and return (reward, done) in python as provided in the experiences.
‘‘‘
class Entity:
def __init__(self, x, y, **kwargs):
self.name = self.__class__.__name__
self.x = x
self.y = y
for key, value in kwargs.items():
setattr(self, key, value)
def __repr__(self):
attr = ’, ’.join(f’{key}={value}’ for key, value in self.__dict__.items() if key not in
(’name’, ’x’, ’y’))
if attr: return f"{self.name}({self.x}, {self.y}, {attr})"
else: return f"{self.name}({self.x}, {self.y})"
def __eq__(self, other):
return all(getattr(self, key) == getattr(other, key, None) for key in self.__dict__.
keys())
def __hash__(self):
return hash(tuple(sorted(self.__dict__.items())))
class Agent(Entity): pass
class Key(Entity): pass
class Door(Entity): pass
class Goal(Entity): pass
29
Building World Models by Writing Code and Interacting with the Environment
‘‘‘
Please implement code to model the logic of the world as demonstrated by the
experiences. Please implement the code following the template. You must implement the ‘
reward_func ‘ function as the main function to be called by the environment. The code needs to
be directly runnable on the inputs as (state, action, next_state) and return (reward, done) in
python as provided in the experiences. Let’s think step by step.
You are a robot exploring in an object-centric environment. Your goal is to model the logic of
the world in python. You have tried it before and came up with one partially correct solution.
However, it is not perfect. They can model the logic for some experiences but failed for others
. You need to improve your code to model the logic of the world for all the experiences. The
new code needs to be directly runnable on the (state, action) pair and return the next state in
python as provided in the experiences.
Here is the partially correct solution you came up with. It can model the logic for some
experiences but failed for others. You need to improve your code to model the logic of the
world for all the experiences. The new code needs to be directly runnable on the (state, action
) pair and return the next state in python as provided in the experiences.
‘‘‘
class Entity:
def __init__(self, x, y, **kwargs):
self.name = self.__class__.__name__
self.x = x
self.y = y
for key, value in kwargs.items():
setattr(self, key, value)
def __repr__(self):
attr = ’, ’.join(f’{key}={value}’ for key, value in self.__dict__.items() if key not in
(’name’, ’x’, ’y’))
if attr: return f"{self.name}({self.x}, {self.y}, {attr})"
30
Building World Models by Writing Code and Interacting with the Environment
‘‘‘
The given code cannot model the logic of the world for all the experiences. Here are some
experiences that the code have successfully modeled.
31
Building World Models by Writing Code and Interacting with the Environment
32
Building World Models by Writing Code and Interacting with the Environment
"""
Nothing happened
"""
For this failed experience, do you know what is different between the true transitions from the
environment and the predictions from the code? Do you know why the environment behaves in this
way? Do you know why the code behaves differently from the environment? Which part of the code
causes the problem? How to fix it? Please improve your code to model the logic of the world
for all the experiences, accordingly. Please implement the code following the template. Feel
free to implement any helper functions you need. You can also implement the logic for
difference actions in different helper functions. However, you must implement the ‘ transition
‘ function as the main function to be called by the environment. The code needs to be directly
runnable on the (state, action) tuple and return the new state in python as provided in the
experiences. If the code is too long, try to refactor it to be shorter.
You are a robot exploring in an object-centric environment. Your goal is to model the logic of
the world in python. You have tried it before and came up with one partially correct solution.
However, it is not perfect. They can model the logic for some experiences but failed for others
. You need to improve your code to model the logic of the world for all the experiences. The
33
Building World Models by Writing Code and Interacting with the Environment
new code needs to be directly runnable on the (state, action, next_state) tuple and return the
(reward, done) tuple in python as provided in the experiences.
Here is the partially correct solution you came up with for mission "use the key to open the
door and then get to the goal". It can model the logic for some experiences but failed for
others. You need to improve your code to model the logic of the world for all the experiences.
The new code need to be directly runnable on the (state, action, next_state) tuple and return
the (reward, done) tuple in python as provided in the experiences.
‘‘‘
class Entity:
def __init__(self, x, y, **kwargs):
self.name = self.__class__.__name__
self.x = x
self.y = y
for key, value in kwargs.items():
setattr(self, key, value)
def __repr__(self):
attr = ’, ’.join(f’{key}={value}’ for key, value in self.__dict__.items() if key not in
(’name’, ’x’, ’y’))
if attr: return f"{self.name}({self.x}, {self.y}, {attr})"
else: return f"{self.name}({self.x}, {self.y})"
def __eq__(self, other):
return all(getattr(self, key) == getattr(other, key, None) for key in self.__dict__.
keys())
def __hash__(self):
return hash(tuple(sorted(self.__dict__.items())))
class Agent(Entity): pass
class Key(Entity): pass
class Door(Entity): pass
class Goal(Entity): pass
class Wall(Entity): pass
class Box(Entity): pass
class Ball(Entity): pass
class Lava(Entity): pass
def get_entities_by_position(entities, x, y):
return [ entity for entity in entities if entity.x == x and entity.y == y ]
def reward_func(state, action, next_state):
state_set = set(state)
next_state_set = set(next_state)
agent = [e for e in state_set if isinstance(e, Agent)][0]
agent_next = [e for e in next_state_set if isinstance(e, Agent)][0]
on_goal = any(isinstance(entity, Goal) for entity in get_entities_by_position(next_state,
agent_next.x, agent_next.y))
done = on_goal
if state_set == next_state_set:
reward = -0.1 # Small negative reward for no-op actions to encourage faster solution
elif done:
reward = 1.0 # Reward for reaching the goal
else:
reward = 0.0 # No reward in other cases
return reward, done
‘‘‘
The given code cannot model the logic of the world for all the experiences. Here are some
experiences that the code has successfully modeled.
34
Building World Models by Writing Code and Interacting with the Environment
35
Building World Models by Writing Code and Interacting with the Environment
"""
The agent (direction=(0, -1)) at pos (1, 2) becomes an agent (direction=(-1, 0)).
"""
, the returned reward is ‘ 0.0 ‘ and the returned done is ‘ False ‘
For this failed experience, do you know what is different between the true rewards and dones
from the environment and the predictions from the code? Do you know why the environment behaves
in this way? Do you know why the code behaves differently from the environment? Which part of
the code causes the problem? How to fix it? Please improve your code to model the logic of the
world for all the experiences, accordingly. Please implement the code following the template.
You must implement the ‘ reward_func ‘ function as the main function to be called by the
environment. The code needs to be directly runnable on the (state, action, next_state) tuple
and return (reward, done) in python as provided in the experiences. If the code is too long,
try to refactor it to be shorter.
You are a robot exploring in an object-centric environment. Your goal is to model the logic of
the world in python, specifically the reward function that maps (state, action, next_state) to
(reward, done). You will to be given the mission for the environment you are going to act in,
as well as a few sample code from the other environments. You need to implement the new reward
function for the new environment you are going to act in. The new code needs to be directly
runnable on (state, action, next_state) and return (reward, done) in python.
Here is a few sample code for the reward function in other environments. Please check them in
detail and think about how to implement the reward function for mission "pick up the yellow box
" in the new environment. The code needs to be directly runnable on (state, action, next_state)
and return (reward, done) in python.
The reward function code for mission "pick up the grey box" is:
36
Building World Models by Writing Code and Interacting with the Environment
‘‘‘
37
Building World Models by Writing Code and Interacting with the Environment
‘‘‘
The reward function code for mission "pick up the purple box" is:
‘‘‘
38
Building World Models by Writing Code and Interacting with the Environment
done = True
return reward, done
‘‘‘
The reward function code for mission "pick up the green box" is:
‘‘‘
39
Building World Models by Writing Code and Interacting with the Environment
‘‘‘
Now, you have entered a new environment. It shows a mission "pick up the yellow box". Do you
know what this mission means and how to implement it in a reward function? Analyze the
behaviors of the reward function case by case. In what situations will it return a positive
reward or not? In what situations will it return done=True or not? Why? Please implement the
code following the template in the sample code. You must implement the ‘ reward_func‘ function
as the main function to be called by the environment. The code needs to be directly runnable on
(mission, state, action, next_state) and return (reward, done) in python.
You are a robot exploring in an object-centric environment. Your goal is to model the logic of
the world in python. You have tried it before and came up with one partially correct solution.
However, it is not perfect. The code can model the logic for some experiences but failed to
model the logic to achieve the goal in another environment. You need to improve your code so
that the agent can achieve the objective as specified by the mission from the given initial
state as well as still modelling the original logic. The new code should still follow the same
template. The ‘ transition ‘ function needs to be directly runnable on (state, action) and
return the next state in python. The ‘ reward_func ‘ function needs to be directly runnable on
(state, action, next_state) and return (reward, done) in python.
‘‘‘
class Entity:
def __init__(self, x, y, **kwargs):
self.name = self.__class__.__name__
self.x = x
self.y = y
for key, value in kwargs.items():
setattr(self, key, value)
def __repr__(self):
attr = ’, ’.join(f’{key}={value}’ for key, value in self.__dict__.items() if key not in
(’name’, ’x’, ’y’))
if attr: return f"{self.name}({self.x}, {self.y}, {attr})"
else: return f"{self.name}({self.x}, {self.y})"
def __eq__(self, other):
return all(getattr(self, key) == getattr(other, key, None) for key in self.__dict__.
keys())
def __hash__(self):
return hash(tuple(sorted(self.__dict__.items())))
class Agent(Entity): pass
class Key(Entity): pass
class Door(Entity): pass
class Goal(Entity): pass
class Wall(Entity): pass
class Box(Entity): pass
class Ball(Entity): pass
class Lava(Entity): pass
40
Building World Models by Writing Code and Interacting with the Environment
import copy
41
Building World Models by Writing Code and Interacting with the Environment
# Reward calculation
if state_set == next_state_set:
# If state didn’t change -> ‘nothing‘, ‘toggle‘ when not carrying a key or ‘drop‘ when
carrying nothing happened
reward = 0.0
elif action == ’turn left’ or action == ’turn right’:
# If direction of agent changes -> ‘turn left‘, ‘turn right‘ happened
reward = 0.0
else:
# In other cases, no reward. Can be modified when other scenarios are applied.
reward = 0.0
return reward, done
‘‘‘
However, the code failed to achieve the goal/objective as specified by the mission "use the key
to open the door and then get to the goal" from the following initial
state:
‘‘‘
‘‘‘
‘‘‘
‘‘‘
The valid actions are {’turn right’, ’nothing’, ’move forward’, ’turn left’, ’toggle’, ’drop’,
’pick up’}.
Do you know why the mission cannot be achieved from the given initial state with the world
model as implemented in the code? What subgoals does the agent need to achieve in order to
achieve the final goal as specified by the mission? Can the agent achieve those subgoals using
the world model as implemented in the code? If not, what is missing or wrong? How can you
improve the code to achieve the goal/objective as specified by the mission from the given
initial state? Please improve the code as analyzed before so that the mission can be achieved
from the given initial state. Please implement the code following the template. Feel free to
implement any helper functions you need. You can also implement the logic for difference
actions in different helper functions. However, you must implement the ‘ transition ‘ function
and the ‘ reward_func ‘ function as the main functions to be called by the environment. The ‘
transition ‘ function needs to be directly runnable on (state, action) and return the next
state in python. The ‘ reward_func ‘ function needs to be directly runnable on (state, action,
next_state) and return (reward, done) in python. The new code, by themselves, should be
complete, compilable, and runnable.
42