Level Generation Through Large Language Models
Level Generation Through Large Language Models
Abstract
Large Language Models (LLMs) are powerful tools, capable
of leveraging their training on natural language to write
stories, generate code, and answer questions. But can they
generate functional video game levels? Game levels, with
their complex functional constraints and spatial relationships
in more than one dimension, are very different from the kinds
of data an LLM typically sees during training. Datasets of
game levels are also hard to come by, potentially taxing the
abilities of these data-hungry models. We investigate the use
of LLMs to generate levels for the game Sokoban, finding
that LLMs are indeed capable of doing so, and that their
performance scales dramatically with dataset size. We also Figure 1. A level for the puzzle game Sokoban generated by
perform preliminary experiments on controlling LLM level GPT-3
generators and discuss promising areas for future work.
ACM Reference Format:
Graham Todd, Sam Earle, Muhammad Umair Nasir, Michael Cerny underlying these models have been leveraged for tasks out-
Green, and Julian Togelius. 2023. Level Generation Through Large side the realm of standard text generation, from music [6]
Language Models. In Proceedings of ACM Conference (Conference’23).
to reinforcement learning [3], comparatively less effort has
ACM, New York, NY, USA, 8 pages. https://doi.org/10.1145/nnnnnnn.
been spent on analyzing the capacity of the LLMs themselves
nnnnnnn
to produce non-lingusitic artifacts while still leveraging their
vast amounts of training data. In this paper, we investigate
1 Introduction the ability of LLMs to generate video game levels and the ex-
In recent years, attention-based large language models (LLMs) tent to which truths about these models taken from natural
have taken the world by storm, demonstrating surprisingly language processing apply to this new domain. We also con-
high performance on a variety of natural language tasks. duct preliminary experiments on the capacity to control the
With the right tuning, LLMs have been shown to generate levels generated by LLMs using simple data augmentation
coherent text in a number of styles, produce working snip- and prompting.
pets of computer code, and even respond naturalistically to Despite their impressive performance, there are reasons
human questions and conversation. While the architectures to doubt that LLMs would be well suited to the task of level
generation. The first is representational. For context, the last
Permission to make digital or hard copies of all or part of this work for few years have seen a steady increase in the use of machine
personal or classroom use is granted without fee provided that copies are not
made or distributed for profit or commercial advantage and that copies bear learning to generate novel game content, including game
this notice and the full citation on the first page. Copyrights for components levels. This procedural content generation through machine
of this work owned by others than ACM must be honored. Abstracting with learning (PCGML) has made use of a variety of methods,
credit is permitted. To copy otherwise, or republish, to post on servers or to including cellular automata, Markov models, convolutional
redistribute to lists, requires prior specific permission and/or a fee. Request neural networks, and generative adversarial networks. While
permissions from [email protected].
Conference’23, April 2023, Lisbon, Portugal
dissimilar in function, these methods are nonetheless uni-
© 2023 Association for Computing Machinery. fied in that they tend to represent game levels spatially, as
ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. . . $15.00 arrangements of tiles or features in two or three dimensions.
https://doi.org/10.1145/nnnnnnn.nnnnnnn This is an intuitive approach, as it allows models to more
Conference’23, April 2023, Lisbon, Portugal Anonymous
Finally, our target domain of Sokoban is a popular choice predict the next “token” (typically a word, or piece of a word,
for PCG work, ranging from early rule- and template-based in the standard context of natural langauge) given the con-
approaches [14, 26], to search-based methods [8] and recur- text of preceding tokens. By default, these models are initially
rent neural networks [22]. Zakaria et al. present a particularly trained on vast corpora of natural language text scraped from
in-depth comparison of various PCG methods for Sokoban, in- the internet, before being fine-tuned on (typically) smaller
cluding LSTMs [29]. They bootstrap an initially-small dataset datasets to specialize their performance to particular tasks.
of levels and use it to train their level generators, demon- Owing to its greater availability, we focus the majority of
strating that LSTMs are capable of producing a variety of our experiments on GPT-2 and variants thereof.
novel and playable levels. They also perform experiments
on the controllability of their generators. Their results indi- 5 Metrics
cate that while the task is challenging, LSTMs, in addition
To measure the ability of LLMs to generate game levels, we
to other existing PCGML methods, can adhere to specified
use the following four metrics:
characteristics at levels substantially above chance. Our anal-
ysis is similar, though we instead focus our attention on a • Playability: we measure the proportion of generated
more modern class of large language model. In addition to a game levels that are “valid”. In the case of Sokoban,
change in architecture, we also specifically interrogate some this means that they are rectangular, contain only valid
of the standard assumptions about the training and behavior characters, contain an equal and non-zero number of
of LLMs, and whether or not they are helpful for the task of boxes and goals, contain exactly one player, and are
level generation. solvable. We determine solvability using an ASTAR
tree-search agent. If, after running for 150,000 steps,
3 Data the ASTAR agent fails to find a valid solution, we deem
We train our models on levels from the game Sokoban, a the level unplayable. This provides a lower bound on
block-pushing puzzle game released in 1982 by Thinking the true rate of playability.
Rabbit. In Sokoban, the player is tasked with navigating along • Novelty: we measure the proportion of generated game
a rectangular grid in order to push boxes into specified target levels that are distinct from each level in the training
squares. A level can easily be represented as a grid of ASCII dataset. While determining when two levels are “dis-
characters, where each character is mapped to either: a wall, tinct” can be challenging, we opt to use a simplified
an empty space, the player, a box, a goal, a box on top of a approach that treats two levels as distinct if their string
goal, or a player on top of a goal. Despite its simplicity of edit distance is above some threshold. For our experi-
representation, Sokoban levels can be very challenging for ments, we use a threshold of 𝑘 = 5.
both human and artificial agents alike, owing to its rapidly- • Diversity: we measure the proportion of generated
branching state space and the fact that certain game states game levels that are mutually distinct from each other.
are “unrecoverable” and, once reached, cannot be escaped Using the same definition of distinctness as above, we
from. use a graph-based approach to find the largest subset of
We use two sets of Sokoban levels to train our models. generated levels that are all at least 𝑘 = 5 edit distance
The first is the Microban1 dataset, consisting of roughly 500 from each other. Specifically, we convert the set of
levels created by David W. Skinner. We restrict our dataset generated levels into an undirected graph where two
to 282 levels for which an ASTAR search agent was able to nodes (levels) have an edge if their edit distance is
find a solution. Levels in this set range in size from 5 by 3 to above the threshold 𝑘. We then find the largest clique
27 by 12, with solution lengths ranging from 1 to 279. (subset of fully connected nodes) on this graph, and
The second set of Sokoban levels is the Boxoban dataset, report the diversity as the size of this clique divided
which consists of 438,000 procedurally generated levels. Lev- by the number of generated levels (a set of levels on
els were generated using a combination of heuristic and this graph is only fully connected if each level is at
pattern-based rules [28]. Unlike with the Microban set, lev- least 𝑘 edit distance away from every other level in
els in the Boxoban set are all 10 by 10 and contain 4 boxes / the set). Because finding a maximum clique can be
goals. We use the ASTAR agent to recover solutions for all computationally intractable, we terminate the clique-
438,000 levels, and solution lengths range from 6 to 206. finding algorithm after a specified number of iterations
(1 million) and report the size of the largest clique
4 Models found. This provides a lower bound on the model’s
Our core experimental model is the Generative Pre-Trained diversity.
Transformer (GPT), a class of attention-based language model • Accuracy: in the case of controllability experiments,
[2, 16]. Both GPT-2 and GPT-3 are trained by attempting to we measure the proportion of generated game levels
that adhere to the given prompt. Rather than enforce
1 Microban dataset available here: https://tinyurl.com/yckwxd7k an exact match between prompt and output, we allow
Conference’23, April 2023, Lisbon, Portugal Anonymous
the model to generate levels that are within a certain Novelty Playability Diversity Score
experiment-specific tolerance of the specified charac- Model
teristic (for instance, with a tolerance of 5 we would
GPT-2 0.97 0.54 1.00 0.53
consider accurate a level with a solution length of 21,
GPT-2 (Untrained) 0.96 0.60 1.00 0.56
when the prompt called for a level with solution length Java GPT-2 0.97 0.54 1.00 0.53
25)
Table 1. Novelty, playability, diversity, and overall “score”
6 Experiments (defined as the diversity of the subset of generated levels that
are both novel and playable) for each type of pretraining,
6.1 Effects of Pretraining
using the best evaluation hyperparameters when averaged
Our first experiment aims to answer two questions: over 5 seeds.
1. Are LLMs capable of generating novel, playable game
levels?
2. Does the extensive pretraining given to LLMs affect
their ability to generate game levels?
LLMs are typically trained on vast quantities of text col-
lected from a variety of natural language contexts and then temperature, the top-p value, and the number of beams. For
later “fine-tuned" on a smaller, more task-specific dataset. each model, we select the evaluation hyperparameters which
While this pretraining appears helpful for the models’ ca- achieve the highest score when averaged over the 5 random
pacity for natural language understanding (and thus perfor- seeds. We report these average scores, along with the average
mance on many downstream language-based tasks), it’s not novelty, playability, and diversity rates, for each model in
clear whether pretraining would offer models a benefit in Table 1.
the more specialized task of generating valid game levels. To
examine this question, we consider 3 variants of the GPT-2
model: standard, java-adapted, and untrained. The standard 6.2 Effects of Dataset Size
GPT-2 model was pretrained on the WebText dataset, consist- Another well-known property of LLMs is that their perfor-
ing of the content of 45 million links [16], the java-adapted mance on a variety of NLP tasks tends to scale with the
model was pretrained on the CodeSearchNet dataset of Java amount of training data [7]. Again, however, it is worth ex-
code, consisting of 1.6 million Java methods [12], and the amining whether this trend holds for the specialized task of
untrained model, unsurprisingly, received no pretraining generating game levels. In many contexts, it might be difficult
(weights are randomly initialized). As a note, both standard or impossible to collect a large set of high-quality game lev-
and java-adapted models use specialized tokenizers which els. Are LLMs still capable of learning to generate valid levels
are trained to efficiently break input strings into sub-word when their training data is severely limited? Conversely, in
tokens. Rather than use an existing tokenizer, we allowed situations where large amounts of game levels are available
the untrained model to train a custom tokenizer on the game (typically games for which heuristic or rule-based PCG ap-
level dataset, using the same byte-pair encoding scheme as proaches exist), do LLMs benefit from ever-increasing dataset
GPT-2. sizes? Finally, can simple data augmentation approaches im-
Each model is trained for 100k steps with 5 random seeds prove LLM performance?
on the Boxoban dataset with the following training hyper- First, we consider four “slices” of the Boxoban dataset
parameters: learning rate of 0.0001, weight decay of 0.0001, consisting of 0.1%, 1%, 10%, 100% (i.e. the complete dataset
a batch size of 32, and the AdamW optimizer. In order to eval- used above) of the data, randomly sampled. We take the
uate a model, we provide it with some initial context (for standard GPT-2 model and re-train it on each of the slices
this experiment, only the START token) and then use beam for 100k steps, using the same training hyperparameters as
search with random sampling in order to generate one or above. We then evaluate each model’s novelty, playability,
more continuations. We then compute the proportion of gen- and “score” using the same procedure as in Section 6.1. As
erated levels that are novel, the proportion that are playable, before, we find the evaluation hyperparameters that achieve
and the proportion that are novel, playable, and diverse. For the highest average score for each model, and report the
simplicity, we call the proportion of novel, playable, and results in Table 2.
diverse levels the model’s score (e.g. if the model produces Next, we train a GPT-2 model on the Microban dataset, as
54 levels that are playable and novel out of 100 samples, of well as two augmented versions of the dataset: Microbanflip
which 47 are mutually distinct, we report a score 0.47). (levels flipped about the X and Y axes) and Microbanflip+rotate
Because the outputs of a LLM are largely dependent on the (levels rotated 90 degrees clockwise and counterclockwise).
hyperparameters used during generation, for each model and Each model is trained for 100k steps, and the highest achieved
seed we perform an additional sweep over the generation average score for each is reported in Table 3.
Level Generation Through Large Language Models Conference’23, April 2023, Lisbon, Portugal
explained by the substantial dissimilarity between modeling 7.4 Preliminary Investigation on GPT-3
natural language and Sokoban levels, meaning that models Table 5 contains the results of GPT-3 level generation when
are required to effectively learn from scratch in this domain trained on the Microban and its augmentations. While these
and are able to do so. results should be taken with a healthy amount of caution
because they are generated from only a single training run
7.2 Effects of Dataset Size
and with a limited evaluation hyperparameter sweep, they
The results in Table 2 and Table 3 indicate that dataset size is nonetheless offer some reason for optimism. In contrast to
indeed an important factor for an LLM’s ability to generate GPT-2, GPT-3 is able to produce novel and playable levels
game levels. For small datasets (i.e. the 0.1% and 1% condi- when trained on both the augmented forms of the Microban
tions of Boxoban, as well as also Microban conditions), the dataset, with its overall score on the final condition approach-
GPT-2 model seems unable to produce levels that are both ing that of GPT-2 trained on the entire Boxoban dataset. As
novel and playable (though it can produce novel levels or with previous experiments, however, we observe that increas-
playable levels in isolation), leading to low overall scores. ing dataset size (in this case adding rotations in addition to
Nevertheless, in all but the 0.1% Boxoban condition, sample flips) does lead to increased overall performance with GPT-3.
diversity remains relatively high. While the effect is not espe- In future work, we intend to perform a more robust analysis
cially pronounced, there does appear to be a correlation be- of GPT-3’s abilities, including its capacity for controllable
tween the size of the dataset and the model’s score. This sup- level generation.
ports the notion that, like with many natural language tasks,
LLM performance on level generation scales effectively with
the availability of training data. However, it is important to
note that [29] demonstrate that simple LSTM generators are 8 Future Work
capable of generating novel game levels even when trained In this paper, we examine the performance of LLMs on gen-
on bootstrapped datasets beginning with only 12 samples erating levels for a single game. However, one of the primary
(and never reaching the order of magnitude of Boxoban). strengths of LLMs is their ability to rapidly adapt to a variety
Thus, it seems very unlikely that LLMs are fully incapable of contexts given the appropriate prompt. Consider a dataset
of performing well at level generation when restricted to of levels from many different games, where each level has
small datasets. What might account for this difference in been annotated with the natural language mapping from tiles
performance, then? One possibility is expressivity: modern to game objects (e.g. “@ represents the player, M represents
transformers are much better able to represent sequential a monster), along with a description of the level objective.
data than LSTMs, and so are more likely to completely model An LLM might be better equipped than other PCG systems
the dynamics of their training datasets, to the detriment of to generate novel and playable levels from this variety of
their generative capabilities. However, as we will see, this games, owing to its familiarity with natural language and
explanation does not account for the performance of GPT-3 capacity for rapid adaptation.
(see Section 7.4). However, our work also indicates that making effective
use of LLMs for game level generation may require more
7.3 Controllability consideration of dataset size: few games have available the
In Table 4, we see effects of sampling levels conditioned massive amount of levels present in the Boxoban set (though
on simple prompts. In the first row, we see that the GPT-2 it is worth noting that, preliminarily, GPT-3 seemed better
model is able to produce levels that are novel, playable, and able to handle data scarcity than GPT-2). As mentioned in
within a single tile of the specified proportion of empty space Section 7.2, prior work has demonstrated that bootstrapping
(corresponding to perfect accuracy and a relatively high larger training sets from initially small collections of levels
control score). However, when it comes to solution length, is a viable technique. Another possibility is augmenting ex-
GPT-2 achieves an accuracy of only 17%. Given the tolerance isting datasets beyond simple flips and rotations. In the case
of 5 and the fact that most solution lengths in the dataset fall of Sokoban, for instance, this might involve collecting atomic
within a relatively narrow band, this cannot be interpreted as puzzle components and re-arranging them.
anything more than the effects random chance. A similar fact Finally, there is room for much greater sophistication in
holds for the combined condition, where overall accuracy the techniques used to control LLM outputs. Research in
is determined by both the correct amount of empty space the area of controllable language model decoding [10] offers
and solution length and does not rise above 3%. It is worth the opportunity to leverage existing work in PCG through
noting, however, than even in the conditions where GPT-2 reinforcement learning. More modern LLMs, especially, have
failed to produce accurate levels, it nonetheless continued to also been shown to benefit from careful prompt engineering
generally produce novel and playable ones. In other words, [30]. A combination of these approaches might allow for LLM
the introduction of the prompt did not negatively affect the generators that are better equipped to obey the functional
model’s performance. constraints of game levels.
Level Generation Through Large Language Models Conference’23, April 2023, Lisbon, Portugal
[10] Jiwei Li, Will Monroe, and Dan Jurafsky. 2017. Learning to decode for [27] Ruben Rodriguez Torrado, Ahmed Khalifa, Michael Cerny Green, Niels
future success. arXiv preprint arXiv:1701.06549 (2017). Justesen, Sebastian Risi, and Julian Togelius. 2020. Bootstrapping con-
[11] Jialin Liu, Sam Snodgrass, Ahmed Khalifa, Sebastian Risi, Georgios N ditional gans for video game level generation. In 2020 IEEE Conference
Yannakakis, and Julian Togelius. 2021. Deep learning for procedural on Games (CoG). IEEE, 41–48.
content generation. Neural Computing and Applications 33, 1 (2021), [28] Théophane Weber, Sébastien Racanière, David P. Reichert, Lars
19–37. Buesing, Arthur Guez, Danilo Jimenez Rezende, Adria Puigdomènech
[12] Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Badia, Oriol Vinyals, Nicolas Heess, Yujia Li, Razvan Pascanu, Pe-
Ambrosio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu ter Battaglia, Demis Hassabis, David Silver, and Daan Wierstra. 2017.
Tang, et al. 2021. Codexglue: A machine learning benchmark dataset Imagination-Augmented Agents for Deep Reinforcement Learning.
for code understanding and generation. arXiv preprint arXiv:2102.04664 https://doi.org/10.48550/ARXIV.1707.06203
(2021). [29] Yahia Zakaria, Magda Fayek, and Mayada Hadhoud. 2022. Procedural
[13] Justin Mott, Saujas Nandi, and Luke Zeller. 2019. Controllable and Level Generation for Sokoban via Deep Learning: An Experimental
coherent level generation: A two-pronged approach. In Experimental Study. IEEE Transactions on Games (2022), 1–1. https://doi.org/10.
AI in games workshop. 1109/TG.2022.3175795
[14] Yoshio Murase, Hitoshi Matsubara, and Yuzuru Hiraga. 1996. Auto- [30] Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. 2021.
matic making of sokoban problems. In PRICAI’96: Topics in Artificial Calibrate before use: Improving few-shot performance of language
Intelligence: 4th Pacific Rim International Conference on Artificial Intel- models. In International Conference on Machine Learning. PMLR, 12697–
ligence Cairns, Australia, August 26–30, 1996 Proceedings 4. Springer, 12706.
592–600.
[15] Kyungjin Park, Bradford W Mott, Wookhee Min, Kristy Elizabeth
Boyer, Eric N Wiebe, and James C Lester. 2019. Generating educational
game levels with multistep deep convolutional generative adversarial
networks. In 2019 IEEE Conference on Games (CoG). IEEE, 1–8.
[16] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei,
Ilya Sutskever, et al. 2019. Language models are unsupervised multitask
learners. OpenAI blog 1, 8 (2019), 9.
[17] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark
Chen. 2022. Hierarchical text-conditional image generation with clip
latents. arXiv preprint arXiv:2204.06125 (2022).
[18] Scott Reed, Konrad Zolna, Emilio Parisotto, Sergio Gomez Colmenarejo,
Alexander Novikov, Gabriel Barth-Maron, Mai Gimenez, Yury Sulsky,
Jackie Kay, Jost Tobias Springenberg, et al. 2022. A generalist agent.
arXiv preprint arXiv:2205.06175 (2022).
[19] Anurag Sarkar and Seth Cooper. 2018. Blending Levels from Different
Games using LSTMs.. In AIIDE Workshops.
[20] Jacob Schrum, Jake Gutierrez, Vanessa Volz, Jialin Liu, Simon Lucas,
and Sebastian Risi. 2020. Interactive evolution and exploration within
latent level-design space of generative adversarial networks. In Pro-
ceedings of the 2020 Genetic and Evolutionary Computation Conference.
148–156.
[21] Sam Snodgrass and Anurag Sarkar. 2020. Multi-domain level genera-
tion and blending with sketches via example-driven bsp and variational
autoencoders. In Proceedings of the 15th international conference on the
foundations of digital games. 1–11.
[22] Muhammad Suleman, Farrukh Hasan Syed, Tahir Q Syed, Saqib Arfeen,
Sadaf I Behlim, and Behroz Mirza. 2017. Generation of sokoban stages
using recurrent neural networks. International Journal of Advanced
Computer Science and Applications 8, 3 (2017).
[23] Adam Summerville, Matthew Guzdial, Michael Mateas, and Mark O
Riedl. 2016. Learning player tailored content from observation: Plat-
former level generation from video traces using lstms. In Twelfth
artificial intelligence and interactive digital entertainment conference.
[24] Adam Summerville and Michael Mateas. 2016. Super Mario as a String:
Platformer Level Generation Via LSTMs. https://doi.org/10.48550/
ARXIV.1603.00930
[25] Adam Summerville, Sam Snodgrass, Matthew Guzdial, Christoffer
Holmgård, Amy K Hoover, Aaron Isaksen, Andy Nealen, and Julian
Togelius. 2018. Procedural content generation via machine learning
(PCGML). IEEE Transactions on Games 10, 3 (2018), 257–270.
[26] Joshua Taylor and Ian Parberry. 2011. Procedural generation of
sokoban levels. In Proceedings of the International North American
Conference on Intelligent Games and Simulation. 5–12.