0% found this document useful (0 votes)
60 views8 pages

Level Generation Through Large Language Models

Uploaded by

Huy Hoàng Gia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views8 pages

Level Generation Through Large Language Models

Uploaded by

Huy Hoàng Gia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Level Generation Through Large Language Models

Graham Todd Sam Earle Muhammad Umair Nasir


[email protected] [email protected] [email protected]
New York University Tandon New York University Tandon University of the Witwatersrand
Brooklyn, New York, USA Brooklyn, New York, USA Johannesburg, South Africa

Michael Cerny Green Julian Togelius


[email protected] [email protected]
New York University Tandon New York University Tandon
Brooklyn, New York, USA Brooklyn, New York, USA
arXiv:2302.05817v1 [cs.AI] 11 Feb 2023

Abstract
Large Language Models (LLMs) are powerful tools, capable
of leveraging their training on natural language to write
stories, generate code, and answer questions. But can they
generate functional video game levels? Game levels, with
their complex functional constraints and spatial relationships
in more than one dimension, are very different from the kinds
of data an LLM typically sees during training. Datasets of
game levels are also hard to come by, potentially taxing the
abilities of these data-hungry models. We investigate the use
of LLMs to generate levels for the game Sokoban, finding
that LLMs are indeed capable of doing so, and that their
performance scales dramatically with dataset size. We also Figure 1. A level for the puzzle game Sokoban generated by
perform preliminary experiments on controlling LLM level GPT-3
generators and discuss promising areas for future work.
ACM Reference Format:
Graham Todd, Sam Earle, Muhammad Umair Nasir, Michael Cerny underlying these models have been leveraged for tasks out-
Green, and Julian Togelius. 2023. Level Generation Through Large side the realm of standard text generation, from music [6]
Language Models. In Proceedings of ACM Conference (Conference’23).
to reinforcement learning [3], comparatively less effort has
ACM, New York, NY, USA, 8 pages. https://doi.org/10.1145/nnnnnnn.
been spent on analyzing the capacity of the LLMs themselves
nnnnnnn
to produce non-lingusitic artifacts while still leveraging their
vast amounts of training data. In this paper, we investigate
1 Introduction the ability of LLMs to generate video game levels and the ex-
In recent years, attention-based large language models (LLMs) tent to which truths about these models taken from natural
have taken the world by storm, demonstrating surprisingly language processing apply to this new domain. We also con-
high performance on a variety of natural language tasks. duct preliminary experiments on the capacity to control the
With the right tuning, LLMs have been shown to generate levels generated by LLMs using simple data augmentation
coherent text in a number of styles, produce working snip- and prompting.
pets of computer code, and even respond naturalistically to Despite their impressive performance, there are reasons
human questions and conversation. While the architectures to doubt that LLMs would be well suited to the task of level
generation. The first is representational. For context, the last
Permission to make digital or hard copies of all or part of this work for few years have seen a steady increase in the use of machine
personal or classroom use is granted without fee provided that copies are not
made or distributed for profit or commercial advantage and that copies bear learning to generate novel game content, including game
this notice and the full citation on the first page. Copyrights for components levels. This procedural content generation through machine
of this work owned by others than ACM must be honored. Abstracting with learning (PCGML) has made use of a variety of methods,
credit is permitted. To copy otherwise, or republish, to post on servers or to including cellular automata, Markov models, convolutional
redistribute to lists, requires prior specific permission and/or a fee. Request neural networks, and generative adversarial networks. While
permissions from [email protected].
Conference’23, April 2023, Lisbon, Portugal
dissimilar in function, these methods are nonetheless uni-
© 2023 Association for Computing Machinery. fied in that they tend to represent game levels spatially, as
ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. . . $15.00 arrangements of tiles or features in two or three dimensions.
https://doi.org/10.1145/nnnnnnn.nnnnnnn This is an intuitive approach, as it allows models to more
Conference’23, April 2023, Lisbon, Portugal Anonymous

using natural language prompts to generate levels with par-


ticular characteristics. Recent work has demonstrated that
natural language-guided generation is possible not only for
text, but also for images [17] and music [1]. These systems
leverage LLMs and are capable of accommodating a wide
range of potential prompts, an impressive feat that provides
some reason for optimism that current approaches for con-
trollable level generation could be similarly improved. At
the same time, LLMs have shown considerable promise in
generalizing to unseen domains [2] or across a wide variety
(a) Playable level generated by (b) Nearest level (edit distance) of tasks [18]. With respect to level generation, this might
GPT-2 in the Boxoban set allow for a single LLM-based model to produce levels for
multiple games or even, with sufficiently detailed prompting,
Figure 2. A novel and playable generated level, and its near- a previously unencountered game.
est neighbor in the training set. In this paper, we aim to answer some of the initial ques-
tions surrounding the ability of LLMs to generate game levels
using the iconic puzzle game Sokoban. We perform experi-
ments on the effects of pretraining and dataset size, as well
readily learn the local spatial dynamics present in game en- as a preliminary investigation on the controllability of LLM
vironments. By contrast, LLMs process inputs and generate level generators. We conclude with a discussion of the results
outputs in a linear fashion. Game levels must be presented and the many fruitful avenues for future work.
as a sequence of tokens in order to be fed into the model,
and generated outputs must be reinterpreted as spatial data
in order to be used. Further, variable-length tokenization
schemes used by modern LLMs mean that two levels of the 2 Related Work
same size might be represented with different amounts of Procedural content generation (PCG), refers to the use of
underlying tokens. Maintaining regularity and spatial re- automated or algorithmic methods to create artifacts, typi-
lationships (e.g. attempting to place an enemy directly be- cally for use in art or games. Techniques for PCG range from
neath a player in two dimensions) therefore requires more simple noise functions to complex neural models. Our work
than simply counting the number of tokens. Nevertheless, falls into the broad category of procedural content genera-
prior work on n-grams and recurrent neural networks has tion via machine learning (PCGML) [25], in which content
demonstrated that game levels can be represented sequen- generating functions are learned from extant datasets. Liu et
tially without the loss of critical spacial dependencies, albeit al. present an overview of the PCGML field, with a specific
with some difficulty. We investigate the extent to which this focus on deep learning [11].
holds for modern, attention-based models and their typical For the specific task of generating game levels, common
tokenization schemes. model choices include variational autoencoders [21], gen-
The second potential issue with using LLMs to generate erative adversarial networks [15, 27], evolution [20], and
game levels is that of data. LLMs are notoriously data hungry, reinforcement learning [9]. In addition, however, there is
while datasets of game levels are notoriously small, difficult a history of using autoregressive models typically found
to obtain, and lacking in standardization. The first obvious in natural language processing for game level generation.
question is whether the ability of an LLM to generate lev- Dahlskog et al. present early work on this approach, using a
els depends on its receiving vast amounts of high-quality simple n-gram approach to generate novel Super Mario Bros.
training data. More subtly, it is also important to determine levels from an existing dataset by treating a level as a left-to-
whether the vast amount of data used in pretraining actu- right sequence of “tokens” each representing a vertical slice
ally assists the LLM in producing game levels. It is not clear [4]. This work was quickly expanded to use long short-term
whether the patterns and structures learned from exposure memory (LSTM) networks [24], a model choice which has
natural language (or, in some cases, code) transfer to the also found success in generating levels based on human play
functional and spatial constraints of game levels. The quality traces [23] and combining levels from multiple games [19].
and even playability of a game level is often dependent on Our work also borrows from the literature on controllable
factors such as topology or the relative amounts of different PCG, in which specific parameters are provided to the gener-
tile types – a far cry from the syntax of English or Python! ator in order to guide its outputs. Approaches for controllable
Even so, LLMs also seem to have certain advantages when PCG often involve manipulating a latent embedding vector,
it comes to level generation, namely controllability and gen- with prior work having made use of generative networks
eralizability. Controllability here refers to the possibility of [13] and reinforcement learning [5].
Level Generation Through Large Language Models Conference’23, April 2023, Lisbon, Portugal

Finally, our target domain of Sokoban is a popular choice predict the next “token” (typically a word, or piece of a word,
for PCG work, ranging from early rule- and template-based in the standard context of natural langauge) given the con-
approaches [14, 26], to search-based methods [8] and recur- text of preceding tokens. By default, these models are initially
rent neural networks [22]. Zakaria et al. present a particularly trained on vast corpora of natural language text scraped from
in-depth comparison of various PCG methods for Sokoban, in- the internet, before being fine-tuned on (typically) smaller
cluding LSTMs [29]. They bootstrap an initially-small dataset datasets to specialize their performance to particular tasks.
of levels and use it to train their level generators, demon- Owing to its greater availability, we focus the majority of
strating that LSTMs are capable of producing a variety of our experiments on GPT-2 and variants thereof.
novel and playable levels. They also perform experiments
on the controllability of their generators. Their results indi- 5 Metrics
cate that while the task is challenging, LSTMs, in addition
To measure the ability of LLMs to generate game levels, we
to other existing PCGML methods, can adhere to specified
use the following four metrics:
characteristics at levels substantially above chance. Our anal-
ysis is similar, though we instead focus our attention on a • Playability: we measure the proportion of generated
more modern class of large language model. In addition to a game levels that are “valid”. In the case of Sokoban,
change in architecture, we also specifically interrogate some this means that they are rectangular, contain only valid
of the standard assumptions about the training and behavior characters, contain an equal and non-zero number of
of LLMs, and whether or not they are helpful for the task of boxes and goals, contain exactly one player, and are
level generation. solvable. We determine solvability using an ASTAR
tree-search agent. If, after running for 150,000 steps,
3 Data the ASTAR agent fails to find a valid solution, we deem
We train our models on levels from the game Sokoban, a the level unplayable. This provides a lower bound on
block-pushing puzzle game released in 1982 by Thinking the true rate of playability.
Rabbit. In Sokoban, the player is tasked with navigating along • Novelty: we measure the proportion of generated game
a rectangular grid in order to push boxes into specified target levels that are distinct from each level in the training
squares. A level can easily be represented as a grid of ASCII dataset. While determining when two levels are “dis-
characters, where each character is mapped to either: a wall, tinct” can be challenging, we opt to use a simplified
an empty space, the player, a box, a goal, a box on top of a approach that treats two levels as distinct if their string
goal, or a player on top of a goal. Despite its simplicity of edit distance is above some threshold. For our experi-
representation, Sokoban levels can be very challenging for ments, we use a threshold of 𝑘 = 5.
both human and artificial agents alike, owing to its rapidly- • Diversity: we measure the proportion of generated
branching state space and the fact that certain game states game levels that are mutually distinct from each other.
are “unrecoverable” and, once reached, cannot be escaped Using the same definition of distinctness as above, we
from. use a graph-based approach to find the largest subset of
We use two sets of Sokoban levels to train our models. generated levels that are all at least 𝑘 = 5 edit distance
The first is the Microban1 dataset, consisting of roughly 500 from each other. Specifically, we convert the set of
levels created by David W. Skinner. We restrict our dataset generated levels into an undirected graph where two
to 282 levels for which an ASTAR search agent was able to nodes (levels) have an edge if their edit distance is
find a solution. Levels in this set range in size from 5 by 3 to above the threshold 𝑘. We then find the largest clique
27 by 12, with solution lengths ranging from 1 to 279. (subset of fully connected nodes) on this graph, and
The second set of Sokoban levels is the Boxoban dataset, report the diversity as the size of this clique divided
which consists of 438,000 procedurally generated levels. Lev- by the number of generated levels (a set of levels on
els were generated using a combination of heuristic and this graph is only fully connected if each level is at
pattern-based rules [28]. Unlike with the Microban set, lev- least 𝑘 edit distance away from every other level in
els in the Boxoban set are all 10 by 10 and contain 4 boxes / the set). Because finding a maximum clique can be
goals. We use the ASTAR agent to recover solutions for all computationally intractable, we terminate the clique-
438,000 levels, and solution lengths range from 6 to 206. finding algorithm after a specified number of iterations
(1 million) and report the size of the largest clique
4 Models found. This provides a lower bound on the model’s
Our core experimental model is the Generative Pre-Trained diversity.
Transformer (GPT), a class of attention-based language model • Accuracy: in the case of controllability experiments,
[2, 16]. Both GPT-2 and GPT-3 are trained by attempting to we measure the proportion of generated game levels
that adhere to the given prompt. Rather than enforce
1 Microban dataset available here: https://tinyurl.com/yckwxd7k an exact match between prompt and output, we allow
Conference’23, April 2023, Lisbon, Portugal Anonymous

the model to generate levels that are within a certain Novelty Playability Diversity Score
experiment-specific tolerance of the specified charac- Model
teristic (for instance, with a tolerance of 5 we would
GPT-2 0.97 0.54 1.00 0.53
consider accurate a level with a solution length of 21,
GPT-2 (Untrained) 0.96 0.60 1.00 0.56
when the prompt called for a level with solution length Java GPT-2 0.97 0.54 1.00 0.53
25)
Table 1. Novelty, playability, diversity, and overall “score”
6 Experiments (defined as the diversity of the subset of generated levels that
are both novel and playable) for each type of pretraining,
6.1 Effects of Pretraining
using the best evaluation hyperparameters when averaged
Our first experiment aims to answer two questions: over 5 seeds.
1. Are LLMs capable of generating novel, playable game
levels?
2. Does the extensive pretraining given to LLMs affect
their ability to generate game levels?
LLMs are typically trained on vast quantities of text col-
lected from a variety of natural language contexts and then temperature, the top-p value, and the number of beams. For
later “fine-tuned" on a smaller, more task-specific dataset. each model, we select the evaluation hyperparameters which
While this pretraining appears helpful for the models’ ca- achieve the highest score when averaged over the 5 random
pacity for natural language understanding (and thus perfor- seeds. We report these average scores, along with the average
mance on many downstream language-based tasks), it’s not novelty, playability, and diversity rates, for each model in
clear whether pretraining would offer models a benefit in Table 1.
the more specialized task of generating valid game levels. To
examine this question, we consider 3 variants of the GPT-2
model: standard, java-adapted, and untrained. The standard 6.2 Effects of Dataset Size
GPT-2 model was pretrained on the WebText dataset, consist- Another well-known property of LLMs is that their perfor-
ing of the content of 45 million links [16], the java-adapted mance on a variety of NLP tasks tends to scale with the
model was pretrained on the CodeSearchNet dataset of Java amount of training data [7]. Again, however, it is worth ex-
code, consisting of 1.6 million Java methods [12], and the amining whether this trend holds for the specialized task of
untrained model, unsurprisingly, received no pretraining generating game levels. In many contexts, it might be difficult
(weights are randomly initialized). As a note, both standard or impossible to collect a large set of high-quality game lev-
and java-adapted models use specialized tokenizers which els. Are LLMs still capable of learning to generate valid levels
are trained to efficiently break input strings into sub-word when their training data is severely limited? Conversely, in
tokens. Rather than use an existing tokenizer, we allowed situations where large amounts of game levels are available
the untrained model to train a custom tokenizer on the game (typically games for which heuristic or rule-based PCG ap-
level dataset, using the same byte-pair encoding scheme as proaches exist), do LLMs benefit from ever-increasing dataset
GPT-2. sizes? Finally, can simple data augmentation approaches im-
Each model is trained for 100k steps with 5 random seeds prove LLM performance?
on the Boxoban dataset with the following training hyper- First, we consider four “slices” of the Boxoban dataset
parameters: learning rate of 0.0001, weight decay of 0.0001, consisting of 0.1%, 1%, 10%, 100% (i.e. the complete dataset
a batch size of 32, and the AdamW optimizer. In order to eval- used above) of the data, randomly sampled. We take the
uate a model, we provide it with some initial context (for standard GPT-2 model and re-train it on each of the slices
this experiment, only the START token) and then use beam for 100k steps, using the same training hyperparameters as
search with random sampling in order to generate one or above. We then evaluate each model’s novelty, playability,
more continuations. We then compute the proportion of gen- and “score” using the same procedure as in Section 6.1. As
erated levels that are novel, the proportion that are playable, before, we find the evaluation hyperparameters that achieve
and the proportion that are novel, playable, and diverse. For the highest average score for each model, and report the
simplicity, we call the proportion of novel, playable, and results in Table 2.
diverse levels the model’s score (e.g. if the model produces Next, we train a GPT-2 model on the Microban dataset, as
54 levels that are playable and novel out of 100 samples, of well as two augmented versions of the dataset: Microbanflip
which 47 are mutually distinct, we report a score 0.47). (levels flipped about the X and Y axes) and Microbanflip+rotate
Because the outputs of a LLM are largely dependent on the (levels rotated 90 degrees clockwise and counterclockwise).
hyperparameters used during generation, for each model and Each model is trained for 100k steps, and the highest achieved
seed we perform an additional sweep over the generation average score for each is reported in Table 3.
Level Generation Through Large Language Models Conference’23, April 2023, Lisbon, Portugal

6.3 Controllability Novelty Playability Diversity Score


Arguably the most compelling reason to use LLMs for game % of Boxoban
level generation is the ability to prompt the model in natural
0.1% 0.00 0.80 0.01 0.01
language to generate levels with specific characteristics. For
1% 0.10 0.66 0.97 0.03
instance, it might be possible to create a level that has a
10% 0.90 0.55 1.00 0.47
specific difficulty (represented by the length of the solution),
100% 0.97 0.54 1.00 0.53
or with certain level topologies. Recent LLMs have demon-
strated impressive abilities to leverage prompting in order Table 2. Novelty, playability, diveristy, and overall score
generalize from few or even zero examples on a variety of for GPT-2 trained on increasing amounts of the Boxoban
tasks [2]. However, zero-shot generalization is likely to be dataset, using the best hyperparameters averaged over 5
difficult for level generation owing to the many functional seeds. Increasing dataset size leads to increased performance.
constraints on valid game levels and their dissimilarity from
inputs encountered during pretraining. Thus, we instead fo-
cus on LLMs that have been trained specifically to adhere to
Novelty Playability Diversity Score
prompts during level generation. We accomplish this by sim- Dataset
ply prepending an “annotation” to each level in the training
Microban 0.59 0.30 0.83 0.02
dataset. The annotation consists of each property we intend
Microbanflip 0.56 0.32 0.89 0.02
to control, as well as its value for that specific level. At gen- Microbanflip+rotate 0.24 0.54 0.82 0.04
eration time, we provide the model with only the annotation
and task it with generating the rest of the level. Table 3. Novelty, playability, diversity, and overall score for
For this experiment, we focus on two annotated character- GPT-2 trained on the Microban dataset and two augmenta-
istics: the proportion of empty space (i.e. percentage of level tions, using the best hyperaparmeters averaged over 5 seeds.
tiles that are not players, walls, boxes, or targets) and the The model broadly overfits and fails to generate novel and
solution length. Both of these are measurable characteristics playable levels.
of valid Sokoban levels, though they differ in complexity. The
proportion of empty space is an observable characteristic of a
level, requiring only the ability to count in order to compute. a greatly increased amount of pretraining data. Access to
Solution length, by contrast, can typically only be computed GPT-3 is currently limited (especially with respect to the
by actually solving the level in question and not through cost of training on large datasets), making it infeasible to
direct observation. Even visually sparse or simple levels can perform direct comparisons with GPT-2 on all measures.
require long solutions. Nevertheless, we perform some initial experiments on the
As with the dataset size experiment in Section 6.2, we use a performance of OpenAI’s Davinci model when trained on
standard pretrained GPT-2 model. We train a separate model the Microban dataset and its augmentations.
on the Boxoban dataset annotated with the proportion of We train the Davinci model for 10 epochs separately on
empty space and the Boxoban dataset annotated with level each of the datasets using a single seed. At test time, we
solution length. At test time, we provide the model with perform a limited hyperparameter sweep over generation
only the annotation, randomly sampled from the collection temperature and top-p. As with GPT-2, we compute the
of annotations in the training set. In addition to novelty, model’s novelty, playability, and overall score. We report
playability, and diversity, we compute the model’s accuracy the GPT-3 results in Table 5.
as described in Section 5. For the proportion of empty space
condition, we use a tolerance of 0.01, and for the solution 7 Results
length condition we use a tolerance of 5. For this experiment,
7.1 Effects of Pretraining
we report both the standard “score” defined above, as well as
the “control score,” which is simply the diversity of levels that We see in Table 1 that all three models are able to gener-
are accurate to the prompt, in addition to being novel and ate novel, playable, and diverse levels. An average “score”
playable. We report these results, using the same evaluation of around 0.55 indicates that the language model is able to
procedure as in Section 6.1, in Table 4. reliably generate Sokoban levels that are valid and solvable
without directly copying from its Boxoban training dataset.
We observe that the untrained GPT-2 model performs very
6.4 Preliminary Investigation on GPT-3 slightly better than either of the pretrained models. The dif-
While the GPT-2 model has demonstrated very high perfor- ference, however, is minute and likely to the effect of random
mance on a variety of natural language tasks, it has nonethe- variance. Overall, this seems to indicate that the pretrain-
less been largely eclipsed by its successor model: GPT-3, ing afforded to these LLMs neither particularly helps nor
which boasts both substantially more parameters as well as hinders their ability to generate game levels. This could be
Conference’23, April 2023, Lisbon, Portugal Anonymous

explained by the substantial dissimilarity between modeling 7.4 Preliminary Investigation on GPT-3
natural language and Sokoban levels, meaning that models Table 5 contains the results of GPT-3 level generation when
are required to effectively learn from scratch in this domain trained on the Microban and its augmentations. While these
and are able to do so. results should be taken with a healthy amount of caution
because they are generated from only a single training run
7.2 Effects of Dataset Size
and with a limited evaluation hyperparameter sweep, they
The results in Table 2 and Table 3 indicate that dataset size is nonetheless offer some reason for optimism. In contrast to
indeed an important factor for an LLM’s ability to generate GPT-2, GPT-3 is able to produce novel and playable levels
game levels. For small datasets (i.e. the 0.1% and 1% condi- when trained on both the augmented forms of the Microban
tions of Boxoban, as well as also Microban conditions), the dataset, with its overall score on the final condition approach-
GPT-2 model seems unable to produce levels that are both ing that of GPT-2 trained on the entire Boxoban dataset. As
novel and playable (though it can produce novel levels or with previous experiments, however, we observe that increas-
playable levels in isolation), leading to low overall scores. ing dataset size (in this case adding rotations in addition to
Nevertheless, in all but the 0.1% Boxoban condition, sample flips) does lead to increased overall performance with GPT-3.
diversity remains relatively high. While the effect is not espe- In future work, we intend to perform a more robust analysis
cially pronounced, there does appear to be a correlation be- of GPT-3’s abilities, including its capacity for controllable
tween the size of the dataset and the model’s score. This sup- level generation.
ports the notion that, like with many natural language tasks,
LLM performance on level generation scales effectively with
the availability of training data. However, it is important to
note that [29] demonstrate that simple LSTM generators are 8 Future Work
capable of generating novel game levels even when trained In this paper, we examine the performance of LLMs on gen-
on bootstrapped datasets beginning with only 12 samples erating levels for a single game. However, one of the primary
(and never reaching the order of magnitude of Boxoban). strengths of LLMs is their ability to rapidly adapt to a variety
Thus, it seems very unlikely that LLMs are fully incapable of contexts given the appropriate prompt. Consider a dataset
of performing well at level generation when restricted to of levels from many different games, where each level has
small datasets. What might account for this difference in been annotated with the natural language mapping from tiles
performance, then? One possibility is expressivity: modern to game objects (e.g. “@ represents the player, M represents
transformers are much better able to represent sequential a monster), along with a description of the level objective.
data than LSTMs, and so are more likely to completely model An LLM might be better equipped than other PCG systems
the dynamics of their training datasets, to the detriment of to generate novel and playable levels from this variety of
their generative capabilities. However, as we will see, this games, owing to its familiarity with natural language and
explanation does not account for the performance of GPT-3 capacity for rapid adaptation.
(see Section 7.4). However, our work also indicates that making effective
use of LLMs for game level generation may require more
7.3 Controllability consideration of dataset size: few games have available the
In Table 4, we see effects of sampling levels conditioned massive amount of levels present in the Boxoban set (though
on simple prompts. In the first row, we see that the GPT-2 it is worth noting that, preliminarily, GPT-3 seemed better
model is able to produce levels that are novel, playable, and able to handle data scarcity than GPT-2). As mentioned in
within a single tile of the specified proportion of empty space Section 7.2, prior work has demonstrated that bootstrapping
(corresponding to perfect accuracy and a relatively high larger training sets from initially small collections of levels
control score). However, when it comes to solution length, is a viable technique. Another possibility is augmenting ex-
GPT-2 achieves an accuracy of only 17%. Given the tolerance isting datasets beyond simple flips and rotations. In the case
of 5 and the fact that most solution lengths in the dataset fall of Sokoban, for instance, this might involve collecting atomic
within a relatively narrow band, this cannot be interpreted as puzzle components and re-arranging them.
anything more than the effects random chance. A similar fact Finally, there is room for much greater sophistication in
holds for the combined condition, where overall accuracy the techniques used to control LLM outputs. Research in
is determined by both the correct amount of empty space the area of controllable language model decoding [10] offers
and solution length and does not rise above 3%. It is worth the opportunity to leverage existing work in PCG through
noting, however, than even in the conditions where GPT-2 reinforcement learning. More modern LLMs, especially, have
failed to produce accurate levels, it nonetheless continued to also been shown to benefit from careful prompt engineering
generally produce novel and playable ones. In other words, [30]. A combination of these approaches might allow for LLM
the introduction of the prompt did not negatively affect the generators that are better equipped to obey the functional
model’s performance. constraints of game levels.
Level Generation Through Large Language Models Conference’23, April 2023, Lisbon, Portugal

Novelty Playability Accuracy Diversity Score Control Score


Controls
Prop. Empty 0.96 0.57 1.00 0.97 0.53 0.53
Solution Len 0.95 0.54 0.17 1.00 0.50 0.14
Prop. Empty & Solution Len 0.96 0.59 0.03 0.79 0.45 0.03
Table 4. Novelty, playability, diversity, and accuracy (along with the overall score and the “control score”, which accounts
for accuracy) for GPT-2 trained on Boxoban, annotated with the proportion of empty space, the solution length, and both
simultaneously, using the best hyperparameters when averaged over 5 seeds. The model is able to adhere to the empty space
controls, but not the solution length controls.

Novelty Playability Diversity Score Ethical Statement


Dataset Large language models often reproduce biases in their train-
Microban 0.09 0.88 0.67 0.01 ing data. Though our models are tuned to produce game
Microbanflip 0.55 0.94 0.77 0.36 levels, it could still be possible to extract harmful informa-
Microbanflip+rotate 0.70 0.93 0.88 0.51 tion from them given an adversarial approach.
Table 5. Novelty, playability, diversity, and overall score for
GPT-3 trained on the Microban dataset and two augmenta-
tions. GPT-3 is able to produce novel, playable, and diverse
References
[1] Andrea Agostinelli, Timo I Denk, Zalán Borsos, Jesse Engel, Mauro
levels from a relatively small training set.
Verzetti, Antoine Caillon, Qingqing Huang, Aren Jansen, Adam
Roberts, Marco Tagliasacchi, et al. 2023. MusicLM: Generating Music
From Text. arXiv preprint arXiv:2301.11325 (2023).
[2] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared
Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam,
9 Conclusion Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss,
Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh,
Large languages models are highly versatile. Beyond merely Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse,
predicting likely continuations of text, they are capable of Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess,
an impressive range of natural language tasks. In this work, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya
we show that generating video game levels can be added to Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot
Learners. https://doi.org/10.48550/ARXIV.2005.14165
that list. With sufficient data and training, LLMs are able to [3] Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover,
produce a diverse set of novel and playable Sokoban levels. Misha Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch.
We show that the pretraining generally afforded to these 2021. Decision transformer: Reinforcement learning via sequence
models does not hinder its ability to generate game levels, modeling. Advances in neural information processing systems 34 (2021),
15084–15097.
though any actual effect is unclear. We also demonstrate
[4] Steve Dahlskog, Julian Togelius, and Mark J Nelson. 2014. Linear levels
that, for GPT-2, the domain of game level generation is be- through n-grams. In Proceedings of the 18th International Academic
holden to the same data scaling trends that apply to many MindTrek Conference: Media Business, Management, Content & Services.
natural language domains – model performance is strongly 200–206.
dependent on the availability of data. Cutting-edge LLMs [5] Sam Earle, Maria Edwards, Ahmed Khalifa, Philip Bontrager, and Julian
like GPT-3 may have the potential to better generalize from Togelius. 2021. Learning controllable content generators. In 2021 IEEE
Conference on Games (CoG). IEEE, 1–9.
small amounts of training data, though more work must [6] Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam
be done before decisive conclusions can be drawn. With Shazeer, Ian Simon, Curtis Hawthorne, Andrew M Dai, Matthew D
respect to controllability, we find that a simple prompting Hoffman, Monica Dinculescu, and Douglas Eck. 2018. Music trans-
approach is sufficient for observable level characteristics former. arXiv preprint arXiv:1809.04281 (2018).
like the proportion of empty tiles, but breaks down on more [7] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Ben-
jamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and
complicated metrics like solution length. Overall, the use Dario Amodei. 2020. Scaling laws for neural language models. arXiv
of LLMs for game level generation shows promise despite preprint arXiv:2001.08361 (2020).
it being a wildly different domain from natural language, [8] Bilal Kartal, Nick Sohre, and Stephen J Guy. 2016. Data driven Sokoban
complete with its own set of constraints and syntax. LLMs puzzle generation with Monte Carlo tree search. In Twelfth Artificial
also seem potentially poised to overcome the general lack Intelligence and Interactive Digital Entertainment Conference.
[9] Ahmed Khalifa, Philip Bontrager, Sam Earle, and Julian Togelius. 2020.
of available game level data, potentially offering a new way Pcgrl: Procedural content generation via reinforcement learning. In
forward for procedural content generation through machine Proceedings of the AAAI Conference on Artificial Intelligence and Inter-
learning. active Digital Entertainment, Vol. 16. 95–101.
Conference’23, April 2023, Lisbon, Portugal Anonymous

[10] Jiwei Li, Will Monroe, and Dan Jurafsky. 2017. Learning to decode for [27] Ruben Rodriguez Torrado, Ahmed Khalifa, Michael Cerny Green, Niels
future success. arXiv preprint arXiv:1701.06549 (2017). Justesen, Sebastian Risi, and Julian Togelius. 2020. Bootstrapping con-
[11] Jialin Liu, Sam Snodgrass, Ahmed Khalifa, Sebastian Risi, Georgios N ditional gans for video game level generation. In 2020 IEEE Conference
Yannakakis, and Julian Togelius. 2021. Deep learning for procedural on Games (CoG). IEEE, 41–48.
content generation. Neural Computing and Applications 33, 1 (2021), [28] Théophane Weber, Sébastien Racanière, David P. Reichert, Lars
19–37. Buesing, Arthur Guez, Danilo Jimenez Rezende, Adria Puigdomènech
[12] Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Badia, Oriol Vinyals, Nicolas Heess, Yujia Li, Razvan Pascanu, Pe-
Ambrosio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu ter Battaglia, Demis Hassabis, David Silver, and Daan Wierstra. 2017.
Tang, et al. 2021. Codexglue: A machine learning benchmark dataset Imagination-Augmented Agents for Deep Reinforcement Learning.
for code understanding and generation. arXiv preprint arXiv:2102.04664 https://doi.org/10.48550/ARXIV.1707.06203
(2021). [29] Yahia Zakaria, Magda Fayek, and Mayada Hadhoud. 2022. Procedural
[13] Justin Mott, Saujas Nandi, and Luke Zeller. 2019. Controllable and Level Generation for Sokoban via Deep Learning: An Experimental
coherent level generation: A two-pronged approach. In Experimental Study. IEEE Transactions on Games (2022), 1–1. https://doi.org/10.
AI in games workshop. 1109/TG.2022.3175795
[14] Yoshio Murase, Hitoshi Matsubara, and Yuzuru Hiraga. 1996. Auto- [30] Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. 2021.
matic making of sokoban problems. In PRICAI’96: Topics in Artificial Calibrate before use: Improving few-shot performance of language
Intelligence: 4th Pacific Rim International Conference on Artificial Intel- models. In International Conference on Machine Learning. PMLR, 12697–
ligence Cairns, Australia, August 26–30, 1996 Proceedings 4. Springer, 12706.
592–600.
[15] Kyungjin Park, Bradford W Mott, Wookhee Min, Kristy Elizabeth
Boyer, Eric N Wiebe, and James C Lester. 2019. Generating educational
game levels with multistep deep convolutional generative adversarial
networks. In 2019 IEEE Conference on Games (CoG). IEEE, 1–8.
[16] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei,
Ilya Sutskever, et al. 2019. Language models are unsupervised multitask
learners. OpenAI blog 1, 8 (2019), 9.
[17] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark
Chen. 2022. Hierarchical text-conditional image generation with clip
latents. arXiv preprint arXiv:2204.06125 (2022).
[18] Scott Reed, Konrad Zolna, Emilio Parisotto, Sergio Gomez Colmenarejo,
Alexander Novikov, Gabriel Barth-Maron, Mai Gimenez, Yury Sulsky,
Jackie Kay, Jost Tobias Springenberg, et al. 2022. A generalist agent.
arXiv preprint arXiv:2205.06175 (2022).
[19] Anurag Sarkar and Seth Cooper. 2018. Blending Levels from Different
Games using LSTMs.. In AIIDE Workshops.
[20] Jacob Schrum, Jake Gutierrez, Vanessa Volz, Jialin Liu, Simon Lucas,
and Sebastian Risi. 2020. Interactive evolution and exploration within
latent level-design space of generative adversarial networks. In Pro-
ceedings of the 2020 Genetic and Evolutionary Computation Conference.
148–156.
[21] Sam Snodgrass and Anurag Sarkar. 2020. Multi-domain level genera-
tion and blending with sketches via example-driven bsp and variational
autoencoders. In Proceedings of the 15th international conference on the
foundations of digital games. 1–11.
[22] Muhammad Suleman, Farrukh Hasan Syed, Tahir Q Syed, Saqib Arfeen,
Sadaf I Behlim, and Behroz Mirza. 2017. Generation of sokoban stages
using recurrent neural networks. International Journal of Advanced
Computer Science and Applications 8, 3 (2017).
[23] Adam Summerville, Matthew Guzdial, Michael Mateas, and Mark O
Riedl. 2016. Learning player tailored content from observation: Plat-
former level generation from video traces using lstms. In Twelfth
artificial intelligence and interactive digital entertainment conference.
[24] Adam Summerville and Michael Mateas. 2016. Super Mario as a String:
Platformer Level Generation Via LSTMs. https://doi.org/10.48550/
ARXIV.1603.00930
[25] Adam Summerville, Sam Snodgrass, Matthew Guzdial, Christoffer
Holmgård, Amy K Hoover, Aaron Isaksen, Andy Nealen, and Julian
Togelius. 2018. Procedural content generation via machine learning
(PCGML). IEEE Transactions on Games 10, 3 (2018), 257–270.
[26] Joshua Taylor and Ian Parberry. 2011. Procedural generation of
sokoban levels. In Proceedings of the International North American
Conference on Intelligent Games and Simulation. 5–12.

You might also like