Collaborative Dialogue in Minecraft: Anjali Narayan-Chen Prashant Jayannavar Julia Hockenmaier
Collaborative Dialogue in Minecraft: Anjali Narayan-Chen Prashant Jayannavar Julia Hockenmaier
5405
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5405–5415
Florence, Italy, July 28 - August 2, 2019. c 2019 Association for Computational Linguistics
resentations and a route planner generate real-time toy blocks: Minecraft blocks can only be placed
instructions to guide users through a treasure hunt on a discrete 3D grid, and they do not need to obey
in a virtual 3D world. gravity. That is, they do not need to be placed on
There is a resurgence of interest in Blocks the ground or on top of another block, but can be
World-like scenarios. Wang et al. (2017) let users put anywhere as long as one of their sides touches
define 3D voxel structures via a highly program- another block. That neighboring block can later
matic natural language. The interface learns to be removed, allowing the second block (and any
understand descriptions of increasing complex- structure supported by it) to “float”. Players need
ity, but does not engage in a back-and-forth di- to identify when such supporting blocks need to
alogue with the user. Most closely related to be added or removed.
our work are the corpora of Bisk et al. (2018,
2016a,b), which feature pairs of scenes involving Collaborative Building Task We define the
simulated, uniquely labeled, 3D blocks annotated Collaborative Building Task as a two-player game
with single-shot instructions aimed at guiding an between an Architect (A) and a Builder (B). A is
(imaginary) partner on how to transform an input given a target structure (Target) and has to instruct
scene into the target. In their scenario, the build- B via a text chat interface to build a copy of Target
ing area is always viewed from a fixed bird’s-eye on a given build region. A and B can communicate
perspective. Simpler versions of the data retain back and forth via chat throughout the game (e.g.
the grid-based assumption over blocks, and struc- to resolve confusions or to correct B’s mistakes).
tures consist solely of numeric digits procedurally B is given access to an inventory of 120 blocks of
reconstructed along the horizontal plane. Later six given colors that it can place and remove. A
versions increase the task complexity significantly can observe B and move around in its world, al-
by incorporating human-generated, truly 3D struc- lowing it to provide instructions from varying per-
tures and removed the grid assumption, as well as spectives. But A cannot move blocks, and remains
allowing for rotations of individual blocks. Their invisible to B. The task is complete when the struc-
blocks behave like physical blocks, disallowing ture built by B (Built) matches Target, invariant to
structures with floating blocks that are prevalent translations within the horizontal plane and rota-
in our data. Our work differs considerably in a tions about the vertical axis. Built also needs to
few other aspects: our corpus features two-way lie completely within the boundaries of the prede-
dialogue between an instructor and a real human fined build region.
partner; it also includes a wide range of perspec- Although human players were able to complete
tives as a result of using Minecraft avatars, rather each structure successfully, this task is not triv-
than a fixed bird’s-eye perspective; and we utilize ial. Figure 1 shows the perspectives seen by each
blocks of different colors, allowing for entire sub- player in the Minecraft client. This example from
structures to be identified (e.g., “the red pillar”). our corpus shows some of the challenges of this
task. A often provides instructions that they think
are sufficient, but leave B still clearly confused,
3 Minecraft Collaborative Building Task indicated either by B’s lack of initiative to start
building or a confused response. Once a multi-
Minecraft (https://minecraft.net/) is a
step instruction is understood, B also needs to plan
popular multi-player game in which players
a sequence of steps to follow that instruction; in
control avatars to navigate in a 3D world and
many cases, B chooses clearly suboptimal solu-
manipulate inherently block-like materials in
tions, resulting in large amounts of redundancy
order to build structures. Players can freely move,
in block movements. A misinterpreted instruction
jump and fly, and they can choose between first-
may also lead to a whole sequence of blocks being
or third-person perspectives. Camera angles can
misplaced by B (either due to miscommunication,
be smoothly rotated by moving around or turning
or because B made an educated guess on how to
one’s avatar’s head up, down, and side-to-side,
proceed) until A decides to intervene (in the ex-
resulting in a wide range of possible viewpoints.
ample, this can be seen with the built yellow 6). A
Blocks World in Minecraft Minecraft provides could also misinterpret the target structure, giving
an ideal setting to simulate Blocks World, al- B incorrect instructions that would later need to be
though there are two key differences to physical rectified. This illustrates the challenges involved
5406
Figure 1: In the Minecraft Collaborative Building Task, the Architect (A) has to instruct a Builder (B) to build a
target structure. A can observe B, but remains invisible to B. Both players communicate via a chat interface. (NB:
We show B’s actions in the dialogue as a visual aid to the reader.)
in designing an interactive agent for this task: the Architects were encouraged not to overwhelm
Architect needs to provide clear instructions; the the Builder with instructions and to allow their
Builder needs to identify when more information partner a chance to respond or act before moving
is required, and both agents may need to design on. Builders were instructed not to place blocks
efficient plans to construct complex structures. outside the specified build region and to stay as
faithful as possible to the Architect’s instructions.
4 The Minecraft Dialogue Corpus Both players were asked to communicate as natu-
rally as possible while avoiding idle chit-chat.
The Minecraft Dialogue Corpus consists of 509 Participants were allowed to complete multiple
human-human dialogues and game logs for the sessions if desired; we ensured that an individual
Collaborative Building Task. This section de- never saw the same target structure twice, and at-
scribes this corpus and our data collection process. tempted as much as possible to pair them with a
Further details are in the supplementary materials. previously unseen partner. While some individu-
als indicated a preference towards either the Ar-
4.1 Data Collection Procedure
chitect or Builder roles, roles were, for the most
Data was collected over the course of 3 weeks (ap- part, assigned in such a way that each individual
prox. 62 hours overall). 40 volunteers, both under- who participated in repeat sessions played both
graduate and graduate students with varying levels roles equally often. Each participant is assigned
of proficiency with Minecraft, participated in 1.5 a unique anonymous ID across sessions.
hour sessions in which they were paired up and
asked to build various predefined structures within 4.2 Data Structures and Collection Platform
a 11 × 11 × 9 sized build region. Builders be- Microsoft’s Project Malmo (Johnson et al., 2016)
gan with an inventory of 6 colors of blocks and 20 is an AI research platform that provides an API
blocks of each color. After a brief warm-up round for Minecraft agents and the ability to log, save,
to become familiar with the interface, participants and load game states. We have extended Malmo
were asked to successfully build as many struc- into a data collection platform. We represent the
tures as they could manage within this time frame. progression of each game (involving the construc-
On average, each game took 8.55 minutes. tion of a single target structure by an Architect and
5407
Builder pair) as a discrete sequence of game states. tures varied greatly, ranging from step-by-step in-
Although Malmo continuously monitors the game, structions involving temporary supporting blocks
we selectively discretize this data by only saving to single-shot descriptions such as, simply, “build
snapshots, or “observations,” of the game state at a floating yellow block” (sufficient for a veteran
certain triggering moments (whenever B picks up Minecraft player, but not necessarily for a novice).
or puts down a block or when either player sends
Referring expressions and ellipsis Architects
a chat message). This allows us to reduce the
made frequent use of implicit arguments and ref-
amount of (redundant) data to be logged while pre-
erences, relying heavily on the Builder’s current
serving significant game state changes. Each ob-
perspective and their most recent actions for ref-
servation is a JSON object that contains the fol-
erence resolution. For instance, Architect instruc-
lowing information: 1) a time stamp, 2) the chat
tions could include references such as “two more
history up until that point in time, 3) B’s posi-
in the same direction,” “one up,” “two towards
tion (a tuple of real-valued x, y, z coordinates as
you,” and “one right from the last thing you built.”
well as pitch and yaw angles, representing the
orientation of their camera), 4) B’s block inven- Recognizable shapes and sub-structures
tory, 5) the locations of the blocks in the build Some target structures were designed with com-
region, 6) screenshots taken from A’s and B’s monplace objects in mind. Some Architects took
perspectives. Whenever B manipulates a block, advantage of this in their instructions, ranging
we also capture screenshots from four invisible from straightforward (‘L’-shapes, “staircases”) to
“Fixed Viewer” clients hovering around the build more eccentric descriptions (“either a chicken or a
region at fixed angles. gun turret,” “a heart that looks diseased,” “a silly
multicolored worm”). To avoid slogging through
4.3 Data Statistics and Analysis block-by-block instructions, Architects frequently
Overall statistics The Minecraft Dialogue Cor- used such names to refer to sub-elements of the
pus contains 509 human-human dialogues (15,926 target structure. Some even defined new terms
utterances, 113,116 tokens) and game logs for 150 that get re-used across utterances: A: i will refer
target structures of varying complexity (min. 6 to this shape as r-windows from here on out... B:
blocks, max. 68 blocks, avg. 23.5 blocks). We okay A: please place the first green block in the
collected a minimum of three dialogues per struc- right open space of the blue r-window.
ture. The training, test and development sets con-
Builder utterances Even though the Architect
sist of 85 structures (281 dialogues), 39 structures
shouldered the large responsibility of describing
(137 dialogues), and 29 structures (101 dialogues)
the unseen structure, the Builder played an active
respectively. Dialogues for the same structure are
role in continuing and clarifying the dialogue, es-
fully contained within a single split; structures in
pecially for more complex structures. Builders
training are thus guaranteed to be unseen in test.
regularly took initiative during the course of a dia-
On average, dialogues contain 30.7 utterances:
logue in a variety of ways, including verification
22.5 Architect utterances (avg. length 7.9 tokens),
questions (“is this ok?”), clarification questions
8.2 Builder utterances (avg. length 2.9 tokens),
(“is it flat?” or “did I clean it up correctly?”),
and 49.5 Builder block movements. Dialogue
status updates (“i’m out of red blocks”), sugges-
length varies greatly with the complexity of the
tions (“feel free to give more than one direction at
target structure (not just the number of blocks, but
a time if you’re comfortable,” “i’ll stay in a fixed
whether it requires floating blocks or contains rec-
position so it’s easier to give me directions with
ognizable substructures).
respect to what i’m looking at”), or extrapolation
Floating blocks Blocks in Minecraft can be (“I think I know what you want. Let me try,” then
placed anywhere as long as they touch an existing continuing to build without explicit instruction).
block (or the ground). If such a supporting block is
5 Architect Utterance Generation Task
later removed, the remaining block (and any struc-
ture supported by it) will continue to “float” in Although the Minecraft Dialogue Corpus was mo-
place. This makes it possible to produce complex tivated by our ultimate goal of building agents that
designs. 53.6% of our target structures contain can successfully play an entire collaborative build-
such floating blocks. Instructions for these struc- ing game as Architect or Builder, we first con-
5408
Figure 3: A target structure (left) and corresponding
built structure at a certain point in the game (right).
5409
the Hamming distance between the built structure blocks Bn can be placed in the immediate next ac-
and the target (the total number of blocks of each tion. Bn , the set of blocks that can be feasibly
color to be placed and removed), and only retain placed, is a subset of Bp .
those alignments that have the smallest distance
Block counters To obtain a summary represen-
to the target. Once the game has progressed suf-
tation of the optimal alignments (without detailed
ficiently far, there is often only one optimal align-
spatial information), we represent each of the sets
ment between built and target structures, but in the
Bp and Br (as well as Bn ) of an alignment A =
early stages, a number of different optimal align-
Bp ∪ Br as sets of counters over block colors (in-
ments may be possible. Our world state represen-
dicating how many blocks of each color remain to
tation captures this uncertainty.
be placed [next] and to be removed). We compute
Figure 3 depicts a target structure (left) and a
the set of expected block counters for each color
point in the game at which a single red block has
c ∈ {red,blue,orange, purple, yellow, green} and
been placed (right). We can identify three poten-
action a ∈ {p, r, n} as the average over all k opti-
tial paths (left, up, and down) to continue the struc-
mal alignments A∗ = arg minA (|diff(T, S, A)|).
ture by extending it along the four cardinal direc-
tions. A permissibility check disqualifies the op- k
1X
tion of extending to the right, as blocks would end E[countc,a ] = countic,a
k
up placed outside the build region. These remain- i=1
ing paths, considered equally likely, indicate the With six colors, and three sets of blocks (all place-
colors and locations of blocks to be placed (or re- ments, next placements, removals), we obtain an
moved). A summary of this information forms the 18-dimensional vector of expected block counts.
basis of the input to our model.
7.1 Block Counter Models
Computing the distance between structures
Computing the Hamming distance between the We augment our basic seq2seq model with two
built and target structure under a given alignment variants of block counters that capture the current
tells us also which blocks need to be placed or re- state of the built structure:
moved. A structure S is a set of blocks (c, x, y, z). Global block counters are 18-dimensional vec-
Each block has a color c and occupies a location tors (capturing expected overall placements, next
(x, y, z) in absolute coordinate space (i.e., the co- placements, and removals for each of the six col-
ordinate system defined by the Minecraft client). ors) that are computed over the whole build region.
A structure’s position and orientation can be mu-
tated by an alignment A in which S undergoes a Local block counters Since many Builder ac-
translation AT (shift) followed by a rotation AR , tions involve locations immediately adjacent to
denoted A(S) = AR (AT (S)). We only consider their last action, we construct local block coun-
rotations about the vertical axis in 90-degree inter- ters that focus on and encode spatial information
vals, but allow all possible translations along the of this concentrated region. Here, we consider
horizontal plane. The symmetric difference be- a 3 × 3 × 3 cube of block locations: those di-
tween the target T and a built structure S w.r.t. an rectly surrounding the location of the last Builder
alignment A, diff(T, S, A), consists of the set of action as well as the last action itself. We com-
blocks to be placed, Bp = A(T ) − S and the set pute a separate set of block counters for each of
of blocks to be removed from S, Br = S − A(T ). these 27 locations. Using the Builder’s position
and gaze, we deterministically assign a relative
diff(T, S, A) = Bp ∪ Br direction for each location that indicates its posi-
tion relative to the last action in the Builder’s per-
The cardinality |diff(T, S, A)| is the Hamming spective, e.g., “left”, “top”, “back-right,” etc. The
distance between A(T ) and S. 27 18-dimensional block counters of each location
are concatenated, using a fixed canonical ordering
Feasible next placements Architects’ instruc-
of the assigned directions.
tions often concern the immediate next blocks to
be placed. Since new blocks can only be feasi- Adding block counters to the model To add
bly placed if one of their faces touches the ground block counters to out models, we found the best re-
or another block, we also wish to capture which sults by feeding the concatenated global and local
5410
counter vectors through a single fully-connected dicative of dialogue acts (e.g., responding “yes”
layer before concatenating them to the word em- vs. “no”, instructing to “place” vs. “remove”,
bedding vector that is fed into the decoder at each etc.). These lists also capture synonyms that are
time step (Figure 2). common in our data (e.g. “yes”/“yeah”), and
were obtained by curating non-overlapping lists
8 Experimental Setup of words (with a frequency ≥ 10 across all data
Data Our training, test and dev splits contain splits) that are appropriate to each category.2
6,548, 2,855, and 2,251 Architect utterances. We report precision and recall scores per cate-
gory, and for an “all keywords” list consisting of
Training We trained for a maximum of 40 the union of all category word lists. For each cat-
epochs using the Adam optimizer (Kingma and egory, we reduce both human and generated utter-
Ba, 2015). During training, we minimize the ances to those tokens that occur in the correspond-
sum of the cross entropy losses between each pre- ing keyword list: “place another red left of the
dicted and ground truth token. We stop training green” reduces to “red green” for color, to “left”
early when perplexity on the held-out validation for spatial relations and “place” for dialogue.
set had increased monotonically for two epochs. For a given (reduced) generated sentence Sg
All word embeddings were initialized with pre- and its associated (reduced) human utterance Sh ,
trained GloVe vectors (Pennington et al., 2014). we calculate term-specific precision (and recall) as
We first performed grid search over model archi- follows. Any token tg in Sg matches a token th in
tecture hyperparameters (embedding layer sizes Sh if tg and th are identical or synonyms. Similar
and RNN layer depths). Once the best-performing to BLEU’s modified unigram precision, once tg is
architecture was found, we then varied dropout pa- matched to one token th , it cannot be used for fur-
rameters (Srivastava et al., 2014). More details can ther matches to other tokens within Sh . Counts are
be found in the supplementary materials. accumulated over the entire corpus to compute the
ratio of matched to total tokens in Sg (or Sh ).
Decoding We use beam search decoding to
generate the utterance with the maximum log- Ablation study Table 1 shows the results of an
likelihood score according to our model normal- ablation study on the validation set. All model
ized by utterance length (beam size = 10). In or- variants here share the same RNN parameters.
der to promote diversity of generated utterances, While the individual addition of global and local
we use a γ penalty (Li et al., 2016) of γ = 0.8. block counters each see a slight boost in perfor-
These parameters were found by a grid search on mance in precision and recall respectively, com-
the validation set for our best model. bining them as in our final model shows significant
performance increase, especially on colors.
9 Results and Analysis
Test set results We finetune our most basic and
We evaluate our models in three ways: we use au-
most complex model via a grid search over all ar-
tomated metrics to assess how closely the gener-
chitectural parameters and dropout values on the
ated utterances match the human utterances. For
validation set. The best model’s results on the test
a random sample of 100 utterances per model, we
set are shown in Table 2. Our full model shows no-
use human evaluators to identify dialogue acts and
ticeable improvements on each of our metrics over
to evaluate whether the generated utterances are
the baseline. Most promising is again the signifi-
correct in the given game context. Finally, we per-
cant increase in performance on colors, indicating
form a qualitative analysis of our best model.
that the block counters capture necessary informa-
9.1 Automated Evaluation tion about next Builder actions.
Metrics To evaluate how closely the generated 9.2 Human Evaluation
utterances resemble the human utterances, we re-
port standard BLEU scores (Papineni et al., 2002). In order to better evaluate the quality of generated
We also compute (modified) precision and recall utterances as well as benchmark human perfor-
of a number of lists of domain-specific keywords mance, we performed a small-scale human eval-
that are instrumental to task success: colors, spa- uation of Architect utterances. We asked 3 hu-
2
tial relations, and other words that are highly in- These word lists are in the supplementary materials.
5411
BLEU Precision / Recall
Metric B-1 B-2 B-3 B-4 all keywords colors spatial dialogue
seq2seq 14.9 6.9 3.8 2.1 12.0 / 10.3 8.4 / 12.1 9.9 / 9.1 16.5 / 19.1
+ global only 16.1 7.7 4.1 2.4 12.9 / 11.6 14.4 / 15.5 8.8 / 7.0 19.1 / 18.8
+ local only 16.0 7.9 4.5 2.6 13.5 / 13.8 13.3 / 23.5 9.5 / 11.3 19.3 / 22.0
+ global & local 16.2 8.1 4.7 2.8 14.5 / 13.8 14.8 / 23.3 10.7 / 9.5 17.9 / 20.6
Table 1: BLEU score and term-specific precision and recall ablation study on the validation set.
Table 2: BLEU and term-specific precision and recall scores of the seq2seq and the full model on the test set.
man participants who had previously completed Utterance correctness Given a window of
the Minecraft Collaborative Building Task to eval- game context (consisting of at least the last seven
uate 100 randomly sampled scenarios from the test Builder’s and Architect’s actions, but always in-
set. Each scenario was reenacted from an actual cluding the previous Architect’s utterance) and ac-
human-human game by simulating the context of cess to the target structure to be built, evaluators
dialogue and Builder actions in Minecraft. Then, were asked to rate the correctness of an utterance
we presented 3 candidate Architect utterances to immediately following that context with respect
follow that context (one each generated from the to task completion. For an utterance to be fully
models in Table 2 as well as the original human correct, information contained within it must both
utterance) to the evaluators in randomized order. be consistent with the current state of the world
as well as not lead the Builder off-course from
Here, we analyze a subset of results on coarse
the target. Utterances could be considered par-
annotation of dialogue acts and utterance correct-
tially correct if some described elements (e.g. col-
ness. More details on the full evaluation frame-
ors) were accurate, but other incorrect elements
work, including descriptions of evaluation crite-
precluded full correctness. Otherwise, utterances
ria and inter-annotator agreement statistics, are in-
could be deemed incorrect (if wildly off-course) or
cluded in the supplementary materials.
N/A (if there was not enough information). Results
can be found in Table 4. Unsurprisingly, with-
Dialogue acts Given a list of six predefined out access to world state information, the baseline
coarse-grained dialogue acts (including Instruct B, model performs poorly, conveying incorrect infor-
Describe Target, etc.; see the supplementary ma- mation about half of the time. With access to a
terial for full details), evaluators were asked to simple world representation, our full model shows
choose all dialogue acts that categorized a candi- marked improvement on generating both fully and
date utterance. An utterance could belong to any partially correct utterances. Finally, human per-
number of categories; e.g., “great! now place a formance sets a high bar; when not engaging in
red block” is both a confirmation as well as an in- chitchat or correcting typos, humans consistently
struction. Results can be found in Table 3. These produce fully correct utterances constructive to-
results show a significantly higher diversity of ut- wards task completion.
terance types generated by humans. Humans pro- 9.3 Qualitative Analysis
vided instructions only about half of the time, and
Here, we use examples to illustrate different as-
devoted more energy to providing higher-level de-
pects of our best model’s utterances.
scriptions of the target, responding to the Builder’s
actions and queries, and rectifying mistakes. On Identifying the game state In the course of a
the other hand, even the improved model failed to game, players progress through different states. In
capture this, mainly generating instructions even if the human-human data, dialogue is peppered with
it was inappropriate or unhelpful to do so. context cues (greetings, questions, apologies, in-
5412
Describe Answer Confirm B’s Correct/
Model Instruct B Target question actions/plans clarify A/B Other
seq2seq 76.0 12.0 7.0 9.0 3.0 4.0
+ global & local 72.0 14.0 8.0 9.0 3.0 4.0
human 47.0 14.0 12.0 17.0 23.0 8.0
Table 3: Percentage of utterances categorized as a given dialogue act. Labels were determined per dialogue act by
majority vote across three human evaluators. An utterance can belong to multiple dialogue acts.
Model Full Partial None N/A the blue block, put a blue block on top of the blue”
seq2seq 14.0 28.0 48.0 10.0 or “yes, now, purple, purple, purple, ...”
+ global & local 25.0 36.0 32.0 7.0
human 89.0 2.0 0.0 9.0 10 Conclusion and Future Work
Table 4: Percentage of utterances deemed correct by The Minecraft Collaborative Building Task pro-
human evaluators. vides interesting challenges for interactive agents:
they must understand and generate spatially-aware
dialogue, execute instructions, identify and re-
structions to move or place blocks) that indicate cover from mistakes. As a first step towards the
the flow of a game. Our model is able to capture goal of developing fully interactive agents for this
some of these aspects. It often begins games with task, we considered the subtask of Architect utter-
an instruction like “we’ll start with blue”, and ance generation. To give accurate, high-level in-
may end them with “ok we’re done!” (although structions, Architects need to align the Builder’s
it occasionally continues with further instructions, world state to the target structure and identify
e.g “great! now we’ll do the same thing on the complex substructures. We show that models
other side”.) It often says “perfect!” immediately that capture some world state information improve
followed by a new instruction which indicates the over naive baselines. Richer models (e.g. CNNs
model’s ability to acknowledge a Builder’s previ- over world states, attention mechanisms (Bah-
ous actions before continuing. The model often danau et al., 2015), memory networks (Bordes
describes the type of the next required action cor- et al., 2017)) and/or explicit semantic representa-
rectly (even if it makes mistakes in the specifics of tions should be able to generate better utterances.
that action): it generated “remove the bottom row” Clearly, much work remains to be done to create
when the ground truth was “okay so now get rid of actual agents that can play either role interactively
the inner most layer of purple in the square”. against a human. The Minecraft Dialogue Corpus
as well as the Malmo platform and our extension
Predicting block colors and spatial relations of it enable many such future directions. Our plat-
Generated utterances often identify the correct form can also be extended to support fully inter-
color of blocks, e.g “then place a red block on active scenarios that may involve a human player,
top of that” in a context when the the next place- measure task completion, or support other training
ments include a layer of red blocks (ground truth regimes (e.g. reinforcement learning).
utterance: “the second level of the structure con-
sists wholly of red blocks. start by putting a red Acknowledgements
block on each orange block”.) Less frequently,
the model is also able to predict accurate spatial We would like to thank the reviewers for their
relations (“perfect! now place a red block to the valuable comments. This work was supported
left of that”) for referent blocks. by Contract W911NF-15-1-0461 with the US
Defense Advanced Research Projects Agency
Utterance diversity and repetition Generated (DARPA) Communicating with Computers Pro-
utterances lack diversity: the pattern “a x b” (for gram and the Army Research Office (ARO). Ap-
a rectangle of size a × b) is almost exclusively proved for Public Release, Distribution Unlimited.
used to describe squares (an extremely common The views expressed are those of the authors and
shape in our data). Utterances are mostly fluent, do not reflect the official policy or position of the
but sometimes contain repeats: “okay, on top of Department of Defense or the U.S. Government.
5413
References Abhishek Das, Satwik Kottur, Khushi Gupta, Avi
Singh, Deshraj Yadav, José M.F. Moura, Devi
Anne H Anderson, Miles Bader, Ellen Gurman Bard, Parikh, and Dhruv Batra. 2017. Visual Dialog. In
Elizabeth Boyle, Gwyneth Doherty, Simon Garrod, Proceedings of the IEEE Conference on Computer
Stephen Isard, Jacqueline Kowtko, Jan McAllister, Vision and Pattern Recognition (CVPR), pages 326–
Jim Miller, et al. 1991. The HCRC map task corpus. 335.
Language and speech, 34(4):351–366.
Srinivasan Janarthanam, Oliver Lemon, and Xingkun
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- Liu. 2012. A web-based evaluation framework for
gio. 2015. Neural machine translation by jointly spatial instruction-giving systems. In Proceedings
learning to align and translate. In 3rd Inter- of the ACL 2012 System Demonstrations, pages 49–
national Conference on Learning Representations, 54, Jeju Island, Korea. Association for Computa-
ICLR 2015, San Diego, CA, USA, May 7-9, 2015, tional Linguistics.
Conference Track Proceedings.
Matthew Johnson, Katja Hofmann, Tim Hutton, and
Yonatan Bisk, Daniel Marcu, and William Wong. David Bignell. 2016. The Malmo platform for artifi-
2016a. Towards a dataset for human computer com- cial intelligence experimentation. In Proceedings of
munication via grounded language acquisition. In the Twenty-Fifth International Joint Conference on
AAAI Workshop: Symbiotic Cognitive Systems. Artificial Intelligence (IJCAI-16), pages 4246–4247.
Yonatan Bisk, Kevin Shih, Yejin Choi, and Daniel
Seokhwan Kim, Luis Fernando D’Haro, Rafael E
Marcu. 2018. Learning interpretable spatial oper-
Banchs, Jason D Williams, and Matthew Henderson.
ations in a rich 3D Blocks World. In Proceedings
2017. The fourth dialog state tracking challenge.
of the Thirty-Second AAAI Conference on Artificial
In Dialogues with Social Robots, pages 435–449.
Intelligence, pages 5028–5036.
Springer.
Yonatan Bisk, Deniz Yuret, and Daniel Marcu. 2016b.
Natural language communication with robots. In Seokhwan Kim, Luis Fernando D’Haro, Rafael E
Proceedings of the 2016 Conference of the North Banchs, Jason D Williams, Matthew Henderson, and
American Chapter of the Association for Computa- Koichiro Yoshino. 2016. The fifth dialog state track-
tional Linguistics: Human Language Technologies, ing challenge. In 2016 IEEE Spoken Language
pages 751–761, San Diego, California. Association Technology Workshop (SLT), pages 511–517.
for Computational Linguistics.
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A
Antoine Bordes, Y.-Lan Boureau, and Jason Weston. method for stochastic optimization. In 3rd Inter-
2017. Learning end-to-end goal-oriented dialog. In national Conference on Learning Representations,
5th International Conference on Learning Represen- ICLR 2015, San Diego, CA, USA, May 7-9, 2015,
tations, ICLR 2017, Toulon, France, April 24-26, Conference Track Proceedings.
2017, Conference Track Proceedings.
Alexander Koller, Kristina Striegnitz, Donna Byron,
Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang Justine Cassell, Robert Dale, Johanna Moore, and
Tseng, Iñigo Casanueva, Stefan Ultes, Osman Ra- Jon Oberlander. 2010. The first challenge on gen-
madan, and Milica Gašić. 2018. MultiWOZ - a erating instructions in virtual environments. In
large-scale multi-domain wizard-of-Oz dataset for Empirical Methods in Natural Language Genera-
task-oriented dialogue modelling. In Proceedings of tion, pages 328–352, Berlin, Heidelberg. Springer-
the 2018 Conference on Empirical Methods in Nat- Verlag.
ural Language Processing, pages 5016–5026, Brus-
sels, Belgium. Association for Computational Lin- Jiwei Li, Will Monroe, and Dan Jurafsky. 2016. A sim-
guistics. ple, fast diverse decoding algorithm for neural gen-
eration. arXiv preprint arXiv:1611.08562.
Joyce Y. Chai, Qiaozi Gao, Lanbo She, Shaohua Yang,
Sari Saba-Sadiya, and Guangyue Xu. 2018. Lan- Ryan Lowe, Nissan Pow, Iulian Serban, and Joelle
guage to action: Towards interactive task learn- Pineau. 2015. The Ubuntu dialogue corpus: A large
ing with physical agents. In Proceedings of the dataset for research in unstructured multi-turn dia-
Twenty-Seventh International Joint Conference on logue systems. In Proceedings of the 16th Annual
Artificial Intelligence (IJCAI-18), pages 2–9. Inter- Meeting of the Special Interest Group on Discourse
national Joint Conferences on Artificial Intelligence and Dialogue, pages 285–294, Prague, Czech Re-
Organization. public. Association for Computational Linguistics.
David Chen and Raymond Mooney. 2011. Learning Dipendra K. Misra, Jaeyong Sung, Kevin Lee, and
to interpret natural language navigation instructions Ashutosh Saxena. 2016. Tell me Dave: Context-
from observations. In Proceedings of the Twenty- sensitive grounding of natural language to manip-
Fifth AAAI Conference on Artificial Intelligence, ulation instructions. The International Journal of
pages 859–865. Robotics Research, 35(1-3):281–300.
5414
Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Sida I. Wang, Samuel Ginn, Percy Liang, and Christo-
Jing Zhu. 2002. Bleu: a method for automatic eval- pher D. Manning. 2017. Naturalizing a program-
uation of machine translation. In Proceedings of ming language via interactive learning. In Proceed-
40th Annual Meeting of the Association for Com- ings of the 55th Annual Meeting of the Association
putational Linguistics, pages 311–318, Philadelphia, for Computational Linguistics (Volume 1: Long Pa-
Pennsylvania, USA. Association for Computational pers), pages 929–938, Vancouver, Canada. Associa-
Linguistics. tion for Computational Linguistics.
Ramakanth Pasunuru and Mohit Bansal. 2018. Game- Terry Winograd. 1971. Procedures as a representa-
based video-context dialogue. In Proceedings of tion for data in a computer program for understand-
the 2018 Conference on Empirical Methods in Nat- ing natural language. Technical report, MIT. Cent.
ural Language Processing, pages 125–136, Brus- Space Res.
sels, Belgium. Association for Computational Lin-
guistics.
Jeffrey Pennington, Richard Socher, and Christopher
Manning. 2014. GloVe: Global vectors for word
representation. In Proceedings of the 2014 Con-
ference on Empirical Methods in Natural Language
Processing, pages 1532–1543, Doha, Qatar. Associ-
ation for Computational Linguistics.
Alan Ritter, Colin Cherry, and Bill Dolan. 2010. Un-
supervised modeling of Twitter conversations. In
Human Language Technologies: The 2010 Annual
Conference of the North American Chapter of the
Association for Computational Linguistics, pages
172–180, Los Angeles, California. Association for
Computational Linguistics.
Nicolas Schrading, Cecilia Ovesdotter Alm, Ray
Ptucha, and Christopher Homan. 2015. An analy-
sis of domestic abuse discourse on Reddit. In Pro-
ceedings of the 2015 Conference on Empirical Meth-
ods in Natural Language Processing, pages 2577–
2583, Lisbon, Portugal. Association for Computa-
tional Linguistics.
M. Schuster and K. K. Paliwal. 1997. Bidirectional re-
current neural networks. IEEE Transactions on Sig-
nal Processing, 45(11):2673–2681.
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,
Ilya Sutskever, and Ruslan Salakhutdinov. 2014.
Dropout: A simple way to prevent neural networks
from overfitting. Journal of Machine Learning Re-
search, 15:1929–1958.
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.
Sequence to sequence learning with neural net-
works. In Advances in neural information process-
ing systems, pages 3104–3112.
Stefanie Tellex, Thomas Kollar, Steven Dickerson,
Matthew Walter, Ashis Banerjee, Seth Teller, and
Nicholas Roy. 2011. Understanding natural lan-
guage commands for robotic navigation and mobile
manipulation. In Proceedings of the Twenty-Fifth
AAAI Conference on Artificial Intelligence, pages
1507–1514.
Jesse Thomason, Shiqi Zhang, Raymond J Mooney,
and Peter Stone. 2015. Learning to interpret nat-
ural language commands through human-robot di-
alog. In Proceedings of the Twenty-Fourth Interna-
tional Joint Conference on Artificial Intelligence (IJ-
CAI 2015), pages 1923–1929.
5415