0% found this document useful (0 votes)
66 views8 pages

Grounding Graph

The document describes a new model called Generalized Grounding Graphs (G3) for understanding natural language commands given to autonomous systems performing navigation and mobile manipulation tasks. The G3 model dynamically instantiates a probabilistic graphical model according to the compositional and hierarchical semantic structure of the input command. The model is trained on a corpus of commands paired with robot actions to learn word meanings, and performs inference to find the most likely plan corresponding to the command. The approach is evaluated in the domain of commands for a robotic forklift but is generally applicable to other domains.

Uploaded by

suman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views8 pages

Grounding Graph

The document describes a new model called Generalized Grounding Graphs (G3) for understanding natural language commands given to autonomous systems performing navigation and mobile manipulation tasks. The G3 model dynamically instantiates a probabilistic graphical model according to the compositional and hierarchical semantic structure of the input command. The model is trained on a corpus of commands paired with robot actions to learn word meanings, and performs inference to find the most likely plan corresponding to the command. The approach is evaluated in the domain of commands for a robotic forklift but is generally applicable to other domains.

Uploaded by

suman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

In Proceedings of the National Conference on Artificial Intelligence (AAAI 2011).

Understanding Natural Language Commands


for Robotic Navigation and Mobile Manipulation
Stefanie Tellex1 and Thomas Kollar1 and Steven Dickerson1 and
Matthew R. Walter and Ashis Gopal Banerjee and Seth Teller and Nicholas Roy
Computer Science and Artificial Intelligence Laboratory
Massachusetts Institute of Technology
Cambridge, MA 02139

Abstract Commands from the corpus


This paper describes a new model for understanding natural - Go to the first crate on the left
language commands given to autonomous systems that per- and pick it up.
form navigation and mobile manipulation in semi-structured - Pick up the pallet of boxes in the
environments. Previous approaches have used models with middle and place them on the
fixed structure to infer the likelihood of a sequence of ac- trailer to the left.
tions given the environment and the command. In contrast, - Go forward and drop the pallets to
our framework, called Generalized Grounding Graphs (G3 ), the right of the first set of
dynamically instantiates a probabilistic graphical model for a tires.
particular natural language command according to the com- - Pick up the tire pallet off the
mand’s hierarchical and compositional semantic structure. truck and set it down
Our system performs inference in the model to successfully
find and execute plans corresponding to natural language
commands such as “Put the tire pallet on the truck.” The (a) Robotic forklift (b) Sample commands
model is trained using a corpus of commands collected us-
ing crowdsourcing. We pair each command with robot ac- Figure 1: A target robotic platform for mobile manipulation
tions and use the corpus to learn the parameters of the model. and navigation (Teller et al., 2010), and sample commands
We evaluate the robot’s performance by inferring plans from from the domain, created by untrained human annotators.
natural language commands, executing each plan in a realistic Our system can successfully follow these commands.
robot simulator, and asking users to evaluate the system’s per-
formance. We demonstrate that our system can successfully
follow many natural language commands from the corpus.
arguments or nested clauses. At training time, when using
a flat structure, the system sees the entire phrase “the pallet
1 Introduction beside the truck” and has no way to separate the meanings of
To be useful teammates to human partners, robots must be relations like “beside” from objects such as “the truck.” Fur-
able to robustly follow spoken instructions. For example, thermore, a flat structure ignores the argument structure of
a human supervisor might tell an autonomous forklift, “Put verbs. For example, the command “put the box on the pallet
the tire pallet on the truck,” or the occupant of a wheelchair beside the truck,” has two arguments (“the box” and “on the
equipped with a robotic arm might say, “Get me the book pallet beside the truck”), both of which are necessary to learn
from the coffee table.” Such commands are challenging be- an accurate meaning for the verb “put.” In order to infer the
cause they involve events (“Put”), objects (“the tire pallet”), meaning of unconstrained natural language commands, it is
and places (“on the truck”), each of which must be grounded critical for the model to exploit these compositional and hi-
to aspects of the world and which may be composed in many erarchical linguistic structures at both learning and inference
different ways. Figure 1 shows some of the wide variety of time.
human-generated commands that our system is able to fol- To address these issues, we introduce a new model called
low for the robotic forklift domain. Generalized Grounding Graphs (G3 ). A grounding graph is
We frame the problem of following instructions as infer- a probabilistic graphical model that is instantiated dynami-
ring the most likely robot state sequence from a natural lan- cally according to the compositional and hierarchical struc-
guage command. Previous approaches (Kollar et al., 2010; ture of a natural language command. Given a natural lan-
Shimizu and Haas, 2009) assume that natural language com- guage command, the structure of the grounding graph model
mands have a fixed and flat structure that can be exploited is induced using Spatial Description Clauses (SDCs), a se-
when inferring actions for the robot. However, this kind of mantic structure introduced by Kollar et al. (2010). Each
fixed and flat sequential structure does not allow for variable SDC represents a linguistic constituent from the command
that can be mapped to an aspect of the world or grounding,
1
The first three authors contributed equally to this paper. such as an object, place, path or event. In the G3 frame-
work, the structure of each individual SDC and the random location, shape and name of each object and place, along
variables, nodes, and edges in the overall grounding graph with a topology that defines the environment’s connectivity.
depend on the specific words in the text. At the top level, the system infers a grounding correspond-
The model is trained on a corpus of natural language com- ing to the entire command, which is then interpreted as a
mands paired with groundings for each part of the com- plan for the robot to execute.
mand, enabling the system to automatically learn meanings More formally, we define Γ to be the set of all ground-
for words in the corpus, including complex verbs such as ings γi for a given command. In order to allow for uncer-
“put” and “take.” We evaluate the system in the specific tainty in candidate groundings, we introduce binary corre-
domain of natural language commands given to a robotic spondence variables Φ; each φi ∈ Φ is true if γi ∈ Γ is
forklift, although our approach generalizes to any domain correctly mapped to part of the natural language command,
where linguistic constituents can be associated with specific and false otherwise. Then we want to maximize the condi-
actions and environmental features. Videos of example com- tional distribution:
mands paired with inferred action sequences can be seen at
argmax p(Φ = T rue|command, Γ) (1)
http://spatial.csail.mit.edu/grounding. Γ

This optimization is different from conventional CRF infer-


2 Related Work ence, where the goal is to infer the most likely hidden labels
Beginning with SHRDLU (Winograd, 1970), many systems Φ. Although our setting is discriminative, we fix the corre-
have exploited the compositional structure of language to spondence variables Φ and search over features induced by Γ
statically generate a plan corresponding to a natural lan- to find the most likely grounding. By formulating the prob-
guage command (Dzifcak et al., 2009; Hsiao et al., 2008; lem in this way, we are able to perform domain-independent
MacMahon, Stankiewicz, and Kuipers, 2006; Skubic et al., learning and inference.
2004). Our work moves beyond this framework by defining
a probabilistic graphical model according to the structure of 3.1 Spatial Description Clauses
the natural language command, inducing a distribution over The factorization of the distribution in Equation 1 is defined
plans and groundings. This approach enables the system to according to the grounding graph constructed for a natural
learn models for the meanings of words in the command and language command. To construct a probabilistic model ac-
efficiently perform inference over many plans to find the best cording to the linguistic structure of the command, we de-
sequence of actions and groundings corresponding to each compose a natural language command into a hierarchy of
part of the command. Spatial Description Clauses or SDCs (Kollar et al., 2010).
Others have used generative and discriminative models Each SDC corresponds to a constituent of the linguistic in-
for understanding route instructions, but did not use the hi- put and consists of a figure f , a relation r, and a variable
erarchical nature of the language to understand mobile ma- number of landmarks li . A general natural language com-
nipulation commands (Kollar et al., 2010; Matuszek, Fox, mand is represented as a tree of SDCs. SDCs for the com-
and Koscher, 2010; Vogel and Jurafsky, 2010). Shimizu and mand “Put the tire pallet on the truck” appear in Figure 2a,
Haas (2009) use a flat, fixed action space to train a CRF that and “Go to the pallet on the truck” in Figure 3a. Leaf SDCs
followed route instructions. Our approach, in contrast, in- in the tree contain only text in the figure field, such as “the
terprets a grounding graph as a structured CRF, enabling the tire pallet.” Internal SDCs have other fields populated, such
system to learn over a rich compositional action space. as “the tire pallet on the truck.” The figure and landmark
The structure of SDCs builds on the work of Jackendoff fields of internal SDCs are always themselves SDCs. The
(1983), Landau and Jackendoff (1993) and Talmy (2005), text in fields of an SDC does not have to be contiguous. For
providing a computational instantiation of their formalisms. phrasal verbs such as “Put the tire pallet down,” the relation
Katz (1988) devised ternary expressions to capture relations field contains “Put down,” and the landmark field is “the tire
between words in a sentence. The SDC representation adds pallet.”
types for each clause, each of which induces a candidate Kollar et al. (2010) introduced SDCs and used them to
space of groundings, as well as the ability to represent mul- define a probabilistic model that factors according to the se-
tiple landmark objects, making it straightforward to directly quential structure of language. Here we change the formal-
associate groundings with SDCs. ism slightly to collapse the verb and spatial relation fields
into a single relation and exploit the hierarchical structure of
3 Approach SDCs in the factorization of the model.
Our system takes as input a natural language command and The system infers groundings in the world correspond-
outputs a plan for the robot. In order to infer a correct ing to each SDC. To structure the search for groundings
plan, it must find a mapping between parts of the natu- and limit the size of the search space, we follow Jackend-
ral language command and corresponding groundings (ob- off (1983) and assign a type to each SDC:
jects, paths, and places) in the world. We formalize this • EVENT An action sequence that takes place (or should
mapping with a grounding graph, a probabilistic graphical take place) in the world (e.g. “Move the tire pallet”).
model with random variables corresponding to groundings • OBJECT A thing in the world. This category includes
in the world. Each grounding is taken from a semantic map people and the robot as well as physical objects (e.g.
of the environment, which consists of a metric map with the “Forklift,” “the tire pallet,” “the truck,” “the person”).
EV EN T1 (r = Put, EV EN T1 (r = Go
l = OBJ2 (f = the pallet), l = P AT H2 (r = to,
l2 = P LACE3 (r = on, l = OBJ3 (f = OBJ4 (f = the pallet),
l = OBJ4 (f = the truck))) r = on,
l = OBJ5 (f = the truck))))

(a) SDC tree (a) SDC tree

γ1 γ1

γ2
φ1 φ1

φ2
γ2 γ3 γ4

φ4
γ4 γ5
φ2 φ3 φ3

φ4 φ5

λr1 λf2 λr3 λf4 λr1 λr2 λf4 λr3 λf5


“Put” “the pallet” “on” “the truck” “Go” “to “the pallet” “on” “the truck”

(b) Induced model (b) Induced model

Figure 2: (a) SDC tree for “Put the pallet on the truck.” (b) Figure 3: (a) SDC tree for “Go to the pallet on the truck.” (b)
Induced graphical model and factorization. A different induced factor graph from Figure 2. Structural
differences between the two models are highlighted in gray.

• PLACE A place in the world (e.g. “on the truck,” or “next


to the tire pallet”). • λfi The words of the figure field of the ith SDC.
• PATH A path or path fragment through the world (e.g. • λri The words of the relation field of the ith SDC.
“past the truck,” or “toward receiving”). • λl1 l2
i , λi The words of the first and second landmark fields
Each EVENT and PATH SDC contains a relation with of the ith SDC; if non-empty, always a child SDC.
one or more core arguments. Since almost all relations (e.g. • γif , γil1 , γil2 ∈ Γ The groundings associated with the cor-
verbs) take two or fewer core arguments, we use at most responding field(s) of the ith SDC: the state sequence of
two landmark fields l1 and l2 for the rest of the paper. We the robot (or an object), or a location in the semantic map.
have built an automatic SDC extractor that uses the Stan-
ford dependencies, which are extracted using the Stanford For a phrase such as “the pallet on the truck,” λri is the
Parser (de Marneffe, MacCartney, and Manning, 2006). word “on,” and γif and γil1 correspond to objects in the
world, represented as a location, a bounding box, and a list
3.2 Generalized Grounding Graphs of labels. φi would be true if the induced features between
We present an algorithm for constructing a grounding graph γif and γil1 correspond to “on,” and false otherwise.
according to the linguistic structure defined by a tree of Each random variable connects to one or more factor
SDCs. The induced grounding graph for a given command nodes, Ψi . Graphically, there is an edge between a variable
is a bipartite factor graph corresponding to a factorization of and a factor if the factor takes that variable as an argument.
the distribution from Equation 1 with factors Ψi and normal- The specific factors created depend on the structure of the
ization constant Z: SDC tree. The factors Ψ fall into two types:
p(Φ|commands, Γ) =p(Φ|SDCs, Γ) (2) • Ψ(φi , λfi , γi ) for leaf SDCs.
1 Y
= Ψi (φi , SDCi , Γ) (3) • Ψ(φi , λri , γif , γil1 ) or Ψ(φi , λri , γif , γil1 , γil2 ) for internal
Z i SDCs.
The graph has two types of nodes: random variables and Leaf SDCs contain only λfi and a grounding γif . For ex-
factors. First we define the following random variables: ample, the phrase “the truck” is a leaf SDC that generates
• φi True if the grounding γi corresponds to ith SDC, and the subgraph in Figure 3 containing variables γ5 , φ5 and λf5 .
false otherwise. The value of γ5 is an object in the world, and φ5 is true if the
object corresponds to the words “the truck” and false other- More generally, we implemented a set of base features
wise (for example, if γ5 was a pallet). involving geometric relations between the γi . Then to com-
An internal SDC has text in the relation field and SDCs pute features sk we generate the Cartesian product of the
in the figure and landmark fields. For these SDCs, φi de- base features with the presence of words in the correspond-
pends on the text of the relation field, and the groundings ing fields of the SDC. A second problem is that many natu-
(rather than the text) of the figure and landmark fields. For ral features between geometric objects are continuous rather
example, “the pallet on the truck” is an internal SDC, with than binary valued. For example, for the relation “next to,”
a corresponding grounding that is a place in the world. This one feature is the normalized distance between γif and γil .
SDC generates the subgraph in Figure 3 containing the vari- To solve this problem, we discretize continuous features into
ables γ4 , γ5 , φ3 , and λr3 . φ3 is true if γ4 is “on” γ5 , and false uniform bins. We use 49 base features for leaf OBJECT
otherwise. and PATH SDCs, 56 base features for internal OBJECT and
Figures 2 and 3 show the SDC trees and induced ground- PATH SDCs, 112 base features for EVENT SDCs and 47
ing graphs for two similar commands: “Put the pallet on the base features for PATH SDCs. This translates to 147,274
truck” and “Go to the pallet on the truck.” In the first case, binary features after the Cartesian product with words and
“Put” is a two-argument verb that takes an OBJECT and a discretization.
PLACE. The model in Figure 2b connects the grounding γ3 For OBJECTs and PLACEs, geometric features corre-
for “on the truck” directly to the factor for “Put.” In the spond to relations between two three-dimensional boxes in
second case, “on the truck” modifies “the pallet.” For this the world. All continuous features are first normalized so
reason, the grounding γ4 for “on the truck” is connected to they are scale-invariant, then discretized to be a set of binary
“the pallet.” The differences between the two models are features. Examples include
highlighted in gray.
In this paper we use generalized grounding graphs to de- • supports(γif , γil ). For “on” and “pick up.”
fine a discriminative model in order to train the model from • distance(γif , γil ). For “near” and “by.”
a large corpus of data. However, the same graphical formal-
ism can also be used to define factors for a generative graph- • avs(γif , γil ). For “in front of” and “to the left of.” At-
ical model, or even a constraint network that does not take tention Vector Sum or AVS (Regier and Carlson, 2001)
a probabilistic approach at all. For example, the generative measures the degree to which relations like “in front of”
model described in Kollar et al. (2010) for following route or “to the left of” are true for particular groundings.
instructions is a special case of this more general framework. In order to compute features for relations like “to the left”
We model the distribution in Equation 2 as a conditional or “to the right,” the system needs to compute a frame of ref-
random field in which each potential function Ψ takes the erence, or the orientation of a coordinate system. We com-
following form (Lafferty, McCallum, and Pereira, 2001): pute these features for frames of reference in all four cardi-
nal directions at the agent’s starting orientation, the agent’s
!
X
Ψi (φi , SDCi , Γ) = exp µk sk (φi , SDCi , Γ) (4) ending orientation, and the agent’s average orientation dur-
k ing the action sequence.
Here, sk are feature functions that take as input the binary For PATH and EVENT SDCs, groundings correspond to
correspondence variable, an SDC and a set of groundings the location and trajectory of the robot and any objects it
and output a binary decision. The µk are the weights corre- manipulates over time. Base features are computed with re-
sponding to the output of a particular feature function. spect to the entire motion trajectory of a three-dimensional
At training time, we observe SDCs, their corresponding object through space. Examples include:
groundings Γ, and the output variable Φ. In order to learn • The displacement of a path toward or away from a ground
the parameters µk that maximize the likelihood of the train- object.
ing dataset, we compute the gradient, and use the Mallet
• The average distance of a path from a ground object.
toolkit (McCallum, 2002) to optimize the parameters of the
model via gradient descent with L-BFGS (Andrew and Gao, We also use the complete set of features described in
2007). When inferring a plan, we optimize over Γ by fixing Tellex (2010). Finally, we compute the same set of fea-
Φ and the SDCs as in Equation 1. tures as for OBJECTs and PLACEs using the state at the
beginning of the trajectory, the end of the trajectory, and the
3.3 Features average during the trajectory.
To train the model, the system extracts binary features sk for The system must map noun phrases such as “the wheel
each factor Ψi . These features correspond to the degree to skid” to a grounding γif for a physical object in the
which each Γ correctly grounds SDCi . For a relation such as world with location, geometry, and a set of labels such as
“on,” a natural feature is whether the the landmark ground- {“tires”, “pallet”}. To address this issue we introduce a sec-
ing supports the figure grounding. However, the feature ond class of base features that correspond to the likelihood
supports(γif , γil ) alone is not enough to enable the model that an unknown word actually denotes a known concept.
The system computes word-label similarity in two ways:
to learn that “on” corresponds to supports(γif , γil ). Instead
using WordNet; and from co-occurrence statistics obtained
we need a feature that also takes into account the word “on:”
by downloading millions of images and corresponding tags
supports(γif , γil ) ∧ (“on” ∈ λri ) (5) from Flickr (Kollar et al., 2010).
3.4 Inference Subjects did not see any text describing the actions or ob-
jects in the video, leading to a wide variety of natural lan-
Given a command, we want to find the set of most probable
guage commands including nonsensical ones such as “Load
groundings. During inference, we fix the values of Φ and the
the forklift onto the trailer,” and misspelled ones such as
SDCs and search for groundings Γ that maximize the prob-
“tyre” (tire) or “tailor” (trailer). Example commands from
ability of a match, as in Equation 1. Because the space of
the corpus are shown in Figure 1.
potential groundings includes all permutations of object as-
signments, as well as every feasible sequence of actions the To train the system, each SDC must be associated with
agent might perform, the search space becomes large as the a grounded object in the world. We manually annotated
number of objects and potential manipulations in the world SDCs in the corpus, and then annotated each OBJECT and
increases. In order to make the inference tractable, we use a PLACE SDC with an appropriate grounding. Each PATH
beam search with a fixed beam-width of twenty in order to and EVENT grounding was automatically associated with
bound the number of candidate groundings considered for the action or agent path from the log associated with the
any particular SDC. original video. This approximation is faster to annotate but
leads to problems for compound commands such as “Pick
A second optimization is that we search in two passes:
up the right skid of tires and place it parallel and a bit closer
the algorithm first finds and scores candidate groundings for
to the trailer,” where each EVENT SDC refers to a different
OBJECT and PLACE SDCs, then uses those candidates to
part of the state sequence.
search the much larger space of robot action sequences, cor-
responding to EVENTs and PATHs. This optimization ex- The annotations above provided positive examples of
ploits the types and independence relations among SDCs to grounded language. In order to train the model, we also need
structure the search so that these candidates need to be com- negative examples. We generated negative examples by as-
puted only once, rather than for every possible EVENT. sociating a random grounding with each SDC. Although this
heuristic works well for EVENTs and PATHs, ambiguous
Once a full set of candidate OBJECT and PLACE ground-
object SDCs such as “the pallet” or “the one on the right,”
ings is obtained up to the beam width, the system searches
are often associated with a different, but still correct object
over possible action sequences for the agent, scoring each
(in the context of that phrase alone). For these examples we
sequence against the language in the EVENT and PATH
re-annotated them as positive.
SDCs of the command. After searching over potential ac-
tion sequences, the system returns a set of object groundings
and a sequence of actions for the agent to perform. Figure 4 4.2 Cost Function Evaluation
shows the actions and groundings identified in response to Using the annotated data, we trained the model and evalu-
the command “Put the tire pallet on the truck.” ated its performance on a held-out test set in a similar envi-
ronment. We assessed the model’s performance at predict-
4 Evaluation ing the correspondence variable given access to SDCs and
groundings. The test set pairs a disjoint set of scenarios from
To train and evaluate the system, we collected a corpus of the training set with language given by subjects from AMT.
natural language commands paired with robot actions and
environment state sequences. We use this corpus both to
train the model and to evaluate end-to-end performance of SDC type Precision Recall F-score Accuracy
the system when following real-world commands from un- OBJECT 0.93 0.94 0.94 0.91
trained users. PLACE 0.70 0.70 0.70 0.70
PATH 0.86 0.75 0.80 0.81
4.1 Corpus EVENT 0.84 0.73 0.78 0.80
Overall 0.90 0.88 0.89 0.86
To quickly generate a large corpus of examples of language
paired with robot plans, we posted videos of action se-
quences to Amazon’s Mechanical Turk (AMT) and collected Table 1: Performance of the learned model at predicting the
language associated with each video. The videos showed a correspondence variable φ.
simulated robotic forklift engaging in an action such as pick-
ing up a pallet or moving through the environment. Paired Table 1 reports overall performance on this test set and
with each video, we had a complete log of the state of the performance broken down by SDC type. The performance
environment and the robot’s actions. Subjects were asked to of the model on this corpus indicates that it robustly learns to
type a natural language command that would cause an ex- predict when SDCs match groundings from the corpus. We
pert human forklift operator to carry out the action shown evaluated how much training was required to achieve good
in the video. We collected commands from 45 subjects for performance on the test dataset and found that the test error
twenty-two different videos showing the forklift executing asymptotes at around 1,000 (of 3,000) annotated SDCs.
an action in a simulated warehouse. Each subject interpreted For OBJECT SDCs, correctly-classified high-scoring ex-
each video only once, but we collected multiple commands amples in the dataset include “the tire pallet,” “tires,” “pal-
(an average of 13) for each video. let,” “pallette [sic],” “the truck,” and “the trailer.” Low-
Actions included moving objects from one location to an- scoring examples included SDCs with incorrectly annotated
other, picking up objects, and driving to specific locations. groundings that the system actually got right. A second class
m
Grounding for γ1
Grounding for γ2


h
Grounding for γ3


Grounding for γ4

(a) Object groundings (b) Pick up the pallet (c) Put it on the truck

Figure 4: A sequence of the actions that the forklift takes in response to the command, “Put the tire pallet on the truck.” (a) The
search grounds objects and places in the world based on their initial positions. (b) The forklift executes the first action, picking
up the pallet. (c) The forklift puts the pallet on the trailer.

of low-scoring examples were due to words that did not ap- agreement with the statement, “The forklift in the video is
pear many times in the corpus. executing the above spoken command” on a five-point Likert
For PLACE SDCs, the system often correctly classifies scale. We report command-video pairs as correct if the sub-
examples involving the relation “on,” such as “on the trailer.” jects agreed or strongly agreed with the statement, and in-
However, the model often misclassifies PLACE SDCs that correct if they were neutral, disagreed or strongly disagreed.
involve frame-of-reference. For example, “just to the right We collected five annotator judgments for each command-
of the furthest skid of tires” requires the model to have fea- video pair.
tures for “furthest” and the principal orientation of the “skid To validate our evaluation strategy, we conducted the eval-
of tires” to reason about which location should be grounded uation using known correct and incorrect command-video
to the language “to the right,” or “between the pallets on the pairs. In the first condition, subjects saw a command paired
ground and the other trailer” requires reasoning about mul- with the original video that a different subject watched when
tiple objects and a PLACE SDC that has two arguments. creating the command. In the second condition, the subject
For EVENT SDCs, the model generally performs well on saw the command paired with random video that was not
“pick up,” “move,” and “take” commands. The model cor- used to generate the original command. As expected, there
rectly predicts commands such as “Lift pallet box,” “Pick up was a large difference in performance in the two conditions,
the pallets of tires,” and “Take the pallet of tires on the left shown in Table 2. Despite the diverse and challenging lan-
side of the trailer.” We incorrectly predict plans for com- guage in our corpus, new annotators agree that commands
mands like, “move back to your original spot,” or “pull par- in the corpus are consistent with the original video. These
allel to the skid next to it.” The word “parallel” appeared results show that language in the corpus is understandable
in the corpus only twice, which was probably insufficient by a different annotator.
to learn a good model. “Move” had few good negative ex-
amples, since we did not have in the training set, to use as Precision
contrast, paths in which the forklift did not move.
Command with original video 0.91 (±0.01)
4.3 End-to-end Evaluation Command with random video 0.11 (±0.02)
The fact that the model performs well at predicting the corre-
spondence variable from annotated SDCs and groundings is Table 2: The fraction of end-to-end commands considered
promising but does not necessarily translate to good end-to- correct by our annotators for known correct and incorrect
end performance when inferring groundings associated with videos. We show the 95% confidence intervals in parenthe-
a natural language command (as in Equation 1). ses.
To evaluate end-to-end performance, we inferred plans
given only commands from the test set and a starting lo- We then evaluated our system by considering three differ-
cation for the robot. We segmented commands containing ent configurations. Serving as a baseline, the first consisted
multiple top-level SDCs into separate clauses, and utilized of ground truth SDCs and a random probability distribution,
the system to infer a plan and a set of groundings for each resulting in a constrained search over a random cost func-
clause. Plans were then simulated on a realistic, high-fidelity tion. The second configuration involved ground truth SDCs
robot simulator from which we created a video of the robot’s and our learned distribution, and the third consisted of auto-
actions. We uploaded these videos to AMT, where subjects matically extracted SDCs with our learned distribution.
viewed the video paired with a command and reported their Due to the overhead of the end-to-end evaluation, we con-
Precision with correct robot actions. We demonstrate promising per-
Constrained search, random cost 0.28 (±0.05)
formance at following natural language commands from a
Ground truth SDCs (top 30), learned cost 0.63 (±0.08) challenging corpus collected from untrained users.
Automatic SDCs (top 30), learned cost 0.54 (±0.08) Our work constitutes a step toward robust language under-
Ground truth SDCs (all), learned cost 0.47 (±0.04) standing systems, but many challenges remain. One limita-
tion of our approach is the need for annotated training data.
Table 3: The fraction of commands considered correct by Unsupervised or semi-supervised modeling frameworks in
our annotators for different configurations of our system. We which the object groundings are latent variables have the
show the 95% confidence intervals in parentheses. potential to exploit much larger corpora without the expense
of annotation. Another limitation is the size of the search
space; more complicated task domains require deeper search
sider results for the top 30 commands with the highest pos- and more sophisticated algorithms. In particular, we plan
terior probability of the final plan correctly corresponding to to extend our approach to perform inference over possible
the command text for each configuration. In order to evalu- parses as well as groundings in the world.
ate the relevance of the probability assessment, we also eval- Our model provides a starting point for incorporating di-
uate the entire test set for ground truth SDCs and our learned alog, because it not only returns a plan corresponding to the
distribution. Table 3 reports the performance of each config- command, but also groundings (with confidence scores) for
uration along with their 95% confidence intervals. The rela- each component in the command. This information can en-
tively high performance of the random cost function config- able the system to identify confusing parts of the command
uration relative to the random baseline for the corpus is due in order to ask clarifying questions.
the fact that the robot is not acting completely randomly on There are many complex linguistic phenomena that our
account of the constrained search space. In all conditions, framework does not yet support, such as abstract objects,
the system performs statistically significantly better than a negation, anaphora, conditionals, and quantifiers. Many of
random cost function. these could be addressed with a richer model, as in Liang,
Jordan, and Klein (2011). For example, our framework does
The system performs noticeably better on the 30 most
not currently handle negation, such as “Don’t pick up the
probable commands than on the entire test set. This result
pallet,” but it might be possible to do so by fixing some cor-
indicates the validity of our probability measure, suggesting
respondence variables to false (rather than true) during in-
that the system has some knowledge of when it is correct and
ference. The system could represent anaphora such as “it”
incorrect. The system could use this information to decide
in “Pick up the pallet and put it on the truck” by adding a
when to ask for confirmation before acting.
factor linking “it” with its referent, “the pallet.” The system
The system qualitatively produces compelling end-to-end
could handle abstract objects such as “the row of pallets”
performance. Even when the system makes a mistake, it is
if all possible objects were added to the space of candidate
often partially correct. For example, it might pick up the left
groundings. Since each of these modifications would sub-
tire pallet instead of the right one. Other problems stem from
stantially increase the size of the search space, solving these
ambiguous or unusual language in the corpus commands,
problems will require efficient approximate inference tech-
such as “remove the goods” or “then swing to the right,”
niques combined with heuristic functions to make the search
that make the inference particularly challenging. Despite
problem tractable.
these limitations, however, our system successfully follows
commands such as “Put the tire pallet on the truck,” “Pick up
the tire pallet” and “put down the tire pallet” and “go to the 6 Acknowledgments
truck,” using only data from the corpus to learn the model. We would like to thank Alejandro Perez, as well as the an-
Although we conducted our evaluation with single SDCs, notators on Amazon Mechanical Turk and the members of
the framework supports multiple SDCs by performing beam the Turker Nation forum. This work was sponsored by the
search to find groundings for all components in both SDCs. Robotics Consortium of the U.S Army Research Laboratory
Using this algorithm, the system successfully followed the under the Collaborative Technology Alliance Program, Co-
commands listed in Figure 1. These commands are more operative Agreement W911NF-10-2-0016, and by the Office
challenging than those with single SDCs because the search of Naval Research under MURI N00014-07-1-0749.
space is larger, because there are often dependencies be-
tween commands, and because these commands often con- References
tain unresolved pronouns like “it.” Andrew, G., and Gao, J. 2007. Scalable training of L1-
regularized log-linear models. In Proc. Int’l Conf. on Ma-
5 Conclusion chine Learning (ICML).
In this paper, we present an approach for automatically de Marneffe, M.; MacCartney, B.; and Manning, C. 2006.
generating a probabilistic graphical model according to the Generating typed dependency parses from phrase struc-
structure of natural language navigation or mobile manip- ture parses. In Proc. Int’l Conf. on Language Resources
ulation commands. Our system automatically learns the and Evaluation (LREC), 449–454.
meanings of complex manipulation verbs such as “put” or Dzifcak, J.; Scheutz, M.; Baral, C.; and Schermerhorn, P.
“take” from a corpus of natural language commands paired 2009. What to do and how to do it: Translating natu-
ral language directives into temporal and dynamic logic McCallum, A. K. 2002. MALLET: A machine learning for
representation for goal management and action execution. language toolkit. http://mallet.cs.umass.edu.
In Proc. IEEE Int’l Conf. on Robotics and Automation Regier, T., and Carlson, L. A. 2001. Grounding spatial
(ICRA), 4163–4168. language in perception: An empirical and computational
Hsiao, K.; Tellex, S.; Vosoughi, S.; Kubat, R.; and Roy, investigation. J. of Experimental Psychology: General
D. 2008. Object schemas for grounding language in a 130(2):273–98.
responsive robot. Connection Science 20(4):253–276. Shimizu, N., and Haas, A. 2009. Learning to follow nav-
Jackendoff, R. S. 1983. Semantics and Cognition. MIT igational route instructions. In Proc. Int’l Joint Conf. on
Press. 161–187. Artificial Intelligence (IJCAI), 1488–1493.
Katz, B. 1988. Using English for indexing and retrieving. Skubic, M.; Perzanowski, D.; Blisard, S.; Schultz, A.;
In Proc. Conf. on Adaptivity, Personilization and Fusion Adams, W.; Bugajska, M.; and Brock, D. 2004. Spa-
of Heterogeneous Information (RIAO). MIT Press. tial language for human-robot dialogs. IEEE Trans. on
Kollar, T.; Tellex, S.; Roy, D.; and Roy, N. 2010. Systems, Man, and Cybernetics, Part C: Applications and
Toward understanding natural language directions. In Reviews 34(2):154–167.
Proc. ACM/IEEE Int’l Conf. on Human-Robot Interaction Talmy, L. 2005. The fundamental system of spatial schemas
(HRI), 259–266. in language. In Hamp, B., ed., From Perception to Mean-
Lafferty, J. D.; McCallum, A.; and Pereira, F. C. N. 2001. ing: Image Schemas in Cognitive Linguistics. Mouton de
Conditional random fields: Probabilistic models for seg- Gruyter.
menting and labeling sequence data. In Proc. Int’l Conf. Teller, S.; Walter, M. R.; Antone, M.; Correa, A.; Davis, R.;
on Machine Learning (ICML), 282–289. Fletcher, L.; Frazzoli, E.; Glass, J.; How, J.; Huang, A.;
Landau, B., and Jackendoff, R. 1993. “What” and “where” Jeon, J.; Karaman, S.; Luders, B.; Roy, N.; and Sainath,
in spatial language and spatial cognition. Behavioral and T. 2010. A voice-commandable robotic forklift work-
Brain Sciences 16:217–265. ing alongside humans in minimally-prepared outdoor en-
Liang, P.; Jordan, M. I.; and Klein, D. 2011. Learning vironments. In Proc. IEEE Int’l Conf. on Robotics and
dependency-based compositional semantics. In Proc. As- Automation (ICRA), 526–533.
sociation for Computational Linguistics (ACL). Tellex, S. 2010. Natural Language and Spatial Reason-
MacMahon, M.; Stankiewicz, B.; and Kuipers, B. 2006. ing. Ph.D. Dissertation, Massachusetts Institute of Tech-
Walk the talk: Connecting language, knowledge, and ac- nology.
tion in route instructions. In Proc. Nat’l Conf. on Artificial Vogel, A., and Jurafsky, D. 2010. Learning to follow naviga-
Intelligence (AAAI), 1475–1482. tional directions. In Proc. Association for Computational
Matuszek, C.; Fox, D.; and Koscher, K. 2010. Follow- Linguistics (ACL), 806–814.
ing directions using statistical machine translation. In Winograd, T. 1970. Procedures as a representation for
Proc. ACM/IEEE Int’l Conf. on Human-Robot Interaction data in a computer program for understanding natural
(HRI), 251–258. language. Ph.D. Dissertation, Massachusetts Institute of
Technology.

You might also like