Grounding Graph
Grounding Graph
γ1 γ1
γ2
φ1 φ1
φ2
γ2 γ3 γ4
φ4
γ4 γ5
φ2 φ3 φ3
φ4 φ5
Figure 2: (a) SDC tree for “Put the pallet on the truck.” (b) Figure 3: (a) SDC tree for “Go to the pallet on the truck.” (b)
Induced graphical model and factorization. A different induced factor graph from Figure 2. Structural
differences between the two models are highlighted in gray.
h
Grounding for γ3
Grounding for γ4
(a) Object groundings (b) Pick up the pallet (c) Put it on the truck
Figure 4: A sequence of the actions that the forklift takes in response to the command, “Put the tire pallet on the truck.” (a) The
search grounds objects and places in the world based on their initial positions. (b) The forklift executes the first action, picking
up the pallet. (c) The forklift puts the pallet on the trailer.
of low-scoring examples were due to words that did not ap- agreement with the statement, “The forklift in the video is
pear many times in the corpus. executing the above spoken command” on a five-point Likert
For PLACE SDCs, the system often correctly classifies scale. We report command-video pairs as correct if the sub-
examples involving the relation “on,” such as “on the trailer.” jects agreed or strongly agreed with the statement, and in-
However, the model often misclassifies PLACE SDCs that correct if they were neutral, disagreed or strongly disagreed.
involve frame-of-reference. For example, “just to the right We collected five annotator judgments for each command-
of the furthest skid of tires” requires the model to have fea- video pair.
tures for “furthest” and the principal orientation of the “skid To validate our evaluation strategy, we conducted the eval-
of tires” to reason about which location should be grounded uation using known correct and incorrect command-video
to the language “to the right,” or “between the pallets on the pairs. In the first condition, subjects saw a command paired
ground and the other trailer” requires reasoning about mul- with the original video that a different subject watched when
tiple objects and a PLACE SDC that has two arguments. creating the command. In the second condition, the subject
For EVENT SDCs, the model generally performs well on saw the command paired with random video that was not
“pick up,” “move,” and “take” commands. The model cor- used to generate the original command. As expected, there
rectly predicts commands such as “Lift pallet box,” “Pick up was a large difference in performance in the two conditions,
the pallets of tires,” and “Take the pallet of tires on the left shown in Table 2. Despite the diverse and challenging lan-
side of the trailer.” We incorrectly predict plans for com- guage in our corpus, new annotators agree that commands
mands like, “move back to your original spot,” or “pull par- in the corpus are consistent with the original video. These
allel to the skid next to it.” The word “parallel” appeared results show that language in the corpus is understandable
in the corpus only twice, which was probably insufficient by a different annotator.
to learn a good model. “Move” had few good negative ex-
amples, since we did not have in the training set, to use as Precision
contrast, paths in which the forklift did not move.
Command with original video 0.91 (±0.01)
4.3 End-to-end Evaluation Command with random video 0.11 (±0.02)
The fact that the model performs well at predicting the corre-
spondence variable from annotated SDCs and groundings is Table 2: The fraction of end-to-end commands considered
promising but does not necessarily translate to good end-to- correct by our annotators for known correct and incorrect
end performance when inferring groundings associated with videos. We show the 95% confidence intervals in parenthe-
a natural language command (as in Equation 1). ses.
To evaluate end-to-end performance, we inferred plans
given only commands from the test set and a starting lo- We then evaluated our system by considering three differ-
cation for the robot. We segmented commands containing ent configurations. Serving as a baseline, the first consisted
multiple top-level SDCs into separate clauses, and utilized of ground truth SDCs and a random probability distribution,
the system to infer a plan and a set of groundings for each resulting in a constrained search over a random cost func-
clause. Plans were then simulated on a realistic, high-fidelity tion. The second configuration involved ground truth SDCs
robot simulator from which we created a video of the robot’s and our learned distribution, and the third consisted of auto-
actions. We uploaded these videos to AMT, where subjects matically extracted SDCs with our learned distribution.
viewed the video paired with a command and reported their Due to the overhead of the end-to-end evaluation, we con-
Precision with correct robot actions. We demonstrate promising per-
Constrained search, random cost 0.28 (±0.05)
formance at following natural language commands from a
Ground truth SDCs (top 30), learned cost 0.63 (±0.08) challenging corpus collected from untrained users.
Automatic SDCs (top 30), learned cost 0.54 (±0.08) Our work constitutes a step toward robust language under-
Ground truth SDCs (all), learned cost 0.47 (±0.04) standing systems, but many challenges remain. One limita-
tion of our approach is the need for annotated training data.
Table 3: The fraction of commands considered correct by Unsupervised or semi-supervised modeling frameworks in
our annotators for different configurations of our system. We which the object groundings are latent variables have the
show the 95% confidence intervals in parentheses. potential to exploit much larger corpora without the expense
of annotation. Another limitation is the size of the search
space; more complicated task domains require deeper search
sider results for the top 30 commands with the highest pos- and more sophisticated algorithms. In particular, we plan
terior probability of the final plan correctly corresponding to to extend our approach to perform inference over possible
the command text for each configuration. In order to evalu- parses as well as groundings in the world.
ate the relevance of the probability assessment, we also eval- Our model provides a starting point for incorporating di-
uate the entire test set for ground truth SDCs and our learned alog, because it not only returns a plan corresponding to the
distribution. Table 3 reports the performance of each config- command, but also groundings (with confidence scores) for
uration along with their 95% confidence intervals. The rela- each component in the command. This information can en-
tively high performance of the random cost function config- able the system to identify confusing parts of the command
uration relative to the random baseline for the corpus is due in order to ask clarifying questions.
the fact that the robot is not acting completely randomly on There are many complex linguistic phenomena that our
account of the constrained search space. In all conditions, framework does not yet support, such as abstract objects,
the system performs statistically significantly better than a negation, anaphora, conditionals, and quantifiers. Many of
random cost function. these could be addressed with a richer model, as in Liang,
Jordan, and Klein (2011). For example, our framework does
The system performs noticeably better on the 30 most
not currently handle negation, such as “Don’t pick up the
probable commands than on the entire test set. This result
pallet,” but it might be possible to do so by fixing some cor-
indicates the validity of our probability measure, suggesting
respondence variables to false (rather than true) during in-
that the system has some knowledge of when it is correct and
ference. The system could represent anaphora such as “it”
incorrect. The system could use this information to decide
in “Pick up the pallet and put it on the truck” by adding a
when to ask for confirmation before acting.
factor linking “it” with its referent, “the pallet.” The system
The system qualitatively produces compelling end-to-end
could handle abstract objects such as “the row of pallets”
performance. Even when the system makes a mistake, it is
if all possible objects were added to the space of candidate
often partially correct. For example, it might pick up the left
groundings. Since each of these modifications would sub-
tire pallet instead of the right one. Other problems stem from
stantially increase the size of the search space, solving these
ambiguous or unusual language in the corpus commands,
problems will require efficient approximate inference tech-
such as “remove the goods” or “then swing to the right,”
niques combined with heuristic functions to make the search
that make the inference particularly challenging. Despite
problem tractable.
these limitations, however, our system successfully follows
commands such as “Put the tire pallet on the truck,” “Pick up
the tire pallet” and “put down the tire pallet” and “go to the 6 Acknowledgments
truck,” using only data from the corpus to learn the model. We would like to thank Alejandro Perez, as well as the an-
Although we conducted our evaluation with single SDCs, notators on Amazon Mechanical Turk and the members of
the framework supports multiple SDCs by performing beam the Turker Nation forum. This work was sponsored by the
search to find groundings for all components in both SDCs. Robotics Consortium of the U.S Army Research Laboratory
Using this algorithm, the system successfully followed the under the Collaborative Technology Alliance Program, Co-
commands listed in Figure 1. These commands are more operative Agreement W911NF-10-2-0016, and by the Office
challenging than those with single SDCs because the search of Naval Research under MURI N00014-07-1-0749.
space is larger, because there are often dependencies be-
tween commands, and because these commands often con- References
tain unresolved pronouns like “it.” Andrew, G., and Gao, J. 2007. Scalable training of L1-
regularized log-linear models. In Proc. Int’l Conf. on Ma-
5 Conclusion chine Learning (ICML).
In this paper, we present an approach for automatically de Marneffe, M.; MacCartney, B.; and Manning, C. 2006.
generating a probabilistic graphical model according to the Generating typed dependency parses from phrase struc-
structure of natural language navigation or mobile manip- ture parses. In Proc. Int’l Conf. on Language Resources
ulation commands. Our system automatically learns the and Evaluation (LREC), 449–454.
meanings of complex manipulation verbs such as “put” or Dzifcak, J.; Scheutz, M.; Baral, C.; and Schermerhorn, P.
“take” from a corpus of natural language commands paired 2009. What to do and how to do it: Translating natu-
ral language directives into temporal and dynamic logic McCallum, A. K. 2002. MALLET: A machine learning for
representation for goal management and action execution. language toolkit. http://mallet.cs.umass.edu.
In Proc. IEEE Int’l Conf. on Robotics and Automation Regier, T., and Carlson, L. A. 2001. Grounding spatial
(ICRA), 4163–4168. language in perception: An empirical and computational
Hsiao, K.; Tellex, S.; Vosoughi, S.; Kubat, R.; and Roy, investigation. J. of Experimental Psychology: General
D. 2008. Object schemas for grounding language in a 130(2):273–98.
responsive robot. Connection Science 20(4):253–276. Shimizu, N., and Haas, A. 2009. Learning to follow nav-
Jackendoff, R. S. 1983. Semantics and Cognition. MIT igational route instructions. In Proc. Int’l Joint Conf. on
Press. 161–187. Artificial Intelligence (IJCAI), 1488–1493.
Katz, B. 1988. Using English for indexing and retrieving. Skubic, M.; Perzanowski, D.; Blisard, S.; Schultz, A.;
In Proc. Conf. on Adaptivity, Personilization and Fusion Adams, W.; Bugajska, M.; and Brock, D. 2004. Spa-
of Heterogeneous Information (RIAO). MIT Press. tial language for human-robot dialogs. IEEE Trans. on
Kollar, T.; Tellex, S.; Roy, D.; and Roy, N. 2010. Systems, Man, and Cybernetics, Part C: Applications and
Toward understanding natural language directions. In Reviews 34(2):154–167.
Proc. ACM/IEEE Int’l Conf. on Human-Robot Interaction Talmy, L. 2005. The fundamental system of spatial schemas
(HRI), 259–266. in language. In Hamp, B., ed., From Perception to Mean-
Lafferty, J. D.; McCallum, A.; and Pereira, F. C. N. 2001. ing: Image Schemas in Cognitive Linguistics. Mouton de
Conditional random fields: Probabilistic models for seg- Gruyter.
menting and labeling sequence data. In Proc. Int’l Conf. Teller, S.; Walter, M. R.; Antone, M.; Correa, A.; Davis, R.;
on Machine Learning (ICML), 282–289. Fletcher, L.; Frazzoli, E.; Glass, J.; How, J.; Huang, A.;
Landau, B., and Jackendoff, R. 1993. “What” and “where” Jeon, J.; Karaman, S.; Luders, B.; Roy, N.; and Sainath,
in spatial language and spatial cognition. Behavioral and T. 2010. A voice-commandable robotic forklift work-
Brain Sciences 16:217–265. ing alongside humans in minimally-prepared outdoor en-
Liang, P.; Jordan, M. I.; and Klein, D. 2011. Learning vironments. In Proc. IEEE Int’l Conf. on Robotics and
dependency-based compositional semantics. In Proc. As- Automation (ICRA), 526–533.
sociation for Computational Linguistics (ACL). Tellex, S. 2010. Natural Language and Spatial Reason-
MacMahon, M.; Stankiewicz, B.; and Kuipers, B. 2006. ing. Ph.D. Dissertation, Massachusetts Institute of Tech-
Walk the talk: Connecting language, knowledge, and ac- nology.
tion in route instructions. In Proc. Nat’l Conf. on Artificial Vogel, A., and Jurafsky, D. 2010. Learning to follow naviga-
Intelligence (AAAI), 1475–1482. tional directions. In Proc. Association for Computational
Matuszek, C.; Fox, D.; and Koscher, K. 2010. Follow- Linguistics (ACL), 806–814.
ing directions using statistical machine translation. In Winograd, T. 1970. Procedures as a representation for
Proc. ACM/IEEE Int’l Conf. on Human-Robot Interaction data in a computer program for understanding natural
(HRI), 251–258. language. Ph.D. Dissertation, Massachusetts Institute of
Technology.