Learning Modular Language-Conditioned Robot Policies Through Attention
Learning Modular Language-Conditioned Robot Policies Through Attention
Learning Modular Language-Conditioned Robot Policies Through Attention
https://doi.org/10.1007/s10514-023-10129-1
Received: 2 May 2023 / Accepted: 26 July 2023 / Published online: 30 August 2023
© The Author(s) 2023
Abstract
Training language-conditioned policies is typically time-consuming and resource-intensive. Additionally, the resulting con-
trollers are tailored to the specific robot they were trained on, making it difficult to transfer them to other robots with different
dynamics. To address these challenges, we propose a new approach called Hierarchical Modularity, which enables more
efficient training and subsequent transfer of such policies across different types of robots. The approach incorporates Super-
vised Attention which bridges the gap between modular and end-to-end learning by enabling the re-use of functional building
blocks. In this contribution, we build upon our previous work, showcasing the extended utilities and improved performance by
expanding the hierarchy to include new tasks and introducing an automated pipeline for synthesizing a large quantity of novel
objects. We demonstrate the effectiveness of this approach through extensive simulated and real-world robot manipulation
experiments.
1 Introduction puttis et al., 2020; Lynch and Sermanet, 2021; Ahn et al.,
2022; Shridhar et al., 2021). However, this task requires
The word robot was introduced and popularized in the Czech robots to interpret instructions in the current situational and
play, “Rossum’s Universal Robots”, also known as R.U.R. behavioral context in order to accurately reflect the inten-
In this seminal piece of theatre, robots understand and carry tions of the human partner. Achieving such inference and
out a variety of verbal human instructions. Roboticists and decision-making capabilities demands a deep integration of
AI researchers have long been striving to create machines multiple data modalities—specifically, the intersection of
with such an ability to turn natural language instructions into vision, language, and motion. Language-conditioned imita-
physical actions in the real world (Jang et al., 2022; Step- tion learning (Lynch and Sermanet, 2021; Stepputtis et al.,
2020) is a technique that can help address these challenges
B Yifan Zhou by jointly learning perception, language understanding, and
[email protected] control in an end-to-end fashion.
Shubham Sonawani However, a significant drawback of this approach is that,
[email protected] once trained, these language-conditioned policies are only
Mariano Phielipp applicable to the specific robot they were trained on. This is
[email protected] because end-to-end policies are monolithic in nature, which
Heni Ben Amor means that robot-specific aspects of the task, such as kine-
[email protected] matic structure or visual appearance, cannot be individually
Simon Stepputtis targeted and adjusted. While it is possible to retrain the policy
[email protected] on a new robot, this comes with the risk of catastrophic for-
1
getting and substantial computational overhead. Similarly,
School of Computing and Augmented Intelligence, Arizona
State University, Tempe, AZ, USA adding a new aspect, behaviors, or elements to the task may
2
also require a complete retraining.
Intel AI, Phoenix, AZ, USA
This paper tackles the problem of creating modular
3 The Robotics Institute, Carnegie Mellon University, language-conditioned robot policies that can be re-structured,
Pittsburgh, PA, USA
123
1014 Autonomous Robots (2023) 47:1013–1033
Fig. 1 Our proposed method demonstrates high performance on a vari- new behaviors to an existing trained policy. Besides them, we also
ety of tasks. It is able to transfer to new robots in a data-efficient manner, demonstrate the ability to learn relational tasks, where there are two
while still keeping a high execution performance. It also accepts adding objects involved in the same sentence
extended and selectively retrained. Figure 1, depicts a set of [BC-Z (Jang et al., 2022) and LP (Stepputtis et al., 2020)]; (4)
scenarios that we want to address in this paper. For exam- we extend the methodology by creating more complex tasks
ple, we envision an approach which allows for the efficient that incorporate obstacle avoidance and relational instruction
repurposing and transfer of a policy to a new robot. We also following. Finally, we also perform an extensive number of
envision situations in which a new behavior may be added experiments that shed light on generalization properties of
to an existing policy, e.g., incorporating obstacle avoidance the our methodology from different angles, e.g., dealing with
into an existing motion primitive. Similarly, we envision occlusions, synonyms, variable objects, etc (Fig. 1).
situations in which the type of behavior is changed by incor-
porating additional modules into a policy, e.g., following
human instructions that define a relationship between multi-
ple objects, such as, “Put the apple left of the orange!”. 2 Preamble: how generative AI helped write
However, the considered modularity is at odds with the the paper
monolithic nature of end-to-end deep learning. To overcome
this challenge, the paper proposes an attention-based method- This paper largely centers around the training of generative
ology for learning reusable building blocks, or modules, that models at the intersection of vision, language and robot con-
realize specialized sub-tasks. In particular, we discuss super- trol. Besides being the topic of the paper, generative models
vised attention which allows the user to guide the training have also been instrumental in writing this paper. In partic-
process by focusing the attention of a sub-network (or mod- ular, we incorporated such techniques into both (a) the text
ule) on certain input–output variables. By imposing a specific editing process when writing the manuscript, as well as (b)
locus of attention, individual sub-modules can be guided to the process of generating 3D models and textures of manip-
realize an intended target functionality. Another contribution, ulated objects.
called hierarchical modularity, is a training regime inspired For text editing, we utilized GPT-4 (OpenAI, 2023) to iter-
by curriculum learning that aims to decompose the over- atively revise and refine our initial drafts, ensuring improved
all learning process into individual subtasks. This approach readability and clarity of the concepts discussed. We achieved
enables neural networks to be trained in a structured fashion, this by conducting prompt engineering and formulating a
maintaining a degree of modularity and compositionality. specific prompt as follows:
Our contributions can be summarize and extend our prior “Now you are a professor at a top university, studying
work in Zhou et al. (2022) as follows: (1) we propose a computer science, robotics and artificial intelligence. Could
sample-efficient approach for training language-conditioned you please help me rewrite the following text so that it is
manipulation policies that allows for rapid transfer across of high quality, clear, easy to read, well written and can be
different types of robots; (2) we introduce a novel method, published in a top level journal? Some of the paragraphs
which is based on two components called hierarchical mod- might lack critical information. If you notice that, could you
ularity and supervised attention, that bridges the divide please let me know? Let’s do back and forth discussions on
between modular and end-to-end learning and enables the the writing and refine the writing.”
reuse of functional building blocks; (3) we demonstrate that We initiate each conversation with this impersonation
our method outperforms the current state-of-the-art methods prompt, followed by our draft text. GPT-4 then returns a
revised version of the text, ensuring the semantics remained
123
Autonomous Robots (2023) 47:1013–1033 1015
unaltered while updating the literary style to incorporate pro- to rapidly generate a potentially infinite number of variants of
fessional terminology and wording, as well as a clear logical an object. This is particularly useful when studying the gener-
flow. This prompt also encourages GPT-4 to solicit feedback alization capabilities of a model. It also completely removes
on the revised text, thus facilitating back-and-forth conver- any 3D modeling or texturing burden. At the moment, the
sations. We manually determine when a piece of writing has pipeline is limited to symmetric objects.
been fine-tuned to a satisfactory degree and bring the con-
versation to a close.
With regard to the generation of 3D models and assets, we 3 Related work
created a new pipeline for automated synthesis of complete
polygonal meshes. Figure 2(top row) depicts the individual Imitation learning offers a straightforward and efficient
steps of this process. First, we synthesize an image of the method for learning agent actions based on expert demon-
intended asset using latent diffusion models (Rombach et al., strations (Dillmann and Friedrich, 1996; Schaal, 1999; Argall
2022) to produce an image of the required asset. We provide et al., 2009). This approach has proven effective in diverse
as input to the model a textual description of the asset, e.g., tasks including helicopter flight (Coates et al., 2009), robot
“A front image of an apple and a white background.”. In turn, control (Maeda et al., 2014), and collaborative assembly.
the resulting image is fed into a monocular depth-estimation Recent advancements in deep learning have enabled the
algorithm (Ranftl et al., 2022) to generate the corresponding acquisition of high-dimensional inputs, such as vision and
depth map. At this stage, each pixel in the image has both language data (Duan et al., 2017a; Zhang et al., 2018a;
(1) a corresponding depth value and (2) an associated RGB Xie et al., 2020)—partially stemming from improvements
texture value. To generate a 3D object, we take a flat mesh in image and video understanding domains (Lu et al., 2019;
grid of the same resolution as the synthesized RGB image. Kamath et al., 2021; Chen et al., 2020; Tan and Bansal,
We then perform displacement mapping (Zirr and Ritschel, 2019; Radford et al., 2021; Dosovitskiy et al., 2021), but
2019) based on the values present in the depth image. Within also in language comprehension (Wang et al., 2022; Ouyang
this process, each point of the originally flat grid gets elevated et al., 2022). Specifically, the work presented in Radford
or depressed according to its depth value. The result is a 3D et al. (2021) paved the way for multimodal language and
model representing the front half of the target object. For the vision alignment. The generalizability of such large mul-
sake of this paper, we assume a plane symmetry—a feature timodal models (Singh et al., 2022; Alayrac et al., 2022;
that is common among a large number of household objects. Ouyang et al., 2022; Zhu et al., 2023) enables a variety of
Accordingly, we can mirror the displacement map in order to downstream tasks, including image captioning (Laina et al.,
yield the occluded part of the object. Finally, we also apply 2019; Vinyals et al., 2015; Xu et al., 2015), visual ques-
a Laplacian smoothing operation (Sorkine et al., 2004) on tion answering systems (VQA) (Antol et al., 2015; Johnson
the final object. Texturing information is retrieved from the et al., 2017), and multimodal dialog systems (Kottur et al.,
source image. This automated 3D synthesis process allows us 2018; Das et al., 2017). However, most importantly, these
123
1016 Autonomous Robots (2023) 47:1013–1033
123
Autonomous Robots (2023) 47:1013–1033 1017
Fig. 3 Overview: different input modalities, i.e., vision, joint angles modular fashion—individual modules address sub-aspects of the task.
and language are fed into a language-conditioned neural network to The neural network can efficiently be trained and transferred onto other
produce robot control values. The network is setup and trained in a robots and environments (e.g. Sim2Real)
4.2 Training modular language-conditioned policies step is to transform the joint representation into a compatible
shape that aligns with the other input embeddings.
Our overall method is illustrated in Fig. 3. First, camera
image I is processed together with a natural language instruc-
tion s and the robot’s proprioceptive data (i.e., joint angles) 4.2.1 Supervised attention
through modality-specific encoders to generate their respec-
tive embeddings. The resulting embeddings are subsequently After encoding, the inputs are processed within a single
supplied as input tokens to a transformer-style (Vaswani et al., neural network in order to produce robot control actions.
2017) neural network consisting of multiple attention lay- However, a unique element of our approach is the formation
ers. This neural network is responsible for implementing the of semantically meaningful sub-modules during the learn-
overall policy πθ and produces the final robot control signals. ing process. These modules may solve a specific sub-task,
The encoding process ensures that distinct input modal- e.g., detecting the robot end-effector or calculating the dis-
ities, e.g., language, vision and motion, can effectively be tance between the robot and the target object. To achieve this
integrated within a single model. To that end, Vision Encod- effect, we build upon modern attention mechanisms (Vaswani
ings e I = f V (I) are generated using an input image et al., 2017) in order to manage the flow of information within
I ∈ R H ×W ×3 . Taking inspiration from (Carion et al., 2020; attention layers, thereby explicitly guiding the network to
Locatello et al., 2020), we maintain the original spatial concentrate on essential inputs.
structure while encoding the image into a sequence of lower- More specifically, we adopt a supervised attention mech-
resolution image tokens. The resolution is reduced via a anism in order to enable user-defined information routing
convolutional neural network while increasing the number of and the formation of modules within an end-to-end neural
channels, yielding eII ∈ R(H /s)×(W /s)×d , with s representing network. The main idea underlying this mechanism is that
a scaling factor and d denoting the embedding size. Conse- information about optimal token pairings may be available
quently, the low-resolution pixel tokens are transformed into to the user. In other words, if we know which key tokens
a sequence of tokens eI ∈ R Z ×d , where Z = (H × W )/s 2 are important for the queries to look at, we can treat their
through a flattening operation. similarity score as a maximization goal. In Fig. 4, we see
By contrast, Language Encodings e s = f L (s) ∈ R1×d the information routing for three modules. The first module
are produced via a pre-trained and fine-tuned CLIP (Radford LANG is supposed to identify the target object within the sen-
et al., 2021) model. Particularly, each instruction s is rep- tence. Hence, the corresponding attention layer is trained to
resented as a sequence of words [w0 , w1 , . . . , wn ] in which only focus on the language input. The attention for the robot
each word wi ∈ W is a member of vocabulary W. During joint values and vision input is trained to be zero. In order to
training, we employ automatically generated, well-formed provide the output of this module to the attention layer in the
sentences; however, after training, we allow any free-form next level, we use so-called register slots. Register slots are
verbal instruction that is presented to the model, including used to store the output of a module so that it can be accessed
sentences affected by typos or bad grammar. Finally, Joint in subsequent modules in the hierarchy. Accordingly, each
Encodings e j = f J (a) ∈ R1×d are created by transforming module within our method has corresponding register slot
the current robot state a into a latent representation using tokens. The role of the register slots is to provide access to
a simple multi-layer perceptron. The main purpose of this the output of previously executed modules within the hierar-
chy. Coming back to Fig. 4, the second module EE2D locates
123
1018 Autonomous Robots (2023) 47:1013–1033
Fig. 4 Different sub-aspects of the tasks are implemented as modules (via supervised attention). LANG identifies the target object. EE2D locates
the robot end-effector
where dk is the dimensionality of the keys. In our use case, depend on the language register from the prior LANG mod-
the queries are initialized with either learnable and previ- ule. Following common practice, the keys and values derive
ously unused register slots, or with registers that have been from the input image, with each image embedding vector cor-
set by modules operating in prior layers, thus encoding their responding to an image patch (Fig. 5 left). In this particular
respective results. Our keys are equivalent to the values and example, the EE uses a trainable, previously unused register
are initialized with all formal inputs (language, vision, and as query, while the Tar register utilizes the output register of
joint embeddings) as well as all previously set registers from the language module to find the correct object (Fig. 5 top).
prior layers. In contrast to common practice, we control the The EE register is supervised to focus on the robot’s gripper
information flow when learning each module via our pro- image patch, thereby creating a sub-module for detecting the
posed supervised attention, which is a specific optimization robot end-effector. Similarly, the target register attends to the
target for attention layers. target object’s image patch, forming a sub-module respon-
As an illustrative example, consider a query identifying the sible for identifying the target object. When these queries
location of the end-effector, as demonstrated in Fig. 5 (first accurately attend to their respective patches, these patches
key and query combination in the top left) or finding the tar- will primarily contribute to the output register’s embedding
get object (key and query combination near the center). For vector, which can then be used as subsequent module inputs.
simplicity, we omit the other formal inputs and only focus More formally, we maximize the similarity between query
on the visual input. However, the Tar Reg. would also q i and key k j if a connection should exist, thus optimizing
123
Autonomous Robots (2023) 47:1013–1033 1019
123
1020 Autonomous Robots (2023) 47:1013–1033
Algorithm 1 Hierachical Modularity: training algorithm the robot controller CTRL is also implemented as an MLP
returns network weights θ . decoder, which also represents the overall prediction target
Input: D, {Sk }k=1
K , {L } K , { } K
k k=1 k k=1 of our training process. Notably, in our scenario, this decoder
Output: Weights θ predicts the next ten goal positions at each timestep instead
for subtask k ← 1 to K do of predicting only the next action. This choice is inspired
while not converged do
by Jang et al. (2022), which also allows for a fair comparison
E k ← kt=0 Lt (St ) + t
θ ← Train (D, {S1 , . . . , Sk }, E k ) in the subsequent evaluation sections.
end while While the modular approach requires manually defining
end for loss terms for each module, it is essential to note that all mod-
return θ
ules form a single overarching neural network implementing
the robot policy, inherently learning necessary features in an
end-to-end manner. Modularization arises solely from train-
equation Eq. 1 with keys being the set of original sensor ing the network with various supervised attention targets and
modalities, as well as registers. In the first layer of Fig. 6, the a cost function that successively integrates more sub-tasks.
LANG module identifies the target object, as referred to in
the verbal command, and stores the result in the r LANG reg-
4.3 Use-cases and extensions of hierarchy
ister. Subsequently, in the second layer, the f TAR2D module
utilizes the r LANG register as a query while the f EE2D mod-
We present our model as a cascade of sub-modules, trained
ule utilizes a new, previously unused register as a query. This
hierarchically, enabling seamless integration of additional
chain continues until the final control output of the robot is
modules. In this section, we discuss the incorporation of
generated in the CTRL module.
obstacle avoidance, tracking a predefined obstacle, and
Recall that sub-modules address intermediate tasks in the
describe the generalization of this approach to arbitrary
overarching control problem, making the output register r
“referential objects” that let users specify commands that
suitable for human interpretation and allowing for super-
reference any other object. These enhancements are imple-
vised training of the resulting embedding. To achieve this,
mented by introducing new modules, as depicted in Fig. 8.
we employ small multi-layer perceptron (MLP) decoders to
convert the module outputs into their respective numeric out-
puts. For example, we train a small MLP on top of the r EE2D 4.3.1 Runtime introspection
register that predicts the end-effector location (eex , ee y ) via
a single linear transformation. This approach enables our All sub-modules retain their functionality, even after training.
policy to predict intermediate module outputs, enhancing Consequently, they can be used at runtime to query indi-
training accuracy and allowing monitoring and debugging vidual outputs (e.g., LANG, TAR2D, EE3D). This feature
during inference, which is particularly valuable when trans- allows users to monitor the intermediate computations of
ferring the policy to different robots or scenarios. the end-to-end network to identify potential deviations and
Training Cascaded Modules misclassifications. Figure 7 visually depicts the outputs of
Intuitively, the cascaded modules can be trained in a man- each model during the execution on a real robot. A textual
ner inspired by curriculum learning, wherein each component description (left upper corner) shows the currently identified
is trained before further layers of the hierarchy are added object name, as well as the displacement (in cm) between
to the training objective. This ensures that each module is the end-effector and the target object. The current attention
trained until convergence before being employed for more map is visualized in yellow, whereas the end-effector posi-
sophisticated decision-making processes, ultimately leading tion and the target position are highlighted by red and blue
to the prediction of robot control parameters. Algorithm 1 points. Computing these intermediate outputs of the network
outlines the training procedure for our hierarchical approach generates negligible to no computation overhead. In our spe-
in further detail. The algorithm trains each module of the cific system, we implemented a real-time visualization tool
hierarchy one after another, until the currently trained mod- that can be used at all times to monitor the above features.
ule is converged according to its respective loss function. Such tools for introspection can help in debugging and trou-
After that, we progressively incorporate additional modules bleshooting of the language-conditioned policy. For example,
in a manner reminiscent of curriculum learning. Each module they can be used to detect when individual modules need to be
k is trained with an attention loss Lk given the supervision retrained, or where in the hierarchy a problem is manifesting.
signal S of our proposed supervised attention approach, as In addition, such outputs can be used with formal runtime
well as a task-specific loss functions k which trains the monitoring and verification systems, e.g., Yamaguchi and
MLP decoder for every module. Thus, each module is opti- Fainekos (2021) and Pettersson (2005), to improve the safety
mized with regard to two targets. Note that the policy loss for of the neural network policy.
123
Autonomous Robots (2023) 47:1013–1033 1021
Fig. 7 Sequence of real-time outputs of the network modules: the object All values are generated from a single network that also produces robot
name (white) and visual attention (yellow region), the length of the dis- controls (Color figure online)
placement (white text), the object pos (blue), and end-effector pos (red).
123
1022 Autonomous Robots (2023) 47:1013–1033
123
Autonomous Robots (2023) 47:1013–1033 1023
5.2 Model performance and baseline comparison truth, (3) End Effector Position Error (EE3D) quantifies the
Euclidean 3D distance between the predicted end effector
In this section, we evaluate our model on the three basic position and the ground truth, (4) Displacement Error (DISP)
actions across the six Robosuite objects, utilizing the DUR5 calculates the 3D distance between the predicted 3D displace-
dataset. We also compare our method to two state-of-the-art ment vector and corresponding ground truth vector.
baselines, specifically BC-Z (Jang et al., 2022) and LP (Step- Our method (line 4) outperformed BC-Z (line 3) on
puttis et al., 2020). As our third baseline, we investigate all basic tasks with an average success rate of 82.4%, as
vanilla, unsupervised attention. In this scenario, the same compared to 73.1% for BC-Z. Furthermore, we separately
network as before is trained, but without supervision of the assessed the prediction error of the proposed network’s com-
attention process as introduced in this paper. ponents, namely EE3D, TAR3D, and DISP. We note that
Table 3 summarizes these results in which each training the end-effector pose prediction accuracy (approximately
and testing procedure was executed three times to provide a 0.5 cm) surpasses the target object’s accuracy, which could
better understanding of the stability of the compared meth- be attributed to the presence of the robot’s joint state infor-
ods. We evaluate not only the overall success rates but also the mation. The target object’s position estimation deviates by
performance of each individual module within our language- around 2–3, possibly due to the absence of depth information
conditioned policy. Specifically, we employ the following in our input dataset (solely consisting of RGB).
metrics: (1) Success Rate describes the percentage of suc- By contrast, the LP model (line 1) is not able to success-
cessfully executed trials among the test set, (2) Target Object fully complete any of the tasks. We hypothesize that this
Position Error (TAR3D) measures the Euclidean 3D distance low performance is due to the training dataset’s significantly
between the predicted target object position and the ground smaller size compared to the LP’s usual training data size,
123
1024 Autonomous Robots (2023) 47:1013–1033
5.2.1 Ablations
5.2.2 Occlusion
In order to evaluate the impact of our two main contributions—
supervised attention and hierarchical modularity—we con- Next, we evaluate the robustness of our approach to partial
duct an ablation study to investigate the impact of each occlusions of the target objects during task execution. To this
contribution on training performance. In addition, we also end, occlusions are introduced by removing image patches
ablate the structure of the hierarchy itself in order to investi- in the camera feed of the simulated experiments. This step is
gate its resiliency to structural changes. performed by covering approximately 20%, 42%, 68% and
Results of the ablation experiments can be found in 80% of the target object’s total area; calculated via a pixel-
Table 3. Our model (line 3) has an overall success rate of based segmentation approach of the input image provided
82.4% across three seeds. When ablating the usage of hier- by the simulator. All experiments are conducted on all six
archical modularity, performance drops to 36.4% (line 5). Robosuite objects across all three basic tasks. The results are
Utilizing our runtime introspection approach to investigate shown in Fig. 10. We observe that our method is robust to
potential issues in the modules (Sect. 4.3.1), we find that the occlusions of up to 20% of the target object, while our base-
target and displacement errors increased to over 20 cm, which line model, BC-Z, already experiences a significant drop in
is likely the cause for the reduced performance. When remov- accuracy. While our model only loses about 1.1% in per-
ing the supervision signal (line 4) for the attention inside our formance, BC-Z drops by 9.35%. However, for occlusions
modules (and instead relying on end-to-end training), we see greater than 40%, our method performs on-par with BC-Z.
a drop of ≈ 2.5% in performance to about 80%. We argue that our robustness to 20% occlusions is signifi-
When ablating the hierarchy itself, we merged the TAR2D cant since small, partial, occlusions are more likely to occur
and TAR3D module (line 7) into a single module instead of during tabletop manipulation tasks.
maintaining two. The underlying rationale is that the sepa-
ration of the target detection between 2D and 3D detection 5.2.3 Synonyms
is not strictly necessary and thus a single target module may
be sufficient. The resulting success rate in this case is 80.9% Our final robustness experiment is concerned with the vari-
which is only slightly below the original rate of 82.4%. Next ability of free-form spoken language. While our system is
we removed the displacement module DISP (line 9) alto- trained with sentences from a template-based generator, we
gether, which results in a performance of about 61.3% (a evaluate its performance when exposed to a set of addi-
loss of around 20%). Finally, we added spurious modules tional synonyms, as well as free-form spoken language from
that are not necessary for the policy’s success in these tasks a small set of human subjects. When replacing synonyms,
(line 10). In particular, we added a specific module that only as shown in Table 8, in the single-word and short-phrase
detects the “Coke” can. In this case, we achieved a success case, we observe that our model achieves a 82.5% success
rate of 86.3% which is slightly higher than the original result. rate on the pushing task. When using BC-Z, on the same
As a general observation, the approach seems to be task with the same synonyms, performance drops to 28.57%,
favorable to superfluous modules, combined modules, or indicating the robustness of our methods to variations in the
variations of a hierarchy. However, the absence of certain language inputs. Finally, we also evaluate the performance
critical modules, e.g., the DISP or TAR modules (lines 9 and on 30 examples of free-form natural language instructions
8 respectively), may have a more drastic effect on perfor- that were collected from human annotators and report a suc-
mance. In the above case of removing the DISP module (line cess rate of 73.3%. The sentences used by the annotators can
9), the performance reduces to about 61.3% which is below be found in Table 11 and show that our model can work with
the corresponding value for BC-Z (73.1%). unconstrained language commands.
123
Autonomous Robots (2023) 47:1013–1033 1025
5.3 Transfer to different robots and real-world BC-Z. Interestingly, we also observe that our model performs
similarly when transferring to the Franka and UR5 robots
In this section, we evaluate the ability of our approach to across the dataset splits, while BC-Z seems to initially per-
efficiently transfer policies between different robots that form worse when transferring the Franka robot. Note here
may have different morphologies. Rather than retraining our that Franka is a 7 degree of freedom (DoF) robot while the
model from scratch to accommodate the altered dynamics source policy, which operates over the Kinova robot, only has
between different robots, we posit that our modular approach six. This discrepancy likely affects robot dynamics thereby
enables the transfer of substantial portions of the prior pol- affecting the transfer process.
icy. This necessitates only minimal fine-tuning, consequently Further, we conducted experiments in which we froze
resulting in a reduced demand for data collection on the dif- parts of our model during transfer of a pre-trained policy
ferent robots. In particular, we evaluate fine-tuning of the from the Kinova to the UR5 and Franka robot. In particular,
entire policy, and fine-tuning of only the modules affected the TAR3D, EE3D, and DISP prediction modules are unaf-
by a change in visual appearance of robot morphology. fected by the change in visual appearance and morphology
of the new robot and, thus, do not need to be retrained. Note,
5.3.1 Transfer in simulation however, that we retrain TAR2D since partial occlusions by
the new robot could lead to false positives for target objects.
Our initial policy is trained from scratch on the DKinova We have conducted further experiments with the same fine-
dataset while the transfer of the trained policy to the Franka tuning datasets and report their results in Fig. 11 (reported as
and UR5 robot is realized with the DTF Franka and D UR5 datasets
TF “Ours-f”). In this setting, with a dataset of only 80 demon-
respectively. strations, the partially frozen module produces a result of
As noted earlier, the DTF datasets are intentionally over- 60% and 72.5% when transferring to the Franka and UR5
provisioned to allow an evaluation regarding how much data respectively. This poses a substantial performance improve-
is required in order to match the performance of the trans- ment of up to 18% in the case of transfer to the UR5 robot
ferred policy to a policy that is trained from scratch on the while utilizing less data than fine-tuning the entire model.
same robot. In order to shed some light on this, we sub- This result further underlines the gains in data-efficiency that
sampled the transfer datasets to a total size of 80, 160, 240 and can be achieved through the hierarchical modularity.
320 demonstrations and conducted the training. Figure 11
shows the results of this analysis (reported as “Ours”) given 5.3.2 Real-world transfer
the varying dataset sizes when fine-tuning the entire policy
initialized with the Kinova weights. With 160 demonstra- Having demonstrated the ability of our approach to efficiently
tions, our model achieves a success rate of 80%, which is transfer policies between robots in simulation, we demon-
only slightly below the policy’s performance when trained strate that a policy can also be transferred to the real world
on the full 1600 demonstrations from scratch. Further, given (Sim2Real Transfer) in a sample-efficient way. To this end,
the full 320 demonstrations of the transfer dataset, the pol- we first trained a policy for the UR5 robot in simulation uti-
icy reaches a performance that is on-par with one trained lizing the DUR5 dataset and subsequently transferred it with a
from scratch. When fine-tuning BC-Z with the same dataset substantially smaller real-world dataset DRW UR5 . More specifi-
splits, we observe that our model consistently outperforms cally, 260 demonstrations on the real-robot are collected for
transfer—this corresponds to about 16 -ththe size of the origi-
nal training set. The overall robot setup can be seen in Fig. 12.
The scene is observed via an external RGB camera and robot
actions are calculated in a closed-loop fashion by provid-
ing the current camera image and language instruction to the
policy.
To investigate the contributions of our proposed methods,
we conduct experiments under 3 different baseline settings.
These include directly applying the simulated policy on
the real robot, fine-tuning the simulated policy using the
real-world dataset DRW UR5 , and transferring the simulated pol-
123
1026 Autonomous Robots (2023) 47:1013–1033
123
Autonomous Robots (2023) 47:1013–1033 1027
Fig. 15 Robot trained to avoid all obstacles in the scene. On the way to the Coke can, the robot first avoids a basketball and then the green bottle.
We move the bottle in front of the robot to generate an instantaneous response
The policy for this task has been trained from a UR5 policy For this task, we have made modifications to the orig-
by utilizing the above datatet that introduces the novel task. inal hierarchies. Firstly, we introduce a LANG2 module
123
1028 Autonomous Robots (2023) 47:1013–1033
Fig. 17 Robot performing relational tasks with 2 objects involved in 1 command. The top row is the robot putting an avocado left to a hamburger,
while the second row is the robot putting a donut right to a hamburger
to determine the referential object based on the language ponents and subtasks into which a task can be divided. This
input. Besides that, we add TAR2D2 and TAR3D2 mod- process requires organizing these subtasks into a hierarchical
ules to identify the image patch corresponding to the second cascade. Early results indicate that an inadequate decomposi-
object and generate its 3D world coordinate, respectively. We tion can hamper, rather than improve, learning. Furthermore,
also include a DISP2 module to calculate the displacement the approach does not incorporate memory and therefore can-
between the end-effector and the second object. not perform sequential actions. In a few cases we observed
In this scenario, the robot is directed by a verbal sentence a failure to stop after finishing a manipulation - the robot
to identify the first object, pick it up, recognize the second continues with random actions. Another open question is the
object, and then place the first object either to the left or right scalability of the approach. In our investigations, we looked
of the second object according to the command given. at behaviors with a small number of sub-tasks. Is it possible
The entire process is carried out in the MuJoCo environ- to scale the approach to hierarchies with hundreds or thou-
ment, evaluated on 100 test trials. For comparison, we also sands of nodes? The prospect is appealing since this would
train and evaluate the BC-Z model. Our model achieves a bridge the divide between the expressiveness and plasticity of
success rate of 76%, which is a 7% improvement over the neural networks and the ability to create larger robot control
BC-Z performance. Considering the increased complexity systems which require the interplay of many subsystems.
of this task compared to previous ones—due to the need to For future work, we are particularly interested in using
identify two objects from both the sentence and image, and unsupervised and supervised attention side-by-side, i.e., sev-
the more extended manipulation steps required—a 76% suc- eral modules may be supervised by the human expert whereas
cess rate and a 7% increment compared to the baseline are other modules are adjusted in an unsupervised fashion. This
commendable results. would combine the best of both worlds, namely the abil-
ity to provide human structure and knowledge while at the
same time maximally profiting from the network’s plasticity.
6 Discussion and limitations
This is a particularly promising direction, since the ablation
experiments indicate that having superfluous modules does
The above experiments show a variety of benefits of the intro-
not drastically alter the network performance. Further, we
duced modular approach. On one hand, it allows for new
would like to investigate the potential of inferring a suitable
components and behaviors to be incorporated into an exist-
hierarchy in a data-driven manner.
ing policy. This property is particularly appealing in robotics,
since many popular robot control architectures are based
on the concept of modular building-blocks, e.g., behavior- 7 Conclusions
based robotics (Arkin, 1998) and subsumption architecture
(Brooks, 1986). Modularity also enables the user to employ In this paper, we present a data-efficient approach for
modern verification and runtime monitoring tools to better language-conditioned policies in robot manipulation tasks.
understand and debug the decision-making of the system. At We introduce a novel method called Hierarchical Modular-
the same time, the overall system is still end-to-end differ- ity, and adopt supervised attention, to train a set of reusable
entiable and was shown in the above experiments to yield sub-modules. This approach maintains the end-to-end learn-
practical improvements in sample-efficiency, robustness and ing advantages while promoting the reusability of the learned
extensibility. sub-modules. As a result, we are able to customize the hier-
However, a major assumption made in our approach is that archy according to the specific task demand, or integrating
a human expert correctly identifies the logical flow of com- new modules to an existing hierarchy for new tasks. Our
123
Autonomous Robots (2023) 47:1013–1033 1029
123
1030 Autonomous Robots (2023) 47:1013–1033
Table 9 The noun phrase template Table 11 Sentences collected from annotators for evaluation purposes
Object Adj Noun Annotator labeled sentences Success
123
Autonomous Robots (2023) 47:1013–1033 1031
References Filan, D., Hod, S., Wild, C., et al. (2020). Neural networks are sur-
prisingly modular. arXiv:2003.04881, cite Comment: 23 pages,
Abolghasemi, P., Mazaheri, A., Shah, M., et al. (2019). Pay attention!- 13 figures.
robustifying a deep visuomotor policy through task-focused visual Huang, W., Xia, F., Xiao, T., et al. (2022). Inner monologue:
attention. In Proceedings of the IEEE/CVF conference on com- Embodied reasoning through planning with language models.
puter vision and pattern recognition (pp. 4254–4262). arXiv:2207.05608
Ahn, M., Brohan, A., Brown, N., et al. (2022). Do as I can, not as I say: Jang, E., Irpan, A., Khansari, M., et al. (2022). Bc-z: Zero-shot task
Grounding language in robotic affordances. arXiv:2204.01691 generalization with robotic imitation learning. In Conference on
Alayrac, J. B., Donahue, J., Luc, P., et al. (2022). Flamingo: A visual robot learning, PMLR (pp. 991–1002).
language model for few-shot learning. Advances in Neural Infor- Johnson, J., Hariharan, B., Van Der Maaten, L., et al. (2017). Clevr:
mation Processing Systems, 35, 23716–23736. A diagnostic dataset for compositional language and elementary
Anderson, P., Shrivastava, A., Parikh, D., et al. (2019). Chasing ghosts: visual reasoning. In Proceedings of the IEEE conference on com-
Instruction following as Bayesian state tracking. In Advances in puter vision and pattern recognition (pp. 2901–2910).
neural information processing systems (Vol. 32). Kamath, A., Singh, M., LeCun, Y., et al. (2021). Mdetr—modulated
Antol, S., Agrawal, A., Lu, J., et al. (2015). Vqa: Visual question detection for end-to-end multi-modal understanding. In Proceed-
answering. In Proceedings of the IEEE international conference ings of the IEEE/CVF international conference on computer vision
on computer vision (pp. 2425–2433). (ICCV) (pp. 1780–1790).
Argall, B. D., Chernova, S., Veloso, M., et al. (2009). A survey of robot Khatib, O. (1986). The potential field approach and operational space
learning from demonstration. Robotics and Autonomous Systems, formulation in robot control. In Adaptive and learning systems (pp.
57(5), 469–483. 367–377). Springer.
Arkin, R. (1998). Behavior-based robotics. The MIT Press. Kottur, S., Moura, J. M., Parikh, D., et al. (2018). Visual coreference
Bengio, Y., Louradour, J., Collobert, R., et al. (2009). Curriculum learn- resolution in visual dialog using neural module networks. In Pro-
ing. In Proceedings of the 26th annual international conference ceedings of the European conference on computer vision (ECCV)
on machine learning. Association for Computing Machinery, New (pp. 153–169).
York, NY, USA, ICML ’09 (pp. 41–48). https://doi.org/10.1145/ Kuo, Y. L., Katz, B., & Barbu, A. (2020). Deep compositional robotic
1553374.1553380 planners that follow natural language commands. In 2020 IEEE
Brooks, R. (1986). A robust layered control system for a mobile robot. international conference on robotics and automation (ICRA) (pp.
IEEE Journal on Robotics and Automation, 2(1), 14–23. https:// 4906–4912). IEEE.
doi.org/10.1109/JRA.1986.1087032 Laina, I., Rupprecht, C., & Navab, N. (2019). Towards unsupervised
Carion, N., Massa, F., Synnaeve, G., et al. (2020). End-to-end object image captioning with shared multimodal embeddings. In Pro-
detection with transformers. In European conference on computer ceedings of the IEEE/CVF international conference on computer
vision (pp. 213–229). Springer. vision (ICCV).
Chen, Y. C., Li, L., Yu, L., et al. (2020). Uniter: Universal image-text Liu, L., Utiyama, M., Finch, A., et al. (2016). Neural machine translation
representation learning. In: Computer vision—ECCV 2020: 16th with supervised attention. In Proceedings of COLING 2016, the
European conference, Glasgow, UK, August 23–28, 2020, Pro- 26th international conference on computational linguistics: tech-
ceedings, Part XXX (pp. 104—120). Springer. https://doi.org/10. nical papers. The COLING 2016 Organizing Committee, Osaka,
1007/978-3-030-58577-8_7 Japan (pp. 3093–3102). https://aclanthology.org/C16-1291
Coates, A., Abbeel, P., & Ng, A. Y. (2009). Apprenticeship learning for Locatello, F., Weissenborn, D., Unterthiner, T., et al. (2020). Object-
helicopter control. Communications of the ACM, 52(7), 97–105. centric learning with slot attention. Advances in Neural Informa-
Csordás, R., van Steenkiste, S., & Schmidhuber, J. (2021). Are neural tion Processing Systems, 33, 11525–11538.
nets modular? inspecting functional modularity through differ- Lu, J., Batra, D., Parikh, D., et al. (2019). Vilbert: Pretraining task-
entiable weight masks. In International conference on learning agnostic visiolinguistic representations for vision-and-language
representations. tasks. In Advances in Neural Information Processing Systems (Vol.
Das, A., Kottur, S., Gupta, K., et al. (2017). Visual dialog. In Pro- 32).
ceedings of the IEEE conference on computer vision and pattern Lynch, C., & Sermanet, P. (2021). Language conditioned imitation
recognition (pp. 326–335). learning over unstructured data. In Proceedings of robotics: Sci-
de Boer, P. T., Kroese, D. P., Mannor, S., et al. (2004). A tutorial on ence and systems.
the cross-entropy method. Annals of Operations Research, 134, Maeda, G., Ewerton, M., Lioutikov, R., et al. (2014). Learning inter-
19–67. action for collaborative tasks with probabilistic movement prim-
Dillmann, R., & Friedrich, H. (1996). Programming by demonstration: itives. In 2014 IEEE-RAS international conference on humanoid
A machine learning approach to support skill acquision for robots. robots (pp. 527–534). IEEE.
In International conference on artificial intelligence and symbolic Mees, O., Hermann, L., Rosete-Beas, E., et al. (2022). Calvin: A bench-
mathematical computing (pp. 87–108). Springer. mark for language-conditioned policy learning for long-horizon
Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. (2021). An image is robot manipulation tasks. IEEE Robotics and Automation Letters,
worth 16x16 words: Transformers for image recognition at scale. 7(3), 7327–7334.
arXiv:2010.11929. Nair, S., Rajeswaran, A., Kumar, V., et al. (2022). R3m: A universal
Duan, Y., Andrychowicz, M., Stadie, B., et al. (2017a). One-shot imita- visual representation for robot manipulation. arXiv:2203.12601
tion learning. In I. Guyon, U. V. Luxburg, S. Bengio, et al. (Eds.), OpenAI. (2023). Gpt-4 technical report. arXiv:2303.08774
Advances in Neural Information Processing Systems. (Vol. 30). Ouyang, L., Wu, J., Jiang, X., et al. (2022). Training language models
Curran Associates Inc. to follow instructions with human feedback. arXiv:2203.02155
Duan, Y., Andrychowicz, M., Stadie, B., et al. (2017b). One-shot imi- Pettersson, O. (2005). Execution monitoring in robotics: A survey.
tation learning. In Advances in neural information processing Robotics and Autonomous Systems, 53(2), 73–88.
systems (Vol. 30). Radford, A., Kim, J. W., Hallacy, C., et al. (2021). Learning
transferable visual models from natural language supervision.
arXiv:2103.00020
123
1032 Autonomous Robots (2023) 47:1013–1033
Rahmatizadeh, R., Abolghasemi, P., Bölöni, L. et al. (2018). Vision- conference, RV 2021, virtual event, October 11–14, 2021, Proceed-
based multi-task manipulation for inexpensive robots using end- ings (p. 297). Springer.
to-end learning from demonstration. In 2018 IEEE international Zhang, T., McCarthy, Z., Jow, O., et al. (2018a). Deep imitation learning
conference on robotics and automation (ICRA) (pp. 3758–3765), for complex manipulation tasks from virtual reality teleoperation.
IEEE. In 2018 IEEE international conference on robotics and automa-
Ranftl, R., Lasinger, K., Hafner, D., et al. (2022). Towards robust tion (ICRA) (pp. 5628–5635). https://doi.org/10.1109/ICRA.2018.
monocular depth estimation: Mixing datasets for zero-shot cross- 8461249
dataset transfer. IEEE Transactions on Pattern Analysis and Zhang, T., McCarthy, Z., Jow, O., et al. (2018b). Deep imitation learning
Machine Intelligence, 44(3), 1623–1637. for complex manipulation tasks from virtual reality teleoperation.
Reed, S., Zolna, K., Parisotto, E., et al. (2022). A generalist agent. In 2018 IEEE international conference on robotics and automation
arXiv:2205.06175 (ICRA) (pp. 5628–5635). IEEE.
Rombach, R., Blattmann, A., Lorenz, D., et al. (2022). High-resolution Zhou, Y., Sonawani, S., Phielipp, M., et al. (2022). Modularity through
image synthesis with latent diffusion models. In Proceedings of the attention: Efficient training and transfer of language-conditioned
IEEE/CVF conference on computer vision and pattern recognition policies for robot manipulation. arXiv:2212.04573
(pp. 10684–10695). Zhu, D., Chen, J., Shen, X., et al. (2023). Minigpt-4: Enhancing vision-
Schaal, S. (1999). Is imitation learning the route to humanoid robots? language understanding with advanced large language models.
Trends in Cognitive Sciences, 3(6), 233–242. arXiv:2304.10592
Schaal, S. (2006). Dynamic movement primitives—A framework for Zhu, Y., Wong, J., Mandlekar, A., et al. (2020). robosuite: A mod-
motor control in humans and humanoid robotics. In Adaptive ular simulation framework and benchmark for robot learning.
motion of animals and machines (pp. 261–280). Springer. arXiv:2009.12293
Shridhar, M., Manuelli, L., & Fox, D. (2021). Cliport: What and where Zirr, T., & Ritschel, T. (2019). Distortion-free displacement mapping.
pathways for robotic manipulation. arXiv:2109.12098 Computer Graphics Forum. https://doi.org/10.1111/cgf.13760
Singh, A., Hu, R., Goswami, V., et al. (2022). Flava: A foundational
language and vision alignment model. arXiv:2112.04482
Sorkine, O., Cohen-Or, D., Lipman, Y., et al. (2004). Laplacian sur- Publisher’s Note Springer Nature remains neutral with regard to juris-
face editing. In Proceedings of the 2004 Eurographics/ACM dictional claims in published maps and institutional affiliations.
SIGGRAPH symposium on geometry processing. association for
computing machinery, New York, NY, USA, SGP ’04 (pp. 175–184).
https://doi.org/10.1145/1057432.1057456
Stepputtis, S., Campbell, J., Phielipp, M., et al. (2020). Language- Yifan Zhou is a PhD student at
conditioned imitation learning for robot manipulation tasks. Arizona State University, working
arXiv:2010.12083 in the Interactive Robotics Lab
Tan, H., & Bansal, M. (2019). Lxmert: Learning cross-modality encoder with Prof. Heni Ben Amor. His
representations from transformers. In K. Inui, J. Jiang, V. Ng, et al. main focus is language-conditioned
(Eds.) Proceedings of the 2019 conference on empirical methods imitation learning for robot manip-
in natural language processing and the 9th international joint con- ulation. Prior to that, he received
ference on natural language processing, EMNLP-IJCNLP 2019, his master’s degree in Carnegie
Hong Kong, China, November 3–7, 2019. Association for Com- Mellon University and bachelor’s
putational Linguistics (pp. 5099–5110). https://doi.org/10.18653/ degree in Southwest Jiaotong Uni-
v1/D19-1514 versity.
Todorov, E., Erez, T., & Tassa, Y. (2012). Mujoco: A physics engine for
model-based control. In 2012 IEEE/RSJ international conference
on intelligent robots and systems (pp. 5026–5033). IEEE.
Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). Attention is all you
need. In Advances in neural information processing systems (pp.
5998–6008). Shubham Sonawani is a PhD
Vemprala, S., Bonatti, R., Bucker, A., et al. (2023). Chatgpt for robotics: student in Electrical Engineering
Design principles and model abilities. 2023. at Arizona State University. He
Vinyals, O., Toshev, A., Bengio, S., et al. (2015). Show and tell: A neural embarked on his academic jour-
image caption generator. In Proceedings of the IEEE conference ney at the Interactive Robotics
on computer vision and pattern recognition (CVPR). Lab in Fall 2018. Prior to his
Wang, Y., Mishra, S., Alipoormolabashi, P., et al. (2022). Super- endeavors at ASU, Shubham earned
naturalinstructions: Generalization via declarative instructions on his Bachelor of Technology in Elec-
1600+ nlp tasks. In Proceedings of the 2022 conference on empir- trical Engineering from VJTI, India.
ical methods in natural language processing (pp. 5085–5109). His PhD research focuses on the
Xie, F., Chowdhury, A., De Paolis Kaluza, M. C., et al. (2020). confluence of Human-Robot Inter-
Deep imitation learning for bimanual robotic manipulation. In: action, Grasping, Manipulation, and
H. Larochelle, M. Ranzato, R. Hadsell, et al. (Eds.) Advances Mixed Reality. The primary objec-
in neural information processing systems (Vol. 33, pp. 2327– tive of his research is to enhance
2337). Curran Associates, Inc. https://proceedings.neurips.cc/ collaboration between humans and
paper/2020/file/18a010d2a9813e91907ce88cd9143fdf-Paper.pdf robots by leveraging a range of information modalities, including but
Xu, K., Ba, J., Kiros, R., et al. (2015). Show, attend and tell: Neural not limited to projection mapping, computer vision, and natural lan-
image caption generation with visual attention. In International guage processing.
conference on machine learning, PMLR (pp. 2048–2057).
Yamaguchi, T., & Fainekos, G. (2021). Percemon: Online monitoring
for perception systems. In Runtime verification: 21st international
123
Autonomous Robots (2023) 47:1013–1033 1033
123