Learning Modular Language-Conditioned Robot Policies Through Attention

Autonomous Robots (2023) 47:1013–1033
https://doi.org/10.1007/s10514-023-10129-1
Learning modular language-conditioned robot policies through

attention
Yifan Zhou1 · Shubham Sonawani1 · Mariano Phielipp2 · Heni Ben Amor1 · Simon Stepputtis3
Received: 2 May 2023 / Accepted: 26 July 2023 / Published online: 30 August 2023
© The Author(s) 2023
Abstract
Training language-conditioned policies is typically time-consuming and resource-intensive. Additionally, the resulting con-
trollers are tailored to the specific robot they were trained on, making it difficult to transfer them to other robots with different
dynamics. To address these challenges, we propose a new approach called Hierarchical Modularity, which enables more
efficient training and subsequent transfer of such policies across different types of robots. The approach incorporates Super-
vised Attention which bridges the gap between modular and end-to-end learning by enabling the re-use of functional building
blocks. In this contribution, we build upon our previous work, showcasing the extended utilities and improved performance by
expanding the hierarchy to include new tasks and introducing an automated pipeline for synthesizing a large quantity of novel
objects. We demonstrate the effectiveness of this approach through extensive simulated and real-world robot manipulation
experiments.
Keywords Language-conditioned learning · Attention · Imitation · Modularity
1 Introduction puttis et al., 2020; Lynch and Sermanet, 2021; Ahn et al.,
2022; Shridhar et al., 2021). However, this task requires
The word robot was introduced and popularized in the Czech robots to interpret instructions in the current situational and
play, “Rossum’s Universal Robots”, also known as R.U.R. behavioral context in order to accurately reflect the inten-
In this seminal piece of theatre, robots understand and carry tions of the human partner. Achieving such inference and
out a variety of verbal human instructions. Roboticists and decision-making capabilities demands a deep integration of
AI researchers have long been striving to create machines multiple data modalities—specifically, the intersection of
with such an ability to turn natural language instructions into vision, language, and motion. Language-conditioned imita-
physical actions in the real world (Jang et al., 2022; Step- tion learning (Lynch and Sermanet, 2021; Stepputtis et al.,
2020) is a technique that can help address these challenges
B Yifan Zhou by jointly learning perception, language understanding, and
[email protected] control in an end-to-end fashion.
Shubham Sonawani However, a significant drawback of this approach is that,
[email protected] once trained, these language-conditioned policies are only
Mariano Phielipp applicable to the specific robot they were trained on. This is
[email protected] because end-to-end policies are monolithic in nature, which
Heni Ben Amor means that robot-specific aspects of the task, such as kine-
[email protected] matic structure or visual appearance, cannot be individually
Simon Stepputtis targeted and adjusted. While it is possible to retrain the policy
[email protected] on a new robot, this comes with the risk of catastrophic for-
1
getting and substantial computational overhead. Similarly,
School of Computing and Augmented Intelligence, Arizona
State University, Tempe, AZ, USA adding a new aspect, behaviors, or elements to the task may
2
also require a complete retraining.
Intel AI, Phoenix, AZ, USA
This paper tackles the problem of creating modular
3 The Robotics Institute, Carnegie Mellon University, language-conditioned robot policies that can be re-structured,
Pittsburgh, PA, USA
123
1014 Autonomous Robots (2023) 47:1013–1033
Fig. 1 Our proposed method demonstrates high performance on a vari- new behaviors to an existing trained policy. Besides them, we also
ety of tasks. It is able to transfer to new robots in a data-efficient manner, demonstrate the ability to learn relational tasks, where there are two
while still keeping a high execution performance. It also accepts adding objects involved in the same sentence
extended and selectively retrained. Figure 1, depicts a set of [BC-Z (Jang et al., 2022) and LP (Stepputtis et al., 2020)]; (4)
scenarios that we want to address in this paper. For exam- we extend the methodology by creating more complex tasks
ple, we envision an approach which allows for the efficient that incorporate obstacle avoidance and relational instruction
repurposing and transfer of a policy to a new robot. We also following. Finally, we also perform an extensive number of
envision situations in which a new behavior may be added experiments that shed light on generalization properties of
to an existing policy, e.g., incorporating obstacle avoidance the our methodology from different angles, e.g., dealing with
into an existing motion primitive. Similarly, we envision occlusions, synonyms, variable objects, etc (Fig. 1).
situations in which the type of behavior is changed by incor-
porating additional modules into a policy, e.g., following
human instructions that define a relationship between multi-
ple objects, such as, “Put the apple left of the orange!”. 2 Preamble: how generative AI helped write
However, the considered modularity is at odds with the the paper
monolithic nature of end-to-end deep learning. To overcome
this challenge, the paper proposes an attention-based method- This paper largely centers around the training of generative
ology for learning reusable building blocks, or modules, that models at the intersection of vision, language and robot con-
realize specialized sub-tasks. In particular, we discuss super- trol. Besides being the topic of the paper, generative models
vised attention which allows the user to guide the training have also been instrumental in writing this paper. In partic-
process by focusing the attention of a sub-network (or modular, we incorporated such techniques into both (a) the text
ule) on certain input–output variables. By imposing a specific editing process when writing the manuscript, as well as (b)
locus of attention, individual sub-modules can be guided to the process of generating 3D models and textures of manip-
realize an intended target functionality. Another contribution, ulated objects.
called hierarchical modularity, is a training regime inspired For text editing, we utilized GPT-4 (OpenAI, 2023) to iter-
by curriculum learning that aims to decompose the over- atively revise and refine our initial drafts, ensuring improved
all learning process into individual subtasks. This approach readability and clarity of the concepts discussed. We achieved
enables neural networks to be trained in a structured fashion, this by conducting prompt engineering and formulating a
maintaining a degree of modularity and compositionality. specific prompt as follows:
Our contributions can be summarize and extend our prior “Now you are a professor at a top university, studying
work in Zhou et al. (2022) as follows: (1) we propose a computer science, robotics and artificial intelligence. Could
sample-efficient approach for training language-conditioned you please help me rewrite the following text so that it is
manipulation policies that allows for rapid transfer across of high quality, clear, easy to read, well written and can be
different types of robots; (2) we introduce a novel method, published in a top level journal? Some of the paragraphs
which is based on two components called hierarchical mod- might lack critical information. If you notice that, could you
ularity and supervised attention, that bridges the divide please let me know? Let’s do back and forth discussions on
between modular and end-to-end learning and enables the the writing and refine the writing.”
reuse of functional building blocks; (3) we demonstrate that We initiate each conversation with this impersonation
our method outperforms the current state-of-the-art methods prompt, followed by our draft text. GPT-4 then returns a
revised version of the text, ensuring the semantics remained
123
Autonomous Robots (2023) 47:1013–1033 1015
Fig. 2 Using generative models to automatically synthesize an unlimited set of 3D models
unaltered while updating the literary style to incorporate pro- to rapidly generate a potentially infinite number of variants of
fessional terminology and wording, as well as a clear logical an object. This is particularly useful when studying the gener-
flow. This prompt also encourages GPT-4 to solicit feedback alization capabilities of a model. It also completely removes
on the revised text, thus facilitating back-and-forth conver- any 3D modeling or texturing burden. At the moment, the
sations. We manually determine when a piece of writing has pipeline is limited to symmetric objects.
been fine-tuned to a satisfactory degree and bring the con-
versation to a close.
With regard to the generation of 3D models and assets, we 3 Related work
created a new pipeline for automated synthesis of complete
polygonal meshes. Figure 2(top row) depicts the individual Imitation learning offers a straightforward and efficient
steps of this process. First, we synthesize an image of the method for learning agent actions based on expert demon-
intended asset using latent diffusion models (Rombach et al., strations (Dillmann and Friedrich, 1996; Schaal, 1999; Argall
2022) to produce an image of the required asset. We provide et al., 2009). This approach has proven effective in diverse
as input to the model a textual description of the asset, e.g., tasks including helicopter flight (Coates et al., 2009), robot
“A front image of an apple and a white background.”. In turn, control (Maeda et al., 2014), and collaborative assembly.
the resulting image is fed into a monocular depth-estimation Recent advancements in deep learning have enabled the
algorithm (Ranftl et al., 2022) to generate the corresponding acquisition of high-dimensional inputs, such as vision and
depth map. At this stage, each pixel in the image has both language data (Duan et al., 2017a; Zhang et al., 2018a;
(1) a corresponding depth value and (2) an associated RGB Xie et al., 2020)—partially stemming from improvements
texture value. To generate a 3D object, we take a flat mesh in image and video understanding domains (Lu et al., 2019;
grid of the same resolution as the synthesized RGB image. Kamath et al., 2021; Chen et al., 2020; Tan and Bansal,
We then perform displacement mapping (Zirr and Ritschel, 2019; Radford et al., 2021; Dosovitskiy et al., 2021), but
2019) based on the values present in the depth image. Within also in language comprehension (Wang et al., 2022; Ouyang
this process, each point of the originally flat grid gets elevated et al., 2022). Specifically, the work presented in Radford
or depressed according to its depth value. The result is a 3D et al. (2021) paved the way for multimodal language and
model representing the front half of the target object. For the vision alignment. The generalizability of such large mul-
sake of this paper, we assume a plane symmetry—a feature timodal models (Singh et al., 2022; Alayrac et al., 2022;
that is common among a large number of household objects. Ouyang et al., 2022; Zhu et al., 2023) enables a variety of
Accordingly, we can mirror the displacement map in order to downstream tasks, including image captioning (Laina et al.,
yield the occluded part of the object. Finally, we also apply 2019; Vinyals et al., 2015; Xu et al., 2015), visual ques-
a Laplacian smoothing operation (Sorkine et al., 2004) on tion answering systems (VQA) (Antol et al., 2015; Johnson
the final object. Texturing information is retrieved from the et al., 2017), and multimodal dialog systems (Kottur et al.,
source image. This automated 3D synthesis process allows us 2018; Das et al., 2017). However, most importantly, these
123
models have shown their utility when learning language- 4 Methodology

conditioned robot policies (Shridhar et al., 2021; Nair et al.,
2022) that conduct a variety of manipulation tasks (Lynch In this section, we present our approach for modularity in
and Sermanet, 2021; Stepputtis et al., 2020; Jang et al., 2022). language-conditioned robot policies. The main objective of
Utilizing multimodal inputs for task specification and robot the approach is to build neural networks out of composable
control (Anderson et al., 2019; Kuo et al., 2020; Rahmati- building blocks which can be reused, retrained and repur-
zadeh et al., 2018; Duan et al., 2017b; Zhang et al., 2018b; posed whenever changes to the underlying task occur. A
Abolghasemi et al., 2019; Mees et al., 2022) plays a cru- distinguishing feature of our approach is its modular training,
cial role, as the environment and verbal instruction needs to while maintaining end-to-end learning benefits. In particu-
be grounded across modalities. Most notably, BC-Z (Jang lar, the shift from training individual components to training
et al., 2022) proposes a large multimodal dataset which is the complete network occurs progressively, yet modules can
trained via imitation learning in order to complete a variety of be trained quickly without requiring gradient propagation
diverse household tasks. Similar in spirit, LanguagePolicies throughout the entire network. Owing to its modularity, πθ
(LP) (Stepputtis et al., 2020) learns a language-conditioned can be transferred to a new robot in a sample-efficient man-
policy to comprehend commands that describes what, where ner. The modular nature of the resulting neural networks also
and how to do a task, but describes the outputs of the policy in enables easy introspection into the intermediate computation
terms of a dynamic motor primitive (DMP) (Schaal, 2006). steps at runtime.
Going beyond single instruction following, SayCan (Ahn The introduced methodology builds upon two essential
et al., 2022) focuses on planning of longer horizon tasks and components, namely supervised attention and hierarchical
incorporates prompt engineering. Most recently, even large modularity—two ingredients that are used in conjunction
language models have achieved impressive performance on to crystallize individual modules within an end-to-end deep
embodied agents (Vemprala et al., 2023), with a push to gen- learning system. Subsequently, we first introduce the prob-
erally capable agents that can play Atari, caption images, lem statement underlying language-conditioned imitation
chat, and stack blocks with a real robot arm (Reed et al., learning. Thereafter, a detailed description of the training
2022). process is provided. Initially, we focus on efficient training of
While these model achieve impressive performance, they language-conditioned policies that can be transferred across
usually require large quantities of data and are mostly “black a variety of robots. Thereafter, we shift our focus to the ques-
box” approaches that do not lend themselves well to human tion of how new modules can be incorporated or how multiple
interpretation in case the policy behavior is not perform- modules can be interrelated.
ing as desired. A potential solution to this problem that
retains the end-to-end training benefits of deep learning 4.1 Problem statement
is the utilization of a modularized approach, allowing the
creation of entire policies from a set of modules that can In Language-Conditioned Imitation Learning (Lynch and
afford additional insights into the inference process of the Sermanet, 2021; Stepputtis et al., 2020), the goal is to learn
neural network. Such modularization can be achieved by a policy πθ (a | s, I) that can execute a human instruction
introducing auxiliary tasks that have shown to improve pol- while taking into account situational and environmental con-
icy performance (Huang et al., 2022). Recent works on ditions. The result of the learning process is a deep neural
modularity investigate the question of whether “modules network parameterized by weight vector θ . Input s is a ver-
implementing specific functionality emerge” in neural net- bal task instruction provided by a human whereas I is an
works automatically (Csordás et al., 2021; Filan et al., image captured by an RGB camera mounted on the robot.
2020). However, in contrast to these emergent modularity Throughout this paper, policy πθ is trained to generate an
approaches, our prior work (Zhou et al., 2022) introduced action a ∈ R7 containing the Cartesian position (x, y, z)
supervised attention, together with a hierarchical learning and orientation (r , p, y) for the robot end-effector, as well
regime akin to curriculum learning. Originating in machine as a binary label g = {open, closed} indicating the gripper
translation (Liu et al., 2016), supervised attention and hier- state. Policies are trained following the imitation learning
archical modularity allow for such functional modules to be paradigm from a dataset D = {d 0 , . . . , d N } of N expert
implemented in a top-down manner. In this work, we delve demonstrations and corresponding verbal commands. In this
deeper into the benefits of this approach by investigating how dataset, each demonstration d n represents a sequence with T
it can be extended to more complex tasks including obstacle steps ((a0 , s0 , I 0 ), . . . , (a T , s T , I T )). Each step in demon-
avoidance and instructions utilizing referential expressions stration d n is defined as a tuple (at , st , I t ) containing the
across tasks that utilize a large qunatity of automatically gen- action, language command, and image at time step t. Upon
erated scene objects. completion of training, the policy is expected to execute novel
configurations of the task.
123
Fig. 3 Overview: different input modalities, i.e., vision, joint angles modular fashion—individual modules address sub-aspects of the task.
and language are fed into a language-conditioned neural network to The neural network can efficiently be trained and transferred onto other
produce robot control values. The network is setup and trained in a robots and environments (e.g. Sim2Real)
4.2 Training modular language-conditioned policies step is to transform the joint representation into a compatible
shape that aligns with the other input embeddings.
Our overall method is illustrated in Fig. 3. First, camera
image I is processed together with a natural language instruc-
tion s and the robot’s proprioceptive data (i.e., joint angles) 4.2.1 Supervised attention
through modality-specific encoders to generate their respec-
tive embeddings. The resulting embeddings are subsequently After encoding, the inputs are processed within a single
supplied as input tokens to a transformer-style (Vaswani et al., neural network in order to produce robot control actions.
2017) neural network consisting of multiple attention lay- However, a unique element of our approach is the formation
ers. This neural network is responsible for implementing the of semantically meaningful sub-modules during the learn-
overall policy πθ and produces the final robot control signals. ing process. These modules may solve a specific sub-task,
The encoding process ensures that distinct input modal- e.g., detecting the robot end-effector or calculating the dis-
ities, e.g., language, vision and motion, can effectively be tance between the robot and the target object. To achieve this
integrated within a single model. To that end, Vision Encod- effect, we build upon modern attention mechanisms (Vaswani
ings e I = f V (I) are generated using an input image et al., 2017) in order to manage the flow of information within
I ∈ R H ×W ×3 . Taking inspiration from (Carion et al., 2020; attention layers, thereby explicitly guiding the network to
Locatello et al., 2020), we maintain the original spatial concentrate on essential inputs.
structure while encoding the image into a sequence of lower- More specifically, we adopt a supervised attention mech-
resolution image tokens. The resolution is reduced via a anism in order to enable user-defined information routing
convolutional neural network while increasing the number of and the formation of modules within an end-to-end neural
channels, yielding eII ∈ R(H /s)×(W /s)×d , with s representing network. The main idea underlying this mechanism is that
a scaling factor and d denoting the embedding size. Conse- information about optimal token pairings may be available
quently, the low-resolution pixel tokens are transformed into to the user. In other words, if we know which key tokens
a sequence of tokens eI ∈ R Z ×d , where Z = (H × W )/s 2 are important for the queries to look at, we can treat their
through a flattening operation. similarity score as a maximization goal. In Fig. 4, we see
By contrast, Language Encodings e s = f L (s) ∈ R1×d the information routing for three modules. The first module
are produced via a pre-trained and fine-tuned CLIP (Radford LANG is supposed to identify the target object within the sen-
et al., 2021) model. Particularly, each instruction s is rep- tence. Hence, the corresponding attention layer is trained to
resented as a sequence of words [w0 , w1 , . . . , wn ] in which only focus on the language input. The attention for the robot
each word wi ∈ W is a member of vocabulary W. During joint values and vision input is trained to be zero. In order to
training, we employ automatically generated, well-formed provide the output of this module to the attention layer in the
sentences; however, after training, we allow any free-form next level, we use so-called register slots. Register slots are
verbal instruction that is presented to the model, including used to store the output of a module so that it can be accessed
sentences affected by typos or bad grammar. Finally, Joint in subsequent modules in the hierarchy. Accordingly, each
Encodings e j = f J (a) ∈ R1×d are created by transforming module within our method has corresponding register slot
the current robot state a into a latent representation using tokens. The role of the register slots is to provide access to
a simple multi-layer perceptron. The main purpose of this the output of previously executed modules within the hierar-
chy. Coming back to Fig. 4, the second module EE2D locates
123
Fig. 4 Different sub-aspects of the tasks are implemented as modules (via supervised attention). LANG identifies the target object. EE2D locates
the robot end-effector
the robot end-effector in the image. Accordingly, the atten-

tion for this module is trained such that the focus is on the
vision and language inputs only. In turn, the result is writ-
ten into the corresponding register slot. The final module in
Fig. 4, DISP, calculates the distance between the end effector
(EE) and the target object (O). Since this module is higher up
in the hierarchy, it accesses the register slots of lower-level
modules as inputs in order to calculate the distance.
Registers serve multiple purposes and can either be used
as inputs to a module, in which case they serve as a learnable
latent embedding, or be used to store the output of a partic-
ular module. An output register of a module is calculated by
utilizing the standard transformer architecture. In particular,
we define a transformer based attention module over queries
( Q), keys (K ), and values (V ), which are subsequently pro-
cessed as follows:
Fig. 5 Supervised attention example for the second layer of processing
QKT information throughout our overall policy
r out = Attn.( Q, K , V ) = softmax √ V (1)
dk
where dk is the dimensionality of the keys. In our use case, depend on the language register from the prior LANG mod-
the queries are initialized with either learnable and previ- ule. Following common practice, the keys and values derive
ously unused register slots, or with registers that have been from the input image, with each image embedding vector cor-
set by modules operating in prior layers, thus encoding their responding to an image patch (Fig. 5 left). In this particular
respective results. Our keys are equivalent to the values and example, the EE uses a trainable, previously unused register
are initialized with all formal inputs (language, vision, and as query, while the Tar register utilizes the output register of
joint embeddings) as well as all previously set registers from the language module to find the correct object (Fig. 5 top).
prior layers. In contrast to common practice, we control the The EE register is supervised to focus on the robot’s gripper
information flow when learning each module via our pro- image patch, thereby creating a sub-module for detecting the
posed supervised attention, which is a specific optimization robot end-effector. Similarly, the target register attends to the
target for attention layers. target object’s image patch, forming a sub-module respon-
As an illustrative example, consider a query identifying the sible for identifying the target object. When these queries
location of the end-effector, as demonstrated in Fig. 5 (first accurately attend to their respective patches, these patches
key and query combination in the top left) or finding the tar- will primarily contribute to the output register’s embedding
get object (key and query combination near the center). For vector, which can then be used as subsequent module inputs.
simplicity, we omit the other formal inputs and only focus More formally, we maximize the similarity between query
on the visual input. However, the Tar Reg. would also q i and key k j if a connection should exist, thus optimizing
123
Table 1 Explanation of the

Module Layer Formulation, supervised attention mask, and explanation
various modules utilized in our
hierarchical attention module LANG 1 r L AN G = f LANG (r 0 , {es , e j , e I })
Supervised to connect r 1 ↔ es
Identify the target object in the user’s language
EE2D 2 r E E2D = f EE2D (r 1 , {es , e j , e I , r L AN G })
Supervised to connect r 2 ↔ e I
Identify the robot’s end-effector in the image patches
TAR2D 2 r T A R2D = f TAR2D (r 2 , {es , e j , e I , r L AN G })
Supervised to connect r L AN G ↔ e I
Identify visual representation of the target object in the image
EE3D 3 r E E3D = f TAR2D (r 3 , {es , e j , e I , r L AN G , r E E2D , r T A R2D })
Supervised to connect r 3 ↔ e j , r E E2D
Similar to forward kinematics, calculates the end-effector’s 3D position
TAR3D 3 r T A R3D = f TAR3D (r 4 , {es , e j , e I , r L AN G , r E E2D , r T A R2D })
Supervised to connect r 4 ↔ r T A R2D
Identify the robot’s end-effector in 3D space
DISP 4 r D I S P = f DISP (r 5 , {es , e j , e I , r L AN G , r E E2D , r E E3D , r T A R2D , r T A R3D })
Supervised to connect r 5 ↔ r E E3D , r T A R2D
Calculate the displacement between the end-effector and the target object
CTRL 5 r C T R L = f CTRL (r 6 , {es , e j , e I , r L AN G , r E E2D , r E E3D , r T A R2D , r T A R3D , r D I S P })
Supervised to connect r 5 ↔ r E E3D , r T A R3D , r D I S P , es , e j
Calculates the controll signal for the robot
The bold “e” refer to the embeddings as introduced in Sec. 4.2. the bold “r” refers to the registers as introduced
in Equation 1
arg maxθ q i k Tj . This process is equivalent to maximizing

the corresponding attention map element M i j , where M i =
Q KT
softmax( √i d ). Since each element M i j < 1, we mini-
k
mize the distance between M i j and 1 according to Eq. 2. We
assume that N supervision pairs are provided in a set S, indi-
Fig. 6 Hierarchy of the modules used in our method
cating the query and key tokens that should pay attention to
each other. Each pair (i, j) ∈ S contains the indices defining
which queries q i should attend to which corresponding keys 4.2.2 Hierarchical modularity
k j . Individual supervision pairs in this set can be addressed
by S( p) = (i p , j p ). We then define the cost function for In this section, we describe hierarchical modularity—an
supervised attention as follows: algorithm for training hierarchies of modules which is
inspired by curriculum learning (Bengio et al., 2009).
The previously introduced supervised attention mechanism
2 enables the training of modules or building blocks relevant

N
q r ksT to the task. However, such modules also have to be stacked
L(S) = softmax √ −1 (2)
n=0
dk and cascaded together in other to realize the overall goal
of the policy. In that sense, one module’s output becomes
the subsequent module’s input. This can be represented as a
directed graph, as shown in Fig. 6 (top), in which a cascade
where (r , s) correspond to the indices held by the n-th super- of specialized modules implements the overall control pol-
vision pair (r , s) = S(n). While Eq. 2 defines the loss as a icy. Here, each module is represented by a node, while edges
minimization problem with a mean squared error loss, other represent the information flow between nodes.
cost functions such as the cross-entropy (de Boer et al., 2004) Table 1 formally defines each of the modules (nodes)
can also be applied, but have empirically resulted in lower by introducing their functionality, queries, keys, and super-
performance. vised attention mask. Broadly speaking, each module follows
123
Algorithm 1 Hierachical Modularity: training algorithm the robot controller CTRL is also implemented as an MLP
returns network weights θ . decoder, which also represents the overall prediction target

Input: D, {Sk }k=1
K , {L } K , { } K
k k=1 k k=1 of our training process. Notably, in our scenario, this decoder
Output: Weights θ predicts the next ten goal positions at each timestep instead
for subtask k ← 1 to K do of predicting only the next action. This choice is inspired
while not converged do
by Jang et al. (2022), which also allows for a fair comparison
E k ← kt=0 Lt (St ) + t
θ ← Train (D, {S1 , . . . , Sk }, E k ) in the subsequent evaluation sections.
end while While the modular approach requires manually defining
end for loss terms for each module, it is essential to note that all mod-
return θ
ules form a single overarching neural network implementing
the robot policy, inherently learning necessary features in an
end-to-end manner. Modularization arises solely from train-
equation Eq. 1 with keys being the set of original sensor ing the network with various supervised attention targets and
modalities, as well as registers. In the first layer of Fig. 6, the a cost function that successively integrates more sub-tasks.
LANG module identifies the target object, as referred to in
the verbal command, and stores the result in the r LANG reg-
4.3 Use-cases and extensions of hierarchy
ister. Subsequently, in the second layer, the f TAR2D module
utilizes the r LANG register as a query while the f EE2D mod-
We present our model as a cascade of sub-modules, trained
ule utilizes a new, previously unused register as a query. This
hierarchically, enabling seamless integration of additional
chain continues until the final control output of the robot is
modules. In this section, we discuss the incorporation of
generated in the CTRL module.
obstacle avoidance, tracking a predefined obstacle, and
Recall that sub-modules address intermediate tasks in the
describe the generalization of this approach to arbitrary
overarching control problem, making the output register r
“referential objects” that let users specify commands that
suitable for human interpretation and allowing for super-
reference any other object. These enhancements are imple-
vised training of the resulting embedding. To achieve this,
mented by introducing new modules, as depicted in Fig. 8.
we employ small multi-layer perceptron (MLP) decoders to
convert the module outputs into their respective numeric out-
puts. For example, we train a small MLP on top of the r EE2D 4.3.1 Runtime introspection
register that predicts the end-effector location (eex , ee y ) via
a single linear transformation. This approach enables our All sub-modules retain their functionality, even after training.
policy to predict intermediate module outputs, enhancing Consequently, they can be used at runtime to query indi-
training accuracy and allowing monitoring and debugging vidual outputs (e.g., LANG, TAR2D, EE3D). This feature
during inference, which is particularly valuable when trans- allows users to monitor the intermediate computations of
ferring the policy to different robots or scenarios. the end-to-end network to identify potential deviations and
Training Cascaded Modules misclassifications. Figure 7 visually depicts the outputs of
Intuitively, the cascaded modules can be trained in a man- each model during the execution on a real robot. A textual
ner inspired by curriculum learning, wherein each component description (left upper corner) shows the currently identified
is trained before further layers of the hierarchy are added object name, as well as the displacement (in cm) between
to the training objective. This ensures that each module is the end-effector and the target object. The current attention
trained until convergence before being employed for more map is visualized in yellow, whereas the end-effector posi-
sophisticated decision-making processes, ultimately leading tion and the target position are highlighted by red and blue
to the prediction of robot control parameters. Algorithm 1 points. Computing these intermediate outputs of the network
outlines the training procedure for our hierarchical approach generates negligible to no computation overhead. In our spe-
in further detail. The algorithm trains each module of the cific system, we implemented a real-time visualization tool
hierarchy one after another, until the currently trained mod- that can be used at all times to monitor the above features.
ule is converged according to its respective loss function. Such tools for introspection can help in debugging and trou-
After that, we progressively incorporate additional modules bleshooting of the language-conditioned policy. For example,
in a manner reminiscent of curriculum learning. Each module they can be used to detect when individual modules need to be
k is trained with an attention loss Lk given the supervision retrained, or where in the hierarchy a problem is manifesting.
signal S of our proposed supervised attention approach, as In addition, such outputs can be used with formal runtime
well as a task-specific loss functions k which trains the monitoring and verification systems, e.g., Yamaguchi and
MLP decoder for every module. Thus, each module is opti- Fainekos (2021) and Pettersson (2005), to improve the safety
mized with regard to two targets. Note that the policy loss for of the neural network policy.
123
Fig. 7 Sequence of real-time outputs of the network modules: the object All values are generated from a single network that also produces robot
name (white) and visual attention (yellow region), the length of the dis- controls (Color figure online)
placement (white text), the object pos (blue), and end-effector pos (red).
object embeddings and ultimately results in a displacement

value. The controller module incorporates the additional dis-
placement as an additional input. In general, new modules
can be added or existing modules removed according to the
needs of the task.
4.3.3 Creating new types of behaviors by interconnecting

modules
The modular approach also enables new types of behaviors

to be incorporated; in particular, behaviors that interconnect
multiple existing modules. For example, we may want to
learn a robot policy that allows for relational queries, e.g.,
“Put the coke can in front of the pepsi can.”. Such a fea-
Fig. 8 Extensions of hierarchy. a The hierarchy used for obstacle avoid-
ance. 3 new modules, OBST2D, OBST3D and DISP2 are plugged in
ture would require the dynamic identification of a secondary
post-training for detecting the obstacle and avoiding collision with it. object and its desired relationship to the target object. In
b The hierarchy for the relational tasks. These tasks involve 2 objects the previous example, the model must infer the target object
in a sentence, e.g., “Put the apple right to the orange”, where “orange” (“coke can”), the reference object (“pepsi can”), and their
is the referral object. We add LANG2D, REF2D, REF3D and DISP2
for detecting the referral object and generating the according trajectory
relation (“in front of”).
(Color figure online) Figure 8 (bottom) shows the new hierarchy for this use
case. Similar to the object avoidance case, we can incor-
porate additional modules REF2D, REF3D, and DISP2 for
this purpose. However, in contrast to the obstacle avoid-
4.3.2 Adding new behaviors
ance case, an additional module LANG2 is added to extract
the object reference from the user’s instruction and sub-
An important benefit of the modular architecture of our
sequently informs the REF2D for further processing. This
approach is the ability to add new modules into a neural
process of adding and removing modules allows for extensi-
network, even after successful training. To demonstrate this
ble language-conditioned policies whose complexity can be
functionality, we add an obstacle avoidance behavior into the
increased or reduced according to the necessities of the task.
system, i.e., the robot is expected to detect an obstacle and
In the evaluation section, we will see that such an incremen-
generate controls to avoid any collisions.
tal approach has advantages over a complete retraining of the
In our specific scenario, we introduce an obstacle in the
entire policy.
form of an orange basketball that must be avoided when
approaching the target object. To incorporate this ability into
the existing system, we add new modules into the previous 5 Evaluation
hierarchy. This can be seen in Fig. 8 (top).
In particular, we add OBST2D and OBST3D, which iden- In this section, we present a set of experiments designed to
tify the obstacle’s position, and DISP2, which computes the evaluate various aspects of our approach. We firstly elabo-
displacement between the end-effector and obstacle. Simi- rate on the data collection process in Sect. 5.1. In Sect. 5.2,
lar to target object detection, the obstacle is identified from we investigating basic performance metrics of our approach
123
and compare them to other state-of-the-art methods. To this

end, we carry out ablation studies in order to probe the impact
of our hierarchical modularity and supervised attention mod-
ules, as well as structure of the hierarchy itself. Thereafter, we
study the robustness of our approach when exposed to occlu-
sions (Sect. 5.2.2) and linguistic variability (Sect. 5.2.3). In
Sect. 5.3, we focus on the ability of our approach to transfer
existing policies between different robots in simulation, but
also demonstrate the transfer to real-world robots in a sample
efficient manner. Section 5.4 examines the policy’s ability to
generalize to novel objects. Lastly, we explore the possibility
of incorporating new modules into an existing hierarchy for
the purposes of obstacle avoidance and relational instructions
(Sect. 5.5).
We evaluate our method on a tabletop manipulation task
Fig. 9 A human instruction is turned into robot actions via a learned
of six Robosuite (Zhu et al., 2020) and up to 100 automati- language-conditioned policy. The neural network is then successfully
cally generated objects across five different tasks. Our tasks transferred to different robots in simulation and real-world
include three basic objectives namely picking objects, push-
ing them across the table, and rotating them. Further, we have
two obstacle avoidance task, focusing on a single object, or all ing a gravity-compensated robot arm. Beyond the robots’
non-target objects simultaniously. In addition, we also inves- motion, we store each action’s respective command (e.g.,
tigate a more complex placing task in which objects need “Pick up the green bottle!”) and the corresponding RGB
to be placed in relation to other objects in the environment, video stream captured by an overhead camera from the same
thus requiring the understanding and correct interpretation of angle as shown in Fig. 9 with a resolution of 224×224 pixels.
relational instructions. Tasks are performed on three differ- As a simple data augmentation technique, we utilize a tem-
ent robots in simulation and one robot in the real world. Our plating system that generates syntactically correct sentences
simulated robots include a Franka Emika, Kinova Jaco, and during the collection of training, validation, and testing data.
Universal Robot UR5 compliant robot arm. In the real world These templates are derived from two human annotators
scenario, we utilize a UR5 robot. The following sections will who, after watching pre-recorded robot behavior videos,
first provide the details of our experimental setup and data were assigned the task of providing instructions on what the
collection strategy and then discuss evaluation results. robot was executing in the video. This small dataset served
Training Resources To train our method from scratch, as the foundation for extracting command templates, as well
a single Quadro P5000 GPU takes approximately 48 h until as a collection of the used nouns, verbs, and adjectives. This
convergence. In conjunction with this paper, we will release collection is then extended with commonly available syn-
our final code base (and dataset) which is capable of leverag- onyms to allow the creation of an automated system for
ing multi-GPU setups, thereby resulting in further speed-ups command generation during data collection. The template
with regards to the absolute training time. initially selects a random verb phrase in accordance with
Table 10. Subsequently, a noun phrase is determined through
5.1 Data generation random selections from Adj and Noun, as outlined in Table 9.
Table 2 presents the datasets utilized in our experimen-
We perform a series of simulated experiments in MuJoCo tal setup. Each sample is collected with 125 Hz, resulting in
(Todorov et al., 2012), employing three distinct robotic plat- trajectories containing 100–500 steps, depending on the dis-
forms (Kinova, UR5, and Franka) that closely resemble our tance between the robot’s initial position and the target object,
real-world experimental setup with a UR5 robot. as well as the task being executed. The smaller datasets in
Figure 9 illustrates all four configurations, along with the rows three to five are for transferring a previously trained
six Robosuite objects utilized in our investigations, includ- policy from one robot or task to another. The transfer learn-
ing a red cube, a Coke can, a Pepsi can, a milk carton, a ing datasets are purposefully over-provisioned, as we assess
green bottle, and a loaf of bread. Further, our comprehen- the minimal size required to achieve performance compa-
sive set of 100 procedurally generated objects is depicted in rable to a policy trained from scratch in Sect. 5.3. Finally,
Fig. 3. Demonstrations are collected using a heuristic motion the datasets in rows 4, 6, 7, 8, and 9 undergo evaluation in
planner that orchestrates fundamental motion planning tech- an interactive, live setting in which a user engages with a
niques to control each target robot. By contrast, real-world deployed policy, either within a simulation or the real world;
demonstrations are collected via kinesthetic teaching utiliz- thus, these datasets do not have a formal test split.
123
Table 2 Utilized dataset during our experiments

Dataset Samples Objects Tasks Env. Robot Purpose
Train/val/test Robosuite Gen. Base Place
DUR5 1600/400/400 Simulation UR5 Training from scratch

DKinova 1600/400/400 Simulation Kinova Training from scratch
UR5
DTF 320/80/80 Simulation UR5 Transfer Kinova → UR5
UR5
DRW 260/80/live Real-World UR5 Transfer UR5 → real-world
Franka
DTF 320/80/80 Simulation Franka Transfer Kinova → Franka
UR5
DOBST 200/400/live Simulation UR5 Extend skill obstacle-avoidance
UR5
DM-OBST 200/400/live Simulation UR5 Multiple-obstacle-avoidance
UR5
DNO 3000/600/live Simulation UR5 Training 100 novel objects
UR5
DET 3000/600/live Simulation UR5 Training relational task
A “live” designation indicates that testing has been conducted interactively and no formal test dataset exists
Table 3 Comparison with the state-of-the-art baseline as well as ablations in Mujoco

Model Success rate (%) Prediction error (cm)
Pick Push Putdown Overall TAR3D EE3D DISP
LP 0.0 ± 0.0 0.0 ± 0.0 0.0 ± 0.0 0.0 ± 0.0 – – –

Vanilla attention 17.7 ± 1.8 8.3 ± 0.0 12.5 ± 4.2 13.3 ± 0.7 – – –
BC-Z 81.6 ± 6.2 85.6 ± 7.9 49.1 ± 5.8 73.1 ± 4.5 – – –
Ours (UR5 sim) 91.3 ± 5.3 97.2 ± 2.0 55.6 ± 8.6 82.4 ± 4.9 2.24 ± 0.48 0.51 ± 0.09 2.42 ± 0.52
Ours: no S. Attn 88.9 ± 4.2 92.6 ± 4.7 55.1 ± 11.6 79.9 ± 5.8 3.18 ± 1.80 0.42 ± 0.10 3.10 ± 1.64
Ours: no H. Mod 44.4 ± 3.8 39.4 ± 7.1 22.7 ± 7.9 36.4 ± 3.3 22.96 ± 0.99 0.59 ± 0.16 23.16 ± 1.05
Ours: TAR 87.5 ± 4.8 93.8 ± 2.3 59.3 ± 11.5 80.9 ± 6.0 2.77 ± 0.80 0.71 ± 0.08 3.24 ± 1.45
Ours: no TAR 20.8 ± 1.8 13.9 ± 2.4 8.3 ± 4.2 15.0 ± 1.3 32.23 ± 0.05 0.81 ± 0.01 32.29 ± 0.29
Ours: no DISP 65.6 ± 3.1 83.3 ± 4.2 33.3 ± 8.3 61.3 ± 3.3 3.37 ± 0.17 0.69 ± 0.01 27.49 ± 0.20
Ours: Extra 94.8 ± 1.8 98.6 ± 2.4 62.5 ± 7.2 86.3 ± 2.5 1.53 ± 0.14 0.68 ± 0.13 2.25 ± 0.22
Ours: Joint 10.9 ± 2.2 25.0 ± 5.9 2.1 ± 2.9 12.5 ± 3.5 6.19 ± 0.35 1.71 ± 0.01 10.83 ± 0.53
The bold numbers are the ones with top 2 performance within all methods
5.2 Model performance and baseline comparison truth, (3) End Effector Position Error (EE3D) quantifies the
Euclidean 3D distance between the predicted end effector
In this section, we evaluate our model on the three basic position and the ground truth, (4) Displacement Error (DISP)
actions across the six Robosuite objects, utilizing the DUR5 calculates the 3D distance between the predicted 3D displace-
dataset. We also compare our method to two state-of-the-art ment vector and corresponding ground truth vector.
baselines, specifically BC-Z (Jang et al., 2022) and LP (Step- Our method (line 4) outperformed BC-Z (line 3) on
puttis et al., 2020). As our third baseline, we investigate all basic tasks with an average success rate of 82.4%, as
vanilla, unsupervised attention. In this scenario, the same compared to 73.1% for BC-Z. Furthermore, we separately
network as before is trained, but without supervision of the assessed the prediction error of the proposed network’s com-
attention process as introduced in this paper. ponents, namely EE3D, TAR3D, and DISP. We note that
Table 3 summarizes these results in which each training the end-effector pose prediction accuracy (approximately
and testing procedure was executed three times to provide a 0.5 cm) surpasses the target object’s accuracy, which could
better understanding of the stability of the compared meth- be attributed to the presence of the robot’s joint state infor-
ods. We evaluate not only the overall success rates but also the mation. The target object’s position estimation deviates by
performance of each individual module within our language- around 2–3, possibly due to the absence of depth information
conditioned policy. Specifically, we employ the following in our input dataset (solely consisting of RGB).
metrics: (1) Success Rate describes the percentage of suc- By contrast, the LP model (line 1) is not able to success-
cessfully executed trials among the test set, (2) Target Object fully complete any of the tasks. We hypothesize that this
Position Error (TAR3D) measures the Euclidean 3D distance low performance is due to the training dataset’s significantly
between the predicted target object position and the ground smaller size compared to the LP’s usual training data size,
123
as indicated by Stepputtis et al. (2020). Finally, the vanilla,

unsupervised attention approach (line 2) achieves a success
rate of 13.3%. Qualitatively, we observe in this scenario that
the vanilla attention model is not able to recognize the cor-
rect object. Similarly to the LP approach, we hypothesize that
the issue could potentially be resolved with a larger dataset.
However, for the sake of a fair comparison within this paper,
we utilize the same dataset DUR5 across all methods.
Fig. 10 Success rate when part of the target object is occluded
5.2.1 Ablations
5.2.2 Occlusion
In order to evaluate the impact of our two main contributions—
supervised attention and hierarchical modularity—we con- Next, we evaluate the robustness of our approach to partial
duct an ablation study to investigate the impact of each occlusions of the target objects during task execution. To this
contribution on training performance. In addition, we also end, occlusions are introduced by removing image patches
ablate the structure of the hierarchy itself in order to investi- in the camera feed of the simulated experiments. This step is
gate its resiliency to structural changes. performed by covering approximately 20%, 42%, 68% and
Results of the ablation experiments can be found in 80% of the target object’s total area; calculated via a pixel-
Table 3. Our model (line 3) has an overall success rate of based segmentation approach of the input image provided
82.4% across three seeds. When ablating the usage of hier- by the simulator. All experiments are conducted on all six
archical modularity, performance drops to 36.4% (line 5). Robosuite objects across all three basic tasks. The results are
Utilizing our runtime introspection approach to investigate shown in Fig. 10. We observe that our method is robust to
potential issues in the modules (Sect. 4.3.1), we find that the occlusions of up to 20% of the target object, while our base-
target and displacement errors increased to over 20 cm, which line model, BC-Z, already experiences a significant drop in
is likely the cause for the reduced performance. When remov- accuracy. While our model only loses about 1.1% in per-
ing the supervision signal (line 4) for the attention inside our formance, BC-Z drops by 9.35%. However, for occlusions
modules (and instead relying on end-to-end training), we see greater than 40%, our method performs on-par with BC-Z.
a drop of ≈ 2.5% in performance to about 80%. We argue that our robustness to 20% occlusions is signifi-
When ablating the hierarchy itself, we merged the TAR2D cant since small, partial, occlusions are more likely to occur
and TAR3D module (line 7) into a single module instead of during tabletop manipulation tasks.
maintaining two. The underlying rationale is that the sepa-
ration of the target detection between 2D and 3D detection 5.2.3 Synonyms
is not strictly necessary and thus a single target module may
be sufficient. The resulting success rate in this case is 80.9% Our final robustness experiment is concerned with the vari-
which is only slightly below the original rate of 82.4%. Next ability of free-form spoken language. While our system is
we removed the displacement module DISP (line 9) alto- trained with sentences from a template-based generator, we
gether, which results in a performance of about 61.3% (a evaluate its performance when exposed to a set of addi-
loss of around 20%). Finally, we added spurious modules tional synonyms, as well as free-form spoken language from
that are not necessary for the policy’s success in these tasks a small set of human subjects. When replacing synonyms,
(line 10). In particular, we added a specific module that only as shown in Table 8, in the single-word and short-phrase
detects the “Coke” can. In this case, we achieved a success case, we observe that our model achieves a 82.5% success
rate of 86.3% which is slightly higher than the original result. rate on the pushing task. When using BC-Z, on the same
As a general observation, the approach seems to be task with the same synonyms, performance drops to 28.57%,
favorable to superfluous modules, combined modules, or indicating the robustness of our methods to variations in the
variations of a hierarchy. However, the absence of certain language inputs. Finally, we also evaluate the performance
critical modules, e.g., the DISP or TAR modules (lines 9 and on 30 examples of free-form natural language instructions
8 respectively), may have a more drastic effect on perfor- that were collected from human annotators and report a suc-
mance. In the above case of removing the DISP module (line cess rate of 73.3%. The sentences used by the annotators can
9), the performance reduces to about 61.3% which is below be found in Table 11 and show that our model can work with
the corresponding value for BC-Z (73.1%). unconstrained language commands.
123
5.3 Transfer to different robots and real-world BC-Z. Interestingly, we also observe that our model performs
similarly when transferring to the Franka and UR5 robots
In this section, we evaluate the ability of our approach to across the dataset splits, while BC-Z seems to initially per-
efficiently transfer policies between different robots that form worse when transferring the Franka robot. Note here
may have different morphologies. Rather than retraining our that Franka is a 7 degree of freedom (DoF) robot while the
model from scratch to accommodate the altered dynamics source policy, which operates over the Kinova robot, only has
between different robots, we posit that our modular approach six. This discrepancy likely affects robot dynamics thereby
enables the transfer of substantial portions of the prior pol- affecting the transfer process.
icy. This necessitates only minimal fine-tuning, consequently Further, we conducted experiments in which we froze
resulting in a reduced demand for data collection on the dif- parts of our model during transfer of a pre-trained policy
ferent robots. In particular, we evaluate fine-tuning of the from the Kinova to the UR5 and Franka robot. In particular,
entire policy, and fine-tuning of only the modules affected the TAR3D, EE3D, and DISP prediction modules are unaf-
by a change in visual appearance of robot morphology. fected by the change in visual appearance and morphology
of the new robot and, thus, do not need to be retrained. Note,
5.3.1 Transfer in simulation however, that we retrain TAR2D since partial occlusions by
the new robot could lead to false positives for target objects.
Our initial policy is trained from scratch on the DKinova We have conducted further experiments with the same fine-
dataset while the transfer of the trained policy to the Franka tuning datasets and report their results in Fig. 11 (reported as
and UR5 robot is realized with the DTF Franka and D UR5 datasets
TF “Ours-f”). In this setting, with a dataset of only 80 demon-
respectively. strations, the partially frozen module produces a result of
As noted earlier, the DTF datasets are intentionally over- 60% and 72.5% when transferring to the Franka and UR5
provisioned to allow an evaluation regarding how much data respectively. This poses a substantial performance improve-
is required in order to match the performance of the trans- ment of up to 18% in the case of transfer to the UR5 robot
ferred policy to a policy that is trained from scratch on the while utilizing less data than fine-tuning the entire model.
same robot. In order to shed some light on this, we sub- This result further underlines the gains in data-efficiency that
sampled the transfer datasets to a total size of 80, 160, 240 and can be achieved through the hierarchical modularity.
320 demonstrations and conducted the training. Figure 11
shows the results of this analysis (reported as “Ours”) given 5.3.2 Real-world transfer
the varying dataset sizes when fine-tuning the entire policy
initialized with the Kinova weights. With 160 demonstra- Having demonstrated the ability of our approach to efficiently
tions, our model achieves a success rate of 80%, which is transfer policies between robots in simulation, we demon-
only slightly below the policy’s performance when trained strate that a policy can also be transferred to the real world
on the full 1600 demonstrations from scratch. Further, given (Sim2Real Transfer) in a sample-efficient way. To this end,
the full 320 demonstrations of the transfer dataset, the pol- we first trained a policy for the UR5 robot in simulation uti-
icy reaches a performance that is on-par with one trained lizing the DUR5 dataset and subsequently transferred it with a
from scratch. When fine-tuning BC-Z with the same dataset substantially smaller real-world dataset DRW UR5 . More specifi-
splits, we observe that our model consistently outperforms cally, 260 demonstrations on the real-robot are collected for
transfer—this corresponds to about 16 -ththe size of the origi-
nal training set. The overall robot setup can be seen in Fig. 12.
The scene is observed via an external RGB camera and robot
actions are calculated in a closed-loop fashion by provid-
ing the current camera image and language instruction to the
policy.
To investigate the contributions of our proposed methods,
we conduct experiments under 3 different baseline settings.
These include directly applying the simulated policy on
the real robot, fine-tuning the simulated policy using the
real-world dataset DRW UR5 , and transferring the simulated pol-
icy to the real world using our proposed method. Image

sequences of real-robot executions can be seen in Figs. 7
Fig. 11 Results of transferring policies from Kinova (K) robot to UR5 and 9. As expected, the policy trained in simulation is unable
(U) and Franka (F) robots. “Ours-f” refers to freezing parts of our model to complete any task when being directly applied to the real
during transferring. Experiments are performed in Mujoco simulator robot despite coordinate systems and basic dynamics being
123
Fig. 14 Test results of simulated experiments with 70 unseen objects

from 10 classes which are generated automatically using the data
pipeline proposed in Sect. 2
Fig. 12 Experimental setup of real-robot experiments. Objects are seen
through an external camera and actions are generated in a closed-loop
As before, we compare our model’s performance to BC-Z,

as well as an ablated version of our model without supervised
attention or hierarchical modularity. The results are shown
in Fig. 14 on the 100 object generalization task. During this
study, we removed either one or both of our components
from the model during training to examine their individ-
ual and combined contributions. The most basic version
of the model, identified as “Base” and not using super-
Fig. 13 Robot performing a task with objects which are generated auto- vised attention, nor hierarchical modularity, demonstrates
matically generated. The top row is the robot picking up an object, while poor performance with a score of 47%. Models without
the second row is the robot pushing an object
Supervised Attention (“w/ Sup. Attn”) and those without
Hierarchical Modularity (“w/ Hier. Mod.”) each exhibit sig-
matched between the simulation and the real world. This nificantly better performance compared to the base model.
failure is due to the substantial variation in visual appear- Notably, the optimal model is our proposed full model
ance of the robot and objects. When using a naive fine-tuning (“Ours”), which combines both Supervised Attention and
approach that does not use our core contributions, the result- Hierarchical Modularity, resulting in an impressive success
ing success rate is 56.7% over 30 trials, thus demonstrating rate of 87%. For comparison, the baseline model BC-Z attains
partial success. However, we observe that the noise in the a 76% success rate, which is surpassed by our proposed
attention maps is unusually high, which we attribute to the model by 9%.
intricacies of real-world vision and dynamics. Finally, when
training the system with our approach, including supervised
attention and hierarchical modularity, the approach achieve 5.5 Hierarchy extension
a success rate of 80% in the real world when prompted with
30 commands issued by a human operator. In this section, we explore two extension to our hierarchy
by introducing new models that allow the policy to conduct
5.4 Generalization to novel objects new tasks. As we have shown in prior sections, our modular
approach allows for easy transfer between different robots;
To investigate the importance of modularity for general- however, this approach can also be utilized to introduce novel
ization, we extend the experiment setup to include a more tasks to the policy. The following sections introduce an obsta-
challenging scenario. In particular, we incorporate a total of cle avoidance task in which an object is placed in the path
100 objects, which are automatically generated following the between the robot and the described target object which has
approach outlined in Sect. 2. They are comprised of 10 unique to be detected and avoided. In a subsequent experiment, we
classes, each with 10 objects. We utilize 3 objects from each further extend the hierarchy by not only focusing on a fixed
class for training, while the remaining 7 objects, which were “obstacle object”, but allow the user to specify a secondary
previously unobserved by the model, are reserved for testing. reference object, ultimately affording a novel placement task
For this experiment, we utilize the DNO
UR5 dataset and perform that allows objects to be placed in relation to others objects
an evaluation with 100 trials (Fig. 13). in the environment.
123
Fig. 15 Robot trained to avoid all obstacles in the scene. On the way to the Coke can, the robot first avoids a basketball and then the green bottle.
We move the bottle in front of the robot to generate an instantaneous response
For evaluation, we define a successful trial as the absence

of any collision between the robot and the obstacle. After
training, our method achieves a success rate of 88% where
failure cases mostly revolve around premature contact with
the target object.
We further extend the capabilities of the proposed hierar-
chical approach by avoiding any object in the environment.
Fig. 16 Robot performing a task while avoiding a basketball. The top
To this end, we utilize a single module that is trained to focus
row shows a pick action and the bottom row shows a push action. In both on all objects with exception of the target object. This mod-
cases, the robot changes its course to avoid collision with the obstacle ule can viewed as the inverse of the target detection module,
i.e., all but the target object are highlighted. In this multiple-
obstacle case, the trained network achieves a success rate
5.5.1 Obstacle avoidance of 83%. An image sequence of the resulting behavior can
be seen in Fig. 15. Notice the robot response after a second
In this experiment, we demonstrate a seamless way to inte- obstacle (green bottle) is moved in front of it. The image
grate new modules into an existing trained hierarchy by sequence also highlights the closed-loop control underly-
introducing an obstacle avoidance task. First, we discuss a ing our approach—robot actions are constantly recalculated
setup wherein a single, specific object needs to be avoided, based on the current environmental conditions.
before extending the approach to avoid any obstacle in the
scene (Figs. 15, 16). 5.5.2 Relational reference
In our first setting, a basketball is placed between the
end-effector and the objects that are to be manipulated, serv- While the obstacle avoidance tasks showed the basic pipeline
ing as an obstacle. The robot must first identify the obstacle of adding a secondary object and defining a desired behavior
and subsequently formulate a trajectory to navigate around it for it, the approach can be extended to also allow the user to
effectively. In this task, new modules OBST2D and OBST3D verbally specify this secondary object. For this purpose, we
are added to the hierarchy and trained to generate the loca- introduce a relational placing task in which a user specifies a
tion of the obstacle in image space and world space. More reference object, requiring the system to identify the two task
specifically, OBST2D identifies image patches that belong related objects and generating a control signal in accordance
to the object. In turn, these patches are fed into OBST3D to it. Relational tasks involve instructions that not only spec-
to generate a 3D world coordinate. We relate the obstacle’s ify a target object for manipulation (e.g., an “apple”) but also
position to the robot by calculating a second displacement mention an additional referential object (e.g., an “orange”),
DISP2 which utilizes EE3D and OBST3D. Figure 8 shows such as “Place the apple to the right of the orange.” In these
the updated hierarchy. The output of DISP2 feeds into the scenarios, the robot must identify the two objects and under-
calculation of the control value where it is combined with stand the intention behind the given language. We aim to
the output of DISP (the displacement of the end-effector to demonstrate that our model can effectively handle such tasks
the target object). even under generalization constraints. For this purpose we
The expert trajectories which avoid the obstacle are gen- again utilize our 100 automatically generated objects and
erated by using a potential field approach (Khatib, 1986). train a policy over 30 of them, which are composed of 3
More specifically, the basketball is a repulsor that pushes the objects per class, while evaluating it’s generalization capa-
end-effector away from it. Using this approach, 200 train- bilities on the remaining 70 objects. For this experiment, we
ing demonstrations are collected, forming the dataste DOBST
UR5 . utilize the DET
UR5 dataset (Fig. 17).
The policy for this task has been trained from a UR5 policy For this task, we have made modifications to the orig-
by utilizing the above datatet that introduces the novel task. inal hierarchies. Firstly, we introduce a LANG2 module
123
Fig. 17 Robot performing relational tasks with 2 objects involved in 1 command. The top row is the robot putting an avocado left to a hamburger,
while the second row is the robot putting a donut right to a hamburger
to determine the referential object based on the language ponents and subtasks into which a task can be divided. This
input. Besides that, we add TAR2D2 and TAR3D2 mod- process requires organizing these subtasks into a hierarchical
ules to identify the image patch corresponding to the second cascade. Early results indicate that an inadequate decomposi-
object and generate its 3D world coordinate, respectively. We tion can hamper, rather than improve, learning. Furthermore,
also include a DISP2 module to calculate the displacement the approach does not incorporate memory and therefore can-
between the end-effector and the second object. not perform sequential actions. In a few cases we observed
In this scenario, the robot is directed by a verbal sentence a failure to stop after finishing a manipulation - the robot
to identify the first object, pick it up, recognize the second continues with random actions. Another open question is the
object, and then place the first object either to the left or right scalability of the approach. In our investigations, we looked
of the second object according to the command given. at behaviors with a small number of sub-tasks. Is it possible
The entire process is carried out in the MuJoCo environ- to scale the approach to hierarchies with hundreds or thou-
ment, evaluated on 100 test trials. For comparison, we also sands of nodes? The prospect is appealing since this would
train and evaluate the BC-Z model. Our model achieves a bridge the divide between the expressiveness and plasticity of
success rate of 76%, which is a 7% improvement over the neural networks and the ability to create larger robot control
BC-Z performance. Considering the increased complexity systems which require the interplay of many subsystems.
of this task compared to previous ones—due to the need to For future work, we are particularly interested in using
identify two objects from both the sentence and image, and unsupervised and supervised attention side-by-side, i.e., sev-
the more extended manipulation steps required—a 76% suc- eral modules may be supervised by the human expert whereas
cess rate and a 7% increment compared to the baseline are other modules are adjusted in an unsupervised fashion. This
commendable results. would combine the best of both worlds, namely the abil-
ity to provide human structure and knowledge while at the
same time maximally profiting from the network’s plasticity.
6 Discussion and limitations
This is a particularly promising direction, since the ablation
experiments indicate that having superfluous modules does
The above experiments show a variety of benefits of the intro-
not drastically alter the network performance. Further, we
duced modular approach. On one hand, it allows for new
would like to investigate the potential of inferring a suitable
components and behaviors to be incorporated into an exist-
hierarchy in a data-driven manner.
ing policy. This property is particularly appealing in robotics,
since many popular robot control architectures are based
on the concept of modular building-blocks, e.g., behavior- 7 Conclusions
based robotics (Arkin, 1998) and subsumption architecture
(Brooks, 1986). Modularity also enables the user to employ In this paper, we present a data-efficient approach for
modern verification and runtime monitoring tools to better language-conditioned policies in robot manipulation tasks.
understand and debug the decision-making of the system. At We introduce a novel method called Hierarchical Modular-
the same time, the overall system is still end-to-end differ- ity, and adopt supervised attention, to train a set of reusable
entiable and was shown in the above experiments to yield sub-modules. This approach maintains the end-to-end learn-
practical improvements in sample-efficiency, robustness and ing advantages while promoting the reusability of the learned
extensibility. sub-modules. As a result, we are able to customize the hier-
However, a major assumption made in our approach is that archy according to the specific task demand, or integrating
a human expert correctly identifies the logical flow of com- new modules to an existing hierarchy for new tasks. Our
123
method demonstrates high performance in a comprehensive Table 4 Image encoder architecture

set of experiments including training manipulation policies Layer Kernel Channel Stride Padding
with limited data, transferring between multiple robots, and
extension of module hierarchies. We also develop an auto- CNN 7 64 1 3
mated data generation pipeline for creating simulated objects CNN 3 128 2 1
to manipulate with, and show our model’s generalization CNN 3 256 2 1
capability on unseen objects generated by such pipeline. CNN 3 256 2 1
Furthermore, we demonstrate that the learned hierarchy of ResBlock 3 256 1 1
sub-modules can be employed for introspection and visual- ResBlock 3 256 1 1
ization of the robot’s decision-making processes. ResBlock 3 256 1 1
Acknowledgements This work was partially supported by the National

Science Foundation under Grant CNS-1932068. Table 5 Joint encoder Layer Dimension
architecture
Author Contributions YZ contributed to agent development, imitation
FC 256
learning, data collection, running and the analysis of results. SS con-
tributed to engineering infrastructure, environment development, and FC 128
automated 3D object generation. MP contributed to ideation and theo- FC 192
retical algorithm design. HBA contributed to advising, ideation, writing,
project management, and development of generative AI model. SS con-
tributed to Advising, ideation, writing and task design. Table 6 Position and Layer Dimension
displacement decoder
Declarations architecture FC 128
FC 9
Ethical approval This manuscript expands upon our prior conference
paper, “Modularity through Attention: Efficient Training and Transfer
of Language-Conditioned Policies for Robot Manipulation” by Zhou et Table 7 Controller architecture
al., previously published as part of the Conference on Robot Learning Layer Dimension
(CoRL) 2022 proceedings held in Auckland, New Zealand. Particularly
FC 2048
the methodology described in Sects. 4.1 and 4.2 have previously been
published. Similarly, the results discussed in Sects. 5.2 and 5.3 have FC 1024
been previously published in the above publication. We confirm that the FC 256
manuscript, in its current form, has not been published elsewhere and FC 120
is not under consideration by another journal. Further, we confirm that
all authors have approved the manuscript and agree with its submission
to the Autonomous Robots Special Issue on Large Language Models in
Robotics. decoders and controllers, which are shown in Tables 5, 6 and
Open Access This article is licensed under a Creative Commons 7 respectively. We use 4 eight-head attention layers of 192
Attribution 4.0 International License, which permits use, sharing, adap- dimensions for modality fusing and interaction. The Adam
tation, distribution and reproduction in any medium or format, as optimizer with learning rate of 1e−4 is adopted for training
long as you give appropriate credit to the original author(s) and the (Tables 8, 9, 10, 11).
source, provide a link to the Creative Commons licence, and indi-
cate if changes were made. The images or other third party material
in this article are included in the article’s Creative Commons licence,
unless indicated otherwise in a credit line to the material. If material
is not included in the article’s Creative Commons licence and your
intended use is not permitted by statutory regulation or exceeds the
permitted use, you will need to obtain permission directly from the copy-
right holder. To view a copy of this licence, visit http://creativecomm
ons.org/licenses/by/4.0/.
Appendix A: Training details
A.1: Network architectures and hyperparameters
We use a convolutional neural network for image encod-

ing, as shown in Table 4. We use fully connected layers
for joint encoders, target position decoders, displacement
123
Table 8 Synonyms used in test

Milk carton Bottle Coke Cube Bread
Skimmed milk package Soda Coke zero Brick Cinnamon roll

Goat milk carton Perrier Round container Block Sourdough
Milk case Tonic Can Cuboid Brown bread
White packet Flask Coca cola Bar Loaf
Milk parcel Pitcher Red soda Solid lump Naan
Cream carton Container Cola Rectangular object Rye bread
Cream package Decanter Metal container Solid piece Toast
Heavy milk carton Vial Small soft drink Slab Gluten free food
Almond milk box Vessel Fizzy drink Cuboidal slice Light bread
Goat milk packs Cruet Diet coke Square object Food
Table 9 The noun phrase template Table 11 Sentences collected from annotators for evaluation purposes
Object Adj Noun Annotator labeled sentences Success
Coke Red Can Grab the loafs F

Coke Bottle Put down the lime soda T
Cocacola Lay down the red block T
Pepsi Blue Can Tip over the azure can T
Pepsi Bottle Lift the white carton F
Pepsi coke Knock over the pastry T
Bottle green Bottle Lift the coke can T
Glass Put down the sprite T
‘’ Grab the pepsi T
Green glass Elevate the red cube T
Carton Milk Carton Pick up the red cube T
White Box Lift up the blue cylinder T
Cube Red Object Move away the brown object T
Maroon Cube Push away the white object T
Square Lift the blue object T
Bread ‘’ Bread Put down the green sprite T
Yellow object Push the green sprite T
Brown object Push the reddish can T
Pick up the milk container F
Hold up the milk carton F
Table 10 The verb phrase template
Please pick up the green thing F
Verb pick Verb push Verb put Lift the red colored coke can T
Pick Push Put down Push the yellow bread T
Pick up Move Place down Grab the blue colored can T
Raise Nudge that green bottle F
Put down the red colored cuboid T
Lift the white box T
Take the pepsi off the table F
Push the green object forward F
Put down the zero coke on the desk T
Our model achieves 73.3% success rate on variations of languages
123
References Filan, D., Hod, S., Wild, C., et al. (2020). Neural networks are sur-
prisingly modular. arXiv:2003.04881, cite Comment: 23 pages,
Abolghasemi, P., Mazaheri, A., Shah, M., et al. (2019). Pay attention!- 13 figures.
robustifying a deep visuomotor policy through task-focused visual Huang, W., Xia, F., Xiao, T., et al. (2022). Inner monologue:
attention. In Proceedings of the IEEE/CVF conference on com- Embodied reasoning through planning with language models.
puter vision and pattern recognition (pp. 4254–4262). arXiv:2207.05608
Ahn, M., Brohan, A., Brown, N., et al. (2022). Do as I can, not as I say: Jang, E., Irpan, A., Khansari, M., et al. (2022). Bc-z: Zero-shot task
Grounding language in robotic affordances. arXiv:2204.01691 generalization with robotic imitation learning. In Conference on
Alayrac, J. B., Donahue, J., Luc, P., et al. (2022). Flamingo: A visual robot learning, PMLR (pp. 991–1002).
language model for few-shot learning. Advances in Neural Infor- Johnson, J., Hariharan, B., Van Der Maaten, L., et al. (2017). Clevr:
mation Processing Systems, 35, 23716–23736. A diagnostic dataset for compositional language and elementary
Anderson, P., Shrivastava, A., Parikh, D., et al. (2019). Chasing ghosts: visual reasoning. In Proceedings of the IEEE conference on com-
Instruction following as Bayesian state tracking. In Advances in puter vision and pattern recognition (pp. 2901–2910).
neural information processing systems (Vol. 32). Kamath, A., Singh, M., LeCun, Y., et al. (2021). Mdetr—modulated
Antol, S., Agrawal, A., Lu, J., et al. (2015). Vqa: Visual question detection for end-to-end multi-modal understanding. In Proceed-
answering. In Proceedings of the IEEE international conference ings of the IEEE/CVF international conference on computer vision
on computer vision (pp. 2425–2433). (ICCV) (pp. 1780–1790).
Argall, B. D., Chernova, S., Veloso, M., et al. (2009). A survey of robot Khatib, O. (1986). The potential field approach and operational space
learning from demonstration. Robotics and Autonomous Systems, formulation in robot control. In Adaptive and learning systems (pp.
57(5), 469–483. 367–377). Springer.
Arkin, R. (1998). Behavior-based robotics. The MIT Press. Kottur, S., Moura, J. M., Parikh, D., et al. (2018). Visual coreference
Bengio, Y., Louradour, J., Collobert, R., et al. (2009). Curriculum learn- resolution in visual dialog using neural module networks. In Pro-
ing. In Proceedings of the 26th annual international conference ceedings of the European conference on computer vision (ECCV)
on machine learning. Association for Computing Machinery, New (pp. 153–169).
York, NY, USA, ICML ’09 (pp. 41–48). https://doi.org/10.1145/ Kuo, Y. L., Katz, B., & Barbu, A. (2020). Deep compositional robotic
1553374.1553380 planners that follow natural language commands. In 2020 IEEE
Brooks, R. (1986). A robust layered control system for a mobile robot. international conference on robotics and automation (ICRA) (pp.
IEEE Journal on Robotics and Automation, 2(1), 14–23. https:// 4906–4912). IEEE.
doi.org/10.1109/JRA.1986.1087032 Laina, I., Rupprecht, C., & Navab, N. (2019). Towards unsupervised
Carion, N., Massa, F., Synnaeve, G., et al. (2020). End-to-end object image captioning with shared multimodal embeddings. In Pro-
detection with transformers. In European conference on computer ceedings of the IEEE/CVF international conference on computer
vision (pp. 213–229). Springer. vision (ICCV).
Chen, Y. C., Li, L., Yu, L., et al. (2020). Uniter: Universal image-text Liu, L., Utiyama, M., Finch, A., et al. (2016). Neural machine translation
representation learning. In: Computer vision—ECCV 2020: 16th with supervised attention. In Proceedings of COLING 2016, the
European conference, Glasgow, UK, August 23–28, 2020, Pro- 26th international conference on computational linguistics: tech-
ceedings, Part XXX (pp. 104—120). Springer. https://doi.org/10. nical papers. The COLING 2016 Organizing Committee, Osaka,
1007/978-3-030-58577-8_7 Japan (pp. 3093–3102). https://aclanthology.org/C16-1291
Coates, A., Abbeel, P., & Ng, A. Y. (2009). Apprenticeship learning for Locatello, F., Weissenborn, D., Unterthiner, T., et al. (2020). Object-
helicopter control. Communications of the ACM, 52(7), 97–105. centric learning with slot attention. Advances in Neural Informa-
Csordás, R., van Steenkiste, S., & Schmidhuber, J. (2021). Are neural tion Processing Systems, 33, 11525–11538.
nets modular? inspecting functional modularity through differ- Lu, J., Batra, D., Parikh, D., et al. (2019). Vilbert: Pretraining task-
entiable weight masks. In International conference on learning agnostic visiolinguistic representations for vision-and-language
representations. tasks. In Advances in Neural Information Processing Systems (Vol.
Das, A., Kottur, S., Gupta, K., et al. (2017). Visual dialog. In Pro- 32).
ceedings of the IEEE conference on computer vision and pattern Lynch, C., & Sermanet, P. (2021). Language conditioned imitation
recognition (pp. 326–335). learning over unstructured data. In Proceedings of robotics: Sci-
de Boer, P. T., Kroese, D. P., Mannor, S., et al. (2004). A tutorial on ence and systems.
the cross-entropy method. Annals of Operations Research, 134, Maeda, G., Ewerton, M., Lioutikov, R., et al. (2014). Learning inter-
19–67. action for collaborative tasks with probabilistic movement prim-
Dillmann, R., & Friedrich, H. (1996). Programming by demonstration: itives. In 2014 IEEE-RAS international conference on humanoid
A machine learning approach to support skill acquision for robots. robots (pp. 527–534). IEEE.
In International conference on artificial intelligence and symbolic Mees, O., Hermann, L., Rosete-Beas, E., et al. (2022). Calvin: A bench-
mathematical computing (pp. 87–108). Springer. mark for language-conditioned policy learning for long-horizon
Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. (2021). An image is robot manipulation tasks. IEEE Robotics and Automation Letters,
worth 16x16 words: Transformers for image recognition at scale. 7(3), 7327–7334.
arXiv:2010.11929. Nair, S., Rajeswaran, A., Kumar, V., et al. (2022). R3m: A universal
Duan, Y., Andrychowicz, M., Stadie, B., et al. (2017a). One-shot imita- visual representation for robot manipulation. arXiv:2203.12601
tion learning. In I. Guyon, U. V. Luxburg, S. Bengio, et al. (Eds.), OpenAI. (2023). Gpt-4 technical report. arXiv:2303.08774
Advances in Neural Information Processing Systems. (Vol. 30). Ouyang, L., Wu, J., Jiang, X., et al. (2022). Training language models
Curran Associates Inc. to follow instructions with human feedback. arXiv:2203.02155
Duan, Y., Andrychowicz, M., Stadie, B., et al. (2017b). One-shot imi- Pettersson, O. (2005). Execution monitoring in robotics: A survey.
tation learning. In Advances in neural information processing Robotics and Autonomous Systems, 53(2), 73–88.
systems (Vol. 30). Radford, A., Kim, J. W., Hallacy, C., et al. (2021). Learning
transferable visual models from natural language supervision.
arXiv:2103.00020
123
Rahmatizadeh, R., Abolghasemi, P., Bölöni, L. et al. (2018). Vision- conference, RV 2021, virtual event, October 11–14, 2021, Proceed-
based multi-task manipulation for inexpensive robots using end- ings (p. 297). Springer.
to-end learning from demonstration. In 2018 IEEE international Zhang, T., McCarthy, Z., Jow, O., et al. (2018a). Deep imitation learning
conference on robotics and automation (ICRA) (pp. 3758–3765), for complex manipulation tasks from virtual reality teleoperation.
IEEE. In 2018 IEEE international conference on robotics and automa-
Ranftl, R., Lasinger, K., Hafner, D., et al. (2022). Towards robust tion (ICRA) (pp. 5628–5635). https://doi.org/10.1109/ICRA.2018.
monocular depth estimation: Mixing datasets for zero-shot cross- 8461249
dataset transfer. IEEE Transactions on Pattern Analysis and Zhang, T., McCarthy, Z., Jow, O., et al. (2018b). Deep imitation learning
Machine Intelligence, 44(3), 1623–1637. for complex manipulation tasks from virtual reality teleoperation.
Reed, S., Zolna, K., Parisotto, E., et al. (2022). A generalist agent. In 2018 IEEE international conference on robotics and automation
arXiv:2205.06175 (ICRA) (pp. 5628–5635). IEEE.
Rombach, R., Blattmann, A., Lorenz, D., et al. (2022). High-resolution Zhou, Y., Sonawani, S., Phielipp, M., et al. (2022). Modularity through
image synthesis with latent diffusion models. In Proceedings of the attention: Efficient training and transfer of language-conditioned
IEEE/CVF conference on computer vision and pattern recognition policies for robot manipulation. arXiv:2212.04573
(pp. 10684–10695). Zhu, D., Chen, J., Shen, X., et al. (2023). Minigpt-4: Enhancing vision-
Schaal, S. (1999). Is imitation learning the route to humanoid robots? language understanding with advanced large language models.
Trends in Cognitive Sciences, 3(6), 233–242. arXiv:2304.10592
Schaal, S. (2006). Dynamic movement primitives—A framework for Zhu, Y., Wong, J., Mandlekar, A., et al. (2020). robosuite: A mod-
motor control in humans and humanoid robotics. In Adaptive ular simulation framework and benchmark for robot learning.
motion of animals and machines (pp. 261–280). Springer. arXiv:2009.12293
Shridhar, M., Manuelli, L., & Fox, D. (2021). Cliport: What and where Zirr, T., & Ritschel, T. (2019). Distortion-free displacement mapping.
pathways for robotic manipulation. arXiv:2109.12098 Computer Graphics Forum. https://doi.org/10.1111/cgf.13760
Singh, A., Hu, R., Goswami, V., et al. (2022). Flava: A foundational
language and vision alignment model. arXiv:2112.04482
Sorkine, O., Cohen-Or, D., Lipman, Y., et al. (2004). Laplacian sur- Publisher’s Note Springer Nature remains neutral with regard to juris-
face editing. In Proceedings of the 2004 Eurographics/ACM dictional claims in published maps and institutional affiliations.
SIGGRAPH symposium on geometry processing. association for
computing machinery, New York, NY, USA, SGP ’04 (pp. 175–184).
https://doi.org/10.1145/1057432.1057456
Stepputtis, S., Campbell, J., Phielipp, M., et al. (2020). Language- Yifan Zhou is a PhD student at
conditioned imitation learning for robot manipulation tasks. Arizona State University, working
arXiv:2010.12083 in the Interactive Robotics Lab
Tan, H., & Bansal, M. (2019). Lxmert: Learning cross-modality encoder with Prof. Heni Ben Amor. His
representations from transformers. In K. Inui, J. Jiang, V. Ng, et al. main focus is language-conditioned
(Eds.) Proceedings of the 2019 conference on empirical methods imitation learning for robot manip-
in natural language processing and the 9th international joint con- ulation. Prior to that, he received
ference on natural language processing, EMNLP-IJCNLP 2019, his master’s degree in Carnegie
Hong Kong, China, November 3–7, 2019. Association for Com- Mellon University and bachelor’s
putational Linguistics (pp. 5099–5110). https://doi.org/10.18653/ degree in Southwest Jiaotong Uni-
v1/D19-1514 versity.
Todorov, E., Erez, T., & Tassa, Y. (2012). Mujoco: A physics engine for
model-based control. In 2012 IEEE/RSJ international conference
on intelligent robots and systems (pp. 5026–5033). IEEE.
Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). Attention is all you
need. In Advances in neural information processing systems (pp.
5998–6008). Shubham Sonawani is a PhD
Vemprala, S., Bonatti, R., Bucker, A., et al. (2023). Chatgpt for robotics: student in Electrical Engineering
Design principles and model abilities. 2023. at Arizona State University. He
Vinyals, O., Toshev, A., Bengio, S., et al. (2015). Show and tell: A neural embarked on his academic jour-
image caption generator. In Proceedings of the IEEE conference ney at the Interactive Robotics
on computer vision and pattern recognition (CVPR). Lab in Fall 2018. Prior to his
Wang, Y., Mishra, S., Alipoormolabashi, P., et al. (2022). Super- endeavors at ASU, Shubham earned
naturalinstructions: Generalization via declarative instructions on his Bachelor of Technology in Elec-
1600+ nlp tasks. In Proceedings of the 2022 conference on empir- trical Engineering from VJTI, India.
ical methods in natural language processing (pp. 5085–5109). His PhD research focuses on the
Xie, F., Chowdhury, A., De Paolis Kaluza, M. C., et al. (2020). confluence of Human-Robot Inter-
Deep imitation learning for bimanual robotic manipulation. In: action, Grasping, Manipulation, and
H. Larochelle, M. Ranzato, R. Hadsell, et al. (Eds.) Advances Mixed Reality. The primary objec-
in neural information processing systems (Vol. 33, pp. 2327– tive of his research is to enhance
2337). Curran Associates, Inc. https://proceedings.neurips.cc/ collaboration between humans and
paper/2020/file/18a010d2a9813e91907ce88cd9143fdf-Paper.pdf robots by leveraging a range of information modalities, including but
Xu, K., Ba, J., Kiros, R., et al. (2015). Show, attend and tell: Neural not limited to projection mapping, computer vision, and natural lan-
image caption generation with visual attention. In International guage processing.
conference on machine learning, PMLR (pp. 2048–2057).
Yamaguchi, T., & Fainekos, G. (2021). Percemon: Online monitoring
for perception systems. In Runtime verification: 21st international
123
Mariano Phielipp is currently Simon Stepputtis is a Postdoctoral

employed at Intel AI Lab, which Fellow at Carnegie Mellon Uni-
is situated within Intel Labs. His versity’s Robotics Institute. His
work encompasses research and research focuses on human-agent
development in deep learning, deep teaming and multi-agent systems,
reinforcement learning, machine investigating how humans and
learning, and artificial intelligence. robots can efficiently collaborate
Since joining Intel, Dr. Phielipp in complex scenarios by building
has made significant contributions trust and utilizing effective com-
in various domains, including com- munication. Prior to moving to
puter vision, face recognition, face CMU, he received a Ph.D. from
detection, object categorization, rec- Arizona State University working
ommendation systems, online learn- on physical human-robot interac-
ing, automatic rule learning, nat- tion. He has worked at Bosch and
ural language processing, knowl- X, The Moonshot Factory on var-
edge representation, energy based algorithms, and other machine ious aspects of industrial robot manipulation. Dr. Stepputtis’s research
learning and AI-related endeavors. topics include artificial intelligence, human-robot interaction, natural-
language processing, interpretability, few-shot learning, multi-agent
Heni Ben Amor Amor is an Asso- systems, and computer vision, for which he received spotlight presen-
ciate Professor at Arizona State tations at various conferences, including NeurIPS and CoLLAs.
University where he leads the ASU
Interactive Robotics Laboratory.
He was previously a research sci-
entist at Georgia Tech’s Institute
for Robotics and Intelligent
Machines, where he led a project
to improve robots for future appli-
cations in industrial settings, espe-
cially manufacturing. Prior to mov-
ing to Georgia Tech, Ben Amor
worked with Jan Peters at the Tech-
nical University Darmstadt as a
postdoctoral scholar. Ben Amor’s
research topics focus on artificial intelligence, machine learning,
human-robot interaction, robot vision, and automatic motor skill
acquisition. He received the highly competitive Daimler-and-Benz
Fellowship as well as several best paper awards at major robotics and
AI conferences. He also serves on the program committee of vari-
ous AI and robotics conferences, including AAAI, IJCAI, IROS, and
ICRA.
123

Learning Modular Language-Conditioned Robot Policies Through Attention

Uploaded by

Copyright:

Available Formats

Learning Modular Language-Conditioned Robot Policies Through Attention

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Learning Modular Language-Conditioned Robot Policies Through Attention

Uploaded by

Copyright:

Available Formats

Autonomous Robots (2023) 47:1013–1033

Learning modular language-conditioned robot policies through

Keywords Language-conditioned learning · Attention · Imitation · Modularity

Fig. 2 Using generative models to automatically synthesize an unlimited set of 3D models

models have shown their utility when learning language- 4 Methodology

the robot end-effector in the image. Accordingly, the atten-

Table 1 Explanation of the

arg maxθ q i k Tj . This process is equivalent to maximizing

object embeddings and ultimately results in a displacement

4.3.3 Creating new types of behaviors by interconnecting

The modular approach also enables new types of behaviors

and compare them to other state-of-the-art methods. To this

Table 2 Utilized dataset during our experiments

DUR5 1600/400/400 Simulation UR5 Training from scratch

Table 3 Comparison with the state-of-the-art baseline as well as ablations in Mujoco

LP 0.0 ± 0.0 0.0 ± 0.0 0.0 ± 0.0 0.0 ± 0.0 – – –

as indicated by Stepputtis et al. (2020). Finally, the vanilla,

Fig. 10 Success rate when part of the target object is occluded

icy to the real world using our proposed method. Image

Fig. 14 Test results of simulated experiments with 70 unseen objects

As before, we compare our model’s performance to BC-Z,

For evaluation, we define a successful trial as the absence

method demonstrates high performance in a comprehensive Table 4 Image encoder architecture

Acknowledgements This work was partially supported by the National

Appendix A: Training details

A.1: Network architectures and hyperparameters

We use a convolutional neural network for image encod-

Table 8 Synonyms used in test

Skimmed milk package Soda Coke zero Brick Cinnamon roll

Coke Red Can Grab the loafs F

Mariano Phielipp is currently Simon Stepputtis is a Postdoctoral

You might also like