0% found this document useful (0 votes)
23 views8 pages

Burg Ard 22 Language Imitation Learning

Uploaded by

NoNameHere
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views8 pages

Burg Ard 22 Language Imitation Learning

Uploaded by

NoNameHere
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 7, NO.

4, OCTOBER 2022 11205

What Matters in Language Conditioned Robotic


Imitation Learning Over Unstructured Data
Oier Mees , Graduate Student Member, IEEE, Lukas Hermann , and Wolfram Burgard , Fellow, IEEE

Abstract—A long-standing goal in robotics is to build robots


that can perform a wide range of daily tasks from perceptions
obtained with their onboard sensors and specified only via natural
language. While recently substantial advances have been achieved
in language-driven robotics by leveraging end-to-end learning from
pixels, there is no clear and well-understood process for making
various design choices due to the underlying variation in setups.
In this letter, we conduct an extensive study of the most critical
challenges in learning language conditioned policies from offline
free-form imitation datasets. We further identify architectural and
algorithmic techniques that improve performance, such as a hier-
archical decomposition of the robot control learning, a multimodal
transformer encoder, discrete latent plans and a self-supervised
contrastive loss that aligns video and language representations.
By combining the results of our investigation with our improved
model components, we are able to present a novel approach that
significantly outperforms the state of the art on the challenging
language conditioned long-horizon robot manipulation CALVIN
benchmark. We have open-sourced our implementation to facilitate Fig. 1. HULC learns a single 7-DoF language conditioned visuomotor policy
future research in learning to perform many complex manipulation from offline, unstructured data that can solve multi-stage, long-horizon robot
manipulation tasks. We divide instruction following into learning global plans
skills in a row specified with natural language.
representing high-level behavior and a local policy conditioned on the plan and
Index Terms—Imitation learning, learning categories and the instruction.
concepts, machine learning for robot control.
built. How can we design learning systems that can efficiently
I. INTRODUCTION acquire a diverse repertoire of useful skills that allows them to
NE of the grand challenges in robotics is to create a solve many different tasks based on arbitrary user commands?
O generalist robot: a single agent capable of performing a
wide variety of tasks in everyday settings based on arbitrary
To address this problem, we must resolve two questions.
1) How can untrained users direct the robot to perform specific
user commands. Doing so requires the robot to acquire a diverse tasks? Natural language presents a promising alternative form of
repertoire of general-purpose skills and non-expert users to be specification, providing an intuitive and flexible way for humans
able to effectively specify tasks for the robot to solve. This stands to communicate tasks and refer to abstract concepts. However,
in contrast to most current end-to-end models, which typically learning to follow language instructions involves addressing
learn individual tasks one at a time from manually-specified a difficult symbol grounding problem [3], relating a language
rewards and assume tasks being specified via goal images [1] or instruction to a robot’s onboard perception and actions. 2) How
one-hot skill selectors [2], which are not practical for untrained can the robot efficiently learn general-purpose skills from offline
users to instruct robots. Not only is this inefficient, but also data, without hand-specified rewards? A simple and versatile
limits the versatility and adaptivity of the systems that can be choice is to define skills as being continuous instead of discrete,
endowing the agent of task-agnostic control: the ability to reach
Manuscript received 24 February 2022; accepted 11 July 2022. Date of any reachable goal state from any current state [4]. These forms
publication 3 August 2022; date of current version 29 August 2022. This letter of task specification can in principle enable a robot to solve
was recommended for publication by Associate Editor Berthold Bauml and multi-stage tasks by following several language instructions in
Editor Dana Kulic upon evaluation of the reviewers’ comments. This work
was supported by the German Federal Ministry of Education and Research a row.
under Contract 01IS18040B-OML. (Oier Mees and Lukas Hermann contributed Recent advances have been made at learning language con-
equally to this work.) (Corresponding author: Oier Mees.) ditioned policies for continuous visuomotor-control in 3D envi-
Oier Mees and Lukas Hermann are with the University of Freiburg, 79085
Freiburg im Breisgau, Germany (e-mail: [email protected]; ronments via imitation learning [5]–[8] or reinforcement learn-
[email protected]). ing [9], [10]. These approaches typically require offline data
Wolfram Burgard is with the Technical University of Nuremberg, 90489 sources of robotic interaction together with post-hoc crowd-
Nürnberg, Germany (e-mail: [email protected]).
Codebase and trained models available at http://hulc.cs.uni-freiburg.de. sourced natural language labels. Although all methods share
Digital Object Identifier 10.1109/LRA.2022.3196123 the basic idea of leveraging instructions that are grounded in
2377-3766 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Chulalongkorn University provided by UniNet. Downloaded on December 11,2023 at 11:13:05 UTC from IEEE Xplore. Restrictions apply.
11206 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 7, NO. 4, OCTOBER 2022

the agent’s high-dimensional observation space, their details combination of ideas and uses different setups or task defini-
vary greatly. Moreover, evaluating published methods and their tions, making it unclear how individual ideas compare to each
components in language conditioned policy learning is difficult other and which ideas combine well together. For example, the
due to incomparable setups or subjective task definitions. In methods BC-Z and MIA [7], [8] use both behavior cloning,
this work we systematically compare, improve, and integrate but different actions spaces and multi-modal alignment losses,
key components by leveraging the recently proposed CALVIN such as regressing the language embedding from visual obser-
benchmark [11] to further our understanding and provide a vations [7] or cross-modality matching [8]. Moreover, BC-Z
unified framework for long-horizon language conditioned policy leverages expert trajectories and task labels, and MIA includes
learning. We build upon relabeled imitation learning [12] to mobile navigation, making them difficult to implement directly
distill many reusable behaviors into a goal-directed policy, as vi- in CALVIN, which contains unlabeled play data on different
sualized in Fig. 1. Our approach consists of only standard super- tabletop environments. Nair et al. [9] learn a reward classifier
vised learning subroutines, and learns perceptual and linguistic which predicts if a change in state completes a language instruc-
understanding, together with task-agnostic control end-to-end tion and leverage it for offline multi-task RL given four camera
as a single neural network. Our contributions are: views. Similar to BC-Z they rely on discrete task labels and
r We systematically compare key components of language do not focus on solving long-horizon language-specified tasks.
conditioned imitation learning over unstructured data, such Most related to our approach is multi-context imitiation learning
as observation and action spaces, losses for aligning visuo- (MCIL) [5], which also uses relabeled imitation learning to
lingual representations, language models and latent plan distill reusable behaviors into a goal-reaching policy. Besides
representations, and we analyze the effect of other choices, different action and observation spaces, these works leverage
such as data augmentation and optimization. different language models to encode the raw text instructions
r We propose four improvements to these key components: a into a semantic pre-trained vector space, making it difficult to
multimodal transformer encoder to learn to recognize and analyze which language models are best suited for language
organize behaviors during robotic interaction into a global conditioned policy learning. The ablation studies presented in
categorical latent plan, a hierarchical division of the robot these papers show that each novel contribution of each work
control learning that learns local policies in the gripper does indeed improve the performance of their model, but due
camera frame conditioned on the global plan, balancing to incomparable setups and evaluation protocols, it is difficult
terms within the KL loss and a self-supervised contrastive to asses what matters for language conditioned policy learning.
visual-language alignment loss. Our work addresses this problem by systematically comparing
r We integrate the best performing improved components and combining different observation and action spaces, auxil-
in a unified framework, Hierarchical Universal Language iary losses and latent representations and integrating the best
Conditioned Policies (HULC). Our model sets a new state performing components in a unified framework.
of the art on the challenging CALVIN benchmark [11],
on learning a single 7-DoF policy that can perform long-
horizon manipulation tasks in a 3D environment, directly
III. PROBLEM FORMULATION AND METHOD OVERVIEW
from images, and only specified with natural language.
We consider the problem of learning a goal-conditioned policy
πθ (at | st , l) that outputs action at ∈ A, conditioned on the
II. RELATED WORK current state st ∈ S and free-form language instruction l ∈ L,
Natural language processing has recently received much at- under environment dynamics T : S × A → S. We note that the
tention in the field of robotics [13], following the advances made agent does not have access to the true state of the environment,
towards learning groundings between vision and language [14], but to visual observations. In CALVIN [11] the action space A
[15] and grounding behaviors in language [16]. Early works consists of the 7-DoF control of a Franka Emika Panda robot
have approached instruction following by designing interactive arm with a parallel gripper.
fetching systems to localize objects mentioned in referring ex- We model the interactive agent with a general-purpose
pressions [17], [18] or by grounding not only objects, but also goal-reaching policy based on multi-context imitation learning
spatial relations to follow language expressions characterizing (MCIL) from play data [5]. To learn from unstructured “play” we
pick-and-place commands [19]–[21]. Unlike these approaches, assume access to an unsegmented teleoperated play dataset D of
we directly learn robotic control from images and natural lan- semantically meaningful behaviors provided by users, without a
guage instructions, and do not assume any predefined motion set of predefined tasks in mind. To learn control, this long tem-
primitives. poral state-action stream D = {(st , at )}∞ t=0 is relabeled [12],
More recently, end-to-end deep learning has been used to con- treating each visited state in the dataset as a “reached goal
dition agents on natural language instructions [5]–[10], which state,” with the preceding states and actions treated as optimal
are then trained under an imitation or reinforcement learning behavior for reaching that goal. Relabeling yields a dataset of
Dplay
objective. These works have pushed the state of the art and Dplay = {(τ, sg )i }i=0 where each goal state sg has a trajectory
generated a range of ideas for language conditioned policy demonstration τ = {(s0 , a0 ), . . .} solving for the goal. These
learning, such as losses for aligning visual observations and short horizon goal image conditioned demonstrations can be
language instructions. However, each work evaluates a different fed to a simple maximum likelihood goal conditioned imitation

Authorized licensed use limited to: Chulalongkorn University provided by UniNet. Downloaded on December 11,2023 at 11:13:05 UTC from IEEE Xplore. Restrictions apply.
MEES et al.: WHAT MATTERS IN LANGUAGE CONDITIONED ROBOTIC IMITATION LEARNING OVER UNSTRUCTURED DATA 11207

objective: language and visual representations, data augmentation and


⎡ ⎤ optimization. We visualize the full architecture in Fig. 2.
|τ |

LLf P = E(τ,sg )∼Dplay ⎣ log πθ (at | st , sg )⎦ (1)
t=0
A. Observation and Actions Spaces
How to best represent motion skills is an age-old question
to learn a goal-reaching policy πθ (at | st , sg ). We address the
in robotics. From a learning perspective, generating the action
inherent multi-modality in free-form imitation datasets by auto-
sequences to solve diverse manipulation tasks with a single
encoding contextual demonstrations through a latent “plan”
network from high-dimensional observations is challenging,
space with an sequence-to-sequence conditional variational
because the distribution is multi-modal, discontinuous and im-
auto-encoder (seq2seq CVAE) [1]. Conditioning the policy on
balanced. For these reasons, finding an efficient representation
the latent plan frees up the policy to use the entirety of its capacity
is crucial to perform this non-trivial reasoning using learning-
for learning uni-modal behavior. To generate latent plans z we
based methods. MCIL [5] uses global actions learned from a
make use of the variational inference framework [22]. The ob-
single static RGB camera. We observe that predicting 7-DoF
jective of the latent plan sampler is to model the full distribution
global actions leads to the network primarily solving static
over all high-level behaviors that might connect the current
element tasks, such as pushing a button, but failing to generalize
and goal state, to provide multi-modal plans at inference time.
to dynamic tasks, such as manipulating colored blocks. To
This distribution is learned with a CVAE by maximizing the
alleviate this problem, we propose generating global plans that
marginal log likelihood of the observed behaviors in the dataset
correspond to reusable common behavior b seen in the play
log p(x | s), where x are sampled state-action trajectories from
data, but learning local policies conditioned on the plan. This
τ . The Evidence Lower Bound (ELBO) [22] for the CVAE can
results in a hierarchical approach that frees up the network
be written as:
from having to memorize all locations in the scene were the
log p(x|s) ≥ − KL (q(z|x, s) || p(z|s)) behaviors were performed. Concretely, we encode RGB images
from both the static and a gripper camera to learn a compact
+ Eq(z|x,s) [log p(x|z, s)] (2)
representation of all the different high-level plans that take an
The decoder is a policy trained to reconstruct input actions, agent from a current state to a goal state, learning p(b | st , sg ).
conditioned on state st , goal sg , and an inferred plan z for how Inspired by a recent line of work that aims to learn hierarchies of
to get from st to sg . At test time, it takes a goal as input, and controllers based on static and gripper cameras [23], we use the
infers and follows plan z in closed-loop. encoded gripper camera representations in the policy network,
However, when learning language conditioned policies the global contextualized latent plan, and perform control in the
πθ (at | st , l) it is not possible to relabel any visited state s to gripper frame with relative actions for an efficient robot control
a natural language goal as the goal space is no longer equivalent learning. The action space consists of delta XYZ position, delta
to the observation space. Lynch et al. [5] showed that pairing euler angles and the gripper action. Our proposed formulation
a small number of random windows with language after-the- has several advantages: a) local policies based on the gripper
fact instructions enables learning a single language conditioned camera generalize better to different locations of the objects
visuomotor policy that can perform a wide variety of robotic to be manipulated b) the policy has a prior in the form of a
manipulation tasks. The key insight here is that solving a single global contextualized latent plan, but is free to discover the exact
imitation learning policy for either goal image or language strategy on how to interact with the objects.
goals, allows for learning control mostly from unlabeled play
data and reduces the burden of language annotation to less B. Latent Plan Encoding
than 1% of the total data. Concretely, given multiple contextual
A challenge in self-supervising control on top of free-form
imitation datasets D = {D0 , D1 , . . . , DK }, with a different way
imitation data is that in general, there are many valid high-
of describing tasks, MCIL trains a single latent goal conditioned
level behaviors that might connect the same (st , sg ) pairs.
policy πθ (at | st , z) over all datasets simultaneously, as well as
By auto-encoding contextual demonstrations through a latent
one parameterized encoder per dataset.
“plan” space with an sequence-to-sequence conditional varia-
tional auto-encoder (seq2seq CVAE) [1], we can learn to recog-
IV. KEY COMPONENTS OF LANGUAGE CONDITIONED
nize which region of the latent plan space an observation-action
IMITATION LEARNING OVER UNSTRUCTURED DATA sequence belongs to. Critically, conditioning the policy on the
This section compares and improves key components of lan- latent plan frees up the policy to use the entirety of its capacity
guage conditioned imitation learning over unstructured data. We for learning uni-modal behavior. Thus, learning to generate
base our model on MCIL [5] and improve it by decomposing and represent high-quality latent plans is a key component in
control into a hierarchical approach of generating global plans the seq2seq CVAE framework. MCIL [5] uses bidirectional
with a static camera and learning local policies with a gripper recurrent neural networks (RNN) to encode a randomly sampled
camera conditioned on the plan. Then we go through different play sequence and map it into a latent Gaussian distribution. In
components that have a large impact on performance: archi- contrast, we leverage a multimodal transformer encoder [24]
tectures to encode sequences in relabeled imitation learning, to build a contextualized representation of abstract behavior
the representation of the latent distributions, how to best align expressed in language instructions and map into a vector of

Authorized licensed use limited to: Chulalongkorn University provided by UniNet. Downloaded on December 11,2023 at 11:13:05 UTC from IEEE Xplore. Restrictions apply.
11208 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 7, NO. 4, OCTOBER 2022

Fig. 2. Overview of our architecture to learn language conditioned policies from unstructured data. First the language instructions and the visual observations are
encoded. During training a multimodal transformer encodes sequences of observations to learn to recognize and organize high-level behaviors through a posterior.
Its temporally contextualized features are provided as input to a contrastive visuo-lingual alignment loss. The plan sampler network receives the initial state and
the latent language goal and predicts the distribution over plans for achieving the goal. Both prior and posterior distributions are predicted as a vector of multiple
categorical variables and are trained by minimizing their KL divergence. The local policy network receives the latent language instruction, the gripper camera
observation and the global latent plan to generate a sequence of relative actions in the gripper camera frame to achieve the goal.

several latent categorical variables [25]. The foundation of the C. Semantic Alignment of Video and Language
Transformer architecture is the scaled dot-product attention Learning to follow language instructions involves addressing
function, which enables elements in a sequence to attend to other a difficult symbol grounding problem [3], relating a language
elements. The attention function receives as input a sequence
instruction to a robots onboard perception and actions. Although
{x1 , . . ., xn } and outputs a sequence {y1 , . . ., yn }. Each input instructions and visual observations are aligned in CALVIN,
xi is linearly projected linearly to a query qi , key ki , and
learning to manipulate the colored blocks is a challenging prob-
value vi . To compute the output yi the values are summed
lem. This is due to the fact that the robot needs to learn a wide
with weights that take into account the similarity of the query variety of diverse behaviors to manipulate the blocks, but also
with its corresponding key. The attention function is defined
T needs to understand which colored block the user is referring
as Attention(Q, K, V ) = softmax( QK √
dk
)V , where dk is the di- to. Thus, the block related instructions are very similar, for the
mension of the keys and queries. The queries, keys, and values exception of a word that might disambiguate the instruction by
are stacked together into matrix Q ∈ Rn×dmodel , K ∈ Rn×dmodel , indicating a color. Therefore, most pre-trained language models
and V ∈ Rn×dmodel . We encode the sequence of visual observa- struggle to learn such semantics from text only and the policy
tions of both modalities X{static,gripper} ∈ RT ×H×W ×3 with needs to learn referring expression comprehension via the imi-
separate perceptual encoders, and concatenate them to form tation loss. There have been a number of multi-modal alignment
the fused perceptual representation V ∈ RT ×d of the sampled losses proposed, such as regressing the language embedding
demonstration, where T represents the sequence length and d the from the visual observation [7] or cross-modality matching [8].
feature dimension. To enable the sequences to carry temporal We maximize the cosine similarity between the visual features
information, we add positional embeddings [24] and feed the of the sequence i and the corresponding language features while,
result into the Multimodal Transformer to learn temporally at the same time, minimizing the cosine similarity between
contextualized global video representations. Finally, inspired by the current visual features and other language instructions in
the recent line of work that looks into learning discrete instead the same batch. We define our Lcontrast loss the same way as
of continuous latent codes [25], [26], we represent the latent the contrastive loss for pairing images and captions in CLIP [15].
plans as a vector of multiple categorical latent variables and and However, ideally our model should use the time-dependent
optimize them using straight-through gradients [27]. Learning representation of the sequence visual observations in order to
discrete representations in the context of language conditioned capture the meaning of a language instruction. This can be ap-
policies is a natural fit, as language is inherently discrete and preciated only after the sequence of actions have been executed
images can often be described concisely by language [19]. Fur- for several timesteps. The usage of in-batch negatives enables
thermore, discrete representations are a natural fit for complex re-use of computation both in the forward and the backward
reasoning, planning and predictive learning (e.g., if it is sunny, pass making training highly efficient. The logits for one batch is
I will go to the beach). a M × M matrix, where each entry is given by logit(xi , yj ) =

Authorized licensed use limited to: Chulalongkorn University provided by UniNet. Downloaded on December 11,2023 at 11:13:05 UTC from IEEE Xplore. Restrictions apply.
MEES et al.: WHAT MATTERS IN LANGUAGE CONDITIONED ROBOTIC IMITATION LEARNING OVER UNSTRUCTURED DATA 11209

cos_sim(xi , yj ) · exp(τ ), ∀(i, j), i, j ∈ {1, 2, . . . , M } where τ self-attention heads, and a hidden size of 2048. In order to encode
is a trainable temperature parameter. Only entries on the diagonal raw text into a semantic pre-trained vector space, we leverage
of the matrix are considered positive examples. The final loss is the paraphrase-MiniLM-L3-v2 model [30], which distills a large
the sum of the cross entropy losses on the row and the column Transformer based language model and is trained on paraphrase
direction. language corpora that is mainly derived from Wikipedia. It has
a vocabulary size of 30,522 words and maps a sentence of any
D. Action Decoder length into a vector of size 384.
A challenge in learning control from free-form imitation
data, in which different ways of executing the same skill are F. Data Augmentation
shown, is that a standard unimodal predictor, such as a Gaussian To aid learning we apply data augmentation to image obser-
distribution, will average out dissimilar motions. To address vations, both in our method and across all baselines. During
this multimodality, we follow the solution proposed by Lynch training, we apply stochastic image shifts of 0-4 pixels to the
et al. [1] of discretizing the action space and then parameterizing gripper camera images and of 0-10 pixels to the static camera
the policy as a discretized logistic mixture distribution [28], [37]. images as in Yarats et al. [31]. Additionally, a bilinear interpo-
Each of the predicted k logistic distributions have a separate lation is applied on top of the shifted image by replacing each
mean and scale, and are weighted with α to form the mixture pixel with the average of the nearest pixels.
distribution. The imitation loss is the negative log-likelihood for
this distribution: V. EXPERIMENTS
Lact (Dplay , V ) =−ln(Σki=0 αk (Vt ) P (at , μi (Vt ), σi (Vt )) We evaluate our model in an extensive comparison and abla-
    tion study, to determine which components matter for language
Where, P (at , μi (Vt ),σi (Vt ))=F at +0.5−μi (Vt )
σi (Vt )
i t
a −0.5−μ (V )
−F t σ (V i) t
conditioned imitation learning over unstructured data. We ablate
and F (·) is the logistic CDF. Additionally, we use a
single components of our full approach to study the influence of
cross-entropy loss to model the binary gripper open/close
each component. We then compare our resulting model to the
action.
best published methods on the CALVIN benchmark, and show
that it outperforms all previous methods.
E. Optimization and Implementation Details
Our full training objective for the 1% of the total data that is A. Evaluation Protocol
annotated with after-the-fact language instructions is given by
The goal of the agent in CALVIN is to solve sequences of
L = Lact + βLKL + λLcontrast . The windows without anno-
up to 5 language instructions in a row using only onboard
tations are trained with the same imitation learning objective,
sensors. This setting is very challenging as it requires agents
but the language goals are replaced by the last visual frame of
to be able to transition between different subgoals. CALVIN
the sampled window to learn control in a fully self-supervised
has a total of 34 different subtasks and evaluates 1000 unique
manner. A common problem in training VAEs is finding the
sequence instruction chains. The robot is set to a neutral position
right balance in the weight of the KL loss. A high β value
after every sequence to avoid biasing the policies through the
can result in an over-regularized model in which the decoder
robot’s initial pose. This neutral initialization breaks correlation
ignores the latent plans from the prior, also known as a “posterior
between initial state and task, forcing the agent to rely entirely
collapse” [29]. On the other hand, setting β too low results
on language to infer and solve the task. For each subtask in a
in the plan sampler network being unable to catch up to plan
row the policy is conditioned on the current subgoal instruction
over the latent space created by the posterior, and as a result
and transitions to the next subgoal only if the agent successfully
at test time the plans generated by the plan sampler network
completes the current task. We perform the ablation studies on
will be unfamiliar inputs for the decoder. Orthogonal to this, as
the environment D of CALVIN and additionally report numbers
the KL loss is bidirectional, we want to avoid regularizing the
of our approach for the other two CALVIN splits, the multi
plans generated by the posterior toward a poorly trained prior.
environment and zero-shot multi environment splits. We empha-
To solve this problem, we minimize the KL loss faster with
size that the CALVIN dataset for each of the four environment
respect to the prior than the posterior by using different learning
consists of 6 hours of teleoperated undirected play data that
rates, α = 0.8 for the prior and 1 − α for the posterior, similar to
might contain suboptimal behavior. To simulate a real-world
Hafner et al. [25]. We set β = 0.01 and λ = 3 for all experiments
scenario, only 1% of that data contains crow-sourced language
and train with the Adam optimizer with a learning rate of 2−4 .
annotations.
During training, we randomly sample windows between length
20 and 32 and pad them until the max length of 32. For the latent
plan representation we use 32 categoricals with 32 classes each. B. Results and Ablations of Key Components
To better compare the differences between approaches, we use Observation and Actions Spaces: We compare our approach
the same convolutional encoders as the MCIL baseline available of dividing the robot control learning into generating global
in CALVIN for processing the images of the static and gripper contextualized plans and conditioning a local policy that re-
camera. Our multimodal transformer encoder has 2 blocks, 8 ceives only the observations of the the gripper camera on the

Authorized licensed use limited to: Chulalongkorn University provided by UniNet. Downloaded on December 11,2023 at 11:13:05 UTC from IEEE Xplore. Restrictions apply.
11210 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 7, NO. 4, OCTOBER 2022

Fig. 3. Performance of our model on the D environment of the CALVIN Challenge and ablation of the key components, across three seeded runs. All models
receive RGB images from both a static and a gripper camera as a input.

global plan against a “No Local Policy” baseline. Unlike our to 2.45. Finally, we note that applying stochastic image shifts to
approach, which performs control in the gripper camera frame, the input images increases the performance significantly.
the baseline’s policy receives both cameras images and performs Latent Plan Encoding: In our CVAE framework the latent plan
control in the robot’s base frame, as is usual in most published represents valid ways of connecting the actual state and the goal
approaches. We observe in Fig. 3, that despite the baseline’s state and thus, frees up the policy to use the entirety of its capac-
decoder having more perceptual information, the performance ity for learning uni-modal behavior. As language is inherently
for completing five chains of language instructions sequentially discrete and discrete representations are a natural fit for complex
drops from 28.3% to 20.1%. In order to analyze the big perfor- reasoning and planning, we represent latent plans as a vector of
mance difference with respect to the original MCIL baseline, multiple categorical latent variables and and optimize them using
we train a MCIL baseline with relative actions and observe straight-through gradients [27]. We observe that the performance
that its performance improves significantly from the original for 5-chain evaluation drops from 28.3% to 23.6% when we train
MCIL baseline with absolute actions, but performs worse than our model with a diagonal Gaussian distribution as in MCIL.
our models. We speculate that using relative actions with a local While it is difficult to judge why categorical latents work better
policy is easier for the agent to learn instead of memorizing than continuous latent variables, we hypothesize that categorical
all the locations where interactions have been performed with latents could be a better inductive bias for non-smooth aspects
global actions and a global observation space. By decoupling the of the CALVIN benchmark, such as when a block is hidden
control into a hierarchical structure, we show that performance behind the sliding door. Besides, the sparsity level enforced by
increases significantly. Additionally, we analyze the influence a categorical distribution could be beneficial for generalization.
of using the 7-DoF proprioceptive information as input for both Additionally, we compare against a goal-conditioned Behavior
the plan encodings and conditioning the policy, as many works Cloning (GCBC) baseline [1] which does not condition the
report improved performance from it [1], [2], [5]. We observe policy on a latent plan, and observe that it performs worse
that the performance drops significantly and the agent relies than MCIL with relative actions, highlighting the importance
too much on the robot’s initial position, rather than learning to of modeling latent behaviors in free-form imitation datasets.
disentangle initial states and tasks. We hypothesize this might We also observe that balancing the KL loss is beneficial in the
be due to a causal confusion between the proprioceptive infor- CVAE training. By scaling up the prior cross entropy relative
mation and the target actions [35]. We also analyze the effect to the posterior entropy, the agent is encouraged to minimize
of modeling the full action space, including the binary gripper the KL loss by improving its prior toward the more informed
action dimension, with the mixture of logistics distribution posterior, as opposed to reducing the KL by increasing the
instead of using the log loss for the open/close gripper action posterior entropy. We visualize a t-SNE plot of our learned
and observe that the average sequence length drops from 2.64 discrete latent space in Fig. 5 and that see that even for unseen

Authorized licensed use limited to: Chulalongkorn University provided by UniNet. Downloaded on December 11,2023 at 11:13:05 UTC from IEEE Xplore. Restrictions apply.
MEES et al.: WHAT MATTERS IN LANGUAGE CONDITIONED ROBOTIC IMITATION LEARNING OVER UNSTRUCTURED DATA 11211

Fig. 4. Performance of our model on the multi environment splits of the CALVIN Challenge across three seeded runs.

with a cosine loss [7], cross-modality matching [8] and not


having an auxiliary visuo-lingual alignment loss. We observe
that using an auxiliary loss to semantically align the sampled
video sequences and the language instructions helps, but both
baselines perform similarly. We hypothesize that our contrastive
loss works best because it leverages a larger number of in-batch
negatives than the cross-modality matching loss. Concretely, we
maximize the cosine similarity for N real pairs in the batch while
minimizing the cosine similarity of the multimodal embeddings
of the N 2 − N incorrect pairings. The cross-modality matching
loss implements a discriminator that produces a binary predictor
of whether the embeddings match or not. The batch is shuffled
only once to produce the negative samples, contrasting only N
negative samples.
Language Models: Despite steady progress in language condi-
tioned policy learning, a fundamental, but less considered aspect
is the choice of the pre-trained language model to encode raw
Fig. 5. t-SNE visualization of the discrete latent plans generated by embedding text into a semantic pre-trained vector space. We compare the
randomly selected unseen language annotations. Surprisingly, we find that
despite not being trained explicitly with task labels, HULC appears to organize
lightweight paraphrase-MiniLM-L3-v2 language embeddings
its latent plan space functionally. We visualize with the same color functionally from our full model against several popular alternatives, such
similar skills, but use different shapes to distinguish sub-skills. as the larger BERT [34], Distilroberta [33] and MPNet [32],
which double the embedding size from 384 to 768. Besides
language instructions it appears to organize the latent space the architecture of the language model, we analyze the impact
functionally. Additionally, we report degraded performance for of the loss functions the language models are trained on, by
an over-regularized model which learns to ignore the latent comparing the original embeddings of MPNet and Distilroberta
plans, in which we weight the KL divergence with β = 0.1. against versions that have been finetuned with contrastive losses
Finally, we evaluate replacing the transformer encoder in the at the sentence level to map semantically similar sentences into
posterior with a GRU bidirectional recurrent network of the the same latent space [30]. We observe that the SBERT models
same hidden dimension of 2048, similar to MCIL. The results that have been finetuned on sentence semantic similarity achieve
suggest that besides an improved performance, the multimodal significantly better results than the original language models
transformer encoder is significantly more efficient both memory trained on masked language modeling. Concretely, the original
and model size wise (5.9 M vs 106 M parameters for the posterior Distilroberta model achieves an average sequence length of
network) and overall training wall clock time. For comparison, 2.21, while the SBERT Distilroberta model achieves an average
with the transformer encoder, our full approach contains 47.1 M sequence length of 2.50. Finally, we also compare against a
trainable parameters. model conditioned on visual (ResNet-50) and language-goal
Semantic Alignment of Video and Language: One of the main features from a pre-trained CLIP model [15], which has been
challenges for language conditioned continuous visuomotor- trained to align visual and language features from millions of
control is solving a difficult symbol grounding problem [3], image-caption pairs from the internet. Surprisingly, we find
relating a language instruction to a robots onboard perception that performance is slightly worse than our best performing
and actions. An agent in CALVIN needs to learn a wide variety model. We hypothesize that this might be due to a domain gap
of diverse behaviors to manipulate blocks with different shapes, between the natural images that CLIP has been trained on and
but also needs to understand which colored block the user the simulated images from CALVIN. The results suggest that
is instructing it to manipulate. We compare commonly used for complex semantics, the choice of the pre-trained language
auxiliary losses for aligning visual and language representations. model has a large impact and models finetuned on sentence level
Concretely, we compare our contrastive loss against predicting semantic similarity should be preferred. While in this paper, we
the language embedding from the sequence’s visual observations do not finetune the language models with the action loss, we

Authorized licensed use limited to: Chulalongkorn University provided by UniNet. Downloaded on December 11,2023 at 11:13:05 UTC from IEEE Xplore. Restrictions apply.
11212 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 7, NO. 4, OCTOBER 2022

anticipate this might lead to better performance, specially in [11] O. Mees, L. Hermann, E. Rosete-Beas, and W. Burgard, “CALVIN: A
order to ground instructions referring to the colored blocks. benchmark for language-conditioned policy learning for long-horizon
robot manipulation tasks,” IEEE Robot. Autom. Lett., vol. 7, no. 3,
Multi Environment and Zero-Shot Generalization: Finally, we pp. 7327–7334, Jul. 2022.
investigate the performance of our approach on the larger multi [12] M. Andrychowicz et al., “Hindsight experience replay,” in Proc. 31st Int.
environment splits of CALVIN on Fig. 4. On the zero-shot split, Conf. Neural Inf. Process. Syst., 2017, pp. 5055–5065.
[13] S. Tellex, N. Gopalan, H. Kress-Gazit, and C. Matuszek, “Robots that
which consists on training on three environments and testing use language,” Annu. Rev. Control, Robot. Auton. Syst., vol. 3, pp. 25–55,
on an unseen environment with unseen instructions, we observe 2020.
that despite modest improvements over the MCIL baseline, the [14] J. Lu, D. Batra, D. Parikh, and S. Lee, “ViLBERT: Pretraining task-agnostic
visiolinguistic representations for vision-and-language tasks,” in Proc.
policy achieves just an average sequence length of 0.67. We hy- 33rd Int. Conf. Neural Inf. Process. Syst., 2019, Art. no. 2.
pothesize that in order to achieve better zero-shot performance, [15] A. Radford et al., “Learning transferable visual models from natu-
additional techniques from the domain adaptation literature, ral language supervision,” in Proc. Int. Conf. Mach. Learn., 2021
pp. 8748–8763.
such as adversarial skill-transfer losses might be helpful [36]. On [16] T. Winograd, “Understanding natural language,” Cogn. Psychol., vol. 3,
the split that trains on all four environments and evaluates on one no. 1, pp. 1–191, 1972.
of them, we observe that HULC benefits from the larger dataset [17] M. Shridhar and D. Hsu, “Interactive visual grounding of referring expres-
sions for human-robot interaction,” in Proc. Robot.: Sci. Syst., 2018.
size and sets a new state of the art with an average sequence [18] J. Hatori et al., “Interactively picking real-world objects with uncon-
length of 3.06, which is higher than our best performing model strained spoken language instructions,” in Proc. IEEE Int. Conf. Robot.
trained and tested on environment D (2.64). The results suggest Autom., 2018, pp. 3774–3781.
[19] O. Mees and W. Burgard, “Composing pick-and-place tasks by grounding
that increasing the number of collected language pairs aids language,” in Proc. Int. Symp. Exp. Robot., 2021, pp. 491–501.
addressing the complicated perceptual grounding problem. [20] W. Liu, C. Paxton, T. Hermans, and D. Fox, “StructFormer: Learning
spatial structure for language-guided semantic rearrangement of novel
objects,” in Proc. Int. Conf. Robot. Automat., 2022 pp. 6322–6329.
VI. CONCLUSION [21] M. Shridhar, L. Manuelli, and D. Fox, “CLIPort: What and Where Path-
ways for Robotic Manipulation,” in Proc. 5th Conf. Robot Learn., 2021,
We have presented a study into what matters in language pp. 894–906.
conditioned robotic imitation learning over unstructured data [22] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” 2013,
that systematically analyzes, compares, and improves a set of arXiv:1312.6114.
[23] J. Borja-Diaz, O. Mees, G. Kalweit, L. Hermann, J. Boedecker,
key components. This study results in a range of novel ob- and W. Burgard, “Affordance learning from play for sample-efficient
servations about these components and their interactions, from policy learning,” in Proc. IEEE Int. Conf. Robot. Autom., 2022,
which we integrate the best components and improvements into pp. 6372–6378.
[24] A. Vaswani et al., “Attention is all you need,” in Proc. 31st Int. Conf.
a state-of-the-art approach. Our resulting hierarchical HULC Neural Inf. Process. Syst., 2017, pp. 6000–6010.
model learns a single policy from unstructured imitation data that [25] D. Hafner, T. P. Lillicrap, M. Norouzi, and J. Ba, “Mastering atari
substantially surpasses the state of the art on the challenging lan- with discrete world models,” in Proc. Int. Conf. Learn. Representations,
2020.
guage conditioned long-horizon robot manipulation CALVIN [26] A. Van Den Oord et al., “Neural discrete representation learning,” in Proc.
benchmark. We hope it will be useful as a starting point for 31st Int. Conf. Neural Inf. Process. Syst., 2017, pp. 6309–6318.
further research and will bring us closer towards general-purpose [27] Y. Bengio, N. Léonard, and A. Courville, “Estimating or propagating
gradients through stochastic neurons for conditional computation,” 2013,
robots that can relate human language to their perception and arXiv:1308.3432.
actions. [28] T. Salimans, A. Karpathy, X. Chen, and D. P. Kingma, “PixelCNN:
Improving the pixelcnn with discretized logistic mixture likelihood and
REFERENCES other modifications,” 2017, arXiv:1701.05517.
[29] S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. Jozefowicz, and
[1] C. Lynch et al., “Learning latent plans from play,” in Proc. Conf. Robot S. Bengio, “Generating sentences from a continuous space,” 2015,
Learn., 2020, pp. 1113–1132. arXiv:1511.06349.
[2] D. Kalashnikov et al., “Scaling up multi-task robotic reinforce- [30] N. Reimers and I. Gurevych, “Sentence-BERT: Sentence embeddings
ment learning,” in Proc. 5th Conf. Robot Learn., 2022, pp. 557–575, using siamese bert-networks,” in Proc. Conf. Empir. Methods Natural
arXiv:2104.08212. Lang. Process., 2019, pp. 3982–3992.
[3] S. Harnad, “The symbol grounding problem,” Physica D: Nonlinear [31] D. Yarats, R. Fergus, A. Lazaric, and L. Pinto, “Mastering visual contin-
Phenomena, vol. 42, no. 1/3, pp. 335–346, 1990. uous control: Improved data-augmented reinforcement learning,” 2021,
[4] L. P. Kaelbling, “Learning to achieve goals,” in Proc. Int. Joint Conf. Artif. arXiv:2107.09645.
Intell., 1993, pp. 1094–1099. [32] K. Song, X. Tan, T. Qin, J. Lu, and T.-Y. Liu, “MPNet: Masked and
[5] C. Lynch and P. Sermanet, “Language conditioned imitation learning over permuted pre-training for language understanding,” in Proc. 34th Int. Conf.
unstructured data,” in Proc. Robot.: Sci. Syst., 2021. Neural Inf. Process. Syst., 2020, Art. no. 1414.
[6] S. Stepputtis, J. Campbell, M. Phielipp, S. Lee, C. Baral, and H. B. Amor, [33] Y. Liu et al., “RoBERTa: A robustly optimized BERT pretraining ap-
“Language-conditioned imitation learning for robot manipulation tasks,” proach,” 2019, arXiv:1907.11692.
in Proc. 34th Int. Conf. Neural Inf. Process. Syst., 2020, Art. no. 1102. [34] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training
[7] E. Jang et al., “BC-0: Zero-shot task generalization with robotic imitation of deep bidirectional transformers for language understanding,” 2018,
learning,” in Proc. 5th Conf. Robot Learn., 2021, pp. 991–1002. arXiv:1810.04805.
[8] D. I. A. Team et al., “Creating multimodal interactive agents with imitation [35] P. de Haan, D. Jayaraman, and S. Levine, “Causal confusion in imita-
and self-supervised learning,” 2021, arXiv:2112.03763. tion learning,” in Proc. 33rd Int. Conf. Neural Inf. Process. Syst., 2019,
[9] S. Nair, E. Mitchell, K. Chen, B. Ichter, S. Savarese, and C. Finn, “Learning Art. no. 1049.
language-conditioned robot behavior from offline data and crowd-sourced [36] O. Mees, M. Merklinger, G. Kalweit, and W. Burgard, “Adversarial skill
annotation,” in Proc. 5th Conf. Robot Learn., 2021, pp. 1303–1315. networks: Unsupervised robot skill learning from videos,” in Proc. IEEE
[10] L. Shao, T. Migimatsu, Q. Zhang, K. Yang, and J. Bohg, “Concept2Robot: Int. Conf. Robot. Autom., 2020, pp. 4188–4194.
Learning manipulation concepts from instructions and human demonstra- [37] S. Dasari and A. Gupta, “Transformers for one-shot visual imitation,” in
tions,” in Proc. Robot.: Sci. Syst., 2020. Proc. Conf. Robot Learn., 2021, pp. 2071–2084.

Authorized licensed use limited to: Chulalongkorn University provided by UniNet. Downloaded on December 11,2023 at 11:13:05 UTC from IEEE Xplore. Restrictions apply.

You might also like