Burg Ard 22 Language Imitation Learning

Uploaded by

NoNameHere

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views8 pages

Burg Ard 22 Language Imitation Learning

Uploaded by

NoNameHere

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 7, NO.

4, OCTOBER 2022 11205

What Matters in Language Conditioned Robotic

Imitation Learning Over Unstructured Data
Oier Mees , Graduate Student Member, IEEE, Lukas Hermann , and Wolfram Burgard , Fellow, IEEE

Abstract—A long-standing goal in robotics is to build robots

that can perform a wide range of daily tasks from perceptions
obtained with their onboard sensors and specified only via natural
language. While recently substantial advances have been achieved
in language-driven robotics by leveraging end-to-end learning from
pixels, there is no clear and well-understood process for making
various design choices due to the underlying variation in setups.
In this letter, we conduct an extensive study of the most critical
challenges in learning language conditioned policies from offline
free-form imitation datasets. We further identify architectural and
algorithmic techniques that improve performance, such as a hier-
archical decomposition of the robot control learning, a multimodal
transformer encoder, discrete latent plans and a self-supervised
contrastive loss that aligns video and language representations.
By combining the results of our investigation with our improved
model components, we are able to present a novel approach that
significantly outperforms the state of the art on the challenging
language conditioned long-horizon robot manipulation CALVIN
benchmark. We have open-sourced our implementation to facilitate Fig. 1. HULC learns a single 7-DoF language conditioned visuomotor policy
future research in learning to perform many complex manipulation from offline, unstructured data that can solve multi-stage, long-horizon robot
manipulation tasks. We divide instruction following into learning global plans
skills in a row specified with natural language.
representing high-level behavior and a local policy conditioned on the plan and
Index Terms—Imitation learning, learning categories and the instruction.
concepts, machine learning for robot control.
built. How can we design learning systems that can efficiently
I. INTRODUCTION acquire a diverse repertoire of useful skills that allows them to
NE of the grand challenges in robotics is to create a solve many different tasks based on arbitrary user commands?
O generalist robot: a single agent capable of performing a
wide variety of tasks in everyday settings based on arbitrary
To address this problem, we must resolve two questions.
1) How can untrained users direct the robot to perform specific
user commands. Doing so requires the robot to acquire a diverse tasks? Natural language presents a promising alternative form of
repertoire of general-purpose skills and non-expert users to be specification, providing an intuitive and flexible way for humans
able to effectively specify tasks for the robot to solve. This stands to communicate tasks and refer to abstract concepts. However,
in contrast to most current end-to-end models, which typically learning to follow language instructions involves addressing
learn individual tasks one at a time from manually-specified a difficult symbol grounding problem [3], relating a language
rewards and assume tasks being specified via goal images [1] or instruction to a robot’s onboard perception and actions. 2) How
one-hot skill selectors [2], which are not practical for untrained can the robot efficiently learn general-purpose skills from offline
users to instruct robots. Not only is this inefficient, but also data, without hand-specified rewards? A simple and versatile
limits the versatility and adaptivity of the systems that can be choice is to define skills as being continuous instead of discrete,
endowing the agent of task-agnostic control: the ability to reach
Manuscript received 24 February 2022; accepted 11 July 2022. Date of any reachable goal state from any current state [4]. These forms
publication 3 August 2022; date of current version 29 August 2022. This letter of task specification can in principle enable a robot to solve
was recommended for publication by Associate Editor Berthold Bauml and multi-stage tasks by following several language instructions in
Editor Dana Kulic upon evaluation of the reviewers’ comments. This work
was supported by the German Federal Ministry of Education and Research a row.
under Contract 01IS18040B-OML. (Oier Mees and Lukas Hermann contributed Recent advances have been made at learning language con-
equally to this work.) (Corresponding author: Oier Mees.) ditioned policies for continuous visuomotor-control in 3D envi-
Oier Mees and Lukas Hermann are with the University of Freiburg, 79085
Freiburg im Breisgau, Germany (e-mail: [email protected]; ronments via imitation learning [5]–[8] or reinforcement learn-
[email protected]). ing [9], [10]. These approaches typically require offline data
Wolfram Burgard is with the Technical University of Nuremberg, 90489 sources of robotic interaction together with post-hoc crowd-
Nürnberg, Germany (e-mail: [email protected]).
Codebase and trained models available at http://hulc.cs.uni-freiburg.de. sourced natural language labels. Although all methods share
Digital Object Identifier 10.1109/LRA.2022.3196123 the basic idea of leveraging instructions that are grounded in
2377-3766 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Chulalongkorn University provided by UniNet. Downloaded on December 11,2023 at 11:13:05 UTC from IEEE Xplore. Restrictions apply.
11206 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 7, NO. 4, OCTOBER 2022

the agent’s high-dimensional observation space, their details combination of ideas and uses different setups or task defini-
vary greatly. Moreover, evaluating published methods and their tions, making it unclear how individual ideas compare to each
components in language conditioned policy learning is difficult other and which ideas combine well together. For example, the
due to incomparable setups or subjective task definitions. In methods BC-Z and MIA [7], [8] use both behavior cloning,
this work we systematically compare, improve, and integrate but different actions spaces and multi-modal alignment losses,
key components by leveraging the recently proposed CALVIN such as regressing the language embedding from visual obser-
benchmark [11] to further our understanding and provide a vations [7] or cross-modality matching [8]. Moreover, BC-Z
unified framework for long-horizon language conditioned policy leverages expert trajectories and task labels, and MIA includes
learning. We build upon relabeled imitation learning [12] to mobile navigation, making them difficult to implement directly
distill many reusable behaviors into a goal-directed policy, as vi- in CALVIN, which contains unlabeled play data on different
sualized in Fig. 1. Our approach consists of only standard super- tabletop environments. Nair et al. [9] learn a reward classifier
vised learning subroutines, and learns perceptual and linguistic which predicts if a change in state completes a language instruc-
understanding, together with task-agnostic control end-to-end tion and leverage it for offline multi-task RL given four camera
as a single neural network. Our contributions are: views. Similar to BC-Z they rely on discrete task labels and
r We systematically compare key components of language do not focus on solving long-horizon language-specified tasks.
conditioned imitation learning over unstructured data, such Most related to our approach is multi-context imitiation learning
as observation and action spaces, losses for aligning visuo- (MCIL) [5], which also uses relabeled imitation learning to
lingual representations, language models and latent plan distill reusable behaviors into a goal-reaching policy. Besides
representations, and we analyze the effect of other choices, different action and observation spaces, these works leverage
such as data augmentation and optimization. different language models to encode the raw text instructions
r We propose four improvements to these key components: a into a semantic pre-trained vector space, making it difficult to
multimodal transformer encoder to learn to recognize and analyze which language models are best suited for language
organize behaviors during robotic interaction into a global conditioned policy learning. The ablation studies presented in
categorical latent plan, a hierarchical division of the robot these papers show that each novel contribution of each work
control learning that learns local policies in the gripper does indeed improve the performance of their model, but due
camera frame conditioned on the global plan, balancing to incomparable setups and evaluation protocols, it is difficult
terms within the KL loss and a self-supervised contrastive to asses what matters for language conditioned policy learning.
visual-language alignment loss. Our work addresses this problem by systematically comparing
r We integrate the best performing improved components and combining different observation and action spaces, auxil-
in a unified framework, Hierarchical Universal Language iary losses and latent representations and integrating the best
Conditioned Policies (HULC). Our model sets a new state performing components in a unified framework.
of the art on the challenging CALVIN benchmark [11],
on learning a single 7-DoF policy that can perform long-
horizon manipulation tasks in a 3D environment, directly
III. PROBLEM FORMULATION AND METHOD OVERVIEW
from images, and only specified with natural language.
We consider the problem of learning a goal-conditioned policy
πθ (at | st , l) that outputs action at ∈ A, conditioned on the
II. RELATED WORK current state st ∈ S and free-form language instruction l ∈ L,
Natural language processing has recently received much at- under environment dynamics T : S × A → S. We note that the
tention in the field of robotics [13], following the advances made agent does not have access to the true state of the environment,
towards learning groundings between vision and language [14], but to visual observations. In CALVIN [11] the action space A
[15] and grounding behaviors in language [16]. Early works consists of the 7-DoF control of a Franka Emika Panda robot
have approached instruction following by designing interactive arm with a parallel gripper.
fetching systems to localize objects mentioned in referring ex- We model the interactive agent with a general-purpose
pressions [17], [18] or by grounding not only objects, but also goal-reaching policy based on multi-context imitation learning
spatial relations to follow language expressions characterizing (MCIL) from play data [5]. To learn from unstructured “play” we
pick-and-place commands [19]–[21]. Unlike these approaches, assume access to an unsegmented teleoperated play dataset D of
we directly learn robotic control from images and natural lan- semantically meaningful behaviors provided by users, without a
guage instructions, and do not assume any predefined motion set of predefined tasks in mind. To learn control, this long tem-
primitives. poral state-action stream D = {(st , at )}∞ t=0 is relabeled [12],
More recently, end-to-end deep learning has been used to con- treating each visited state in the dataset as a “reached goal
dition agents on natural language instructions [5]–[10], which state,” with the preceding states and actions treated as optimal
are then trained under an imitation or reinforcement learning behavior for reaching that goal. Relabeling yields a dataset of
Dplay
objective. These works have pushed the state of the art and Dplay = {(τ, sg )i }i=0 where each goal state sg has a trajectory
generated a range of ideas for language conditioned policy demonstration τ = {(s0 , a0 ), . . .} solving for the goal. These
learning, such as losses for aligning visual observations and short horizon goal image conditioned demonstrations can be
language instructions. However, each work evaluates a different fed to a simple maximum likelihood goal conditioned imitation

Authorized licensed use limited to: Chulalongkorn University provided by UniNet. Downloaded on December 11,2023 at 11:13:05 UTC from IEEE Xplore. Restrictions apply.
MEES et al.: WHAT MATTERS IN LANGUAGE CONDITIONED ROBOTIC IMITATION LEARNING OVER UNSTRUCTURED DATA 11207

objective: language and visual representations, data augmentation and

⎡ ⎤ optimization. We visualize the full architecture in Fig. 2.
|τ |

LLf P = E(τ,sg )∼Dplay ⎣ log πθ (at | st , sg )⎦ (1)
t=0
A. Observation and Actions Spaces
How to best represent motion skills is an age-old question
to learn a goal-reaching policy πθ (at | st , sg ). We address the
in robotics. From a learning perspective, generating the action
inherent multi-modality in free-form imitation datasets by auto-
sequences to solve diverse manipulation tasks with a single
encoding contextual demonstrations through a latent “plan”
network from high-dimensional observations is challenging,
space with an sequence-to-sequence conditional variational
because the distribution is multi-modal, discontinuous and im-
auto-encoder (seq2seq CVAE) [1]. Conditioning the policy on
balanced. For these reasons, finding an efficient representation
the latent plan frees up the policy to use the entirety of its capacity
is crucial to perform this non-trivial reasoning using learning-
for learning uni-modal behavior. To generate latent plans z we
based methods. MCIL [5] uses global actions learned from a
make use of the variational inference framework [22]. The ob-
single static RGB camera. We observe that predicting 7-DoF
jective of the latent plan sampler is to model the full distribution
global actions leads to the network primarily solving static
over all high-level behaviors that might connect the current
element tasks, such as pushing a button, but failing to generalize
and goal state, to provide multi-modal plans at inference time.
to dynamic tasks, such as manipulating colored blocks. To
This distribution is learned with a CVAE by maximizing the
alleviate this problem, we propose generating global plans that
marginal log likelihood of the observed behaviors in the dataset
correspond to reusable common behavior b seen in the play
log p(x | s), where x are sampled state-action trajectories from
data, but learning local policies conditioned on the plan. This
τ . The Evidence Lower Bound (ELBO) [22] for the CVAE can
results in a hierarchical approach that frees up the network
be written as:
from having to memorize all locations in the scene were the
log p(x|s) ≥ − KL (q(z|x, s) || p(z|s)) behaviors were performed. Concretely, we encode RGB images
from both the static and a gripper camera to learn a compact
+ Eq(z|x,s) [log p(x|z, s)] (2)
representation of all the different high-level plans that take an
The decoder is a policy trained to reconstruct input actions, agent from a current state to a goal state, learning p(b | st , sg ).
conditioned on state st , goal sg , and an inferred plan z for how Inspired by a recent line of work that aims to learn hierarchies of
to get from st to sg . At test time, it takes a goal as input, and controllers based on static and gripper cameras [23], we use the
infers and follows plan z in closed-loop. encoded gripper camera representations in the policy network,
However, when learning language conditioned policies the global contextualized latent plan, and perform control in the
πθ (at | st , l) it is not possible to relabel any visited state s to gripper frame with relative actions for an efficient robot control
a natural language goal as the goal space is no longer equivalent learning. The action space consists of delta XYZ position, delta
to the observation space. Lynch et al. [5] showed that pairing euler angles and the gripper action. Our proposed formulation
a small number of random windows with language after-the- has several advantages: a) local policies based on the gripper
fact instructions enables learning a single language conditioned camera generalize better to different locations of the objects
visuomotor policy that can perform a wide variety of robotic to be manipulated b) the policy has a prior in the form of a
manipulation tasks. The key insight here is that solving a single global contextualized latent plan, but is free to discover the exact
imitation learning policy for either goal image or language strategy on how to interact with the objects.
goals, allows for learning control mostly from unlabeled play
data and reduces the burden of language annotation to less B. Latent Plan Encoding
than 1% of the total data. Concretely, given multiple contextual
A challenge in self-supervising control on top of free-form
imitation datasets D = {D0 , D1 , . . . , DK }, with a different way
imitation data is that in general, there are many valid high-
of describing tasks, MCIL trains a single latent goal conditioned
level behaviors that might connect the same (st , sg ) pairs.
policy πθ (at | st , z) over all datasets simultaneously, as well as
By auto-encoding contextual demonstrations through a latent
one parameterized encoder per dataset.
“plan” space with an sequence-to-sequence conditional varia-
tional auto-encoder (seq2seq CVAE) [1], we can learn to recog-
IV. KEY COMPONENTS OF LANGUAGE CONDITIONED
nize which region of the latent plan space an observation-action
IMITATION LEARNING OVER UNSTRUCTURED DATA sequence belongs to. Critically, conditioning the policy on the
This section compares and improves key components of lan- latent plan frees up the policy to use the entirety of its capacity
guage conditioned imitation learning over unstructured data. We for learning uni-modal behavior. Thus, learning to generate
base our model on MCIL [5] and improve it by decomposing and represent high-quality latent plans is a key component in
control into a hierarchical approach of generating global plans the seq2seq CVAE framework. MCIL [5] uses bidirectional
with a static camera and learning local policies with a gripper recurrent neural networks (RNN) to encode a randomly sampled
camera conditioned on the plan. Then we go through different play sequence and map it into a latent Gaussian distribution. In
components that have a large impact on performance: archi- contrast, we leverage a multimodal transformer encoder [24]
tectures to encode sequences in relabeled imitation learning, to build a contextualized representation of abstract behavior
the representation of the latent distributions, how to best align expressed in language instructions and map into a vector of

Authorized licensed use limited to: Chulalongkorn University provided by UniNet. Downloaded on December 11,2023 at 11:13:05 UTC from IEEE Xplore. Restrictions apply.
11208 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 7, NO. 4, OCTOBER 2022

Fig. 2. Overview of our architecture to learn language conditioned policies from unstructured data. First the language instructions and the visual observations are
encoded. During training a multimodal transformer encodes sequences of observations to learn to recognize and organize high-level behaviors through a posterior.
Its temporally contextualized features are provided as input to a contrastive visuo-lingual alignment loss. The plan sampler network receives the initial state and
the latent language goal and predicts the distribution over plans for achieving the goal. Both prior and posterior distributions are predicted as a vector of multiple
categorical variables and are trained by minimizing their KL divergence. The local policy network receives the latent language instruction, the gripper camera
observation and the global latent plan to generate a sequence of relative actions in the gripper camera frame to achieve the goal.

several latent categorical variables [25]. The foundation of the C. Semantic Alignment of Video and Language
Transformer architecture is the scaled dot-product attention Learning to follow language instructions involves addressing
function, which enables elements in a sequence to attend to other a difficult symbol grounding problem [3], relating a language
elements. The attention function receives as input a sequence
instruction to a robots onboard perception and actions. Although
{x1 , . . ., xn } and outputs a sequence {y1 , . . ., yn }. Each input instructions and visual observations are aligned in CALVIN,
xi is linearly projected linearly to a query qi , key ki , and
learning to manipulate the colored blocks is a challenging prob-
value vi . To compute the output yi the values are summed
lem. This is due to the fact that the robot needs to learn a wide
with weights that take into account the similarity of the query variety of diverse behaviors to manipulate the blocks, but also
with its corresponding key. The attention function is defined
T needs to understand which colored block the user is referring
as Attention(Q, K, V ) = softmax( QK √
dk
)V , where dk is the di- to. Thus, the block related instructions are very similar, for the
mension of the keys and queries. The queries, keys, and values exception of a word that might disambiguate the instruction by
are stacked together into matrix Q ∈ Rn×dmodel , K ∈ Rn×dmodel , indicating a color. Therefore, most pre-trained language models
and V ∈ Rn×dmodel . We encode the sequence of visual observa- struggle to learn such semantics from text only and the policy
tions of both modalities X{static,gripper} ∈ RT ×H×W ×3 with needs to learn referring expression comprehension via the imi-
separate perceptual encoders, and concatenate them to form tation loss. There have been a number of multi-modal alignment
the fused perceptual representation V ∈ RT ×d of the sampled losses proposed, such as regressing the language embedding
demonstration, where T represents the sequence length and d the from the visual observation [7] or cross-modality matching [8].
feature dimension. To enable the sequences to carry temporal We maximize the cosine similarity between the visual features
information, we add positional embeddings [24] and feed the of the sequence i and the corresponding language features while,
result into the Multimodal Transformer to learn temporally at the same time, minimizing the cosine similarity between
contextualized global video representations. Finally, inspired by the current visual features and other language instructions in
the recent line of work that looks into learning discrete instead the same batch. We define our Lcontrast loss the same way as
of continuous latent codes [25], [26], we represent the latent the contrastive loss for pairing images and captions in CLIP [15].
plans as a vector of multiple categorical latent variables and and However, ideally our model should use the time-dependent
optimize them using straight-through gradients [27]. Learning representation of the sequence visual observations in order to
discrete representations in the context of language conditioned capture the meaning of a language instruction. This can be ap-
policies is a natural fit, as language is inherently discrete and preciated only after the sequence of actions have been executed
images can often be described concisely by language [19]. Fur- for several timesteps. The usage of in-batch negatives enables
thermore, discrete representations are a natural fit for complex re-use of computation both in the forward and the backward
reasoning, planning and predictive learning (e.g., if it is sunny, pass making training highly efficient. The logits for one batch is
I will go to the beach). a M × M matrix, where each entry is given by logit(xi , yj ) =

cos_sim(xi , yj ) · exp(τ ), ∀(i, j), i, j ∈ {1, 2, . . . , M } where τ self-attention heads, and a hidden size of 2048. In order to encode
is a trainable temperature parameter. Only entries on the diagonal raw text into a semantic pre-trained vector space, we leverage
of the matrix are considered positive examples. The final loss is the paraphrase-MiniLM-L3-v2 model [30], which distills a large
the sum of the cross entropy losses on the row and the column Transformer based language model and is trained on paraphrase
direction. language corpora that is mainly derived from Wikipedia. It has
a vocabulary size of 30,522 words and maps a sentence of any
D. Action Decoder length into a vector of size 384.
A challenge in learning control from free-form imitation
data, in which different ways of executing the same skill are F. Data Augmentation
shown, is that a standard unimodal predictor, such as a Gaussian To aid learning we apply data augmentation to image obser-
distribution, will average out dissimilar motions. To address vations, both in our method and across all baselines. During
this multimodality, we follow the solution proposed by Lynch training, we apply stochastic image shifts of 0-4 pixels to the
et al. [1] of discretizing the action space and then parameterizing gripper camera images and of 0-10 pixels to the static camera
the policy as a discretized logistic mixture distribution [28], [37]. images as in Yarats et al. [31]. Additionally, a bilinear interpo-
Each of the predicted k logistic distributions have a separate lation is applied on top of the shifted image by replacing each
mean and scale, and are weighted with α to form the mixture pixel with the average of the nearest pixels.
distribution. The imitation loss is the negative log-likelihood for
this distribution: V. EXPERIMENTS
Lact (Dplay , V ) =−ln(Σki=0 αk (Vt ) P (at , μi (Vt ), σi (Vt )) We evaluate our model in an extensive comparison and abla-
tion study, to determine which components matter for language
Where, P (at , μi (Vt ),σi (Vt ))=F at +0.5−μi (Vt )
σi (Vt )
i t
a −0.5−μ (V )
−F t σ (V i) t
conditioned imitation learning over unstructured data. We ablate
and F (·) is the logistic CDF. Additionally, we use a
single components of our full approach to study the influence of
cross-entropy loss to model the binary gripper open/close
each component. We then compare our resulting model to the
action.
best published methods on the CALVIN benchmark, and show
that it outperforms all previous methods.
E. Optimization and Implementation Details
Our full training objective for the 1% of the total data that is A. Evaluation Protocol
annotated with after-the-fact language instructions is given by
The goal of the agent in CALVIN is to solve sequences of
L = Lact + βLKL + λLcontrast . The windows without anno-
up to 5 language instructions in a row using only onboard
tations are trained with the same imitation learning objective,
sensors. This setting is very challenging as it requires agents
but the language goals are replaced by the last visual frame of
to be able to transition between different subgoals. CALVIN
the sampled window to learn control in a fully self-supervised
has a total of 34 different subtasks and evaluates 1000 unique
manner. A common problem in training VAEs is finding the
sequence instruction chains. The robot is set to a neutral position
right balance in the weight of the KL loss. A high β value
after every sequence to avoid biasing the policies through the
can result in an over-regularized model in which the decoder
robot’s initial pose. This neutral initialization breaks correlation
ignores the latent plans from the prior, also known as a “posterior
between initial state and task, forcing the agent to rely entirely
collapse” [29]. On the other hand, setting β too low results
on language to infer and solve the task. For each subtask in a
in the plan sampler network being unable to catch up to plan
row the policy is conditioned on the current subgoal instruction
over the latent space created by the posterior, and as a result
and transitions to the next subgoal only if the agent successfully
at test time the plans generated by the plan sampler network
completes the current task. We perform the ablation studies on
will be unfamiliar inputs for the decoder. Orthogonal to this, as
the environment D of CALVIN and additionally report numbers
the KL loss is bidirectional, we want to avoid regularizing the
of our approach for the other two CALVIN splits, the multi
plans generated by the posterior toward a poorly trained prior.
environment and zero-shot multi environment splits. We empha-
To solve this problem, we minimize the KL loss faster with
size that the CALVIN dataset for each of the four environment
respect to the prior than the posterior by using different learning
consists of 6 hours of teleoperated undirected play data that
rates, α = 0.8 for the prior and 1 − α for the posterior, similar to
might contain suboptimal behavior. To simulate a real-world
Hafner et al. [25]. We set β = 0.01 and λ = 3 for all experiments
scenario, only 1% of that data contains crow-sourced language
and train with the Adam optimizer with a learning rate of 2−4 .
annotations.
During training, we randomly sample windows between length
20 and 32 and pad them until the max length of 32. For the latent
plan representation we use 32 categoricals with 32 classes each. B. Results and Ablations of Key Components
To better compare the differences between approaches, we use Observation and Actions Spaces: We compare our approach
the same convolutional encoders as the MCIL baseline available of dividing the robot control learning into generating global
in CALVIN for processing the images of the static and gripper contextualized plans and conditioning a local policy that re-
camera. Our multimodal transformer encoder has 2 blocks, 8 ceives only the observations of the the gripper camera on the

Authorized licensed use limited to: Chulalongkorn University provided by UniNet. Downloaded on December 11,2023 at 11:13:05 UTC from IEEE Xplore. Restrictions apply.
11210 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 7, NO. 4, OCTOBER 2022

Fig. 3. Performance of our model on the D environment of the CALVIN Challenge and ablation of the key components, across three seeded runs. All models
receive RGB images from both a static and a gripper camera as a input.

global plan against a “No Local Policy” baseline. Unlike our to 2.45. Finally, we note that applying stochastic image shifts to
approach, which performs control in the gripper camera frame, the input images increases the performance significantly.
the baseline’s policy receives both cameras images and performs Latent Plan Encoding: In our CVAE framework the latent plan
control in the robot’s base frame, as is usual in most published represents valid ways of connecting the actual state and the goal
approaches. We observe in Fig. 3, that despite the baseline’s state and thus, frees up the policy to use the entirety of its capac-
decoder having more perceptual information, the performance ity for learning uni-modal behavior. As language is inherently
for completing five chains of language instructions sequentially discrete and discrete representations are a natural fit for complex
drops from 28.3% to 20.1%. In order to analyze the big perfor- reasoning and planning, we represent latent plans as a vector of
mance difference with respect to the original MCIL baseline, multiple categorical latent variables and and optimize them using
we train a MCIL baseline with relative actions and observe straight-through gradients [27]. We observe that the performance
that its performance improves significantly from the original for 5-chain evaluation drops from 28.3% to 23.6% when we train
MCIL baseline with absolute actions, but performs worse than our model with a diagonal Gaussian distribution as in MCIL.
our models. We speculate that using relative actions with a local While it is difficult to judge why categorical latents work better
policy is easier for the agent to learn instead of memorizing than continuous latent variables, we hypothesize that categorical
all the locations where interactions have been performed with latents could be a better inductive bias for non-smooth aspects
global actions and a global observation space. By decoupling the of the CALVIN benchmark, such as when a block is hidden
control into a hierarchical structure, we show that performance behind the sliding door. Besides, the sparsity level enforced by
increases significantly. Additionally, we analyze the influence a categorical distribution could be beneficial for generalization.
of using the 7-DoF proprioceptive information as input for both Additionally, we compare against a goal-conditioned Behavior
the plan encodings and conditioning the policy, as many works Cloning (GCBC) baseline [1] which does not condition the
report improved performance from it [1], [2], [5]. We observe policy on a latent plan, and observe that it performs worse
that the performance drops significantly and the agent relies than MCIL with relative actions, highlighting the importance
too much on the robot’s initial position, rather than learning to of modeling latent behaviors in free-form imitation datasets.
disentangle initial states and tasks. We hypothesize this might We also observe that balancing the KL loss is beneficial in the
be due to a causal confusion between the proprioceptive infor- CVAE training. By scaling up the prior cross entropy relative
mation and the target actions [35]. We also analyze the effect to the posterior entropy, the agent is encouraged to minimize
of modeling the full action space, including the binary gripper the KL loss by improving its prior toward the more informed
action dimension, with the mixture of logistics distribution posterior, as opposed to reducing the KL by increasing the
instead of using the log loss for the open/close gripper action posterior entropy. We visualize a t-SNE plot of our learned
and observe that the average sequence length drops from 2.64 discrete latent space in Fig. 5 and that see that even for unseen

Fig. 4. Performance of our model on the multi environment splits of the CALVIN Challenge across three seeded runs.

with a cosine loss [7], cross-modality matching [8] and not

having an auxiliary visuo-lingual alignment loss. We observe
that using an auxiliary loss to semantically align the sampled
video sequences and the language instructions helps, but both
baselines perform similarly. We hypothesize that our contrastive
loss works best because it leverages a larger number of in-batch
negatives than the cross-modality matching loss. Concretely, we
maximize the cosine similarity for N real pairs in the batch while
minimizing the cosine similarity of the multimodal embeddings
of the N 2 − N incorrect pairings. The cross-modality matching
loss implements a discriminator that produces a binary predictor
of whether the embeddings match or not. The batch is shuffled
only once to produce the negative samples, contrasting only N
negative samples.
Language Models: Despite steady progress in language condi-
tioned policy learning, a fundamental, but less considered aspect
is the choice of the pre-trained language model to encode raw
Fig. 5. t-SNE visualization of the discrete latent plans generated by embedding text into a semantic pre-trained vector space. We compare the
randomly selected unseen language annotations. Surprisingly, we find that
despite not being trained explicitly with task labels, HULC appears to organize
lightweight paraphrase-MiniLM-L3-v2 language embeddings
its latent plan space functionally. We visualize with the same color functionally from our full model against several popular alternatives, such
similar skills, but use different shapes to distinguish sub-skills. as the larger BERT [34], Distilroberta [33] and MPNet [32],
which double the embedding size from 384 to 768. Besides
language instructions it appears to organize the latent space the architecture of the language model, we analyze the impact
functionally. Additionally, we report degraded performance for of the loss functions the language models are trained on, by
an over-regularized model which learns to ignore the latent comparing the original embeddings of MPNet and Distilroberta
plans, in which we weight the KL divergence with β = 0.1. against versions that have been finetuned with contrastive losses
Finally, we evaluate replacing the transformer encoder in the at the sentence level to map semantically similar sentences into
posterior with a GRU bidirectional recurrent network of the the same latent space [30]. We observe that the SBERT models
same hidden dimension of 2048, similar to MCIL. The results that have been finetuned on sentence semantic similarity achieve
suggest that besides an improved performance, the multimodal significantly better results than the original language models
transformer encoder is significantly more efficient both memory trained on masked language modeling. Concretely, the original
and model size wise (5.9 M vs 106 M parameters for the posterior Distilroberta model achieves an average sequence length of
network) and overall training wall clock time. For comparison, 2.21, while the SBERT Distilroberta model achieves an average
with the transformer encoder, our full approach contains 47.1 M sequence length of 2.50. Finally, we also compare against a
trainable parameters. model conditioned on visual (ResNet-50) and language-goal
Semantic Alignment of Video and Language: One of the main features from a pre-trained CLIP model [15], which has been
challenges for language conditioned continuous visuomotor- trained to align visual and language features from millions of
control is solving a difficult symbol grounding problem [3], image-caption pairs from the internet. Surprisingly, we find
relating a language instruction to a robots onboard perception that performance is slightly worse than our best performing
and actions. An agent in CALVIN needs to learn a wide variety model. We hypothesize that this might be due to a domain gap
of diverse behaviors to manipulate blocks with different shapes, between the natural images that CLIP has been trained on and
but also needs to understand which colored block the user the simulated images from CALVIN. The results suggest that
is instructing it to manipulate. We compare commonly used for complex semantics, the choice of the pre-trained language
auxiliary losses for aligning visual and language representations. model has a large impact and models finetuned on sentence level
Concretely, we compare our contrastive loss against predicting semantic similarity should be preferred. While in this paper, we
the language embedding from the sequence’s visual observations do not finetune the language models with the action loss, we

Authorized licensed use limited to: Chulalongkorn University provided by UniNet. Downloaded on December 11,2023 at 11:13:05 UTC from IEEE Xplore. Restrictions apply.
11212 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 7, NO. 4, OCTOBER 2022

anticipate this might lead to better performance, specially in [11] O. Mees, L. Hermann, E. Rosete-Beas, and W. Burgard, “CALVIN: A
order to ground instructions referring to the colored blocks. benchmark for language-conditioned policy learning for long-horizon
robot manipulation tasks,” IEEE Robot. Autom. Lett., vol. 7, no. 3,
Multi Environment and Zero-Shot Generalization: Finally, we pp. 7327–7334, Jul. 2022.
investigate the performance of our approach on the larger multi [12] M. Andrychowicz et al., “Hindsight experience replay,” in Proc. 31st Int.
environment splits of CALVIN on Fig. 4. On the zero-shot split, Conf. Neural Inf. Process. Syst., 2017, pp. 5055–5065.
[13] S. Tellex, N. Gopalan, H. Kress-Gazit, and C. Matuszek, “Robots that
which consists on training on three environments and testing use language,” Annu. Rev. Control, Robot. Auton. Syst., vol. 3, pp. 25–55,
on an unseen environment with unseen instructions, we observe 2020.
that despite modest improvements over the MCIL baseline, the [14] J. Lu, D. Batra, D. Parikh, and S. Lee, “ViLBERT: Pretraining task-agnostic
visiolinguistic representations for vision-and-language tasks,” in Proc.
policy achieves just an average sequence length of 0.67. We hy- 33rd Int. Conf. Neural Inf. Process. Syst., 2019, Art. no. 2.
pothesize that in order to achieve better zero-shot performance, [15] A. Radford et al., “Learning transferable visual models from natu-
additional techniques from the domain adaptation literature, ral language supervision,” in Proc. Int. Conf. Mach. Learn., 2021
pp. 8748–8763.
such as adversarial skill-transfer losses might be helpful [36]. On [16] T. Winograd, “Understanding natural language,” Cogn. Psychol., vol. 3,
the split that trains on all four environments and evaluates on one no. 1, pp. 1–191, 1972.
of them, we observe that HULC benefits from the larger dataset [17] M. Shridhar and D. Hsu, “Interactive visual grounding of referring expres-
sions for human-robot interaction,” in Proc. Robot.: Sci. Syst., 2018.
size and sets a new state of the art with an average sequence [18] J. Hatori et al., “Interactively picking real-world objects with uncon-
length of 3.06, which is higher than our best performing model strained spoken language instructions,” in Proc. IEEE Int. Conf. Robot.
trained and tested on environment D (2.64). The results suggest Autom., 2018, pp. 3774–3781.
[19] O. Mees and W. Burgard, “Composing pick-and-place tasks by grounding
that increasing the number of collected language pairs aids language,” in Proc. Int. Symp. Exp. Robot., 2021, pp. 491–501.
addressing the complicated perceptual grounding problem. [20] W. Liu, C. Paxton, T. Hermans, and D. Fox, “StructFormer: Learning
spatial structure for language-guided semantic rearrangement of novel
objects,” in Proc. Int. Conf. Robot. Automat., 2022 pp. 6322–6329.
VI. CONCLUSION [21] M. Shridhar, L. Manuelli, and D. Fox, “CLIPort: What and Where Path-
ways for Robotic Manipulation,” in Proc. 5th Conf. Robot Learn., 2021,
We have presented a study into what matters in language pp. 894–906.
conditioned robotic imitation learning over unstructured data [22] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” 2013,
that systematically analyzes, compares, and improves a set of arXiv:1312.6114.
[23] J. Borja-Diaz, O. Mees, G. Kalweit, L. Hermann, J. Boedecker,
key components. This study results in a range of novel ob- and W. Burgard, “Affordance learning from play for sample-efficient
servations about these components and their interactions, from policy learning,” in Proc. IEEE Int. Conf. Robot. Autom., 2022,
which we integrate the best components and improvements into pp. 6372–6378.
[24] A. Vaswani et al., “Attention is all you need,” in Proc. 31st Int. Conf.
a state-of-the-art approach. Our resulting hierarchical HULC Neural Inf. Process. Syst., 2017, pp. 6000–6010.
model learns a single policy from unstructured imitation data that [25] D. Hafner, T. P. Lillicrap, M. Norouzi, and J. Ba, “Mastering atari
substantially surpasses the state of the art on the challenging lan- with discrete world models,” in Proc. Int. Conf. Learn. Representations,
2020.
guage conditioned long-horizon robot manipulation CALVIN [26] A. Van Den Oord et al., “Neural discrete representation learning,” in Proc.
benchmark. We hope it will be useful as a starting point for 31st Int. Conf. Neural Inf. Process. Syst., 2017, pp. 6309–6318.
further research and will bring us closer towards general-purpose [27] Y. Bengio, N. Léonard, and A. Courville, “Estimating or propagating
gradients through stochastic neurons for conditional computation,” 2013,
robots that can relate human language to their perception and arXiv:1308.3432.
actions. [28] T. Salimans, A. Karpathy, X. Chen, and D. P. Kingma, “PixelCNN:
Improving the pixelcnn with discretized logistic mixture likelihood and
REFERENCES other modifications,” 2017, arXiv:1701.05517.
[29] S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. Jozefowicz, and
[1] C. Lynch et al., “Learning latent plans from play,” in Proc. Conf. Robot S. Bengio, “Generating sentences from a continuous space,” 2015,
Learn., 2020, pp. 1113–1132. arXiv:1511.06349.
[2] D. Kalashnikov et al., “Scaling up multi-task robotic reinforce- [30] N. Reimers and I. Gurevych, “Sentence-BERT: Sentence embeddings
ment learning,” in Proc. 5th Conf. Robot Learn., 2022, pp. 557–575, using siamese bert-networks,” in Proc. Conf. Empir. Methods Natural
arXiv:2104.08212. Lang. Process., 2019, pp. 3982–3992.
[3] S. Harnad, “The symbol grounding problem,” Physica D: Nonlinear [31] D. Yarats, R. Fergus, A. Lazaric, and L. Pinto, “Mastering visual contin-
Phenomena, vol. 42, no. 1/3, pp. 335–346, 1990. uous control: Improved data-augmented reinforcement learning,” 2021,
[4] L. P. Kaelbling, “Learning to achieve goals,” in Proc. Int. Joint Conf. Artif. arXiv:2107.09645.
Intell., 1993, pp. 1094–1099. [32] K. Song, X. Tan, T. Qin, J. Lu, and T.-Y. Liu, “MPNet: Masked and
[5] C. Lynch and P. Sermanet, “Language conditioned imitation learning over permuted pre-training for language understanding,” in Proc. 34th Int. Conf.
unstructured data,” in Proc. Robot.: Sci. Syst., 2021. Neural Inf. Process. Syst., 2020, Art. no. 1414.
[6] S. Stepputtis, J. Campbell, M. Phielipp, S. Lee, C. Baral, and H. B. Amor, [33] Y. Liu et al., “RoBERTa: A robustly optimized BERT pretraining ap-
“Language-conditioned imitation learning for robot manipulation tasks,” proach,” 2019, arXiv:1907.11692.
in Proc. 34th Int. Conf. Neural Inf. Process. Syst., 2020, Art. no. 1102. [34] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training
[7] E. Jang et al., “BC-0: Zero-shot task generalization with robotic imitation of deep bidirectional transformers for language understanding,” 2018,
learning,” in Proc. 5th Conf. Robot Learn., 2021, pp. 991–1002. arXiv:1810.04805.
[8] D. I. A. Team et al., “Creating multimodal interactive agents with imitation [35] P. de Haan, D. Jayaraman, and S. Levine, “Causal confusion in imita-
and self-supervised learning,” 2021, arXiv:2112.03763. tion learning,” in Proc. 33rd Int. Conf. Neural Inf. Process. Syst., 2019,
[9] S. Nair, E. Mitchell, K. Chen, B. Ichter, S. Savarese, and C. Finn, “Learning Art. no. 1049.
language-conditioned robot behavior from offline data and crowd-sourced [36] O. Mees, M. Merklinger, G. Kalweit, and W. Burgard, “Adversarial skill
annotation,” in Proc. 5th Conf. Robot Learn., 2021, pp. 1303–1315. networks: Unsupervised robot skill learning from videos,” in Proc. IEEE
[10] L. Shao, T. Migimatsu, Q. Zhang, K. Yang, and J. Bohg, “Concept2Robot: Int. Conf. Robot. Autom., 2020, pp. 4188–4194.
Learning manipulation concepts from instructions and human demonstra- [37] S. Dasari and A. Gupta, “Transformers for one-shot visual imitation,” in
tions,” in Proc. Robot.: Sci. Syst., 2020. Proc. Conf. Robot Learn., 2021, pp. 2071–2084.

Authorized licensed use limited to: Chulalongkorn University provided by UniNet. Downloaded on December 11,2023 at 11:13:05 UTC from IEEE Xplore. Restrictions apply.

Learning Modular Language-Conditioned Robot Policies Through Attention
No ratings yet
Learning Modular Language-Conditioned Robot Policies Through Attention
21 pages
G S: G R S T L L M: EN IM Enerating Obotic Imulation Asks VIA Arge Anguage Odels
No ratings yet
G S: G R S T L L M: EN IM Enerating Obotic Imulation Asks VIA Arge Anguage Odels
36 pages
Recent Advances of Foundation Language Models-Based Continual Learning - A Survey
No ratings yet
Recent Advances of Foundation Language Models-Based Continual Learning - A Survey
40 pages
BC-Z - Zero-Shot Task Generalization With Robotic Imitation Learning
No ratings yet
BC-Z - Zero-Shot Task Generalization With Robotic Imitation Learning
23 pages
LMPC
No ratings yet
LMPC
26 pages
Beeftink - Mart - LiteratureSurvey - Learning Manipulation Tasks in A Domestic Care Robot Application Using Teleoperation
No ratings yet
Beeftink - Mart - LiteratureSurvey - Learning Manipulation Tasks in A Domestic Care Robot Application Using Teleoperation
78 pages
No - Ntnu Inspera 187579291 24496466
No ratings yet
No - Ntnu Inspera 187579291 24496466
92 pages
(2023) Scaling Distilling
No ratings yet
(2023) Scaling Distilling
27 pages
Kormushev ROB2013
No ratings yet
Kormushev ROB2013
28 pages
Robot RobotxR1 2505.03238v1
No ratings yet
Robot RobotxR1 2505.03238v1
19 pages
Decomposing User-Defined Tasks in A Reinforcement Learning Setup Using TextWorld
No ratings yet
Decomposing User-Defined Tasks in A Reinforcement Learning Setup Using TextWorld
14 pages
Robotics 12 00012 v2
No ratings yet
Robotics 12 00012 v2
19 pages
R+X: Retrieval and Execution From Everyday Human Videos: Georgios Papagiannis Norman Di Palo Pietro Vitiello Edward Johns
No ratings yet
R+X: Retrieval and Execution From Everyday Human Videos: Georgios Papagiannis Norman Di Palo Pietro Vitiello Edward Johns
18 pages
Never-Ending Behavior Cloning
No ratings yet
Never-Ending Behavior Cloning
17 pages
Neural Execution Engines: Learning To Execute Subroutines: Work Completed During An Internship at Google
No ratings yet
Neural Execution Engines: Learning To Execute Subroutines: Work Completed During An Internship at Google
21 pages
44 Efficient Adaptation For End T
No ratings yet
44 Efficient Adaptation For End T
12 pages
One Big Net For Everything: Ower LAY
No ratings yet
One Big Net For Everything: Ower LAY
17 pages
Don't Teach. Incentivize
No ratings yet
Don't Teach. Incentivize
59 pages
Presentation On Gesture Controlled Robotic Arm
No ratings yet
Presentation On Gesture Controlled Robotic Arm
30 pages
Self-Improving Robots
No ratings yet
Self-Improving Robots
13 pages
Robots in Manufacturing Electronics
No ratings yet
Robots in Manufacturing Electronics
8 pages
KeMotion E Web PDF
100% (1)
KeMotion E Web PDF
26 pages
Learning Robot Control - 2012
No ratings yet
Learning Robot Control - 2012
12 pages
Interactive Language: Talking To Robots in Real Time
No ratings yet
Interactive Language: Talking To Robots in Real Time
11 pages
Adaptive Robotics Papers
No ratings yet
Adaptive Robotics Papers
56 pages
Groot n1
No ratings yet
Groot n1
36 pages
Learning Expressive and Transferable First-Order Logic Reward Machines
No ratings yet
Learning Expressive and Transferable First-Order Logic Reward Machines
13 pages
Enhancing Robotic Manipulation: Harnessing The Power of Multi-Task Reinforcement Learning and Single Life Reinforcement Learning in Meta-World
No ratings yet
Enhancing Robotic Manipulation: Harnessing The Power of Multi-Task Reinforcement Learning and Single Life Reinforcement Learning in Meta-World
13 pages
P - S - L: L M G RL S L H R T: LAN EQ Earn Anguage Odel Uided FOR Olving ONG Orizon Obotics Asks
No ratings yet
P - S - L: L M G RL S L H R T: LAN EQ Earn Anguage Odel Uided FOR Olving ONG Orizon Obotics Asks
29 pages
25 GROOT 1 5 Learning To Follo
No ratings yet
25 GROOT 1 5 Learning To Follo
9 pages
A Reinforcement Learning-Based Framework For Robot Manipulation Skill Acquisition
No ratings yet
A Reinforcement Learning-Based Framework For Robot Manipulation Skill Acquisition
9 pages
Burgard23AffordanceGrounding Grounding Language With Visual Affordances Over Unstructured Data
No ratings yet
Burgard23AffordanceGrounding Grounding Language With Visual Affordances Over Unstructured Data
7 pages
Language-Conditioned Robotic Manipulation
No ratings yet
Language-Conditioned Robotic Manipulation
7 pages
2011-Leon Teaching A Robotb
No ratings yet
2011-Leon Teaching A Robotb
8 pages
Report ML Aat g1 Final
No ratings yet
Report ML Aat g1 Final
8 pages
6709 One Shot Imitation Learning
No ratings yet
6709 One Shot Imitation Learning
12 pages
Mobile Pick and Place Robot
100% (1)
Mobile Pick and Place Robot
51 pages
Garbage Collection Robot
No ratings yet
Garbage Collection Robot
12 pages
Reinforcement Learning in Robotics A Survey
No ratings yet
Reinforcement Learning in Robotics A Survey
37 pages
Relational Reinforcement Learning With Guided Demon 2017 Artificial Intellig
No ratings yet
Relational Reinforcement Learning With Guided Demon 2017 Artificial Intellig
18 pages
A Continual Learning Survey Defying Forgetting in Classification Tasks
No ratings yet
A Continual Learning Survey Defying Forgetting in Classification Tasks
20 pages
ARTICLEONnlp
No ratings yet
ARTICLEONnlp
18 pages
Survey On Large Language Model-Enhanced Reinforcement Learning: Concept, Taxonomy, and Methods
No ratings yet
Survey On Large Language Model-Enhanced Reinforcement Learning: Concept, Taxonomy, and Methods
22 pages
Survey of Model-Based Reinforcement Learning: Applications On Robotics
No ratings yet
Survey of Model-Based Reinforcement Learning: Applications On Robotics
21 pages
RT-1: R T R - W C S: Obotics Ransformer FOR EAL Orld Ontrol at Cale
No ratings yet
RT-1: R T R - W C S: Obotics Ransformer FOR EAL Orld Ontrol at Cale
31 pages
Sharma 2019 Dynamics Aware
No ratings yet
Sharma 2019 Dynamics Aware
11 pages
Program Generation For Situated Robot Task Planning Using Large Language Models
No ratings yet
Program Generation For Situated Robot Task Planning Using Large Language Models
14 pages
Me 2028 Robotics With QB
0% (1)
Me 2028 Robotics With QB
133 pages
Robust Control of A Cable-Driven Soft Exoskeleton Joint For Intrinsic Human-Robot Interaction
No ratings yet
Robust Control of A Cable-Driven Soft Exoskeleton Joint For Intrinsic Human-Robot Interaction
11 pages
Ed300 Ubd Plan
No ratings yet
Ed300 Ubd Plan
3 pages
Continual Curiosity Driven Skill Acquisition From High Di - 2017 - Artificial in
No ratings yet
Continual Curiosity Driven Skill Acquisition From High Di - 2017 - Artificial in
23 pages
Modeling and Analysis of A 6 DOF Robotic Arm Manipulator: January 2012
No ratings yet
Modeling and Analysis of A 6 DOF Robotic Arm Manipulator: January 2012
8 pages
Pi 0
No ratings yet
Pi 0
17 pages
Schwarke Et Al. - 2023 - Curiosity-Driven Learning of Joint Locomotion and
No ratings yet
Schwarke Et Al. - 2023 - Curiosity-Driven Learning of Joint Locomotion and
17 pages
Pi 0
No ratings yet
Pi 0
17 pages
Zeshan CVV
No ratings yet
Zeshan CVV
2 pages
Mecha Tronic
No ratings yet
Mecha Tronic
2 pages
Generalized Learning Controller For Robotic Manipulators
No ratings yet
Generalized Learning Controller For Robotic Manipulators
1 page
MIS 272 Lesson3
No ratings yet
MIS 272 Lesson3
26 pages
Design Stable Conroller For PUMA 560
No ratings yet
Design Stable Conroller For PUMA 560
13 pages
Lecture 1
No ratings yet
Lecture 1
69 pages
STAS Midterm Reviewer
No ratings yet
STAS Midterm Reviewer
8 pages
A 3D Printed Soft Robotic Hand With Embedded Soft
No ratings yet
A 3D Printed Soft Robotic Hand With Embedded Soft
9 pages
Lnaguage To Rewards
No ratings yet
Lnaguage To Rewards
31 pages
Reinforcement Learning and Transfer Learning: Simulation-Robot System For Object-Handling
No ratings yet
Reinforcement Learning and Transfer Learning: Simulation-Robot System For Object-Handling
3 pages
Frobt 09 1110571
No ratings yet
Frobt 09 1110571
3 pages
Automatic Control, Robotics, and Information Processing Piotr Kulczycki - Get Instant Access To The Full Ebook Content
No ratings yet
Automatic Control, Robotics, and Information Processing Piotr Kulczycki - Get Instant Access To The Full Ebook Content
63 pages
Language To Rewards For Robotic Skill Synthesis
No ratings yet
Language To Rewards For Robotic Skill Synthesis
31 pages
Reinforcement Learning With Ta
No ratings yet
Reinforcement Learning With Ta
20 pages
CP Report Mechatronics
No ratings yet
CP Report Mechatronics
7 pages
Xiao Et Al. - 2019 - An Adaptive Feature Extraction Algorithm For Multiple Typical Seam Tracking Based On Vision Sensor I
No ratings yet
Xiao Et Al. - 2019 - An Adaptive Feature Extraction Algorithm For Multiple Typical Seam Tracking Based On Vision Sensor I
15 pages
Reinforcement Learning For Robotics Advance
No ratings yet
Reinforcement Learning For Robotics Advance
2 pages
Paper A Framework For Learning Declarative Structure Hart
No ratings yet
Paper A Framework For Learning Declarative Structure Hart
5 pages
EEF 24-25 Faculty of Electrical and Electronics Engineering
No ratings yet
EEF 24-25 Faculty of Electrical and Electronics Engineering
3 pages
Synthetic Data RL: Task Definition Is All You Need: Yiduo Guo Zhen Guo Chuanwei Huang Zi-Ang Wang
No ratings yet
Synthetic Data RL: Task Definition Is All You Need: Yiduo Guo Zhen Guo Chuanwei Huang Zi-Ang Wang
34 pages
Line Follower Robot & Obstacle Detection Using PID Controller
No ratings yet
Line Follower Robot & Obstacle Detection Using PID Controller
9 pages
HRP-2 Fanuc Robot M-20iD Mechanical Unit Operators Manual
No ratings yet
HRP-2 Fanuc Robot M-20iD Mechanical Unit Operators Manual
132 pages
Digital Lean Article
No ratings yet
Digital Lean Article
12 pages
Training The Application of LLM
No ratings yet
Training The Application of LLM
68 pages
2024 VL TMLS 08 Qi2 LRP
No ratings yet
2024 VL TMLS 08 Qi2 LRP
49 pages
Yoon Kevin 2007 2
No ratings yet
Yoon Kevin 2007 2
8 pages
2024 AI Guide
No ratings yet
2024 AI Guide
12 pages
View Assignment
No ratings yet
View Assignment
2 pages
Sim2Real Rulebook
No ratings yet
Sim2Real Rulebook
3 pages
Unit I
No ratings yet
Unit I
17 pages
AI-Driven Career Growth (2025-2035)
No ratings yet
AI-Driven Career Growth (2025-2035)
6 pages
Automatic All in Floor Cleaning Machine (AAFCM) With Disinfecting and Hand Sanitizing
No ratings yet
Automatic All in Floor Cleaning Machine (AAFCM) With Disinfecting and Hand Sanitizing
9 pages
Larning Introduction
No ratings yet
Larning Introduction
6 pages
ĐỀ TIẾNG ANH L3
No ratings yet
ĐỀ TIẾNG ANH L3
16 pages

Burg Ard 22 Language Imitation Learning

Uploaded by

Burg Ard 22 Language Imitation Learning

Uploaded by

IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 7, NO.

4, OCTOBER 2022 11205

What Matters in Language Conditioned Robotic

Abstract—A long-standing goal in robotics is to build robots

objective: language and visual representations, data augmentation and

with a cosine loss [7], cross-modality matching [8] and not

You might also like