Burg Ard 22 Language Imitation Learning
Burg Ard 22 Language Imitation Learning
Authorized licensed use limited to: Chulalongkorn University provided by UniNet. Downloaded on December 11,2023 at 11:13:05 UTC from IEEE Xplore. Restrictions apply.
11206 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 7, NO. 4, OCTOBER 2022
the agent’s high-dimensional observation space, their details combination of ideas and uses different setups or task defini-
vary greatly. Moreover, evaluating published methods and their tions, making it unclear how individual ideas compare to each
components in language conditioned policy learning is difficult other and which ideas combine well together. For example, the
due to incomparable setups or subjective task definitions. In methods BC-Z and MIA [7], [8] use both behavior cloning,
this work we systematically compare, improve, and integrate but different actions spaces and multi-modal alignment losses,
key components by leveraging the recently proposed CALVIN such as regressing the language embedding from visual obser-
benchmark [11] to further our understanding and provide a vations [7] or cross-modality matching [8]. Moreover, BC-Z
unified framework for long-horizon language conditioned policy leverages expert trajectories and task labels, and MIA includes
learning. We build upon relabeled imitation learning [12] to mobile navigation, making them difficult to implement directly
distill many reusable behaviors into a goal-directed policy, as vi- in CALVIN, which contains unlabeled play data on different
sualized in Fig. 1. Our approach consists of only standard super- tabletop environments. Nair et al. [9] learn a reward classifier
vised learning subroutines, and learns perceptual and linguistic which predicts if a change in state completes a language instruc-
understanding, together with task-agnostic control end-to-end tion and leverage it for offline multi-task RL given four camera
as a single neural network. Our contributions are: views. Similar to BC-Z they rely on discrete task labels and
r We systematically compare key components of language do not focus on solving long-horizon language-specified tasks.
conditioned imitation learning over unstructured data, such Most related to our approach is multi-context imitiation learning
as observation and action spaces, losses for aligning visuo- (MCIL) [5], which also uses relabeled imitation learning to
lingual representations, language models and latent plan distill reusable behaviors into a goal-reaching policy. Besides
representations, and we analyze the effect of other choices, different action and observation spaces, these works leverage
such as data augmentation and optimization. different language models to encode the raw text instructions
r We propose four improvements to these key components: a into a semantic pre-trained vector space, making it difficult to
multimodal transformer encoder to learn to recognize and analyze which language models are best suited for language
organize behaviors during robotic interaction into a global conditioned policy learning. The ablation studies presented in
categorical latent plan, a hierarchical division of the robot these papers show that each novel contribution of each work
control learning that learns local policies in the gripper does indeed improve the performance of their model, but due
camera frame conditioned on the global plan, balancing to incomparable setups and evaluation protocols, it is difficult
terms within the KL loss and a self-supervised contrastive to asses what matters for language conditioned policy learning.
visual-language alignment loss. Our work addresses this problem by systematically comparing
r We integrate the best performing improved components and combining different observation and action spaces, auxil-
in a unified framework, Hierarchical Universal Language iary losses and latent representations and integrating the best
Conditioned Policies (HULC). Our model sets a new state performing components in a unified framework.
of the art on the challenging CALVIN benchmark [11],
on learning a single 7-DoF policy that can perform long-
horizon manipulation tasks in a 3D environment, directly
III. PROBLEM FORMULATION AND METHOD OVERVIEW
from images, and only specified with natural language.
We consider the problem of learning a goal-conditioned policy
πθ (at | st , l) that outputs action at ∈ A, conditioned on the
II. RELATED WORK current state st ∈ S and free-form language instruction l ∈ L,
Natural language processing has recently received much at- under environment dynamics T : S × A → S. We note that the
tention in the field of robotics [13], following the advances made agent does not have access to the true state of the environment,
towards learning groundings between vision and language [14], but to visual observations. In CALVIN [11] the action space A
[15] and grounding behaviors in language [16]. Early works consists of the 7-DoF control of a Franka Emika Panda robot
have approached instruction following by designing interactive arm with a parallel gripper.
fetching systems to localize objects mentioned in referring ex- We model the interactive agent with a general-purpose
pressions [17], [18] or by grounding not only objects, but also goal-reaching policy based on multi-context imitation learning
spatial relations to follow language expressions characterizing (MCIL) from play data [5]. To learn from unstructured “play” we
pick-and-place commands [19]–[21]. Unlike these approaches, assume access to an unsegmented teleoperated play dataset D of
we directly learn robotic control from images and natural lan- semantically meaningful behaviors provided by users, without a
guage instructions, and do not assume any predefined motion set of predefined tasks in mind. To learn control, this long tem-
primitives. poral state-action stream D = {(st , at )}∞ t=0 is relabeled [12],
More recently, end-to-end deep learning has been used to con- treating each visited state in the dataset as a “reached goal
dition agents on natural language instructions [5]–[10], which state,” with the preceding states and actions treated as optimal
are then trained under an imitation or reinforcement learning behavior for reaching that goal. Relabeling yields a dataset of
Dplay
objective. These works have pushed the state of the art and Dplay = {(τ, sg )i }i=0 where each goal state sg has a trajectory
generated a range of ideas for language conditioned policy demonstration τ = {(s0 , a0 ), . . .} solving for the goal. These
learning, such as losses for aligning visual observations and short horizon goal image conditioned demonstrations can be
language instructions. However, each work evaluates a different fed to a simple maximum likelihood goal conditioned imitation
Authorized licensed use limited to: Chulalongkorn University provided by UniNet. Downloaded on December 11,2023 at 11:13:05 UTC from IEEE Xplore. Restrictions apply.
MEES et al.: WHAT MATTERS IN LANGUAGE CONDITIONED ROBOTIC IMITATION LEARNING OVER UNSTRUCTURED DATA 11207
Authorized licensed use limited to: Chulalongkorn University provided by UniNet. Downloaded on December 11,2023 at 11:13:05 UTC from IEEE Xplore. Restrictions apply.
11208 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 7, NO. 4, OCTOBER 2022
Fig. 2. Overview of our architecture to learn language conditioned policies from unstructured data. First the language instructions and the visual observations are
encoded. During training a multimodal transformer encodes sequences of observations to learn to recognize and organize high-level behaviors through a posterior.
Its temporally contextualized features are provided as input to a contrastive visuo-lingual alignment loss. The plan sampler network receives the initial state and
the latent language goal and predicts the distribution over plans for achieving the goal. Both prior and posterior distributions are predicted as a vector of multiple
categorical variables and are trained by minimizing their KL divergence. The local policy network receives the latent language instruction, the gripper camera
observation and the global latent plan to generate a sequence of relative actions in the gripper camera frame to achieve the goal.
several latent categorical variables [25]. The foundation of the C. Semantic Alignment of Video and Language
Transformer architecture is the scaled dot-product attention Learning to follow language instructions involves addressing
function, which enables elements in a sequence to attend to other a difficult symbol grounding problem [3], relating a language
elements. The attention function receives as input a sequence
instruction to a robots onboard perception and actions. Although
{x1 , . . ., xn } and outputs a sequence {y1 , . . ., yn }. Each input instructions and visual observations are aligned in CALVIN,
xi is linearly projected linearly to a query qi , key ki , and
learning to manipulate the colored blocks is a challenging prob-
value vi . To compute the output yi the values are summed
lem. This is due to the fact that the robot needs to learn a wide
with weights that take into account the similarity of the query variety of diverse behaviors to manipulate the blocks, but also
with its corresponding key. The attention function is defined
T needs to understand which colored block the user is referring
as Attention(Q, K, V ) = softmax( QK √
dk
)V , where dk is the di- to. Thus, the block related instructions are very similar, for the
mension of the keys and queries. The queries, keys, and values exception of a word that might disambiguate the instruction by
are stacked together into matrix Q ∈ Rn×dmodel , K ∈ Rn×dmodel , indicating a color. Therefore, most pre-trained language models
and V ∈ Rn×dmodel . We encode the sequence of visual observa- struggle to learn such semantics from text only and the policy
tions of both modalities X{static,gripper} ∈ RT ×H×W ×3 with needs to learn referring expression comprehension via the imi-
separate perceptual encoders, and concatenate them to form tation loss. There have been a number of multi-modal alignment
the fused perceptual representation V ∈ RT ×d of the sampled losses proposed, such as regressing the language embedding
demonstration, where T represents the sequence length and d the from the visual observation [7] or cross-modality matching [8].
feature dimension. To enable the sequences to carry temporal We maximize the cosine similarity between the visual features
information, we add positional embeddings [24] and feed the of the sequence i and the corresponding language features while,
result into the Multimodal Transformer to learn temporally at the same time, minimizing the cosine similarity between
contextualized global video representations. Finally, inspired by the current visual features and other language instructions in
the recent line of work that looks into learning discrete instead the same batch. We define our Lcontrast loss the same way as
of continuous latent codes [25], [26], we represent the latent the contrastive loss for pairing images and captions in CLIP [15].
plans as a vector of multiple categorical latent variables and and However, ideally our model should use the time-dependent
optimize them using straight-through gradients [27]. Learning representation of the sequence visual observations in order to
discrete representations in the context of language conditioned capture the meaning of a language instruction. This can be ap-
policies is a natural fit, as language is inherently discrete and preciated only after the sequence of actions have been executed
images can often be described concisely by language [19]. Fur- for several timesteps. The usage of in-batch negatives enables
thermore, discrete representations are a natural fit for complex re-use of computation both in the forward and the backward
reasoning, planning and predictive learning (e.g., if it is sunny, pass making training highly efficient. The logits for one batch is
I will go to the beach). a M × M matrix, where each entry is given by logit(xi , yj ) =
Authorized licensed use limited to: Chulalongkorn University provided by UniNet. Downloaded on December 11,2023 at 11:13:05 UTC from IEEE Xplore. Restrictions apply.
MEES et al.: WHAT MATTERS IN LANGUAGE CONDITIONED ROBOTIC IMITATION LEARNING OVER UNSTRUCTURED DATA 11209
cos_sim(xi , yj ) · exp(τ ), ∀(i, j), i, j ∈ {1, 2, . . . , M } where τ self-attention heads, and a hidden size of 2048. In order to encode
is a trainable temperature parameter. Only entries on the diagonal raw text into a semantic pre-trained vector space, we leverage
of the matrix are considered positive examples. The final loss is the paraphrase-MiniLM-L3-v2 model [30], which distills a large
the sum of the cross entropy losses on the row and the column Transformer based language model and is trained on paraphrase
direction. language corpora that is mainly derived from Wikipedia. It has
a vocabulary size of 30,522 words and maps a sentence of any
D. Action Decoder length into a vector of size 384.
A challenge in learning control from free-form imitation
data, in which different ways of executing the same skill are F. Data Augmentation
shown, is that a standard unimodal predictor, such as a Gaussian To aid learning we apply data augmentation to image obser-
distribution, will average out dissimilar motions. To address vations, both in our method and across all baselines. During
this multimodality, we follow the solution proposed by Lynch training, we apply stochastic image shifts of 0-4 pixels to the
et al. [1] of discretizing the action space and then parameterizing gripper camera images and of 0-10 pixels to the static camera
the policy as a discretized logistic mixture distribution [28], [37]. images as in Yarats et al. [31]. Additionally, a bilinear interpo-
Each of the predicted k logistic distributions have a separate lation is applied on top of the shifted image by replacing each
mean and scale, and are weighted with α to form the mixture pixel with the average of the nearest pixels.
distribution. The imitation loss is the negative log-likelihood for
this distribution: V. EXPERIMENTS
Lact (Dplay , V ) =−ln(Σki=0 αk (Vt ) P (at , μi (Vt ), σi (Vt )) We evaluate our model in an extensive comparison and abla-
tion study, to determine which components matter for language
Where, P (at , μi (Vt ),σi (Vt ))=F at +0.5−μi (Vt )
σi (Vt )
i t
a −0.5−μ (V )
−F t σ (V i) t
conditioned imitation learning over unstructured data. We ablate
and F (·) is the logistic CDF. Additionally, we use a
single components of our full approach to study the influence of
cross-entropy loss to model the binary gripper open/close
each component. We then compare our resulting model to the
action.
best published methods on the CALVIN benchmark, and show
that it outperforms all previous methods.
E. Optimization and Implementation Details
Our full training objective for the 1% of the total data that is A. Evaluation Protocol
annotated with after-the-fact language instructions is given by
The goal of the agent in CALVIN is to solve sequences of
L = Lact + βLKL + λLcontrast . The windows without anno-
up to 5 language instructions in a row using only onboard
tations are trained with the same imitation learning objective,
sensors. This setting is very challenging as it requires agents
but the language goals are replaced by the last visual frame of
to be able to transition between different subgoals. CALVIN
the sampled window to learn control in a fully self-supervised
has a total of 34 different subtasks and evaluates 1000 unique
manner. A common problem in training VAEs is finding the
sequence instruction chains. The robot is set to a neutral position
right balance in the weight of the KL loss. A high β value
after every sequence to avoid biasing the policies through the
can result in an over-regularized model in which the decoder
robot’s initial pose. This neutral initialization breaks correlation
ignores the latent plans from the prior, also known as a “posterior
between initial state and task, forcing the agent to rely entirely
collapse” [29]. On the other hand, setting β too low results
on language to infer and solve the task. For each subtask in a
in the plan sampler network being unable to catch up to plan
row the policy is conditioned on the current subgoal instruction
over the latent space created by the posterior, and as a result
and transitions to the next subgoal only if the agent successfully
at test time the plans generated by the plan sampler network
completes the current task. We perform the ablation studies on
will be unfamiliar inputs for the decoder. Orthogonal to this, as
the environment D of CALVIN and additionally report numbers
the KL loss is bidirectional, we want to avoid regularizing the
of our approach for the other two CALVIN splits, the multi
plans generated by the posterior toward a poorly trained prior.
environment and zero-shot multi environment splits. We empha-
To solve this problem, we minimize the KL loss faster with
size that the CALVIN dataset for each of the four environment
respect to the prior than the posterior by using different learning
consists of 6 hours of teleoperated undirected play data that
rates, α = 0.8 for the prior and 1 − α for the posterior, similar to
might contain suboptimal behavior. To simulate a real-world
Hafner et al. [25]. We set β = 0.01 and λ = 3 for all experiments
scenario, only 1% of that data contains crow-sourced language
and train with the Adam optimizer with a learning rate of 2−4 .
annotations.
During training, we randomly sample windows between length
20 and 32 and pad them until the max length of 32. For the latent
plan representation we use 32 categoricals with 32 classes each. B. Results and Ablations of Key Components
To better compare the differences between approaches, we use Observation and Actions Spaces: We compare our approach
the same convolutional encoders as the MCIL baseline available of dividing the robot control learning into generating global
in CALVIN for processing the images of the static and gripper contextualized plans and conditioning a local policy that re-
camera. Our multimodal transformer encoder has 2 blocks, 8 ceives only the observations of the the gripper camera on the
Authorized licensed use limited to: Chulalongkorn University provided by UniNet. Downloaded on December 11,2023 at 11:13:05 UTC from IEEE Xplore. Restrictions apply.
11210 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 7, NO. 4, OCTOBER 2022
Fig. 3. Performance of our model on the D environment of the CALVIN Challenge and ablation of the key components, across three seeded runs. All models
receive RGB images from both a static and a gripper camera as a input.
global plan against a “No Local Policy” baseline. Unlike our to 2.45. Finally, we note that applying stochastic image shifts to
approach, which performs control in the gripper camera frame, the input images increases the performance significantly.
the baseline’s policy receives both cameras images and performs Latent Plan Encoding: In our CVAE framework the latent plan
control in the robot’s base frame, as is usual in most published represents valid ways of connecting the actual state and the goal
approaches. We observe in Fig. 3, that despite the baseline’s state and thus, frees up the policy to use the entirety of its capac-
decoder having more perceptual information, the performance ity for learning uni-modal behavior. As language is inherently
for completing five chains of language instructions sequentially discrete and discrete representations are a natural fit for complex
drops from 28.3% to 20.1%. In order to analyze the big perfor- reasoning and planning, we represent latent plans as a vector of
mance difference with respect to the original MCIL baseline, multiple categorical latent variables and and optimize them using
we train a MCIL baseline with relative actions and observe straight-through gradients [27]. We observe that the performance
that its performance improves significantly from the original for 5-chain evaluation drops from 28.3% to 23.6% when we train
MCIL baseline with absolute actions, but performs worse than our model with a diagonal Gaussian distribution as in MCIL.
our models. We speculate that using relative actions with a local While it is difficult to judge why categorical latents work better
policy is easier for the agent to learn instead of memorizing than continuous latent variables, we hypothesize that categorical
all the locations where interactions have been performed with latents could be a better inductive bias for non-smooth aspects
global actions and a global observation space. By decoupling the of the CALVIN benchmark, such as when a block is hidden
control into a hierarchical structure, we show that performance behind the sliding door. Besides, the sparsity level enforced by
increases significantly. Additionally, we analyze the influence a categorical distribution could be beneficial for generalization.
of using the 7-DoF proprioceptive information as input for both Additionally, we compare against a goal-conditioned Behavior
the plan encodings and conditioning the policy, as many works Cloning (GCBC) baseline [1] which does not condition the
report improved performance from it [1], [2], [5]. We observe policy on a latent plan, and observe that it performs worse
that the performance drops significantly and the agent relies than MCIL with relative actions, highlighting the importance
too much on the robot’s initial position, rather than learning to of modeling latent behaviors in free-form imitation datasets.
disentangle initial states and tasks. We hypothesize this might We also observe that balancing the KL loss is beneficial in the
be due to a causal confusion between the proprioceptive infor- CVAE training. By scaling up the prior cross entropy relative
mation and the target actions [35]. We also analyze the effect to the posterior entropy, the agent is encouraged to minimize
of modeling the full action space, including the binary gripper the KL loss by improving its prior toward the more informed
action dimension, with the mixture of logistics distribution posterior, as opposed to reducing the KL by increasing the
instead of using the log loss for the open/close gripper action posterior entropy. We visualize a t-SNE plot of our learned
and observe that the average sequence length drops from 2.64 discrete latent space in Fig. 5 and that see that even for unseen
Authorized licensed use limited to: Chulalongkorn University provided by UniNet. Downloaded on December 11,2023 at 11:13:05 UTC from IEEE Xplore. Restrictions apply.
MEES et al.: WHAT MATTERS IN LANGUAGE CONDITIONED ROBOTIC IMITATION LEARNING OVER UNSTRUCTURED DATA 11211
Fig. 4. Performance of our model on the multi environment splits of the CALVIN Challenge across three seeded runs.
Authorized licensed use limited to: Chulalongkorn University provided by UniNet. Downloaded on December 11,2023 at 11:13:05 UTC from IEEE Xplore. Restrictions apply.
11212 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 7, NO. 4, OCTOBER 2022
anticipate this might lead to better performance, specially in [11] O. Mees, L. Hermann, E. Rosete-Beas, and W. Burgard, “CALVIN: A
order to ground instructions referring to the colored blocks. benchmark for language-conditioned policy learning for long-horizon
robot manipulation tasks,” IEEE Robot. Autom. Lett., vol. 7, no. 3,
Multi Environment and Zero-Shot Generalization: Finally, we pp. 7327–7334, Jul. 2022.
investigate the performance of our approach on the larger multi [12] M. Andrychowicz et al., “Hindsight experience replay,” in Proc. 31st Int.
environment splits of CALVIN on Fig. 4. On the zero-shot split, Conf. Neural Inf. Process. Syst., 2017, pp. 5055–5065.
[13] S. Tellex, N. Gopalan, H. Kress-Gazit, and C. Matuszek, “Robots that
which consists on training on three environments and testing use language,” Annu. Rev. Control, Robot. Auton. Syst., vol. 3, pp. 25–55,
on an unseen environment with unseen instructions, we observe 2020.
that despite modest improvements over the MCIL baseline, the [14] J. Lu, D. Batra, D. Parikh, and S. Lee, “ViLBERT: Pretraining task-agnostic
visiolinguistic representations for vision-and-language tasks,” in Proc.
policy achieves just an average sequence length of 0.67. We hy- 33rd Int. Conf. Neural Inf. Process. Syst., 2019, Art. no. 2.
pothesize that in order to achieve better zero-shot performance, [15] A. Radford et al., “Learning transferable visual models from natu-
additional techniques from the domain adaptation literature, ral language supervision,” in Proc. Int. Conf. Mach. Learn., 2021
pp. 8748–8763.
such as adversarial skill-transfer losses might be helpful [36]. On [16] T. Winograd, “Understanding natural language,” Cogn. Psychol., vol. 3,
the split that trains on all four environments and evaluates on one no. 1, pp. 1–191, 1972.
of them, we observe that HULC benefits from the larger dataset [17] M. Shridhar and D. Hsu, “Interactive visual grounding of referring expres-
sions for human-robot interaction,” in Proc. Robot.: Sci. Syst., 2018.
size and sets a new state of the art with an average sequence [18] J. Hatori et al., “Interactively picking real-world objects with uncon-
length of 3.06, which is higher than our best performing model strained spoken language instructions,” in Proc. IEEE Int. Conf. Robot.
trained and tested on environment D (2.64). The results suggest Autom., 2018, pp. 3774–3781.
[19] O. Mees and W. Burgard, “Composing pick-and-place tasks by grounding
that increasing the number of collected language pairs aids language,” in Proc. Int. Symp. Exp. Robot., 2021, pp. 491–501.
addressing the complicated perceptual grounding problem. [20] W. Liu, C. Paxton, T. Hermans, and D. Fox, “StructFormer: Learning
spatial structure for language-guided semantic rearrangement of novel
objects,” in Proc. Int. Conf. Robot. Automat., 2022 pp. 6322–6329.
VI. CONCLUSION [21] M. Shridhar, L. Manuelli, and D. Fox, “CLIPort: What and Where Path-
ways for Robotic Manipulation,” in Proc. 5th Conf. Robot Learn., 2021,
We have presented a study into what matters in language pp. 894–906.
conditioned robotic imitation learning over unstructured data [22] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” 2013,
that systematically analyzes, compares, and improves a set of arXiv:1312.6114.
[23] J. Borja-Diaz, O. Mees, G. Kalweit, L. Hermann, J. Boedecker,
key components. This study results in a range of novel ob- and W. Burgard, “Affordance learning from play for sample-efficient
servations about these components and their interactions, from policy learning,” in Proc. IEEE Int. Conf. Robot. Autom., 2022,
which we integrate the best components and improvements into pp. 6372–6378.
[24] A. Vaswani et al., “Attention is all you need,” in Proc. 31st Int. Conf.
a state-of-the-art approach. Our resulting hierarchical HULC Neural Inf. Process. Syst., 2017, pp. 6000–6010.
model learns a single policy from unstructured imitation data that [25] D. Hafner, T. P. Lillicrap, M. Norouzi, and J. Ba, “Mastering atari
substantially surpasses the state of the art on the challenging lan- with discrete world models,” in Proc. Int. Conf. Learn. Representations,
2020.
guage conditioned long-horizon robot manipulation CALVIN [26] A. Van Den Oord et al., “Neural discrete representation learning,” in Proc.
benchmark. We hope it will be useful as a starting point for 31st Int. Conf. Neural Inf. Process. Syst., 2017, pp. 6309–6318.
further research and will bring us closer towards general-purpose [27] Y. Bengio, N. Léonard, and A. Courville, “Estimating or propagating
gradients through stochastic neurons for conditional computation,” 2013,
robots that can relate human language to their perception and arXiv:1308.3432.
actions. [28] T. Salimans, A. Karpathy, X. Chen, and D. P. Kingma, “PixelCNN:
Improving the pixelcnn with discretized logistic mixture likelihood and
REFERENCES other modifications,” 2017, arXiv:1701.05517.
[29] S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. Jozefowicz, and
[1] C. Lynch et al., “Learning latent plans from play,” in Proc. Conf. Robot S. Bengio, “Generating sentences from a continuous space,” 2015,
Learn., 2020, pp. 1113–1132. arXiv:1511.06349.
[2] D. Kalashnikov et al., “Scaling up multi-task robotic reinforce- [30] N. Reimers and I. Gurevych, “Sentence-BERT: Sentence embeddings
ment learning,” in Proc. 5th Conf. Robot Learn., 2022, pp. 557–575, using siamese bert-networks,” in Proc. Conf. Empir. Methods Natural
arXiv:2104.08212. Lang. Process., 2019, pp. 3982–3992.
[3] S. Harnad, “The symbol grounding problem,” Physica D: Nonlinear [31] D. Yarats, R. Fergus, A. Lazaric, and L. Pinto, “Mastering visual contin-
Phenomena, vol. 42, no. 1/3, pp. 335–346, 1990. uous control: Improved data-augmented reinforcement learning,” 2021,
[4] L. P. Kaelbling, “Learning to achieve goals,” in Proc. Int. Joint Conf. Artif. arXiv:2107.09645.
Intell., 1993, pp. 1094–1099. [32] K. Song, X. Tan, T. Qin, J. Lu, and T.-Y. Liu, “MPNet: Masked and
[5] C. Lynch and P. Sermanet, “Language conditioned imitation learning over permuted pre-training for language understanding,” in Proc. 34th Int. Conf.
unstructured data,” in Proc. Robot.: Sci. Syst., 2021. Neural Inf. Process. Syst., 2020, Art. no. 1414.
[6] S. Stepputtis, J. Campbell, M. Phielipp, S. Lee, C. Baral, and H. B. Amor, [33] Y. Liu et al., “RoBERTa: A robustly optimized BERT pretraining ap-
“Language-conditioned imitation learning for robot manipulation tasks,” proach,” 2019, arXiv:1907.11692.
in Proc. 34th Int. Conf. Neural Inf. Process. Syst., 2020, Art. no. 1102. [34] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training
[7] E. Jang et al., “BC-0: Zero-shot task generalization with robotic imitation of deep bidirectional transformers for language understanding,” 2018,
learning,” in Proc. 5th Conf. Robot Learn., 2021, pp. 991–1002. arXiv:1810.04805.
[8] D. I. A. Team et al., “Creating multimodal interactive agents with imitation [35] P. de Haan, D. Jayaraman, and S. Levine, “Causal confusion in imita-
and self-supervised learning,” 2021, arXiv:2112.03763. tion learning,” in Proc. 33rd Int. Conf. Neural Inf. Process. Syst., 2019,
[9] S. Nair, E. Mitchell, K. Chen, B. Ichter, S. Savarese, and C. Finn, “Learning Art. no. 1049.
language-conditioned robot behavior from offline data and crowd-sourced [36] O. Mees, M. Merklinger, G. Kalweit, and W. Burgard, “Adversarial skill
annotation,” in Proc. 5th Conf. Robot Learn., 2021, pp. 1303–1315. networks: Unsupervised robot skill learning from videos,” in Proc. IEEE
[10] L. Shao, T. Migimatsu, Q. Zhang, K. Yang, and J. Bohg, “Concept2Robot: Int. Conf. Robot. Autom., 2020, pp. 4188–4194.
Learning manipulation concepts from instructions and human demonstra- [37] S. Dasari and A. Gupta, “Transformers for one-shot visual imitation,” in
tions,” in Proc. Robot.: Sci. Syst., 2020. Proc. Conf. Robot Learn., 2021, pp. 2071–2084.
Authorized licensed use limited to: Chulalongkorn University provided by UniNet. Downloaded on December 11,2023 at 11:13:05 UTC from IEEE Xplore. Restrictions apply.