0% found this document useful (0 votes)
68 views13 pages

Ozdemir Et Al. - 2022 - Language Model-Based Paired Variational Autoencode

This document summarizes a research article that proposes two neural network models - Paired Variational Autoencoders (PVAE) and PVAE-BERT - for robotic language learning through interaction with objects. The PVAE model binds robot actions to language descriptions using variational autoencoders to allow multiple equivalent descriptions of each action. The PVAE-BERT model enhances this by using a pretrained BERT language model, enabling it to understand unconstrained natural language beyond the descriptions it was trained on. Experiments show the models can map actions to descriptions and vice versa, and recognize actions from alternative vocabularies or with different object shapes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views13 pages

Ozdemir Et Al. - 2022 - Language Model-Based Paired Variational Autoencode

This document summarizes a research article that proposes two neural network models - Paired Variational Autoencoders (PVAE) and PVAE-BERT - for robotic language learning through interaction with objects. The PVAE model binds robot actions to language descriptions using variational autoencoders to allow multiple equivalent descriptions of each action. The PVAE-BERT model enhances this by using a pretrained BERT language model, enabling it to understand unconstrained natural language beyond the descriptions it was trained on. Experiments show the models can map actions to descriptions and vice versa, and recognize actions from alternative vocabularies or with different object shapes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

This article has been accepted for publication in IEEE Transactions on Cognitive and Developmental Systems.

This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TCDS.2022.3204452

Language Model-Based Paired Variational


Autoencoders for Robotic Language Learning
Ozan Özdemir, Matthias Kerzel, Cornelius Weber, Jae Hee Lee, Stefan Wermter

Abstract—Human infants learn language while interacting with


their environment in which their caregivers may describe the
objects and actions they perform. Similar to human infants,
artificial agents can learn language while interacting with their
environment. In this work, first, we present a neural model
that bidirectionally binds robot actions and their language
descriptions in a simple object manipulation scenario. Building on
our previous Paired Variational Autoencoders (PVAE) model, we
demonstrate the superiority of the variational autoencoder over
standard autoencoders by experimenting with cubes of different
colours, and by enabling the production of alternative vocabu-
laries. Additional experiments show that the model’s channel-
separated visual feature extraction module can cope with objects
of different shapes. Next, we introduce PVAE-BERT, which
equips the model with a pretrained large-scale language model,
Fig. 1. Our tabletop object manipulation scenario in the simulation environ-
i.e., Bidirectional Encoder Representations from Transformers
ment: the NICO robot is interacting with toy objects. In the left panel, NICO
(BERT), enabling the model to go beyond comprehending only views all the toy objects; on the right, NICO pulls the red house. In both
the predefined descriptions that the network has been trained on; panels, NICO’s field of view is given in the top right inset.
the recognition of action descriptions generalises to unconstrained
natural language as the model becomes capable of understanding
unlimited variations of the same descriptions. Our experiments
suggest that using a pretrained language model as the language modalities such as audio, touch, proprioception and vision can
encoder allows our approach to scale up for real-world scenarios be employed towards learning language in the environment.
with instructions from human users. The field of artificial intelligence has recently seen many
Index Terms—language grounding, variational autoencoders, studies attempting to learn language in an embodied fashion
channel separation, pretrained language model, object manipu-
lation [2], [6], [12], [19], [21]. In this paper, we bidirectionally
map language with robot actions by employing three distinct
I. I NTRODUCTION modalities, namely text, proprioception and vision. In our
robotic scenario, two objects1 are placed on a table as the
Humans use language as a means to understand and to be NICO (the Neuro-Inspired COmpanion) robot [16] physically
understood by their interlocutors. Although we can commu- interacts with them - see Figure 1. NICO moves objects along
nicate effortlessly in our native language, language is a so- the table surface according to given textual descriptions and
phisticated form of interaction which requires comprehension recognises the actions by translating them to corresponding de-
and production skills. Understanding language depends also scriptions. The possibility of bidirectional translation between
on the context, because words can have multiple meanings language and control was realised with a paired recurrent
and a situation can be explained in many ways. As it is not autoencoder (PRAE) architecture by Yamada et al. [30], which
always possible to describe a situation only in language or aligns the two modalities that are each processed by an au-
understand it only with the medium of language, we benefit toencoder. We extended this approach (PRAE) with the Paired
from other modalities such as vision and proprioception. Sim- Variational Autoencoders (PVAE) [32] model, which enriches
ilarly, artificial agents can utilise the concept of embodiment the language used to describe the actions taken by the robot:
(i.e. acting in the environment) in addition to perception instead of mapping a distinct description to each action [30],
(i.e. using multimodal input like audio and vision) for better the PVAE maps multiple descriptions, which are equivalent
comprehension and production of language [5]. Human infants in meaning, to each action. Hence, we have transcended the
learn language in their environment while their caregivers strict one-to-one mapping between control and language since
describe the properties of objects, which they interact with, our variational autoencoder-based model can associate each
and actions, which are performed on those objects. In a robot action with multiple description alternatives. The PVAE
similar vein, artificial agents can be taught language; different is composed of two variational autoencoders (VAEs), one
O. Özdemir, M. Kerzel, C. Weber, J. H. Lee and S. Wermter are with
Knowledge Technology Group, Department of Informatics, University of 1 Note that, in the left panel of Fig. 1, we show all the toy objects for
Hamburg, Hamburg, Germany (emails: ozan.oezdemir@*, matthias.kerzel@*, visualisation purposes. In all our experiments, there are always only two
cornelius.weber@*, jae.hee.lee@*, stefan.wermter@*; * = uni-hamburg.de) objects on the table.

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Transactions on Cognitive and Developmental Systems. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TCDS.2022.3204452

for language, the other for action, and both of them consist the architecture of the PVAE and PVAE-BERT models, various
of an LSTM (long short-term memory) [14] encoder and experiments and their results are given in Section 4, Section 5
decoder which are suitable for sequential data. The dataset2 , discusses the results and their implications and the last section
which our model is trained with, consists of paired textual concludes the paper with final remarks3 .
descriptions and corresponding joint angle values with ego-
II. R ELATED W ORK
centric images. The language VAE reconstructs descriptions,
whereas the action VAE reconstructs joint angle values that The state-of-the-art approaches in embodied language learn-
are conditioned on the visual features extracted in advance by ing mostly rely on tabletop environments [11], [13], [24],
the channel-separated convolutional autoencoder (CAE) [32] [25], [30] or interactive play environments [19] where a robot
from egocentric images. The two autoencoders are implicitly interacts with various objects according to given instructions.
bound together with an extra loss term which aligns actions We categorise these approaches into three groups: those that
with their corresponding descriptions and separates unrelated translate from language to action, those that translate from
actions and descriptions in the hidden vector space. action to language and those that can translate in both direc-
However, even with multiple descriptions mapped to a robot tions, i.e., bidirectional approaches. Bidirectional approaches
action as implemented in our previous work [32], replacing allow greater exploitation of available training data as training
each word by its alternative does not lift the grammar re- in both directions can be interpreted as multitask learning,
strictions on the language input. In order to process uncon- which ultimately leads to more robust and powerful models
strained language input, we equip the PVAE architecture with independent of the translation direction. By using the maxi-
the Bidirectional Encoder Representations from Transformers mum amount of shared weights for multiple tasks, such mod-
(BERT) language model [8] that has been pretrained on large- els would be more efficient than independent unidirectional
scale text corpora to enable the recognition of unconstrained networks in terms of data utilisation and the model size.
natural language commands by human users. To this end, we A. Language-to-Action Translation
replace the LSTM language encoder with a pretrained BERT
Translating from language to action is the most common
model so that the PVAE can recognise different commands that
form in embodied language learning. Hatori et al. [11] in-
correspond to the same actions as the predefined descriptions
troduce a neural network architecture for moving objects
given the same object combinations on the table. This new
given the visual input and language instructions, as their
model variant, which we call PVAE-BERT, can handle not
work focuses on the interaction of a human operator with the
only the descriptions it is trained with, but also various
computational neural system that picks and places miscella-
descriptions equivalent in meaning with different word order
neous items as per verbal commands. In their scenario, many
and/or filler words (e.g., ‘please’, ‘could’, ‘the’, etc.) as our
items with different shape and size (e.g. toys, bottles etc.) are
analysis shows. We make use of transfer learning by using
distributed across four bins with many of them being occluded
a pretrained language model, hence, benefitting from large
- hence, the scene is very complex and cluttered. Given a
unlabelled textual data.
pick-and-place instruction from the human operator, the robot
Our contributions can be summarised as follows:
first confirms and then executes it if the instruction is clear.
1) In our previous work [32], we showed that varia- Otherwise, the robot asks the human operator to clarify the
tional autoencoders facilitate better one-to-many action- desired object. The network receives a verbal command from
to-language translation and that channel separation in the operator and an RGB image from the environment, and it
visual feature extraction, i.e., training RGB channels has separate object recognition and language understanding
separately, results in more accurate recognition of object modules, which are trained jointly to learn the names and
colours in our object manipulation scenario. In this attributes of the objects.
follow-up work, we extend our dataset with different Shridhar and Hsu [25] propose a comprehensive system for
shapes and show that our PVAE with the channel separa- a robotic arm to pick up objects based on visual and linguistic
tion approach is able to translate from action to language input. The system consists of multiple modules such as ma-
while manipulating different objects. nipulation, perception and a neural network architecture, and
2) Here, we introduce PVAE-BERT, which, by using pre- is called INGRESS (Interactive Visual Grounding of Referring
trained BERT, indicates the potential of our approach to Expressions). INGRESS is composed of two network streams
be scaled up for unconstrained instructions from human (self-referential and relational) which are trained on large
users. datasets to generate a definitive expression for each object in
3) Additional principle component (PCA) analysis shows the scene based on the input image. The generated expression
that language as well as action representation vectors is compared with the input expression to detect the desired ob-
arrange according to the semantics of the language ject. INGRESS is therefore responsible for grounding language
descriptions. by learning object names and attributes via manipulation. The
The remainder of this paper is organised as follows: the approach can resolve ambiguities when it comes to which
next section describes the relevant work, Section 3 presents object to lift by asking confirmation questions to the user.
2 https://www.inf.uni-hamburg.de/en/inst/ab/wtm/research/corpora.html 3 Our code is available at https://github.com/oo222bs/PVAE-BERT.

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Transactions on Cognitive and Developmental Systems. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TCDS.2022.3204452

Shao et al. [24] put forward a robot learning framework, tion: they describe the actions they perform in the environment.
Concept2Robot, for learning manipulation concepts from hu- However, they are unable to execute a desired action given by
man video demonstrations in two stages. In the first stage, the human user. Nevertheless, from the robotics perspective, it
they use reinforcement learning and, in the second, they utilise is desirable to have models that can also translate from action
imitation learning. The architecture consists of three main to language and not just execute verbal commands; such robots
parts: semantic context network, policy network and action can explain their actions by verbalising an ongoing action,
classification. The model receives as input a natural language which also paves the way for more interpretable systems.
description for each task alongside an RGB image of the initial
scene. In return, it is expected to produce the parameters C. Bidirectional Translation
of a motion trajectory to accomplish the task in the given Very few embodied language learning approaches are ca-
environment. pable of flexibly translating in both directions, hence, bidirec-
Lynch and Sermanet [19] introduce the LangLfP (language tional. While unidirectional approaches are feasible for smaller
learning from play) approach, in which they utilise multi- datasets, we aim to research architectures that can serve as
context imitation to train a single policy based on multiple large-scale multimodal foundation models and solve multiple
modalities. Specifically, the policy is trained on both image tasks in different modalities. By generating a discrete set of
and language goals and this enables the approach to follow words, bidirectional models can also provide feedback to a
natural language instructions during evaluation. During train- user about the information contained within its continuous
ing, fewer than 1% of the tasks are labelled with natural variables. By providing rich language descriptions, rather
language instructions, because it suffices to train the policy than only performing actions, such models can contribute to
for more than 99% of the cases with goal images only. There- explainable AI (XAI) for non-experts. For a comprehensive
fore, only few of the tasks must be labelled with language overview of the field of XAI, readers can refer to the survey
instructions. Furthermore, they utilise a Transformer-based paper by Adadi and Berrada [1].
[27] multilingual language encoder, Multilingual Universal In one of the early examples of bidirectional translation,
Sentence Encoder [31], to encode linguistic input so that the Ogata et al. [22] present a model that is aimed at articulation
system can handle unseen language input like synonyms and and allocation of arm movements by using a parametric
instructions in 16 different languages. bias to bind motion and language. The method enables the
The language-to-action translation methods are designed robot to move its arms according to given sentences and to
to act upon a given language input as in textual or verbal generate sentences according to given arm motions. The model
commands. They can recognise commands and execute the shows generalisation towards motions and sentences that it has
desired actions. However, they cannot describe the actions that not been trained with. However, it fails to handle complex
they perform. sentences.
Antunes et al. [3] introduce the multiple timescale long
B. Action-to-Language Translation short-term memory (MT-LSTM) model in which the slowest
Another class of approaches in embodied language learning layer establishes a bidirectional connection between action and
translates action into language. Heinrich et al. [13] intro- language. The MT-LSTM consists of two components, namely
duce an embodied crossmodal neurocognitive architecture, the language and action streams, each of which is divided into
adaptive multiple timescale recurrent neural network (adaptive three layers with varying timescales. The two components are
MTRNN), which enables the robot to acquire language by bound by a slower meaning layer that allows translation from
listening to commands while interacting with objects in a action to language and vice versa. The approach shows limited
playground environment. The approach has auditory, senso- generalisation capabilities.
rimotor and visual perception capabilities. Since neurons at Yamada et al. [30] propose the paired recurrent autoencoder
multiple timescales facilitate the emergence of hierarchical (PRAE) architecture, which consists of two autoencoders,
representations, the results indicate good generalisation and namely action and description. The action autoencoder takes
hierarchical concept decomposition within the network. as input joint angle trajectories with visual features and is
Eisermann et al. [9] study the problem of compositional expected to reconstruct the original joint angle trajectories.
generalisation, in which they conduct numerous experiments The description autoencoder, on the other hand, reads and then
on a tabletop scenario where a robotic arm manipulates various reconstructs the action descriptions. The dataset that the model
objects. They utilise a simple LSTM-based network to describe is trained on consists of pairs of simple robot actions and
the actions performed on the objects in hindsight - the model their textual descriptions, e.g., ‘pushing away the blue cube’.
accepts visual and proprioceptive input and produces textual The model is trained end-to-end, with both autoencoders,
descriptions. Their results show that with the inclusion of reconstructing language and action, whilst there is no explicit
proprioception as input and using more data in training, neural connection between the two. The crossmodal pairing
the network’s performance on compositional generalisation between action and description autoencoders is supplied with
improves significantly. a loss term that aligns the hidden representations of paired
Similar to the language-to-action translation methods, the actions and descriptions. The binding loss allows the PRAE
action-to-language translation methods work only in one direc- to execute actions given instructions as well as translate

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Transactions on Cognitive and Developmental Systems. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TCDS.2022.3204452

actions to descriptions. As a bidirectional approach, the PRAE channel-separated CAE (short for convolutional autoencoder),
is biologically plausible to some extent, since humans can which improves the ability of the approach to distinguish the
easily execute given commands and also describe these actions colours of cubes. The details of each model component are
linguistically. To imitate human-like language recognition and given in the following subsections.
production, bidirectionality is essential. However, due to its
A. Language Variational Autoencoder
use of standard autoencoders, the PRAE can only bind a
robot action with a particular description in a one-to-one way, The language VAE accepts as input one-hot encoded matrix
although actions can be expressed in different ways. In order of a description word by word in the case of the PVAE or the
to map each robot action to multiple description alternatives, complete description altogether for PVAE-BERT, and for both
we have proposed the PVAE (paired variational autoencoders) the PVAE and PVAE-BERT, is responsible for reproducing the
approach [32] which utilises variational autoencoders (VAEs) original description. It consists of an encoder, a decoder and
to randomise the latent representation space and thereby allows latent layers (in the bottleneck) where latent representations
one-to-many translation between action and language. A recent are extracted via sampling. For the PVAE, the language
review by Marino [20] highlights similarities between VAEs encoder embeds a description of length N , (x1 , x2 , ..., xN ),
and predictive coding from neuroscience in terms of model into two fixed-dimensional vectors zmean and zsigma as follows:
formulations and inference approaches. henc enc enc enc
(1 ≤ t ≤ N ),
t , ct = LSTM(xt , ht−1 , ct−1 )
This work is an extension of the ICDL article “Embodied enc
zmean = Wmean · hN + benc
mean ,
Language Learning with Paired Variational Autoencoders”
[32]. Inspired by the TransferLangLfP paradigm by Lynch and zvar = Wvar · hN + benc
enc
var ,
Sermanet [19], we propose to use the PVAE with a pretrained zlang = zmean + zvar · N (µ, σ 2 ),
BERT language model [8] in order to enable the model to
comprehend unconstrained language instructions from human where ht and ct are the hidden and cell state of the LSTM at
users. Furthermore, we conduct experiments using PVAE- time step t, respectively, and N is a Gaussian distribution.
BERT on our dataset for various use cases and analyse the h0 and c0 are set as zero vectors, while µ and σ are 0
internal representations for the first time. and 0.1, respectively. zlang is the latent representation of a
description. LSTM here, and in the following, is a peephole
III. P ROPOSED M ETHODS : PVAE & PVAE-BERT LSTM [23] following the implementation of Yamada et al.
As can be seen in Figure 2, the PVAE model consists of two [30]. The language input is represented in one-hot encoded
variational autoencoders: a language VAE and an action VAE. matrices, whose rows represent the sequence of input words
The former learns to generate descriptions matching original and columns represent every word that is in the vocabulary.
descriptions, whilst the latter learns to reconstruct joint angle In each row, only one cell is 1 and the rest are 0, which
values with conditioning on the visual input. The two autoen- determines the word that is given to the model at that time
coders do not have any explicit neural connection between step.
them, but instead they are implicitly aligned by the binding For PVAE-BERT, we replace the LSTM language encoder
loss, which brings the two autoencoders closer to each other with the pretrained BERT-base model and, following the
in the latent space over the course of learning by reducing implementation by Devlin et al. [8], tokenise the descriptions
the distance between the two latent variables. First, action accordingly with the subword-based tokeniser WordPiece [29].
and language encoder map the input to the latent code, i.e., The language decoder generates a sequence by recursively
the language encoder accepts one-hot encoded descriptions expanding zlang :
word by word as input and produces the encoded descriptions, hdec dec
0 , c0 = W
dec
· zlang + bdec ,
whereas the action encoder accepts corresponding arm trajec-
hdec dec dec dec
t , ct = LSTM(yt−1 , ht−1 , ct−1 ) (1 ≤ t ≤ N − 1),
tories and visual features as input and produces the encoded
actions. Next, the encoded representations are used to extract yt = soft(W out · hdec out
t +b ) (1 ≤ t ≤ N − 1),
latent representations by randomly sampling from a Gaussian where soft denotes the softmax activation function. y0 is the
distribution separately for language and action modalities. first symbol indicating the beginning of the sentence, hence
Finally, from the latent representations, language and action the <BOS> tag.
decoders reconstruct the descriptions and joint angle values,
respectively. B. Action Variational Autoencoder
Our model is a bidirectional approach, i.e., after training The action VAE accepts a sequence of joint angle values and
translation is possible in both directions, action-to-language visual features as input and it is responsible for reconstructing
and language-to-action. The PVAE model transforms robot the joint angle values. Similar to the language VAE, it is
actions to descriptions in a one-to-many fashion by appropri- composed of an encoder, a decoder and latent layers (in
ately randomising the latent space. PVAE-BERT additionally the bottleneck) where latent representations are extracted via
handles variety in language input by using pretrained BERT as sampling. The action encoder encodes a sequence of length M ,
the language encoder module. As part of the action encoder, ((j1 , v1 ), (j2 , v2 ), ..., (jM , vM )), which includes concatenation
the visual input features are extracted in advance using a of joint angles j and visual features v. Note that the visual

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Transactions on Cognitive and Developmental Systems. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TCDS.2022.3204452

Language
Language

Bottleneck (Latent Representations)


Encoder Decoder
Lz_var pull red fast <EOS>
Sampling y1 y2 y3 yN-1
Language
'pull red
LSTM or BERT Lz LSTM LSTM LSTM LSTM
VAE fast'
Lz_mean x1 y1 y2 y3
<BOS> pull red fast
v1 magnified
Example

Binding
Image

Action
Loss Action

Encoder  ĵ2 ĵ3 Decoder ĵ4 ĵM


Az_mean
Action
LSTM Sampling
LSTM LSTM LSTM LSTM Az LSTM LSTM LSTM LSTM
VAE v1 v2 v3 v4 vM v1 v2 v3 vM-1
Az_var j1  ĵ2 ĵ3 ĵM-1
j1 j2 j3 j4 jM

Fig. 2. The architecture of the proposed PVAE and PVAE-BERT models: the language VAE (blue rectangles) processes descriptions, whilst the action VAE
(orange rectangles) processes joint angles and images at each time step. The input to the language VAE is the given description x, whereas the action VAE
takes as input joint angle values j and visual features v. The two VAEs are implicitly bound via a binding loss in the latent representation space. The image
from which the v1 is extracted is magnified for visualisation purposes. <BOS> and <EOS> stand for beginning of sentence and end of sentence tags,
respectively. The two models differ only by the language encoder employed: the PVAE uses LSTM, whereas PVAE-BERT uses a pretrained BERT model.

features are extracted by the channel-separated convolutional results in image classification as the network parameters are
autoencoder beforehand. The equations that define the action used more efficiently. The channel-separated CAE accepts a
encoder are as follows4 : colour channel of 120 × 160 RGB images captured by the
cameras in the eyes of NICO - referred also as the egocentric
henc enc enc enc
t , ct = LSTM(vt , jt , ht−1 , ct−1 ) (1 ≤ t ≤ M ), view of the robot - at a time. As can be seen in detail
enc
zmean = Wmean · hM + benc
mean , in Table I, it consists of a convolutional encoder, a fully-
zvar = Wvar · hM + benc
enc
var ,
connected bottleneck (incorporates hidden representations) and
zact = zmean + zvar · N (µ, σ 2 ), a deconvolutional decoder. After training for each colour
channel, we extract the visual features of each image for
where ht and ct are the hidden and cell state of the LSTM at every channel from the middle layer in the bottleneck (FC
time step t, respectively, and N is a Gaussian distribution. h0 , 3). The visual features extracted from each channel are then
c0 are set as zero vectors, while µ and σ are set as 0 and 0.1, concatenated to make up the ultimate visual features v.
respectively. zact is the latent representation of a robot action.
The action decoder reconstructs the joint angles:
TABLE I
hdec dec
0 , c0 = W
dec
· zact + bdec , D ETAILED A RCHITECTURE OF C HANNEL -S EPARATED CAE

hdec dec dec dec


t , ct = LSTM(vt , ȷ̂t , ht−1 , ct−1 ) (1 ≤ t ≤ M − 1),
Block Layer Out Chan. Kernel Size Stride Padding Activation
ȷ̂t+1 = tanh(W out · hdec out
t +b ) (1 ≤ t ≤ M − 1), Encoder Conv 1 8 4x4 2 1 ReLU
Conv 2 16 4x4 2 1 ReLU
where tanh denotes the hyperbolic tangent activation function Conv 3 32 4x4 2 1 ReLU
and ȷ̂1 is equal to j1 , i.e. joint angle values at the initial time Conv 4 64 8x8 5 2 ReLU
step. Bottleneck FC 1 384 - - - -
FC 2 192 - - - -
FC 3 10 - - - -
C. Visual Feature Extraction FC 4 192 - - - -
We utilise a convolutional autoencoder architecture, fol- FC 5 384 - - - -
Decoder Deconv 1 32 8x8 5 2 ReLU
lowing Yamada et al. [30], to extract the visual features of Deconv 2 16 4x4 2 1 ReLU
the images. Different from the approach used in [30], we Deconv 3 8 4x4 2 1 ReLU
change the number of input channels the model accepts from Deconv 4 1 4x4 2 1 Sigmoid
three to one and train an instance of CAE for each colour
channel (red, green and blue) to recognise different colours
Channel separation increases the use of computational re-
more accurately: channel separation. Therefore, we call our
sources compared to standard convolution approach, because
visual feature extractor the channel-separated CAE. The idea
it essentially uses three separate models: even though they
behind the channel-separated CAE is similar to depthwise
are identical, they do not share weights. The number of
separable convolutions [7], where completely separating cross-
model parameters is about three times that of the standard
channel convolutions from spatial convolutions leads to better
approach. Therefore, it requires roughly three times more
4 For the sake of clarity, we use mostly the same symbols in the equations computational power than the standard approach. Nonetheless,
as in the equations of the language VAE. channel separation excels at distinguishing the object colours.

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Transactions on Cognitive and Developmental Systems. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TCDS.2022.3204452

D. Sampling and Binding and action VAE (Euclidean distance between original and
generated joint values) are Llang and Lact , respectively:
Stochastic Gradient Variational Bayes-based sampling
N −1 −1
V
!
(SGVB) [18] enables one-to-many mapping between action 1 X X [i] [i]
and language. The two VAEs have identical random sampling Llang = − xt+1 log yt ,
N − 1 t=1 i=0
procedures. After producing the latent variables zmean and zvar M −1
via the fully connected layers, we utilise a normal distribution 1 X 2
Lact = ∥jt+1 − ȷ̂t+1 ∥2 ,
N (µ, σ 2 ) to derive random values, ϵ, which are, in turn, used M − 1 t=1
with zmean and zvar to arrive at the latent representation z,
which is also known as the reparameterisation trick [18]: where V is the vocabulary size, N is the number of words per
description, M is the sequence length for an action trajectory.
The regularisation loss is specific to variational autoencoders;
z = zmean + zvar · ϵ
it is defined as the Kullback–Leibler divergence for language
DKLlang and action DKLact . Therefore, the overall loss function
where ϵ is the approximation of N (0, 0.01). is as follows:
As in the case of [30], to align the latent representations of
robot actions and their descriptions, we use an extra loss term Lall = αLlang + βLact + γLbinding + αDKLlang + βDKLact
that brings the mean hidden features, zmean , of the two VAEs where α, β and γ are weighting factors for different terms
closer to each other. This enables bidirectional translation in the loss function. In our experiments, α and β are set to
between action and language, i.e., the network can transform 1, whilst γ is set to 2 in order to sufficiently bind the two
actions to descriptions as well as descriptions to actions, after modalities.
training without an explicit fusion of the two modalities. This
loss term (binding loss) can be calculated as follows: F. Transformer-Based Language Encoder
In order for the model to understand unconstrained language
B B X
X
lang act
X input from non-expert human users, we replace the LSTM for
Lbinding = ψ(zmean i
, zmeani
)+ max the language encoder with a pretrained BERT-base language
i i j̸=i
n o model [8] - see Figure 2. According to [8], BERT is pretrained
lang act lang act with the BooksCorpus, which involves 800 million words, and
0, ∆ + ψ(zmean i
, zmean i
) − ψ(zmean j
, zmean i
) ,
English Wikipedia, which involves 2.5 billion words. With the
where B stands for the batch size and ψ is the Euclidean introduction of BERT as the language encoder, we assume
distance. The first term in the equation binds the paired that BERT can interpret action descriptions correctly in our
instructions and actions, whereas the second term separates scenario. However, since language models like BERT are
unpaired actions and descriptions. Hyperparameter ∆ is used pretrained exclusively on textual data from the internet, they
to adjust the separation margin for the second term - the higher are not specialised for object manipulation environments like
it is, the further apart the unpaired actions and descriptions are ours. Therefore, the embedding of an instruction like ‘push
pushed in the latent space. the blue object’ may not differ from the embedding
of another such as ‘push the red object’ significantly.
Different multi-modal fusion techniques like Gated Multi-
For this reason, we finetune the pretrained BERT-base, i.e.
modal Unit (GMU) [4], which uses gating and multiplicative
all of BERT’s parameters are updated, during the end-to-
mechanisms to fuse different modalities, and CentralNet [28],
end training of PVAE-BERT so that it can separate similar
which fuses information by having a separate network for each
instructions from each other, which is critical for our scenario.
modality as well as central joint representations at each layer,
were also considered during our work. However, since our G. Training Details
model is bidirectional (must work on both action-to-language To train the PVAE and PVAE-BERT, we first extract visual
and language-to-action directions) and must work with either features using our channel-separated CAE. The visual features
language or action input during inference (both GMU and are used to condition the actions depending on the cube
CentralNet require all of the modalities to be available), we arrangement, i.e., the execution of a description depends also
opted for the binding loss for multi-modal integration. on the position of the target cube. For both the PVAE and
PVAE-BERT, the action encoder and action decoder are each
E. Loss Function a two-layer LSTM with a hidden size of 100, whilst the
language decoder is a single-layer LSTM with the same hidden
The overall loss is calculated as the sum of the reconstruc- size. In contrast, the language encoder of PVAE-BERT is the
tion, regularisation and binding losses. The binding loss is pretrained BERT-base model with 12 layers, each with 12
calculated for both VAEs jointly. In contrast, the reconstruction self-attention heads and a hidden size of 768, whereas the
and regularisation losses are calculated independently for language encoder of the PVAE is a one-layer LSTM with
each VAE. Following [30], the reconstruction losses for the a hidden size of 100. Both the PVAE and PVAE-BERT are
language VAE (cross entropy between input and output words) trained end-to-end with both the language and action VAEs

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Transactions on Cognitive and Developmental Systems. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TCDS.2022.3204452

together. The PVAE and PVAE-BERT are trained for 20,000 vocabulary by adding an alternative word for each word in the
and 40,000 iterations, respectively, with the gradient descent original vocabulary. As descriptions are composed of 3 words
algorithm and Adam optimiser [17]. We take the learning rate with two alternatives per word, we arrive at 8 variations for
as 10−4 with a batch size of 100 pairs of language and action each description of a given meaning. Table II does not include
sequences after a few trials with different learning rates and nouns, because we use a predefined grammar, which doesn’t
batch sizes. Due to having approximately 110M parameters, involve a noun, and the same size cubes for these experiments.
compared with the PVAE’s approximately 465K parameters, For each cube arrangement, the colours of the two cubes
an iteration of PVAE-BERT training takes about 1.4 times always differ to avoid ambiguities in the language description.
longer than an iteration of PVAE training. Therefore, it takes Actions, which are transcribed in capitals, are composed of any
about 2.8 times longer to train PVAE-BERT in total. of the three action types PUSH, PULL, SLIDE, two positions
LEFT, RIGHT and two speed settings SLOWLY, FAST, re-
IV. E VALUATION AND R ESULTS
sulting in 12 possible actions (3 action types × 2 positions ×
We evaluate the performance of our PVAE and its variant 2 speeds), e.g., PUSH-LEFT-SLOWLY means pushing the left
using BERT, namely PVAE-BERT, with multiple experiments. object slowly. Every sentence is composed of three words
First, we compare the original PVAE with PRAE [30] in terms (excluding the <BOS/EOS> tags which denote beginning of
of action-to-language translation by conducting experiments sentence or end of sentence) with the first word indicating
of varying object colour options to display the superiority the action, the second the cube colour and the last the
of variational autoencoders over regular autoencoders and speed at which the action is performed (e.g., ‘push green
the advantage of using the channel separation technique in slowly’). Therefore, without the alternative words, there are
visual feature extraction. Different object colour possibilities 18 possible sentences (3 action verbs × 3 colours × 2 adverbs)
correspond to a different corpus and overall dataset size; for Experiment 1a, whereas, for Experiment 1b and 2, the
the more object colour options there are, the larger both number of sentences is 36 as 6 cube colours are used in
the vocabulary and the overall dataset become. Therefore, both experiments. As a result, our dataset consists of 6 cube
with these experiments, we also test the scalability of both arrangements (3 colour alternatives and the colours of the two
approaches. In order to show the impact of channel separation cubes on the table never match) for Experiment 1a, 12 cube
on the action-to-language translation performance, we train our arrangements for Experiments 1b and 2 (3 secondary colours
architecture with visual features provided by a regular CAE are used in addition to 3 primary colours and secondary and
(no channel separation) as implemented in [30]. These are primary colours are mutually exclusive), 18×8 = 144 possible
Experiment 1a (with 3 cube colour alternatives: red, green, sentences for Experiment 1a, 36 × 8 = 288 possible sentences
blue) and Experiment 1b (with 6 cube colour alternatives: for Experiments 1b and 2 with alternative vocabulary (consult
red, green, blue, yellow, cyan, violet) - see Table III. Table II) - the factor of 8 because of eight alternatives per
Moreover, in Experiment 2, we train PVAE-BERT on sentence. We have 72 patterns (action-description-arrangement
the dataset with 6 colour alternatives (red, green, blue, yel- combinations) for Experiment 1a (12 actions with six cube
low, cyan, violet) to compare it with the standard PVAE arrangements each) and 144 patterns for Experiments 1b and
by conducting action-to-language, language-to-language and 2. Following Yamada et al. [30], we choose the patterns
language-to-action evaluation experiments. This experiment rigorously to ensure that combinations of action, description
uses the pretrained BERT as the language encoder which is and cube arrangements used in the test set are excluded from
then finetuned with the rest of the model during training. the training set, although the training set includes all possible
In Experiments 1a, 1b and 2, two cubes of different colours combinations of action, description and cube arrangements
are placed on a table at which the robot is seated to interact that are not in the test set. For Experiment 1a, 54 patterns
with them. The words (vocabulary) that constitute the de- are used for training while the remaining 18 for testing (for
scriptions are given in Table II. We introduce a more diverse Experiments 1b and 2: 108 for training, 36 for testing). Each
pattern is collected six times in the simulation with random
variations on the action execution resulting in different joint
TABLE II
VOCABULARY trajectories. We also use 4-fold cross-validation to provide
more reliable results (consult Table III) for Experiment 1.
Original Alternative
Experiment 1c tests for different shapes, other than cubes:
Verb push move-up we perform the same actions on toy objects, which are a car,
pull move-down duck, cup, glass, house and lego brick. For testing the shape
slide move-sideways processing capability of the model, all objects are of the same
Colour red scarlet
green harlequin colour, namely yellow. Analogous to the other experiments,
blue azure two objects of different shapes are placed on the table. We
yellow blonde keep the actions as they are but replace the colours with object
cyan greenish-blue
violet purple names in the descriptions. Before we extract the visual features
Speed slowly unhurriedly from the new images, we train both the regular CAE and the
fast quickly channel-separated CAE with them. Similar to Experiments

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Transactions on Cognitive and Developmental Systems. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TCDS.2022.3204452

TABLE III
ACTION - TO -L ANGUAGE T RANSLATION ACCURACIES AT S ENTENCE L EVEL

Method Experiment 1a (3 colours) Experiment 1b (6 colours) Experiment 1c (6 shapes)


Training Test Training Test Training Test
PRAE + regular CAE 33.33 ± 1.31% 33.56 ± 3.03% 33.64 ± 1.13% 33.3 ± 0.98% 68.36 ± 2.12% 65.28 ± 2.45%
PVAE + regular CAE 66.6 ± 1.31% 65.28 ± 6.05% 69.60 ± 0.46% 61.57 ± 2.01% 80.71 ± 1.41% 73.15 ± 1.87%
PVAE + channel-separated CAE 100.00 ± 0.00% 90.28 ± 4.61% 100.00 ± 0.00% 100.00 ± 0.00% 95.99 ± 3.74% 92.13 ± 2.83%

1a and 1b, we experiment with three methods: PRAE with instead of standard ones increases the accuracy significantly.
standard CAE, PVAE with standard CAE and PVAE with Using PVAE with channel-separated CAE improves the results
channel-separated CAE. further, indicating the superiority of channel separation in our
We use NICO (Neuro-Inspired COmpanion) [15], [16] in a tabletop scenario. Therefore, our approach with variational
virtual environment created with Blender5 for our experiments autoencoders and a channel-separated CAE is superior to both
- see Figure 1. NICO is a humanoid robot, has a height of PRAE and PVAE with regular visual feature extraction.
approximately one metre and a weight of approximately 20 In Experiment 1b, in order to test the limits of our PVAE and
kg. The left arm of NICO is used to interact with the objects the impact of more data with a larger corpus, we add three
while utilising 5 joints. Actions are realised using the inverse more colour options for the cubes: yellow, cyan and violet.
kinematics solver provided by the simulation environment: for These secondary colours are combined amongst themselves for
each action, first, the starting point and endpoint are adjusted the arrangements in addition to the colour combinations used
manually, then, the Gaussian deviation is applied around the in the first experiment, i.e., a cube of a primary colour and a
starting point and endpoint to generate the variations of the cube of a secondary colour do not co-occur. Therefore, this ex-
action, ensuring that there is a slight difference in the overall periment has 12 arrangements. Moreover, the vocabulary size
trajectory. NICO has a camera in each of its eyes, which is is extended to 23 from 17 in Experiment 1b (two alternative
used to extract egocentric visual images. words for each colour - see Table II). As in Experiment 1a,
each sentence has eight alternative ways to be described.
A. Experiment 1 We train both PVAE and PRAE [30] on the extended dataset
We use the same actions as in [30], such as PUSH-RIGHT- from scratch and test both architectures. As shown in Table
SLOWLY. We use three colour options for the cubes as in III (Experiment 1b), PVAE succeeds in performing 100% by
[30] for Experiment 1a, but six colours for Experiment 1b. translating every pattern from action to description correctly,
However, we extend the descriptions in [30] by adding an even for the test set. In contrast, PRAE performs poorly
alternative for each word in the original vocabulary. Hence, in this setting and manages to translate only one third of
the vocabulary size of 9 is extended to 17 for Experiment 1a the descriptions correctly in the test set. Compared with the
and the vocabulary size of 11 is extended to 23 for Experiment accuracy values reached in the first experiment with less data
1b - note that we do not add an alternative for <BOS/EOS> and a smaller corpus, extension of the dataset helps PVAE
tags. Since every sentence consists of three words, we extend to perform better in translation, whereas PRAE is not able to
the number of sentences by a factor of eight (23 = 8). take advantage of more data. Similar to Experiment 1a, we
After training the PVAE and PRAE on the same training set, also test the influence of channel separation on the translation
we test them for action-to-language translation. We consider accuracy by training PVAE with visual features provided by
only those produced descriptions in which all three words a regular CAE. In this setting, PVAE only achieves around
and the <EOS> tag are correctly predicted as correct. The 61% of accuracy in the test set. This highlights once again the
produced descriptions that have one or more incorrect words importance of channel separation in visual feature extraction
are considered as false translations. As each description has for our setup. Whilst the improvement by using our PVAE over
seven more alternatives, predicting any of the eight description PRAE is significant, further improvement is made by utilising
alternatives is considered correct. the channel-separated CAE.
For Experiment 1a, our model is able to translate approx- In addition, as the results show in the last column of Table
imately 90% of the patterns in the test set, whilst PRAE III (Experiment 1c), our PVAE with channel separation in
could translate only one third of the patterns, as can be seen visual feature extraction outperforms the other methods even
in Table III. We can, thus, say that our model outperforms when manipulated objects have different shapes. Although
PRAE in one-to-many mapping. We also test the impact of there is a slight drop in action-language translation perfor-
channel separation on the translation accuracy by training our mance, it is clear that the PVAE with the channel-separated
model with visual features extracted with the regular CAE CAE is able to handle different-shaped objects. The PRAE
as described in Yamada et al.’s approach [30]. It is clearly model performs slightly better than it does in the experiments
indicated in Table III that using variational autoencoders with cubes of different colours. However, our variational
autoencoders approach without channel separation improves
5 https://www.blender.org/ the translation accuracy by approximately 8%. The channel

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Transactions on Cognitive and Developmental Systems. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TCDS.2022.3204452

separation in visual feature extraction improves the results TABLE V


even more similar to Experiment 1a and Experiment 1b, which VARIATIONS OF D ESCRIPTIONS FOR ONE E XAMPLE AND PVAE-BERT
L ANGUAGE - TO -L ANGUAGE S ENTENCE T RANSLATION ACCURACIES
shows the robustness of the channel-separated CAE when
processing different objects.
Var. Type Example Accuracy
1 Standard ‘push blue slowly’ 80.56%
B. Experiment 2 2 Changed Word Order ‘slowly push blue’ 80.56%
In this experiment, we test the performance of PVAE-BERT 3 Full Command ‘push the blue cube slowly’ 81.02%
4 ‘please’+Full Command ‘please push the blue cube 81.94%
on action-to-language, language-to-action and language-to- slowly’
language translation. We use the same dataset as in Experiment 5 Full Command+‘please’ ‘push the blue cube slowly 81.48%
please’
1b for a fair comparison with the original PVAE (LSTM 6 Ch.W.Order+F. Com.+‘pls.’ ‘slowly push the blue cube 81.48%
language encoder). We thus use the same descriptions, which please’
7 Polite Request ‘could you please push the 79.63%
are constructed by using a verb, colour and speed from the blue cube slowly?’
vocabulary given in Table II as well as the <BOS/EOS>
tags in the same order. Both PVAE and PVAE-BERT utilise
channel-separated CAE-extracted visual features. encoder and an LSTM decoder. The BERT-base language
encoder constitutes the overwhelming majority of parameters
TABLE IV in the PVAE-BERT model, which renders the language VAE
S ENTENCE T RANSLATION ACCURACIES FOR PVAE-BERT AND PVAE heavily skewed to the encoder half. This may affect the
performance of the language decoder when translating back
PVAE PVAE-BERT to the description from the hidden code produced mainly by
Translation Direction Test Accuracy (T - F) Test Accuracy (T - F)
Action→Language 100.00% (216 - 0) 97.22% (210 - 6)
BERT as the decoder’s parameters constitute less than 1% of
Language→Language 100.00% (216 - 0) 80.56% (174 - 42) the parameters of the language VAE. This hypothesis is further
supported by the original architecture, which has a symmetric
As shown in Table IV, when translating from action to language VAE, achieving 100% of accuracy in the same task.
language, PVAE-BERT achieves approximately 97% accuracy Nevertheless, our findings shows that the PVAE-BERT
failing to translate only six of the descriptions, which is model achieves stable language-to-language translation perfor-
comparable with the original architecture - the original PVAE mance even when the given descriptions do not comply with
correctly translate all 216 descriptions. The false translations the fixed grammar and are full commands such as ‘push the
are all due to incorrect translation of cube colours, e.g., the blue cube slowly’ or have a different word order such
predicted description is ‘slide blue slowly’ instead of as ‘quickly push blue’. To turn predefined descriptions
the ground truth ‘slide red slowly’. We hypothesise into full commands, we add the words ‘the’ and ‘cube’ to
that the slight drop in the performance is due to the relatively the descriptions and we also experiment with adding the word
small size of the dataset compared to almost 110 million ‘please’ and changing the word order as can be seen from
parameters trained in the case of BERT. Nevertheless, these the examples given in Table V. Although it is not explicitly
results show that finetuning the BERT during training leads to stated in the table for space reasons, we alternate between the
almost perfect action-to-language translation in our scenario. main elements of the descriptions as in the other experiments
As can be seen in Figure 3, both the PVAE and PVAE- following the vocabulary; for example, ‘push’ can be replaced
BERT perform decently in language-to-action translation, and by ‘move-up’ and ‘quickly’ can be replaced by ‘fast’.
produce joint angle values that are in line with and very similar Moreover, we achieve consistent language-to-action translation
to the original descriptions. In the bottom left plot, we can see performance with PVAE-BERT when we test it with different
that the joint trajectories output by the PVAE-BERT are more description types shown in the table - consult Figure 3 bottom
accurate than those produced by the PVAE. We hypothesise right plot. As the PVAE-BERT performs consistently even with
that the error margins are negligible and both, PVAE-BERT descriptions not following the predefined grammar, we can see
and the PVAE, succeed in language-to-action translation. Since that the adoption of a language model to the architecture is
we did not realise the actions with the generated joint values promising towards acquiring natural language understanding
in the simulation, we do not report the language-to-action skills.
translation accuracies in Table IV. However, we calculated
the mean squared errors (MSE) for both the PVAE and PVAE- C. Principal Component Analysis on Hidden Representations
BERT, which were both very close to zero. Therefore, it is fair We have also conducted principal component analysis
to say that both architectures recognise language and translate (PCA) on the hidden features extracted from PVAE-BERT.
it to action successfully. Figure 4 shows the latent representations of language in Plot
Language-to-language translation, however, suffers a bigger (a) and of action in Plot (b). The PCA on the representations of
performance drop when BERT is used as language encoder; language shows that the model learns the compositionality of
PVAE-BERT reconstructs around 80% of the descriptions language: the X-axis (principal component PC 1) distinguishes
correctly (see Table IV). We hypothesise that this is partly due the descriptions in the speed component (adverb), the Y-
to having an asymmetric language autoencoder with a BERT axis (PC 3) distinguishes colour, and the Z-axis (PC 6)

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Transactions on Cognitive and Developmental Systems. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TCDS.2022.3204452

Fig. 3. Examples of language-to-action translation by PVAE-BERT and its comparison with PVAE: in the top row, the two plots represent the ground truth
and predicted joint trajectories by PVAE-BERT for PUSH-LEFT-SLOWLY and PULL-LEFT-SLOWLY actions. Solid lines show the ground truth, while the
dashed lines, which are often covered by the solid lines, show the predicted joint angle values. In the bottom row, the left plot shows the total error margin
of the five joint values produced by PVAE and PVAE-BERT per time step for the PUSH-LEFT-SLOWLY action, while the right plot shows the joint values
produced by PVAE-BERT given three variations (see Table V) of the same command for PULL-LEFT-SLOWLY - notice how the joint trajectories overlap
most of the time. In all of the plots, the X axis represents the time steps.

distinguishes the action type (verb)6 . Plot (b) shows that the cate that the binding loss has transferred semantically driven
PCA representations of actions are semantically similar, since ordering from the language to the action representations.
their arrangement coincides with those in Plot (a). When our agent receives a language instruction, which
Our method learns actions according to their paired de- contains the colour but not position, the agent is still able
scriptions: it learns the colour of the object (an element to perform the action according to the position (cf. Figure 3)
of descriptions) interacted with. However, it does not learn of the object. The retrieval of the position information must
the position of it (an element of actions). We inspected the therefore be done by the action decoder: it reads the images
representations along all major principle components, but we to obtain the position of the object that has the colour given in
could not find any direction along which the position was the instruction. It is therefore not surprising that the PCA does
meaningfully distinguished. For example, in (b), some of the not reveal any object position encodings in the bottleneck.
filled red circles (corresponding to description ‘push red
V. D ISCUSSION
slowly’) are paired with the action PUSH-LEFT-SLOWLY
while the others with PUSH-RIGHT-SLOWLY. As actions Experiments 1a and 1b show that our variational autoen-
learned according to their paired descriptions, hence semanti- coder approach with a channel-separated CAE visual feature
cally, the filled red circles are grouped together even though extraction (‘PVAE + channel-separated CAE’) performs better
the red cube may be on the right or left. In contrast, an action than the standard autoencoder approach, i.e., PRAE [30], in
can be represented far from another identical action: e.g., the the one-to-many translation of robot actions into language
representations of ‘pull red slowly’ (filled red circles descriptions. Our approach is superior both in the case of
in Figure 4) are separated from those of ‘pull yellow three colour alternatives per cube and in the case of six
slowly’ (filled yellow circles) along PC 3, even if they both colour alternatives per cube by a large margin. The additional
denote the action PULL-LEFT-SLOWLY. These results indi- experiment with six different objects highlights the robust-
ness of our approach against the variation in object types.
6 The percentages of variance explained were very similar between PC 2
We demonstrate that a Bayesian inference-based method like
until PC 6; therefore, we selected PC 3 and PC 6 for display as they resolved variational autoencoders can scale up with more data for gen-
colour and action type optimally. eralisation, whereas standard autoencoders cannot capitalise

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Transactions on Cognitive and Developmental Systems. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TCDS.2022.3204452

Fig. 4. Hidden features of language (a) and hidden features of action (b): PCA was performed jointly on the hidden features of 36 descriptions and the
hidden features of 144 actions. For (b), each unique action (12 in total) occurs 12 times as there are 12 possible cube arrangements; therefore, 144 points are
shown. For both (a) and (b), we label the points according to descriptions, i.e., for (b), actions are also labelled according to their paired descriptions. As can
be seen from the legend, different shapes, colours and fillings indicate the verb (action type), object colour and adverb (speed), respectively.

on a larger dataset, since the proposed PVAE model achieves accuracy for the ‘PVAE + channel-separated CAE’ goes up by
better accuracy when the dataset and the corpus are extended approximately 10% to 100% when three more colour options
with three extra colours or six different objects. Additionally, are added to the dataset.
standard autoencoders are fairly limited in coping with the Furthermore, training the PVAE with the visual features
diversification of language as they do not have the capacity extracted by the standard CAE demonstrates that training
to learn the mapping between an action and many descrip- and extracting features from each RGB channel separately
tions. In contrast, variational autoencoders yield remarkably mitigates the colour distinction issue for cubes when the
better results in one-to-many translation between actions and visual input, like in our setup, includes objects covering a
descriptions, because stochastic generation (random normal relatively small portion of the visual field. The ‘PVAE +
distribution) within the latent feature extraction allows latent regular CAE’ variant performs significantly worse than our
representations to slightly vary, which leads to VAEs learning ‘PVAE + channel-separated CAE’ approach. This also demon-
multiple descriptions rather than a particular description for strates the importance of the visual modality for the overall
each action. performance of the approach. Our analysis on the incorrectly
A closer look into action-to-language translation accuracies translated descriptions shows that a large amount of all errors
achieved by the PRAE for Experiments 1a and 1b shows committed by the ‘PVAE + regular CAE’ were caused due
that having more variety in the data (i.e. more colour options to cube colour distinction failures such as translating ‘slide
for cubes) does not help the standard autoencoder approach red fast’ as ‘slide green fast’, which proves the
to learn one-to-many binding between action and language. channel-separated CAE’s superiority over the standard CAE
Both in the first case with three colour alternatives and in the in visual feature extraction in our scenario. Moreover, using
second case with six colour alternatives, the PRAE manages to the channel-separated CAE for visual feature extraction rather
translate only around one third of the samples from actions to than the standard CAE results in better action-to-language
descriptions correctly. In contrast, the accuracies achieved by translation accuracy even when the objects are of various
our proposed PVAE for both datasets prove that the variational shapes. This indicates that the channel-separated CAE not
autoencoder approach can benefit from more data as the test only works well with cubes of different colours but also

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Transactions on Cognitive and Developmental Systems. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TCDS.2022.3204452

objects of different shapes. We emphasise the superiority of In the future, we will collect descriptions via crowdsourcing
channel separation in our scenario, which is tested and proven in order to investigate the viability of using a pretrained
in a simulation environment. For real-world scenarios with language model as an encoder to relate from language to motor
different lighting conditions, it is advisable to take into account control. We will also seek ways to bind the two modalities in
also the channel interaction [26] to have more robust visual a more biologically plausible way. Moreover, increasing the
feature extraction. complexity of the scenario with more objects in general and
Experiment 2 indicates the potential of utilising a pretrained on the table simultaneously may shed light to the scalability of
language model like BERT for the interpretation of language our approach. Lastly, we will transfer our simulation scenario
descriptions. This extension produces comparable results to the to the real world and conduct experiments on the real robot.
original PVAE with the LSTM language encoder in language-
to-action and action-to-language translations. The drop in ACKNOWLEDGMENT
language-to-language performance to 80% is most probably The authors gratefully acknowledge support from the Ger-
caused by the asymmetric language VAE of the PVAE-BERT man Research Foundation DFG, project CML (TRR 169).
model that consists of a feedforward BERT encoder with
attention mechanisms, which reads the entire input sequence in R EFERENCES
parallel, and of a recurrent LSTM decoder, which produces the
[1] Amina Adadi and Mohammed Berrada. Peeking inside the black-box: A
output sequentially. A previous study on a text classification survey on explainable artificial intelligence (xai). IEEE Access, 6:52138–
task also shows that LSTM models outperform BERT on 52160, 2018.
a relatively small corpus because, with its large number of [2] Ahmed Akakzia, Cédric Colas, Pierre-Yves Oudeyer, Mohamed
Chetouani, and Olivier Sigaud. Grounding Language to Autonomously-
parameters, BERT tends to overfit when the dataset size is Acquired Skills via Goal Generation. In International Conference on
small [10]. Furthermore, we have also tested the PVAE- Learning Representations, Virtual (formerly Vienna, Austria), 2021.
BERT, which was trained on predefined descriptions, with [3] Alexandre Antunes, Alban Laflaquiere, Tetsuya Ogata, and Angelo Can-
full sentence descriptions - e.g. ‘push the blue cube gelosi. A bi-directional multiple timescales LSTM model for grounding
of actions and verbs. In 2019 IEEE/RSJ International Conference on
slowly’ for ‘push blue slowly’ - and with variations Intelligent Robots and Systems (IROS), pages 2614–2621, 2019.
of the descriptions that have a different word order. We have [4] John Arevalo, Thamar Solorio, Manuel Montes-y Gomez, and Fabio A
confirmed that PVAE-BERT achieves the same performance González. Gated multimodal networks. Neural Computing and Appli-
cations, 32(14):10209–10228, 2020.
in language-to-action and language-to-language translations. [5] Yonatan Bisk, Ari Holtzman, Jesse Thomason, Jacob Andreas, Yoshua
This is promising for the future because the pretrained BERT Bengio, Joyce Chai, Mirella Lapata, Angeliki Lazaridou, Jonathan May,
allows the model to understand unconstrained natural language Aleksandr Nisnevich, Nicolas Pinto, and Joseph Turian. Experience
grounds language. In Proceedings of the 2020 Conference on Empirical
commands that do not conform to the defined grammar. Methods in Natural Language Processing, pages 8718–8735. Associa-
The PCA conducted on the hidden features of PVAE-BERT tion for Computational Linguistics, November 2020.
shows that our method can learn language and robot actions [6] Joyce Y Chai, Qiaozi Gao, Lanbo She, Shaohua Yang, Sari Saba-Sadiya,
and Guangyue Xu. Language to action: Towards interactive task learning
compositionally and semantically. Although it is not explicitly with physical agents. In IJCAI, pages 2–9, 2018.
given, we have also confirmed that both the PVAE and PVAE- [7] François Chollet. Xception: Deep learning with depthwise separable
BERT are able to reconstruct joint values almost perfectly convolutions. In 2017 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pages 1800–1807, 2017.
accurately when we analysed the action-to-action translation [8] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.
results. Together with the language-to-language performance, BERT: Pre-training of deep bidirectional transformers for language
action-to-action capability of both variants of our architecture understanding. In NAACL-HLT (1), 2019.
[9] Aaron Eisermann, Jae Hee Lee, Cornelius Weber, and Stefan Wermter.
demonstrates that the two variational autoencoders (language Generalization in multimodal language learning from simulation. In
and action) in our approach retain their reconstructive nature. Proceedings of the International Joint Conference on Neural Networks
(IJCNN 2021), Jul 2021.
[10] Aysu Ezen-Can. A comparison of LSTM and BERT for small corpus.
VI. C ONCLUSION arXiv preprint arXiv:2009.05451, 2020.
[11] Jun Hatori, Yuta Kikuchi, Sosuke Kobayashi, Kuniyuki Takahashi, Yuta
In this study, we have reported the findings of previous Tsuboi, Yuya Unno, Wilson Ko, and Jethro Tan. Interactively picking
work and its extension with several experiments. We have real-world objects with unconstrained spoken language instructions.
In 2018 IEEE International Conference on Robotics and Automation
shown that variational autoencoders outperform standard au- (ICRA), pages 3774–3781. IEEE, 2018.
toencoders in terms of one-to-many translation of robot actions [12] Stefan Heinrich and Stefan Wermter. Interactive natural language
to descriptions. Furthermore, the superiority of our channel- acquisition in a multi-modal recurrent neural architecture. Connection
Science, 30(1):99–133, 2018.
separated visual feature extraction has been proven with an [13] Stefan Heinrich, Yuan Yao, Tobias Hinz, Zhiyuan Liu, Thomas Hummel,
extra experiment that involves different types of objects. In Matthias Kerzel, Cornelius Weber, and Stefan Wermter. Crossmodal
addition, using the PVAE with a BERT model pretrained on language grounding in an embodied neurocognitive model. Frontiers in
Neurorobotics, 14:52, 2020.
large text corpora, instead of the LSTM encoder trained on [14] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.
our small predefined grammar, unveils promising scaling-up Neural Computation, 9(8):1735–1780, 1997.
opportunities for the proposed approach, and it offers the [15] Matthias Kerzel, Theresa Pekarek-Rosin, Erik Strahl, Stefan Heinrich,
and Stefan Wermter. Teaching NICO how to grasp: an empirical study
possibility to map unconstrained natural language descriptions on crossmodal social interaction as a key factor for robots learning from
with actions. humans. Frontiers in Neurorobotics, 14:28, 2020.

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Transactions on Cognitive and Developmental Systems. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TCDS.2022.3204452

[16] Matthias Kerzel, Erik Strahl, Sven Magg, Nicolás Navarro-Guerrero, Ozan Özdemir is a doctoral candidate and working
Stefan Heinrich, and Stefan Wermter. NICO—Neuro-Inspired COm- as a research associate in the Knowledge Technology
panion: A developmental humanoid robot platform for multimodal group, University of Hamburg, Germany. He has a
interaction. In 2017 26th IEEE International Symposium on Robot and BSc degree in computer engineering from Yildiz
Human Interactive Communication (RO-MAN), pages 113–120, 2017. Technical University. He has received his MSc de-
[17] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic opti- gree in Intelligent Adaptive Systems at the Univer-
mization. In 3rd International Conference on Learning Representations, sity of Hamburg. His research interests are embodied
ICLR, San Diego, CA, USA, May 7-9, 2015. and crossmodal language learning, autoencoders, re-
[18] Diederik P. Kingma and Max Welling. Auto-encoding variational Bayes. current neural networks and large language models.
In Proceedings of International Conference on Learning Representations
(ICLR), Banff, AB, Canada, April 14-16, 2014, 2014.
[19] Corey Lynch and Pierre Sermanet. Language conditioned imitation
Matthias Kerzel received his MSc and PhD in
learning over unstructured data. Robotics: Science and Systems, 2021.
computer science from the Universität Hamburg,
[20] Joseph Marino. Predictive coding, variational autoencoders, and biolog-
Germany. He is currently a postdoctoral research
ical connections. Neural Computation, 34(1):1–44, 2021.
and teaching associate at the Knowledge Technology
[21] Hwei Geok Ng, Paul Anton, Marc Brügger, Nikhil Churamani, Erik
Group of Prof. Stefan Wermter at the University
Fließwasser, Thomas Hummel, Julius Mayer, Waleed Mustafa, Thi
of Hamburg. He has given lectures on Knowledge
Linh Chi Nguyen, Quan Nguyen, et al. Hey robot, why don’t you
Processing in Intelligent Systems, Neural Networks
talk to me? In 2017 26th IEEE International Symposium on Robot and
and Bio-inspired Artificial Intelligence. He is cur-
Human Interactive Communication (RO-MAN), pages 728–731, 2017.
rently the Secretary of the European Neural Network
[22] Tetsuya Ogata, Masamitsu Murase, Jun Tani, Kazunori Komatani, and
Society and worked in the organising committee
Hiroshi G. Okuno. Two-way translation of compound sentences and arm
of the International Conference on Artificial Neural
motions by recurrent neural networks. In 2007 IEEE/RSJ International
Networks conferences. His research interests are in developmental neuro-
Conference on Intelligent Robots and Systems, pages 1858–1863, 2007.
robotics, hybrid neurosymbolic architectures, explainable AI and human-robot
[23] Haşim Sak, Andrew Senior, and Françoise Beaufays. Long short-term
interaction. He is currently involved in the international SFB/TRR-169 large-
memory recurrent neural network architectures for large scale acoustic
scale project on crossmodal learning.
modeling. In Proceedings of Interspeech 2014, pages 338–342, 2014.
[24] Lin Shao, Toki Migimatsu, Qiang Zhang, Karen Yang, and Jeannette
Bohg. Concept2robot: Learning manipulation concepts from instructions Cornelius Weber graduated in physics at Universität
and human demonstrations. In Proceedings of Robotics: Science and Bielefeld, Germany and received his PhD in com-
Systems (RSS), 2020. puter science at Technische Universität Berlin. Fol-
[25] Mohit Shridhar, Dixant Mittal, and David Hsu. INGRESS: Interactive lowing positions were a Postdoctoral Fellow in Brain
visual grounding of referring expressions. The International Journal of and Cognitive Sciences, University of Rochester,
Robotics Research, 39(2-3):217–232, 2020. USA; Research Scientist in Hybrid Intelligent Sys-
[26] Du Tran, Heng Wang, Matt Feiszli, and Lorenzo Torresani. Video tems, University of Sunderland, UK; Junior Fellow
classification with channel-separated convolutional networks. In 2019 at the Frankfurt Institute for Advanced Studies,
IEEE/CVF International Conference on Computer Vision (ICCV), pages Germany. Currently he is Lab Manager at Knowl-
5551–5560, 2019. edge Technology, Universität Hamburg. His interests
[27] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion are in computational neuroscience, development of
Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention visual feature detectors, neural models of representations and transformations,
is all you need. In Advances in Neural Information Processing Systems, reinforcement learning and robot control, grounded language learning, human-
pages 5998–6008, 2017. robot interaction and related applications in social assistive robotics.
[28] Valentin Vielzeuf, Alexis Lechervy, Stéphane Pateux, and Frédéric
Jurie. Centralnet: A multilayer approach for multimodal fusion. In Jae Hee Lee is a postdoctoral research associate
Laura Leal-Taixé and Stefan Roth, editors, Computer Vision – ECCV in the Knowledge Technology Group, University of
2018 Workshops, pages 575–589, Cham, 2019. Springer International Hamburg, Germany. He has worked on topics in
Publishing. multimodal learning, grounded language understand-
[29] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad ing, and spatial and temporal reasoning. Jae Hee
Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Lee received his Diplom degree in mathematics and
Klaus Macherey, et al. Google’s neural machine translation system: doctoral degree in computer science from the Uni-
Bridging the gap between human and machine translation. arXiv preprint versity of Bremen, Germany. He was a postdoctoral
arXiv:1609.08144, 2016. researcher at the Australian National University,
[30] Tatsuro Yamada, Hiroyuki Matsunaga, and Tetsuya Ogata. Paired re- University of Technology Sydney (Australia) and
current autoencoders for bidirectional translation between robot actions Cardiff University (UK).
and linguistic descriptions. IEEE Robotics and Automation Letters,
3(4):3441–3448, 2018.
Stefan Wermter (Member, IEEE) is currently a Full
[31] Yinfei Yang, Daniel Cer, Amin Ahmad, Mandy Guo, Jax Law, Noah
Professor with the University of Hamburg, Hamburg,
Constant, Gustavo Hernandez Abrego, Steve Yuan, Chris Tar, Yun-
Germany, where he is also the Director of the
hsuan Sung, Brian Strope, and Ray Kurzweil. Multilingual universal
Department of Informatics, Knowledge Technology
sentence encoder for semantic retrieval. In Proceedings of the 58th
Institute. Currently, he is a co-coordinator of the In-
Annual Meeting of the Association for Computational Linguistics:
ternational Collaborative Research Centre on Cross-
System Demonstrations, pages 87–94, Online, July 2020. Association
modal Learning (TRR-169) and a coordinator of the
for Computational Linguistics.
European Training Network TRAIL on transparent
[32] Ozan Özdemir, Matthias Kerzel, and Stefan Wermter. Embodied
interpretable robots. His main research interests are
language learning with paired variational autoencoders. In 2021 IEEE
in the fields of neural networks, hybrid knowledge
International Conference on Development and Learning (ICDL), pages
technology, cognitive robotics and human–robot in-
1–6. IEEE, Aug 2021.
teraction. He is an Associate Editor of Connection Science and International
Journal for Hybrid Intelligent Systems. He is on the Editorial Board of the
journals Cognitive Systems Research, Cognitive Computation and Journal of
Computational Intelligence. He is serving as the President for the European
Neural Network Society.

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

You might also like