Ozdemir Et Al. - 2022 - Language Model-Based Paired Variational Autoencode
Ozdemir Et Al. - 2022 - Language Model-Based Paired Variational Autoencode
This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TCDS.2022.3204452
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Transactions on Cognitive and Developmental Systems. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TCDS.2022.3204452
for language, the other for action, and both of them consist the architecture of the PVAE and PVAE-BERT models, various
of an LSTM (long short-term memory) [14] encoder and experiments and their results are given in Section 4, Section 5
decoder which are suitable for sequential data. The dataset2 , discusses the results and their implications and the last section
which our model is trained with, consists of paired textual concludes the paper with final remarks3 .
descriptions and corresponding joint angle values with ego-
II. R ELATED W ORK
centric images. The language VAE reconstructs descriptions,
whereas the action VAE reconstructs joint angle values that The state-of-the-art approaches in embodied language learn-
are conditioned on the visual features extracted in advance by ing mostly rely on tabletop environments [11], [13], [24],
the channel-separated convolutional autoencoder (CAE) [32] [25], [30] or interactive play environments [19] where a robot
from egocentric images. The two autoencoders are implicitly interacts with various objects according to given instructions.
bound together with an extra loss term which aligns actions We categorise these approaches into three groups: those that
with their corresponding descriptions and separates unrelated translate from language to action, those that translate from
actions and descriptions in the hidden vector space. action to language and those that can translate in both direc-
However, even with multiple descriptions mapped to a robot tions, i.e., bidirectional approaches. Bidirectional approaches
action as implemented in our previous work [32], replacing allow greater exploitation of available training data as training
each word by its alternative does not lift the grammar re- in both directions can be interpreted as multitask learning,
strictions on the language input. In order to process uncon- which ultimately leads to more robust and powerful models
strained language input, we equip the PVAE architecture with independent of the translation direction. By using the maxi-
the Bidirectional Encoder Representations from Transformers mum amount of shared weights for multiple tasks, such mod-
(BERT) language model [8] that has been pretrained on large- els would be more efficient than independent unidirectional
scale text corpora to enable the recognition of unconstrained networks in terms of data utilisation and the model size.
natural language commands by human users. To this end, we A. Language-to-Action Translation
replace the LSTM language encoder with a pretrained BERT
Translating from language to action is the most common
model so that the PVAE can recognise different commands that
form in embodied language learning. Hatori et al. [11] in-
correspond to the same actions as the predefined descriptions
troduce a neural network architecture for moving objects
given the same object combinations on the table. This new
given the visual input and language instructions, as their
model variant, which we call PVAE-BERT, can handle not
work focuses on the interaction of a human operator with the
only the descriptions it is trained with, but also various
computational neural system that picks and places miscella-
descriptions equivalent in meaning with different word order
neous items as per verbal commands. In their scenario, many
and/or filler words (e.g., ‘please’, ‘could’, ‘the’, etc.) as our
items with different shape and size (e.g. toys, bottles etc.) are
analysis shows. We make use of transfer learning by using
distributed across four bins with many of them being occluded
a pretrained language model, hence, benefitting from large
- hence, the scene is very complex and cluttered. Given a
unlabelled textual data.
pick-and-place instruction from the human operator, the robot
Our contributions can be summarised as follows:
first confirms and then executes it if the instruction is clear.
1) In our previous work [32], we showed that varia- Otherwise, the robot asks the human operator to clarify the
tional autoencoders facilitate better one-to-many action- desired object. The network receives a verbal command from
to-language translation and that channel separation in the operator and an RGB image from the environment, and it
visual feature extraction, i.e., training RGB channels has separate object recognition and language understanding
separately, results in more accurate recognition of object modules, which are trained jointly to learn the names and
colours in our object manipulation scenario. In this attributes of the objects.
follow-up work, we extend our dataset with different Shridhar and Hsu [25] propose a comprehensive system for
shapes and show that our PVAE with the channel separa- a robotic arm to pick up objects based on visual and linguistic
tion approach is able to translate from action to language input. The system consists of multiple modules such as ma-
while manipulating different objects. nipulation, perception and a neural network architecture, and
2) Here, we introduce PVAE-BERT, which, by using pre- is called INGRESS (Interactive Visual Grounding of Referring
trained BERT, indicates the potential of our approach to Expressions). INGRESS is composed of two network streams
be scaled up for unconstrained instructions from human (self-referential and relational) which are trained on large
users. datasets to generate a definitive expression for each object in
3) Additional principle component (PCA) analysis shows the scene based on the input image. The generated expression
that language as well as action representation vectors is compared with the input expression to detect the desired ob-
arrange according to the semantics of the language ject. INGRESS is therefore responsible for grounding language
descriptions. by learning object names and attributes via manipulation. The
The remainder of this paper is organised as follows: the approach can resolve ambiguities when it comes to which
next section describes the relevant work, Section 3 presents object to lift by asking confirmation questions to the user.
2 https://www.inf.uni-hamburg.de/en/inst/ab/wtm/research/corpora.html 3 Our code is available at https://github.com/oo222bs/PVAE-BERT.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Transactions on Cognitive and Developmental Systems. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TCDS.2022.3204452
Shao et al. [24] put forward a robot learning framework, tion: they describe the actions they perform in the environment.
Concept2Robot, for learning manipulation concepts from hu- However, they are unable to execute a desired action given by
man video demonstrations in two stages. In the first stage, the human user. Nevertheless, from the robotics perspective, it
they use reinforcement learning and, in the second, they utilise is desirable to have models that can also translate from action
imitation learning. The architecture consists of three main to language and not just execute verbal commands; such robots
parts: semantic context network, policy network and action can explain their actions by verbalising an ongoing action,
classification. The model receives as input a natural language which also paves the way for more interpretable systems.
description for each task alongside an RGB image of the initial
scene. In return, it is expected to produce the parameters C. Bidirectional Translation
of a motion trajectory to accomplish the task in the given Very few embodied language learning approaches are ca-
environment. pable of flexibly translating in both directions, hence, bidirec-
Lynch and Sermanet [19] introduce the LangLfP (language tional. While unidirectional approaches are feasible for smaller
learning from play) approach, in which they utilise multi- datasets, we aim to research architectures that can serve as
context imitation to train a single policy based on multiple large-scale multimodal foundation models and solve multiple
modalities. Specifically, the policy is trained on both image tasks in different modalities. By generating a discrete set of
and language goals and this enables the approach to follow words, bidirectional models can also provide feedback to a
natural language instructions during evaluation. During train- user about the information contained within its continuous
ing, fewer than 1% of the tasks are labelled with natural variables. By providing rich language descriptions, rather
language instructions, because it suffices to train the policy than only performing actions, such models can contribute to
for more than 99% of the cases with goal images only. There- explainable AI (XAI) for non-experts. For a comprehensive
fore, only few of the tasks must be labelled with language overview of the field of XAI, readers can refer to the survey
instructions. Furthermore, they utilise a Transformer-based paper by Adadi and Berrada [1].
[27] multilingual language encoder, Multilingual Universal In one of the early examples of bidirectional translation,
Sentence Encoder [31], to encode linguistic input so that the Ogata et al. [22] present a model that is aimed at articulation
system can handle unseen language input like synonyms and and allocation of arm movements by using a parametric
instructions in 16 different languages. bias to bind motion and language. The method enables the
The language-to-action translation methods are designed robot to move its arms according to given sentences and to
to act upon a given language input as in textual or verbal generate sentences according to given arm motions. The model
commands. They can recognise commands and execute the shows generalisation towards motions and sentences that it has
desired actions. However, they cannot describe the actions that not been trained with. However, it fails to handle complex
they perform. sentences.
Antunes et al. [3] introduce the multiple timescale long
B. Action-to-Language Translation short-term memory (MT-LSTM) model in which the slowest
Another class of approaches in embodied language learning layer establishes a bidirectional connection between action and
translates action into language. Heinrich et al. [13] intro- language. The MT-LSTM consists of two components, namely
duce an embodied crossmodal neurocognitive architecture, the language and action streams, each of which is divided into
adaptive multiple timescale recurrent neural network (adaptive three layers with varying timescales. The two components are
MTRNN), which enables the robot to acquire language by bound by a slower meaning layer that allows translation from
listening to commands while interacting with objects in a action to language and vice versa. The approach shows limited
playground environment. The approach has auditory, senso- generalisation capabilities.
rimotor and visual perception capabilities. Since neurons at Yamada et al. [30] propose the paired recurrent autoencoder
multiple timescales facilitate the emergence of hierarchical (PRAE) architecture, which consists of two autoencoders,
representations, the results indicate good generalisation and namely action and description. The action autoencoder takes
hierarchical concept decomposition within the network. as input joint angle trajectories with visual features and is
Eisermann et al. [9] study the problem of compositional expected to reconstruct the original joint angle trajectories.
generalisation, in which they conduct numerous experiments The description autoencoder, on the other hand, reads and then
on a tabletop scenario where a robotic arm manipulates various reconstructs the action descriptions. The dataset that the model
objects. They utilise a simple LSTM-based network to describe is trained on consists of pairs of simple robot actions and
the actions performed on the objects in hindsight - the model their textual descriptions, e.g., ‘pushing away the blue cube’.
accepts visual and proprioceptive input and produces textual The model is trained end-to-end, with both autoencoders,
descriptions. Their results show that with the inclusion of reconstructing language and action, whilst there is no explicit
proprioception as input and using more data in training, neural connection between the two. The crossmodal pairing
the network’s performance on compositional generalisation between action and description autoencoders is supplied with
improves significantly. a loss term that aligns the hidden representations of paired
Similar to the language-to-action translation methods, the actions and descriptions. The binding loss allows the PRAE
action-to-language translation methods work only in one direc- to execute actions given instructions as well as translate
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Transactions on Cognitive and Developmental Systems. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TCDS.2022.3204452
actions to descriptions. As a bidirectional approach, the PRAE channel-separated CAE (short for convolutional autoencoder),
is biologically plausible to some extent, since humans can which improves the ability of the approach to distinguish the
easily execute given commands and also describe these actions colours of cubes. The details of each model component are
linguistically. To imitate human-like language recognition and given in the following subsections.
production, bidirectionality is essential. However, due to its
A. Language Variational Autoencoder
use of standard autoencoders, the PRAE can only bind a
robot action with a particular description in a one-to-one way, The language VAE accepts as input one-hot encoded matrix
although actions can be expressed in different ways. In order of a description word by word in the case of the PVAE or the
to map each robot action to multiple description alternatives, complete description altogether for PVAE-BERT, and for both
we have proposed the PVAE (paired variational autoencoders) the PVAE and PVAE-BERT, is responsible for reproducing the
approach [32] which utilises variational autoencoders (VAEs) original description. It consists of an encoder, a decoder and
to randomise the latent representation space and thereby allows latent layers (in the bottleneck) where latent representations
one-to-many translation between action and language. A recent are extracted via sampling. For the PVAE, the language
review by Marino [20] highlights similarities between VAEs encoder embeds a description of length N , (x1 , x2 , ..., xN ),
and predictive coding from neuroscience in terms of model into two fixed-dimensional vectors zmean and zsigma as follows:
formulations and inference approaches. henc enc enc enc
(1 ≤ t ≤ N ),
t , ct = LSTM(xt , ht−1 , ct−1 )
This work is an extension of the ICDL article “Embodied enc
zmean = Wmean · hN + benc
mean ,
Language Learning with Paired Variational Autoencoders”
[32]. Inspired by the TransferLangLfP paradigm by Lynch and zvar = Wvar · hN + benc
enc
var ,
Sermanet [19], we propose to use the PVAE with a pretrained zlang = zmean + zvar · N (µ, σ 2 ),
BERT language model [8] in order to enable the model to
comprehend unconstrained language instructions from human where ht and ct are the hidden and cell state of the LSTM at
users. Furthermore, we conduct experiments using PVAE- time step t, respectively, and N is a Gaussian distribution.
BERT on our dataset for various use cases and analyse the h0 and c0 are set as zero vectors, while µ and σ are 0
internal representations for the first time. and 0.1, respectively. zlang is the latent representation of a
description. LSTM here, and in the following, is a peephole
III. P ROPOSED M ETHODS : PVAE & PVAE-BERT LSTM [23] following the implementation of Yamada et al.
As can be seen in Figure 2, the PVAE model consists of two [30]. The language input is represented in one-hot encoded
variational autoencoders: a language VAE and an action VAE. matrices, whose rows represent the sequence of input words
The former learns to generate descriptions matching original and columns represent every word that is in the vocabulary.
descriptions, whilst the latter learns to reconstruct joint angle In each row, only one cell is 1 and the rest are 0, which
values with conditioning on the visual input. The two autoen- determines the word that is given to the model at that time
coders do not have any explicit neural connection between step.
them, but instead they are implicitly aligned by the binding For PVAE-BERT, we replace the LSTM language encoder
loss, which brings the two autoencoders closer to each other with the pretrained BERT-base model and, following the
in the latent space over the course of learning by reducing implementation by Devlin et al. [8], tokenise the descriptions
the distance between the two latent variables. First, action accordingly with the subword-based tokeniser WordPiece [29].
and language encoder map the input to the latent code, i.e., The language decoder generates a sequence by recursively
the language encoder accepts one-hot encoded descriptions expanding zlang :
word by word as input and produces the encoded descriptions, hdec dec
0 , c0 = W
dec
· zlang + bdec ,
whereas the action encoder accepts corresponding arm trajec-
hdec dec dec dec
t , ct = LSTM(yt−1 , ht−1 , ct−1 ) (1 ≤ t ≤ N − 1),
tories and visual features as input and produces the encoded
actions. Next, the encoded representations are used to extract yt = soft(W out · hdec out
t +b ) (1 ≤ t ≤ N − 1),
latent representations by randomly sampling from a Gaussian where soft denotes the softmax activation function. y0 is the
distribution separately for language and action modalities. first symbol indicating the beginning of the sentence, hence
Finally, from the latent representations, language and action the <BOS> tag.
decoders reconstruct the descriptions and joint angle values,
respectively. B. Action Variational Autoencoder
Our model is a bidirectional approach, i.e., after training The action VAE accepts a sequence of joint angle values and
translation is possible in both directions, action-to-language visual features as input and it is responsible for reconstructing
and language-to-action. The PVAE model transforms robot the joint angle values. Similar to the language VAE, it is
actions to descriptions in a one-to-many fashion by appropri- composed of an encoder, a decoder and latent layers (in
ately randomising the latent space. PVAE-BERT additionally the bottleneck) where latent representations are extracted via
handles variety in language input by using pretrained BERT as sampling. The action encoder encodes a sequence of length M ,
the language encoder module. As part of the action encoder, ((j1 , v1 ), (j2 , v2 ), ..., (jM , vM )), which includes concatenation
the visual input features are extracted in advance using a of joint angles j and visual features v. Note that the visual
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Transactions on Cognitive and Developmental Systems. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TCDS.2022.3204452
Language
Language
Binding
Image
Action
Loss Action
Fig. 2. The architecture of the proposed PVAE and PVAE-BERT models: the language VAE (blue rectangles) processes descriptions, whilst the action VAE
(orange rectangles) processes joint angles and images at each time step. The input to the language VAE is the given description x, whereas the action VAE
takes as input joint angle values j and visual features v. The two VAEs are implicitly bound via a binding loss in the latent representation space. The image
from which the v1 is extracted is magnified for visualisation purposes. <BOS> and <EOS> stand for beginning of sentence and end of sentence tags,
respectively. The two models differ only by the language encoder employed: the PVAE uses LSTM, whereas PVAE-BERT uses a pretrained BERT model.
features are extracted by the channel-separated convolutional results in image classification as the network parameters are
autoencoder beforehand. The equations that define the action used more efficiently. The channel-separated CAE accepts a
encoder are as follows4 : colour channel of 120 × 160 RGB images captured by the
cameras in the eyes of NICO - referred also as the egocentric
henc enc enc enc
t , ct = LSTM(vt , jt , ht−1 , ct−1 ) (1 ≤ t ≤ M ), view of the robot - at a time. As can be seen in detail
enc
zmean = Wmean · hM + benc
mean , in Table I, it consists of a convolutional encoder, a fully-
zvar = Wvar · hM + benc
enc
var ,
connected bottleneck (incorporates hidden representations) and
zact = zmean + zvar · N (µ, σ 2 ), a deconvolutional decoder. After training for each colour
channel, we extract the visual features of each image for
where ht and ct are the hidden and cell state of the LSTM at every channel from the middle layer in the bottleneck (FC
time step t, respectively, and N is a Gaussian distribution. h0 , 3). The visual features extracted from each channel are then
c0 are set as zero vectors, while µ and σ are set as 0 and 0.1, concatenated to make up the ultimate visual features v.
respectively. zact is the latent representation of a robot action.
The action decoder reconstructs the joint angles:
TABLE I
hdec dec
0 , c0 = W
dec
· zact + bdec , D ETAILED A RCHITECTURE OF C HANNEL -S EPARATED CAE
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Transactions on Cognitive and Developmental Systems. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TCDS.2022.3204452
D. Sampling and Binding and action VAE (Euclidean distance between original and
generated joint values) are Llang and Lact , respectively:
Stochastic Gradient Variational Bayes-based sampling
N −1 −1
V
!
(SGVB) [18] enables one-to-many mapping between action 1 X X [i] [i]
and language. The two VAEs have identical random sampling Llang = − xt+1 log yt ,
N − 1 t=1 i=0
procedures. After producing the latent variables zmean and zvar M −1
via the fully connected layers, we utilise a normal distribution 1 X 2
Lact = ∥jt+1 − ȷ̂t+1 ∥2 ,
N (µ, σ 2 ) to derive random values, ϵ, which are, in turn, used M − 1 t=1
with zmean and zvar to arrive at the latent representation z,
which is also known as the reparameterisation trick [18]: where V is the vocabulary size, N is the number of words per
description, M is the sequence length for an action trajectory.
The regularisation loss is specific to variational autoencoders;
z = zmean + zvar · ϵ
it is defined as the Kullback–Leibler divergence for language
DKLlang and action DKLact . Therefore, the overall loss function
where ϵ is the approximation of N (0, 0.01). is as follows:
As in the case of [30], to align the latent representations of
robot actions and their descriptions, we use an extra loss term Lall = αLlang + βLact + γLbinding + αDKLlang + βDKLact
that brings the mean hidden features, zmean , of the two VAEs where α, β and γ are weighting factors for different terms
closer to each other. This enables bidirectional translation in the loss function. In our experiments, α and β are set to
between action and language, i.e., the network can transform 1, whilst γ is set to 2 in order to sufficiently bind the two
actions to descriptions as well as descriptions to actions, after modalities.
training without an explicit fusion of the two modalities. This
loss term (binding loss) can be calculated as follows: F. Transformer-Based Language Encoder
In order for the model to understand unconstrained language
B B X
X
lang act
X input from non-expert human users, we replace the LSTM for
Lbinding = ψ(zmean i
, zmeani
)+ max the language encoder with a pretrained BERT-base language
i i j̸=i
n o model [8] - see Figure 2. According to [8], BERT is pretrained
lang act lang act with the BooksCorpus, which involves 800 million words, and
0, ∆ + ψ(zmean i
, zmean i
) − ψ(zmean j
, zmean i
) ,
English Wikipedia, which involves 2.5 billion words. With the
where B stands for the batch size and ψ is the Euclidean introduction of BERT as the language encoder, we assume
distance. The first term in the equation binds the paired that BERT can interpret action descriptions correctly in our
instructions and actions, whereas the second term separates scenario. However, since language models like BERT are
unpaired actions and descriptions. Hyperparameter ∆ is used pretrained exclusively on textual data from the internet, they
to adjust the separation margin for the second term - the higher are not specialised for object manipulation environments like
it is, the further apart the unpaired actions and descriptions are ours. Therefore, the embedding of an instruction like ‘push
pushed in the latent space. the blue object’ may not differ from the embedding
of another such as ‘push the red object’ significantly.
Different multi-modal fusion techniques like Gated Multi-
For this reason, we finetune the pretrained BERT-base, i.e.
modal Unit (GMU) [4], which uses gating and multiplicative
all of BERT’s parameters are updated, during the end-to-
mechanisms to fuse different modalities, and CentralNet [28],
end training of PVAE-BERT so that it can separate similar
which fuses information by having a separate network for each
instructions from each other, which is critical for our scenario.
modality as well as central joint representations at each layer,
were also considered during our work. However, since our G. Training Details
model is bidirectional (must work on both action-to-language To train the PVAE and PVAE-BERT, we first extract visual
and language-to-action directions) and must work with either features using our channel-separated CAE. The visual features
language or action input during inference (both GMU and are used to condition the actions depending on the cube
CentralNet require all of the modalities to be available), we arrangement, i.e., the execution of a description depends also
opted for the binding loss for multi-modal integration. on the position of the target cube. For both the PVAE and
PVAE-BERT, the action encoder and action decoder are each
E. Loss Function a two-layer LSTM with a hidden size of 100, whilst the
language decoder is a single-layer LSTM with the same hidden
The overall loss is calculated as the sum of the reconstruc- size. In contrast, the language encoder of PVAE-BERT is the
tion, regularisation and binding losses. The binding loss is pretrained BERT-base model with 12 layers, each with 12
calculated for both VAEs jointly. In contrast, the reconstruction self-attention heads and a hidden size of 768, whereas the
and regularisation losses are calculated independently for language encoder of the PVAE is a one-layer LSTM with
each VAE. Following [30], the reconstruction losses for the a hidden size of 100. Both the PVAE and PVAE-BERT are
language VAE (cross entropy between input and output words) trained end-to-end with both the language and action VAEs
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Transactions on Cognitive and Developmental Systems. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TCDS.2022.3204452
together. The PVAE and PVAE-BERT are trained for 20,000 vocabulary by adding an alternative word for each word in the
and 40,000 iterations, respectively, with the gradient descent original vocabulary. As descriptions are composed of 3 words
algorithm and Adam optimiser [17]. We take the learning rate with two alternatives per word, we arrive at 8 variations for
as 10−4 with a batch size of 100 pairs of language and action each description of a given meaning. Table II does not include
sequences after a few trials with different learning rates and nouns, because we use a predefined grammar, which doesn’t
batch sizes. Due to having approximately 110M parameters, involve a noun, and the same size cubes for these experiments.
compared with the PVAE’s approximately 465K parameters, For each cube arrangement, the colours of the two cubes
an iteration of PVAE-BERT training takes about 1.4 times always differ to avoid ambiguities in the language description.
longer than an iteration of PVAE training. Therefore, it takes Actions, which are transcribed in capitals, are composed of any
about 2.8 times longer to train PVAE-BERT in total. of the three action types PUSH, PULL, SLIDE, two positions
LEFT, RIGHT and two speed settings SLOWLY, FAST, re-
IV. E VALUATION AND R ESULTS
sulting in 12 possible actions (3 action types × 2 positions ×
We evaluate the performance of our PVAE and its variant 2 speeds), e.g., PUSH-LEFT-SLOWLY means pushing the left
using BERT, namely PVAE-BERT, with multiple experiments. object slowly. Every sentence is composed of three words
First, we compare the original PVAE with PRAE [30] in terms (excluding the <BOS/EOS> tags which denote beginning of
of action-to-language translation by conducting experiments sentence or end of sentence) with the first word indicating
of varying object colour options to display the superiority the action, the second the cube colour and the last the
of variational autoencoders over regular autoencoders and speed at which the action is performed (e.g., ‘push green
the advantage of using the channel separation technique in slowly’). Therefore, without the alternative words, there are
visual feature extraction. Different object colour possibilities 18 possible sentences (3 action verbs × 3 colours × 2 adverbs)
correspond to a different corpus and overall dataset size; for Experiment 1a, whereas, for Experiment 1b and 2, the
the more object colour options there are, the larger both number of sentences is 36 as 6 cube colours are used in
the vocabulary and the overall dataset become. Therefore, both experiments. As a result, our dataset consists of 6 cube
with these experiments, we also test the scalability of both arrangements (3 colour alternatives and the colours of the two
approaches. In order to show the impact of channel separation cubes on the table never match) for Experiment 1a, 12 cube
on the action-to-language translation performance, we train our arrangements for Experiments 1b and 2 (3 secondary colours
architecture with visual features provided by a regular CAE are used in addition to 3 primary colours and secondary and
(no channel separation) as implemented in [30]. These are primary colours are mutually exclusive), 18×8 = 144 possible
Experiment 1a (with 3 cube colour alternatives: red, green, sentences for Experiment 1a, 36 × 8 = 288 possible sentences
blue) and Experiment 1b (with 6 cube colour alternatives: for Experiments 1b and 2 with alternative vocabulary (consult
red, green, blue, yellow, cyan, violet) - see Table III. Table II) - the factor of 8 because of eight alternatives per
Moreover, in Experiment 2, we train PVAE-BERT on sentence. We have 72 patterns (action-description-arrangement
the dataset with 6 colour alternatives (red, green, blue, yel- combinations) for Experiment 1a (12 actions with six cube
low, cyan, violet) to compare it with the standard PVAE arrangements each) and 144 patterns for Experiments 1b and
by conducting action-to-language, language-to-language and 2. Following Yamada et al. [30], we choose the patterns
language-to-action evaluation experiments. This experiment rigorously to ensure that combinations of action, description
uses the pretrained BERT as the language encoder which is and cube arrangements used in the test set are excluded from
then finetuned with the rest of the model during training. the training set, although the training set includes all possible
In Experiments 1a, 1b and 2, two cubes of different colours combinations of action, description and cube arrangements
are placed on a table at which the robot is seated to interact that are not in the test set. For Experiment 1a, 54 patterns
with them. The words (vocabulary) that constitute the de- are used for training while the remaining 18 for testing (for
scriptions are given in Table II. We introduce a more diverse Experiments 1b and 2: 108 for training, 36 for testing). Each
pattern is collected six times in the simulation with random
variations on the action execution resulting in different joint
TABLE II
VOCABULARY trajectories. We also use 4-fold cross-validation to provide
more reliable results (consult Table III) for Experiment 1.
Original Alternative
Experiment 1c tests for different shapes, other than cubes:
Verb push move-up we perform the same actions on toy objects, which are a car,
pull move-down duck, cup, glass, house and lego brick. For testing the shape
slide move-sideways processing capability of the model, all objects are of the same
Colour red scarlet
green harlequin colour, namely yellow. Analogous to the other experiments,
blue azure two objects of different shapes are placed on the table. We
yellow blonde keep the actions as they are but replace the colours with object
cyan greenish-blue
violet purple names in the descriptions. Before we extract the visual features
Speed slowly unhurriedly from the new images, we train both the regular CAE and the
fast quickly channel-separated CAE with them. Similar to Experiments
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Transactions on Cognitive and Developmental Systems. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TCDS.2022.3204452
TABLE III
ACTION - TO -L ANGUAGE T RANSLATION ACCURACIES AT S ENTENCE L EVEL
1a and 1b, we experiment with three methods: PRAE with instead of standard ones increases the accuracy significantly.
standard CAE, PVAE with standard CAE and PVAE with Using PVAE with channel-separated CAE improves the results
channel-separated CAE. further, indicating the superiority of channel separation in our
We use NICO (Neuro-Inspired COmpanion) [15], [16] in a tabletop scenario. Therefore, our approach with variational
virtual environment created with Blender5 for our experiments autoencoders and a channel-separated CAE is superior to both
- see Figure 1. NICO is a humanoid robot, has a height of PRAE and PVAE with regular visual feature extraction.
approximately one metre and a weight of approximately 20 In Experiment 1b, in order to test the limits of our PVAE and
kg. The left arm of NICO is used to interact with the objects the impact of more data with a larger corpus, we add three
while utilising 5 joints. Actions are realised using the inverse more colour options for the cubes: yellow, cyan and violet.
kinematics solver provided by the simulation environment: for These secondary colours are combined amongst themselves for
each action, first, the starting point and endpoint are adjusted the arrangements in addition to the colour combinations used
manually, then, the Gaussian deviation is applied around the in the first experiment, i.e., a cube of a primary colour and a
starting point and endpoint to generate the variations of the cube of a secondary colour do not co-occur. Therefore, this ex-
action, ensuring that there is a slight difference in the overall periment has 12 arrangements. Moreover, the vocabulary size
trajectory. NICO has a camera in each of its eyes, which is is extended to 23 from 17 in Experiment 1b (two alternative
used to extract egocentric visual images. words for each colour - see Table II). As in Experiment 1a,
each sentence has eight alternative ways to be described.
A. Experiment 1 We train both PVAE and PRAE [30] on the extended dataset
We use the same actions as in [30], such as PUSH-RIGHT- from scratch and test both architectures. As shown in Table
SLOWLY. We use three colour options for the cubes as in III (Experiment 1b), PVAE succeeds in performing 100% by
[30] for Experiment 1a, but six colours for Experiment 1b. translating every pattern from action to description correctly,
However, we extend the descriptions in [30] by adding an even for the test set. In contrast, PRAE performs poorly
alternative for each word in the original vocabulary. Hence, in this setting and manages to translate only one third of
the vocabulary size of 9 is extended to 17 for Experiment 1a the descriptions correctly in the test set. Compared with the
and the vocabulary size of 11 is extended to 23 for Experiment accuracy values reached in the first experiment with less data
1b - note that we do not add an alternative for <BOS/EOS> and a smaller corpus, extension of the dataset helps PVAE
tags. Since every sentence consists of three words, we extend to perform better in translation, whereas PRAE is not able to
the number of sentences by a factor of eight (23 = 8). take advantage of more data. Similar to Experiment 1a, we
After training the PVAE and PRAE on the same training set, also test the influence of channel separation on the translation
we test them for action-to-language translation. We consider accuracy by training PVAE with visual features provided by
only those produced descriptions in which all three words a regular CAE. In this setting, PVAE only achieves around
and the <EOS> tag are correctly predicted as correct. The 61% of accuracy in the test set. This highlights once again the
produced descriptions that have one or more incorrect words importance of channel separation in visual feature extraction
are considered as false translations. As each description has for our setup. Whilst the improvement by using our PVAE over
seven more alternatives, predicting any of the eight description PRAE is significant, further improvement is made by utilising
alternatives is considered correct. the channel-separated CAE.
For Experiment 1a, our model is able to translate approx- In addition, as the results show in the last column of Table
imately 90% of the patterns in the test set, whilst PRAE III (Experiment 1c), our PVAE with channel separation in
could translate only one third of the patterns, as can be seen visual feature extraction outperforms the other methods even
in Table III. We can, thus, say that our model outperforms when manipulated objects have different shapes. Although
PRAE in one-to-many mapping. We also test the impact of there is a slight drop in action-language translation perfor-
channel separation on the translation accuracy by training our mance, it is clear that the PVAE with the channel-separated
model with visual features extracted with the regular CAE CAE is able to handle different-shaped objects. The PRAE
as described in Yamada et al.’s approach [30]. It is clearly model performs slightly better than it does in the experiments
indicated in Table III that using variational autoencoders with cubes of different colours. However, our variational
autoencoders approach without channel separation improves
5 https://www.blender.org/ the translation accuracy by approximately 8%. The channel
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Transactions on Cognitive and Developmental Systems. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TCDS.2022.3204452
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Transactions on Cognitive and Developmental Systems. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TCDS.2022.3204452
Fig. 3. Examples of language-to-action translation by PVAE-BERT and its comparison with PVAE: in the top row, the two plots represent the ground truth
and predicted joint trajectories by PVAE-BERT for PUSH-LEFT-SLOWLY and PULL-LEFT-SLOWLY actions. Solid lines show the ground truth, while the
dashed lines, which are often covered by the solid lines, show the predicted joint angle values. In the bottom row, the left plot shows the total error margin
of the five joint values produced by PVAE and PVAE-BERT per time step for the PUSH-LEFT-SLOWLY action, while the right plot shows the joint values
produced by PVAE-BERT given three variations (see Table V) of the same command for PULL-LEFT-SLOWLY - notice how the joint trajectories overlap
most of the time. In all of the plots, the X axis represents the time steps.
distinguishes the action type (verb)6 . Plot (b) shows that the cate that the binding loss has transferred semantically driven
PCA representations of actions are semantically similar, since ordering from the language to the action representations.
their arrangement coincides with those in Plot (a). When our agent receives a language instruction, which
Our method learns actions according to their paired de- contains the colour but not position, the agent is still able
scriptions: it learns the colour of the object (an element to perform the action according to the position (cf. Figure 3)
of descriptions) interacted with. However, it does not learn of the object. The retrieval of the position information must
the position of it (an element of actions). We inspected the therefore be done by the action decoder: it reads the images
representations along all major principle components, but we to obtain the position of the object that has the colour given in
could not find any direction along which the position was the instruction. It is therefore not surprising that the PCA does
meaningfully distinguished. For example, in (b), some of the not reveal any object position encodings in the bottleneck.
filled red circles (corresponding to description ‘push red
V. D ISCUSSION
slowly’) are paired with the action PUSH-LEFT-SLOWLY
while the others with PUSH-RIGHT-SLOWLY. As actions Experiments 1a and 1b show that our variational autoen-
learned according to their paired descriptions, hence semanti- coder approach with a channel-separated CAE visual feature
cally, the filled red circles are grouped together even though extraction (‘PVAE + channel-separated CAE’) performs better
the red cube may be on the right or left. In contrast, an action than the standard autoencoder approach, i.e., PRAE [30], in
can be represented far from another identical action: e.g., the the one-to-many translation of robot actions into language
representations of ‘pull red slowly’ (filled red circles descriptions. Our approach is superior both in the case of
in Figure 4) are separated from those of ‘pull yellow three colour alternatives per cube and in the case of six
slowly’ (filled yellow circles) along PC 3, even if they both colour alternatives per cube by a large margin. The additional
denote the action PULL-LEFT-SLOWLY. These results indi- experiment with six different objects highlights the robust-
ness of our approach against the variation in object types.
6 The percentages of variance explained were very similar between PC 2
We demonstrate that a Bayesian inference-based method like
until PC 6; therefore, we selected PC 3 and PC 6 for display as they resolved variational autoencoders can scale up with more data for gen-
colour and action type optimally. eralisation, whereas standard autoencoders cannot capitalise
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Transactions on Cognitive and Developmental Systems. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TCDS.2022.3204452
Fig. 4. Hidden features of language (a) and hidden features of action (b): PCA was performed jointly on the hidden features of 36 descriptions and the
hidden features of 144 actions. For (b), each unique action (12 in total) occurs 12 times as there are 12 possible cube arrangements; therefore, 144 points are
shown. For both (a) and (b), we label the points according to descriptions, i.e., for (b), actions are also labelled according to their paired descriptions. As can
be seen from the legend, different shapes, colours and fillings indicate the verb (action type), object colour and adverb (speed), respectively.
on a larger dataset, since the proposed PVAE model achieves accuracy for the ‘PVAE + channel-separated CAE’ goes up by
better accuracy when the dataset and the corpus are extended approximately 10% to 100% when three more colour options
with three extra colours or six different objects. Additionally, are added to the dataset.
standard autoencoders are fairly limited in coping with the Furthermore, training the PVAE with the visual features
diversification of language as they do not have the capacity extracted by the standard CAE demonstrates that training
to learn the mapping between an action and many descrip- and extracting features from each RGB channel separately
tions. In contrast, variational autoencoders yield remarkably mitigates the colour distinction issue for cubes when the
better results in one-to-many translation between actions and visual input, like in our setup, includes objects covering a
descriptions, because stochastic generation (random normal relatively small portion of the visual field. The ‘PVAE +
distribution) within the latent feature extraction allows latent regular CAE’ variant performs significantly worse than our
representations to slightly vary, which leads to VAEs learning ‘PVAE + channel-separated CAE’ approach. This also demon-
multiple descriptions rather than a particular description for strates the importance of the visual modality for the overall
each action. performance of the approach. Our analysis on the incorrectly
A closer look into action-to-language translation accuracies translated descriptions shows that a large amount of all errors
achieved by the PRAE for Experiments 1a and 1b shows committed by the ‘PVAE + regular CAE’ were caused due
that having more variety in the data (i.e. more colour options to cube colour distinction failures such as translating ‘slide
for cubes) does not help the standard autoencoder approach red fast’ as ‘slide green fast’, which proves the
to learn one-to-many binding between action and language. channel-separated CAE’s superiority over the standard CAE
Both in the first case with three colour alternatives and in the in visual feature extraction in our scenario. Moreover, using
second case with six colour alternatives, the PRAE manages to the channel-separated CAE for visual feature extraction rather
translate only around one third of the samples from actions to than the standard CAE results in better action-to-language
descriptions correctly. In contrast, the accuracies achieved by translation accuracy even when the objects are of various
our proposed PVAE for both datasets prove that the variational shapes. This indicates that the channel-separated CAE not
autoencoder approach can benefit from more data as the test only works well with cubes of different colours but also
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Transactions on Cognitive and Developmental Systems. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TCDS.2022.3204452
objects of different shapes. We emphasise the superiority of In the future, we will collect descriptions via crowdsourcing
channel separation in our scenario, which is tested and proven in order to investigate the viability of using a pretrained
in a simulation environment. For real-world scenarios with language model as an encoder to relate from language to motor
different lighting conditions, it is advisable to take into account control. We will also seek ways to bind the two modalities in
also the channel interaction [26] to have more robust visual a more biologically plausible way. Moreover, increasing the
feature extraction. complexity of the scenario with more objects in general and
Experiment 2 indicates the potential of utilising a pretrained on the table simultaneously may shed light to the scalability of
language model like BERT for the interpretation of language our approach. Lastly, we will transfer our simulation scenario
descriptions. This extension produces comparable results to the to the real world and conduct experiments on the real robot.
original PVAE with the LSTM language encoder in language-
to-action and action-to-language translations. The drop in ACKNOWLEDGMENT
language-to-language performance to 80% is most probably The authors gratefully acknowledge support from the Ger-
caused by the asymmetric language VAE of the PVAE-BERT man Research Foundation DFG, project CML (TRR 169).
model that consists of a feedforward BERT encoder with
attention mechanisms, which reads the entire input sequence in R EFERENCES
parallel, and of a recurrent LSTM decoder, which produces the
[1] Amina Adadi and Mohammed Berrada. Peeking inside the black-box: A
output sequentially. A previous study on a text classification survey on explainable artificial intelligence (xai). IEEE Access, 6:52138–
task also shows that LSTM models outperform BERT on 52160, 2018.
a relatively small corpus because, with its large number of [2] Ahmed Akakzia, Cédric Colas, Pierre-Yves Oudeyer, Mohamed
Chetouani, and Olivier Sigaud. Grounding Language to Autonomously-
parameters, BERT tends to overfit when the dataset size is Acquired Skills via Goal Generation. In International Conference on
small [10]. Furthermore, we have also tested the PVAE- Learning Representations, Virtual (formerly Vienna, Austria), 2021.
BERT, which was trained on predefined descriptions, with [3] Alexandre Antunes, Alban Laflaquiere, Tetsuya Ogata, and Angelo Can-
full sentence descriptions - e.g. ‘push the blue cube gelosi. A bi-directional multiple timescales LSTM model for grounding
of actions and verbs. In 2019 IEEE/RSJ International Conference on
slowly’ for ‘push blue slowly’ - and with variations Intelligent Robots and Systems (IROS), pages 2614–2621, 2019.
of the descriptions that have a different word order. We have [4] John Arevalo, Thamar Solorio, Manuel Montes-y Gomez, and Fabio A
confirmed that PVAE-BERT achieves the same performance González. Gated multimodal networks. Neural Computing and Appli-
cations, 32(14):10209–10228, 2020.
in language-to-action and language-to-language translations. [5] Yonatan Bisk, Ari Holtzman, Jesse Thomason, Jacob Andreas, Yoshua
This is promising for the future because the pretrained BERT Bengio, Joyce Chai, Mirella Lapata, Angeliki Lazaridou, Jonathan May,
allows the model to understand unconstrained natural language Aleksandr Nisnevich, Nicolas Pinto, and Joseph Turian. Experience
grounds language. In Proceedings of the 2020 Conference on Empirical
commands that do not conform to the defined grammar. Methods in Natural Language Processing, pages 8718–8735. Associa-
The PCA conducted on the hidden features of PVAE-BERT tion for Computational Linguistics, November 2020.
shows that our method can learn language and robot actions [6] Joyce Y Chai, Qiaozi Gao, Lanbo She, Shaohua Yang, Sari Saba-Sadiya,
and Guangyue Xu. Language to action: Towards interactive task learning
compositionally and semantically. Although it is not explicitly with physical agents. In IJCAI, pages 2–9, 2018.
given, we have also confirmed that both the PVAE and PVAE- [7] François Chollet. Xception: Deep learning with depthwise separable
BERT are able to reconstruct joint values almost perfectly convolutions. In 2017 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pages 1800–1807, 2017.
accurately when we analysed the action-to-action translation [8] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.
results. Together with the language-to-language performance, BERT: Pre-training of deep bidirectional transformers for language
action-to-action capability of both variants of our architecture understanding. In NAACL-HLT (1), 2019.
[9] Aaron Eisermann, Jae Hee Lee, Cornelius Weber, and Stefan Wermter.
demonstrates that the two variational autoencoders (language Generalization in multimodal language learning from simulation. In
and action) in our approach retain their reconstructive nature. Proceedings of the International Joint Conference on Neural Networks
(IJCNN 2021), Jul 2021.
[10] Aysu Ezen-Can. A comparison of LSTM and BERT for small corpus.
VI. C ONCLUSION arXiv preprint arXiv:2009.05451, 2020.
[11] Jun Hatori, Yuta Kikuchi, Sosuke Kobayashi, Kuniyuki Takahashi, Yuta
In this study, we have reported the findings of previous Tsuboi, Yuya Unno, Wilson Ko, and Jethro Tan. Interactively picking
work and its extension with several experiments. We have real-world objects with unconstrained spoken language instructions.
In 2018 IEEE International Conference on Robotics and Automation
shown that variational autoencoders outperform standard au- (ICRA), pages 3774–3781. IEEE, 2018.
toencoders in terms of one-to-many translation of robot actions [12] Stefan Heinrich and Stefan Wermter. Interactive natural language
to descriptions. Furthermore, the superiority of our channel- acquisition in a multi-modal recurrent neural architecture. Connection
Science, 30(1):99–133, 2018.
separated visual feature extraction has been proven with an [13] Stefan Heinrich, Yuan Yao, Tobias Hinz, Zhiyuan Liu, Thomas Hummel,
extra experiment that involves different types of objects. In Matthias Kerzel, Cornelius Weber, and Stefan Wermter. Crossmodal
addition, using the PVAE with a BERT model pretrained on language grounding in an embodied neurocognitive model. Frontiers in
Neurorobotics, 14:52, 2020.
large text corpora, instead of the LSTM encoder trained on [14] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.
our small predefined grammar, unveils promising scaling-up Neural Computation, 9(8):1735–1780, 1997.
opportunities for the proposed approach, and it offers the [15] Matthias Kerzel, Theresa Pekarek-Rosin, Erik Strahl, Stefan Heinrich,
and Stefan Wermter. Teaching NICO how to grasp: an empirical study
possibility to map unconstrained natural language descriptions on crossmodal social interaction as a key factor for robots learning from
with actions. humans. Frontiers in Neurorobotics, 14:28, 2020.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Transactions on Cognitive and Developmental Systems. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TCDS.2022.3204452
[16] Matthias Kerzel, Erik Strahl, Sven Magg, Nicolás Navarro-Guerrero, Ozan Özdemir is a doctoral candidate and working
Stefan Heinrich, and Stefan Wermter. NICO—Neuro-Inspired COm- as a research associate in the Knowledge Technology
panion: A developmental humanoid robot platform for multimodal group, University of Hamburg, Germany. He has a
interaction. In 2017 26th IEEE International Symposium on Robot and BSc degree in computer engineering from Yildiz
Human Interactive Communication (RO-MAN), pages 113–120, 2017. Technical University. He has received his MSc de-
[17] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic opti- gree in Intelligent Adaptive Systems at the Univer-
mization. In 3rd International Conference on Learning Representations, sity of Hamburg. His research interests are embodied
ICLR, San Diego, CA, USA, May 7-9, 2015. and crossmodal language learning, autoencoders, re-
[18] Diederik P. Kingma and Max Welling. Auto-encoding variational Bayes. current neural networks and large language models.
In Proceedings of International Conference on Learning Representations
(ICLR), Banff, AB, Canada, April 14-16, 2014, 2014.
[19] Corey Lynch and Pierre Sermanet. Language conditioned imitation
Matthias Kerzel received his MSc and PhD in
learning over unstructured data. Robotics: Science and Systems, 2021.
computer science from the Universität Hamburg,
[20] Joseph Marino. Predictive coding, variational autoencoders, and biolog-
Germany. He is currently a postdoctoral research
ical connections. Neural Computation, 34(1):1–44, 2021.
and teaching associate at the Knowledge Technology
[21] Hwei Geok Ng, Paul Anton, Marc Brügger, Nikhil Churamani, Erik
Group of Prof. Stefan Wermter at the University
Fließwasser, Thomas Hummel, Julius Mayer, Waleed Mustafa, Thi
of Hamburg. He has given lectures on Knowledge
Linh Chi Nguyen, Quan Nguyen, et al. Hey robot, why don’t you
Processing in Intelligent Systems, Neural Networks
talk to me? In 2017 26th IEEE International Symposium on Robot and
and Bio-inspired Artificial Intelligence. He is cur-
Human Interactive Communication (RO-MAN), pages 728–731, 2017.
rently the Secretary of the European Neural Network
[22] Tetsuya Ogata, Masamitsu Murase, Jun Tani, Kazunori Komatani, and
Society and worked in the organising committee
Hiroshi G. Okuno. Two-way translation of compound sentences and arm
of the International Conference on Artificial Neural
motions by recurrent neural networks. In 2007 IEEE/RSJ International
Networks conferences. His research interests are in developmental neuro-
Conference on Intelligent Robots and Systems, pages 1858–1863, 2007.
robotics, hybrid neurosymbolic architectures, explainable AI and human-robot
[23] Haşim Sak, Andrew Senior, and Françoise Beaufays. Long short-term
interaction. He is currently involved in the international SFB/TRR-169 large-
memory recurrent neural network architectures for large scale acoustic
scale project on crossmodal learning.
modeling. In Proceedings of Interspeech 2014, pages 338–342, 2014.
[24] Lin Shao, Toki Migimatsu, Qiang Zhang, Karen Yang, and Jeannette
Bohg. Concept2robot: Learning manipulation concepts from instructions Cornelius Weber graduated in physics at Universität
and human demonstrations. In Proceedings of Robotics: Science and Bielefeld, Germany and received his PhD in com-
Systems (RSS), 2020. puter science at Technische Universität Berlin. Fol-
[25] Mohit Shridhar, Dixant Mittal, and David Hsu. INGRESS: Interactive lowing positions were a Postdoctoral Fellow in Brain
visual grounding of referring expressions. The International Journal of and Cognitive Sciences, University of Rochester,
Robotics Research, 39(2-3):217–232, 2020. USA; Research Scientist in Hybrid Intelligent Sys-
[26] Du Tran, Heng Wang, Matt Feiszli, and Lorenzo Torresani. Video tems, University of Sunderland, UK; Junior Fellow
classification with channel-separated convolutional networks. In 2019 at the Frankfurt Institute for Advanced Studies,
IEEE/CVF International Conference on Computer Vision (ICCV), pages Germany. Currently he is Lab Manager at Knowl-
5551–5560, 2019. edge Technology, Universität Hamburg. His interests
[27] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion are in computational neuroscience, development of
Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention visual feature detectors, neural models of representations and transformations,
is all you need. In Advances in Neural Information Processing Systems, reinforcement learning and robot control, grounded language learning, human-
pages 5998–6008, 2017. robot interaction and related applications in social assistive robotics.
[28] Valentin Vielzeuf, Alexis Lechervy, Stéphane Pateux, and Frédéric
Jurie. Centralnet: A multilayer approach for multimodal fusion. In Jae Hee Lee is a postdoctoral research associate
Laura Leal-Taixé and Stefan Roth, editors, Computer Vision – ECCV in the Knowledge Technology Group, University of
2018 Workshops, pages 575–589, Cham, 2019. Springer International Hamburg, Germany. He has worked on topics in
Publishing. multimodal learning, grounded language understand-
[29] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad ing, and spatial and temporal reasoning. Jae Hee
Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Lee received his Diplom degree in mathematics and
Klaus Macherey, et al. Google’s neural machine translation system: doctoral degree in computer science from the Uni-
Bridging the gap between human and machine translation. arXiv preprint versity of Bremen, Germany. He was a postdoctoral
arXiv:1609.08144, 2016. researcher at the Australian National University,
[30] Tatsuro Yamada, Hiroyuki Matsunaga, and Tetsuya Ogata. Paired re- University of Technology Sydney (Australia) and
current autoencoders for bidirectional translation between robot actions Cardiff University (UK).
and linguistic descriptions. IEEE Robotics and Automation Letters,
3(4):3441–3448, 2018.
Stefan Wermter (Member, IEEE) is currently a Full
[31] Yinfei Yang, Daniel Cer, Amin Ahmad, Mandy Guo, Jax Law, Noah
Professor with the University of Hamburg, Hamburg,
Constant, Gustavo Hernandez Abrego, Steve Yuan, Chris Tar, Yun-
Germany, where he is also the Director of the
hsuan Sung, Brian Strope, and Ray Kurzweil. Multilingual universal
Department of Informatics, Knowledge Technology
sentence encoder for semantic retrieval. In Proceedings of the 58th
Institute. Currently, he is a co-coordinator of the In-
Annual Meeting of the Association for Computational Linguistics:
ternational Collaborative Research Centre on Cross-
System Demonstrations, pages 87–94, Online, July 2020. Association
modal Learning (TRR-169) and a coordinator of the
for Computational Linguistics.
European Training Network TRAIL on transparent
[32] Ozan Özdemir, Matthias Kerzel, and Stefan Wermter. Embodied
interpretable robots. His main research interests are
language learning with paired variational autoencoders. In 2021 IEEE
in the fields of neural networks, hybrid knowledge
International Conference on Development and Learning (ICDL), pages
technology, cognitive robotics and human–robot in-
1–6. IEEE, Aug 2021.
teraction. He is an Associate Editor of Connection Science and International
Journal for Hybrid Intelligent Systems. He is on the Editorial Board of the
journals Cognitive Systems Research, Cognitive Computation and Journal of
Computational Intelligence. He is serving as the President for the European
Neural Network Society.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/