Robótica

robotics
Article
Developing Emotion-Aware Human–Robot Dialogues
for Domain-Specific and Goal-Oriented Tasks †
Jhih-Yuan Huang, Wei-Po Lee *, Chen-Chia Chen and Bu-Wei Dong
Department of Information Management, National Sun Yat-sen University, Kaohsiung 80424, Taiwan;
[email protected] (J.-Y.H.); [email protected] (C.-C.C.); [email protected] (B.-W.D.)
* Correspondence: [email protected]
† This paper is an extended version of our paper Huang, J.-Y.; Lee, W.-P.; Dong, B.-W. Learning Emotion
Recognition and Response Generation for a Service Robot. In Proceedings of the 6th IFToMM International
Symposium on Robotics and Mechatronics, Taipei, Taiwan, 28–30 October 2019; pp. 286–297.

Received: 30 March 2020; Accepted: 3 May 2020; Published: 7 May 2020
Abstract: Developing dialogue services for robots has been promoted nowadays for providing natural
human–robot interactions to enhance user experiences. In this study, we adopted a service-oriented
framework to develop emotion-aware dialogues for service robots. Considering the importance of
the contexts and contents of dialogues in delivering robot services, our framework employed deep
learning methods to develop emotion classifiers and two types of dialogue models of dialogue services.
In the first type of dialogue service, the robot works as a consultant, able to provide domain-specific
knowledge to users. We trained different neural models for mapping questions and answering
sentences, tracking the human emotion during the human–robot dialogue, and using the emotion
information to decide the responses. In the second type of dialogue service, the robot continuously
asks the user questions related to a task with a specific goal, tracks the user’s intention through
the interactions and provides suggestions accordingly. A series of experiments and performance
comparisons were conducted to evaluate the major components of the presented framework and the
results showed the promise of our approach.
Keywords: human–machine interaction; service robot; emotion recognition; dialogue modeling;

deep learning
1. Introduction
Researchers and engineers have been building service robots that can interact with people and
achieve given tasks. To deploy practical service robots, two major concerns need to be seriously
considered, including the system architecture for launching the services and the creation of the service
functions. At present, the services are mostly laboring services, in which robots take actions in the
physical environment to assist people. However, robots are now expected to play more important roles
in providing domain-specific knowledge services and task-oriented services. To deliver these services,
robots communicate with users through a natural way of spoken language because conversation is a
key instrument for developing and maintaining mutual relationships. Following our previous studies
that adopted a service-oriented architecture to develop action-oriented robot services, in this work we
presented a trainable framework for modeling emotion-aware human–robot dialogues to provide the
aforementioned services.
Regarding the many choices of supportive software architecture, some researchers have proposed
to adopt cloud-based service-oriented architecture (SOA). SOA is an architectural style based on
interacting software components, providing services as fundamental units to design, build and compose
the service-oriented software systems [1]. A service is a function made available by a service provider
Robotics 2020, 9, 31; doi:10.3390/robotics9020031 www.mdpi.com/journal/robotics

Robotics 2020, 9, 31 2 of 20
in order to deliver results to a consumer. Moreover, services are autonomous platform-independent

entities that can be described, published, discovered and loosely coupled. To effectively and efficiently
deploy different kinds of services, researchers have proposed to link SOA to a cloud computing
environment. With this way, the robots are no longer limited by onboard computation, memory
and programming, leading to a more intelligent robotic network. Our former work implemented a
cloud-based system to support a variety of user-created services [2,3]. To ensure its expandability and
shareability, we constructed a service configuration mechanism and deployed the system on the ROS
(robot operating system, [4]) computing nodes in practice.
The most common way for achieving natural language-based human–robot interaction is to build
a dialogue system to be a vocal interactive interface. Essentially, the dialogue system includes a
knowledge base (i.e., dataset) with organized domain questions and their corresponding answers and
the dialogue service is to design an accurate mapping mechanism that can correctly retrieve answers in
response to the users’ questions. The system is performed in a question-answering manner, and most
traditional approaches are based on hand-crafted rules or templates. Recently, the deep learning-based
methods have been successfully employed to infer neural models for question and answer sentences.
These neural systems mainly use a sequence to sequence (seq2seq) model as a backbone to perform
mappings from entire sequences of words or characters to other sequences, for example [5,6]. In addition
to the dialoguing content, emotion plays a significant role in determining the relevance of the answer
to a specific question. By integrating emotion information into the applications, a service system
can enable its services to automatically adapt to changes in the operational environment, leading to
enhanced user experience.
To enhance the service performance and equip the robot with social competences, in this work,
we developed an emotion-aware human–robot dialogue framework extended from our previous
research presented in [7], with a series of additional experiments and newly developed dialogue services.
To this end, this extended framework included two types of dialogue services. One was to enable the
robot to work as a consultant to provide domain-specific knowledge services. The main focus was on
constructing a deep learning model for mapping questions and answer sentences, tracking the human
emotion during the process of the human–robot dialoguing and using this additional information to
determine the relevance of the sentences obtained by the model. The other was to provide task-oriented
dialogue services which raised considerable interests due to its broad applicability for assisting users in
achieving specific goals (e.g., for booking flight tickets or scheduling meetings). To verify the presented
approach, we conducted a series of experiments as described below to evaluate the major system
components. The results showed the effectiveness and efficiency of the presented approach.
The remaining part of this paper is arranged as follows. Section 2 provides the research background
and reviews the dialogue-related research work. Section 3 describes the framework, including the
functional modules of emotion classification and dialogue response selection, and the deep learning
techniques used for modeling. Section 4 presents the experimental outcomes and the performance
comparisons of the different methods. Finally, Section 5 concludes the paper.
2. Related Works
As mentioned previously, at present most of the service robot frameworks have been connected to
various cloud-computing environments to exploit their large amounts of resources. Among others, the
most representative work is RoboEarth [8], driven by an open-source cloud robotics platform [9]. With
this platform, the robots can distribute highly loaded computation to the cloud and access the RoboEarth
knowledge repository to download required resources. There are also other platforms developed for
cloud robotic systems. For example, Pereira et al. proposed the ROSRemote framework [10], which
enabled users to work with ROS remotely to create several applications. More extensive surveys
were found in [11,12]. More recently, due to the rapid advances of the Internet of Things (IoT),
researchers proposed the concept of the Internet of Robot Things (IoRT) to describe a new approach
to robotics [13,14]. In this way, smart devices can monitor events, fuse sensor data from a variety of
Robotics 2020, 9, 31 3 of 20
sources and use local and distributed intelligence to determine a best course of action. This expands the
ability of service robots, improves a robot’s understanding during the human–machine interaction and
leads to a more intelligent robotic network. Moreover, to deal with the scalability problem, researchers
have started to extend the cloud computing concept for service robots to edge or fog computing to
utilize the resources in a more efficient way [15,16].
Instead of investigating issues related to resource allocation and utilization, this work aimed to
develop emotion-aware dialogues for a service robot, in which the most important issues were to
recognize the emotions from the user utterances and to generate appropriate machine responses. Many
methods have been proposed to solve these problems from different perspectives. Because this work
adopted deep learning models to address the above two issues, in the following we discuss the most
relevant studies with similar computational methods.
In general, using a deep learning-based approach to develop dialogues, responses are generated
based on sequence-to-sequence (seq2seq) neural network models, with an objective function of the
maximum-likelihood estimation [17]. This model is to take dialogue modeling as learning a mapping
between human utterances and machine responses. The focus is on how to generate a suitable response
from a corpus to a human utterance. For the training of dialogue models, generative and retrieval-based
methods are often used. Although generative methods have the potential to generate sentences of rich
content, current generative models often have the disadvantages of lacking coherence and producing
unnatural responses. In contrast, though retrieval-based methods are more restricted, they have the
advantage of producing informative and fluent responses. Thus, the retrieval-based methods are
more practical. As can be observed, retrieval-based methods rely on the exploitation of a large and
varied corpus (human–human or human–machine interactions) [18] and deep learning models have
been employed to derive mappings (that is, a selection mechanism) between questions and answers
(e.g., [5,19]).
The basic seq2seq model consists of two recurrent neural networks (RNNs): one works as an
encoder to process the input; the other, a decoder to generate the output. With the characteristic of
making predictions based on running texts of varying lengths, the long short-term memory networks
(LSTMs) are often adopted to train the answer selection mechanism. This model has now been widely
applied to conversation generation and most existing works have mainly focused on developing more
advanced techniques (such as decoding strategies or network models) to improve the content quality
of the responses. Many neural dialogue systems have been constructed based on this design principle.
For example, Serban et al. used a hierarchical LSTM network for a conversation application [20], and
Wen et al. proposed a task-oriented model to generate the correct answers in response to the needs
of the given dialogue [21]. To overcome the problem of overly general (i.e., safe) responses, Wu et al.
proposed a hybrid-level encoder–decoder model, which utilized both word-level and character-level
features [22]. Although these models, in theory, are better at maintaining the dialogue state using
memory components, they require longer training time and excessive searching for hyper-parameters.
In contrast to the above domain-specific dialogue systems that aim to generate fluent and
engaging responses, the other type of neural dialogue systems that has attracted a lot of attention
is task-oriented [23,24]. Task-oriented dialogue systems need to complete a specific task (to achieve
a goal), for example, restaurant reservation, by interacting with users (i.e., a response generation
process). Existing task-oriented systems can be divided into two categories: the modularized pipeline
and the end-to-end single-module systems. The former decomposes the task-oriented dialogue task
into modularized pipelines to be solved separately, while the latter proposes to use an end-to-end
model to produce a sequence of output tokens directly to solve the overall task. End-to-end systems
are often more superior than pipeline systems, due to their unique characteristics, such as global
optimization and easier adaptation to new domains. In the task-oriented dialogue systems, the most
critical component is the goal tracker [25]. The system must update the state of the dialogue according
to each user’s query and their intent. Given the current dialogue state, the system can then decide how
to respond best to the user to accomplish the desired task.
Robotics 2020, 9, 31 4 of 20
In addition to employing more sophisticated models and advanced tuning mechanisms towards
proper response generation, some recent works attempted to augment the emotional information of
the neural dialoguing models to generate more meaningful and humanized machine responses. For
example, Zhou et al. presented a model that assumed the emotion category of human utterance was
known and taken as an additional input to train a model of responses [26]. Sun et al. adopted a LSTM
neural network for conversation modeling [27] in which an emotional category label was added to the
encoder, which regarded emotional information as an additional source to the conversational model.
Moreover, Asghar et al. discussed the feasibility of employing emotion information to help generate
diverse responses [28]. They proposed a model of affective response generation to generate sentences
conditioned on emotional word embeddings, affective objective functions and diverse beam search.
However, these methods only focused on emotional factors while ignoring content relevance, possibly
resulting in a decline in the quality and diversity of a response. The integration of emotion and content
is still a challenging task for several reasons. The first is that high-quality emotion-labeled data are
difficult to obtain in a large-scale corpus because emotions are subjective and difficult to annotate.
Moreover, it is difficult to deal with emotions coherently because balancing grammaticality and the
expressions of emotions is needed [29].
3. Developing Human–Robot Dialogues
3.1. The Framework

In this work, we adopted a service-oriented robotic framework that could provide various services
and resources and develop emotion-aware dialoguing services. This computing platform included
two parts: the on-board processors mounted on the robot side (to handle robot functions requiring fast
responses, such as those related to perception and actuation) and the computing nodes located on the
cloud side to perform highly loaded computing services (such as service planning and deep learning).
To realize the proposed design in practice, we configured the framework with ROS to deliver different
types of services. Figure 1 illustrates our robotic system architecture and its ROS configuration. As
shown, the Graphics Processing Unit (GPU) acceleration virtual machine (VM) and the cloud parallel
computing virtual machine are used to support computation. To provide different services on the
cloud, we defined different types of computing nodes in the framework. Through the ROS frame
protocol, where the management of data interchange is between nodes, the framework could easily
Robotics 2020, 9, x FOR PEER REVIEW 5 of 20
combine different services to launch new functions. The module of service planning was described in
our previous work
emotion-aware [2,3]. Here,
services. we focused
The same on the
approach candialogue
also be module, in which
applied to the major functional
the task-oriented dialogue
components were indicated.
service. The major components of our framework are described in the following subsections.
Figure 1. Overview
Figure 1. Overview of
of the
the proposed
proposed framework
framework for the human–robot
for the human–robot dialogues.
dialogues.
3.2. Learning Emotion Recognition
3.2.1. Text Processing

In addition to the traditional text processing steps to clean and purify texts, we apply semantic
Robotics 2020, 9, 31 5 of 20
Our framework included two types of dialogue services, one for domain-specific dialogues and
the other for task-specific (task-oriented) dialogues. In contrast to the open-domain conversation
performed by the general purpose chatbots, the domain-specific dialogue presented here aims to provide
knowledge services of a certain domain (e.g., finance or insurance) through a question-answering
manner between the user and the robot. In contrast, the task-oriented dialogue service was to achieve
the specific goal for a certain task (e.g., restaurant recommendation) by conducting the iterative
human–robot dialogue to adapt to the user’s intention or preference related to the task goal. This type
of service is especially important in the coming conversational commerce.
At present, the functions of user identification and emotion recognition are constructed
independently from the dialogue model, mainly because of the lack of a dataset containing complete
information of a human face, utterance emotion and dialoguing content. The current strategy was that
the identified user was assigned to a certain type of user group and the corresponding model was
retrieved to perform dialoguing. Then, the candidate sentences produced by the model were re-ranked
(based on the recognized emotion) following a set of hand-crafted rules and the sentence with the
highest rank was selected as the robot’s response. In this work, we only applied the emotion mechanism
to the first type of dialogue (i.e., domain-specific) as a representative example of emotion-aware services.
The same approach can also be applied to the task-oriented dialogue service. The major components of
our framework are described in the following subsections.
3.2. Learning Emotion Recognition
3.2.1. Text Processing

In addition to the traditional text processing steps to clean and purify texts, we apply semantic
rules to perform sentence segmentation. For example, if there is a disjunctive such as “but” or
“although” in the sentence, the emotion of the entire sentence is usually biased toward the former or
the latter clause. To tackle such a problem, this study adopted a set of five semantic rules (selected from
those proposed in [30]) to perform more precise sentence segmentation. For example, using one of the
rules: “If a sentence contains but, disregard all previous sentiment and only take the sentiment of the
part after but,” the sentence “I really, really, really wanna go, but I can’t.” is simplified to be “I can’t”.
The details of the rules refer to [30].
After the above sentence segmentation, we employed the Natural Language Processing Toolkit
(NLTK, [31]) to build a dictionary, consisting of more than 7000 words, of which the most frequent stop
words were removed. However, because the dialogue dataset used for building classifiers contained
some short responses (such as “He?” and “You?”), the list of stop words was thus not fully applied
to filter them out. In addition, adverbs such as “more”, “most” and “very” are tone aggravation in
conversation, therefore, they were retained. Then, a procedure of stemming was performed to strip off
word endings, reducing them to a common core or stem.
As indicated above, our framework adopted deep learning for model training and an encoding
(embedding) scheme was needed to transfer the natural language sentences into vector representations.
Therefore, once the word processing procedure was completed, the GloVe method (Global Vectors
for Word Representation) was employed to map the words into vectors, due to its high training
efficiency [32]. The training process was performed on aggregated global word–word co-occurrence
statistics from a corpus. Through the mapping, the words were represented by real numbers and
words with similar meanings which could have similar representations. In this study, the words were
mapped into vectors of 300 dimensions. GloVe provides pre-training word vectors, which contain
400 k vocabularies trained from a corpus of 6 billion token words. The vectors were used as the input
of the training algorithm to build the model.
gradient-based machine learning methods. However, this situation still occurs when the sentence
length is too long and the network needs to be deepened. In this work, we adopted LSTM with ReLU
(Rectified Linear Unit [33]) to train a better model, as ReLU was proved to be effective in overcoming
the vanishing gradient problem. Moreover, ReLU has the property of sparse activation, making the
Robotics 2020, 9, 31 6 of 20
neural network sparse to alleviate the problem of over-fitting. In the above learning process, the
widely adopted gradient descent optimization algorithm Adam [34] was used as an optimizer.
3.2.2.As shown Emotion
Learning in FigureClassifiers
2, we used the activation function widely used in deep learning model,
“Softmax”, to map the outputs of the neurons into the interval of (0–1). In this way, a probability
In this work,
distribution we trained
over the possibleaclasses
deep learning
could be network
obtained to and
recognize emotions
the node from
with the the utterances
highest probability in
dialogues. Figure 2 illustrates the model that includes a convolutional neural network
was selected as our prediction emotion class. To calculate the error between the prediction class and (CNN) followed
by
theaactual
long short-term
class, a loss memory
function network (LSTM).
was used As weight
and the shown,update
the inputs aredeep
of the the dialoguing sentences
neural network was
processed and converted to the vectors by the procedure described above. In
performed accordingly. Here, the function “LabelEncoder” of the machine learning tool sciki-learn this network, three
convolutional layers with and
(https://scikit-learn.org/) lengths
theofloss
three, four and
function five were arranged to extract
“categorical_crossentropy” of the
thelocal
deepfeatures
learning of
the sentences.
framework KerasThen, the features were
(https://keras.io/) werecombined
employed andto served
normalizeas the
theinput of theand
class label nextconvert
learning layera
it into
(i.e., LSTM).
one-hot code of the binary matrix to perform the numerical calculation.
Figure 2.
Figure 2. The
The deep
deep learning
learning model
model used for the
used for the emotion
emotion recognition.
recognition.
3.3. Domain-Specific
It has been wellDialogue
known that Modeling
LSTM can overcome the vanishing gradient problem in gradient-based
machine learning methods. However, this situation still occurs when the sentence length is too long
3.3.1.the
and Learning
networkDialogue
needs toModels
be deepened. In this work, we adopted LSTM with ReLU (Rectified Linear
Unit To[33]) to traindialogues
develop a better model,
for the as ReLU
robot, wewas proved
adopted thetoneural
be effective in overcoming
language model fromthe ourvanishing
previous
gradient problem. Moreover, ReLU has the property of sparse activation,
works [3,35] for training the answer selection mechanism. Figure 3 illustrates our model making the neural network
that
sparse to alleviate the problem of over-fitting. In the above learning
included a LSTM network with a CNN network. The LSTM contained memory blocks in the process, the widely adopted
gradient
recurrentdescent
hiddenoptimization algorithm
layer that could Adam
store the [34] was
temporal used
state of as annetwork.
the optimizer.With this characteristic,
As shown in Figure 2, we used the activation function
this model could better capture information over longer time steps to meet widely used our
in deep
goal.learning model,
“Softmax”, to map
For training thethe outputs
deep of the
learning neurons
model, into the interval
the sentences of (0–1). as
were organized Inthe
thisquestion-answering
way, a probability
distribution over the possible classes could be obtained and the node
pairs. The question sentence Q was the input question encoded into an internal vector form with the highest probability
QV by
was
the word-embedding procedure described above. To enhance the performance, we establishedclass
selected as our prediction emotion class. To calculate the error between the prediction the
and the actual class, a loss function was used and the weight update
word2vec [36] weights for the entire corpus and used them as the pretrained model of the of the deep neural network
was performed
embedding layer.accordingly.
The output then Here, the to
flows function
the LSTM “LabelEncoder”
and CNN layers. of the machine
In this learning
procedure, tool
for each
sciki-learn
question Q (https://scikit-learn.org/)
there was a corresponding andpositive
the lossanswer
function A+“categorical_crossentropy”
with a very high probability of to
thebedeep
the
learning framework Keras (https://keras.io/) were employed to normalize the class
correct answer among all the answers in the dataset (i.e., the confirmed correct answer). As shown label and convert it
into a one-hot code of the binary matrix to perform the numerical calculation.
in Figure 3, after the embedding layer, an output vector E was obtained and then calculated
through the LSTM function to derive a hidden vector L as the following:
3.3. Domain-Specific Dialogue Modeling
3.3.1. Learning Dialogue Models E = EMBED(x1, …, xn; We) (1)
To develop dialogues for the robot, weLadopted theWneural

= LSTM(E; L) language model from our previous (2)
works [3,35] for training the answer selection mechanism. Figure 3 illustrates our× model that included
In the above equations, E can be represented as {e1, e2, …, en}, 𝐸 ∈ ℝ in which n is the
a LSTM network with a CNN network. The LSTM contained memory blocks in the recurrent hidden
maximal sentence length and d is the dimension of embedding. We is the weight matrix 𝑊 ∈ ℝ ×
layer that could store the temporal state of the network. With this characteristic, this model could
better capture information over longer time steps to meet our goal.
For training the deep learning model, the sentences were organized as the question-answering
pairs. The question sentence Q was the input question encoded into an internal vector form QV by
the word-embedding procedure described above. To enhance the performance, we established the
word2vec [36] weights for the entire corpus and used them as the pretrained model of the embedding
layer. The output then flows to the LSTM and CNN layers. In this procedure, for each question Q
there was a corresponding positive answer A+ with a very high probability to be the correct answer
among all the answers in the dataset (i.e., the confirmed correct answer). As shown in Figure 3, after
Similarity(VQ,VA)= × (3)
‖ ‖ (  , )
This equation was adopted from [38], and it has been shown to offer good performance. In the
above equation, the parameter γ is 1.0 and c is 1. VA is a positive or negative answer (i.e., VA+ or VA−).
Robotics 2020, 9, 31 7 of 20
Then, the distance between the two similarities is compared (meaning the difference between an
answer and the ground truth) to a pre-defined margin m (a maximum number of steps often used
thereduce
to embedding layer, antime).
the running outputIfvector E was obtained
the distance and then
is less than calculated
m, the networkthrough the LSTM
parameters function
are updated;
to derive a another
otherwise hidden vector L as example
negative the following:
is sampled until the distance is less than m. The above
operations were to ensure that the similarity distance (to be minimized) could reach a certain level.
E = EMBED(x1to
As defined in [38], the loss function corresponding , . .the
. , xabove
n ; We ) similarity is: (1)
L = LSTM(E;Q,V
Loss = max{0, m−Similarity(V WA+L ))+Similarity(VQ,VA-))} (2)
(4)
In the above
During equations, E can
the human–robot be represented
dialoguing period as e2 , . test
{e1 , the
(i.e., . . , enphase), n×d in which n is the
}, E ∈ Rthis dialogue service
d is the dimension of embedding. W
calculates the similarity between a question sentence (asked by the euser) and each matrix
maximal sentence length and is the weight answerW ∈ Rv×d
sentence
(v
(inisthe
theknowledge
number of base).
wordsAinset
theofdictionary), e is the
answers with the vector
highestembedded
similarityfor word
scores is xselected
and WLand
is the LSTM
they are
weight matrix.
re-ranked by the pre-defined rules. The first-ranking sentence is then used as the robot’s response.
Figure 3. The
The deep
deep learning
learning model
model used for the domain-specific human–robot dialogues.
3.3.2.For performance
Knowledge enhancement, we used the genism package [37] to establish the weights for
Enrichment
the entire corpus and used them as the pretrained model of the embedding layer. Though the
LSTM Inlayer
addition to the above,
described learning
onemodel and method,
can extract the dataset
the features of wordwith the domain
sequences questions
in the sentencesand the
of our
corresponding answers we
network. Furthermore, also played athe
connected critical
tensorrole in dialogue
L (Equation (2)) modeling, because layer
to a convolutional a rich
to dataset
extract
represents abundant knowledge for a system to interact with human users. It was thus
more complicated features for performance enhancement. As indicated in Figure 3, the “MaxPooling”important to
include more
function was knowledge
performed resources
and the to enrich
“tanh” function the to
was used dataset
transfer(meaning
and outputbetter conversation
the decoding result.
The above two functions have been widely used in deep learning models for language processing [38].
In the model training procedure, the question Q, the correct answer A+ and the wrong answer
A−(sampled from the answer space) are encoded into vector representations VQ , VA+ and VA− ,
respectively, and the similarities between the question and the two answers are calculated separately.
Here, the similarity of the two vectors is defined as
1 1
Similarity(VQ , VA ) = × (3)
1 + kVQ − VA k 1 + exp −γ dot VQ , VA + c
This equation was adopted from [38], and it has been shown to offer good performance. In the
above equation, the parameter γ is 1.0 and c is 1. VA is a positive or negative answer (i.e., VA+ or
VA− ). Then, the distance between the two similarities is compared (meaning the difference between an
answer and the ground truth) to a pre-defined margin m (a maximum number of steps often used to
reduce the running time). If the distance is less than m, the network parameters are updated; otherwise
another negative example is sampled until the distance is less than m. The above operations were to
ensure that the similarity distance (to be minimized) could reach a certain level. As defined in [38],
the loss function corresponding to the above similarity is:
Loss = max{0, m−Similarity(V Q ,V A+ ) + Similarity(VQ ,VA- ))} (4)

Robotics 2020, 9, 31 8 of 20
During the human–robot dialoguing period (i.e., the test phase), this dialogue service calculates
the similarity between a question sentence (asked by the user) and each answer sentence (in the
knowledge base). A set of answers with the highest similarity scores is selected and they are re-ranked
by the pre-defined rules. The first-ranking sentence is then used as the robot’s response.
3.3.2. Knowledge Enrichment

In addition to the learning model and method, the dataset with the domain questions and the
corresponding answers also played a critical role in dialogue modeling, because a rich dataset represents
abundant knowledge for a system to interact with human users. It was thus important to include
more knowledge resources to enrich the dataset (meaning better conversation comprehension) for a
dialogue system equipped with a service robot. Many strategies can be developed to include more
knowledge resources (e.g., external knowledge resources) for dialogue modeling. In this work, we used
a language translation system to translate a dataset to achieve knowledge sharing between different
languages. This method was especially important for developing human–machine dialogues with a
resource-restricted language (very few data are available for model training). Here, we translated a
dataset from English to Chinese as an example to investigate the corresponding effect.
Word segmentation was a very important sentence preprocessing step in the dialogue modeling
with the dataset in Chinese. This step was to determine word boundaries for a Chinese sentence. That
is, a sentence can be segmented into different combinations of words and therefore the ambiguity exists
for Chinese word segmentation. Several segmentation systems have been proposed for Chinese text.
Among others, the most often used segmentation systems are the CKIP (http://ckipsvr.iis.sinica.edu.tw/),
Stanford Pars (http://nlp.stanford.edu/software/lex-parser.shtml ) and the JIEBA (https://github.com/
ldkrsi/jieba-zh_TW) system. Following a preliminary evaluation, we chose to use the JIEBA system to
perform the word segmentation. Then, the word embedding procedure was performed in which the
Wiki Chinese text documents were used to pre-train the corpus for performance enhancement, and the
same type of dialogue model can be trained by the deep learning method as in the above section.
3.4. Developing Task-Oriented Dialogues

In addition to the above domain-specific dialogue modeling, this section presents the task-oriented
subsystem we developed for the service robot to achieve practical applications with specific goals. As
mentioned previously, existing task-oriented dialogue methods can be divided into two categories:
modularized pipeline and end-to-end single-module systems. Among others, the hybrid code network
(HCN, [24]) is a popular and useful end-to-end framework for developing practical task-oriented
dialogue applications. It allows a developer to hybrid the data-driven learning method and
knowledge-based hand-coded rules. This approach can learn an RNN with considerably less training
data and express domain knowledge via software and action templates. Therefore, in this work,
we adopted a simplified HCN framework with some enhanced functions to develop task-oriented
dialogue services for the robot. Figure 4 presents our revised framework for the task-oriented dialogues.
We also implemented a restaurant recommendation application as an illustrative example. The goal
was to request the robot to make a restaurant reservation for a user, given all his constraints on the
location, cuisine, price range, atmosphere and party size, which were derived iteratively from the
human–robot dialogue.
The operational flow of our revised HCN included four major phases as illustrated in Figure 4.
The first phase, which was mainly for text processing, included three steps to extract different types of
features from a user utterance. The first step was to extract the context features (entities to be traced).
Since we used the DSTC dataset (Dialog State Tracking Challenge dataset [39]) for network training,
the context features here were the same as the original dataset: atmosphere, cuisine, location, party
size and price, each with a value of 0 or 1 (as a placeholder in the entity tracking slot). The second step
was to extract the words (bag of words) to be representatives and the one-hot encoding scheme was
used to form a word vector. The third step was to perform word embedding and here the word2vec
Robotics 2020, 9, 31 9 of 20
was employed. As shown in the figure, in the second phase the text and entities mentioned were then
passed to a module of dialogue state tracking, which grounds and maintains entities. In contrast to the
original HCN work,
above phases, we adopted a deep
a recommendation CNNwas
module network and defined
developed label
to revise ontology
some to further
entities improve
according to the
the state tracking performance (described in Section 3.4.1).
user’s preferences. The details are described in Section 3.4.2.
Figure4.4.The
Figure Theframework
frameworkused
usedfor
forthe
thetask-oriented
task-orienteddialogues.
dialogues.
InBelief
3.4.1. the third phase, the results obtained from the above phases were then concatenated to be a
Tracker
feature vector and a traditional LSTM was adopted for training. As shown in Figure 4, the output of
The belief tracker (i.e., the dialogue state tracking) is an important component in a dialogue
the LSTM model was passed to a dense layer with a Softmax activation, in which the output dimension
system in which a dialogue state is a full and temporal representation of each participant’s intention.
was equal to the number of distinct action templates. The output was a distribution over the action
A belief tracker can track what has happened with the system outputs, user utterances and context
templates. In the fourth phase, the action mask was applied and an action was selected accordingly.
from previous turns. It provides a direct way to validate the system’s understanding of the user’s
Then, the selected action was used to produce a fully formed action. Following the above phases, a
goal at each dialogue step through the intention estimation.
recommendation module was developed to revise some entities according to the user’s preferences.
Traditionally, the rule-based systems were built for state tracking, but they hardly model
The details are described in Section 3.4.2.
uncertainty. Recently, researchers have turned to develop neural models to overcome the uncertainty
in tracking
3.4.1. dialogue states. In task-oriented dialogue systems, the end-to-end neural networks have
Belief Tracker
been successfully employed for state tracking via interacting with an external knowledge base.
The belief tracker (i.e., the dialogue state tracking) is an important component in a dialogue
However, in task-oriented dialogues, a state tracker is usually trained from a large amount of
system in which a dialogue state is a full and temporal representation of each participant’s intention.
manually annotated corpora. Considering the huge efforts required for human annotation, we used
A belief tracker can track what has happened with the system outputs, user utterances and context
the available dataset for model training and focused on the model performance.
from previous turns. It provides a direct way to validate the system’s understanding of the user’s goal
As indicated above, we adopted a simplified HCN model with a dialogue state tracker.
at each dialogue step through the intention estimation.
However, in some situations the original state tracker could misjudge the ambiguous user
Traditionally, the rule-based systems were built for state tracking, but they hardly model uncertainty.
utterances or wrongly spell words and produce incorrect answers. For example, using the original
Recently, researchers have turned to develop neural models to overcome the uncertainty in tracking
HCN tracker to analyze the user utterance “I’m asking my friend if she wants to do Rome”, the
dialogue states. In task-oriented dialogue systems, the end-to-end neural networks have been
word “Rome” (entity value) is wrongly taken as the final location, but in fact the decision has not
successfully employed for state tracking via interacting with an external knowledge base. However, in
yet been made. This was because the original tracker uses a string-matching method for entity
task-oriented dialogues, a state tracker is usually trained from a large amount of manually annotated
identification so the mismatches cannot be corrected. As the neural belief tracker was able to deliver
corpora. Considering the huge efforts required for human annotation, we used the available dataset
a better performance [40], we thus adopted this method and used a deep CNN model to solve this
for model training and focused on the model performance.
problem. In addition, a small ontology was established to ensure the semantic correctness of the
As indicated above, we adopted a simplified HCN model with a dialogue state tracker. However,
mentioned entities (i.e., slot values).
in some situations the original state tracker could misjudge the ambiguous user utterances or wrongly
spell
3.4.2.words and produce incorrect answers. For example, using the original HCN tracker to analyze
Autoencoder
the user utterance “I’m asking my friend if she wants to do Rome”, the word “Rome” (entity value) is
wrongly Following
taken as the above
the final dialogue
location, butflow,
in factwe
thedeveloped a recommender
decision has (as shown
not yet been made. in because
This was Figure 4)
theto
enhance the system performance and user experience. This module was to refine some entities from
the selected response according to the user preferences. In this work, we used a deep
learning-based method and adopted the deep neural network and autoencoder [41,42] to realize
collaborative recommendation.
Robotics 2020, 9, 31 10 of 20
original tracker uses a string-matching method for entity identification so the mismatches cannot be
corrected. As the neural belief tracker was able to deliver a better performance [40], we thus adopted
this method and used a deep CNN model to solve this problem. In addition, a small ontology was
established to ensure the semantic correctness of the mentioned entities (i.e., slot values).
3.4.2. Autoencoder
Following the above dialogue flow, we developed a recommender (as shown in Figure 4) to enhance
the system performance and user experience. This module was to refine some entities from the selected
response
Robotics according
2020, to the
9, x FOR PEER user preferences. In this work, we used a deep learning-based method
REVIEW and
10 of 20
adopted the deep neural network and autoencoder [41,42] to realize collaborative recommendation.
Autoencoder is a superior tool for dimensionality
Autoencoder dimensionality reduction
reduction and and it
it can be regarded as a strict
generalization of principle component analysis. ItIt isis aa network
generalization network withwith implementations
implementations of two
transformations (encoder
transformations (encoder and and decoder),
decoder), aiming
aiming to to reconstruct
reconstruct inputs
inputs inin the
the output
output layer
layer via a
low-dimensional latent
low-dimensional latent space space to predict the missing ratings. Then, the learning
Then, the learning goal goal is to minimize
the error
errorbetween
betweenthe theoriginal
originalvector (input)
vector and and
(input) the transformed
the transformedvectorvector
(output). One of One
(output). the popular
of the
autoencoder-based
popular recommendation
autoencoder-based models ismodels
recommendation AutoRec is [42]. In this
AutoRec model,
[42]. denoising
In this model,techniques
denoising
are used to discover
techniques are used more robust representations
to discover and to avoid learning
more robust representations an identity
and to avoid function.
learning These
an identity
techniquesThese
function. meantechniques
to learn themean
latentto
representations
learn the latent of the corrupted user-item
representations of the preferences and they
corrupted user-item
can be used toand
preferences reconstruct
they can thebe
users’
usedfulltopreferences
reconstruct andthe
reduce the full
users’ overfitting situations.
preferences and In this work,
reduce the
our recommender
overfitting was developed
situations. In this work,basedouron AutoRec.
recommender The overall architecturebased
was developed is illustrated in Figure
on AutoRec. The5,
in whicharchitecture
overall the encoder,iscode-layer
illustratedand decoder
in Figure 5, are the major
in which the parts of the
encoder, model (included
code-layer in theare
and decoder dotted
the
line rectangle).
major parts of theBoth the encoder
model (includedandinthethedecoder
dotted consist of feed-forward
line rectangle). Both theneural
encoder networks
and thewith fully
decoder
connected
consist layers and the neural
of feed-forward depth of the model
networks wasfully
with increased (marked
connected as the
layers anddeep
the stack)
depthtoofenhance
the modelthe
corresponding
was performance.
increased (marked as the deep stack) to enhance the corresponding performance.
Figure 5.
Figure The neural
5. The neural model
model used
used for
for the
the recommendation.
recommendation.
4. Experiments and Results

4. Experiments and Results
To evaluate the presented emotion-aware dialoguing service for human–robot interaction, several
To evaluate the presented emotion-aware dialoguing service for human–robot interaction,
sets of experimental trials were conducted. As mentioned previously, due to the lack of a dataset with
several sets of experimental trials were conducted. As mentioned previously, due to the lack of a
full information on the human face, utterance emotion and dialoguing content, in the experiments
dataset with full information on the human face, utterance emotion and dialoguing content, in the
we used four datasets to evaluate these modules separately. The evaluations are described in the
experiments we used four datasets to evaluate these modules separately. The evaluations are
following subsections.
described in the following subsections.
4.1. Performance Metrics
4.1. Performance Metrics
In the experiments, we employed the criteria often used in data classification for performance
In the and
evaluation experiments,
a five-foldwe employed the strategy
cross-validation criteria often used
was also in data
used. We classification
first measuredfor
theperformance
numbers of
evaluation and a five-fold cross-validation strategy was also used. We first measured the numbers
of true positives (TP), false positives (FP), true negatives (TN), false negatives (FN) and then used
them to calculate the metrics of accuracy (proportion of correctly predicted instances relative to all
predicted instances), precision (proportion of retrieved instances that were relevant), recall
(proportion of relevant instances that were retrieved) and F-measure (the combined effect of
Robotics 2020, 9, 31 11 of 20
true positives (TP), false positives (FP), true negatives (TN), false negatives (FN) and then used them to
calculate the metrics of accuracy (proportion of correctly predicted instances relative to all predicted
instances), precision (proportion of retrieved instances that were relevant), recall (proportion of relevant
instances that were retrieved) and F-measure (the combined effect of precision and recall that often
conflict in nature) [43]. The metrics are defined as follows:
TP + TN
accuracy = (5)
TP + FP + TN + FN
TP
precision = (6)
TP + FP
TP
recall = (7)
TP + FN
2 × precision × recall
F-measure = (8)
precision + recall
In addition to accuracy, to evaluate the performance of the answer selection in the dialogue
modeling, we adopted a statistical measure MRR (mean reciprocal rank, the average of the reciprocal
ranks of results for a sample of n queries). It is defined as
n
1 X 1
MRR = · (9)
n ranki
i=1
where ranki refers to the rank position of the first relevant document for the i-th query.
4.2. User Identification

In this work, a cloud-based system was built for a service robot and we configured a ROS
framework on top of a Linux OS to connect the sensing camera nodes. Often, a system built with ROS
consists of a number of processes on a set of hosts, which are connected at runtime in a peer-to-peer
topology. Here, the ROS master was a PC running the roscore and serving as the resource center for all
the other ROS nodes connected to the network. The cloud parallel computing virtual machine had
eight CPUs and eight GB memory, and the GPU acceleration virtual machine had eight CPUs, 32 GB
memory and a NVIDIA Tesla K80 GPU.
For user identification, the experiments were conducted to evaluate the performance of face
recognition. The goal was to train the robot to recognize human faces in a static manner and we
adopted OpenCV (https://opcv.org, an open source computer vision library) to train the classifiers.
An online face dataset [44] was used. It included 90 image sets of different persons, in which each
set included face images taken from different viewpoints, from 90 to −90 degrees (stepping by 5).
The results showed that the trained classifiers performed the best in the recognition of the front face
images. The faces in the images could be detected correctly with a reasonable rate of accuracy when
the variation of the rotating angle was less than 30 degrees, and the faces could be recognized with a
good accuracy if the view angle was within the range of 10 to −10 degrees.
4.3. Performance of Emotion Recognition
4.3.1. Performance Evaluation

To assess the performance of the emotion recognition module, we adopted the dataset used in [45],
which was derived from the Movie Dialog Corpus. The sentences in this dataset were categorized into
six classes of emotions: fear, disgust, joy, sadness, anticipation and none (neutral). The deep learning
approach described in Section 3.2.2 was employed to train a model for multi-class emotion recognition.
In addition, two popular learning methods, the random forest (RF) and the support vector machine
(SVM) methods, were used for performance comparison.
Robotics 2020, 9, 31 12 of 20
For RF and SVM, we used the n-gram method to extract more text features from the original data
for building classifiers to enhance their performance, in addition to the word features extracted from
the text-processing
methods to investigate procedure. N-gram
their effects can express For
in performance. the the
sequence relationships
semantic between
rules, the five rules the words,
mentioned
and the unigram, bigram and trigram (n is 1, 2 and 3, respectively) models
in Section 3.2.1 were used to perform more precise sentence segmentation; for data balance, we are often used. After a
preliminary test, in this work we used the above three models to extract more text
adopted the sciki-learn tool to produce a set of specific class weights for different types of emotions.features, and the
combined
The resultsfeature vectors were
for accuracy, used recall
precision, as the and
input of the are
F-score above two machine
illustrated learning
in Figure 6b. Asmethods
can be (RF and
seen, in
SVM) to enhance their performance.
general our CNN-LSTM method obtained the best results on all performance metrics. In addition to
Figure
the data 6a illustrates
balance effect, the
the accuracy,
reason for precision, recall andimprovement
the performance F-score for each of the
could bethree
that methods.
the semantic As
can be seen, RF performed the best in all the metrics. The main reason could be
rules removed the irrelevant words and filtered out their effects on the sentence emotions. Thus, thethat RF is a type of
ensemble machine learning algorithm and the way it handled (samples) data
learning methods were able to focus on the emotions delivered by the most related parts of thefor the grouped multiple
classifiers
sentences made it perform better than the others for the imbalanced dataset here.
to be predicted.
Ration of correctness
Ration of correctness
(a) (b)
Figure 6.
Figure 6. Results of the three machine learning methods; (a) without
without and (b) with
with the
the enhanced
enhanced
techniques of the semantic rules and data balance.
techniques of the semantic rules and data balance.
4.3.2.After comparingwith
Comparisons the three
IBM Toneaforementioned
Analyzer methods, we applied two data processing techniques to
the dataset, including semantic rules and data balance, with the above learning methods to investigate
In addition to comparing the different machine learning methods, we evaluated a well known
their effects in performance. For the semantic rules, the five rules mentioned in Section 3.2.1 were
emotion detection system, the IBM Watson’s Tone Analyzer
used to perform more precise sentence segmentation; for data balance, we adopted the sciki-learn
(https://natural-language-understanding-demo.ng.bluemix.net/), for further comparison.
tool to produce a set of specific class weights for different types of emotions. The results for accuracy,
Interestingly, the emotions the Tone Analyzer considered were slightly different from what we
precision, recall and F-score are illustrated in Figure 6b. As can be seen, in general our CNN-LSTM
defined in this work, and it gave degrees (values) of multiple emotions for an input sentence (also
method obtained the best results on all performance metrics. In addition to the data balance effect,
different from our work). To conduct the performance comparison, we projected the two sets of
the reason for the performance improvement could be that the semantic rules removed the irrelevant
emotions (one for our work and one for the Tone Analyzer) into the well known emotional valence
words and filtered out their effects on the sentence emotions. Thus, the learning methods were able to
and arousal space (i.e., V-A space, [46]). In this space, valence indicates the hedonic value (positive
focus on the emotions delivered by the most related parts of the sentences to be predicted.
or negative), ranging from inactive to active; and arousal indicates the emotional intensity, ranging
from Comparisons
4.3.2. unpleasant towith pleasant. The valence
IBM Tone Analyzer and arousal dimensions can be projected onto Euclidean
space, where emotions are represented as point-vectors. In this way, the user’s emotion can be
In addition
located to comparing
in this space to be a tupletheofdifferent
valencemachine
and arousallearning
values.methods, we evaluated a well known
emotion detection
In the system,we
experiments, thefirst
IBMprojected
Watson’s Tone Analyzer
the five classes(https://natural-language-understanding-
(annotated in the dataset) into the V-A
demo.ng.bluemix.net/),
space to retrieve the corresponding valence and arousal valuesthe
for further comparison. Interestingly, emotions
(based on thethe Tone Analyzer
emotion positions
considered were slightly different from what we defined in this work, and it gave
defined in [46]). For the data of each (sentence), we took the positions of the actual (correct) class degrees (values) of
multiple emotions for an input sentence (also different from our work). To conduct
and the predicted classes in the space and obtained the set of V-A values. Then, the emotion values the performance
comparison,
produced bywe theprojected the two
trained model sets taken
were of emotions
as class(one for our
weights andwork
the and one for
weighted the was
sum Tonederived
Analyzer)for
into
thesethe well known
specific emotional valence
data. Consequently, and arousal
our method and space
the Tone (i.e.,Analyzer
V-A space, [46]).
were In this space, valence
compared.
indicates the hedonic
For the valuewe
data of each, (positive
chose orthenegative),
two closestranging
classesfrom inactive
(with to active;
the largest and arousal
weights) indicates
and calculated
the
their weighted distance to represent the distance between the predicted and the actual classes.can
emotional intensity, ranging from unpleasant to pleasant. The valence and arousal dimensions To
be projected
compare theonto Euclidean we
performances, space, wherethe
divided emotions
distanceareintorepresented as point-vectors.
eight intervals and counted In thethis way, the
numbers of
user’s emotion
data within eachcaninterval.
be located in this
Table space tothe
1 presents be results,
a tuple of in valence
which xand arousal
is the values.
weighted distance. As can
be seen, in general the results obtained by the presented method were better than those obtained by
the Tone Analyzer for the dataset used.
Table 1. Performance comparison for the two methods.
Distance CNN-LSTM Tone Analyzer

Robotics 2020, 9, 31 13 of 20
In the experiments, we first projected the five classes (annotated in the dataset) into the V-A space
to retrieve the corresponding valence and arousal values (based on the emotion positions defined
in [46]). For the data of each (sentence), we took the positions of the actual (correct) class and the
predicted classes in the space and obtained the set of V-A values. Then, the emotion values produced
by the trained model were taken as class weights and the weighted sum was derived for these specific
data. Consequently, our method and the Tone Analyzer were compared.
For the data of each, we chose the two closest classes (with the largest weights) and calculated
their weighted distance to represent the distance between the predicted and the actual classes. To
compare the performances, we divided the distance into eight intervals and counted the numbers of
data within each interval. Table 1 presents the results, in which x is the weighted distance. As can be
seen, in general the results obtained by the presented method were better than those obtained by the
Tone Analyzer for the dataset used.
Table 1. Performance comparison for the two methods.
Distance CNN-LSTM Tone Analyzer

x=0 861 692
0 < x <= 0.1 160 156
0.1 < x <= 0.3 68 101
0.3 < x <= 0.5 73 95
0.5 < x <= 0.7 61 107
0.7 < x <= 0.9 125 169
0.0 < x <= 1.0 51 68
1.0 < x 104 115
4.4. Performance of Training a Dialogue Model

The next set of experiments was to examine the system performance of model training in retrieving
(selecting) answers. In this series of experiments, a large dataset was adopted [38]. It was collected
from the Insurance Library website that included 12,889 questions and 21,325 answers, after a data
preprocessing procedure was performed. This procedure was to remove unsuitable data that could not
form the proper input question−answer pairs, to clean the irrelevant terms (such as html tags) and to
transfer the text context into internal identifiers (to form the vectors). In the experiments, the above
dataset was divided into two parts, in which a part of 2000 questions and a part of 3308 answers were
used for testing. The complete experiments of dialoguing were described in our previous work [35], and
here we focused on reporting the results most related to the model training for human–robot interaction.
As described in Section 3.3, in the model training phase, for each question sentence a positive and
a negative answer were needed to constitute a training instance. However, in a real-world application,
the correct answer A+ for a question Q can be determined easily (by the confirmation of the person
asking the question), while the wrong answers are often not explicitly specified. Therefore, in the
experiments here, all other answers in the dataset were considered candidates of wrong answers to Q.
To find the most suitable wrong answer A− for each question in the dataset, we used the above model
training procedure to perform the preprocessing procedure of the wrong answer selection. Due to
the large amount of answers, in this work we randomly chose ten (instead of all) answers for each
question to perform training to reduce computational time.
In the learning process, the random shuffling strategy was used to combine the correct and
wrong answers for each question to work as the training data. The model and method presented in
Section 3.3 were used for training. Figure 7a illustrates the results of the two performance metrics
often used in retrieval-based dialogue modeling, accuracy and MRR. Here, the accuracy was in fact the
top-one precision mentioned in the other relevant studies. It means that the model’s predictive result
(i.e., the top score answer) must be exactly the expected one as recorded in the dataset. As shown,
the LSTM-CNN model could achieve the best performance with a correct prediction rate of 0.61 and
the MRR was 0.70. The results were similar to those presented in the related study [38], whereas the
Robotics 2020, 9, 31 14 of 20
presented method involved a smaller set of parameters and was more efficient in learning. In addition
to the LSTM-CNN model, a traditional embedding model (using only word embedding technique)
was also implemented for performance comparison. The results are shown in Figure 7b. As presented,
the accuracy of the embedding model is 0.12 and for the MRR is 0.21. These results indicated that the
LSTM-CNN
Robotics 2020,model
9, x FORwas
PEERmore efficient; it obtained a better result within less iterations.
REVIEW 14 of 20
(a) (b)
Figure 7. Performance comparison of the two methods for the original dataset: (a) LSTM-CNN
model; (b) embedding model.
In addition to the performance evaluation of the dialogue modeling, we performed another set
of trials to examine the performance of the shared knowledge translated by a dataset from a
(a) (b)
different language. In the experiments, the dataset used in the above set of experiments was
translated
Figure
Figure 7.according
7. Performance
Performance to comparison
thecomparison
steps described twoin
of the
of the twoSection
methods
methods 3.3,
for and
for
the thethe same
original
original deep
dataset:
dataset: learning model
(a) LSTM-CNN
(a) LSTM-CNN model;and
method were
model; (b)used for
embedding
(b) embedding model. the training.
model. As mentioned previously, in contrast to English sentences, a
Chinese sentence could be segmented into various combinations of words by different
In In
segmentation addition
addition to tothethe
methods performance
and
performancethis often evaluation ofofthe
led to different
evaluation thedialogue
dialoguemodeling,
modeling modeling,
results. weperformed
Therefore,
we performed another set
before another
evaluating set of
of
the to
trials trials to
performance
examine theexamine the
of performanceperformance
the model training, of thewe of the shared
conducted
shared knowledge knowledge
a set of translated
trails to investigate
translated by
by a dataset a dataset
thefrom from
effecta of twoa
different
differentsegmentation
popular language. Inmethods: the experiments,
the Jiaba the
and dataset
the above used and
HanLP, in the theabove
resultsset of experiments
showed that the was
Jiaba
language. In the experiments, the dataset used in the set of experiments was translated according
translated better
performed according than to
the the steps described
HanLP. We thus in Section
chose Jiaba 3.3, and the same
segmentation to deep learning
continue the model and
experiments offor
to the steps described in Section 3.3, and the same deep learning model and method were used
method
model were used for the training. As mentioned previously, in contrast to English sentences, a
training.
the training. As mentioned previously, in contrast to English sentences, a Chinese sentence could be
Chinese sentence
The results (i.e.,could
accuracy be and
segmented
MRR) areinto variousin combinations
presented Figure 8a. As shown of words in thebyfigure,
different
the
segmented into various combinations of words by different segmentation methods and this often led
segmentation methods and this often led to different modeling
LSTM-CNN model can achieve a best performance (accuracy) of 0.54 and an MRR of 0.64. Moreover, results. Therefore, before evaluating
to different
the traditional
modeling
performance
results. Therefore,
of the model training,
before evaluating
we conducted
the performance of the model training,
the embedding model was implemented fora comparison
set of trails to and investigate
the results theare
effect
shownof twoin
wepopular
conducted a set of trails to investigate theand effect of two popular segmentation methods: the Jiaba
Figure 8b.segmentation
Similar to themethods: experiments the Jiaba
conducted the
for HanLP, and (untranslated)
the original the results showed dataset,thatthethe Jiaba
results
andperformed
the HanLP, and than
better the results
the HanLP.showed Wethat
thus the Jiaba performed
chose bettertothan the HanLP. We thus chose
here indicated that the LSTM-CNN model was moreJiaba segmentation
efficient continue
than the traditional the experiments
embedding method.of
Jiaba
modelsegmentation
training. to continue the experiments of model training.
Compared to the results obtained from the original dataset, the accuracy declined from 0.61 to
0.54.The Theresults
This results(i.e.,
indicated (i.e.,accuracy
thataccuracy
the model and MRR)
andbuilt
MRR) from are
are presented
presented
the translated in Figure
inknowledge
Figure 8a. 8a. As
As shown
(i.e., shown
dataset) theinfigure,
incould thekeep
not figure,
the
thethe
LSTM-CNN
LSTM-CNN
modelingmodel model can achieve
can achieve
performance at the a best
a best
same performance
performance (accuracy)
(accuracy)the
level; nevertheless, of 0.54
of results and
0.54 andshowed an MRR
an MRRthat of 0.64.
of 0.64. Moreover,
Moreover,
the translated
theknowledge
traditional
the traditionalwasembedding
embedding
learnable with model
modelan waswas implemented
implemented
acceptable performance for comparison
for comparison
and was thus and
and thetheresults
useful results
in areshown
are
building shown
models in in
Figure
Figure
for 8b.8b.
the Similar
Similarto the
to the
resource-restricted experiments
experiments
language. conducted
conducted
The forfor
performancethethe
original (untranslated)
original
could be(untranslated)
further improveddataset,
dataset,the results
the
when results
more here
here indicated
indicated
advanced that
textthe that the LSTM-CNN
LSTM-CNN
translation modelmodel
techniques was more
wasapplied.
are more efficient
efficient thanthanthe the traditional
traditional embedding
embedding method.
method.
0.54. This indicated that the model built from the translated knowledge (i.e., dataset) could not keep
the modeling performance at the same level; nevertheless, the results showed that the translated
knowledge was learnable with an acceptable performance and was thus useful in building models
for the resource-restricted language. The performance could be further improved when more
advanced text translation techniques are applied.
(a) (b)
Figure 8. 8.Performance
Figure Performancecomparison
comparison of the two
two methods
methodsfor
forthe
thetranslated
translateddataset:
dataset:(a)(a) LSTM-CNN
LSTM-CNN
model;
model; (b)(b) embeddingmodel.
embedding model.
(a) (b)
Figure 8. Performance comparison of the two methods for the translated dataset: (a) LSTM-CNN
model; (b) embedding model.
Robotics 2020, 9, 31 15 of 20
0.54. This indicated that the model built from the translated knowledge (i.e., dataset) could not keep
the modeling performance at the same level; nevertheless, the results showed that the translated
knowledge was learnable with an acceptable performance and was thus useful in building models for
the resource-restricted language. The performance could be further improved when more advanced
text translation techniques are applied.
4.5. Evaluation
Robotics 2020, 9, xofFOR
Training Task-Oriented Dialogues
PEER REVIEW 15 of 20
4.5.1. Performance
4.5. Evaluation Evaluation
of Training of NeuralDialogues
Task-Oriented Belief Tracker
As mentioned previously, the belief tracker plays an important role in a goal-oriented dialogue
4.5.1. Performance Evaluation of Neural Belief Tracker
system; it can be used to track each participant’s intention from the continuous dialoguing utterances
between Asthe
mentioned previously,
participant the belief
and the robot. tracker
In this work,plays
wean important role
implemented in a goal-oriented
a deep CNN model to dialogue
work as a
system;
neural it can
belief be used
tracker. to track each
The application taskparticipant’s
was to perform intention from recommendation
restaurant the continuous dialoguing
through the
utterancesdialogue.
user−robot between the participant
A set andwere
of entities the robot. In this work,
pre-defined and thewerobot
implemented a deep
iteratively CNN model
interacted with the
to work as a neural belief tracker. The application task was to perform restaurant
user to derive all the missing entity values. The DSTC6 dataset was used for training the tracker. Inrecommendation
through
this the user−robot
task (dataset), dialogue.
five entities were A set of namely
tracked, entities cuisine,
were pre-defined and range,
location, price the robot iterativelyand
atmosphere
interacted with the user to derive all the missing entity values. The DSTC6 dataset
party size, and the system had to infer their corresponding slot values for making an appropriate was used for
training the tracker. In this task (dataset), five entities were tracked, namely cuisine, location, price
recommendation. Table 2 lists the values defined for the entities.
range, atmosphere and party size, and the system had to infer their corresponding slot values for
making an appropriate recommendation. Table
Table 2. Values for2all
lists
thethe values
tracked defined for the entities.
entities.
Task Entity Table 2. Values for all the tracked entities.

Values
cuisine
Task entity Italian, British,
Values Indian, French, Spanish
locationcuisine
Rome, London, Bombay, Paris, Madrid
Italian, British, Indian, French, Spanish
price range cheap, moderate, expensive
location Rome, London, Bombay, Paris, Madrid
atmosphere casual, business, romantic
price range cheap, moderate, expensive
seat number two, four, six, eight
atmosphere casual, business, romantic
seat number two, four, six, eight
To achieve the task, a state tracker was trained for each entity. In the experiments, the performance
metricsTo achieve
were the task,(the
the accuracy a state tracker
number was trained
of correct responsesfor divided
each entity.
by theInnumber
the experiments,
of turns) andthethe
performance metrics were the accuracy (the number of correct responses divided by the number
loss (here, root mean square error). The training performance for all the entities is presented in Figure of 9,
in which (a) shows how the accuracy was improved during the training process, and (b) illustratesisthe
turns) and the loss (here, root mean square error). The training performance for all the entities
presented in Figure 9, in which (a) shows how the accuracy was improved during the training
reduced loss. As is shown in Figure 9, an accuracy of 0.9 can be obtained after 200 epochs and the loss
process, and (b) illustrates the reduced loss. As is shown in Figure 9, an accuracy of 0.9 can be
converged to a small value after 75 epochs and approximated toward zero in the end of the training
obtained after 200 epochs and the loss converged to a small value after 75 epochs and approximated
(200 epochs).
toward zero in the end of the training (200 epochs).
(a) (b)
Figure9.9.Performance
Figure Performance of
of learning
learning the
the neural
neuralbelief
belieftracker:
tracker:(a)(a)
accuracy; (b)(b)
accuracy; loss.
loss.
4.5.2. Performance Evaluation of Autoencoder

The DSTC6 dataset used in the above section for training the neural belief tracker contained only
dialogue information which cannot be used for making a recommendation. Therefore, in this section
we adopted another public dataset (i.e., Yelp [47]) to evaluate the recommendation performance of the
presented model. The original dataset contained a large amount of users and their ratings of a set of
shops. This dataset had a very high sparsity. To achieve our task of restaurant recommendation and
Robotics 2020, 9, 31 16 of 20

4.5.2. Performance Evaluation of Autoencoder
As mentioned above, we revised the autoencoder model through a set of experimental
The DSTC6 dataset used in the above section for training the neural belief tracker contained only
investigations to enhance the corresponding performance. The first phase was to investigate the effect
dialogue information which cannot be used for making a recommendation. Therefore, in this section
of the code size (the number of nodes in the code layer). A set of code sizes (32, 64, 128 and 256) were
we adopted another public dataset (i.e., Yelp [47]) to evaluate the recommendation performance of the
evaluated and the results showed that with a size of 32, the model could obtain its best performance.
presented model. The original dataset contained a large amount of users and their ratings of a set of
After the preliminary test for code size, in the second phase we evaluated the performance of the
shops. This dataset had a very high sparsity. To achieve our task of restaurant recommendation and to
different activation functions, including ELU, SeLU, ReLU, Sigmoid and tanh, which were often
connect
used inthedeep
recommendation
learning models. moduleThe to the(root
loss dialogue
mean system,
squaredwe chose
error) theemployed
was relevant data (69,634 the
to measure users,
41,019 restaurants and 1,817,955 ratings) to evaluate our approach.
prediction performance and the results are shown in Figure 10, in which (a) is the training process
andAs(b)mentioned above, we
is the corresponding testrevised
process.the autoencoder
Figure 10 indicatesmodel
that thethrough
overfittinga set of experimental
situation occurred
investigations
in all cases andto enhance
the casethe of corresponding
ELU obtained the performance.
best result.The We first
thenphase was toaninvestigate
performed additionalthe seteffect
of
of trials
the code size (the number of nodes in the code layer). A set of code sizes
on the dropout (the dropping out unit in a neural network) and chose a dropout value of 0.8 (32, 64, 128 and 256) were
evaluated andthe
to alleviate theoverfitting.
results showed that with
The third phasea wassize to
of 32, the model
investigate thecould
effect obtain
of the its best performance.
number of hidden
After thearranged
layers preliminaryin thetest
deepfornetwork.
code size, in the
In this set second phase we
of experiments, weevaluated
evaluated fivethe performance
different numbers of the
of layers:
different 2, 4, 6, 8 functions,
activation and 10, andincluding
the resultsELU, are shown
SeLU, in Figure
ReLU, 11. As shown
Sigmoid in Figure
and tanh, which11,werethough with
often used
more hidden layers the model can obtain better training performance, it
in deep learning models. The loss (root mean squared error) was employed to measure the prediction caused overfitting. We thus
chose to useand
performance six hidden layersare
the results in the
shownfinal inexperiments
Figure 10,for inperformance
which (a) iscomparison.
the training process and (b) is
After conducting
the corresponding the above
test process. evaluation
Figure steps for
10 indicates thethe
that determination of the network
overfitting situation occurred parameters,
in all cases
we then compared our enhanced approach to other popular collaborative
and the case of ELU obtained the best result. We then performed an additional set of trials on the filtering methods,
including
dropout (thethe well known
dropping autoencoder
out unit in a neural AutoRec,
network) andand
thechose
latentafactor model
dropout NNMF
value of 0.8(non-negative
to alleviate the
matrix factorization [48]) which is one of the best models in the relevant studies.
overfitting. The third phase was to investigate the effect of the number of hidden layers arranged in In the experiments,
for all three methods, the number of epochs was 100, the code size (for our model and AutoRec)
the deep network. In this set of experiments, we evaluated five different numbers of layers: 2, 4, 6,
and the latent factor (for NNMF) was 32, and the learning rate (for our model and AutoRec) was
8 and 10, and the results are shown in Figure 11. As shown in Figure 11, though with more hidden
0.005. As a result, the loss (error) for the proposed model, the AutoRec model and the NNMF
layers the model can obtain better training performance, it caused overfitting. We thus chose to use six
method were 1.0868, 1.4758 and 1.1293, respectively. Such results showed that the proposed method
hidden layers in the final experiments for performance comparison.
outperformed other methods and can provide better recommendation performance.
(a) (b)
Figure
Figure 10.10.Comparison
Comparisonofofthe
thedifferent
different activation
activation functions
functionsin:
in:(a)
(a)training
trainingphase;
phase;(b)(b)
test phase.
test phase.
After conducting the above evaluation steps for the determination of the network parameters,
we then compared our enhanced approach to other popular collaborative filtering methods, including
the well known autoencoder AutoRec, and the latent factor model NNMF (non-negative matrix
factorization [48]) which is one of the best models in the relevant studies. In the experiments, for all
three methods, the number of epochs was 100, the code size (for our model and AutoRec) and the
latent factor (for NNMF) was 32, and the learning rate (for our model and AutoRec) was 0.005. As a
result, the loss (error) for the proposed model, the AutoRec model and the NNMF method were 1.0868,
1.4758 and 1.1293, respectively. Such results showed that the proposed method outperformed other
methods and can provide better recommendation performance.
(a) (b)
Robotics 2020, 9, 31
(a) (b) 17 of 20
Figure 10. Comparison of the different activation functions in: (a) training phase; (b) test phase.
(a) (b)
Figure 11. Comparison of the different numbers of hidden layers in: (a) training phase; (b) test phase.
4.6. Discussion
The above experiments evaluated our approach for a service robot to provide knowledge services.
As presented, in our current design, the emotion recognition was constructed separately from the
dialogue modeling. The model was trained by a data-driven process with a static dataset. The
emotion classifier was then used to re-rank the sentences selected by the model. The separation of
emotion recognition and dialogue modeling has several advantages. The first is that the modules of
emotion recognition and response generation can be constructed by any effective methods if available;
the system thus operates more flexibly. Meanwhile, the reasons why the system generated these
responses can be interpretable to users for further analysis. The two subsystems can be integrated
into one model to optimize the corresponding structure and performance, for example, to adopt a
monolithic model with an attention mechanism to capture emotion as a special context. However, the
integrated system may thus become relatively difficult to understand and computationally expensive.
Considering the dialogue modeling, this work trained models by a data-driven process with a
static dataset. Therefore, in addition to the learning method, the quality and quantity of the dataset also
had influences on the overall performance. It was thus important to strengthen the role of knowledge
(i.e., dataset) to infer an enriched domain-specific model. Different strategies can be developed to
exploit more knowledge resources, ranging from directly linking the dataset to the up-to-date external
knowledge bases, reorganizing the dataset to obtain an optimized data use and to a complicated
procedure of transferring knowledge between different domains. For a resource-restricted language,
a straightforward way is to take the translated datasets as shared knowledge for modeling. We showed
the effect of using translated knowledge. Our application case revealed that the translated knowledge
was learnable and the modeling performance could be kept at a similar level as when using the
original data. More advanced language translation techniques can be developed to further improve
the performance.
In contrast to the non-task-oriented dialogues, a task-oriented dialogue has a specific goal to
achieve. The dataset is more focused and thus relatively smaller. In such a system, the most critical
component is the goal tracker that is used to track the user’s intent during the dialogue to infer
the dialogue state. The system can then decide the best response accordingly to achieve the goal
(e.g., the recommendations in our experiments). Through the application presented in this work,
we have demonstrated that task-oriented dialogues can be practically launched for a manageable task
with a clear goal and a constrained dataset. During such dialogues, a fine-tuning procedure for the
model parameters needs to be carefully performed to find the best results. When the task becomes
complicated or has a high-level (or abstract) goal to achieve, more advanced state tracking and an
inferring mechanism is needed to better understand the users’ intentions.
Robotics 2020, 9, 31 18 of 20
5. Conclusions
In this work, we presented an emotion-aware dialogue framework for a service robot to achieve
natural human–robot communication. To deploy this framework, we adopted a cloud-based
service-oriented architecture and developed the emotion recognition and two types of dialogue
modeling modules on it. In the first type of service, the robot worked as a consultant to deliver
domain-specific knowledge to users. We employed a deep learning method to train different neural
models for mapping questions and answer sentences, tracking the human emotion during the process
of the human–robot dialoguing and using this additional information to determine the relevance of the
sentences obtained by the model. In the second type of dialogue service, task-oriented dialogues were
provided for assisting users to achieve specific goals. The robot continuously asked the user questions
related to the task, tracked the user’s intention through the interactions and provided suggestions
accordingly. To verify our framework, we conducted a series of experiments to evaluate the major
system components. The results confirmed the effectiveness and efficiency of the presented approach.
Currently, we are developing techniques of knowledge transfer that can extract pairs of questions and
answers from the text documents of different domains, in order to automatically enrich the dataset
for retrieval-based model training. Moreover, we plan to investigate the use of Kansei engineering
with hedge algebras to improve the granularity of the semantic and linguistic analysis in the dialogue
sentences. We also plan to integrate the characteristics and preferences of the users into the learning
model to achieve personalized dialogues.
Author Contributions: All authors discussed and commented on the manuscript at all stages. More specifically:
investigation, methodology and software, J.-Y.H., W.-P.L. and B.-W.D.; supervision, W.-P.L., data analysis and
processing, J.-Y.H., C.-C.C. and B.-W.D.; writing—original draft preparation, J.-Y.H. and W.-P.L.; writing—review
and editing, J.-Y.H., W.-P.L., C.-C.C.; funding acquisition, W.-P.L. All authors have read and agreed to the published
version of the manuscript.
Funding: This research was supported in part by the Ministry of Science and Technology of Taiwan, under
Contract MOST-108-2221-E-110-054.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Erl, T. Service-Oriented Architecture; Prentice Hall: New York, NY, USA, 2005; Volume 8.
2. Yang, T.-H.; Lee, W.-P. A service-oriented framework for developing home robots. Int. J. Adv. Robot. Syst.
2013, 10, 122. [CrossRef]
3. Huang, J.-Y.; Lee, W.-P.; Lin, T.-A. Developing context-aware dialogue services for a cloud-based robotic
system. IEEE Access 2019, 7, 44293–44306. [CrossRef]
4. Quigley, M.; Conley, K.; Gerkey, B.; Faust, J.; Foote, T.; Leibs, J.; Ng, A.Y. ROS: An open-source robot operating
system. In Proceedings of the IEEE International Conference on Robotics and Automation, Workshop on
Open-Source Robotics, Kobe, Japan, 12–17 May 2009.
5. Gao, J.; Galley, M.; Li, L. Neural approaches to conversational AI. In Proceedings of the 41st ACM SIGIR
International Conference on Research and Development in Information Retrieval, Ann Arbor, MI, USA, 8–12
July 2018; pp. 1371–1374.
6. Shang, L.; Lu, Z.; Li, H. Neural responding machine for short-text conversation. In Proceedings of the 53rd
Annual Meeting of the Association for Computational Linguistics, Beijing, China, 26–31 July 2015; Volume 1,
pp. 1577–1586.
7. Huang, J.-Y.; Lee, W.-P.; Dong, B.-W. Learning emotion recognition and response generation for a service
robot. In Proceedings of the 6th IFToMM International Symposium on Robotics and Mechatronics, Taipei,
Taiwan, 28–31 October 2019; pp. 286–297.
8. Waibel, M.; Beetz, M.; Civera, J.; D’Andrea, R.; Elfring, J.; Galvez-Lopez, D.; van de Molengraft, M.J.;
Schiesle, B. RoboEarth-A world wide web for robots. IEEE Robot. Autom. Mag. 2011, 18, 69–82. [CrossRef]
9. Mohanarajah, G.; Hunziker, D.; D’Andrea, R.; Waibel, M. Rapyuta: A cloud robotics plat-form. IEEE Trans.
Autom. Sci. Eng. 2015, 12, 481–493. [CrossRef]
Robotics 2020, 9, 31 19 of 20
10. Pereira, A.B.M.; Bastos, G.S. ROSRemote, using ROS on cloud to access robots remotely. In Proceedings of the
18th IEEE International Conference on Advanced Robotics, Hong Kong, China, 10–12 July 2017; pp. 284–289.
11. Kehoe, B.; Patil, S.; Abbeel, P.; Goldberg, K. A survey of research on cloud robotics and automation. IEEE Trans.
Autom. Sci. Eng. 2015, 12, 398–409. [CrossRef]
12. Saha, O.; Dasgupta, P. A comprehensive survey of recent trends in cloud robotics architectures and
applications. Robotics 2018, 7, 47. [CrossRef]
13. Simoens, P.; Dragone, M.; Saffiotti, A. The internet of robotic things: A review of concept, added value and
applications. J. Adv. Robot. Syst. 2018, 15. [CrossRef]
14. Ray, P.P. Internet of robotic things: Concept, technologies, and challenges. IEEE Access 2016, 4, 9489–9500.
[CrossRef]
15. Tian, N.; Chen, J.; Zhang, R.; Huang, B.; Goldberg, K.; Sojoudi, S. A fog robotic system for dynamic visual
servoing. In Proceedings of the IEEE International Conference on Robotics and Automation, Montreal, QC,
Canada, 20–24 May 2019; pp. 1982–1988.
16. Galambos, P. Cloud, fog, and mist computing: Advanced robot applications. IEEE Syst. Man Cybern. Mag.
2020, 6, 41–45. [CrossRef]
17. Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to sequence learning with neural networks. In Advances in Neural
Information Processing Systems; MIT Press: Cambridge, MA, USA, 2014; Volume 27, pp. 3104–3112.
18. Serban, I.V.; Lowe, R.; Charlin, L.; Pineau, J. A survey of available corpora for building data-driven dialogue
systems. arXiv 2017, arXiv:1512.05742v3.
19. Hu, B.; Lu, Z.; Li, H.; Chen, Q. Convolutional neural network architectures for matching natural language
sentences. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2014;
Volume 27, pp. 2042–2050.
20. Serban, I.V.; Sordoni, A.; Bengio, Y.; Courville, A.; Pineau, J. Building end-to-end dialogue systems using
generative hierarchical neural network models. In Proceedings of the 30th AAAI Conference on Artificial
Intelligence, Phoenix, AZ, USA, 12–17 February 2016; pp. 3776–3783.
21. Wen, T.H.; Gasic, M.; Mrksic, N.; Su, P.H.; Vandyke, D.; Young, S. Semantically conditioned LSTM-based
natural language generation for spoken dialogue systems. In Proceedings of the International Conference on
Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17–21 September 2015; pp. 1711–1721.
22. Wu, S.; Zhang, D.; Li, Y.; Xie, X.; Wu, Z. HL-EncDec: A hybrid-level encoder-decoder for neural response
generation. In Proceedings of the International Conference on Computational Linguistics, Santa Fe, NW,
USA, 20–26 August 2018; pp. 845–856.
23. Li, X.; Chen, Y.-N.; Li, L.; Gao, J.; Celikyilmaz, A. End-to-end task-completion neural dialogue systems.
In Proceedings of the 8th International Joint Conference on Natural Language Processing, Taipei, Taiwan,
3 March 2017; pp. 733–743.
24. Williams, J.D.; Asadi, K.; Zweig, G. Hybrid code networks: Practical and efficient end-to-end dialog control
with supervised and reinforcement learning. arXiv 2017, arXiv:1702.03274v2.
25. Kuchaiev, O.; Ginsburg, B. Training deep autoencoders for collaborative filtering. arXiv 2017, arXiv:1708.
01715v3.
26. Zhou, H.; Huang, M.; Zhang, T.; Zhu, X.; Liu, B. Emotional chatting machine: Emotional conversation
generation with internal and external memory. In Proceedings of the 32th AAAI Conference on Artificial
Intelligence, New Orleans, LA, USA, 2–7 Frbruary 2018; pp. 730–738.
27. Sun, X.; Peng, X.; Ding, S. Emotional human-machine conversation generation based on long short-term
memory. Cogn. Comput. 2018, 10, 389–397. [CrossRef]
28. Asghar, N.; Poupart, P.; Hoey, J.; Jiang, X.; Mou, L. Affective neural response generation. In 40th European
Conference on Information Retrieval Research; Springer: Cham, Switzerland, 2018; pp. 154–166.
29. Ghosh, S.; Vinyals, O.; Strope, B.; Roy, S.; Dean, T.; Heck, L. Contextual LSTM (CLSTM) models for large
scale NLP tasks. arXiv 2016, arXiv:1602.06291.
30. Appel, O.; Chiclana, F.; Carter, J.; Fujita, H. A hybrid approach to the sentiment analysis problem at the
sentence level. Knowl. Based Syst. 2016, 108, 110–124. [CrossRef]
31. Bird, S.; Klein, E.; Loper, E. Natural Language Processing with Python; O’reilly Media: Reading, MA, USA, 2009.
32. Pennington, J.; Socher, R.; Manning, C.D. GloVe: Global vectors for word representation. In Proceedings of
the International Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, 25–29
October 2014; pp. 1532–1543.
Robotics 2020, 9, 31 20 of 20
33. Ramachandran, P.; Barret, Z.; Le Quoc, V. Searching for activation functions. In Proceedings of the Sixth
International Conference on Learning Representations, Workshop Track, Vancouver, BC, Canada, 30 April–3
May 2018.
34. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. In Proceedings of the Third International
Conference for Learning Representations, San Diego, CA, USA, 22 December 2015.
35. Huang, J.-Y.; Lin, T.-A.; Lee, W.-P. Using deep learning and an external knowledge base to develop
human-robot dialogues. In Proceedings of the IEEE International Conference on Systems, Man, and
Cybernetics, Miyazaki, Japan, 7–10 October 2018; pp. 3699–3704.
36. Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed representations of words and phrases
and their compositionality. In Advances in Neural Information Processing Systems 26; Curran Associates, Inc.:
New York, NY, USA, 2013; pp. 3111–3119.
37. Řehůřek, R.; Sojka, P. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the
LREC 2010 Workshop New Challenges for NLP Frameworks, Valletta, Malta, 22 May 2010; pp. 46–50.
38. Feng, M.; Xiang, B.; Glass, M.R.; Wang, L.; Zhou, B. Applying deep learning to answer selection: A study and
an open task. In Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding,
Scottsdale, AZ, USA, 13–17 December 2015; pp. 813–820.
39. Hori, C.; Perez, J.; Higashinaka, R.; Hori, T.; Boureau, Y.L.; Inaba, M.; Tsunomori, Y.; Takahashi, T.; Yoshino, K.;
Kim, S. Overview of the sixth dialog system technology challenge: DSTC6. Comput. Speech Lang. 2019, 55,
1–25. [CrossRef]
40. Mrkšić, N.; Séaghdha, D.O.; Wen, T.H.; Thomson, B.; Young, S. Neural belief tracker: Data-driven dialogue
state tracking. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics,
Vancouver, BC, Canada, 30 July–4 August 2017; Volume 1, pp. 1777–1788.
41. Baldi, P. Autoencoders, unsupervised learning, and deep architectures. In Proceedings of the International
Conference on Machine Learning, Workshop on Unsupervised and Transfer Learning, Edinburgh, Scotland,
26 June–1 July 2012; pp. 37–49.
42. Sedhain, S.; Menon, A.K.; Sanner, S.; Xie, S. AutoRec: Autoencoders meet collaborative filtering.
In Proceedings of the 24th International Conference on World Wide Web, Florence, Italy, 18–22 May
2015; pp. 111–112.
43. Tan, P.; Steinbach, M.; Kumar, V. Introduction to Data Mining; Addison-Wesley: Reading, MA, USA, 2005.
44. The Face Dataset. Available online: http://robotics.csie.ncku.edu.tw/Databases/FaceDetect_Pose_Estimate.
htm#Our_Database (accessed on 10 January 2018).
45. Phan, D.A.; Shindo, H.; Matsumoto, Y. Multiple emotions detection in conversation transcripts. In Proceedings
of the 30th Pacific Asia Conference on Language, Information and Computation, Seoul, Korea, 28–30 October
2016; pp. 85–94.
46. Russell, J.A. A circumplex model of affect. J. Personal. Soc. Psychol. 1980, 39, 1161. [CrossRef]
47. The Yelp Dataset. Available online: https://www.yelp.com/dataset/ (accessed on 20 May 2019).
48. Févotte, C.; Idier, J. Algorithms for nonnegative matrix factorization with the β-divergence. Neural Comput.
2011, 23, 2421–2456. [CrossRef]
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Robótica

Uploaded by

Copyright:

Available Formats

Robótica

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Robótica

Uploaded by

Copyright:

Available Formats

robotics

Keywords: human–machine interaction; service robot; emotion recognition; dialogue modeling;

Robotics 2020, 9, 31; doi:10.3390/robotics9020031 www.mdpi.com/journal/robotics

in order to deliver results to a consumer. Moreover, services are autonomous platform-independent

3. Developing Human–Robot Dialogues

3.1. The Framework

3.2. Learning Emotion Recognition

3.2.1. Text Processing

3.2. Learning Emotion Recognition

3.2.1. Text Processing

3.3.1. Learning Dialogue Models E = EMBED(x1, …, xn; We) (1)

To develop dialogues for the robot, weLadopted theWneural

Loss = max{0, m−Similarity(V Q ,V A+ ) + Similarity(VQ ,VA- ))} (4)

3.3.2. Knowledge Enrichment

3.4. Developing Task-Oriented Dialogues

4. Experiments and Results

4.2. User Identification

4.3. Performance of Emotion Recognition

4.3.1. Performance Evaluation

Table 1. Performance comparison for the two methods.

Distance CNN-LSTM Tone Analyzer

Table 1. Performance comparison for the two methods.

Distance CNN-LSTM Tone Analyzer

4.4. Performance of Training a Dialogue Model

Robotics 2020, 9, x FOR PEER REVIEW 14 of 20

Task Entity Table 2. Values for all the tracked entities.

4.5.2. Performance Evaluation of Autoencoder

Robotics 2020, 9, x FOR PEER REVIEW 16 of 20

You might also like