Robótica
Robótica
Robótica
Article
Developing Emotion-Aware Human–Robot Dialogues
for Domain-Specific and Goal-Oriented Tasks †
Jhih-Yuan Huang, Wei-Po Lee *, Chen-Chia Chen and Bu-Wei Dong
Department of Information Management, National Sun Yat-sen University, Kaohsiung 80424, Taiwan;
[email protected] (J.-Y.H.); [email protected] (C.-C.C.); [email protected] (B.-W.D.)
* Correspondence: [email protected]
† This paper is an extended version of our paper Huang, J.-Y.; Lee, W.-P.; Dong, B.-W. Learning Emotion
Recognition and Response Generation for a Service Robot. In Proceedings of the 6th IFToMM International
Symposium on Robotics and Mechatronics, Taipei, Taiwan, 28–30 October 2019; pp. 286–297.
Received: 30 March 2020; Accepted: 3 May 2020; Published: 7 May 2020
Abstract: Developing dialogue services for robots has been promoted nowadays for providing natural
human–robot interactions to enhance user experiences. In this study, we adopted a service-oriented
framework to develop emotion-aware dialogues for service robots. Considering the importance of
the contexts and contents of dialogues in delivering robot services, our framework employed deep
learning methods to develop emotion classifiers and two types of dialogue models of dialogue services.
In the first type of dialogue service, the robot works as a consultant, able to provide domain-specific
knowledge to users. We trained different neural models for mapping questions and answering
sentences, tracking the human emotion during the human–robot dialogue, and using the emotion
information to decide the responses. In the second type of dialogue service, the robot continuously
asks the user questions related to a task with a specific goal, tracks the user’s intention through
the interactions and provides suggestions accordingly. A series of experiments and performance
comparisons were conducted to evaluate the major components of the presented framework and the
results showed the promise of our approach.
1. Introduction
Researchers and engineers have been building service robots that can interact with people and
achieve given tasks. To deploy practical service robots, two major concerns need to be seriously
considered, including the system architecture for launching the services and the creation of the service
functions. At present, the services are mostly laboring services, in which robots take actions in the
physical environment to assist people. However, robots are now expected to play more important roles
in providing domain-specific knowledge services and task-oriented services. To deliver these services,
robots communicate with users through a natural way of spoken language because conversation is a
key instrument for developing and maintaining mutual relationships. Following our previous studies
that adopted a service-oriented architecture to develop action-oriented robot services, in this work we
presented a trainable framework for modeling emotion-aware human–robot dialogues to provide the
aforementioned services.
Regarding the many choices of supportive software architecture, some researchers have proposed
to adopt cloud-based service-oriented architecture (SOA). SOA is an architectural style based on
interacting software components, providing services as fundamental units to design, build and compose
the service-oriented software systems [1]. A service is a function made available by a service provider
2. Related Works
As mentioned previously, at present most of the service robot frameworks have been connected to
various cloud-computing environments to exploit their large amounts of resources. Among others, the
most representative work is RoboEarth [8], driven by an open-source cloud robotics platform [9]. With
this platform, the robots can distribute highly loaded computation to the cloud and access the RoboEarth
knowledge repository to download required resources. There are also other platforms developed for
cloud robotic systems. For example, Pereira et al. proposed the ROSRemote framework [10], which
enabled users to work with ROS remotely to create several applications. More extensive surveys
were found in [11,12]. More recently, due to the rapid advances of the Internet of Things (IoT),
researchers proposed the concept of the Internet of Robot Things (IoRT) to describe a new approach
to robotics [13,14]. In this way, smart devices can monitor events, fuse sensor data from a variety of
Robotics 2020, 9, 31 3 of 20
sources and use local and distributed intelligence to determine a best course of action. This expands the
ability of service robots, improves a robot’s understanding during the human–machine interaction and
leads to a more intelligent robotic network. Moreover, to deal with the scalability problem, researchers
have started to extend the cloud computing concept for service robots to edge or fog computing to
utilize the resources in a more efficient way [15,16].
Instead of investigating issues related to resource allocation and utilization, this work aimed to
develop emotion-aware dialogues for a service robot, in which the most important issues were to
recognize the emotions from the user utterances and to generate appropriate machine responses. Many
methods have been proposed to solve these problems from different perspectives. Because this work
adopted deep learning models to address the above two issues, in the following we discuss the most
relevant studies with similar computational methods.
In general, using a deep learning-based approach to develop dialogues, responses are generated
based on sequence-to-sequence (seq2seq) neural network models, with an objective function of the
maximum-likelihood estimation [17]. This model is to take dialogue modeling as learning a mapping
between human utterances and machine responses. The focus is on how to generate a suitable response
from a corpus to a human utterance. For the training of dialogue models, generative and retrieval-based
methods are often used. Although generative methods have the potential to generate sentences of rich
content, current generative models often have the disadvantages of lacking coherence and producing
unnatural responses. In contrast, though retrieval-based methods are more restricted, they have the
advantage of producing informative and fluent responses. Thus, the retrieval-based methods are
more practical. As can be observed, retrieval-based methods rely on the exploitation of a large and
varied corpus (human–human or human–machine interactions) [18] and deep learning models have
been employed to derive mappings (that is, a selection mechanism) between questions and answers
(e.g., [5,19]).
The basic seq2seq model consists of two recurrent neural networks (RNNs): one works as an
encoder to process the input; the other, a decoder to generate the output. With the characteristic of
making predictions based on running texts of varying lengths, the long short-term memory networks
(LSTMs) are often adopted to train the answer selection mechanism. This model has now been widely
applied to conversation generation and most existing works have mainly focused on developing more
advanced techniques (such as decoding strategies or network models) to improve the content quality
of the responses. Many neural dialogue systems have been constructed based on this design principle.
For example, Serban et al. used a hierarchical LSTM network for a conversation application [20], and
Wen et al. proposed a task-oriented model to generate the correct answers in response to the needs
of the given dialogue [21]. To overcome the problem of overly general (i.e., safe) responses, Wu et al.
proposed a hybrid-level encoder–decoder model, which utilized both word-level and character-level
features [22]. Although these models, in theory, are better at maintaining the dialogue state using
memory components, they require longer training time and excessive searching for hyper-parameters.
In contrast to the above domain-specific dialogue systems that aim to generate fluent and
engaging responses, the other type of neural dialogue systems that has attracted a lot of attention
is task-oriented [23,24]. Task-oriented dialogue systems need to complete a specific task (to achieve
a goal), for example, restaurant reservation, by interacting with users (i.e., a response generation
process). Existing task-oriented systems can be divided into two categories: the modularized pipeline
and the end-to-end single-module systems. The former decomposes the task-oriented dialogue task
into modularized pipelines to be solved separately, while the latter proposes to use an end-to-end
model to produce a sequence of output tokens directly to solve the overall task. End-to-end systems
are often more superior than pipeline systems, due to their unique characteristics, such as global
optimization and easier adaptation to new domains. In the task-oriented dialogue systems, the most
critical component is the goal tracker [25]. The system must update the state of the dialogue according
to each user’s query and their intent. Given the current dialogue state, the system can then decide how
to respond best to the user to accomplish the desired task.
Robotics 2020, 9, 31 4 of 20
In addition to employing more sophisticated models and advanced tuning mechanisms towards
proper response generation, some recent works attempted to augment the emotional information of
the neural dialoguing models to generate more meaningful and humanized machine responses. For
example, Zhou et al. presented a model that assumed the emotion category of human utterance was
known and taken as an additional input to train a model of responses [26]. Sun et al. adopted a LSTM
neural network for conversation modeling [27] in which an emotional category label was added to the
encoder, which regarded emotional information as an additional source to the conversational model.
Moreover, Asghar et al. discussed the feasibility of employing emotion information to help generate
diverse responses [28]. They proposed a model of affective response generation to generate sentences
conditioned on emotional word embeddings, affective objective functions and diverse beam search.
However, these methods only focused on emotional factors while ignoring content relevance, possibly
resulting in a decline in the quality and diversity of a response. The integration of emotion and content
is still a challenging task for several reasons. The first is that high-quality emotion-labeled data are
difficult to obtain in a large-scale corpus because emotions are subjective and difficult to annotate.
Moreover, it is difficult to deal with emotions coherently because balancing grammaticality and the
expressions of emotions is needed [29].
Figure 1. Overview
Figure 1. Overview of
of the
the proposed
proposed framework
framework for the human–robot
for the human–robot dialogues.
dialogues.
Our framework included two types of dialogue services, one for domain-specific dialogues and
the other for task-specific (task-oriented) dialogues. In contrast to the open-domain conversation
performed by the general purpose chatbots, the domain-specific dialogue presented here aims to provide
knowledge services of a certain domain (e.g., finance or insurance) through a question-answering
manner between the user and the robot. In contrast, the task-oriented dialogue service was to achieve
the specific goal for a certain task (e.g., restaurant recommendation) by conducting the iterative
human–robot dialogue to adapt to the user’s intention or preference related to the task goal. This type
of service is especially important in the coming conversational commerce.
At present, the functions of user identification and emotion recognition are constructed
independently from the dialogue model, mainly because of the lack of a dataset containing complete
information of a human face, utterance emotion and dialoguing content. The current strategy was that
the identified user was assigned to a certain type of user group and the corresponding model was
retrieved to perform dialoguing. Then, the candidate sentences produced by the model were re-ranked
(based on the recognized emotion) following a set of hand-crafted rules and the sentence with the
highest rank was selected as the robot’s response. In this work, we only applied the emotion mechanism
to the first type of dialogue (i.e., domain-specific) as a representative example of emotion-aware services.
The same approach can also be applied to the task-oriented dialogue service. The major components of
our framework are described in the following subsections.
Figure 2.
Figure 2. The
The deep
deep learning
learning model
model used for the
used for the emotion
emotion recognition.
recognition.
3.3. Domain-Specific
It has been wellDialogue
known that Modeling
LSTM can overcome the vanishing gradient problem in gradient-based
machine learning methods. However, this situation still occurs when the sentence length is too long
3.3.1.the
and Learning
networkDialogue
needs toModels
be deepened. In this work, we adopted LSTM with ReLU (Rectified Linear
Unit To[33]) to traindialogues
develop a better model,
for the as ReLU
robot, wewas proved
adopted thetoneural
be effective in overcoming
language model fromthe ourvanishing
previous
gradient problem. Moreover, ReLU has the property of sparse activation,
works [3,35] for training the answer selection mechanism. Figure 3 illustrates our model making the neural network
that
sparse to alleviate the problem of over-fitting. In the above learning
included a LSTM network with a CNN network. The LSTM contained memory blocks in the process, the widely adopted
gradient
recurrentdescent
hiddenoptimization algorithm
layer that could Adam
store the [34] was
temporal used
state of as annetwork.
the optimizer.With this characteristic,
As shown in Figure 2, we used the activation function
this model could better capture information over longer time steps to meet widely used our
in deep
goal.learning model,
“Softmax”, to map
For training thethe outputs
deep of the
learning neurons
model, into the interval
the sentences of (0–1). as
were organized Inthe
thisquestion-answering
way, a probability
distribution over the possible classes could be obtained and the node
pairs. The question sentence Q was the input question encoded into an internal vector form with the highest probability
QV by
was
the word-embedding procedure described above. To enhance the performance, we establishedclass
selected as our prediction emotion class. To calculate the error between the prediction the
and the actual class, a loss function was used and the weight update
word2vec [36] weights for the entire corpus and used them as the pretrained model of the of the deep neural network
was performed
embedding layer.accordingly.
The output then Here, the to
flows function
the LSTM “LabelEncoder”
and CNN layers. of the machine
In this learning
procedure, tool
for each
sciki-learn
question Q (https://scikit-learn.org/)
there was a corresponding andpositive
the lossanswer
function A+“categorical_crossentropy”
with a very high probability of to
thebedeep
the
learning framework Keras (https://keras.io/) were employed to normalize the class
correct answer among all the answers in the dataset (i.e., the confirmed correct answer). As shown label and convert it
into a one-hot code of the binary matrix to perform the numerical calculation.
in Figure 3, after the embedding layer, an output vector E was obtained and then calculated
through the LSTM function to derive a hidden vector L as the following:
3.3. Domain-Specific Dialogue Modeling
This equation was adopted from [38], and it has been shown to offer good performance. In the
above equation, the parameter γ is 1.0 and c is 1. VA is a positive or negative answer (i.e., VA+ or VA−).
Robotics 2020, 9, 31 7 of 20
Then, the distance between the two similarities is compared (meaning the difference between an
answer and the ground truth) to a pre-defined margin m (a maximum number of steps often used
thereduce
to embedding layer, antime).
the running outputIfvector E was obtained
the distance and then
is less than calculated
m, the networkthrough the LSTM
parameters function
are updated;
to derive a another
otherwise hidden vector L as example
negative the following:
is sampled until the distance is less than m. The above
operations were to ensure that the similarity distance (to be minimized) could reach a certain level.
E = EMBED(x1to
As defined in [38], the loss function corresponding , . .the
. , xabove
n ; We ) similarity is: (1)
L = LSTM(E;Q,V
Loss = max{0, m−Similarity(V WA+L ))+Similarity(VQ,VA-))} (2)
(4)
In the above
During equations, E can
the human–robot be represented
dialoguing period as e2 , . test
{e1 , the
(i.e., . . , enphase), n×d in which n is the
}, E ∈ Rthis dialogue service
d is the dimension of embedding. W
calculates the similarity between a question sentence (asked by the euser) and each matrix
maximal sentence length and is the weight answerW ∈ Rv×d
sentence
(v
(inisthe
theknowledge
number of base).
wordsAinset
theofdictionary), e is the
answers with the vector
highestembedded
similarityfor word
scores is xselected
and WLand
is the LSTM
they are
weight matrix.
re-ranked by the pre-defined rules. The first-ranking sentence is then used as the robot’s response.
Figure 3. The
The deep
deep learning
learning model
model used for the domain-specific human–robot dialogues.
3.3.2.For performance
Knowledge enhancement, we used the genism package [37] to establish the weights for
Enrichment
the entire corpus and used them as the pretrained model of the embedding layer. Though the
LSTM Inlayer
addition to the above,
described learning
onemodel and method,
can extract the dataset
the features of wordwith the domain
sequences questions
in the sentencesand the
of our
corresponding answers we
network. Furthermore, also played athe
connected critical
tensorrole in dialogue
L (Equation (2)) modeling, because layer
to a convolutional a rich
to dataset
extract
represents abundant knowledge for a system to interact with human users. It was thus
more complicated features for performance enhancement. As indicated in Figure 3, the “MaxPooling”important to
include more
function was knowledge
performed resources
and the to enrich
“tanh” function the to
was used dataset
transfer(meaning
and outputbetter conversation
the decoding result.
The above two functions have been widely used in deep learning models for language processing [38].
In the model training procedure, the question Q, the correct answer A+ and the wrong answer
A−(sampled from the answer space) are encoded into vector representations VQ , VA+ and VA− ,
respectively, and the similarities between the question and the two answers are calculated separately.
Here, the similarity of the two vectors is defined as
1 1
Similarity(VQ , VA ) = × (3)
1 + kVQ − VA k 1 + exp −γ dot VQ , VA + c
This equation was adopted from [38], and it has been shown to offer good performance. In the
above equation, the parameter γ is 1.0 and c is 1. VA is a positive or negative answer (i.e., VA+ or
VA− ). Then, the distance between the two similarities is compared (meaning the difference between an
answer and the ground truth) to a pre-defined margin m (a maximum number of steps often used to
reduce the running time). If the distance is less than m, the network parameters are updated; otherwise
another negative example is sampled until the distance is less than m. The above operations were to
ensure that the similarity distance (to be minimized) could reach a certain level. As defined in [38],
the loss function corresponding to the above similarity is:
During the human–robot dialoguing period (i.e., the test phase), this dialogue service calculates
the similarity between a question sentence (asked by the user) and each answer sentence (in the
knowledge base). A set of answers with the highest similarity scores is selected and they are re-ranked
by the pre-defined rules. The first-ranking sentence is then used as the robot’s response.
was employed. As shown in the figure, in the second phase the text and entities mentioned were then
Robotics 2020, 9, x FOR PEER REVIEW 9 of 20
passed to a module of dialogue state tracking, which grounds and maintains entities. In contrast to the
original HCN work,
above phases, we adopted a deep
a recommendation CNNwas
module network and defined
developed label
to revise ontology
some to further
entities improve
according to the
the state tracking performance (described in Section 3.4.1).
user’s preferences. The details are described in Section 3.4.2.
Figure4.4.The
Figure Theframework
frameworkused
usedfor
forthe
thetask-oriented
task-orienteddialogues.
dialogues.
InBelief
3.4.1. the third phase, the results obtained from the above phases were then concatenated to be a
Tracker
feature vector and a traditional LSTM was adopted for training. As shown in Figure 4, the output of
The belief tracker (i.e., the dialogue state tracking) is an important component in a dialogue
the LSTM model was passed to a dense layer with a Softmax activation, in which the output dimension
system in which a dialogue state is a full and temporal representation of each participant’s intention.
was equal to the number of distinct action templates. The output was a distribution over the action
A belief tracker can track what has happened with the system outputs, user utterances and context
templates. In the fourth phase, the action mask was applied and an action was selected accordingly.
from previous turns. It provides a direct way to validate the system’s understanding of the user’s
Then, the selected action was used to produce a fully formed action. Following the above phases, a
goal at each dialogue step through the intention estimation.
recommendation module was developed to revise some entities according to the user’s preferences.
Traditionally, the rule-based systems were built for state tracking, but they hardly model
The details are described in Section 3.4.2.
uncertainty. Recently, researchers have turned to develop neural models to overcome the uncertainty
in tracking
3.4.1. dialogue states. In task-oriented dialogue systems, the end-to-end neural networks have
Belief Tracker
been successfully employed for state tracking via interacting with an external knowledge base.
The belief tracker (i.e., the dialogue state tracking) is an important component in a dialogue
However, in task-oriented dialogues, a state tracker is usually trained from a large amount of
system in which a dialogue state is a full and temporal representation of each participant’s intention.
manually annotated corpora. Considering the huge efforts required for human annotation, we used
A belief tracker can track what has happened with the system outputs, user utterances and context
the available dataset for model training and focused on the model performance.
from previous turns. It provides a direct way to validate the system’s understanding of the user’s goal
As indicated above, we adopted a simplified HCN model with a dialogue state tracker.
at each dialogue step through the intention estimation.
However, in some situations the original state tracker could misjudge the ambiguous user
Traditionally, the rule-based systems were built for state tracking, but they hardly model uncertainty.
utterances or wrongly spell words and produce incorrect answers. For example, using the original
Recently, researchers have turned to develop neural models to overcome the uncertainty in tracking
HCN tracker to analyze the user utterance “I’m asking my friend if she wants to do Rome”, the
dialogue states. In task-oriented dialogue systems, the end-to-end neural networks have been
word “Rome” (entity value) is wrongly taken as the final location, but in fact the decision has not
successfully employed for state tracking via interacting with an external knowledge base. However, in
yet been made. This was because the original tracker uses a string-matching method for entity
task-oriented dialogues, a state tracker is usually trained from a large amount of manually annotated
identification so the mismatches cannot be corrected. As the neural belief tracker was able to deliver
corpora. Considering the huge efforts required for human annotation, we used the available dataset
a better performance [40], we thus adopted this method and used a deep CNN model to solve this
for model training and focused on the model performance.
problem. In addition, a small ontology was established to ensure the semantic correctness of the
As indicated above, we adopted a simplified HCN model with a dialogue state tracker. However,
mentioned entities (i.e., slot values).
in some situations the original state tracker could misjudge the ambiguous user utterances or wrongly
spell
3.4.2.words and produce incorrect answers. For example, using the original HCN tracker to analyze
Autoencoder
the user utterance “I’m asking my friend if she wants to do Rome”, the word “Rome” (entity value) is
wrongly Following
taken as the above
the final dialogue
location, butflow,
in factwe
thedeveloped a recommender
decision has (as shown
not yet been made. in because
This was Figure 4)
theto
enhance the system performance and user experience. This module was to refine some entities from
the selected response according to the user preferences. In this work, we used a deep
learning-based method and adopted the deep neural network and autoencoder [41,42] to realize
collaborative recommendation.
Robotics 2020, 9, 31 10 of 20
original tracker uses a string-matching method for entity identification so the mismatches cannot be
corrected. As the neural belief tracker was able to deliver a better performance [40], we thus adopted
this method and used a deep CNN model to solve this problem. In addition, a small ontology was
established to ensure the semantic correctness of the mentioned entities (i.e., slot values).
3.4.2. Autoencoder
Following the above dialogue flow, we developed a recommender (as shown in Figure 4) to enhance
the system performance and user experience. This module was to refine some entities from the selected
response
Robotics according
2020, to the
9, x FOR PEER user preferences. In this work, we used a deep learning-based method
REVIEW and
10 of 20
adopted the deep neural network and autoencoder [41,42] to realize collaborative recommendation.
Autoencoder is a superior tool for dimensionality
Autoencoder dimensionality reduction
reduction and and it
it can be regarded as a strict
generalization of principle component analysis. ItIt isis aa network
generalization network withwith implementations
implementations of two
transformations (encoder
transformations (encoder and and decoder),
decoder), aiming
aiming to to reconstruct
reconstruct inputs
inputs inin the
the output
output layer
layer via a
low-dimensional latent
low-dimensional latent space space to predict the missing ratings. Then, the learning
Then, the learning goal goal is to minimize
the error
errorbetween
betweenthe theoriginal
originalvector (input)
vector and and
(input) the transformed
the transformedvectorvector
(output). One of One
(output). the popular
of the
autoencoder-based
popular recommendation
autoencoder-based models ismodels
recommendation AutoRec is [42]. In this
AutoRec model,
[42]. denoising
In this model,techniques
denoising
are used to discover
techniques are used more robust representations
to discover and to avoid learning
more robust representations an identity
and to avoid function.
learning These
an identity
techniquesThese
function. meantechniques
to learn themean
latentto
representations
learn the latent of the corrupted user-item
representations of the preferences and they
corrupted user-item
can be used toand
preferences reconstruct
they can thebe
users’
usedfulltopreferences
reconstruct andthe
reduce the full
users’ overfitting situations.
preferences and In this work,
reduce the
our recommender
overfitting was developed
situations. In this work,basedouron AutoRec.
recommender The overall architecturebased
was developed is illustrated in Figure
on AutoRec. The5,
in whicharchitecture
overall the encoder,iscode-layer
illustratedand decoder
in Figure 5, are the major
in which the parts of the
encoder, model (included
code-layer in theare
and decoder dotted
the
line rectangle).
major parts of theBoth the encoder
model (includedandinthethedecoder
dotted consist of feed-forward
line rectangle). Both theneural
encoder networks
and thewith fully
decoder
connected
consist layers and the neural
of feed-forward depth of the model
networks wasfully
with increased (marked
connected as the
layers anddeep
the stack)
depthtoofenhance
the modelthe
corresponding
was performance.
increased (marked as the deep stack) to enhance the corresponding performance.
Figure 5.
Figure The neural
5. The neural model
model used
used for
for the
the recommendation.
recommendation.
true positives (TP), false positives (FP), true negatives (TN), false negatives (FN) and then used them to
calculate the metrics of accuracy (proportion of correctly predicted instances relative to all predicted
instances), precision (proportion of retrieved instances that were relevant), recall (proportion of relevant
instances that were retrieved) and F-measure (the combined effect of precision and recall that often
conflict in nature) [43]. The metrics are defined as follows:
TP + TN
accuracy = (5)
TP + FP + TN + FN
TP
precision = (6)
TP + FP
TP
recall = (7)
TP + FN
2 × precision × recall
F-measure = (8)
precision + recall
In addition to accuracy, to evaluate the performance of the answer selection in the dialogue
modeling, we adopted a statistical measure MRR (mean reciprocal rank, the average of the reciprocal
ranks of results for a sample of n queries). It is defined as
n
1 X 1
MRR = · (9)
n ranki
i=1
where ranki refers to the rank position of the first relevant document for the i-th query.
For RF and SVM, we used the n-gram method to extract more text features from the original data
Robotics 2020, 9, x FOR PEER REVIEW 12 of 20
for building classifiers to enhance their performance, in addition to the word features extracted from
the text-processing
methods to investigate procedure. N-gram
their effects can express For
in performance. the the
sequence relationships
semantic between
rules, the five rules the words,
mentioned
and the unigram, bigram and trigram (n is 1, 2 and 3, respectively) models
in Section 3.2.1 were used to perform more precise sentence segmentation; for data balance, we are often used. After a
preliminary test, in this work we used the above three models to extract more text
adopted the sciki-learn tool to produce a set of specific class weights for different types of emotions.features, and the
combined
The resultsfeature vectors were
for accuracy, used recall
precision, as the and
input of the are
F-score above two machine
illustrated learning
in Figure 6b. Asmethods
can be (RF and
seen, in
SVM) to enhance their performance.
general our CNN-LSTM method obtained the best results on all performance metrics. In addition to
Figure
the data 6a illustrates
balance effect, the
the accuracy,
reason for precision, recall andimprovement
the performance F-score for each of the
could bethree
that methods.
the semantic As
can be seen, RF performed the best in all the metrics. The main reason could be
rules removed the irrelevant words and filtered out their effects on the sentence emotions. Thus, thethat RF is a type of
ensemble machine learning algorithm and the way it handled (samples) data
learning methods were able to focus on the emotions delivered by the most related parts of thefor the grouped multiple
classifiers
sentences made it perform better than the others for the imbalanced dataset here.
to be predicted.
Ration of correctness
Ration of correctness
(a) (b)
Figure 6.
Figure 6. Results of the three machine learning methods; (a) without
without and (b) with
with the
the enhanced
enhanced
techniques of the semantic rules and data balance.
techniques of the semantic rules and data balance.
4.3.2.After comparingwith
Comparisons the three
IBM Toneaforementioned
Analyzer methods, we applied two data processing techniques to
the dataset, including semantic rules and data balance, with the above learning methods to investigate
In addition to comparing the different machine learning methods, we evaluated a well known
their effects in performance. For the semantic rules, the five rules mentioned in Section 3.2.1 were
emotion detection system, the IBM Watson’s Tone Analyzer
used to perform more precise sentence segmentation; for data balance, we adopted the sciki-learn
(https://natural-language-understanding-demo.ng.bluemix.net/), for further comparison.
tool to produce a set of specific class weights for different types of emotions. The results for accuracy,
Interestingly, the emotions the Tone Analyzer considered were slightly different from what we
precision, recall and F-score are illustrated in Figure 6b. As can be seen, in general our CNN-LSTM
defined in this work, and it gave degrees (values) of multiple emotions for an input sentence (also
method obtained the best results on all performance metrics. In addition to the data balance effect,
different from our work). To conduct the performance comparison, we projected the two sets of
the reason for the performance improvement could be that the semantic rules removed the irrelevant
emotions (one for our work and one for the Tone Analyzer) into the well known emotional valence
words and filtered out their effects on the sentence emotions. Thus, the learning methods were able to
and arousal space (i.e., V-A space, [46]). In this space, valence indicates the hedonic value (positive
focus on the emotions delivered by the most related parts of the sentences to be predicted.
or negative), ranging from inactive to active; and arousal indicates the emotional intensity, ranging
from Comparisons
4.3.2. unpleasant towith pleasant. The valence
IBM Tone Analyzer and arousal dimensions can be projected onto Euclidean
space, where emotions are represented as point-vectors. In this way, the user’s emotion can be
In addition
located to comparing
in this space to be a tupletheofdifferent
valencemachine
and arousallearning
values.methods, we evaluated a well known
emotion detection
In the system,we
experiments, thefirst
IBMprojected
Watson’s Tone Analyzer
the five classes(https://natural-language-understanding-
(annotated in the dataset) into the V-A
demo.ng.bluemix.net/),
space to retrieve the corresponding valence and arousal valuesthe
for further comparison. Interestingly, emotions
(based on thethe Tone Analyzer
emotion positions
considered were slightly different from what we defined in this work, and it gave
defined in [46]). For the data of each (sentence), we took the positions of the actual (correct) class degrees (values) of
multiple emotions for an input sentence (also different from our work). To conduct
and the predicted classes in the space and obtained the set of V-A values. Then, the emotion values the performance
comparison,
produced bywe theprojected the two
trained model sets taken
were of emotions
as class(one for our
weights andwork
the and one for
weighted the was
sum Tonederived
Analyzer)for
into
thesethe well known
specific emotional valence
data. Consequently, and arousal
our method and space
the Tone (i.e.,Analyzer
V-A space, [46]).
were In this space, valence
compared.
indicates the hedonic
For the valuewe
data of each, (positive
chose orthenegative),
two closestranging
classesfrom inactive
(with to active;
the largest and arousal
weights) indicates
and calculated
the
their weighted distance to represent the distance between the predicted and the actual classes.can
emotional intensity, ranging from unpleasant to pleasant. The valence and arousal dimensions To
be projected
compare theonto Euclidean we
performances, space, wherethe
divided emotions
distanceareintorepresented as point-vectors.
eight intervals and counted In thethis way, the
numbers of
user’s emotion
data within eachcaninterval.
be located in this
Table space tothe
1 presents be results,
a tuple of in valence
which xand arousal
is the values.
weighted distance. As can
be seen, in general the results obtained by the presented method were better than those obtained by
the Tone Analyzer for the dataset used.
In the experiments, we first projected the five classes (annotated in the dataset) into the V-A space
to retrieve the corresponding valence and arousal values (based on the emotion positions defined
in [46]). For the data of each (sentence), we took the positions of the actual (correct) class and the
predicted classes in the space and obtained the set of V-A values. Then, the emotion values produced
by the trained model were taken as class weights and the weighted sum was derived for these specific
data. Consequently, our method and the Tone Analyzer were compared.
For the data of each, we chose the two closest classes (with the largest weights) and calculated
their weighted distance to represent the distance between the predicted and the actual classes. To
compare the performances, we divided the distance into eight intervals and counted the numbers of
data within each interval. Table 1 presents the results, in which x is the weighted distance. As can be
seen, in general the results obtained by the presented method were better than those obtained by the
Tone Analyzer for the dataset used.
presented method involved a smaller set of parameters and was more efficient in learning. In addition
to the LSTM-CNN model, a traditional embedding model (using only word embedding technique)
was also implemented for performance comparison. The results are shown in Figure 7b. As presented,
the accuracy of the embedding model is 0.12 and for the MRR is 0.21. These results indicated that the
LSTM-CNN
Robotics 2020,model
9, x FORwas
PEERmore efficient; it obtained a better result within less iterations.
REVIEW 14 of 20
(a) (b)
Figure 7. Performance comparison of the two methods for the original dataset: (a) LSTM-CNN
model; (b) embedding model.
In addition to the performance evaluation of the dialogue modeling, we performed another set
of trials to examine the performance of the shared knowledge translated by a dataset from a
(a) (b)
different language. In the experiments, the dataset used in the above set of experiments was
translated
Figure
Figure 7.according
7. Performance
Performance to comparison
thecomparison
steps described twoin
of the
of the twoSection
methods
methods 3.3,
for and
for
the thethe same
original
original deep
dataset:
dataset: learning model
(a) LSTM-CNN
(a) LSTM-CNN model;and
method were
model; (b)used for
embedding
(b) embedding model. the training.
model. As mentioned previously, in contrast to English sentences, a
Chinese sentence could be segmented into various combinations of words by different
In In
segmentation addition
addition to tothethe
methods performance
and
performancethis often evaluation ofofthe
led to different
evaluation thedialogue
dialoguemodeling,
modeling modeling,
results. weperformed
Therefore,
we performed another set
before another
evaluating set of
of
the to
trials trials to
performance
examine theexamine the
of performanceperformance
the model training, of thewe of the shared
conducted
shared knowledge knowledge
a set of translated
trails to investigate
translated by
by a dataset a dataset
thefrom from
effecta of twoa
different
differentsegmentation
popular language. Inmethods: the experiments,
the Jiaba the
and dataset
the above used and
HanLP, in the theabove
resultsset of experiments
showed that the was
Jiaba
language. In the experiments, the dataset used in the set of experiments was translated according
translated better
performed according than to
the the steps described
HanLP. We thus in Section
chose Jiaba 3.3, and the same
segmentation to deep learning
continue the model and
experiments offor
to the steps described in Section 3.3, and the same deep learning model and method were used
method
model were used for the training. As mentioned previously, in contrast to English sentences, a
training.
the training. As mentioned previously, in contrast to English sentences, a Chinese sentence could be
Chinese sentence
The results (i.e.,could
accuracy be and
segmented
MRR) areinto variousin combinations
presented Figure 8a. As shown of words in thebyfigure,
different
the
segmented into various combinations of words by different segmentation methods and this often led
segmentation methods and this often led to different modeling
LSTM-CNN model can achieve a best performance (accuracy) of 0.54 and an MRR of 0.64. Moreover, results. Therefore, before evaluating
to different
the traditional
modeling
performance
results. Therefore,
of the model training,
before evaluating
we conducted
the performance of the model training,
the embedding model was implemented fora comparison
set of trails to and investigate
the results theare
effect
shownof twoin
wepopular
conducted a set of trails to investigate theand effect of two popular segmentation methods: the Jiaba
Figure 8b.segmentation
Similar to themethods: experiments the Jiaba
conducted the
for HanLP, and (untranslated)
the original the results showed dataset,thatthethe Jiaba
results
andperformed
the HanLP, and than
better the results
the HanLP.showed Wethat
thus the Jiaba performed
chose bettertothan the HanLP. We thus chose
here indicated that the LSTM-CNN model was moreJiaba segmentation
efficient continue
than the traditional the experiments
embedding method.of
Jiaba
modelsegmentation
training. to continue the experiments of model training.
Compared to the results obtained from the original dataset, the accuracy declined from 0.61 to
0.54.The Theresults
This results(i.e.,
indicated (i.e.,accuracy
thataccuracy
the model and MRR)
andbuilt
MRR) from are
are presented
presented
the translated in Figure
inknowledge
Figure 8a. 8a. As
As shown
(i.e., shown
dataset) theinfigure,
incould thekeep
not figure,
the
thethe
LSTM-CNN
LSTM-CNN
modelingmodel model can achieve
can achieve
performance at the a best
a best
same performance
performance (accuracy)
(accuracy)the
level; nevertheless, of 0.54
of results and
0.54 andshowed an MRR
an MRRthat of 0.64.
of 0.64. Moreover,
Moreover,
the translated
theknowledge
traditional
the traditionalwasembedding
embedding
learnable with model
modelan waswas implemented
implemented
acceptable performance for comparison
for comparison
and was thus and
and thetheresults
useful results
in areshown
are
building shown
models in in
Figure
Figure
for 8b.8b.
the Similar
Similarto the
to the
resource-restricted experiments
experiments
language. conducted
conducted
The forfor
performancethethe
original (untranslated)
original
could be(untranslated)
further improveddataset,
dataset,the results
the
when results
more here
here indicated
indicated
advanced that
textthe that the LSTM-CNN
LSTM-CNN
translation modelmodel
techniques was more
wasapplied.
are more efficient
efficient thanthanthe the traditional
traditional embedding
embedding method.
method.
Compared to the results obtained from the original dataset, the accuracy declined from 0.61 to
0.54. This indicated that the model built from the translated knowledge (i.e., dataset) could not keep
the modeling performance at the same level; nevertheless, the results showed that the translated
knowledge was learnable with an acceptable performance and was thus useful in building models
for the resource-restricted language. The performance could be further improved when more
advanced text translation techniques are applied.
(a) (b)
Figure 8. 8.Performance
Figure Performancecomparison
comparison of the two
two methods
methodsfor
forthe
thetranslated
translateddataset:
dataset:(a)(a) LSTM-CNN
LSTM-CNN
model;
model; (b)(b) embeddingmodel.
embedding model.
(a) (b)
Figure 8. Performance comparison of the two methods for the translated dataset: (a) LSTM-CNN
model; (b) embedding model.
Robotics 2020, 9, 31 15 of 20
Compared to the results obtained from the original dataset, the accuracy declined from 0.61 to
0.54. This indicated that the model built from the translated knowledge (i.e., dataset) could not keep
the modeling performance at the same level; nevertheless, the results showed that the translated
knowledge was learnable with an acceptable performance and was thus useful in building models for
the resource-restricted language. The performance could be further improved when more advanced
text translation techniques are applied.
4.5. Evaluation
Robotics 2020, 9, xofFOR
Training Task-Oriented Dialogues
PEER REVIEW 15 of 20
4.5.1. Performance
4.5. Evaluation Evaluation
of Training of NeuralDialogues
Task-Oriented Belief Tracker
As mentioned previously, the belief tracker plays an important role in a goal-oriented dialogue
4.5.1. Performance Evaluation of Neural Belief Tracker
system; it can be used to track each participant’s intention from the continuous dialoguing utterances
between Asthe
mentioned previously,
participant the belief
and the robot. tracker
In this work,plays
wean important role
implemented in a goal-oriented
a deep CNN model to dialogue
work as a
system;
neural it can
belief be used
tracker. to track each
The application taskparticipant’s
was to perform intention from recommendation
restaurant the continuous dialoguing
through the
utterancesdialogue.
user−robot between the participant
A set andwere
of entities the robot. In this work,
pre-defined and thewerobot
implemented a deep
iteratively CNN model
interacted with the
to work as a neural belief tracker. The application task was to perform restaurant
user to derive all the missing entity values. The DSTC6 dataset was used for training the tracker. Inrecommendation
through
this the user−robot
task (dataset), dialogue.
five entities were A set of namely
tracked, entities cuisine,
were pre-defined and range,
location, price the robot iterativelyand
atmosphere
interacted with the user to derive all the missing entity values. The DSTC6 dataset
party size, and the system had to infer their corresponding slot values for making an appropriate was used for
training the tracker. In this task (dataset), five entities were tracked, namely cuisine, location, price
recommendation. Table 2 lists the values defined for the entities.
range, atmosphere and party size, and the system had to infer their corresponding slot values for
making an appropriate recommendation. Table
Table 2. Values for2all
lists
thethe values
tracked defined for the entities.
entities.
(a) (b)
Figure9.9.Performance
Figure Performance of
of learning
learning the
the neural
neuralbelief
belieftracker:
tracker:(a)(a)
accuracy; (b)(b)
accuracy; loss.
loss.
(a) (b)
Figure
Figure 10.10.Comparison
Comparisonofofthe
thedifferent
different activation
activation functions
functionsin:
in:(a)
(a)training
trainingphase;
phase;(b)(b)
test phase.
test phase.
After conducting the above evaluation steps for the determination of the network parameters,
we then compared our enhanced approach to other popular collaborative filtering methods, including
the well known autoencoder AutoRec, and the latent factor model NNMF (non-negative matrix
factorization [48]) which is one of the best models in the relevant studies. In the experiments, for all
three methods, the number of epochs was 100, the code size (for our model and AutoRec) and the
latent factor (for NNMF) was 32, and the learning rate (for our model and AutoRec) was 0.005. As a
result, the loss (error) for the proposed model, the AutoRec model and the NNMF method were 1.0868,
1.4758 and 1.1293, respectively. Such results showed that the proposed method outperformed other
methods and can provide better recommendation performance.
(a) (b)
Robotics 2020, 9, 31
(a) (b) 17 of 20
Figure 10. Comparison of the different activation functions in: (a) training phase; (b) test phase.
(a) (b)
Figure 11. Comparison of the different numbers of hidden layers in: (a) training phase; (b) test phase.
4.6. Discussion
The above experiments evaluated our approach for a service robot to provide knowledge services.
As presented, in our current design, the emotion recognition was constructed separately from the
dialogue modeling. The model was trained by a data-driven process with a static dataset. The
emotion classifier was then used to re-rank the sentences selected by the model. The separation of
emotion recognition and dialogue modeling has several advantages. The first is that the modules of
emotion recognition and response generation can be constructed by any effective methods if available;
the system thus operates more flexibly. Meanwhile, the reasons why the system generated these
responses can be interpretable to users for further analysis. The two subsystems can be integrated
into one model to optimize the corresponding structure and performance, for example, to adopt a
monolithic model with an attention mechanism to capture emotion as a special context. However, the
integrated system may thus become relatively difficult to understand and computationally expensive.
Considering the dialogue modeling, this work trained models by a data-driven process with a
static dataset. Therefore, in addition to the learning method, the quality and quantity of the dataset also
had influences on the overall performance. It was thus important to strengthen the role of knowledge
(i.e., dataset) to infer an enriched domain-specific model. Different strategies can be developed to
exploit more knowledge resources, ranging from directly linking the dataset to the up-to-date external
knowledge bases, reorganizing the dataset to obtain an optimized data use and to a complicated
procedure of transferring knowledge between different domains. For a resource-restricted language,
a straightforward way is to take the translated datasets as shared knowledge for modeling. We showed
the effect of using translated knowledge. Our application case revealed that the translated knowledge
was learnable and the modeling performance could be kept at a similar level as when using the
original data. More advanced language translation techniques can be developed to further improve
the performance.
In contrast to the non-task-oriented dialogues, a task-oriented dialogue has a specific goal to
achieve. The dataset is more focused and thus relatively smaller. In such a system, the most critical
component is the goal tracker that is used to track the user’s intent during the dialogue to infer
the dialogue state. The system can then decide the best response accordingly to achieve the goal
(e.g., the recommendations in our experiments). Through the application presented in this work,
we have demonstrated that task-oriented dialogues can be practically launched for a manageable task
with a clear goal and a constrained dataset. During such dialogues, a fine-tuning procedure for the
model parameters needs to be carefully performed to find the best results. When the task becomes
complicated or has a high-level (or abstract) goal to achieve, more advanced state tracking and an
inferring mechanism is needed to better understand the users’ intentions.
Robotics 2020, 9, 31 18 of 20
5. Conclusions
In this work, we presented an emotion-aware dialogue framework for a service robot to achieve
natural human–robot communication. To deploy this framework, we adopted a cloud-based
service-oriented architecture and developed the emotion recognition and two types of dialogue
modeling modules on it. In the first type of service, the robot worked as a consultant to deliver
domain-specific knowledge to users. We employed a deep learning method to train different neural
models for mapping questions and answer sentences, tracking the human emotion during the process
of the human–robot dialoguing and using this additional information to determine the relevance of the
sentences obtained by the model. In the second type of dialogue service, task-oriented dialogues were
provided for assisting users to achieve specific goals. The robot continuously asked the user questions
related to the task, tracked the user’s intention through the interactions and provided suggestions
accordingly. To verify our framework, we conducted a series of experiments to evaluate the major
system components. The results confirmed the effectiveness and efficiency of the presented approach.
Currently, we are developing techniques of knowledge transfer that can extract pairs of questions and
answers from the text documents of different domains, in order to automatically enrich the dataset
for retrieval-based model training. Moreover, we plan to investigate the use of Kansei engineering
with hedge algebras to improve the granularity of the semantic and linguistic analysis in the dialogue
sentences. We also plan to integrate the characteristics and preferences of the users into the learning
model to achieve personalized dialogues.
Author Contributions: All authors discussed and commented on the manuscript at all stages. More specifically:
investigation, methodology and software, J.-Y.H., W.-P.L. and B.-W.D.; supervision, W.-P.L., data analysis and
processing, J.-Y.H., C.-C.C. and B.-W.D.; writing—original draft preparation, J.-Y.H. and W.-P.L.; writing—review
and editing, J.-Y.H., W.-P.L., C.-C.C.; funding acquisition, W.-P.L. All authors have read and agreed to the published
version of the manuscript.
Funding: This research was supported in part by the Ministry of Science and Technology of Taiwan, under
Contract MOST-108-2221-E-110-054.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Erl, T. Service-Oriented Architecture; Prentice Hall: New York, NY, USA, 2005; Volume 8.
2. Yang, T.-H.; Lee, W.-P. A service-oriented framework for developing home robots. Int. J. Adv. Robot. Syst.
2013, 10, 122. [CrossRef]
3. Huang, J.-Y.; Lee, W.-P.; Lin, T.-A. Developing context-aware dialogue services for a cloud-based robotic
system. IEEE Access 2019, 7, 44293–44306. [CrossRef]
4. Quigley, M.; Conley, K.; Gerkey, B.; Faust, J.; Foote, T.; Leibs, J.; Ng, A.Y. ROS: An open-source robot operating
system. In Proceedings of the IEEE International Conference on Robotics and Automation, Workshop on
Open-Source Robotics, Kobe, Japan, 12–17 May 2009.
5. Gao, J.; Galley, M.; Li, L. Neural approaches to conversational AI. In Proceedings of the 41st ACM SIGIR
International Conference on Research and Development in Information Retrieval, Ann Arbor, MI, USA, 8–12
July 2018; pp. 1371–1374.
6. Shang, L.; Lu, Z.; Li, H. Neural responding machine for short-text conversation. In Proceedings of the 53rd
Annual Meeting of the Association for Computational Linguistics, Beijing, China, 26–31 July 2015; Volume 1,
pp. 1577–1586.
7. Huang, J.-Y.; Lee, W.-P.; Dong, B.-W. Learning emotion recognition and response generation for a service
robot. In Proceedings of the 6th IFToMM International Symposium on Robotics and Mechatronics, Taipei,
Taiwan, 28–31 October 2019; pp. 286–297.
8. Waibel, M.; Beetz, M.; Civera, J.; D’Andrea, R.; Elfring, J.; Galvez-Lopez, D.; van de Molengraft, M.J.;
Schiesle, B. RoboEarth-A world wide web for robots. IEEE Robot. Autom. Mag. 2011, 18, 69–82. [CrossRef]
9. Mohanarajah, G.; Hunziker, D.; D’Andrea, R.; Waibel, M. Rapyuta: A cloud robotics plat-form. IEEE Trans.
Autom. Sci. Eng. 2015, 12, 481–493. [CrossRef]
Robotics 2020, 9, 31 19 of 20
10. Pereira, A.B.M.; Bastos, G.S. ROSRemote, using ROS on cloud to access robots remotely. In Proceedings of the
18th IEEE International Conference on Advanced Robotics, Hong Kong, China, 10–12 July 2017; pp. 284–289.
11. Kehoe, B.; Patil, S.; Abbeel, P.; Goldberg, K. A survey of research on cloud robotics and automation. IEEE Trans.
Autom. Sci. Eng. 2015, 12, 398–409. [CrossRef]
12. Saha, O.; Dasgupta, P. A comprehensive survey of recent trends in cloud robotics architectures and
applications. Robotics 2018, 7, 47. [CrossRef]
13. Simoens, P.; Dragone, M.; Saffiotti, A. The internet of robotic things: A review of concept, added value and
applications. J. Adv. Robot. Syst. 2018, 15. [CrossRef]
14. Ray, P.P. Internet of robotic things: Concept, technologies, and challenges. IEEE Access 2016, 4, 9489–9500.
[CrossRef]
15. Tian, N.; Chen, J.; Zhang, R.; Huang, B.; Goldberg, K.; Sojoudi, S. A fog robotic system for dynamic visual
servoing. In Proceedings of the IEEE International Conference on Robotics and Automation, Montreal, QC,
Canada, 20–24 May 2019; pp. 1982–1988.
16. Galambos, P. Cloud, fog, and mist computing: Advanced robot applications. IEEE Syst. Man Cybern. Mag.
2020, 6, 41–45. [CrossRef]
17. Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to sequence learning with neural networks. In Advances in Neural
Information Processing Systems; MIT Press: Cambridge, MA, USA, 2014; Volume 27, pp. 3104–3112.
18. Serban, I.V.; Lowe, R.; Charlin, L.; Pineau, J. A survey of available corpora for building data-driven dialogue
systems. arXiv 2017, arXiv:1512.05742v3.
19. Hu, B.; Lu, Z.; Li, H.; Chen, Q. Convolutional neural network architectures for matching natural language
sentences. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2014;
Volume 27, pp. 2042–2050.
20. Serban, I.V.; Sordoni, A.; Bengio, Y.; Courville, A.; Pineau, J. Building end-to-end dialogue systems using
generative hierarchical neural network models. In Proceedings of the 30th AAAI Conference on Artificial
Intelligence, Phoenix, AZ, USA, 12–17 February 2016; pp. 3776–3783.
21. Wen, T.H.; Gasic, M.; Mrksic, N.; Su, P.H.; Vandyke, D.; Young, S. Semantically conditioned LSTM-based
natural language generation for spoken dialogue systems. In Proceedings of the International Conference on
Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17–21 September 2015; pp. 1711–1721.
22. Wu, S.; Zhang, D.; Li, Y.; Xie, X.; Wu, Z. HL-EncDec: A hybrid-level encoder-decoder for neural response
generation. In Proceedings of the International Conference on Computational Linguistics, Santa Fe, NW,
USA, 20–26 August 2018; pp. 845–856.
23. Li, X.; Chen, Y.-N.; Li, L.; Gao, J.; Celikyilmaz, A. End-to-end task-completion neural dialogue systems.
In Proceedings of the 8th International Joint Conference on Natural Language Processing, Taipei, Taiwan,
3 March 2017; pp. 733–743.
24. Williams, J.D.; Asadi, K.; Zweig, G. Hybrid code networks: Practical and efficient end-to-end dialog control
with supervised and reinforcement learning. arXiv 2017, arXiv:1702.03274v2.
25. Kuchaiev, O.; Ginsburg, B. Training deep autoencoders for collaborative filtering. arXiv 2017, arXiv:1708.
01715v3.
26. Zhou, H.; Huang, M.; Zhang, T.; Zhu, X.; Liu, B. Emotional chatting machine: Emotional conversation
generation with internal and external memory. In Proceedings of the 32th AAAI Conference on Artificial
Intelligence, New Orleans, LA, USA, 2–7 Frbruary 2018; pp. 730–738.
27. Sun, X.; Peng, X.; Ding, S. Emotional human-machine conversation generation based on long short-term
memory. Cogn. Comput. 2018, 10, 389–397. [CrossRef]
28. Asghar, N.; Poupart, P.; Hoey, J.; Jiang, X.; Mou, L. Affective neural response generation. In 40th European
Conference on Information Retrieval Research; Springer: Cham, Switzerland, 2018; pp. 154–166.
29. Ghosh, S.; Vinyals, O.; Strope, B.; Roy, S.; Dean, T.; Heck, L. Contextual LSTM (CLSTM) models for large
scale NLP tasks. arXiv 2016, arXiv:1602.06291.
30. Appel, O.; Chiclana, F.; Carter, J.; Fujita, H. A hybrid approach to the sentiment analysis problem at the
sentence level. Knowl. Based Syst. 2016, 108, 110–124. [CrossRef]
31. Bird, S.; Klein, E.; Loper, E. Natural Language Processing with Python; O’reilly Media: Reading, MA, USA, 2009.
32. Pennington, J.; Socher, R.; Manning, C.D. GloVe: Global vectors for word representation. In Proceedings of
the International Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, 25–29
October 2014; pp. 1532–1543.
Robotics 2020, 9, 31 20 of 20
33. Ramachandran, P.; Barret, Z.; Le Quoc, V. Searching for activation functions. In Proceedings of the Sixth
International Conference on Learning Representations, Workshop Track, Vancouver, BC, Canada, 30 April–3
May 2018.
34. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. In Proceedings of the Third International
Conference for Learning Representations, San Diego, CA, USA, 22 December 2015.
35. Huang, J.-Y.; Lin, T.-A.; Lee, W.-P. Using deep learning and an external knowledge base to develop
human-robot dialogues. In Proceedings of the IEEE International Conference on Systems, Man, and
Cybernetics, Miyazaki, Japan, 7–10 October 2018; pp. 3699–3704.
36. Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed representations of words and phrases
and their compositionality. In Advances in Neural Information Processing Systems 26; Curran Associates, Inc.:
New York, NY, USA, 2013; pp. 3111–3119.
37. Řehůřek, R.; Sojka, P. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the
LREC 2010 Workshop New Challenges for NLP Frameworks, Valletta, Malta, 22 May 2010; pp. 46–50.
38. Feng, M.; Xiang, B.; Glass, M.R.; Wang, L.; Zhou, B. Applying deep learning to answer selection: A study and
an open task. In Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding,
Scottsdale, AZ, USA, 13–17 December 2015; pp. 813–820.
39. Hori, C.; Perez, J.; Higashinaka, R.; Hori, T.; Boureau, Y.L.; Inaba, M.; Tsunomori, Y.; Takahashi, T.; Yoshino, K.;
Kim, S. Overview of the sixth dialog system technology challenge: DSTC6. Comput. Speech Lang. 2019, 55,
1–25. [CrossRef]
40. Mrkšić, N.; Séaghdha, D.O.; Wen, T.H.; Thomson, B.; Young, S. Neural belief tracker: Data-driven dialogue
state tracking. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics,
Vancouver, BC, Canada, 30 July–4 August 2017; Volume 1, pp. 1777–1788.
41. Baldi, P. Autoencoders, unsupervised learning, and deep architectures. In Proceedings of the International
Conference on Machine Learning, Workshop on Unsupervised and Transfer Learning, Edinburgh, Scotland,
26 June–1 July 2012; pp. 37–49.
42. Sedhain, S.; Menon, A.K.; Sanner, S.; Xie, S. AutoRec: Autoencoders meet collaborative filtering.
In Proceedings of the 24th International Conference on World Wide Web, Florence, Italy, 18–22 May
2015; pp. 111–112.
43. Tan, P.; Steinbach, M.; Kumar, V. Introduction to Data Mining; Addison-Wesley: Reading, MA, USA, 2005.
44. The Face Dataset. Available online: http://robotics.csie.ncku.edu.tw/Databases/FaceDetect_Pose_Estimate.
htm#Our_Database (accessed on 10 January 2018).
45. Phan, D.A.; Shindo, H.; Matsumoto, Y. Multiple emotions detection in conversation transcripts. In Proceedings
of the 30th Pacific Asia Conference on Language, Information and Computation, Seoul, Korea, 28–30 October
2016; pp. 85–94.
46. Russell, J.A. A circumplex model of affect. J. Personal. Soc. Psychol. 1980, 39, 1161. [CrossRef]
47. The Yelp Dataset. Available online: https://www.yelp.com/dataset/ (accessed on 20 May 2019).
48. Févotte, C.; Idier, J. Algorithms for nonnegative matrix factorization with the β-divergence. Neural Comput.
2011, 23, 2421–2456. [CrossRef]
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).