0% found this document useful (0 votes)
13 views25 pages

Memory-Augmented Dialogue Management For Task-Oriented Dialogue Systems

The document presents a Memory-Augmented Dialogue Management model (MAD) designed to enhance task-oriented dialogue systems by incorporating memory structures that track dialogue state and context. The model utilizes a slot-value memory and an external memory, along with an attention mechanism, to efficiently update and manage dialogue states based on user utterances. Experimental results indicate that MAD achieves state-of-the-art performance, surpassing existing models in dialogue management tasks.

Uploaded by

qianshihua
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views25 pages

Memory-Augmented Dialogue Management For Task-Oriented Dialogue Systems

The document presents a Memory-Augmented Dialogue Management model (MAD) designed to enhance task-oriented dialogue systems by incorporating memory structures that track dialogue state and context. The model utilizes a slot-value memory and an external memory, along with an attention mechanism, to efficiently update and manage dialogue states based on user utterances. Experimental results indicate that MAD achieves state-of-the-art performance, surpassing existing models in dialogue management tasks.

Uploaded by

qianshihua
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Memory-augmented Dialogue Management for

Task-oriented Dialogue Systems

ZHENG ZHANG and MINLIE HUANG∗ , Tsinghua University


arXiv:1805.00150v1 [cs.CL] 1 May 2018

ZHONGZHOU ZHAO, FENG JI, and HAIQING CHEN, Alibaba Group


XIAOYAN ZHU, Tsinghua University
Dialogue management (DM) decides the next action of a dialogue system according to the current dialogue
state, and thus plays a central role in task-oriented dialogue systems. Since dialogue management requires
to have access to not only local utterances, but also the global semantics of the entire dialogue session,
modeling the long-range history information is a critical issue. To this end, we propose a novel Memory-
Augmented Dialogue management model (MAD) which employs a memory controller and two additional
memory structures, i.e., a slot-value memory and an external memory. The slot-value memory tracks the 11
dialogue state by memorizing and updating the values of semantic slots (for instance, cuisine, price, and location
), and the external memory augments the representation of hidden states of traditional recurrent neural
networks through storing more context information. To update the dialogue state efficiently, we also propose
slot-level attention on user utterances to extract specific semantic information for each slot. Experiments
show that our model can obtain state-of-the-art performance and outperforms existing baselines.
CCS Concepts: • Computing methodologies → Discourse, dialogue and pragmatics; Neural networks;
• Software and its engineering → Semantics;
Additional Key Words and Phrases: Dialogue Management, Attention, Dialogue State, Memory Network,
Neural Network
ACM Reference Format:
Zheng Zhang, Minlie Huang, Zhongzhou Zhao, Feng Ji, Haiqing Chen, and Xiaoyan Zhu. 2018. Memory-
augmented Dialogue Management for Task-oriented Dialogue Systems. ACM Transactions on Information
Systems 1, 1, Article 11 (April 2018), 25 pages. https://doi.org/0000001.0000001

1 INTRODUCTION
Task-oriented dialogue systems offer a natural and effective interface for users to seek information
and complete complex tasks in an interactive manner. Such systems often collect users’ preferences
in the course of dialogue before issuing the final query to the knowledge base (such as booking
a flight ticket). There are also some works [12, 25] viewing the task-oriented dialogue task as a
∗ Corresponding author

This work was partly supported by the National Basic Research Program (973 Program) under grant No. 2013CB329403, and
the National Science Foundation of China under grant No. 61272227/61332007.
Authors’ addresses: Zheng Zhang; Minlie Huang, Tsinghua University, Department of Computer Science and Technology,
Beijing, 100084, [email protected], [email protected]; Zhongzhou Zhao; Feng Ji; Haiqing Chen, Alibaba
Group, Hangzhou, Zhejiang, 311121, [email protected], [email protected], haiqing.chenhq@
alibaba-inc.com; Xiaoyan Zhu, Tsinghua University, Department of Computer Science and Technology, Beijing, 100084,
[email protected].

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the
full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from [email protected].
© 2018 Copyright held by the owner/author(s). Publication rights licensed to ACM.
1046-8188/2018/4-ART11 $15.00
https://doi.org/0000001.0000001

ACM Transactions on Information Systems, Vol. 1, No. 1, Article 11. Publication date: April 2018.
11:2 Z. Zhang et al.

Fig. 1. The processing flow of task-oriented a dialogue system. Natural language understanding (NLU) parses
the user utterance and extracts structured semantic information from the utterance, dialogue management
receives the semantic information and decides the next dialogue act that the system should take, and natural
language generation (NLG) translates the dialogue act to a natural language response. In some cases, NLU
and DM can be coupled together as a single module, and the semantic information produced by NLG is often
unstructured in this situation, such as the output of neural network.

context-aware, multi-turn question answering (QA) task in which a user can interact with the
system in multi-turn contexts and the system also has access to the knowledge base.
Different from open-domain conversational systems which are often modeled in an end-to-end
manner, task-oriented dialogue systems are generally composed of several cascaded processes, as
shown in Figure 1, including natural language understanding (NLU), dialogue management (DM),
and natural language generation (NLG). Dialogue management, which is in charge of selecting
actions in response to user inputs, plays a central role in task-oriented dialogue systems [6, 39].
It takes as input the user intent which is analyzed by NLU, interacts with knowledge base, and
decides the next system action. Sometimes NLU and DM can be coupled together as a single module
which can be trained end-to-end to read directly from user utterance and produce system action.
The system action produced by DM will be translated into a natural language utterance by NLG
[35] to interact with users.
In order to decide the next action a dialogue system should take, dialogue management, particu-
larly in task-oriented dialogue systems, should deal with the dialogue context information. It needs
to access not only local utterances, but also the global information about what has been addressed
several turns ago. The global history information, which is often referred to as dialogue state, is a
key factor in dialogue systems. Based on the dialogue state, the dialogue manager then produces
system action according to its policy. The task of dialogue management is sometimes divided into
two subtasks, namely dialogue state tracking which maintains dialogue history information, and
dialogue policy which selects the next system action based on the dialogue state.
Early methods of modeling dialogue management are mostly rule-based, in which the state update
and dialogue policy process are manually defined, but these methods did not take into account
the probability uncertainty in dialogue. Bayesian network methods [22, 39] formulated dialogue
management as a probabilistic graphical model which models the conditional dependency between
different states, and each specific state is bound with an action to be taken, but the definition of
dialogue state still need manually-crafted rules. Recently, many neural network methods have been
proposed for dialogue management due to their capability of semantic representation and automatic
feature extraction, and obtain state-of-the-art performance on many dialogue tasks [7, 28]. More
specifically, most neural dialogue models are RNN (Recurrent Neural Network) based which takes
as input user utterance and system response at each dialogue turn, and the hidden state of RNN is
utilized as the representation of dialogue state [11, 38].

ACM Transactions on Information Systems, Vol. 1, No. 1, Article 11. Publication date: April 2018.
Memory-augmented Dialogue Management for Task-oriented Dialogue Systems 11:3

However, despite of the success of RNN on various text modeling tasks, simple RNN is proven
to have poor performance on dialogue tasks [38] due to the single hidden state vector used in
RNN and thus the defect of modeling long-range contexts. Hierarchical RNN structures [29] and
memory networks [3, 5, 37] are feasible solutions to this issue, but existing neural models still lack
an explicit memorization of the history semantics of the entire dialogue session: the dialogue act
types, semantic slots, and the values of the slots are not explicitly processed during the interaction.
Another important issue is to extract semantic information from user utterance when combining
NLU and DM together, which is the case in most end-to-end dialogue systems. Such semantic
information is critical for dialogue state update. Existing methods either extract information from
predefined features (such as POS and NER tags) by heuristic rules [11], or from pretrained word
embeddings by neural network encoder [21]. However, words in user utterance have different
importance for updating dialogue states and predicting the next action, which is not taken into
consideration by previous methods. For example, in a user utterance I want to book a table in
Beijing Hotel, the word book apparently contributes more than the word want to the user intent.
Furthermore, each word contributes differently to different slots, e.g., word British is more related
to slot Cuisine while north is more related to Location, as shown in Figure 2.
To address the above issues, we propose a novel Memory-Augmented Dialogue management
model (MAD) which attentively receives user utterances as input and predicts the next dialogue
act1 . Dialogue act is composed of two parts in our model: dialogue act type and slot-value pairs, as
shown in Table 1 . Dialogue act type indicates the intent type such as Query or Recommendation,
which is a high-level representation of dialogue act. Slot-value pairs denote key elements of a task,
and represent the key semantic information supplied by the user during the interaction, which also
indicate the state of the dialogue.
We design two memory modules, namely a slot-value memory and an external memory, which
can be written (or updated) and read, to enhance the ability of modeling history semantics of
dialogues. A memory controller is introduced to control the write and read operations to the two
memories. The slot-value memory explicitly memorizes and updates the values of the semantic
slots during interaction. The write to the slot-value memory units, each corresponding to a slot, is
implemented by a slot-level attention mechanism. In this manner, the slot-value memory provides
an observable and interpretable representation of the dialogue state. The external memory serves
as a supplement to the single hidden state of a RNN structure and provides a better capacity to
store more historical dialogue information. A complete dialogue act (consisting of dialogue act
type and slot-value pairs) for the next interaction is predicted based on the slot-value memory and
external memory.

Utterance How about a British restaurant in north part of town.


Dialogue act type Query
Slot-value pairs Cuisine=British, Location=Paris
(auxiliary) Rating Cuisine Price Service Location
Mask
0 1 0 0 1
Table 1. An example of dialogue act for a given utterance. Dialogue act type is a high-level representation of
an utterance. Slot-value pairs are the task-specific semantic elements that are mentioned in an utterance.

Our contributions are summarized as follows:


• We propose a novel memory-augmented dialogue management model by introducing two
memory networks. The slot-value memory network maintains the values of semantic slots
1 The dialogue act can be translated into a natural language utterance by a language generator, as shown in [35].

ACM Transactions on Information Systems, Vol. 1, No. 1, Article 11. Publication date: April 2018.
11:4 Z. Zhang et al.

during interaction, and the external-memory augments the single state representation of
the recurrent networks. Both memory modules enable the model to access not only local
utterances, but also the global semantics of the entire dialogue session.
• We propose an attention mechanism for updating the dialogue state. In particular, the model
first computes a weight distribution over all words in a user utterance for each slot. Then,
the weighted representation of the utterance is used to update the memory unit for each slot.
• The model can offer more observable and interpretable results in that the slot-value memory
can track the change of dialogue states explicitly.

2 RELATED WORK
The role of dialogue management (DM) is to launch the next interaction through predicting the
next action the system should take, or by generating an utterance directly in response to user’s
query. The previous studies on DM can be broadly classified into three types: rule-based models,
Bayesian network models, and neural models.
Rule-based approaches date back to very early dialogue systems [34]. Several architectures are
proposed to formulate the process of dialogue management. The flow diagram approach [19] used
a finite-state machine to model state transition in dialogue, where the state represents a certain
dialogue status, and the transition between states is triggered by the corresponding type of a user
utterance. Slot-filling approaches [8] expanded the definition of dialogue state to an aggregation of
slots and values. In such models, user can talk about each slot by issuing constraints and requesting
the values of slots, and the dialogue state will be updated as long as a user provides new values
for the slots during interaction. Though rule-based DM models work well in some applications,
these approaches have apparent difficulties in task and domain adaptation [41] because the rules
are usually tailored to a specific scenario. Due to the nature of hand-crafted rules, the variety and
diversity of language is not well addressed. The need for hand-crafted rules also makes it expensive
to build a rule-based system.
Bayesian network approaches are proposed to address the issues of rule-based methods. Dialogue
management was firstly formalized as a Markov decision process (MDP) [16] under the Markov
assumption [22], in which the new state st at turn t is only conditioned on the previous state st −1
and system action at −1 . MDP models the uncertainty in dialogue and becomes more robust to the
errors induced by speech recognition and NLU. Partially observable Markov decision processes
(POMDP) [39] provides a more principled way in that it takes environment observation ot into
consideration. On the top of this framework, state transition and dialogue policy are trained using
reinforcement learning. However, the POMDP model becomes difficult to train for the domains
with large state space. An improved version of POMDP - Hidden Information State (HIS) [40]
is proposed to address this problem by grouping dialogue states into partitions. Another key
problem in building Bayesian dialogue model is the lack of training corpus, thus user simulation
[27] is employed to enhance the training procedure, where dialogue data can be collected through
interactions between a user simulator and a target system. In spite of the success of Bayesian
network methods, designing an appropriate reward function and manually crafting features limit
the applicability of such approaches. As a noticeable defect, the state in these approaches is still
manually defined, requiring a large amount of human labor.
A variety of neural models have recently been applied for the dialogue management task. Since
the process of a dialogue session naturally follows a sequence-to-sequence learning problem at
the turn level, recurrent neural network (RNN) is proposed to model the process [11, 21, 36]. At
each turn, RNN takes as input the structured semantic representation produced by NLU (or raw
user utterance when combining NLU and DM together) and predicts system action, where the
hidden state of RNN is utilized as the representation of a dialogue state. There are also some neural

ACM Transactions on Information Systems, Vol. 1, No. 1, Article 11. Publication date: April 2018.
Memory-augmented Dialogue Management for Task-oriented Dialogue Systems 11:5

Fig. 2. Slot-level attention: word mentions in user utterance are mapped to semantic slots such as rating,
cuisine, price, service, and location.

end-to-end models which directly take dialogue context as input and generate natural language
response [17, 28, 30, 31] in open-domain conversational systems. However, due to the vanishing
gradient problem and the limited ability of state representation, RNN is difficult to capture the long-
range context in dialogue. Hybrid Code Networks [38] proposes to handle the state representation
problem by combining rule-based and RNN-based models together, while the performance is still
highly dependent on the hand-crafted rules.
Memory network provides a principled approach for modeling long-range dependency and
making multi-hop reasoning, which has advanced many NLP tasks such as machine translation [33]
and question answering [32]. Neural turing machines [9] was proposed to augment existing neural
models with additional memory units to solve complicated tasks. It is analogous to a Turing machine
but is differentiable end-to-end. [37] proposed fully supervised memory networks which employ
supervision signal not only from answer labels but also from pre-specified supporting facts. [32]
proposed end-to-end memory networks (MEMN2N) which can be trained end-to-end without any
intervention on which supporting fact should be used during training. Dynamic memory network
proposed by [15] uses a sentence-level attention mechanism to update its internal memory during
multi-hop inference. Key-value memory network [20] encodes prior knowledge by introducing
a key memory structure which stores facts to address to the relevant memory value. There are
already some works which introduced memory network into the task of dialogue management
[24] where memory networks are straightforwardly applied in a machine reading manner. In
comparison, our model is better to model the long-range history semantics of the dialogue session
by memorizing and updating the dialogue act types and the values of semantic slots explicitly,
which is implemented through a slot-value memory and an external memory.
Extracting semantic information from user utterance is a key issue in task-oriented dialogue
systems when combining NLU and DM together. Early methods used hand-crafted rules and
semantic features, including NER and POS tags, to construct semantic features for user utterance.
[11] proposed to use the speech recognition confidence score as an additional feature. [28, 30] used
hierarchical RNN models, where the user utterance is processed by a word-level RNN, and utterances
are sequentially connected through an utterance-level RNN. [21] proposed to use convolutional
neural network (CNN) model for semantic feature extraction. However, existing approaches did
not consider the fact that words in an utterance contribute differently to different slots, which is
important for updating the dialogue state.

3 MEMORY-AUGMENTED DIALOGUE MANAGEMENT WITH SLOT-ATTENTION


3.1 Task Definition
This paper deals with task-oriented dialogue management. We start by defining the input and
output of our model. At the current turn (t) of a dialogue, given a user utterance along with the

ACM Transactions on Information Systems, Vol. 1, No. 1, Article 11. Publication date: April 2018.
11:6 Z. Zhang et al.

Fig. 3. Memory-augmented Dialogue Management (MAD): At each dialogue turn t, the model takes as input
the current user utterance and the previous system response, and predicts the next dialogue act. The slot-value
memory is updated with an attentive read of the user utterance by a slot-level attention mechanism while the
external memory is read and updated by the controller. The memory controller along with the two memory
modules will predict the next dialogue act of the system by a classifier.

system response of the previous turn (t − 1), the task of dialogue management module is to predict
the next system dialogue act that will be utilized to generate a natural language utterance. This
procedure can be formalized as follows:
Pθ (DAt |x 1 , y1 , ..., x t −1 , yt −1 , x t )
where x t and yt −1 are the user utterance at the current turn and system response at the previous
turn, respectively, and DAt is the next dialogue act which can be used to generate system response.
θ represents the parameters of the model. The next system response yt will be generated from DAt
by a natural language generator, which is beyond the scope of this paper.
To exemplify the concept of dialogue act in our model, we take the task of restaurant reservation
as an example, as shown in Table 1. Dialogue act (DA) is composed of two elements: dialogue
act type and slot-value pairs. Dialogue act type is a general description of user intents, such as
Query where the user may search for some information, and Recommend where the user may
ask for some recommendations. A slot-value pair represents a filled value for a slot 2 , such as
Location=north, Price=expensive and Cuisine=British. The slot-value pairs are usually regarded
as the state representation in many dialogue state tracking studies [11]. During the interaction,
the filled value for each slot may be provided or updated by the user, and correspondingly, the
dialogue state changes. For instance, when the user says How about a British restaurant in north
part of town., two slot-value pairs, Cuisine=British and Location=north, will be updated. However,
not all slot-value pairs which are mentioned in the context are to be addressed in the dialogue act
of system response. We thus introduce an auxiliary variable Mask, which is a one-hot vector with
dimension ns which is the number of slots, to decide which slot-value pairs are to be included in
the next dialogue act. As shown in Table 1, the slots appeared in dialogue act are only Cuisine and
Location, and their mask value is set to 1. In previous dialogue turns, the value of other slots may
2 Generally speaking, a slot in task-oriented dialogue systems is a category of semantic features, which defines some key
attribute or element for accomplishing a task.

ACM Transactions on Information Systems, Vol. 1, No. 1, Article 11. Publication date: April 2018.
Memory-augmented Dialogue Management for Task-oriented Dialogue Systems 11:7

have already been mentioned, but their value is useless for the system response of this turn, and
their Mask value is 0. Generally speaking, a dialogue act can be viewed as the structured semantic
representation of a natural language sentence.

3.2 Overview
As shown in Figure 3, the memory-augmented dialogue management model has two novel memory
components, namely slot-value memory (M S , MtV ) and external memory MtE . The slot-value mem-
ory consists of a static slot memory (M S ) and a dynamic value memory (MtV ) where one memory
unit M S (i) in M S is mapped to a unique unit M V (i) in MtV . M S remains unchanged during the
interaction, while MtV and MtE is updated at each turn t. We also design an RNN-based memory
controller which controls read and write of the slot-value memory and external memory. The
slot-value memory is updated with an attentive read of the user utterance by a slot-level attention
mechanism while the external memory is read and updated by the controller. The memory con-
troller along with the two memory modules will predict the next dialogue act of the system by a
set of classifiers.
y y
Let x t = (ex1 , ..., enx x, t ) and yt −1 = (e1 , ..., eny, t −1 )3 denote the word embedding sequence of the
user utterance at turn t and the preceding system response at turn t − 1, respectively, where
y
eix , ej ∈ Rm are word embeddings, n x,t and ny,t −1 are the lengths of two sequences. At each turn t,
our model works in the following procedure:

1. Memory Read: The controller reads information from the value memory and external memory.
The read of MtV is conditioned on the controller state (S t −1 ) and the value memory (MtV−1 ) at the
previous turn, and the slot memory, formally as follows:
rVt = readv (S t −1 , M S , MtV−1 ), (1)
and the read of the external memory conditions on the controller state and the external memory at
the previous turn:
rtE = reade (S t −1 , MtE−1 ). (2)
Inspired by [9], we introduce content-based addressing for memory read. rVt , rtE ∈ Rm are content
vectors read from the slot-value memory and the external memory, respectively.

2. Controller State Update: The controller state S t −1 is then updated by the information read
from the value memory and the external memory, and the content from x t and yt −1 :
St = GRU(St −1 , [x t ; yt −1 ; rVt ; rtE ]) (3)
where GRU stands for gated recurrent units [4], and [·; ·] denotes the concatenation of vectors.
For simplicity, an utterance (x t /yt −1 ) is represented by the averaged word embeddings but more
elaborated representation models are also applicable.

3. Memory Write: Memory vectors in MtV and MtE are updated based on S t and their previous
values:
MtV = writev (S t , M S , MtV−1 ) (4)
MtE = writee (S t , MtE−1 ) (5)

3 Note that yt −1 is the system response at turn t − 1 while yt is to be generated with a predicted DAt .

ACM Transactions on Information Systems, Vol. 1, No. 1, Article 11. Publication date: April 2018.
11:8 Z. Zhang et al.

The output at turn t is obtained based on S t and MtV . The output consists of the elements of a
dialogue act, that is, the dialogue act type, slot-value pairs and a mask. Note that the slot memory
M S is static and does not need to be updated.

3.3 Slot-Value Memory


The slot-value memory tracks the dialogue state by storing and updating the value of each semantic
slot during interaction. It is composed of two components: slot memory and value memory, and
both of them are composed of the same number (ns ) of column vectors. The slot memory is kept
constant during the dialogue, with each column vector M S (i) corresponding to a semantic slot i.
The semantic slots are like Location, Price, or Cuisine. Inspired by [20], each slot memory unit M S (i)
in our model acts as the index, which helps to locate the content in MtV . In our proposed model, we
further apply the slot memory unit to extracting slot-relevant information from user utterance.
Thus we keep M S unchanged during training and test time, and M S (i) is initialized by the averaged
embeddings of words in slot i.
The value memory stores the value of each slot i in MtV (i). During the dialogue, the value of a
slot may be added into the memory when a new slot is mentioned, or an old value can be updated
to a new value of a previously mentioned slot. That is, each memory unit in the value memory
stores the latest value (may be empty) of a semantic slot.

Read from the slot-value memory In our model, the main function of the slot-value mem-
ory is to trace the latest value of each slot, which is critical for predicting the slot-value pairs in the
dialogue act. However, the effect of the slot-value memory on the state update of the controller is
not straightforward. Thus, we employ a simple method for the read from the slot-value memory,
which is the average of the vectors in the value memory:
1 Õ V
rVt = Mt −1 (i), (6)
ns i

where ns is the number of slots.

Write to the slot-value memory The write to MtV (i) depends on slot addressing which decides
how much information should be updated for each slot when giving a user utterance. Ideally, the
value memory is supposed to update its values for all slots that are mentioned in a user utterance.
For example, when user inputs an utterance "I want a Chinese restaurant", the model updates slot
Cuisine with a new value Chinese.
Inspired by [9, 20], we apply a slot addressing technique to decide the amount of information that
should be updated to each value memory vector of the corresponding slot given a user utterance:
MtV (i) = βti cit + (1 − βti )MtV−1 (i) (7)
The first term is new information obtained from the attentive representation (cit ) of utterance x t
and the second term is the old information maintained. The attentive representation cit of utterance
x t , described soon later, essentially decides the relatedness of the user utterance to slot i. βti is a
gate which controls how much MtV should be updated, and it depends on the attentive read cit and
the last system response yt −1 :
βti = sigmoid(Wic ([yt −1 ; cit ]) + bic ) (8)
If utterance x t mentions slot i, βti will be large, and the corresponding value memory unit MtV (i)
will be updated substantially, otherwise much less information will be updated with a smaller βti .

ACM Transactions on Information Systems, Vol. 1, No. 1, Article 11. Publication date: April 2018.
Memory-augmented Dialogue Management for Task-oriented Dialogue Systems 11:9

Fig. 4. Slot-level attention mechanism for updating the slot-value memory.For each slot i, the attention score
α i, j for each word j is calculated based on word embeddings e j and slot memory M S (i). Context vector c i is
the weighted sum of word embeddings of the utterance. Finally, the value memory is updated based on the
previous value vector and the context vector. Note that the attention mechanism is applied on each slot i.

In order to better train these βti , we employ additional supervision on the weight, as defined in
L at t (see Eq. 26).

3.4 Slot-level Attention


The context vector cit in the above section is an attentive representation of utterance x t , conditioned
on the i-th slot vector. More formally, for an user utterance x t = (ex1 , .., enx x, t ), we compute attention
weights (α i,1 , ...α i, j , ..., α i,nx, t ) where each weight indicates the similarity of a word embedding e jx
to a slot memory unit M S (i), as follows:
n x, t

cit = α i, j exj
Õ
(9)
j=1
exp(di, j )
α i, j = Ínx, t (10)
k =1
exp(di,k )
di, j = MLP([M S (i), exj ]) (11)
For the previous example, the weight between word Chinese and slot Cuisine will be large, while
the weights between other words and this slot will be much smaller. The learning of α i, j is also
supervised as shown in L at t (see Eq. 26).

3.5 External Memory


The external memory is used to augment the representation capacity of the single state of RNN,
and it is sometimes referred to as memory state [33] in other works. Varies from the slot-value
memory, external memory is not endowed with explicit semantic meaning in our framework. The

ACM Transactions on Information Systems, Vol. 1, No. 1, Article 11. Publication date: April 2018.
11:10 Z. Zhang et al.

external memory MtE consists of ne columns of m-dimensional unit vectors, which are to be read
and written to during dialogue controlled by the memory controller.
Read The read vector rtE at turn t is a weighted sum of the memory units:
ne
rtE = wrt (i)·MtE−1 (i)
Õ
(12)
i=1

where ne is the number of external memory units. And the weight wrt ∈ Rne is given by
wrt = дtr ·wrt −1 + (1 − дtr )·e
wrt (13)
where grt ∈Rne
is an update gate which controls the amount of to be updated, and wrt −1
is a e rt
w
weight controlling new information to read from MtE−1 conditioned on the state of the controller
St −1 .
grt = σ (Wдr St −1 ) (14)
e rt
w = softmax(v ⊤
[MtE−1 (i); St −1 ]) (15)
Write There are two operations during the write to the external memory: erase and add. erase
controls how much old information should be removed from the memory and add controls the
addition of new information. Formally,
MtE (i) = MtE−1 (i)(1 − θ (i)·µ et ) + θ (i)·µ at (16)
where the first term is the left information after erased by vector µ et ∈ Rm , and the second is new
information added by vector µ at ∈ Rm . The scalar θ (i) = wtr (i), the read weight on memory unit i,
as defined in Eq. 13.
Both erase vector and add vector are obtained conditioned on the state of the controller S t , as
follows:
µ et = σ (W e S t ) (17)
µ at a
= σ (W S t ) (18)

3.6 Dialogue Act Prediction


As illustrated in Figure. 5, our memory-augmented network predicts a dialogue act as follows: first,
the dialogue act type is predicted via Ptdat ; second, each slot is associated with a binary classifier
(Ptm,i ) that decides whether the i−th slot should be included in the final dialogue act; third, if a slot
i is selected, the value of the slot is predicted by Pti . The final dialogue act can be assembled by
these predicted results.

Predicting dialogue act type: this classifier outputs a distribution over dialogue act types such
as Inform, Request, and Recommendation. It is implemented by a MLP conditioned on the controller
state and all memory units:
Ptdat (dat |S t , MtE , MtV ) = MLP([S t ; MtE (1); ...; MtE (ne ); MtV (1); ...; MtV (ns )]) (19)
where dat is one of all the dialogue act types.
Predicting a slot: there is a slot mask which controls the slots to be included in the final dialogue
act. There is a binary classifier for each slot i conditioned on the controller state S t , external memory
MtE and its corresponding value memory unit MtV (i)):
Ptm,i (z|S t ) = MLP([MtE (1); ...MtE (ne )]) (20)

ACM Transactions on Information Systems, Vol. 1, No. 1, Article 11. Publication date: April 2018.
Memory-augmented Dialogue Management for Task-oriented Dialogue Systems 11:11

Fig. 5. Dialogue act prediction of MAD: DAT t is the dialogue act type of system response at turn t. Mask t
is the mask for slot-value pairs at turn t, and the color of each mask block indicates its value, with white
indicating 1 and black for 0. vti represents the value of slot i. The prediction of Mask it and vti are both based
on MtE (i).

where z ∈ {0, 1}, z = 1 indicates that slot i should be included in the next dialogue act.
Predicting the value of a slot: once we obtain which slot should be included in the dialogue act,
we need to decide which value of the slot should be mentioned. This is given by the classifier which
estimates a probability distribution over all the values for a slot:
Pti (v ij |S t ) = MLP([M S (i), MtV (i)]) (21)
where v ij is all the values of slot i.

3.7 Loss Function


We adopt cross entropy as our objective function. There are three terms in the function correspond-
ing to the prediction of dialogue act types (L dat ), slot-value pairs (Lv ), and slot mask (Lm ), as
presented in the previous section.
The loss function is defined as follows:
L = L dat + γ Lm (i) + λ Lv (i)
Õ Õ
(22)
i i
where
Õ nÕ
d at

L dat = − [P̂tdat (datk )lnPtdat (datk )] (23)


t k =1

Lm (i) = − [P̂tm,i (z)lnPtm,i (z)]


Õ Õ
(24)
t z ∈ {0,1}
ni
Lv (i) = − [P̂ti (vki )lnPti (vki )]
ÕÕ
(25)
t k=1

where ndat is the number of dialogue act types, ni is the number of values for slot i, P̂t∗ are the gold
distributions obtained from the training data, and Pt∗ are defined in the preceding subsection. λ and
γ are hyper-parameters.

ACM Transactions on Information Systems, Vol. 1, No. 1, Article 11. Publication date: April 2018.
11:12 Z. Zhang et al.

Furthermore, we found that performance improvement can be observed when applying weak
heuristic supervision on the intermediate variables, and the supervision signal can be easily
obtained by simple string matching rules. This is a common practice for training sophisticated
neural networks [13, 18]. More specifically, we apply extra supervision on the update gate of the
value memory (see Eq. 8) and the attention weight of an utterance (see Eq. 10). Those intermediate
supervision is applied with a two-stage training schema: firstly, we pretrain our model only with
the heuristic loss (L at t , see below) for several epochs, and then train the model further with the
loss (L) defined by Eq. 22 for the remaining epochs.
The heuristic supervision loss is defined as follows:
Õ Õ nÕ
x, t

Lh = − [α̂ i,t j lnα i,t j ]


t i j=1

[βˆti lnβti + (1 − βˆti )ln(1 − βti )]


ÕÕ
− (26)
t i

where n x,t is the number of words in x t at turn t and i is the slot index.
t and βˆi represent the gold distributions of the update and attention weights,
Note that α̂ i,k t
respectively. For each word w j of utterance x t , if w j appears in the values of slot i, α̂ i,t j = 1 and
βˆi = 1, otherwise α̂ t = 0 and βˆi = 0. This means that if a value of a slot appears in the utterance,
t i, j t
the value (also the word) should be attended w.r.t. that slot, and the update weight should be equal
to 1. By this way, the value memory of the corresponding slot can be updated accordingly.

4 EXPERIMENT
4.1 Data Preparation
We first evaluated our memory augmented dialogue management model on two synthetic datasets
adopted from the dialog bAbI dataset[3] and the Second Dialogue State Tracking Challenge dataset
[10], which are originally proposed for end-to-end dialogue systems and dialogue state tracking
task. However, both of the above two datasets are small-scale. To better assess the performance of
our proposed model on large-scale datasets, we collected a new Chinese dialogue management
dataset consisting of real conversations from the flight booking domain.

4.1.1 DMBD: Dialogue Management bAbI Dataset. The original dialogue bAbI dataset (DBD) is
designed to evaluate the performance of end-to-end dialogue systems on the task of restaurant
reservation. In [3], the task is formulated as a machine comprehension task by applying the
MEMN2N [32] model, considering the dialogue context and last user utterance as story and
question respectively, and the system response is selected from a fixed answer set. The DBD dataset
is composed of five manually constructed subtasks: issuing API calls, updating API calls, displaying,
providing extra information and full dialogue, to examine the system performance on different tasks,
in which the full dialogue is a combination of the first four tasks. The data for these tasks were
collected through a simulator which is based on an underlying knowledge base along with some
manually-crafted natural language patterns, where the simulator rules can be utilized by us to
perform dialogue act annotations. For more details of DBD, please refer to [3].
Since the dialogue act types and slot-value pairs are not annotated in DBD, we have to do this by
ourselves to train our model. Fortunately, we can easily annotate the system response utterances
because the original data is generated with an underlying knowledge base and some simple natural
language patterns. We thus did reverse engineering by conducting automatic annotations with
manually-crafted rules utilizing the knowledge base of DBD to label the dialogue act type and

ACM Transactions on Information Systems, Vol. 1, No. 1, Article 11. Publication date: April 2018.
Memory-augmented Dialogue Management for Task-oriented Dialogue Systems 11:13

Informable slots Requestable slots


Name #Value Name
Cuisine 10 Address
Location 10 Telephone
Price 3
Size 4
Table 2. Ontologies of the DMBD dataset. An informable slot means that user can provide values to the slot
to constrain a query to KB; while a requestable slot can only be queried from KB without any user provided
value.

slot-value pairs for each utterance. This processed dataset for dialogue management is termed as
Dialogue Management bAbI Dataset (DMBD) in the following sections.
In DMBD, the original user and system utterances are reserved to serve as the input of each
turn of dialogue, while the output is changed from system utterance to its dialogue act, as detailed
in Table 1. The resulting DMBD dataset has fifteen dialogue act types, four informable slots and
two requestable slots, as seen in Table 2. An informable slot means that user can provide values to
the slot to constrain a query to KB; while a requestable slot can only be queried from KB without
any user provided value. Note that DMBD shares the same KB with DBD. As the requestable slots
are only used for issuing API calls, in our implementation, we design a special informable slot
called Ask Slot, which tracks the slots that are to be queried. The values of Ask Slot are the names
of requestable slots.
4.1.2 DM-DSTC: Dialogue Management of the Second Dialogue State Tracking Challenge dataset.
The dialogues in the above DMBD are collected via a simulator which employs hand-crafted
templates, and are thus more or less synthetic. In order to evaluate the performance of our model
on real-world dialogue corpus, we conducted another experiment based on DSTC2 which is a real
world dialogue dataset, and it is also about the task of restaurant reservation.
The original DSTC2 dataset is for dialogue state tracking, in which the output at each turn is the
filled slots and their values which have already been presented by the user so far. The dialogue
act of the system utterance is also annotated and is thus directly utilized as model output. We
thus transform the original DSTC2 dataset to our settings for dialogue management, referred to as
DM-DSTC. The ontologies of dialogue act type and slot in the original dataset are directly reused
in the DN-DSTC.
The resulting DM-DSTC is composed of four informable and nine requestable slots, and the
average value number of informable slots is 54, which is much higher than that of DMBD, and the
enhanced complexity of DM-DSTC dataset reflects the characteristics of real-world data which is
more stochastic and noisy. We also created a special slot for requestable slots in this experiment as
we did in the DMBD experiment. Some statistics of DM-DSTC are shown in Table 3.
4.1.3 ALDM: Alibaba Dialogue Management Dataset. The sizes of the above two datasets are
limited, we thus propose ALDM to test our model’s performance on large-scale dataset. ALDM
is a Chinese dataset, consisting of real conversations from the flight-booking domain, in which
the system is supposed to acquire departure city, arrive city and departure date information from
the user to book a flight ticket. To better fit our model, the departure date values in the corpus are
preprocessed into an uniform MM.DD format, e.g., 12.25 for 25th, Dec.. ALDM is much larger than
the other two datasets, where there are 15,330 sessions for training, 7,665 for validation, and 3,832
for test. On average, there are 5 turns in a session. The average sentence length is 4, and particularly,

ACM Transactions on Information Systems, Vol. 1, No. 1, Article 11. Publication date: April 2018.
11:14 Z. Zhang et al.

Informable slots Requestable slots


Name #Value Name
Food 91 Addr, Area, Food
Pricerange 3 Phone, Pricerange
Res_name 113 Postcode, Signature
Area 5 Res_name
Table 3. Ontology of the DM-DSTC dataset. The Res_name indicates restaurant name. The average value
number of informable slots is 54 which is much higher than that of DMBD dataset. The enhanced complexity
of DM-DSTC reflects the characteristics of real-world dialogue data.

most of the user responses have only one word as users only provide the departure or arrival city,
or the departure data. One difference to the other two datasets exists in that the departure city slot
and the arrive city slot share the same value list, which raises additional difficulty to require the
model to identify which slot the city name in the user utterance should be filled in. To handle this
issue, the model should be able to fill slots conditioned on the dialogue context. For example, if
the user responds with Beijing to the last system response Where are you flying from?, the value of
Beijing should be filled in the departure city. Another difference is that there are not requestable
slots due to the fact that ALDM is system-driven.

DA type Informable Slots


ask_dep_loc Name #Value
ask_arr_loc Dep_city 174
ask_dep_date Arr_city 174
offer, end Date 100
Table 4. Ontology of the ALDM dataset. The ask_ DA type means the system is asking the user for information,
offer means the system is giving recommendation and end means the dialogue session is done. Dep_city and
Arr_city represent the slot of departure city and arrive city respectively, and they share the same value list.
The value of Date slot is transformed into a uniform MM.DD format.

As shown in Table 4, ALDM is composed of 3 informable slots, and the average value number is
150, which is remarkably larger than those of the above two datasets. And there are 5 dialogue act
types as shown in Table 4.

4.2 Experimental Setup


Our model is implemented with Tensorflow [1]. The word embeddings used in each dataset were
pretrained on their own dialogue corpora, where there are 15,000 sessions in DMBD (3,000 per each
task), 2,118 sessions in DM-DSTC and 26,827 sessions in ALDM, using the GloVe algorithm [23].
The dimensions of word embeddings, memory column vectors, and state vectors were all set to 128,
and there are 8 columns in the external memory. We first pretrain our model with the heuristic loss
L h (see Eq. 26) for 2 epochs and then continue to train it using L in Eq. 22.
The parameters γ amd λ in L are not constant during training. More specifically, in the first 7
epochs, λ increases linearly from 0 to 1 while γ remains zero, and in the following 7 epoches γ also
rises from 0 to 1 linearly with λ unchanged. The reason for this setting is that the process of the
value update in the slot-value memory has strong influence on the training of other components.
All the other parameters are initialized with a random uniform distribution N (0, 1).

ACM Transactions on Information Systems, Vol. 1, No. 1, Article 11. Publication date: April 2018.
Memory-augmented Dialogue Management for Task-oriented Dialogue Systems 11:15

We used the train/valid/test partition of the original DBD for each task, where there are 1,000
sessions in each set; and the partition of DM-DSTC is 1412/353/353. For ALDM, we split the dataset
into 15,330/7,665/3,832.
We trained our model using ADAM [14] with a learning rate which is set to 0.002, and the
momentum parameters β 1 = 0.9 and β 2 = 0.999. For each dataset, the model is trained with at most
15 epochs. We use the model parameter with the lowest validation loss for test.

4.3 Baseline
We included two types of baselines in the evaluation. The first type is to select a sentence as answer
from a predefined candidate answer set in a machine comprehension manner, as described in [3].
The second type is to predict a structured dialogue act, the same as our model, where the models
need to make predictions over all combinations of dialogue act type and slot-value pairs.
In the baselines of the first type, each candidate answer sentence is a natural language utterance,
which lexicalizes4 an underlying dialogue act. However, the candidate answer set is not complete,
where not all possible combinations of dialogue act type and slot-value pairs are included. In other
words, the size of the answer space in the first type is less than that in the second type. Thus, the
first setting is therefore easier than the second one.
The baselines of the first type, which select an utterance from a predefined candidate answer set
[3], are listed as follows:
• TF-IDF: A TF-IDF matching algorithm[26] which computes a cosine similarity score between
the input (the whole dialogue history) and a candidate sentence, and the sentence with the
highest score is selected as the final answer. Both the input and the candidate sentence are
represented by the average of bag-of-word vectors.
• TF-IDF(+ type): An enhanced version of TF-IDF by introducing additional entity type
features.
• Supervised Ebd: An information retrieval model based on trainable word embeddings. The
similarity score between an input and a candidate sentence is the inner product of their
averaged word embeddings. The is trained with a margin ranking loss [2].
• MEMN2N: Standard end-to-end memory networks [3, 32]. It stores the dialogue history
information in a memory network and chooses a response by running multi-hop reasoning
upon the history.
• MEMN2N(+ match): A variant of MEMN2N which included additional features about entity
types.
The baselines of the second type, which predict a structured dialogue act, the same as our
proposed model, are as follows:
• MEM: A memory network model which predicts dialogue act. For each output structure (DA
type, slot-value, and mask), a MEMN2N is introduced to make prediction.
• RNN: A recurrent neural network model with turn-level input and output. The dialogue act
predictions (type and slot-value) are based on the hidden state S t at each time step t.
• MAD - SM: A variant of our proposed model without the slot-value memory. Those predic-
tions involving the slot-value memory are modified to using only the memory controller
state S t to make prediction.
• MAD - Attn: A variant of our model without the slot-level attention mechanism. In this
setting, the averaged word embeddings of an utterance is used to update the slot-value
memory.

4 Lexicalizing a dialogue act means converting the act from formal semantic representation to a natural language utterance.

ACM Transactions on Information Systems, Vol. 1, No. 1, Article 11. Publication date: April 2018.
11:16 Z. Zhang et al.

1 Issuing 2 Updating 3 Displaying 4 Providing 5 Full


Metrics
API calls API calls options options dialogs
TF-IDF (no type) 5.6 (0) 3.4 (0) 8.0 (0) 9.5 (0) 4.6 (0)
TF-IDF (+ type) 22.4 (0) 16.4 (0) 8.0 (0) 17.8 (0) 8.1 (0)
Nearest Neighbor 55.1 (0) 68.3 (0) 58.8 (0) 28.6 (0) 57.1 (0)
Supervised Ebd 100 (100) 68.4 (0) 64.9 (0) 57.2 (0) 75.4 (0)
MEMN2N (no match) 99.9 (99.6) 100 (100) 74.9 (2.0) 59.5 (3.0) 96.1 (49.4)
MEMN2N (+ match) 100 (100) 98.3 (83.9) 74.9 (0.0) 100 (100) 93.4 (19.7)
MEM 47.4 (0.1) 61.1 (0.1) 24.6 (0.1) 56.7 (0.8) 25.2 (0.1)
RNN 80.6 (0.1) 45.5 (0.0) 30.0 (0.0) 57.2 (0.0) 3.7 (0.0)
MAD 99.0 (94.2) 100 (100) 99.1 (90.6) 100 (100) 99.9 (97.8)
Table 5. The accuracy across all tasks and methods. The numbers in brackets are the accuracy at the session
level, and numbers without brackets are at the turn level. A session is correct only if all the sentences in the
session are predicted correctly.

• MAD - EM: A variant of our model without the external memory. The predictions involving
the external memory are modified to using the memory controller state S t only, just as
MAD-SM.

It should be noted that the MEMN2N and MEM baseline take as input a context-question pair
at each round, which means they have to make calculation on the cumulated dialogue context at
each turn. Thus with the increasing of the dialogue context, there is an exponential increase in the
computation complexity. While for our model, the context information is stored in the memory
network, and the computation time in each turn is basically the same.

4.4 Performance on DMBD


In this section, we evaluated the performance of our model and the baselines on the DMBD dataset.
The prediction accuracy on both turn-level and session-level evaluation is reported, similar to [3].
Based on the distribution defined in Section 3.6, our model chooses a dialogue act with the maximal
probability as output, respectively for DA type, slot-value and mask. Note here that for DA type
and mask, the prediction is judged as correct only if the output matches the target. As mentioned
in Section 3.1, mask is an auxiliary variable helping to filter the undesired slot-value pairs in a
predicted dialogue act. Thus for the prediction of slot-value, we only need to correctly predict those
slot-value pairs whose mask value is 1. Finally, the overall dialogue act is correct only if its DA
type, slot-value and mask are all correctly predicted. And a dialogue session is correct only if all
the dialogue acts in the session are correctly predicted. We termed this session-level evaluation.
4.4.1 Overall Performance Analysis. We first evaluated our proposed model based on the overall
accuracy of dialogue act prediction, as shown in Table 5. The results of baselines of the first type are
reprinted from their original paper [3], because the partitions of training/validation/test data are
the same as ours, and the results are hence directly comparable. Both turn-level and session-level
results on all the five tasks are reported. We have the following observations:
• MAD obtains the best performance on most of the tasks. The model obtains an accuracy of
about 100% at both turn and session-level evaluation, which shows the effectiveness of our
proposed model. While in Task 1, MAD is at the second place, where Supervised Ebd and
MEMN2N (+match) methods obtains 100% accuracy at both turn and session-level evaluation,

ACM Transactions on Information Systems, Vol. 1, No. 1, Article 11. Publication date: April 2018.
Memory-augmented Dialogue Management for Task-oriented Dialogue Systems 11:17

which is 1% higher than ours. MAD’s defect on Task 1 can be attributed to a potential rule
in Task 1, where if the user doesn’t provide enough values to form a query, the agent will
request for the value of slots in a fixed order. For example in task 1, the agent requests for
slots in an order of (Cuisine → Location → Size → Price). However, this order rule is not
essential for a practica application, where the agent can request for values in an arbitrary
order as long as it can obtain all necessary values.

4.4.2 Fine-grained Performance Analysis. To better understand how the slot-value memory and
the external memory influence the performance, we further analyzed the fine-grained prediction
accuracy of MAD and its variants in addition to the overall dialogue act prediction. Evaluation on
the fine-grained predictions is shown in Table 6. We have the following observations:
• The variants of MAD, MAD-SM, which ablates the slot-value memory module, obtains
degraded performance on overall accuracy compared to MAD. MAD-Attn, which removes
the slot-level attention mechanism, works worse than MAD but still slightly better than
MAD-SM on each task. The performance of MAD-EM drops even more than MAD-SM on all
tasks except for Task 1. The RNN model, which can be regarded as MAD without slot-memory
and external memory, performs even worse on most of the 5 tasks.
• The fine-grained results demonstrate the effectiveness of our proposed model more specifi-
cally. Here we can see that the accuracy of MAD on both slot-value and mask is 100%, while
the prediction on DA type has very few errors. The high accuracy of slot-value prediction
indicates that the slot addressing and the attentive question representation work well, which
is attributed to the slot-value memory and attention supervision we applied. The contribu-
tion of the external memory is also shown by the high performance of DA type and mask
prediction.
• The slot-value memory leads to significant improvements in slot-value accuracy. In our model,
the role of the slot-value memory is to extract semantic information about slots during the
dialogue, thus the ability of tracking slot-value information should decrease if the slot-value
memory is removed. As shown in Table 6, the prediction accuracy of MAD-SM on slot-value
drops much from 100% to around 30%. However, the performance on dialogue act type and
mask prediction are not heavily affected, and the accuracy is still above 90%.
• The slot-level attention mechanism we applied on semantic information extraction influences
the performance remarkably. In MAD-Attn, the slot-level attention mechanism is removed,
and the value update is based on averaged word embeddings of user utterance. Intuitively, the
update of the slot-value memory is not able to concentrate on relevant words without attention
mechanism, thus the performance of slot-value prediction must be heavily influenced. The
experiment results also support our hypothesis, where the accuracy of slot-value prediction
degrades remarkably, but is still better than that of MAD-SM since MAD-Attn retains the
slot-value memory. The attention mechanism affects dialogue act type and mask prediction
very slightly.
• The external memory significantly improves the performance of DA type and mask accuracy
by enhancing the representation capacity of the original RNN state. In MAD-EM, the exter-
nal memory is removed, and those predictions involving the external memory, that is the
prediction of DA type and mask, are changed to use the memory controller state, which is
identical to the hidden state in a RNN model. Compared to MAD, the accuracy of MAD-EM
on DA type and slot-value prediction decreases heavily. This is attributed to the enhanced
representation capacity, meaning that the model can do better in capturing longer term
temporal dependencies in dialogue.

ACM Transactions on Information Systems, Vol. 1, No. 1, Article 11. Publication date: April 2018.
11:18 Z. Zhang et al.

From the above analysis, we can see that the effect of the slot-value memory is mainly on
predicting slot-value, while the effect of the external memory is on predicting dialogue act type
and mask. However, the influence of the modules on the performance is more complex. We can
see from Table 6 that DA type and mask accuracy will also decrease if the slot-value memory is
removed, and so will slot-value accuracy when we remove external memory. This means the two
memory networks in our model are coupled correlatively by the memory controller and can affect
the performance of each other.

Task 1 2 3 4 5
MAD-SM 93.9 (65.9) 100 (100) 95.6 (58.2) 100 (100) 90.9 (11.9)
MAD-EM 95.8 (80.5) 65.7 (3.5) 56.3 (5.8) 100 (100) 17.8 (0)
DA type
MAD-Attn 99.5 (96.9) 100 (100) 99.0 (90.3) 100 (100) 99.9 (98.6)
MAD 99.0 (94.2) 100 (100) 99.1 (90.6) 100 (100) 99.9 (97.8)
MAD-SM 21.1 (0.3) 22.3 (0) 18.4 (0) 40.3 (0.1) 20.9 (0)
MAD-EM 100 (100) 95.3 (65.8) 27.5 (0.1) 100 (100) 22.6 (0)
slot-value
MAD-Attn 26.8 (0.5) 24.8 (0) 27.5 (0) 41.3 (0.1) 31.4 (0)
MAD 100 (100) 100 (100) 100 (100) 100 (100) 100 (100)
MAD-SM 1.0 (1.0) 100 (100) 99.9 (99.9) 100 (100) 98.8 (6)
MAD-EM 99.1 (88.8) 87.8 (2.6) 87.8 (16.4) 100 (100) 66.8 (0)
mask
MAD-Attn 100 (100) 100 (100) 100 (100) 100 (100) 100 (100)
MAD 100 (100) 100 (100) 100 (100) 100 (100) 100 (100)
MAD-SM 77.2 (0.2) 78.9 (0) 70.7 (0) 57.3 (0.1) 59.6 (0)
MAD-EM 95.2 (78.2) 57.4 (0.2) 40.5 (0.0) 1.0 (1.0) 3.1 (0.0)
Overall
MAD-Attn 82.7 (0.5) 79.0 (0) 73.9 (0) 57.3 (0.1) 67.7 (0)
MAD 99.0 (94.2) 100 (100) 99.1 (90.6) 100 (100) 99.9 (97.8)
Table 6. Fine-grained performance on the DMBD dataset. We tested the performance of our proposed model
and three of its variations on both turn and session level, where for each model the dialogue act type, slot-
value, mask and overall prediction accuracy on each task is reported. The highest accuracy on turn level
which is lower than 100% is in bold font.

4.5 Performance on DM-DSTC


Although our proposed model obtains good results on DMBD, it should be noted that the perfor-
mance reflected by the above results are somehow optimistic due to two facts: First, these dialogues
are generated by rules, which are much simpler than real dialogue data. Second, the number of
slots and values in DMBD is quite small, while in real applications the number may become very
large.
To assess the performance of our proposed model on real dialogue data, we conducted another
experiment on DM-DSTC. Different from DMBD, there is only one task in the DM-DSTC dataset.
We only reported the results of the methods which predict dialogue act as output. It should be
pointed out that in this new dataset, many values in dialogue act annotation didn’t appear exactly
in user utterances (such as asian oriental), thus for those values we can not provide precise attention
supervision, which will affect the performance of slot-level attention. Moreover, the Res_name slot
in this dataset degrades the accuracy because its value does not appear in the dialogue context,
and is queried from a knowledge base conditioned on previous search constraints, which is not
consistent with our model setting. We reported the fine-grained and overall accuracy at the turn
level and session level, as shown in Table 7.

ACM Transactions on Information Systems, Vol. 1, No. 1, Article 11. Publication date: April 2018.
Memory-augmented Dialogue Management for Task-oriented Dialogue Systems 11:19

Metrics DA type slot-value mask All


MEM 62.5 (9.9) 14.2 (0.0) 71.0 (0.1) 0 (0.0)
RNN 50.9 (0.3) 14.3 (0.1) 61.8 (0.3) 0.1 (0.0)
MAD-SM 64.1 (13.6) 11.6 (0.1) 81.6 (0.4) 17.1 (0.1)
MAD-Attn 64.6 (12.5) 18.5 (0.1) 80.8 (1.0) 16.9 (0.0)
MAD-EM 44.9 (2.3) 17.5 (0.1) 69.7 (0) 5.7 (0.0)
MAD 63.8 (11.0) 27.3 (0.1) 82.1 (1.3) 18.8 (0)
Table 7. Fine-grained and overall accuracy on the DM-DSTC dataset. The number in brackets are the accuracy
at the session level, and number without brackets are at the turn level.

The results in Table 7 demonstrate our model is still comparable to the vanilla memory network
model. Compared to MEM and RNN, our proposed method obtains higher accuracy on turn-level
overall prediction, as well as the dialogue act type and mask prediction. Although MEM’s accuracy
on DA type , slot-value and mask prediction is slightly lower than ours, its overall accuracy on
turn-level is far less than our proposed model. This can be attributed to the framework of MEM,
where its DA type, mask and slot-value prediction is trained separately, while in our model these
three tasks are trained. For the variants of MAD, the experiment results are consistent with what
we observed in DMBD. MAD-SM obtains lower accuracy on slot-value prediction compared to
MAD, while maintains similar accuracy on DA type and mask. For MAD-Attn, the result is similar
to MAD-SM when compared to MAD, but its accuracy on slot-value prediction is obviously higher
than that of MAD-SM since it maintains the slot-value memory network. MAD-EM, which removes
the external memory, obtains significantly lower accuracy on the prediction of dialogue act type
and mask, and its accuracy on slot-value prediction is also reduced.
We can see that the performance of slot-value prediction is the bottleneck of promoting overall
accuracy. That can be attributed to the data feature of DM-DSTC, where many values of slots
does not appear precisely in the user utterance, which makes it hard to acquire accurate attention
supervision, thus the model’s capacity of extracting semantic features from user utterance is
negatively influenced. For the prediction of DA type and mask, although the result is far better than
that of slot-value, the accuracy is still not so high as that in DMBD. This can be attributed to the
characteristics of real-world data, where there exists much more probability uncertainty and noise
than DMBD. More specifically, in different sessions, the DA type of agent response varies much
even it is given the same dialogue context. What’s more, the agent response in original DSTC2
dataset is conditioned on the knowledge base query result which is not provided, and this also
restricts our model’s ability on predicting DA type and mask.

4.6 Performance on ALDM

Metrics DA type Slot-value Mask All


MEM 64.9 (1.4) 73.5 (0.0) 100.0 (100.0) 0.0 (0.0)
RNN 60.0 (0.0) 80.0 (0.0) 100.0 (100.0) 40.0 (0.0)
MAD-SM 60.3 (0.0) 80.0 (0.0) 100.0 (100.0) 40.3 (0.0)
MAD-Attn 76.4 (15.7) 100.0 (100.0) 100.0 (100.0) 76.4 (17.1)
MAD-EM 76.4 (15.4) 98.6 (92.8) 100.0 (100.0) 74.9 (14.2)
MAD 76.7 (16.3) 100.0 (100.0) 100.0 (100.0) 76.7 (16.3)
Table 8. Fine-grained and overall accuracy on the ALDM dataset. The number in bracket is the accuracy at
the session level, and the number without bracket is at the turn level.

ACM Transactions on Information Systems, Vol. 1, No. 1, Article 11. Publication date: April 2018.
11:20 Z. Zhang et al.

We reported the results of the methods which can output a structured dialogue act as we did in
Section 4.5. The mask prediction is relatively simple for ALDM in which most of the slot values only
appear in the last system response, and thus all the models have an accuracy of 100%. Therefore,
the following analysis will be focused on the DA type and slot-value.
A difference of ALDM compared to the other two datasets is that ALDM is more system-driven,
which makes it hard for our model to correctly predict the order of ask_ DA type, For instance,
ask_dep_Loc is only based on the currently filled slots. If the departure location is provided by the
user, the system can ask for either the arrive location or the departure date in the next turn, which
makes the next DA type difficult to predict. Thus the DA type accuracy is not as good as that in
DMBD. However, when N − 1 slots is already filled (N is the total number of slots to complete a
booking task), the next slot to be asked is determinate. Thus, the dialogue state still has impact on
DA type prediction, which is shown by the results of MAD-SM and RNN in which the two models
removed the slot-value memory.
Although the average number of the slot values in ALDM is much larger than that in the other
two dataset, we still obtain high slot-value accuracy. This can be attributed to the high data quality
of ALDM which is carefully cleaned before training. By removing the slot-value memory (RNN and
MAD-SM) we can see that the slot-value accuracy decreases remarkably, which shows the ability
of slot-value memory for maintaining dialogue states. As it can be seen from Table 8, the slot-value
accuracy of our full model is the same as that of MAD-Attn. This is because of the nature of the
ALDM dataset that the user responses are mainly one-word sentences, which makes no difference
between the models with/without attention mechanism.

Metrics Departure-City Arrive-City


MEM 2.7 4.1
RNN 0.2 0.1
MAD-SM 0.5 0.3
MAD-Attn 100.0 100.0
MAD-EM 96.5 96.2
MAD 100.0 100.0
Table 9. Prediction accuracy on the departure city and arrive city slots. The number in bracket is the accuracy
at the session level, and that without bracket at the turn level.

To verify the model’s ability to combine context information in slot filling, we further analyzed
the prediction accuracy on the Departure_City slot and the Arrive_City slot. As described in Section
4.1.3, they share the same value list. The ability of identifying values from different slots is mainly
controlled by the update gate βti as defined in Section 3.3. Slot-value memory dominates the
prediction of the next slot values, which can be seen from the results of MAD-SM, RNN, and
MEM in Table 9. The results drop dramatically when removing the slot-value memory (RNN and
MAD-EM). For MEM, although its accuracy is higher than that of RNN and MAD-EM, it’s still
much lower than our proposed model. This is because that 1) the city number is too large for MEM
to predict, and 2) MEM fails to identify which slot the value belongs to.

4.7 Parameter Tuning


Generally speaking, the performance of neural network models is highly correlated with the number
of parameters. There are many important hyper-parameters in our model, including the dimensions
of the slot-value memory and external memory, and the number of column vectors in the external

ACM Transactions on Information Systems, Vol. 1, No. 1, Article 11. Publication date: April 2018.
Memory-augmented Dialogue Management for Task-oriented Dialogue Systems 11:21

Fig. 6. Fine-grained prediction accuracy on DMBD with different ne (the number of column vectors in the
external memory). The optimal number is 8.

Fig. 7. Accuracy change on DMBD with different dimensions of the column vectors in the external memory.
The optimal number is 128.

memory. We evaluated the influence of these hyper-parameters on performance. The following


experiments were performed on the DM-DSTC dataset.
First, we studied how the performance is influenced by the number of column vectors in the
external memory ne . The number ne varies from 3 to 9, with a step size of 1. We studied the accuracy
change on dialogue act type, slot-value, and mask, as shown in Figure 6. For predicting dialogue
act type and mask, the optimal ne is 8 and the optimal accuracy is significantly better than others.
For predicting slot-values, although the optimal ne is 4 with an accuracy of 0.331, the accuracy is
almost the same (from 0.321 to 0.331) when varying ne from 4 to 8.
Second, we studied the influence of the dimension of column vectors, as shown in Figure. 7. The
dimension number in our experiment ranges from 32 to 256 with a step size of 32. The accuracy of

ACM Transactions on Information Systems, Vol. 1, No. 1, Article 11. Publication date: April 2018.
11:22 Z. Zhang et al.

Fig. 8. Attention visualization. For each slot, the attention weights (in a row) are a distribution over the words
of an utterance. For utterance "can you book a table with british cuisine for six people in madrid in an expensive
price range." the predicted slot-value pairs are < cuisine, british >, < number , six >, < location, madrid >,
and < price, expensive >.

dialogue act type and mask is highly correlated, whose best accuracy are both obtained with the
dimension of 128. While the optimal value for slot-value accuracy is obtained with the dimension
of 64.

4.8 Visualization Analysis


Figure 8 illustrates an example of the slot-level attention mechanism. For each slot, the model
generates a distribution over the words of an utterance. Each row is thus a probability distribution
over words, where the largest probability corresponds to the word that should be attended mostly.
For utterance "can you book a table with British cuisine for six people in Madrid in an expensive
price range", for slot Cuisine, the most attended word is British, while for slot Price , the word is
expensive, and for slot Number, the word is six. Note that the weight of < Ratinд, british > is also
large, which is wrong intuitively in that Rating information has not yet been mentioned. However,
this kind of wrong attention weight does not have influence on model performance. In other words,
the inclusion of a slot-value pair in the predicted dialogue act is decided by two distributions: the
value distribution and the slot mask distribution for a slot, as mentioned in Section 3.6. The effect
of faulty attention will be filtered out by mask when deciding which slots are to be addressed in
final dialogue act.
Figure 9 illustrates the change of the dialogue state and the predicted next dialogue act in an
exemplar dialogue session. We visualized the values stored in the slot-value memory and shown
the next dialogue act type predicted by the model. At each turn, the model computes an update
gate βti (Eq.8) for each slot i. If a certain value of slot i appears in user utterance x t , βti increases,
and the color of the corresponding cell becomes darker. The darkness of a cell represents the value
of βti ∈ [0, 1], which is calculated independently for each slot i at each turn t. The value in each cell
is computed by Eq. 21 and we only output the value for slot i if βτi > 0.5 for some turn τ . These
values compose a search constraint at each turn. In the exemplar dialogue session, each value in
user utterance is captured by the attention mechanism of a user utterance, and its values are filled
into M V with large βti s.
For instance, when the user asks can you book a table in a cheap price range in london?, the price
slot is filled with the value of cheap , and the location slot is filled with the value of london. The
model predicts the next dialogue act ask_cuisine which prompts the user on the preference of
cuisine. As the user supplied new information with the utterance with french food, the cuisine slot is
filled with the value of french. At this state, the model predicts the next dialogue act ask_people
which should ask the user about how many people are involved. As the dialogue proceeds, the
slot-value memory explicitly tracks the dialogue state, and the next dialogue act is also predicted
according to the state.

ACM Transactions on Information Systems, Vol. 1, No. 1, Article 11. Publication date: April 2018.
Memory-augmented Dialogue Management for Task-oriented Dialogue Systems 11:23

Fig. 9. An example of DA prediction for a dialogue session. x represents user utterance and y system re-
sponse.The values of slots at each turn are predicted by Eq. 21. The color darkness of each cell represents the
value of βti defined in Eq. 8. Darker colors indicate larger values.

5 CONCLUSION
In this paper, we present a memory augmented dialogue management model for capturing long-
range dialogue semantics by explicitly memorizing and updating the dialogue act types and slot-
value pairs during interactions in task-oriented dialogue systems. The model employs two memory
modules, namely the slot-value memory and external memory, to address the history semantics
during the entire dialogue session. The slot-value memory tracks the dialogue state by memorizing
and updating the values of semantic slots, and the external memory augments the single state
representation of RNN by storing more context information. We also propose a slot-level attention
mechanism for attentive read of a user utterance to update the slot-value memory. The attention
mechanism helps to extract the slot-related information that is addressed in a user utterance.
Through the attention mechanism and the memory modules, our proposed model can better
interpret the dialogue context in a more observable and explainable way, which also helps to predict
the next dialogue act given the current dialogue state. Results show that our model is better than the
state-of-the-art baselines, and moreover, the model can offer more observable dialogue semantics by
presenting predicted slot-value pairs at each dialogue turn. We believe that research on interactive
IR may benefit from our work, particularly from the idea of enhancing the interpretability of
dialogue management.

REFERENCES
[1] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis,
Jeffrey Dean, Matthieu Devin, et al. 2016. Tensorflow: Large-scale machine learning on heterogeneous distributed
systems. arXiv preprint arXiv:1603.04467 (2016).
[2] Bing Bai, Jason Weston, David Grangier, Ronan Collobert, Kunihiko Sadamasa, Yanjun Qi, Olivier Chapelle, and
Kilian Weinberger. 2009. Supervised semantic indexing. In Proceedings of the 18th ACM conference on Information and
knowledge management. ACM, 187–196.
[3] Antoine Bordes and Jason Weston. 2016. Learning End-to-End Goal-Oriented Dialog. arXiv preprint arXiv:1605.07683
(2016).
[4] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and
Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation.
arXiv preprint arXiv:1406.1078 (2014).
[5] Jesse Dodge, Andreea Gane, Xiang Zhang, Antoine Bordes, Sumit Chopra, Alexander Miller, Arthur Szlam, and Jason
Weston. 2015. Evaluating prerequisite qualities for learning end-to-end dialog systems. arXiv preprint arXiv:1511.06931
(2015).

ACM Transactions on Information Systems, Vol. 1, No. 1, Article 11. Publication date: April 2018.
11:24 Z. Zhang et al.

[6] Wendong Ge and Bo Xu. 2015. Dialogue Management based on Sentence Clustering. In Proceedings of the 53rd Annual
Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language
Processing (Volume 2: Short Papers). Association for Computational Linguistics, Beijing, China, 800–805.
[7] Wendong Ge and Bo Xu. 2015. Dialogue Management based on Sentence Clustering.. In ACL (2). 800–805.
[8] David Goddeau, Helen Meng, Joseph Polifroni, Stephanie Seneff, and Senis Busayapongchai. 1996. A form-based dia-
logue manager for spoken language applications. In Spoken Language, 1996. ICSLP 96. Proceedings., Fourth International
Conference on, Vol. 2. IEEE, 701–704.
[9] Alex Graves, Greg Wayne, and Ivo Danihelka. 2014. Neural turing machines. arXiv preprint arXiv:1410.5401 (2014).
[10] Matthew Henderson, Blaise Thomson, and Jason Williams. 2014. The second dialog state tracking challenge. In 15th
Annual Meeting of the Special Interest Group on Discourse and Dialogue, Vol. 263.
[11] Matthew Henderson, Blaise Thomson, and Steve Young. 2014. Word-based dialog state tracking with recurrent neural
networks. In Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL).
292–299.
[12] Ben Hixon, Peter Clark, and Hannaneh Hajishirzi. 2015. Learning knowledge graphs for question answering through
conversational dialog. In Proceedings of the 2015 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies. 851–861.
[13] Chloé Kiddon, Luke Zettlemoyer, and Yejin Choi. 2016. Globally coherent text generation with neural checklist models.
In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 329–339.
[14] Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980
(2014).
[15] Ankit Kumar, Ozan Irsoy, Peter Ondruska, Mohit Iyyer, James Bradbury, Ishaan Gulrajani, Victor Zhong, Romain
Paulus, and Richard Socher. 2016. Ask me anything: Dynamic memory networks for natural language processing. In
International Conference on Machine Learning. 1378–1387.
[16] Esther Levin, Roberto Pieraccini, and Wieland Eckert. 1998. Using Markov decision process for learning dialogue
strategies. In Acoustics, Speech and Signal Processing, 1998. Proceedings of the 1998 IEEE International Conference on,
Vol. 1. IEEE, 201–204.
[17] Jiwei Li, Will Monroe, Alan Ritter, Michel Galley, Jianfeng Gao, and Dan Jurafsky. 2016. Deep reinforcement learning
for dialogue generation. arXiv preprint arXiv:1606.01541 (2016).
[18] Lemao Liu, Masao Utiyama, Andrew Finch, and Eiichiro Sumita. 2016. Neural Machine Translation with Supervised
Attention. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical
Papers. The COLING 2016 Organizing Committee, Osaka, Japan, 3093–3102.
[19] Michael F McTear. 1998. Modelling spoken dialogues with state transition diagrams: experiences with the CSLU toolkit.
development 5, 7 (1998).
[20] Alexander Miller, Adam Fisch, Jesse Dodge, Amir-Hossein Karimi, Antoine Bordes, and Jason Weston. 2016. Key-value
memory networks for directly reading documents. arXiv preprint arXiv:1606.03126 (2016).
[21] Nikola Mrkšić, Diarmuid O Séaghdha, Tsung-Hsien Wen, Blaise Thomson, and Steve Young. 2016. Neural belief tracker:
Data-driven dialogue state tracking. arXiv preprint arXiv:1606.03777 (2016).
[22] Tim Paek and David Maxwell Chickering. 2005. The markov assumption in spoken dialogue management. In 6th
SIGDIAL Workshop on Discourse and Dialogue.
[23] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation.
In Empirical Methods in Natural Language Processing (EMNLP). 1532–1543. http://www.aclweb.org/anthology/D14-1162
[24] Julien Perez and Fei Liu. 2017. Dialog state tracking, a machine reading approach using Memory Network. In Proceedings
of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers.
Association for Computational Linguistics, Valencia, Spain, 305–314.
[25] Amrita Saha, Vardaan Pahuja, Mitesh M Khapra, Karthik Sankaranarayanan, and Sarath Chandar. 2018. Complex
Sequential Question Answering: Towards Learning to Converse Over Linked Question Answer Pairs with a Knowledge
Graph. arXiv preprint arXiv:1801.10314 (2018).
[26] Gerard Salton and Michael J McGill. 1986. Introduction to modern information retrieval. (1986).
[27] Jost Schatzmann, Karl Weilhammer, Matt Stuttle, and Steve Young. 2006. A survey of statistical user simulation
techniques for reinforcement-learning of dialogue management strategies. The knowledge engineering review 21, 02
(2006), 97–126.
[28] Iulian Vlad Serban, Alessandro Sordoni, Yoshua Bengio, Aaron C Courville, and Joelle Pineau. 2016. Building End-To-
End Dialogue Systems Using Generative Hierarchical Neural Network Models.. In AAAI. 3776–3784.
[29] Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle Pineau, Aaron Courville, and Yoshua
Bengio. 2016. A Hierarchical Latent Variable Encoder-Decoder Model for Generating Dialogues. arXiv preprint
arXiv:1605.06069 (2016).

ACM Transactions on Information Systems, Vol. 1, No. 1, Article 11. Publication date: April 2018.
Memory-augmented Dialogue Management for Task-oriented Dialogue Systems 11:25

[30] Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle Pineau, Aaron C Courville, and Yoshua
Bengio. 2017. A Hierarchical Latent Variable Encoder-Decoder Model for Generating Dialogues.. In AAAI. 3295–3301.
[31] Lifeng Shang, Zhengdong Lu, and Hang Li. 2015. Neural responding machine for short-text conversation. arXiv
preprint arXiv:1503.02364 (2015).
[32] Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al. 2015. End-to-end memory networks. In Advances in neural
information processing systems. 2440–2448.
[33] Mingxuan Wang, Zhengdong Lu, Hang Li, and Qun Liu. 2016. Memory-enhanced Decoder for Neural Machine
Translation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for
Computational Linguistics, Austin, Texas, 278–286.
[34] Joseph Weizenbaum. 1966. ELIZAâĂŤa computer program for the study of natural language communication between
man and machine. Commun. ACM 9, 1 (1966), 36–45.
[35] Tsung-Hsien Wen, Milica Gasic, Nikola Mrkšić, Pei-Hao Su, David Vandyke, and Steve Young. 2015. Semantically
Conditioned LSTM-based Natural Language Generation for Spoken Dialogue Systems. In Proceedings of the 2015
Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Lisbon,
Portugal, 1711–1721.
[36] Tsung-Hsien Wen, Yishu Miao, Phil Blunsom, and Steve Young. 2017. Latent Intention Dialogue Models. arXiv preprint
arXiv:1705.10229 (2017).
[37] Jason Weston, Sumit Chopra, and Antoine Bordes. 2014. Memory networks. arXiv preprint arXiv:1410.3916 (2014).
[38] Jason D Williams, Kavosh Asadi, and Geoffrey Zweig. 2017. Hybrid Code Networks: practical and efficient end-to-end
dialog control with supervised and reinforcement learning. arXiv preprint arXiv:1702.03274 (2017).
[39] Jason D Williams and Steve Young. 2007. Partially observable Markov decision processes for spoken dialog systems.
Computer Speech & Language 21, 2 (2007), 393–422.
[40] Steve Young, Jost Schatzmann, Karl Weilhammer, and Hui Ye. 2007. The hidden information state approach to dialog
management. In Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on, Vol. 4.
IEEE, IV–149.
[41] Ingrid Zukerman and David W Albrecht. 2001. Predictive statistical models for user modeling. User Modeling and
User-Adapted Interaction 11, 1-2 (2001), 5–18.

Received February 2018

ACM Transactions on Information Systems, Vol. 1, No. 1, Article 11. Publication date: April 2018.

You might also like