0% found this document useful (0 votes)
28 views12 pages

Task-Oriented Dialogue With In-Context Learning

Task-Oriented Dialogue with In-Context Learning

Uploaded by

fmendes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views12 pages

Task-Oriented Dialogue With In-Context Learning

Task-Oriented Dialogue with In-Context Learning

Uploaded by

fmendes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Task-Oriented Dialogue with In-Context Learning

Tom Bocklisch Thomas Werkmeister Daksh Varshneya Alan Nichol∗


Rasa

Abstract and the entity destination with value “sta-


tion”. This dialogue act representation acts as the
arXiv:2402.12234v1 [cs.CL] 19 Feb 2024

We describe a system for building task- interface between the NLU and DM components
oriented dialogue systems combining the in- of the system. The dialogue manager contains the
context learning abilities of large language
logic to react to a book taxi intent by initiating
models (LLMs) with the deterministic execu-
tion of business logic. LLMs are used to trans- a taxi booking task, prompting the end user for the
late between the surface form of the conver- time, pick-up location, etc. These fields are typi-
sation and a domain-specific language (DSL) cally called slots. As the dialogue progresses, sub-
which is used to progress the business logic. sequent user messages are also represented as di-
We compare our approach to the intent-based alogue acts, such as inform(time=3pm). The
NLU approach predominantly used in industry dialogue manager reacts to this sequence of inputs
today. Our experiments show that developing by executing actions and responding to the end
chatbots with our system requires significantly
user, either via a rule-based or a model-based di-
less effort than established approaches, that
these chatbots can successfully navigate com- alogue policy. We refer to this as the intent-based
plex dialogues which are extremely challeng- NLU approach, and it is used by the major indus-
ing for NLU-based systems, and that our sys- try platforms for building chat- and voice-based di-
tem has desirable properties for scaling task- alogue systems like Rasa (Bocklisch et al., 2017),
oriented dialogue systems to a large number of Dialogflow3 , Microsoft Luis4 , and IBM Watson5 .
tasks. We make our implementation available
for use and further study1 . 1.1 Limitations of an intent-based NLU
approach
1 Introduction
A key feature of the intent-based NLU approach is
The workhorse of industrial task-oriented dia- that it poses natural language understanding as a
logue systems and assistants is a modular archi- classification task: messages are “understood” by
tecture comprising three components: natural lan- assigning them to a predefined intent. This is a
guage understanding (NLU), dialogue manage- powerful simplifying assumption. In theory, in-
ment (DM) 2 , and natural language generation tents provide an interface that fully abstracts the
(NLG) (Young et al., 2013; Young, 2007). language understanding component from the dia-
Utterances spoken or written by end users are logue manager.
translated into dialogue acts, where a dialogue act However, working with a fixed list of intents has
comprises an intent and a set of entities. For exam- limitations which become more pronounced as an
ple, an utterance such as “I need a taxi to the sta- application matures and scales:
tion” might be assigned to the intent book taxi

• The taxonomy of intents becomes difficult to
[email protected] remember and reason about when the number
1
The approach described in this paper is implemented
in Rasa under the name CALM: https://rasa.com/docs/rasa- of intents reaches several hundred, complicat-
pro/calm/
3
2
While in the research literature, the dialogue manager is https://cloud.google.com/dialogflow
4
often separated into dialogue state tracking (DST) and dia- https://www.luis.ai/
5
logue policy components, this distinction is rarely made in https://www.ibm.com/docs/en/cloud-
industrial applications. private/3.1.0?topic=services-watson-assistant
ing annotation & feedback loops as well as In this work we aim to develop a system for
application debugging. building industrial task-oriented dialogue systems
with the following attributes:
• Because the dialogue manager is coded to
Fast iteration: The system should allow for rapid
expect specific sequences of intents, making
prototyping and testing. The delay between mak-
changes to intent definitions and introduc-
ing a change (e.g. modifying task logic) and test-
ing new intents becomes increasingly error-
ing it should ideally be measured in seconds.
prone, as shifts in classifier outputs introduce
regressions. Short development time: The system should pro-
vide general conversational capabilities out of the
• Intents are typically defined to map closely box, so that developers can focus on implementing
to the tasks the assistant can perform, but their unique business logic.
user utterances often do not correspond di- Concise representation of business logic: Both
rectly to a specific task. A developer developers and subject-matter experts with less
may create intents like replace card and technical knowledge should be able to create and
block card, but end users often describe modify task logic easily.
situations in their own terms (e.g. “I lost my
wallet”), which could map to a number of dif- Reliable execution of business logic: Business
ferent tasks. logic of arbitrary complexity should be executed
reliably, i.e. we should not rely on a language
• Messages are assigned to the same intent irre- model to remember and follow a set of steps and
spective of context.6 branching conditions.
Explainable and debuggable: It should be possi-
Furthermore, salient information is often dis- ble to explain why the system responded in a cer-
carded when translating a surface form into a tain way at any given time.
dialogue act, and it falls on the dialogue man-
ager to reinterpret the output of the NLU mod- Scalable to a large number of tasks: It is com-
ule to account for context. For example, the mon for AI assistants in industry to support hun-
dialogue manager may prompt the user with a dreds of tasks. The system needs to be able to
yes/no question in order to fill a boolean slot identify the correct task out of hundreds of pos-
(e.g. “Would you like to proceed?”). When the sibilities. Maintaining the system and adding new
end user’s response is mapped to an affirm tasks should not become more complex as the sys-
or deny intent, the dialogue manager reinter- tem increases in size.
prets this output and sets the boolean slot to Model agnostic: Progress in LLMs is rapid and
true or false respectively. Similarly, the the approach should allow developers to adopt the
NLU module might detect a generic entity like latest models without having to re-implement their
person or location, which is then mapped to business logic.
a task-specific slot like transfer recipient
or taxi destination. 2 Related Work
1.2 Desiderata Here we briefly describe some recent streams
of research which have developed alternatives to
Given the impressive capabilities of recent
the intent-based NLU paradigm for building task-
LLMs on language understanding benchmarks
oriented dialogue.
(Huang et al., 2022) and in-context learning
(Brown et al., 2020), we investigate whether a su- 2.1 End-to-End Learning
perior approach to task-oriented dialogue can be
The advent of seq2seq models (Sutskever et al.,
developed by reconsidering the split of responsi-
2014) gave rise to a number of end-to-end ap-
bilities between the NLU, DM, and NLG compo-
proaches, notably (Bordes et al., 2016) who used
nents.
a synthetic task to study whether an end-to-end
6
It is possible to overcome this limitation either by adding model could learn the business logic for a task-
heuristics to the NLU module, or by training a supervised
model which includes conversation history in the input. In oriented dialogue system purely from conversation
practice neither of these approaches has been widely adopted. transcripts grounded in knowledge base queries.
In this stream of work, models are learned end-to- 2.2 Alternative Representations of Dialogue
end, but assistant responses selected from a candi-
Other approaches have developed new representa-
date list rather than being generated. This framing
tions and data structures to change how the task of
has also been called next-utterance classification.
building a dialogue system is formulated.
Continuing this line of work are Hybrid Code Net-
(Cheng et al., 2020) introduced a hierarchical
works (Williams et al., 2017) which combine end-
graph structure which represents the ontology of
to-end learned models with domain-specific soft-
a task. Dialogue state tracking is then framed as a
ware to run parts of the business logic. In a similar
semantic parsing task over this structure.
vein, dialogue transformers (Vlasov et al., 2019)
(Andreas et al., 2020) represents the dialogue
have been proposed, which use self-attention over
state as a dataflow graph, where the effect of
the dialogue history to make learned dialogue poli-
each turn in a dialogue is to modify this graph.
cies more robust to digressions. (Mosharrof et al.,
They show that using this representation, a generic
2023) also learn a dialogue policy by training a
seq2seq model can match the performance of neu-
seq2seq model which outputs the dialogue state,
ral architectures specifically designed for dialogue
knowledge base queries and system actions at ev-
state tracking. One similarity between our sys-
ery turn to fulfill the goal of the user. The learned
tem and the dataflow approach is that both use
policy is shown to generalize well across unseen
the output of a model to generate instructions, and
domains.
then deterministically execute some logic. How-
The PolyResponse model (Henderson et al., ever, there are significant differences in the two
2019) is a next-utterance classification approach approaches. Our work also uses a graph rep-
that leverages transfer learning. The model is resentation of computational steps, but only to
trained on a large corpus of unlabeled dialogue represent the business logic for a specific task,
data and fine-tuned on domain-specific data to as designed explicitly be a developer. Addition-
build a task-oriented system for a given domain. ally, the dataflow approach produces computa-
tional steps such as the refer operation, which han-
However, it has been previously noted that dles anaphora and entity resolution. In our system,
collecting hand-annotated examples for training the dialogue manager does not participate in lan-
task-oriented dialogue systems is challenging guage understanding, and coreference resolution
(Budzianowski et al., 2018) and hence a lot of is always handled implicitly, by including the con-
recent work aims at making the learning more versation transcript in the LLM prompt and gener-
data-efficient. (Peng et al., 2021) leverages trans- ating commands with the arguments already fully
fer learning to make learning end-to-end dialogue resolved. More generally, our approach uses the
models more data efficient, but where assistant conversation transcript as a general-purpose repre-
responses are generated (as in a seq2seq model) sentation of conversation state, while we use an
rather than retrieved. explicit state representation only to track progress
within the logic of a given task.
(Jang et al., 2022) leverage offline reinforce-
2.3 Language Model “Agents”
ment learning (RL) to clone the behaviour exhib-
ited by human agents in human-human conversa- Recently, researchers have explored using the in-
tions and use it to critique the actions taken by context learning (Brown et al., 2020) abilities of
fine-tuned LLM-based agent at training time. On LLMs to have them act fully independently as
the other hand, (Hong et al., 2023) leverage task dialogue systems (Yao et al., 2023), an approach
descriptions to generate relevant and diverse syn- sometimes referred to as LLM “Agents”. This line
thetic dialogues and use them to learn a dialogue of work assumes that the business logic required
agent via offline RL. This approach matches the to complete a task is not known a priori and must
performance of an LLM-based assistant using in- be inferred on-the-fly as a conversation progresses.
context learning, (Ouyang et al., 2022) but with This approach explores the possibility of creating
more concise and informative responses. How- fully open-ended assistants which can help with an
ever, this approach still requires fine-tuning an infinite number of tasks, with the caveat that the
LLM which can be prohibitively time consuming developer of the assistant does not control the task
for rapidly iterating on an implementation. logic. Industrial dialogue systems, on the other
hand, typically support a known set of tasks whose based on the information that was collected. The
logic needs to be followed faithfully. following is an example definition of a minimal
Another approach is to frame the task of di- transfer money flow, which collects the recip-
alogue state tracking as a text-to-SQL problem ient and amount from the user before initiating the
(Hu et al., 2022). By leveraging in-context learn- transfer:
ing, this work demonstrates a dialogue agent transfer_money:
that outperforms previously built fine-tuned agents description: send money to another
account
(Shin et al., 2022; Lee et al., 2021). This approach steps:
leverages example conversations, which are re- - collect: recipient
- collect: amount
trieved at run time and inserted into an LLM - action: initiate_transfer
prompt. Finally, (Heck et al., 2023) leverage in-
context learning without any insertion of conver- This task specification is created by the devel-
sation examples in the prompt to explore the ca- oper of the assistant. In addition to this core
pability of ChatGPT as a task oriented dialogue logic, they have to define the data types of the
agent. They find the system to be competitive recipient and amount slots, and provide the
to previously built zero-shot (Hu et al., 2022) and templated utterances for the steps in the flow.
slots:
few-shot (Lee et al., 2021; Shin et al., 2022) dia- recipient:
logue agents. type: text
amount:
3 Architecture type: float
responses:
Our architecture comprises three core elements: utter_ask_recipient:
- text: Who are you sending money to?
Business Logic, Dialogue Understanding, and utter_ask_amount:
Conversation Repair. - text: How much do you want to send?
When an end user sends a message to an assis- These two code snippets are all that is required
tant, the following takes place: to implement a task; there is no training data re-
quired for language understanding. The money
1. The dialogue understanding module inter-
transfer flow given here is a minimal example, but
prets the conversation so far and translates the
flows can include branching logic, function calls,
latest user message into a set of commands.
calls to other flows, and more. A more complex
2. The generated commands are validated and example can be found in appendix A.
processed by the dialogue manager to update Note that the flow definition does not make any
the conversation state. reference to the user side of the conversation. Nei-
ther dialogue acts nor commands are represented.
3. If the user message requires conversation re- Business logic only describes the steps required
pair, the corresponding repair patterns are to complete a task. It does not specify how the
added to the conversation state. end user provides that information. While the flow
4. The dialogue manager executes the relevant only specifies the “happy path”, an assistant with
business logic deterministically, including this flow can already handle a large number of con-
any repair patterns, and continues executing versations, including repair cases like corrections,
actions until additional user input is required. digressions, interruptions, and cancellations. This
is described in section 3.3.
3.1 Business Logic
3.2 Dialogue Understanding
Business Logic describes the steps required to
In lieu of an NLU module, our system has a Di-
complete a specific task, such as transferring
alogue Understanding module that leverages the
money. Tasks are defined in a declarative format
in-context learning abilities of LLMs. Dialogue
called flows. A minimal flow comprises a descrip-
understanding, framed as a command generation
tion and a list of steps. The steps describe (i)
problem, improves upon intent-based NLU in key
what information is needed from the user (e.g. the
ways:
amount of money and the recipient), (ii) what in-
formation is needed from APIs (e.g. the user’s • While NLU interprets one message in isola-
account balance) and (iii) any branching logic tion, DU considers the greater context: the
whole running transcript of the conversation StartFlow(flow name)
as well as the assistant’s business logic. Flow CancelFlow
definitions and conversation state provide ad- SetSlot(slot name, slot value)
ditional, valuable context for understanding ChitChat
users. This is especially useful for extracting KnowledgeAnswer
slot values, which often requires coreference HumanHandoff
resolution. Clarify(flow name 1, flow name 2)

• While NLU systems output intents and enti- Table 1: A list of the commands which the dialogue
ties representing the semantics of a message, understanding component can produce. Commands ex-
ist for starting and cancelling flows, for setting slots,
DU outputs a sequence of commands repre-
for handing non-task dialogue (e.g. chitchat or content
senting the pragmatics of how the user wants from a knowledge base), for handing over the conversa-
to progress the conversation 7 . tion to a human agent, and for triggering an additional
clarification step to handle disambiguation.
• DU requires no additional annotated data be-
yond the specification of flows.
StartFlow(transfer money),
• While NLU systems assign a user message to SetSlot(recipient, John),
one of a fixed list of intents, DU instead is SetSlot(amount, 55)
generative, and produces a sequence of com-
mands according to a domain-specific lan-
guage and available business logic. This rep- Note that in the NLU-based approach, the val-
resentation can express what users are ask- ues “John” and 45 would typically be extracted as
ing with more nuance than a simple classifi- generic person and number entities, but in our
cation. system they are directly mapped to task-specific
slots by the dialogue understanding component us-
The following section illustrates these improve- ing the information from the flow definitions.
ments with examples. Because the context of the conversation is taken
into account, slots are filled correctly even in cases
3.2.1 Commands as a Domain-Specific of pragmatic implicature, something that is very
Language difficult to achieve with intent-based NLU:
The output of the Dialogue Understanding compo-
nent is a short sequence of commands (typically 1- Are you traveling in economy
3) describing how the end user wants to progress class?
the conversation with the assistant. The allowed
commands are shown in table 1.
sadly
The following are some example user messages
along with the corresponding command output. In SetSlot(economy class, true)
the simplest case, the user expresses a wish that
directly corresponds to one of the defined flows:
Complex utterances can be represented faith-
I want to transfer money fully as commands, when this would be difficult to
achieve with an intent-based NLU approach. For
StartFlow(transfer money) example, a user correcting a previous input and
starting another task:
Alternatively, the user may directly provide
some of the required information. Actually I meant $45. Also
what’s my balance?
I want to transfer $55 to John SetSlot(amount, 45),
StartFlow(check balance)
7
This is a similar idea to (Andreas et al., 2020), where
they also use a model to predict the perlocutionary force of
an utterance Section 3.3 describes how corrections, interrup-
tions, and digressions are subsequently handled by
Would you like to freeze or un-
our system. freeze your card, or cancel it?
3.3 Conversation Repair
As described in section 3.1, a flow only spec- cancel
ifies the steps required to complete a task. It StartFlow(cancel card)
does not represent a graph of possible conversa-
tion paths. Conversation repair defines a set of
Implementing this behaviour using intent-based
patterns, which are meta flows that describe how
NLU is extremely challenging. Words like ‘card’
the assistant behaves in conversations that deviate
and ‘cancel’ are not, on their own, indicative of a
from the “happy path” of a flow.
specific intent. Similarly, end users often start a
We define the “happy path” as any conversa-
conversation with very long messages that also re-
tion in which the end user, every time they are
quire clarification. These types of utterances tend
prompted for information via a collect step,
to fit poorly into an intent classifier’s taxonomy.
successfully provides that information, progress-
Furthermore, the dialogue manager would have to
ing to the next step in the business logic.
be programmed to handle many such sequences
In production systems, end users frequently
for various task combinations. Our system han-
stray from the happy path, for example when they:
dles disambiguation out of the box, without any
additional effort from the developer.
• cancel the current interaction
3.4 The Dialogue Stack
• interrupt the current process to achieve some-
thing else before continuing Our system leverages a dialogue stack to process
commands and execute business logic.
• request additional information The dialogue stack is a high-level representation
of the conversation state. It maintains a Last-in-
• insert an aside, e.g. “just one moment” first-out (LIFO) stack of active flows, as well as
the state of each individual flow.
• correct something they said earlier The dialogue stack is also used to provide addi-
tional context to the dialogue understanding mod-
• say something that requires further clarifica- ule, for example which slots the active flow needs
tion to fill, along with their data types and allowed val-
ues. This helps the LLM generate the correct com-
For these common situations, conversation re- mands.
pair provides patterns that are triggered through In turn, the commands generated by the LLM
either specialized commands (e.g. CancelFlow) are used to manipulate the dialogue stack and con-
or when specific dialogue states are reached. For versation state with a specific set of operations.
example, a specific pattern is triggered when a pre- Commands can set slots and push new flows on
viously interrupted flow is resumed, because the to the dialogue stack. Beyond that, the commands
interrupting flow finished or was cancelled. All the LLM generates do not directly manipulate the
patterns have a default implementation that can be dialogue stack. They cannot directly remove or
overwritten by the developer of an assistant. modify existing flows on the stack, eliminating the
The following example requires a clarification possibility of a malicious user overriding the busi-
step because the developer has created flows for ness logic via prompt injection (Liu et al., 2023).
multiple card-related tasks, and the user’s opening After all commands are processed, the dialogue
message does not provide enough information to manager takes the dialogue stack and executes the
infer which one they want: topmost flow deterministically. This execution
pauses when it reaches a listen step. This way,
card executing business logic and complex operations
on the dialogue stack alike are the domain of de-
Clarify(freeze card,
unfreeze card, cancel card) terministic, developer-defined logic and not left to
an LLM.
With this approach, our system leverages the 3.8 Information Retrieval
powerful language understanding capabilities of
LLMs while limiting their access to directly inter- Industrial dialogue systems often combine task-
vene in the business logic. oriented dialogue, built on business logic, with
information retrieval. While flows are ideal
3.5 Optional Components for multi-step tasks that rely on real-time data
This section describes optional components which fetched from APIs, end users often have ques-
extend the core functionality of our system. tions which can be answered based on static data.
In our architecture, this is achieved by having
3.6 Contextual Rephrasing the Dialogue Understanding component generate
The Contextual Response Rephraser is an optional a KnowledgeAnswer command. The ‘Knowl-
component which can be used to improve the end edgeAnswer‘ command is handled as a pattern
user experience. It uses an LLM to rephrase the (a prebuilt meta flow, see section 3.3). This
templated response to better account for the con- pattern invokes an information retrieval compo-
versation’s context, improving fluency. This is es- nent, which uses the latest user message as a
pecially helpful for generic messages. For exam- query to a knowledge base, returning a selection
ple, responding to a request that is out of scope: of potentially relevant information. The results
are then either presented directly to the user, or
I’d like to add my partner to my used as part of an LLM prompt to formulate an
credit card. answer to the user. The latter approach is fre-
quently called Retrieval-Augmented Generation
(Gao et al., 2024). When the pattern triggered by
I’m sorry, I can’t help you with a KnowledgeAnswer command has completed,
that. it is removed from the top of the stack and the con-
versation proceeds as before.
Unfortunately, I cannot help you
add users to your card. 4 Evaluation

While this can enhance the fluency and natural- Quantitatively evaluating an approach and sys-
ness of conversations, there is a possibility that tem for building conversational AI is difficult
the rephrased response does not preserve the exact (Bohus and Rudnicky, 2009). Desirable qualities
meaning of the templated utterance, depending on like ease of use and a quick learning curve can be
the choice of LLM, the prompt, and the sampling studied by recruiting participants for a controlled
parameters. It is left to the developer of the assis- study. Adoption is another signal of success and
tant to decide if this is an acceptable trade-off. can attest to the scalability of the solution. In
lieu of an absolute evaluation of the effectiveness
3.7 Flow Pre-Selection
of our system, we compare it in relative terms to
As the number of tasks in an assistant increases, an implementation in Rasa that follows an intent-
eventually the information required by the LLM based NLU approach. Our evaluation compares
exceeds the length of its context window. For the effort required to achieve a similar level of
these cases, the Dialogue Understanding compo- functionality in both systems.
nent can be configured to pre-select a list of can-
didate flows to be included for in-context learning. 4.1 Example Assistant
This is achieved by using the latest user utterance
to retrieve the k most similar flows, as measured As an example system we implemented a virtual
by embedding both the user utterance and the de- assistant in English for a travel rewards bank ac-
scriptions of the flows. This can potentially intro- count. The implementations, tests, and instruc-
duce a performance drop as the correct flow is not tions for reproducing these experiments are avail-
guaranteed to be among the top k. We find that able online8 . The assistant supports the following
for sufficiently large k (e.g. k = 20), this error is tasks:
negligibly small; in our experiments we were able
8
to achieve a Hit@20 score of 100%. https://github.com/RasaHQ/tod-in-context-learning
Transferring money Implementation ours baseline
Adding, listing, and removing known contacts Total test pass rate 95.8% 47.9%
Showing recent transactions Lines of Code and Data 1169 1713
Ordering a replacement card
Searching for restaurants and hotels Table 2: Fraction of passing tests and number of lines
of code and data for two implementations, one using
Setting up recurring payments
the system described in this paper and a baseline sys-
Verifying an account tem using intent-based NLU.

4.2 Metrics Category # tests ours baseline


happy path 24 24 22
We evaluate both our system and intent-based im-
cancellations 6 6 2
plementations through a suite of 71 test conversa-
corrections 13 13 0
tions designed to test a variety of conversational
repetitions 2 2 1
abilities. Both implementations were built using
disambiguation 4 3 0
a these tests as a guide, adopting a test-driven de-
input validation 5 3 3
velopment approach. The tests cover each of the
negations 3 3 0
tasks implemented in the assistant and a combina-
chitchat 2 2 2
tion of happy paths and conversations involving re-
digressions 9 9 2
pair. The test conversations vary in length with a
knowledge 3 3 2
minimum of 2 turns, a maximum of 19, and an av-
erage of 7.8 turns. Note that each test conversation Table 3: Number of passing tests in different categories,
represents a distinct conversation “path”, meaning for ours and NLU-based implementations of similar ef-
that our tests are not designed to evaluate under- fort.
standing of variations in phrasing, but rather varia-
tions in user behaviour.
The full set of test conversations is available to- plexity of the two implementations. In our own
gether with the implementation in the github re- subjective experience, the cognitive load of work-
spository, and some example test conversations are ing with our system is much lower than with the
shown in appendix B intent-based NLU approach, and we hypothesise
that this would be reflected in a controlled study
As a proxy for “effort”, we measure the lines of
involving external participants, manifesting for ex-
code and data in each implementation. The imple-
ample as a reduced time to complete a given task.
mentation using the system described in this paper
Second, our evaluation does not quantify the abil-
comprises 14 flows and 47 slots. The baseline im-
ity of either system to handle the lexical variation
plementation comprises 26 intents, 10 entity types,
of real-world user input. This would be better eval-
and 41 slots.
uated by deploying both implementations side-by-
Table 2 compares the pass rate of both imple-
side in an A/B test.
mentations on our test conversations as well as
the lines of code and data in each implementa- Nonetheless, we see distinctly that our system
tion. For these experiments, we used GPT-4 as the allows developers to build dialogue systems which
LLM to power the dialogue understanding compo- can handle a multitude of conversation patterns
nent, since it performed best in our experiments. A with a modest amount of effort. Note also that
comparison of the performance of different LLMs unlike the baseline, the implementation using our
across languages and use cases is left to future system only addresses the happy paths, while cor-
work. rections, digressions, and more are handled by out-
of-the-box conversation repair. The disambigua-
5 Discussion tion cases are worth noting as well, as these are
handled automatically by our system, while pre-
While informative, evaluating our system by com- senting an extreme challenge for an intent-based
paring two implementations has a number of lim- NLU approach. It is worth commenting on the
itations and should not be overly relied upon as a impressive performance of our system on conver-
quantitative guide. For one, a simple metric like sations involving corrections, especially in light
“lines of code and data” does not capture the com- of previous evidence that LLMs show poor per-
formance on conversation repair (Balaraman et al., Acknowledgments
2023). We believe this is because the repair-QA
We would like to thank Oliver Lemon for provid-
dataset, on which previous studies were based, is
ing feedback on a draft of this paper. We thank
a far more challenging task that requires an LLM
numerous colleagues at Rasa for support with im-
to produce free-form answers from the the world
plementing and testing the system.
knowledge implicit in its parameters. Handling
corrections in our system only requires the LLM
to reason over the conversation transcript and pro- References
duce the appropriate command, with the correct
slot value typically present verbatim within the Jacob Andreas, John Bufe, David Burkett, Charles
Chen, Josh Clausman, Jean Crawford, Kate
prompt. Crim, Jordan DeLoach, Leah Dorner, Jason Eis-
ner, Hao Fang, Alan Guo, David Hall, Kristin
6 Future Work Hayes, Kellie Hill, Diana Ho, Wendy Iwaszuk,
Smriti Jha, Dan Klein, Jayant Krishnamurthy,
Important avenues for further work in this area in- Theo Lanman, Percy Liang, Christopher H. Lin,
clude more comprehensive evaluation of the cur- Ilya Lintsbakh, Andy McGovern, Aleksandr
rent system, including case studies of production Nisnevich, Adam Pauls, Dmitrij Petters, Brent
Read, Dan Roth, Subhro Roy, Jesse Rusak, Beth
systems. In addition it would be valuable to study Short, Div Slomin, Ben Snyder, Stephon Striplin,
of the performance of different LLMs for the dia- Yu Su, Zachary Tellman, Sam Thomson, Andrei
logue understanding task in various languages, in- Vorobev, Izabela Witoszko, Jason Wolfe, Abby
cluding code-switching and multilingual applica- Wray, Yuchen Zhang, and Alexander Zotov. 2020.
Task-Oriented Dialogue as Dataflow Synthesis.
tions. Also, given that the dialogue understanding
Transactions of the Association for Computational
task is well defined and requires an LLM only to Linguistics, 8:556–571.
produce a short sequence of known commands, it
would be valuable to investigate whether smaller Vevake Balaraman, Arash Eshghi, Ioannis
Konstas, and Ioannis Papaioannou. 2023.
models can deliver similar performance at smaller No that’s not what i meant: Handling third position repair in conversa
cost and latency. Finally, it is crucial for produc-
tion systems that these can be improved on the ba- Tom Bocklisch, Joey Faulkner, Nick Pawlowski, and
sis of interactions with real end-users9 . There is Alan Nichol. 2017. Rasa: Open source language
understanding and dialogue management. arXiv
a distinct difference in this regard between work- preprint arXiv:1712.05181.
ing with a system based on supervised learning
(like intent-based NLU) versus one based on in- Dan Bohus and Alexander I Rudnicky. 2009. The
context learning. Finding ways to (beyond prompt ravenclaw dialog management framework: Architec-
ture and systems. Computer Speech & Language,
engineering) incorporate the signal from real user 23(3):332–361.
feedback is an active area of investigation.
Antoine Bordes, Y-Lan Boureau, and Jason Weston.
7 Conclusion 2016. Learning end-to-end goal-oriented dialog.
arXiv preprint arXiv:1605.07683.
We have introduced and described a system for
Tom Brown, Benjamin Mann, Nick Ryder, Melanie
developing industrial task-oriented dialogue sys-
Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind
tems, combining the in-context learning abilities Neelakantan, Pranav Shyam, Girish Sastry, Amanda
of LLMs with the deterministic execution of busi- Askell, et al. 2020. Language models are few-shot
ness logic. We discussed the handling of both learners. Advances in neural information processing
“happy path” and conversation repair-type dia- systems, 33:1877–1901.
logues, and show how these can be handled ef- Paweł Budzianowski, Tsung-Hsien Wen, Bo-
fectively with relatively little effort, compared to Hsiang Tseng, Iñigo Casanueva, Stefan Ultes,
the traditional approach of intent-based NLU. We Osman Ramadan, and Milica Gašić. 2018.
MultiWOZ - a large-scale multi-domain Wizard-of-Oz dataset for task
have made our system, as well as the example im-
In Proceedings of the 2018 Conference on Empirical
plementations, available for use and further study, Methods in Natural Language Processing, pages
and hope that this will facilitate the development 5016–5026, Brussels, Belgium. Association for
of many successful conversational assistants. Computational Linguistics.
9
We use the term Conversation-Driven Development to Jianpeng Cheng, Devang Agrawal, Héctor
describe this process(Nichol, 2022). Martı́nez Alonso, Shruti Bhargava, Joris
Driesen, Federico Flego, Dain Kaplan, Dim- Alan Nichol. 2022. Conversation-driven development.
itri Kartsaklis, Lin Li, Dhivya Piraviperu- https://rasa.com/blog/conversation-driven-developmen
mal, Jason D. Williams, Hong Yu, Diarmuid Accessed: 2024-02-15.
Ó Séaghdha, and Anders Johannsen. 2020.
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida,
Conversational semantic parsing for dialog state tracking.
In Proceedings of the 2020 Conference on Empir- Carroll Wainwright, Pamela Mishkin, Chong Zhang,
ical Methods in Natural Language Processing Sandhini Agarwal, Katarina Slama, Alex Ray, et al.
(EMNLP), pages 8107–8117, Online. Association 2022. Training language models to follow instruc-
for Computational Linguistics. tions with human feedback. Advances in Neural In-
formation Processing Systems, 35:27730–27744.
Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang
Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Baolin Peng, Chunyuan Li, Jinchao Li, Shahin
Qianyu Guo, Meng Wang, and Haofen Wang. 2024. Shayandeh, Lars Liden, and Jianfeng Gao. 2021.
Retrieval-augmented generation for large language models:Soloist: Building Task Bots at Scale with Transfer Learning and Mach
A survey.
Transactions of the Association for Computational
Michael Heck, Nurul Lubis, Benjamin Ruppik, Re- Linguistics, 9:807–824.
nato Vukovic, Shutong Feng, Christian Geishauser,
Hsien-Chin Lin, Carel van Niekerk, and Milica Jamin Shin, Hangyeol Yu, Hyeongdon Moon,
Gašić. 2023. Chatgpt for zero-shot dialogue state Andrea Madotto, and Juneyoung Park. 2022.
tracking: A solution or an opportunity? arXiv Dialogue summaries as dialogue states (DS2), template-guided summ
preprint arXiv:2306.01386. In Findings of the Association for Computational
Linguistics: ACL 2022, pages 3824–3846, Dublin,
Matthew Henderson, Ivan Vulić, Iñigo Casanueva, Ireland. Association for Computational Linguistics.
Paweł Budzianowski, Daniela Gerz, Sam
Coope, Georgios Spithourakis, Tsung-Hsien Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.
Wen, Nikola Mrkšić, and Pei-Hao Su. 2019. Sequence to sequence learning with neural networks.
Advances in neural information
Polyresponse: A rank-based approach to task-oriented dialogue with application processing
in restaurant systems,
search and booking.
27.
Joey Hong, Sergey Levine, and Anca Dragan. 2023.
Zero-shot goal-directed dialogue via rl on imagined Vladimir Vlasov, Johannes EM Mosig, and Alan
conversations. arXiv preprint arXiv:2311.05584. Nichol. 2019. Dialogue transformers. arXiv
preprint arXiv:1910.00486.
Yushi Hu, Chia-Hsuan Lee, Tianbao Xie, Tao Yu,
Noah A Smith, and Mari Ostendorf. 2022. In- Jason D. Williams, Kavosh Asadi,
context learning for few-shot dialogue state tracking. and Geoffrey Zweig. 2017.
arXiv preprint arXiv:2203.08568. Hybrid code networks: practical and efficient end-to-end dialog contro

Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu, Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak
Xuezhi Wang, Hongkun Yu, and Jiawei Han. 2022. Shafran, Karthik Narasimhan, and Yuan Cao. 2023.
Large language models can self-improve. React: Synergizing reasoning and acting in language models.

Youngsoo Jang, Jongmin Lee, Steve Young. 2007. Cued standard dialogue acts.
and Kee-Eung Kim. 2022. Report, Cambridge University Engineering Depart-
GPT-critic: Offline reinforcement learning for end-to-end task-oriented dialogue2007.
ment, 14th October, systems.
In International Conference on Learning Represen-
tations. Steve Young, Milica Gašić, Blaise Thom-
son, and Jason D. Williams. 2013.
Chia-Hsuan Lee, Hao Cheng, Pomdp-based statistical spoken dialog systems: A review.
and Mari Ostendorf. 2021. Proceedings of the IEEE, 101(5):1160–1179.
Dialogue state tracking with a language model using schema-driven prompting.
In Proceedings of the 2021 Conference on Empir-
ical Methods in Natural Language Processing,
pages 4937–4949, Online and Punta Cana, Do-
minican Republic. Association for Computational
Linguistics.
Yi Liu, Gelei Deng, Yuekang Li, Kailong
Wang, Tianwei Zhang, Yepang Liu, Haoyu
Wang, Yan Zheng, and Yang Liu. 2023.
Prompt injection attack against llm-integrated applications.
Adib Mosharrof, MH Maqbool, and AB Siddique.
2023. Zero-shot generalizable end-to-end task-
oriented dialog system using context summa-
rization and domain schema. arXiv preprint
arXiv:2303.16252.
A Complex flow Example
transfer_money:
description: This flow lets users send money to other accounts.
name: transfer money
steps:
- collect: transfer_money_recipient
description: the name of a person
- id: "ask_amount" # we keep this id, because we use it for a jump
collect: transfer_money_amount_of_money
description: the amount of money without any currency designation
- action: check_transfer_funds
next:
- if: not slots.transfer_money_has_sufficient_funds
then:
- action: utter_transfer_money_insufficient_funds
- set_slots:
- transfer_money_amount_of_money: null
- transfer_money_has_sufficient_funds: null
next: "ask_amount"
- else: transfer_money_final_confirmation
- id: transfer_money_final_confirmation
collect: transfer_money_final_confirmation
description: accepts True or False
ask_before_filling: true
next:
- if: not slots.transfer_money_final_confirmation
then:
- action: utter_transfer_cancelled
next: "END"
- else: execute_transfer
- id: execute_transfer
action: execute_transfer
next:
- if: slots.transfer_money_transfer_successful
then:
- action: utter_transfer_complete
next: "END"
- else:
- action: utter_transfer_failed
next: "END"

B Example Test Conversations


Here we show some example test conversations.
B.1 Disambiguation

contact

Would you like to add or remove a contact, or see a list of existing con-
tacts?

add

What’s the handle of the user you want to add?

B.2 Correction
I want to transfer money

Who do you want to transfer money to?


John

How much money do you want to transfer?

$100

Please confirm: you want to transfer $100 to John?

Ah, scratch that, I meant $110

Ok, I have updated the amount to $110

Please confirm: you want to transfer $110 to John?

yes.

Ok. Your transfer is complete.

B.3 Context Switch

I want to send $50 to Justin

Just to confirm: you want to transfer $50.00 to Justin?

yes, and btw are there any fees associated with the sending?

We will not charge you for this transfer

Your transfer has been completed

You might also like