Horvath Final Documentation WS18
Horvath Final Documentation WS18
Harshita Agarwala
Robin Becker
Mehnoor Fatima
Lucian Riediger
Advisors:
Andrei Belitski
Olena Schüssler
Laure Vuaille
February 2019
Abstract
1
Contents
1 Introduction 4
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Research 4
2.1 Existing NLP-Frameworks . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Parsers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 Machine Learning Models . . . . . . . . . . . . . . . . . . 6
2.2 Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3 Data Generation 9
3.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Noisy Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3 Correlated Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.4 Generating Training and Test Data . . . . . . . . . . . . . . . . . 12
5 Recommender System 18
5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.2.1 Meta-data as context . . . . . . . . . . . . . . . . . . . . . 19
5.2.2 Entities as context . . . . . . . . . . . . . . . . . . . . . . 20
5.2.3 Past interactions as context . . . . . . . . . . . . . . . . . 21
5.3 System integration . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2
6 Dialog Management 22
6.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
6.2 RASA Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
6.2.1 High-Level Architecture . . . . . . . . . . . . . . . . . . . 23
6.3 Elements of RASA Core . . . . . . . . . . . . . . . . . . . . . . . 24
6.3.1 Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6.3.2 Stories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6.3.3 Training Dialogue Model using RASA Core . . . . . . . . 30
3
1 Introduction
1.1 Motivation
Artificial conversation entities (ACE) in the form of chatbots or voice assistants
have developed significantly over the last decade. The main reason behind
this is the progress of technologies in the field of natural language processing,
speech recognition, parsing (i.e. the analysis of natural language) and machine-
learning.
Many businesses have already realized the advantages of ACEs, especially in
the fields of customer service, convenience and streamlining processes. Two
very popular applications of ACEs are voice assistants on mobile phones and
for domestic use, which enable the user to execute tasks like playing music,
taking notes, setting reminders and browsing the internet with voice input.
For single commands ACEs have a high precision in detecting the users input
correctly. However, a current limitation to a lot of ACEs is the context of a
conversation. Detecting if given input is independent from previous input or if
it is implicitly related is crucial when holding a conversation. Yet, automating
this is very challenging, especially if the variety of input it has to understand is
very broad.
2 Research
Our implementation was divided into three components. The first part was the
Natural Language Understanding. We wanted to create a model that compre-
hends the input. For example for the input ”I would like to drink a coffee”, the
application should be able to understand that the user wants to place an order
for a coffee. The second part is the dialog management component wherein
the application should be able to respond to the above request by an output
like: ”Yes, your order has been placed”. Along with these two components, the
third includes recommendations and history which is integrated with the second
4
component.
Natural language understanding (NLU) is a branch of artificial intelligence that
uses computer software to understand input in the form of sentences in text or
speech format. NLU directly enables human-computer interaction. NLU’s abil-
ity to comprehend natural human languages enables computers to understand
commands without the formalized syntax of computer languages and also for
computers to communicate back to humans in natural languages. NLU is tasked
with communicating with untrained individuals and understanding their intent,
that is, NLU goes beyond understanding words and interprets meaning. NLU
is even programmed with the ability to understand meaning in spite of common
human errors like mispronunciations or transposed letters or words.[1]
• Then semantic and sentiment analyses follow, where words are looked up
in a database (e.g. WordNet) for meaning and possibly sentiment (e.g.
’fat’ → voluminous, negative)
The most common Natural Language Parsers are Stanford NLP, spaCy and
NLTK. The figure below shows a brief comparison between these parsers.
It can be seen that although spaCy is much faster compared to others, Stanford
5
Figure 1: Comparison between Parsers [4]
NLP gives better results. Stanford NLP is a Java implementation package but
there are open source Python wrappers(packages) available that help in running
the same Core-NLP package on Python. Initially we experimented with Stan-
ford’s Open IE annotator and dependency parsers. The Open IE was initially
thought of as a potential candidate in the pipeline of our Natural Language
Processing which could help in splitting complex sentences and identifying mul-
tiple intents. Open information extraction (Open IE) refers to the extraction of
relation tuples, typically binary relations, from plain text. The main difference
from other information extraction is that the schema for these relations does
not need to be specified in advance. The system first splits each sentence into
a set of entailed clauses. Each clause is then maximally shortened, producing a
set of entailed shorter sentence fragments. These fragments are then segmented
into OpenIE triples, and output by the system. An OpenIE triplet consists
of three parts: subject - relation - object. For example, ’Barack Obama was
born in Hawaii’ would create a triple (Barack Obama; was born in; Hawai)
[5]. This annotator, although useful did not result in comprehensive results for
the case of an ACE. Being a general purpose annotator and depending heavily
on the grammatical structure of the sentence, it gave ambiguous results for sen-
tences that were not grammatically correct or well-formed. Encountering such
sentences are highly possible in an ACE as it is a form of natural conversation
or chatting. Another important annotator is the Dependency Parser which is
also used in the final pipeline. It provides a naive external approach to extract
the main entity from the sentence.
The annotators or functions included under the parser packages are very pow-
erful in analyzing the sentence structure. However they have been made in such
a way that they work for all scenarios. Depending solely on these annotators
without using machine learning would give very generalized results. Therefore,
these parser packages have been used along with machine learning models in
our application to produce comprehensive results.
6
important to adopt machine learning models so that the application is trainable
for the given purpose. All most all components of the above mentioned parsers
are trainable. However there are other commercial as well as open source models
that are easier to train and have multiple functionalities combined together to
form robust pipelines. We looked into both open source and commercial models
for the purpose of developing chat bots, more specifically ”goal-oriented chat
bots”.
Bot Development framework is a set of predefined functions and classes which
developers use for faster development. It provides a set of tools that helps in
writing the code better and faster. They can be used by developers and coders
to build bots from scratch using programming language. All conversational bots
require some fundamental features. They should be able to handle basic input
and output messages. They must have natural language and conversational
skills. A bot must be responsive and scalable. Most importantly, they must be
able to extend an as human as possible conversation experience to the user.[6]
A few of the common commercially available models available are:
• API.ai/Dialogflow:
API.ai (Dialogflow) is another web-based bot development framework. It
provides a huge set of domains of intents and entities. Some of the SDKs
and libraries that API.ai provides for bot development are Android, iOS,
Webkit HTML5, JavaScript, Node.js, Python, etc. The concepts built on
API.ai is built on the following concepts:
7
1. Agents: Agents corresponds to applications. Once trained and tested,
an agent can be integrated with an app or device.
2. Entities: Entities represent concepts/objects, often specific to a do-
main, as a way of mapping NLP (Natural Language Processing) phrases
to approved phrases that catch their meaning.
3. Intents: Intents represents a mapping between what a user says and
what action should the software take.
4. Actions: Actions correspond to the steps the application will take when
specific intents are triggered by user inputs.
5. Contexts: Contexts are strings that represent the current context of
the user expression. This is useful for differentiating phrases which might
be ambiguous and have different meaning depending on what was spoken
previously.
API.ai can be integrated with many popular messaging, IoT and virtual
assistant’s platforms. Some of them are Actions on Google, Slack, Face-
book Messenger, Skype, Kik, Line, Telegram, Amazon Alexa, Twilio SMS,
Twitter, etc.
A simple problem that arises using commercial frameworks are that they are
not free of charge and they save the data in their cloud. Hence, for business
models this can not prove to be a lucrative solution. Therefore, we used an open
source framework for this purpose that provided a basic skeleton for NLU. It
also provides the flexibility to modify the pipeline and deploy the entire chatbot
from personal servers. We found Rasa Stack to be the most promising Python
framework that also provids an additional Dialog Management framework.
Rasa Stack is an open-source bot building framework. It comprises of two
modules: Rasa NLU and Rasa Core, both of which can be used separately.
Rasa NLU is responsible for the Natural Language processing. It recognizes
intent and entities from user’s input data based on previous training data and
returns them in a structured format. It combines different annotators from the
spaCy parser to interpret the input data .
• Intent classification: Interpreting meaning based on predefined intents
(Example: Please send the confirmation to [email protected] is a "provide email"
intent with 93% confidence
8
2.2 Databases
For a collection of drinks and food items, a database with relevant groupings
was needed. We looked into WordNet [9], which is one of the largest lexical
databases for English words. Nouns, verbs, adjectives and adverbs are grouped
into sets of cognitive synonyms (synsets), each expressing a distinct concept.
Synsets are interlinked by means of conceptual-semantic and lexical relations.[9]
A comprehensive list of drinks was not available in the database, hence for
training purposes, the list of drinks was adopted from the database of Dialogflow.
However, the powerful relations of these synsets was utilized in a later module
(see section 4.3).
3 Data Generation
As a result of our research, we choose to work with the Rasa NLP framework
which is based on a machine learning model. In order to train the model, we
need a data set of potential sentences which customers would use in a conver-
sation with our ACE. Sentences of this data set not only need to be structured
realistically but they also need to include entities such as drinks, food, amount,
temperature, size etc. of an order. Since the client of our project partner has
no ACE in place yet, there is no real customer data. Hence, we need to create
our own data set on possible sentences of a conversation. To create this data
efficiently, reduce bias in formulations of sentences and have the flexibility to
increase or decrease the data set efficiently, we decide to write a data generation
script.
At a later stage, we extend the script in order to incorporate noise. Noise are
parts of the input sentences of an ACE which do not belong to the sentences
and can for example appear if the ACE misunderstands words or picks up parts
from other, unrelated conversations.
Furthermore, the data generation script takes the likelihood of combinations of
entities into account. All these features aim to make the generated data set as
realistic as possible.
3.1 Approach
The general idea of our data generation approach is to randomly combine gath-
ered raw data to sentences which customers would actually use while communi-
cating with the ACE. This raw data can be split into two groups: Entities and
bare-bone sentences.
Entities are characteristic values of an input sentence of a customer. In order to
generate realistic data we include eight entities: DRINKS, FOOD, AMOUNT,
SIZE, LOCATION, TEMPERATURE, TOPPINGS and SUGAR AMOUNT.
Each of these entities has multiple values, such as over 800 standard soft and
alcoholic drinks in DRINKS, almost 300 international dishes in FOOD, and
roughly 50 different locations of a typical hotel / cruise ship in LOCATIONS.
9
These values of entities were partially brainstormed and added manually or ex-
tracted from lists of drinks and food online.
Since natural language is more elaborate than only listing values, we also need
to create sentences which can be filled with such values. Hence, we create bare-
bone sentences which have slots to be filled with entity values. Each bare-bone
sentence is assigned to one intent, which is the basic message of the sentence.
In total we generate three major intents: ORDER DRINKS, ORDER FOOD
and CHANGE ORDER. Let us have a look into the ORDER DRINKS intent
to illustrate how the bare-bone sentences are created. First, we gather standard
phrases of how to order a drink, again by brainstorming and online research.
One example is ”I would like to have”. We can now either add the slot which is
to be filled directly and obtain ”I would like to have a ENTITY”. Alternatively,
we can add additional parts to the bare-bone sentence such as ”please” or other
fillers which do not change the intent of the sentence. We then obtain ”I would
like to have a ENTITY please”.
Now that we created our raw data we consider the actual data generation. The
script first loads all entity values and bare-bone sentences. For each slot of a
bare-bone sentence which can be filled with entity values we create a list of
values that fit into the slot. Taking the previous example sentence of the OR-
DER DRINKS intent ”I would like to have a ENTITY please”, we can find
two slots which can be filled: ”a” and ”ENTITY”. The first slot ”a” can be
replaced by any number or other entity values which refer to the amount of an
order such as ”a” itself, ”a round of” or ”a dozen”. The second slot ”ENTITY”
can be substituted by a number of entity values. Besides the specific DRINK,
additional entities such as SIZE, TEMPERATURE, LOCATION, TOPPINGS
and SUGAR AMOUNT can be filled in at the second slot. This substitution
happens in an arbitrary and nested way. First an arbitrary sub-list of the list
of additional entities is generated. Let us assume the arbitrary sub-list contains
SIZE, LOCATION and TOPPINGS. The nested substitution starts with SIZE
and replaced the slot ”ENTITY” with ”SIZE ENTITY”. Afterwards the word
”ENTITY” is again replaced by ”ENTITY to LOCATION”. And finally, the
word ”ENTITY” is replaced by ”ENTITY with TOPPING”. At a last step, all
we fill in real values for all variables ENTITY (e.g. coke), SIZE (e.g. large),
TOPPING (e.g. lemon) and LOCATION (e.g. my room). Again taking the
bare-bone sentence we used before, we then obtain ”I would like to have two
large cokes with lemon to my room please”. In this way, we can generate as
many sentences as we need for each of the three major intents.
Besides the three major intents, there are also sentences of minor intents such as
RECOMMEND, GOOD, BAD, HOW ARE YOU, THANKS, HELLO, GOOD-
BYE, CONFIRM POSITIVE, CONFIRM NEGATIVE and CANCEL ORDER.
In contrast to the three major intents, the minor intents usually either have all
slots already filled because they are taken from other ACE databases or the
sentences are of such simple structure that they contain no slots (such as ”Can-
cel my order please”). Hence, the data generation script simply adds these
sentences to the training data for the machine learning model which is the out-
come of the data generation script.
10
Additionally, Rasa provides sample sentences on different intents which are of-
ten used for ACEs such as BOOK RESTAURANT, PLAY MUSIC or
GET WEATHER. These sentences are also added to our training data. With-
out these additional intents, we could not tell if the intent detection of our
machine-learning model actually works properly, or if the right intent of an in-
put was only detected because there are simply no other intents to choose from.
So far we only considered the case that our ACE fully understands every part
of the input sentence given by the customer. Since this is not always the case in
reality due to multiple reasons, we also add noise to the sentences of the training
data.
11
can be combined such as DRINKS and TOPPINGS, DRINKS and TEMPER-
ATURE, DRINKS and SIZE and DRINKS and DRINKS. These interrelations
indicate how likely two values are combined in reality. The two values ”cold”
and ”coke” are more likely to be combined than ”hot” and ”coke” and therefore
obtain a higher value. These interrelations are stored in a table for each possible
combination of entities. If a bare-bone sentence contains such entities, such as
”I would like to order a TEMPERATURE ENTITY please”, the slots are filled
according to the probability of the DRINK-TEMPERATURE table.
These tables do not only result in more realistic training data, they also help
suggesting recommendations. If for example an espresso is ordered, the ACE
then knows that the espresso often ordered with a glass of water and can ask the
customer. We cover the topic of recommendations and the tables of combined
entities in more detail in chapter 6.
12
of ’double-checking’ the output from Rasa’s pipeline. In the third section, we
introduce synonym detection which makes input by customers more machine-
readable. This allows customers to give natural input and enables the ACE to
assign entities of this input to known values.
13
• ner synonyms
• intent classifier sklearn
Throughout this section we would also discuss how they were customized in the
pipeline. But for now we have a look at what each component is doing.
14
4.1.7 Intent Classifier Sklearn
The intent classifier uses outputs of previous components in order to classify the
intent of an input with a support vector machine. Additionally, we do not only
get the confidence of the most likely intent but of all trained intents.
This pre-defined pipeline already equips us with well working intent and en-
tity extractors, which is the basis of our NLU model. However, for some cases
of input we need to take additional actions. Since Rasa is an open source frame-
work we can customize such pipelines by adding components to it.
4.2.2 Approach
A dependency parser analyzes the grammatical structure of a sentence, estab-
lishing relationships between ”head” words and words which modify those heads.
The figure below shows a dependency parse of a short sentence. The arrow from
the word moving to the word faster indicates that faster modifies moving,
and the label advmod assigned to the arrow describes the exact nature of the
dependency. The tag advmod denotes an adverb. The parser is powered by a
neural network which accepts word embedding inputs. It is trained using adap-
tive gradient descent (AdaGrad) with hidden unit dropout. [10] The parser is
used to exploit the grammatical structure of the input sentences and extract
15
the main entity (like DRINK) with all the words that ’modify’ the entity (like
TOPPING, AMOUNT or SIZE). It is based on the assumption that the main
entity will either be the object in a sentence or the subject. For example:
”I want to order a coffee” → (coffee is the object)
”A coffee for drinking will be great” → (coffee is the subject)
The function that was created based on the dependency parser first checks for a
direct object in the sentence and saves it as the main entity. It then searches for
all the words that ’modify’ the main entity and saves them as first-trier entities.
It then also looks for the words that modify the first-trier entities. This was
done as it was noticed that most frequently stopping words (like ’a’ or ’the’)
were the direct ones connected to the main entity. If it can not find a direct
object, it picks up the subject of the sentence as the main entity and follows
the same procedure. However, if neither an object nor subject is found, it sim-
ply returns the input text as output. Moreover, we added a naive approach to
split sentences with multiple intents or multiple main entities. We decided to
split the sentence at ’and’ and pass each part to the above function with the
Dependency Parser. Following are the outputs for a few example sentences.
• ”I want two coffees and new towels please” → [’two’, ’coffees’] [’new’,
’towels’]
• ”Get me a cold lemon ice tea in glass for afternoon” → [’a’, ’cold’, ’lemon’,
’ice’, ’tea’, ’in’, ’glass’]
• ”I want to order bananas and chocolate milk and get them soon” → [’ba-
nanas’], [’chocolate’, ’milk’], [’them’]
• ”A cafe creme will work best for today” → [’A’, ’cafe’, ’creme’]
• ”A cafe creme and fresh orange juice will work best for today” → [’A’,
’cafe’, ’creme’] , [’fresh’, ’orange’, ’juice’]
• ”A cafe creme will work best for today and my wife wants a cappuccino”
→ [’A’, ’cafe’, ’creme’], [’a’, ’cappuccino’]
4.3 Synonyms
4.3.1 Motivation
As was previously established, our NLU model parses a sentence by classifying
its intent and extracting entities. One of the goals of this project was to provide
the resort with a way to optimize their processes by analyzing the data collected
through this automatic parsing. This requires that the output of the NLU model
be machine-readable to facilitate such analysis.
In practice this means that every intent and entity value should have a real
counterpart in the resort, e.g. for a sentence expressing the ”order drink” intent
containing a ”drink” entity with the value ”cappuccino”, there should be a real
mechanism in place, which fulfills the function of ordering a drink as well as an
16
item corresponding to ”cappuccino” to be delivered to the customer. Defining
these counterparts for all possible intents is trivial, since their number is finite
and expected to be small. However, there is an unknown and potentially large
number of names referring to the same item. In other words, by identifying a
drink in a sentence, we do not always know what particular item in the resort’s
range of drinks it corresponds to.
In order to avoid the need to manually map every possible entity value to its
counterpart in the resort, the NLU model should ideally only output one possible
entity value for each distinct item. We approach this problem by attempting to
automatically identify sets of synonyms or ”synsets” across all observed entity
values. It is then assumed that each of these synsets only corresponds to a single
item in the resort.
4.3.2 Approach
WordNet [9] is one of the largest resources for synonyms. Each synset contained
in WordNet expresses a distinct concept, however, one word or phrase can be
part of multiple synsets, since it may express different concepts. For example
the word ”second” in the sentence ”i will be there in a second” has the meaning
”a short time frame”, whereas in the sentence ”i will be there in 60 seconds”
refers to an actual precise measurement of time. The task of identifying the
correct concept of a word or phrase in a given context is called word sense dis-
ambiguation (WSD).
Multiple algorithms have been proposed for this task including dictionary based,
graph based, supervised and similarity based methods [11]. Popular algorithms
include Lesk algorithm [12], maximizing path similarity [13] and maximum en-
tropy [14]. Since we do not have a ground truth for word sense in our example
sentences, supervised methods do not apply to our problem. We also need the
WSD algorithm to be computationally inexpensive, since typically there are a
lot of example sentences to consider at training time and the prediction needs
to be fast at inference time. This, in our experience, excludes graph based and
similarity based methods.
Adapted Lesk [12] is a version of the dictionary based Lesk alorithm, which has
been adapted to the dictionary definitions and synsets of wordnet. We use this
algorithm to predict a synset for each of the entity values in all training sen-
tences. To minimize erroneously assigning synsets, the final prediction for each
entity value is the synset which was most often predicted across all sentences
containing this value. Since we expect the word senses to be disjoint across
entities, we consider two entity values of different entities to be different entity
values, even if they are the same string.
Entity values are then considered synonyms if they are members of the same
entity and synset. The output of the NLU model for an entity value is the lexi-
cographically smallest member of its synset. This selection is arbitrary, though
it ensures that the output for all entity values of a synset is always the same.
If at inference time, an unseen entity value is detected, the synset can be found
by applying adapted Lesk on the newly observed sentence. We implemented
17
our approach for synonym detection as a custom component of the rasa nlu
framework, so that it may be trained end to end.
4.3.3 Results
Across 25584 example sentences containing 14189 distinct entity values our ap-
proach detected 78 synsets, which contain more than one entity value. As a
result, 80 entity values could be replaced by their synonym. This relatively low
number of detected synonyms might be attributed to the fact that this data was
artificially generated, which excludes a lot of grammatical as well as dialect and
cultural variation that would be found in natural language data.
Of the 80 detected synonyms, 2 were found to be falsely detected. The algorithm
equated the genres ”punk” and ”punk rock” as well as the drinks ”arak” and
”arrack”. This can be attributed to mistakes in WordNet’s definitions, since
these two instances are actually common confusions.
Among the correctly identified synonyms 72 are abbreviations, such as the ini-
tials referring to the individual states of the USA or the numeral representations
of written out numbers, as well as grammatical conjugations such as plurals.
This alone justifies the use of our approach, since the need of any additional
corpora is eliminated.
5 Recommender System
5.1 Motivation
The advent of big data has spawned a trend of adapting interactive systems to
users’ preferences. This can partly be attributed to the availability of user spe-
cific data, as well as the need of the customer to filter through an overwhelming
range of products. Examples include Netflix and Amazon attempting to filter
for movies or products, which the customer is most likely to interact with.
This personalization is particularly relevant to a hotel resort, since the goal is
for the customer to feel as comfortable as possible in this environment. Hence, it
is necessary to tailor our ACE system to the customer’s individual preferences.
One approach to this problem is to estimate the customers preferences to all
relevant items, based on his past interactions and other users’ preferences. This
approach corresponds to a recommender system. In our framework, entity val-
ues represent all items which are available to the customer through the resort.
Thus, interactions of the customer with the ACE system provide a data source
for recommendations.
5.2 Approach
A resort is a constantly changing environment. Since the needs of the resort
may differ depending on the situation, a recommender system should be easily
customizable. Additionally, customers might stay only a short while and only
interact very little with the system. Thus, one wants to utilize all available
18
information about the customer at any time. This includes previous knowl-
edge about the customer such as gender, age, and nationality as well as sit-
uational information such as the time of the interaction. Recommender sys-
tems which include such information are called context-aware recommender
systems [15]. Since, we do not have rating data but only interaction data,
we are interested in the probability that a customer will select a given item
in a given context P (item|context). Viewing X as a random variable over
all
P items of a given entity, one obtains a probability distribution of X, with
items P (X = item|context) = 1. The context can take the form of any prior
information and can consist of multiple different contexts, such as age, na-
tionality and gender with context = context1, context2, context3, ... and thus
P (item|context1, context2, context3, ...). There exist three different paradigms
for context-aware recommender systems [15].
Contextual modeling describes the attempt to model P (item|context) directly.
The context however can be high-dimensional resulting in sparse data. Addi-
tionally, when incorporating new contexts into the model, the whole model has
to be retrained.
Contextual pre-filtering is the approach of dividing the data by its context and
modeling each context separately. However the same problem in contextual
modeling of sparsity arises.
In contextual post-filtering the probability P (item) is modeled independently
of context and post-modified according to the context. We use this paradigm by
modeling P (item) as well as P (item|context1 ), P (item|context2 , ...) and then
combining P (item), P (item|context1 ), P (item|context2 ), ... in a weighted sum:
#contexts
X
P (item|context) = a ∗ P (item) + bi ∗ P (item|contexti )
i=1
P#contexts
with a, b ∈ R+ and a + i=1 bi = 1.
P (item) can be post-filtered through other methods than a weighted sum,
though this approach allows to customize how much each context weighs into
the final recommendation.
Usually P (item) implicitly refers to the probability of a customer interacting
with an item, given his past interaction. We choose to view the past interactions
as yet another context and model P (item) by simply considering the relative
frequency of the item in all interactions.
# of interaction with item
P (item) =
# of interactions with all items
The reasoning behind viewing a customer’s past interactions as a context is
given in section 5.3. The following subsections explain the different contexts
and the approaches of modeling P (item|contexti ).
19
can be transformed into meta-data by binning the values. For example, in
the context of time, the values get binned into morning, daytime and evening.
One obtains the data matrix M ∈ Nc×e for a context C with c values and an
entity E with e values. The entries of the matrix are the number of observed
interactions containing the specific entity value and meta-data value. We obtain
the probability distribution for P (E|C) by dividing the columns of M by their
sum and then dividing the rows by their sum. Normalizing the columns in
this way has the effect of calculating the relative frequency of all contexts for
all entity values. Normalizing the rows transforms each row into a probability
distribution over the entity conditioned by a given context. Hence:
P (E = entityi |C = contextj ) = Mi,j , i ∈ [1, e], j ∈ [1, c]
Since we were not provided with any data, we have no means of validating this
approach with a validation data set. However, we regard our the method of
using relative frequency as simple enough as not to require further validation.
In order to generate a first rough estimate of recommendations for different
meta-data, we manually curated lists of drinks relevant to different values of
the contexts time, age and nutritional value. We then generated context specific
data with our approach for data generation and used our algorithm to recover
the probability distributions.
20
Figure 3: Clusters of drinks observed in a T-SNE embedding of the drink-
drink similarity matrix estimated with web search. Figure a) shows a cluster of
alcoholic beverages, while figure b) shows a cluster of non-alcoholic beverages.
21
5.3 System integration
All calculated recommendations are saved to a database and can be retrieved
at runtime. The final recommendation for a set of context are obtained by
calculating their weighted sum. Which contexts are relevant and how they are
weighted depends on the situation and the resorts’ needs.
Including the customer’s past orderings as a context when only few orderings
have been made is to be be avoided, due to the cold start problem. Relying on
more general contexts will yield more general, though more robust recommen-
dations. As a baseline, all available contexts are used and weighted equally.
In order to update the recommendations, we use two approaches. The first is
to simply recalculate all recommendations including the newly observed data.
This however, is too slow to apply after each new data point. Since the system
should respond quickly to new data, the recommendations for an observed en-
tity are simply increased by a fixed value. Higher increase values result in more
recent orders being weighted more heavily.
Once the recommendations are obtained, they can be used to personalize the
system in different ways. The most straightforward way is to allow the customer
to ask for recommendations for a specific entity, or to automatically make rec-
ommendations once a sufficiently high confidence is reached. Additionally, if
the customer asks e.g. ’What items are available?’, only the most likely recom-
mendations are given instead of an exhaustive list.
The recommendations can also be used to improve the performance of the ACE
itself. If there exists a mandatory entity for a given intent, one can fill this
slot with the most likely recommendation if the confidence is high enough, in-
stead of the default value. Additionally, when the NLU model gives ambiguous
predictions, one can select the most likely recommendation.
6 Dialog Management
6.1 Motivation
To consolidate outputs of all the components (RASA NLU, Customer History,
Recommendations) mentioned earlier, and to curate a conversation between the
user and the bot, a dialog management framework was required.
It was required to opt for a framework that is open source, keeps the context of
the conversation and can be developed with zero or less training data.
Considering the above-mentioned constraints, we decided to go forward with
RASA Core for the development of dialogue management component.
22
6.2.1 High-Level Architecture
Following diagram gives an overview of high-level architecture of Rasa Core and
how it manages messages as an input and output.
4. The policy decides about the next action. A Policy decides what action
to take at every step in a dialog. In our project, following four policies
were configured.
23
5. The chosen action is logged by the tracker.
6. A response is sent to the user.
The entire process is handled by the Agent. Agents allows the developer to train,
load, and use the model. It is a simple API that provides most accessibility to
Rasa Core’s functionality.
24
* Unfeautrized: Data you want to store which shouldn’t influence the
dialogue flow
* Main Entity: For each intent, there is a Main Entity defined and stored
in the database. This Main Entity is required for that Intent. For example,
order drink intent must have DRINK as entity in the message.
Slot Filling :
The next step after defining the slots, is to set how and from where each
of these slots will take value.
a. Slot Filling via Form Action
It is recommended to create a Form Action for slot filling, if multiple
pieces of information are required to be collected in a row. It is a single
action which contains the logic to loop over the required slots and ask the
user for this information. To add forms to the domain file, it is needed
to reference their name under forms: section in the domain file. In our
project, we used Form Action to fill the slots*** that are required to order
drink. More details on it will be explained in the Actions section later.
b. Slot Filling via SlotSet Event
Every slot has a name and a value. The SlotSet event can be used to set
a value for a slot on a conversation. In our project, the remaining slots
were set in various actions using SlotSet event. More details on it will be
explained in the Actions section later.
4. Templates: Utterance templates are messages the bot will send back to
the user. There are two ways to use these templates:
If the name of the template starts with utter , the utterance can directly
be used like an action. It is required to add the utterance template to the
domain. In our project, utter templates for all the utter actions mentioned
25
above are added to the domain file
Templates can be used to generate response messages from custom actions
using the dispatcher e.g. dispatcher.utter template(”utter greet”). This
allows to separate the logic of generating the messages from the actual
copy. Adding templates to the domain file is optional.
For our project, we changed the default fallback action to our custom utter
action (utter fallback). Respective parameters were set in the Fallback Policy.
Utter Actions: These actions send messages to the user as per the templates
defined in the domain file. Utter Actions always start with utter . These actions
look for text of the messages in templates defined in domain file. Templates for
utter actions must start with ”utter ”
26
Custom Actions: Custom Actions can be defined to add customized response
from the bot, apart from utter and default actions. A custom action can run
any code and can do anything from turning on lights, adding events in schedule,
checking a user’s bank balance, to anything else
Rasa Core calls an endpoint specified in endpoints.yml, when a custom ac-
tion is predicted. This endpoint should be a webserver that reacts to this call,
runs the code and optionally returns information to modify the dialog state.
27
Rasa provides rasa core sdk to define custom actions in python
In our project following custom actions were defined in Action.py file. Let
us investigate the custom actions in detail.
• required slots: a list of slots that need to be filled for the submit method
to work.
• submit: what to do at the end of the form, when all the slots have been
filled.
Every time this form action gets called, it will ask the user for the next slot in
required slots which is not already set. It does this by looking for a template
called utter ask slot name, which need to defined in domain file for each required
slot. Once all the slots are filled, the submit() method is called, where you
can use the information you have collected to do something. After the submit
method is called, the form is deactivated, and other policies in your Core model
will be used to predict the next action. Some additional methods that defined
in the Form Action are:
slot mappings: defines how to extract slot values from user responses.The
predefined functions work as follows:
• self.from entity(entity=entity name, intent=intent name) will look for an
entity called entity name to fill a slot slot name regardless of user intent
if intent name is None else only if the users intent is intent name.
• self.from intent(intent=intent name, value=value) will fill slot slot name
with value if user intent is intent name.
• self.from text(intent=intent name) will use the next user utterance to fill
the text slot slot name regardless of user intent if intent name is None else
only if user intent is intent name.
28
To allow a combination of these, it is required to provide them as a list.
validate: After extracting a slot value from user input, the form validates the
value of the slot. By default, it only checks if the requested slot was extracted.
In our project, custom validation is added, which checks the value against the
database.
2. Save Time:
This action calculates the time difference between the previous and current user
input. It sets the slot Time and Time Diff
3. Coreferencing:
This action calculates and then used the context of the conversation to set some
slot values since these slot values influence the flow of the dialogue. It first
extracts current intents and entities from the latest message and check if the
main entity of the respective intent is present or missing. If the main entity is
present, it:
• sets the slot for that entity to its value
• sets the Boolean slot, main entity, to true
• sets the slot of previous intent to the current intent as the main entity of
the intent is present
. If the main entity is missing, it:
• takes the value of previous intent from the slots
• checks if the current entities detected are in the list of entities for that
intent
• checks for the confidence of the value if detected as any of the entities in
the list
• if the confidence is more than a certain value and the time difference is
not major then it treats this sentence as one of the previous intents
• sets the slot of main entity as false
• sets the slot previous intent as the previous intent detected
29
6.3.2 Stories
A training example for the Rasa Core dialogue system is called a story. RASA
Core learns from these example conversations provided. A story starts with a
name preceded by two hashes ##story03248462. A story can be called anything,
but it can be very useful for debugging to give them descriptive names.The end
of a story is denoted by a newline, and then a new story starts again with ##.
Messages sent by the user are shown as lines starting with * in the format
intent”entity1”: ”value”, ”entity2”: ”value”.
Actions executed by the bot are shown as lines starting with - and contain the
name of the action. Events returned by an action are on lines immediately after
that action. For example, if an action returns a SlotSet event, this is shown as
the line - slot”slot name”: ”value”
In our project, we have multiple stories for each intent. We have specially
added the stories where the main entity is detected or missing. These stories
can included various scenarios as per the kind of conversations expected between
the user and the bot.
30
python -m rasa core.train interactive –core models/dialogue –
nlu models/current/nlu –endpoints endpoints.yml
In interactive mode, the bot asks user to confirm every prediction made
by NLU and Core before proceeding. For example:
In the above scenario, on typing ‘n’ the bot then asks the user to select
the right action from a list of possible pre-defined actions. On continuing
like this, the bot stores the results and creates new stories for Rasa Core
to incorporate. At the end it appends them to the stories.md file. The
model has to retrain again using either the 1st or 2nd methods.
31
7.1 Models
We trained eight different models to measure the performance and choose the
model with the most promising results. The models are as follows:
0. Baseline:
The Baseline model, as the name suggests is the most basic model that
we trained with the sample data that we generated. The pipeline used
was the pre-defined spaCy pipeline. The breakup of training and test was
80% - 20%. Number of sentences in the training set were 2400 and test
set had 600 sentences for every intent.
32
we adapted the part of the data generation script that introduced noise.
Instead of a random string of characters, we picked an actual but random
word from a large database of words which is called corpus, a package from
the nltk library, a popular python library for natural language processing.
Now the same noisy sentence from above looks like the following: ”I would
like to house have two large cokes with lemon to my room please”. As
before, the word is introduced randomly between any two words in the
sentence and is added to only 10% of the data.
6. Baseline with Two Real-Words Noise:
This model adds two random real words from the nltk library anywhere
in the sentence.
7. Half Amount of Bare-Bone Sentences:
As explained in the data generation script, we used a list of bare-bone
sentences (e.g. I would like to order a ENTITY with TOPPING) that were
combined with different entities (e.g. juice and ice) in various orders to give
final sentences (e.g. I would like to order a juice with ice). The motivation
behind this was to analyze performance with reduced variability in the
model. More number of bare-bone sentences introduces higher richness in
the model. As the training-test split till now was 80% - 20%. For this we
used a 40% - 60% split. Therefore working with only half the amount of
bare-bone sentences as well as list of all entities.
7.2 Evaluation
We chose Rasa NLU’s extensive evaluate module for the performance evaluation.
The module focuses on the three most important metrics: Precision, Recall and
F1 Score. Correctly predicted observations (True Positives in binary case) is
the number of observations that were predicted correct for the class. They
belonged to the class and the model classified them correct. Let k1 , k2 , ..., km
be the number of correct classifications for the m classes and r1 , r2 , ..., rm be all
the number of predictions made for the m classes. Also, let n1 , n2 , ..., nm be the
total number of actual points belonging to the m class. Then,
• Precision: It is the ratio of correctly predicted observations of a class to
the total predicted observations of the class. The precision for the model
is calculated as the average Precision for all classes.
m
1 X ki
P recision =
m i=1 ri
33
• F-Score: It is a harmonic mean of Precision and Recall.
(1 + β 2 )(P recision ∗ Recall)
F Score =
β 2 (P recision + Recall)
where β is commonly 0.5, 1, or 2. Therefore the F1 Score is with β = 1
2(P recision ∗ Recall)
F 1 Score =
(P recision + Recall)
It provides Precision, Recall and F1 Score for both intents and entities. The
following two tables summarise the metrics for all the nine models for intent
recognition and entity recognition respectively.
We can see from the summary table that all models perform extremely well
when it comes to intent recognition. The Lookup tables and Synonyms only
34
impact the entity recognition and hence improve results for the same when
compared to the Baseline models. This follows from the Rasa NLU pipeline.
The intents and entities are extracted separately, independent of each other. An
interesting observation is that the model with noise (BN) performed equally well
as the Baseline models. Moreover the BO and BT models with real word noise
performed better compared to all other models. The main reasoning behind
this is the fact that the model is now more ”careful” to classify words as entities
or use them to identify intents. Hence, the possibility of wrong classifications
reduces in these models and improves performance. The last model HBB which
reduces the variability in the model has to encounter newer words in the test-
ing environment. Hence, the miss classifications increase and the performance
drops.
The Rasa Evaluate module also generates a confusion matrix for the intent clas-
sification and a histogram showing the average Confidence of extracted Intents
on x-axis with the sample size of training data on the y-axis. Additionally a
breakup of entities for more comprehensive in-depth analysis of metrics for all
models is included in the Appendix. Following this extensive analysis of dif-
ferent models, we decided on using a model with the best performance which
would also be sensitive to unseen data. The BO and BT models fit this criteria
well. Additionally, we wanted to be more realistic with our model and avoid
unnecessary complications. Hence we decided on incorporating the One-Real
Word Noise with Rasa Core, the Dialog Management framework.
35
References
[1] “Natural language understanding.” https://searchenterpriseai.techtarget.
com/definition/natural-language-understanding-NLU. Accessed: 2018-12-
01.
[2] “Parsing.” https://en.wikipedia.org/wiki/Parsing. Accessed: 2018-12-01.
[3] “Parser technopedia.” https://www.techopedia.com/definition/3854/parser.
Accessed: 2018-12-01.
[4] “Natural language processing made easy – using spacy (in python).”
https://www.analyticsvidhya.com/blog/2017/04/natural-language-
processing-made-easy-using-spacy-in-python/. Accessed: 2018-12-01.
[5] “Stanford open information extraction.” https://nlp.stanford.edu/software/
openie.html. Accessed: 2018-12-01.
[6] “Complete guide on bot frameworks.” https://www.marutitech.com/complete-
guide-bot-frameworks/. Accessed: 2018-12-01.
[7] “Rasa documentation.” https://rasa.com/docs/nlu/. Accessed: 2018-12-01.
[8] “Conversational ai chatbot using rasa nlu rasa core.” https://medium.com/
@BhashkarKunal/conversational-ai-chatbot-using-rasa-nlu-rasa-core-
how-dialogue-handling-with-rasa-core-can-use-331e7024f733. Accessed:
2018-12-01.
[9] G. A. Miller, “Wordnet: A lexical database for english,” Commun. ACM, vol. 38,
pp. 39–41, Nov. 1995.
[10] “Neural network dependency parser.” https://nlp.stanford.edu/software/
nndep.html. Accessed: 2018-12-01.
[11] R. Navigli, “Word sense disambiguation: a survey,” ACM COMPUTING SUR-
VEYS, vol. 41, no. 2, pp. 1–69, 2009.
[12] S. Banerjee and T. Pedersen, “An adapted lesk algorithm for word sense disam-
biguation using wordnet,” in International Conference on Intelligent Text Pro-
cessing and Computational Linguistics, pp. 136–145, Springer, 2002.
[13] T. Pedersen, S. Patwardhan, and J. Michelizzi, “Wordnet::similarity,” pp. 38–41,
01 2004.
[14] A. L. Berger, V. J. D. Pietra, and S. A. D. Pietra, “A maximum entropy approach
to natural language processing,” Comput. Linguist., vol. 22, pp. 39–71, Mar. 1996.
[15] F. Ricci, L. Rokach, B. Shapira, and P. B. Kantor, Recommender Systems Hand-
book. Berlin, Heidelberg: Springer-Verlag, 1st ed., 2010.
[16] G. Linden, B. Smith, and J. York, “Amazon. com recommendations: Item-to-item
collaborative filtering,” Internet Computing, IEEE, vol. 7, pp. 76–80, 01 2003.
[17] “Bing web search engine.” https://www.bing.com/. Accessed: 2018-12-01.
[18] L. van der Maaten and G. Hinton, “Visualizing data using t-SNE,” Journal of
Machine Learning Research, vol. 9, pp. 2579–2605, 2008.
[19] Y. Hu, Y. Koren, and C. Volinsky, “Collaborative filtering for implicit feedback
datasets,” in Proceedings of the 2008 Eighth IEEE International Conference on
Data Mining, ICDM ’08, (Washington, DC, USA), pp. 263–272, IEEE Computer
Society, 2008.
[20] “Rasa documentation - training data format.” https://rasa.com/docs/nlu/
dataformat/. Accessed: 2018-12-01.
36
Appendix: Results from Performance Evaluation
The histograms generated by RASA’s evaluate function are in the figures below.
Following are the tables, summarizing Precision, Recall and F1 score for each
entity in all the different models.
37
Figure 7: Baseline, Noise, Lookup Table and Synonym Model
38
Table 1: Precision
39
spatial relation 1 0.99 1 1 1 1 1
state 0.98 0.98 0.99 0.99 0.99 0.99 0.99
timeRange 0.99 0.99 0.99 0.99 0.99 0.99 0.99
track 1 0.99 0.99 1 0.99 1 1
year 0.88 0.9 0.91 0.9 0.91 1 1
Table 2: Recall
40
rating unit 1 1 1 1 1 1 1
rating value 0.99 0.99 0.99 0.99 0.99 0.99 0.99
restaurant name 0.95 0.95 0.94 0.95 0.94 0.94 0.96
restaurant type 0.99 0.99 0.99 0.99 0.99 0.99 0.99
served dish 0.96 0.93 0.95 0.94 0.95 0.97 0.97
service 1 1 1 1 1 1 1
sort 1 0.99 0.99 1 0.99 1 1
spatial relation 0.99 0.99 0.99 0.99 0.99 0.99 0.99
state 0.98 0.99 0.98 0.98 0.98 0.98 0.99
timeRange 0.86 0.87 0.85 0.85 0.85 0.85 0.86
track 1 0.96 0.99 0.99 0.99 0.99 1
year 1 1 1 1 1 1 1
Table 3: F1 Score
41
object select 0.97 0.97 0.97 0.98 0.97 0.98 0.98
object type 0.99 0.99 0.99 0.99 0.99 0.99 0.99
party size description 0.99 1 0.99 0.99 0.99 1 1
party size number 0.97 0.96 0.97 0.97 0.97 0.97 0.97
playlist 0.99 0.99 1 1 1 1 1
playlist owner 0.99 0.98 0.98 0.99 0.98 0.99 0.99
poi 0.98 0.99 0.97 0.97 0.97 0.97 0.98
rating unit 1 1 1 1 1 1 1
rating value 0.98 0.98 0.99 0.99 0.99 0.99 0.99
restaurant name 0.97 0.97 0.97 0.97 0.97 0.97 0.98
restaurant type 0.99 0.99 0.99 0.99 0.99 0.99 0.99
served dish 0.9 0.9 0.91 0.88 0.91 0.98 0.99
service 1 1 1 1 1 1 1
sort 0.99 0.99 0.99 0.99 0.99 0.99 0.99
spatial relation 0.99 0.99 0.99 0.99 0.99 0.99 0.99
state 0.98 0.98 0.98 0.98 0.98 0.98 0.99
timeRange 0.92 0.92 0.92 0.92 0.92 0.92 0.92
track 1 0.98 0.99 0.99 0.99 0.99 1
year 0.94 0.95 0.95 0.94 0.95 1 1
42