0% found this document useful (0 votes)

19 views11 pages

Retrieval Augmented Generation (RAG) Based Restaurant Chatbot With AI Testability

Uploaded by

CS60YadavVishalJitendra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views11 pages

Retrieval Augmented Generation (RAG) Based Restaurant Chatbot With AI Testability

Uploaded by

CS60YadavVishalJitendra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/381461839

Retrieval Augmented Generation (RAG) based Restaurant Chatbot with AI

Testability

Conference Paper · July 2024

CITATIONS READS

0 2,357

6 authors, including:

Jerry Gao
San Jose State University
355 PUBLICATIONS 5,850 CITATIONS

SEE PROFILE

All content following this page was uploaded by Jerry Gao on 16 June 2024.

The user has requested enhancement of the downloaded file.

Retrieval Augmented Generation (RAG) based
Restaurant Chatbot with AI Testability
Vani Bhat, Sree Divya Cheerla, Jinu Rose Mathew, Nupur Pathak, Jerry Gao
Guannan Liu Department of Computer Engineering
Department of Applied Data Science San Jose State University
San Jose State University San Jose, USA
San Jose, USA Corresponding Author: [email protected],
(vani.bhat, sreedivya.cheerla, jinurose.mathew, nupur.pathak, [email protected]
guannan.liu)@sjsu.edu
*Equal Contribution – The First Four Authors

Abstract—Post-COVID the restaurant industry is increase at a CAGR of 23.3%. According to markets and
experiencing a surge in demand, presenting a unique challenge market research [2] the chatbot market size is expected to
of efficiently managing increased customer flow while ensuring grow from USD 2.9 billion in 2020 to USD 10.5 billion by
seamless interactions. Chatbots have emerged as an innovative 2026, at a Compound Annual Growth Rate (CAGR) of
solution to meet the demand increase. The paper addresses the 23.5%. They have become prevalent in various sectors like e-
enhancement of AI chatbots through the integration of commerce, healthcare, hospitality, tourism, banking, and
Retrieval-Augmented Generation (RAG) with the Large
customer service. It is expected and forecasted that 70% of
Language Model (LLM). This paper focuses on the
development of a restaurant chatbot that not only engages in
white-collar workers would interact/converse with
natural-language conversations but also addresses context conversational platforms daily by Gartner.
optimization and LLM optimization for restaurant context In recent years post COVID era, chatbots have gained
learning. The approach uses a Neo4j Knowledge graph built popularity, particularly in the restaurant industry. With the
using the restaurant data as an external source of knowledge. advancement and the increase in the usage of technology,
The graph is traversed to match the user question with customers want to converse smoothly and have quick,
appropriate answer tokens using Term Frequency - Inverse
personalized replies to their questions. Restaurant chatbots
Document Frequency (TF-IDF) embeddings. The relevant
serve to meet this need by handling orders, responding to
tokens along with user questions are used to provide additional
context to the T5 language model to provide nuanced responses
inquiries, providing recommendations, and immediate
to the users. This improvement is quantitatively evidenced by a support to clients without having to wait in line to talk to a
Bilingual Evaluation Understudy (BLEU) score of 0.60, customer representative. There exist traditional chatbots that
indicating a high level of precision in language understanding are less adept at handling intricate consumer requests because
and generation. An extensive evaluation of the chatbot includes of their limited functionality, pre-programmed responses, and
assessing AI testability on the level of words, sentences, and provide dull responses like ‘I don’t know’ which results in
information. These evaluations include simulated dialogue the termination of the conversation. The motivation is to
assessments and performance analyses, with a focus on the overcome the existing limitations and to provide more
chatbot's ability to retrieve and integrate information. Based on relevant and contextual responses to the users.
the AI testability evaluation, the models consistently produce
more knowledgeable, diverse, and relevant answers as This research aims to improve the standard restaurant
compared with state-of-the-art models with an average chatbot by RAG approach i.e. integrating an external source
information score in the range of 0.6-0.8. of knowledge (Neo4j knowledge graphs) to refine the
finetuned LLM for contextual and smoother response
Keywords— Natural Language Processing (NLP), Retrieval generation and adding AI testability for evaluating responses.
Augmented Generation (RAG), Large Language Model (LLM), Furthermore, it introduces a multimodal aspect, enabling
Information Retrieval (IR), Knowledge Graph, AI Testability audio input and output for user queries.

I. INTRODUCTION RAG is an artificial intelligence framework designed to

enhance the accuracy of responses generated by LLM. It
Artificial Intelligence (AI) has transformed the way we achieves this by incorporating external sources of knowledge
interact with technology, and one such area where AI has to complement the model's internal understanding. When
shown significant promise is in the development and implemented in a question-answering system based on a large
enhancement of chatbot systems. It is a human-computer language model, RAG offers a significant advantage of
interaction model that simulates a coherent conversation guaranteeing access to the most reliable information. The
between machines and users. The enormous scope of research collected external information is added to the user's input and
in this area and the recent developments in technologies like provided to the language model. The language model utilizes
deep learning, information retrieval and advent of language both the extended input and its internal knowledge to provide
models have led to intelligent chatbot systems that a personalized response for the user through the chatbot.
continuously learn and improve themselves over time.
The AI testability features involve robust evaluation to
The chatbot industry has gained significant growth in ensure its performance meets desired standards by verifying
recent years to meet customer demands by providing that the chatbot can handle various types of customer queries,
customer service 24/7 and to reduce operational costs by understand natural language input, and provide relevant and
assigning jobs to chatbots. According to grand view research
accurate responses. Different AI testability chat metrics like
[1], the size of the global chatbot market was assessed at
USD 5,132.8 million in 2022, and from 2023 to 2030, it is word level, sentence level and information level metrics are
anticipated to
measured to assess and improve the chatbot’s capabilities model that incorporates the global context of internet
[3]. Unlike the other methods that use data in the form of conversations and the local context by retrieving relevant
PDF chunks or knowledge graphs limited to entities and documents. Here RAG-tokens are utilized as the backbone,
intents, in this paper, we propose a novel approach to enabling token generation based on multiple documents. This
knowledge graph creation by establishing question-and- paper improves response accuracy and relevance by using a
Groundness Estimation Model to enhance the response
answer clusters, where the shared tokens can be used to aid
generation process through a model that estimates response
a cohesive representation. It also introduces the RAG model,
groundness. The study conducted by authors in this paper [9]
employing TF-IDF embeddings to identify the closest introduces a unified architecture combining pre-trained
matching question and answer clusters and retrieve relevant language models with a learned retrieval module, showcasing
answer tokens. Furthermore, the study fine-tunes the T5 versatility across tasks. The work aligns with advancements
base model using additional restaurant conversations, in learned retrieval methods, leveraging neural language
incorporating user questions, and passing tokens from graph models. Additionally, the use of an external memory index
to T5 for response generation. Another novel approach is for retrieval distinguishes it from memory-based
testability, providing a robust evaluation framework for architectures. The approach bears similarities to retrieve-and-
chatbots that aids businesses in selecting an optimal solution edit strategies yet emphasizes aggregating content from
from multiple alternatives with similar functionalities. This multiple sources for robust performance in diverse domains.
approach sets the paper apart, showing advancements in
Table I shows the survey of different research papers
knowledge graph creation, retrieval techniques, model fine-
including the above showcasing the objective, dataset,
tuning, and practical implications for chatbot evaluation in models, and evaluation used in their paper.
real-world scenarios.
B. Technology Survey
The rest of the paper is structured as follows: Section II
reviews the technology and literature in the field; Section III To get a detailed overview of the current state of the art in
shows the data collection and transformation; Section IV the advancements of the chatbot, and to understand what area
describes the details of the architecture, design and the research can extend upon, the technology survey is
development of the proposed models; Section V summarizes conducted where one can see the advancements of chatbot
the chatbot development; Section VI talks about the AI over some time. There are differences in the specific
testability and, Section VII and VIII discusses the results and technology and methodologies used, but most of the research
concludes the paper respectively. publications of chatbot until 2022 strive to enhance chatbot
performance utilizing deep learning and reinforcement
II. RELATED WORK learning techniques. Table II shows the technological survey
of various searches conducted in the chatbot application. It
A. Literature Survey
shows what application the chatbot was used for, the
An exhaustive exploration of recent research papers, technology behind it, the input data used to build the chatbot
journals, and articles has been conducted in the literature and metrics that were used for evaluation.
survey for the development of the restaurant chatbot. The
study begins with a comprehensive exploration of generative With the AI revolution created by the launch of
AI, covering its concept and diverse application domains. It ChatGPT, the latest research focuses on utilizing Retrieval
also delves into the technological components and Augmented Generation (RAG) and Large Language Models
frameworks relevant to Large Language Models (LLM) with (LLMs) for building different kinds of chatbots. Therefore,
a focus on RAG. in this paper, we propose a multi-modal integration of the
RAG model and LLM optimization with a Neo4j
The authors in their paper [5] focuses on implementing Knowledge Graph to pull the right answer tokens.
generative AI services using a LLM application architecture. Furthermore, the AI testability is provided to evaluate the
It addresses the challenge of information scarcity in LLMs robustness of chatbot. The AI testability evaluation shows
and proposes solutions such as fine-tuning techniques and that our model can achieve an average information score of
direct document integration. The key contribution of this 0.6-0.8.
study is the development of a RAG model, which aims to
enhance information storage and retrieval, thereby improving III. DATA PREPARATION
content generation. The RAG model mitigates the data A. Data Collection
insufficiency issue by enabling better information
management, crucial for LLMs' functionality [6] cater a way Data Collection is an important step in the chatbot-
to answer multihop questions with good response time. This building strategy as it determines the quality and intelligence
is tested on MultiRC Dataset. The process involves of the chatbot. The dataset used for this project focuses on
preprocessing of the dataset where questions are turned into the restaurant data such as conversation data related to the
key-value pairs. Subsequently, entity pairs are extracted restaurants, entering data by hand, gathering information
using dependency parsing algorithms. These entity pairs are from restaurants, collecting questions from user surveys, and
represented as subject, predicate, and object. Then the using records of real conversations. The collected restaurant-
knowledge graph is constructed which is used to answer the specific data consists of features such as Restaurants,
input questions from the user from the JSON file created Address, Item categories, Items, Item Description, and Item
where the data entities are loaded. The model is compared Price.
with the BERT model where it is found to perform better than The greetings dataset consists of question-answer pairs
BERT model in multihop question answering. related to greetings like “Hey there”. The order dataset
On the other hand, the paper [11] introduces Retrieval- consists of question-answer pairs related to the order. The
Augmented Neural KGC Model. The objective of their paper item description set contains details about the order. The item
is to implement a retrieval-augmented response generation categories set contains details about various categories
available and dishes available under each category. The item
price set contains details about the price of each item. The
parameters for the dataset include the multiple cuisine items
where the chatbot should be able to recognize the items in
TABLE I
COMPARISON SURVEY OF EXISTING RESEARCH PAPERS

Ref Objective Region Dataset Multi- Knowledge Testability Models Evaluation

Modal Graph (Y/N) Metrics
(Y/N) (Y/N)
[4] Address the complexity and Taiwan HotpotQA, TriviaQA Y N N Fusion-In-Decoder R-Precision
high resource demands of NaturalQuestion(NQ) (dense retriever Exact Match
existing retrieval- T-Rex, FEVER model, T5 encoder, (EM)
augmented models in Zero Shot RE (zsRE) T5 Decoder) Accuracy
handling long inputs. Wizard of Wiki F1
(WoW)
[5] Alleviate data insufficiency Korea NA N N N RAG NA
using fine-tuning techniques
and direct document
integration using RAG.
[6] Build question-answering India Multi RC N Y N Knowledge Graph Accuracy
system that could efficiently Question
respond to multiple-hop Answering
queries efficiently using (KGQA)
knowledge graphs.
[7] Develop long-form United 2WikiMultihopQA N N N Forward Looking Exact Match
generation with retrieval States StrategyQA, ASQA Active Retrieval (EM)
augmentation WikiAsp Augmented F1 Precision
Generation Recall
[8] Fine-tuning RAG models Canada HotpotQA, N N N RAG-Token Q-BLEU
with pre-trained parametric Natural Questions RAG-Sequence Factuality
and non-parametric memory Open DPR (DenSPI) Specificity
for NLP tasks. TriviaQA BART, Seq2seq
[9] Address the problem of Dominica Wizard of Wikipedia N N N RAG-Token DPR Perplexity (PPL)
hallucination and factual n CMU Document model with BART- Unigram Overlap
incorrectness in state-of-the- Republic Grounded Large (F1)
art chatbots using retrieval Conversations RAG-Sequence BLEU-4 (B4)
augmentation technique. (CMU_DoG) FiD-RAG ROUGE-L (RL)
BREAD (BART- Knowledge F1
Retriever-Encoder- (KF1)
And-Decoder) Rare F1 (RF1)
TREAD
[10] Develop, evaluate, and United Natural Questions N N N Generation- Accuracy
assess Generation- States (NQ) augmented retrieval Exact Match
Augmented Retrieval (GAR) TriviaQA (Trivia) (GAR) with Sparse (EM)
for open-domain question Representations
answering by augmenting (BM25) - Lexical
queries with heuristically classification
generated relevant contexts.
[11] Implement RAG model that South Reddit KGC N Y N RAG, Knowledge- Top-k retrieval
incorporates the global Korea Grounded Accuracy
context of internet Conversation Exact Match
conversations and the local (KGC) - Local (EM)
context by retrieving context
relevant documents. classification

multiple cuisines, their ingredients, and their prices. We b) Removing punctuation marks: All the punctuation
utilize a custom benchmark dataset in the form of question- marks in the text are removed as they usually are meaningless
answer pairs, stored in CSV format. Table III shows sample and can sometime add complex/introduce noise.
of restaurant Q&A dataset. c) Retaining important words: Retaining important
words, like restaurant names and prices, is crucial for
B. Dataset Preprocessing
preserving contextual information and enhancing user query
Data Preprocessing is a significant step in Natural understanding in a restaurant chatbot. This step contributes to
Language Processing which makes the data ready for further improved contextual awareness facilitating a more effective
analysis and modelling. The various steps involved in data and personalized interaction amongst user and the chatbot.
preprocessing are: d) Stop words Removal: Removed the most commonly
occurring words in a sentence that don't contribute to the
a) Conversion of data to lowercase: The text is meaning of the sentence such as “a,” “an,” “the,” and “in”. It
converted to lowercase which helps in standardizing the
dataset.
TABLE II
TECHNOLOGY SURVEY OF EXISTING RESEARCH PAPERS

Ref Application Technology Input Data Evaluation metrics

[12] FAQ Type Question-Answer bot Seq-2-Seq model (Two Recurrent Neural Network Text Score model - utterance-
(RNN) – an encoder and a decoder) response tuples from chatbots
Reinforcement Learning are scored based on user
comments.
[13] Natural Question Generation- Generator Evaluator framework in a neural network SQuAD Bilingual evaluation
Generate questions for context- architecture. The generator is a transformer and the understudy (BLEU)
answer pair evaluator used Reinforcement Learning (RL) to
check the correctness of the question generated.
[14] Reinforcement learning using long SEQ2SEQ and Deep RL Model Daily Dialog BLEU
context for pretraining chatbots Dataset Coherence Score

[15] Task-oriented chatbot to perform Combination of LSTM and RL with attention-based Emotibot BLEU
specific tasks such as answering hierarchical LSTM network and Generative dataset Google-BLEU (GLEU)
questions or completing a Adversarial Networks (GANs) METEOR
transaction.

[16] Combining task-oriented dialogue Gated Recurrent Unit (GRU) Persona Chat Dialogue Reward
with open-ended dialogue, dataset F1 Score
strategic dialogue, spatially aware Recall
dialogue
[17] Survey on various research papers Rule-based chatbots: Expert systems, Decision trees, Text dataset Precision, Recall, Mean
on Al-based, Rule-based and Inference engines Average, Mean Reciprocal
Hybrid chatbots and the evaluation AI-based chatbots: machine learning algorithms like Rank, BLEU, ROUGE,
metrics used NLP, NLU, and LSTM models. Average dialogue length,
Average user utterance,
Number of wins, Perplexity
[18] Task-Oriented chatbot Sequence-To-Sequence learning, NLP, CNN Twitter dataset BLEU, Cosine Similarity

[19] Open Domain Natural Language Understanding (NLU) Text BLEU

Natural Language Generation (NLG) Cumulative Reward Plot
Bidirectional Recurrent neural network (BRNN) Cumulative Regret Plot
RL (Q Learning, Deep Q Neural Network)
RL with Quantile Regression

TABLE III
SAMPLE RESTAURANT Q&A DATASET
Intent User Questions Bot Responses
Find Restaurants What restaurants are available? The restaurants available are dish N dash, CHEF CHU'S, Olive Garden,
Rock N Grill, Denny's.
Find Location Where is the restaurant “Olive Garden” located? The address of Olive Garden is 1350 Great Mall Dr, Milpitas, CA
95035.
Find Contact How can I contact Olive Garden restaurant? You can contact Olive Garden restaurant at +1 (408) 935-8176.
Find Category Develop long-form generation with retrieval augmentation,
Inquire Timings What time does the restaurant CHEF CHU'S open? The restaurant CHEF CHU'S opens at 11:30 AM.
Find Cuisine What cuisine is available at Rock N Grill? Indian cuisine is available at this Rock N Grill restaurant.
Find Dish What are the different dishes available under Naan and The different dishes under Naan and Roti in Indian cuisine at Rock N
Roti in Indian cuisine at Rock N Grill restaurant? Grill restaurant available are Paneer Kulcha, Garlic Naan, Cheese Naan,
Butter Roti, Bullet Naan, Aloo Paratha.
Check Price What is the price of the Falafel under Appetizers (Hot) in The price of the Falafel under Appetizers (Hot) in Mediterranean cuisine
Mediterranean cuisine at dish N dash restaurant? at dish N dash restaurant is $8.0.
Get Ingredients What is the Veg Ball Manchurian (Sauce) under Veg The Veg Ball Manchurian (Sauce) under Veg Appetizers in Indian
Appetizers in Indian cuisine at Rock N Grill made of? cuisine at Rock N Grill is made of Mixed Vegetable Blended and
Thickened with Potato, Deep Fried Ball Tossed into Manchurian Sauce.

makes the sentence more manageable and helps in better algorithms. It provides structured representation of textual
analysis and better machine language models. data for computational processing.
e) Tokenization: The text is converted into individual
units called tokens which are analyzed / processed by
IV. MACHINE LEARNING MODELS The chatbot system supports audio output with the help
of Google Text-to-Speech (gTTS). gTTS translates text into
Various cutting-edge technologies are used to enable the high-quality audio by utilizing a complex architecture that
restaurant chatbot to provide enhanced user experiences, combines cutting-edge speech synthesis technology and
including speech recognition, text recognition, Natural natural language processing (NLP). Fundamentally, gTTS
Language Understanding (NLU), Natural Language makes use of deep learning models, including transformer
Generation (NLG), as well as RAG support. Using RAG, models or recurrent neural networks (RNNs), to comprehend
additional context can be provided to the chatbot in response and analyze the incoming text while taking context and
to the query it is receiving. Additional context is retrieved linguistic subtleties into account [21]. Because of its
from a variety of sources, such as knowledge graphs. architecture, which guarantees precise pronunciation,
Whenever a question is asked, relevant data regarding the intonation, and emotional expression in the generated audio,
question is retrieved from the knowledge base. Next, the gTTS provides a realistic and captivating auditory experience
question and relevant documents are passed to the NLG, for a variety of applications.
which generates an answer. The chatbot will be able to
answer the question correctly if relevant information is B. Natural Language Unit (NLU)
provided along with the question. As illustrated in Figure 1, After passing the user input through the pre-processing
various models were employed to enable the restaurant stage, the natural language understanding (NLU) by using the
chatbot’s functionality at each stage. First, a user can either BERT model interprets the conversation by identifying
speak to the system, which uses an Audio to Text converter intents and entities. The architecture for BERT is neural
to convert the speech to the text, or type text directly. The network architecture like Transformers and is particularly
text data is then processed in the data preprocessed step. used for sequence-to-sequence tasks like machine translation
Next, the processed text data goes to the NLP tools such as and language modelling [22].
NLU-BERT to understand and interpret the user’s purpose
from the text. The Database and Knowledge Graph can store For the restaurant chatbot, BERT uses a two-step
information about restaurants. The NLG-T5 can generate procedure to conduct (the reason for the query) and the
text responses based on the user’s purpose and the output of entities (certain bits of information) within the text [22]. It
Dialog Manager. These text responses are either sent begins by processing the input sentence and encoding the
directly to the user or converted into speech by the Text to context and semantic meaning of the written text. It then uses
Audio converter. Figure 2 and Figure 4 show the this contextual knowledge to forecast outcomes. BERT
architectural flow and pipeline of the system. divides the input into predetermined intent categories for
intent recognition based on the encoded context. By utilizing
its knowledge of the sentence's structure and semantics, it
recognizes and labels particular tokens in the input text that
relate to pertinent elements, such as dates, names, or numbers.

C. Neo4j Knowledge Repository

Question and Answer Knowledge graph is an advanced
knowledge representation system that clusters or groups the
information according to the kinds of queries they can
answer. The knowledge graph for the restaurant chatbot is
implemented in Neo4j AuraDB as question-and-answer
cluster knowledge graph. In the graph each question cluster
is connected to the answer cluster using the edge ANSWER.
Fig. 1. RestoBot Architecture After preprocessing the question tokens and answer tokens
are connected to the specific question and answer clusters
A. Audio-to-text (ASR) system & Text-to-Audio using the edge HAS_TOKEN. Figure 3 shows the sample
question and answer cluster. The red nodes represent the
The restaurant chatbot accepts input in either text or audio questions, and the green nodes depict the answer. The blue
format. When a user chooses to communicate with the nodes represent the question tokens and answer tokens.
chatbot using audio, it is necessary to transform the audio Figure 3 shows the ANSWER edge connected between
input into text to facilitate subsequent processing by text- question and answer as well as the HAS_TOKEN edge
based models. OpenAI's Whisper is an Automatic Speech connected between the multiple tokens of each question and
Recognition (ASR) and is used to perform the conversion answer. Once the user asks the question, the retrieval of
from audio to text. The Whisper architecture makes use of an answer tokens for that specific question from the knowledge
encoder-decoder transformer technology [20]. The encoder graph takes place through the following steps.
uses a tiny stem that consists of two convolution layers with
a filter size of 3 and an activation function called GELU to
initially handle the input. A decoder is then trained to
anticipate the associated text caption using specific tokens
that instruct the single model to carry out tasks including
language recognition, phrase-level timestamping,
multilingual audio transcription, and to-English voice.

Fig. 2. Architecture Flow

Fig. 3. Sample Question-Answer Cluster

As illustrated in Figure 4, the proposed system understandable, contextually relevant, grammatically sound
processes user inputs through several models. First, the pre- natural language output. Because of its ability to produce
processing is done on the user question which gives the well-reasoned and contextually relevant answers in response
question tokens. Next the question tokens are used to find out to input prompts, T5 is an effective model for NLG tasks.
the matching question clusters that share similar tokens Training is done using masked language modelling. By
using TF-IDF embeddings. If there is no matching cluster an transforming NLU and NLG jobs into sequence-to-sequence
empty list is returned. Now, cosine similarity is used to find tasks in the encoder-decoder variation, the T5 model unifies
the most similar question from the matching question both types of tasks. This means that in the text classification
clusters. Thus, the question cluster with highest similarity problem, the text was utilized by the encoder input, and the
score is retrieved. Subsequently the answer clusters label for the decoder must be generated as regular text rather
connected to the most similar question are retrieved. Then than a class [23].
by calculating the relevancy score, the most relevant answer For restaurants, the T5 model is a great option for
cluster is retrieved. Now the answer tokens connected to the powering chatbots that generate natural language. With its
most relevant answer cluster are retrieved as the response text-to-text framework, T5 can understand and generate
from the knowledge graph. human-like text responses, making it well-suited for various
aspects of restaurant-related conversations. Because it can be
tailored to specific restaurant chatbot applications, it may be
made to offer extremely precise and tailored responses,
improving customer satisfaction, and expediting interactions
in the restaurant business.

V. SYSTEM ANALYSIS AND DESIGN

The architecture of the restaurant chatbot system is

designed to provide an enhanced and seamless user
experience as depicted in Figure 5. Users interact with the
system through a front-end application built using Flask and
HTML. The application includes a chat container supporting
both text and audio input. The user’s textual query is pre-
processed and serves as an input for knowledge retrieval from
a Neo4j knowledge graph. This graph, acting as ground truth,
Fig. 4. Pipeline of the System is pre-built to store relevant answer tokens based on user
queries. The dialogue manager utilizes this to grasp the
D. Natural Language Generation (NLG) context of user inputs, ensuring appropriate responses. To
The Generation stage of a chatbot generates the relevant enhance the system's understanding of user queries, a fine-
responses for users based on user input and the system's tuned BERT model is integrated into the Natural Language
knowledge base. This necessitates the use of natural language Understanding (NLU) unit which extracts intent and entities
generation (NLG) techniques, attention to user input, and from user questions. The final response generated by the
familiarity with the system's pre-processing and processing
phases. The goal of NLG systems is to provide human-
system is passed through the Natural Language Generation d) Debugging and Maintenance: A detailed record of
past interactions simplifies issue identification and
(NLG) unit to produce coherent and contextually relevant text
resolution, streamlining system debugging and maintenance
responses. Users receive these responses through the front-
processes.
end application, available in both text and audio formats via
text-to-audio conversion. The system leverages a combination of technologies to
achieve its objectives. Flask and HTML are employed for the
front-end application, Google Collab facilitates data
processing, and Neo4j serves as the graph database for
effective knowledge storage and retrieval.
The design and integration of various components in the
system architecture contribute to the chatbot's ability to
comprehend user queries, retrieve relevant information from
the knowledge graph, and generate coherent responses. The
evaluation metrics ensure a thorough assessment of the
system's performance, validating its efficacy in providing
enriched user experience.
Fig. 5. System Design
VI. AI TESTABILITY

The performance of the chatbot is evaluated using a set of The restaurant chatbot architecture as shown in Figure 6
metrics at various levels. At the word level, metrics such as utilizes a conversation dataset for training and testing. User
BLEU, ROUGE, METEOR, and F1 SCORE are employed to queries trigger responses generated by the Natural Language
assess the accuracy of individual responses. Sentence-level Generation (NLG) module, which leverages data
evaluation includes language-based similarity evaluation and augmentation techniques for enhanced robustness.
keyword-based weighted text similarity evaluation.
The quality of these generated responses is evaluated by a
Additionally, domain-level metrics are applied to assess the
Test Script at three levels: word-level accuracy, sentence-
overall effectiveness of the chatbot.
In a proactive measure, the system systematically logs level coherence, and information completeness. These are
user questions and corresponding chatbot responses in a used to conduct both semantic as well as syntactic evaluation.
dedicated database. This integration offers several Methods for evaluating models are important for chatbots
advantages: because they provide a methodical manner to evaluate and
a) Historical Analysis: A comprehensive history of user improve the chatbot's functionality, accuracy, and capacity to
interactions enables insightful trend analysis and facilitates meet user needs [17]. By assessing a chatbot's performance
continual system improvements over time. using multiple metrics, its ability to understand user intent,
and generate responses accurately can be improved. Model
b) Performance Metrics: Regularly evaluating the
assessment techniques can be used to evaluate the
chatbot’s responses aids in identifying strengths and areas for
performance of the chatbot and optimize it so that it meets the
enhancement, contributing to an evolving and adaptable
system. needs of the intended use case and target audience. Table IV
shows the different evaluation measures that were employed
c) User Personalization: The database supports tailoring in this project at each step.
responses based on historical user interactions, fostering a
more personalized and engaging user experience.

Fig. 6. AI Testability Architecture

TABLE Ⅳ
EVALUATION METRICS
Metric Description Formula Advantages Disadvantages Weight

BLEU Geometric mean of all Unigram, bigram, trigram, 4-gram precision scores  Fast computation  Doesn’t incorporate 10%
four n-gram precisions 𝐵𝐿𝐸𝑈 = 𝑝1 ∙ 𝑝2 ∙ 𝑝3 ∙ 𝑝4  Easy to calculate semantics
 Doesn’t incorporate
sentence structure
ROUGE Compares n-gram of 𝐶𝑜𝑢𝑛𝑡 𝑜𝑓 𝑤𝑜𝑟𝑑 𝑚𝑎𝑡𝑐ℎ𝑒𝑠  Ability to capture 40%
generation with n-gram 𝑅𝑂𝑈𝐺𝐸 − 1 𝑟𝑒𝑐𝑎𝑙𝑙 = and identify all the
𝐶𝑜𝑢𝑛𝑡 𝑜𝑓 𝑤𝑜𝑟𝑑𝑠 𝑖𝑛 𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒
of references relevant instances
𝐶𝑜𝑢𝑛𝑡 𝑜𝑓 𝑤𝑜𝑟𝑑 𝑚𝑎𝑡𝑐ℎ𝑒𝑠
𝑅𝑂𝑈𝐺𝐸 − 1 𝑝𝑟𝑒𝑐𝑖𝑠𝑜𝑛 =
𝐶𝑜𝑢𝑛𝑡 𝑜𝑓 𝑤𝑜𝑟𝑑𝑠 𝑖𝑛 𝑠𝑢𝑚𝑚𝑎𝑟𝑦
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∙ 𝑟𝑒𝑐𝑎𝑙𝑙
𝑅𝑂𝑈𝐺𝐸 − 1 𝑓1 = 2
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙

Language- Measures how closely 𝐴∙𝐵  Addresses  Higher time to 25%

Based content in question 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦(𝐴, 𝐵) = meaning, context, compute
|𝐴| |𝐵|
Similarity corresponds and structure
Keyword- Similarity based on the 𝑤𝐴 ∙ 𝑤𝐵  Ability to retain  Define set of entities 25%
Based specific keywords and 𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦(𝐴, 𝐵) = important to assign weights
|𝑤𝐴| |𝑤𝐵|
Weighted assigning higher weight keywords  Higher time to
Text compute
Similarity
Information Computed overall 𝑂𝑢𝑡𝑝𝑢𝑡 = 10% ∗ (𝑠𝑖𝑚1) + 40% ∗ (𝑠𝑖𝑚2) + 25% ∗ 100%
Score similarity score (𝑠𝑖𝑚3) + 25% ∗ (𝑠𝑖𝑚4)

A. Word Level Metrics

These metrics, termed word-level metrics for chatbot b) Entity Extraction: The Entity Extraction model, also
evaluation, assess the quality of the chatbot's responses at the employing the BERT architecture, exhibited noteworthy
word level or syntactic evaluation. They gauge correctness, improvement in token-level accuracy, progressing from
completeness, and relevance of the chatbot’s replies to user 76.32% to 84.55% as seen in Figure 7.b. The reduction in
input. Word-level metrics encompass BLEU, ROUGE loss, from 0.512 to 0.2965, highlights the model's
(ROUGE-1, ROUGE-2, ROUGE L), among others, in effectiveness in precisely identifying entities within user
evaluating chatbot performance. input. These outcomes showcase the model's ability to
enhance the chatbot's contextual understanding by accurately
B. Sentence Level Metrics extracting relevant information from user queries.
Sentence level metrics gauge the level of sentence quality
thereby providing semantic evaluation techniques. Sentence c) Natural Language Generation: The NLG unit,
level testing is measured using Language based similarity leveraging the T-5 model, exhibited remarkable proficiency
evaluation and Keyword based weighted text similarity in generating coherent and informative sentences. The
evaluation. increasing trend in both training and validation accuracy,
reaching 94.98%, reflects the model's capability to
C. Information Level Metrics dynamically generate content based on structured
Information level metrics gauge the quantity and quality information. The competitive training losses and validation
of information a chatbot offers a user. Dialogue length and losses indicate that the model generalizes well without
confusion indicator are two measures used to measure overfitting. The BLEU score of ~0.60 further validates the
information level. The length of a conversation can be used high quality and alignment of the generated output with
to gauge how well a chatbot does at maintaining engagement reference texts. The iterative improvement in model accuracy
and providing illuminating responses during the exchange. across epochs demonstrates the effectiveness of the chosen
architectures and training methodologies.
VII. RESULT AND DISCUSSION
TABLE V
EVALUATION METRICS
To ensure that the model is trained on a wide variety of
instances and generalizes well to new data, data preparation Evaluation
Metrics for Epoch Accuracy Loss
is crucial. Using a test-train split, the data is split into a
training and validation set (80%) and a testing set (20%). 1 0.8532 0.412
Intent
2 0.9043 0.356
a) Intent Recognition: The Intent Recognition model, Recognition
3 0.9255 0.2565
powered by the BERT architecture, was trained, and
1 0.7632 0.512
validated over three epochs. The progressive increase in Entity
accuracy, from 85.32% to 92.55%, indicates the model's 2 0.8143 0.386
Extraction
capacity to adeptly understand user intents as seen in Figure 3 0.8455 0.2965
8.a. The decreasing trend in loss, from 0.412 to 0.2565, 1 0.95 0.0033
suggests that the model effectively learned to discriminate NLG model 2 0.98 0.0011
between various user intents. This performance underscores 3 0.98 0.007
the significance of BERT in capturing nuanced intent nuances
and enhancing user-query understanding.
Fig.7. Model Training Results

VIII. CONCLUSION

The implications of these results extend to the enhanced In conclusion, the main goal of the restaurant chatbot's
user experience of the restaurant chatbot. The robustness development to improve the dining industry has been
observed in the training and validation metrics suggests that effectively attained. By utilizing a knowledge graph of
the models generalize well and are poised for effective questions and answers and utilizing various degrees of AI
deployment in real-world scenarios. testability criteria, the chatbot ensures a high standard of
performance and accuracy. The use of Neo4j Knowledge
d) AI Testability: The RestoBot is tested on various Graph as an external source of knowledge has proven
scenarios like General Testing, Domain Testing and Limit
instrumental in enhancing information retrieval capabilities.
Testing. The testing is done to evaluate the functionality and By traversing the graph and utilizing TF-IDF embeddings, the
performance of the chatbot. It was observed that the chatbot
chatbot efficiently matches user questions with relevant
could respond by having general conversations such as answer tokens, ensuring tailored responses. The
greetings with the customer. In General Testing the RestoBot
implementation of robust AI testability ensures chatbot’s
is tested on general facts, while in Domain testing the performance meets desired standards. This facilitates cross-
RestoBot is tested on restaurant-based questions. The testing
chatbot comparisons for a domain to find out the most
was performed for various intents such as find restaurant, find effective one. Furthermore, the strategic Integration of
dish, find location, find ingredients, and others. Finally Limit
Retrieval-Augmented Generation (RAG) with the Large
testing shows how well the RestoBot replies to extreme or Language Model (LLM) has significantly elevated the
boundary conditions. In this case if there is a non-contextual
chatbot's conversational precision, enabling contextually
conversation, the conversation is routed to the customer call
relevant responses to user queries in the restaurant setting.
center service to speak with a human agent. The performance This all-encompassing strategy represents a major
of the chatbot is measured at word-level, sentence-level, and
advancement in restaurant chatbot capabilities for enhanced
information level as discussed in Section IV. Table VI customer experiences and increased operational
provides the backend UI of the chat conversations and the
effectiveness.
evaluation results for the conversation. It was observed that
the average information score was achieved in the range from
0.6 - 0.8 for the session conversations.

TABLE VI
AI TESTABILITY RESULTS OF EVALUATION AFTER SESSION (SESSION CURRENT TIME: 12/11/2023, 4:21:26 PM)

Restaurant User Question RestoBot Response Reference Response BLEU ROUGE1 Similarity Weighted Average
Score Similarity Information
Score Score
Olive What time does the The restaurant Olive The restaurant Olive 0.22 0.6 0.52 0.34 0.48
Garden restaurant Olive Garden is open from 11:00 Garden opens at 11:00 AM
Garden open? AM to 10:00 PM.
Olive What is the opening Olive Garden is open from The opening time of Olive 0.11 0.53 0.43 0.27 0.4
Garden time of Olive 11:00 AM to 10:00 PM Garden is from 11:00 AM
Garden?
Olive What time does the Olive Garden is open from The restaurant Olive 0.06 0.55 0.45 0.67 0.5
Garden restaurant Olive 11:00 AM to 10:00 PM Garden opens at 11:00 AM
Garden open and and closes at 10:00 PM.
close?
Olive What is the closing Olive Garden is closing The closing time of Olive 0.14 0.63 0.43 0.76 0.59
Garden time of Olive from 11:00 AM to 10:00 Garden is 10:00 PM
Garden? PM
View publication stats

[12] Lone, M. B., Nazir, N., Kaur, N., Pradeep, D., Ashraf, A. U., Asrar Ul
ACKNOWLEDGMENT Haq, P., Dar, N. B., Sarwar, A., Rakhra, M., & Dahiya, O. (2022). Self-
learning chatbots using reinforcement learning. 2022 3rd International
We would like to express sincere thanks to our research Conference on Intelligent Engineering and
advisor Dr. Jerry Gao and supervisor Dr. Lee C. Chang from Management (ICIEM).https://doi.org/10.1109/iciem54221.2022.9853
the Department of Applied Data Science, San Jose State 156
University for giving us a wonderful opportunity to work on [13] Biswas, D., Nadipalli, S., Sneha, B., Gupta, D., & J, A. (2022). Natural
this project. Their unwavering support and mentorship have question generation using transformers and reinforcement learning.
been invaluable throughout our research journey, providing 2022 OITS International Conference on Information Technology
us with a remarkable opportunity to contribute to this project. (OCIT). https://doi.org/10.1109/ocit56763.2022.00061
Their guidance has been instrumental in the successful [14] Tran, Q.-D. L., & Le, A.-C. (2021). A deep reinforcement learning
completion of this paper, and we are truly grateful for the model using long contexts for Chatbots. 2021 International Conference
enriching experience they have facilitated. Dr. Gao deserves on System Science and Engineering (ICSSE).
special acknowledgement for generously sharing his wealth https://doi.org/10.1109/icsse52999.2021.9538427
of expertise, dedicating time for insightful discussions, and Conference on Information and Education Technology.
guiding us in the right direction. https://doi.org/10.1145/3323771.3323824.
[15] Hsueh, Yu-Ling, and Tai-Liang Chou. “A Task-Oriented Chatbot
We would also like to express our profound gratitude to Based on LSTM and Reinforcement Learning.” ACM Transactions on
our committed team members, whose joint efforts were Asian and Low-Resource Language Information Processing,
essential to this project’s success in addition to our academic vol. 22, no. 1, 2022, pp. 1 27.,https://doi.org/10.1145/3529649
advisors. Their dedication, diligence, and creative ideas have [16] Liu, C.-W., Lowe, R., Serban, I., Noseworthy, M., Charlin, L., &
greatly enhanced our research and added to the range and Pineau, J. (2016). How not to evaluate your dialogue system: An
depth of our conclusions. Our team’s cohesion, which was empirical study of unsupervised evaluation metrics for dialogue
cultivated via open communication and common objectives, response generation. Proceedings of the 2016 Conference on Empirical
was essential to overcoming obstacles and reaching important Methods in Natural Language Processing.
milestones. https://doi.org/10.18653/v1/d16-1230
[17] Maroengsit, W., Piyakulpinyo, T., Phonyiam, K., Pongnumkul, S.,
REFERENCES Chaovalit, P., & Theeramunkong, T. (2019). A survey on evaluation
` methods for Chatbots. Proceedings of the 2019 7th International
[1] Chatbot market size, share, Trends & Growth Report, 2030. Chatbot Conference on Information and Education Technology.
Market Size, Share, Trends & Growth Report, 2030. (n.d.). Retrieved https://doi.org/10.1145/3323771.3323824.
March 23, 2023, from https://www.grandviewresearch.com/industry- [18] Aleedy, M., Shaiba, H., & Bezbradica, M. (2019). Generating and
analysis/chatbot-market analyzing chatbot responses using Natural Language Processing.
[2] “Marketresearch.com.” Market Research, MarketsandMarkets, 15 International Journal of Advanced Computer Science and
Nov. 2019, https://www.marketresearch.com/MarketsandMarkets- Applications, 10(9). https://doi.org/10.14569/ijacsa.2019.0100910
v3719/Chatbot-Component-Solutions-Services-Usage-12771978/ [19] Rajamalli Keerthana, R., Fathima, G., & Florence, L. (2021).
[3] Hsueh, Yu-Ling, and Tai-Liang Chou. “A Task-Oriented Chatbot Evaluating the performance of various deep reinforcement learning
Based on LSTM and Reinforcement Learning.” ACM Transactions on algorithms for a conversational chatbot. 2021 2nd
Asian and Low-Resource Language Information Processing, vol. 22, International Conference for Emerging Technology (INCET).
no. 1, 2022, pp. 1–27., https://doi.org/10.1145/3529649 https://doi.org/10.1109/incet51464.2021.9456321
[4] Hofstätter, Sebastian, et al. “FID-light: Efficient and effective retrieval- [20] Introducing whisper. Introducing Whisper. (n.d). Retrieved April
augmented text generation.” Proceedings of the 46th International 19,2023, from https://openai.com/research/whisper.
ACM SIGIR Conference on Research and Development in Information [21] K, Bharath. “How to Get Started with Google Text-to-Speech Using
Retrieval, 2023, https://doi.org/10.1145/3539618.3591687. Python.” Medium, Towards Data Science, 30 Aug. 2020,
[5] Jeong, Cheonsu. A Study on the Implementation of Generative AI towardsdatascience.com/how-to-get-started-with-google-text-to-
Services Using an Enterprise Data-Based LLM Application speech-using-python-485e43d1d544.
Architecture, Sept. 2023, [22] Silva Barbon, R., & Akabane, A. T. (2022, October 26). Towards
https://doi.org/https://doi.org/10.48550/arXiv.2309.01105. transfer learning techniques-bert, Distilbert, Bertimbau, and
[6] Skandan, Spurthy, et al. “Question answering system using knowledge Distilbertimbau for automatic text classification from different
graphs.” 2023 International Conference on Inventive Computation languages A case Study MDPI. https://www.mdpi.com/1424-
Technologies (ICICT), 2023, 8220/22/21/8184
https://doi.org/10.1109/icict57646.2023.10134047. [23] Alexander Mathew . “Data to Text Generation with T5; Building a
[7] Jiang, Zhengbao, et al. “Active Retrieval Augmented Generation.” Simple yet Advanced NLG Model.” Medium, Towards Data Science,
arXiv, 22 Oct. 2023, https://doi.org/https://arxiv.org/abs/2305.06983. 10 Apr. 2021, towardsdatascience.com/data-to-text- generation-with-
[8] P. Lewis et al., “Retrieval-Augmented Generation for Knowledge- t5-building- a-simple-yet-advanced- nlg-model-b5cce5a6df45.
Intensive NLP tasks,” arXiv (Cornell University), May 2020, [24] Flaticon, the Largest Database of Free Icons.Flaticon,
Available: https://arxiv.org/pdf/2005.11401 www.flaticon.com/icons.
[9] K. Shuster, S. Poff, M. Chen, D. Kiela, and J. Weston, “Retrieval
augmentation reduces hallucination in conversation,” Empirical
Methods in Natural Language Processing, pp. 3784–3803, Apr. 2021,
Available: https://aclanthology.org/2021.findings-emnlp.320/
[10] Y. Mao et al., “Generation-Augmented Retrieval for Open-domain
Question Answering,” Sep. 2020, doi:
https://doi.org/10.48550/arxiv.2009.08553.
[11] Y. Ahn, S.-G. Lee, J. Shim, and J. Park, “Retrieval-Augmented
Response Generation for Knowledge-Grounded Conversation in the
Wild,” IEEE Access, vol. 10, pp. 131374–131385, Jan. 2022, doi:
https://doi.org/10.1109/access.2022.3228964.