Full Text 01
Full Text 01
AI in Automotive Repair:
Building a Data Driven Chatbot for
Enhanced Vehicle Diagnostics
(In Collaboration with Volvo Group)
Dipanwita Dash
Generative Artificial Intelligence (AI) and Large Language Models (LLMs) present a promis-
ing avenue to augment service delivery and customer satisfaction in many sectors, including
automotive repair. The traditional diagnostic systems in this sector, supporting the "Triple C"
Complaint, Cause, Correction (CCC) of capturing Complaints, identifying Causes and provid-
ing Corrections, often suffer from inefficiencies, such as the under-utilization of insights from
historic cases, stored in massive databases containing structured and unstructured data. This
results in increased costs and extended vehicle downtime due to repetitive or misdiagnosed
issues.
The primary objective of this research is to enhance the efficiency and accuracy of automotive
repair services by developing a chatbot system which can retrieve relevant CCC information
from a dataset consisting of technician service and repair entries. The dataset is sourced
from workshops from several countries, and includes technical codes and free form text with
vehicle and service descriptions. In order to explore and overcome the infrastructure challenges
to implement this system within the organizational setup, this thesis aims to develop and
analyze two different chatbot systems, both featuring a retrival-augmented generation Retrieval
Augmented Generation (RAG) framework to augment Large Language Model (LLM) outputs.
The first system being implemented on-premises, integrates the Instructor XL embedding
model, Milvus vector database, and Mixtral 8x7B LLM. The second system operates within the
Azure cloud environment, employing the text-embedding-ada-002 model for embedding, Azure
AI Search for vector retrieval and GPT-3.5 Turbo as the LLM. Both systems are evaluated
based on performance, accuracy, scalability, and cost-effectiveness.
The on-premises system is better in performance and cost-effectiveness, however, the cloud-
based system is better in scalability, availability and using searchable metadata. The latter has
been implemented by me.
The main impact of this research is demonstrated through its contribution to the integration of
AI in automotive services, addressing critical aspects such as data privacy, system scalability,
and practical implementation of state-of-the-art AI technologies in an industry-specific con-
text. Recommendations for future research include language support, enhanced interactions,
improved evaluation, exploration of hybrid architectural frameworks to combine the strengths
of both RAG as well as fine-tuning of LLM, and extended integration with real-time vehicle
data systems for a comprehensive service experience.
1
Acknowledgments
I am deeply grateful to a number of individuals whose support and guidance were indispensable
throughout the journey of this thesis.
First, a special thanks goes to my thesis partner, Erfan Kanani (Stockholm University), whose
collaboration and insights were crucial in shaping many of the concepts and analyses presented
in this work. Working with him has been an enriching experience.
I would like to express my appreciation to my thesis supervisor Karl Löwenmark from Luleå Uni-
versity of Technology, whose expertise and insightful guidance have been invaluable throughout
this research.
I am also grateful to my Volvo supervisor, Leila Jamshidian Sales, for her continuous support
and insights that enhanced the quality of this thesis. I extend my gratitude to my colleagues at
Volvo Group, for their understanding and support, providing us with the necessary resources
and flexibility to pursue our research objectives.
Lastly, a big thanks to my family and friends who encouraged and supported directly or indi-
rectly throughout my academic journey and the thesis.
This thesis would not have been possible without the collective support and encouragement of
all mentioned and for that, I am truly grateful.
2
Contents
Acronyms 5
List of Symbols 6
List of Figures 7
List of Tables 8
1 Introduction 1
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.5 Usage of AI tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Background 4
2.1 Introduction to Chatbot Technologies . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Foundational Concepts in Conversational AI . . . . . . . . . . . . . . . . . . . . 4
2.2.1 Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.2 Traditional Chatbots Architecture . . . . . . . . . . . . . . . . . . . . . . 5
2.2.3 Core Deep Learning Technologies . . . . . . . . . . . . . . . . . . . . . . 5
2.2.4 Evolution of LLMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.5 Retrieval Augmented Generation . . . . . . . . . . . . . . . . . . . . . . 7
2.2.5.1 Ingestion Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.5.2 Retrieval Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.5.3 Generation Phase . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Limitations of LLM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.1 Limited Knowledge Scope . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.2 Difficulty with Up-to-Date Information . . . . . . . . . . . . . . . . . . . 9
2.3.3 Prone to Hallucination . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.4 Lack of Grounding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Gap and Novelty of This Research . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4.1 Identification of the Gap . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4.2 Novelty of This Research . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4.3 Impact of the Novelty . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 Methodology 11
3.1 Data Collection and Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1.2 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1.3 Data Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 General System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2.1 Chatbot Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 Chatbot Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3.1 On-premise Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3.2 Cloud-based Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 15
3
Contents
4 Results 20
4.1 Graphical User Interface (GUI) for the Chatbot . . . . . . . . . . . . . . . . . . 20
4.2 Comparative Analysis - LLM vs RAG(Cloud-based) . . . . . . . . . . . . . . . . 22
4.3 Performance Analysis of RAG pipelines . . . . . . . . . . . . . . . . . . . . . . . 25
4.4 Evaluation Results for RAG pipelines . . . . . . . . . . . . . . . . . . . . . . . . 26
5 Discussion 28
5.1 Interpretation of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.2 Pros and Cons - On-premise and Cloud-based . . . . . . . . . . . . . . . . . . . 29
5.3 Challenges Encountered . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.4 Ethical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6 Conclusion 31
6.1 Summary of Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.2 Key Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.3 Recommendations for Future Research . . . . . . . . . . . . . . . . . . . . . . . 31
6.4 Final Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
References 33
4
Acronyms
LLM Large Language Model. 1, 3, 4, 7, 8, 13–15, 17, 18, 22–24, 28, 29, 31
LLMs Large Language Models. 1–3, 6, 8–10, 26
LSTM Long-Short Term Memory. 5, 6
5
List of Symbols
6
List of Figures
7
List of Tables
8
Introduction
1.1 Overview
AI has become a cornerstone in modern landscape of technology, being capable of transforma-
tion and enhancement of productivity across diverse sectors. The quest for mastering linguistic
intelligence in machines began with the inception of the Turing Test in the 1950s [37]. Language
is a complex system of human expression regulated by grammatical rules and it poses signifi-
cant challenge in developing AI algorithms capable of grasping and leveraging it. Machines can
begin to comprehend natural language only with sophisticated AI advancements. Achieving
the ability for machines to read, write and communicate in a human-like manner has been a
significant challenge in research since longtime [29].
Significant strides in language modeling have been crucial in overcoming these hurdles. The
introduction of Embeddings from Language Models (ELMo) marked a revolutionary shift in
Natural Language Processing (NLP) by emphasizing word context rather than viewing words
as fixed entities [24]. Subsequent progress was achieved with the Transformer architecture [30],
exemplified by models like BERT [6], which uses self-attention to analyze large volumes of
unlabeled text. This approach has enabled models to capture the linguistic subtleties more
effectively, boosting their performance across various NLP tasks significantly.
Continuous improvements in the field suggest that increasing the scale of Pre-trained Language
Models enhances their effectiveness on diverse linguistic tasks — a principle known as the scaling
law. This phenomenon has been demonstrated by introduction of larger models such as the 175
billion-parameter GPT-3, which exhibits more complex behaviors and superior problem-solving
capabilities compared to its predecessors like the 330 million-parameter BERT or the 1.5 billion
parameter GPT-2 [17] [34].
LLMs, such as GPT-3 [3], PaLM [4], Galactica [27], LLaMA [28] and Mistral [13] represent the
leaders of NLP technology. These models, which contain hundreds of billions of parameters that
are trained on extensive text data collections [25], have been able to grasp the subtleties and
complexities of natural language. Their ability to generate coherent and contextually relevant
text showcases their expertise in tasks such as writing, summarizing, answering questions and
generating creative content. The effectiveness of these models highlights not only their vast
scale but also the innovative Transformer architecture [30] that supports their complex data
processing needs.
Automotive is one of such sectors where AI has been pivotal in redefining customer interactions
and service diagnostics [18]. Usage of the data related to vehicle diagnostics collected from
various sources in an AI setup is capable of creating specialized services, offering quick response
to the queries related to vehicle maintenance and even complex automotive troubleshooting.
1
1.3 Objectives
is a need for a systematic approach that captures customer complaints accurately, identifies the
root causes, and provides effective corrections in a timely manner.
The specific problem that we aim to resolve is related to the "Triple C" or CCC approach.
Firstly, many customers may not accurately describe technical problems, which can lead to
incorrect or incomplete diagnoses. Secondly, the cause of automotive problems often involves
complex interactions of mechanical and electronic systems, which can be difficult to analyze
without an expert. Lastly, providing effective corrections requires an understanding of the
problem and its causes and also needs to be communicated in a way that is clear and actionable
for customers.
In the setup concerning Volvo Group, the application of AI introduces distinct challenges and
opportunities, especially within the context of vehicle maintenance. Volvo’s vehicle repair
workshops require precise, timely, and contextually relevant information to maintain operational
efficiency. Technicians must have access to a wide range of specialized knowledge, which includes
diagnostics, repair protocols, and vehicle maintenance histories. This need for domain-specific
information, coupled with the importance for immediate data updates, highlights the necessity
for advancements in language modeling that extend beyond the capabilities of traditional LLMs.
By integrating a solution equipped with AI and NLP capabilities, the aim is to streamline the
interaction process, enhance the accuracy of diagnostics and improve the overall efficiency of
vehicle service repairs.
On the other hand, LLMs face substantial challenges when tasked with handling queries that
require specialized domain knowledge or expertise [16], despite rapid advancements. One per-
sistent issue is their tendency to generate inaccurate or entirely fabricated "hallucinated" re-
sponses, a problem that is exacerbated when the query extends beyond their training data or
requires up-to-date information [11]. These limitations show the inherent risks in relying solely
on LLMs for practical applications in real-world scenarios, emphasizing the critical need for
additional safeguards and enhancements to ensure their reliability.
1.3 Objectives
The primary goal of this thesis is to develop and evaluate an AI-driven chatbot system that
leverages a RAG framework combined with LLMs to improve the efficiency and precision of ve-
hicle service diagnostics.The system aims to interpret customer complaints, accurately diagnose
underlying causes, and provide effective solutions based on the CCC approach. Additionally,
the research seeks to improve the customer satisfaction by providing contextually relevant ad-
vice, while continuously evaluating the performance, scalability, and cost-effectiveness of the
system compared to traditional methods. The proposed chatbot is envisioned as a crucial step
in narrowing the gap between extensive technical data and the immediate information needs
of technicians, enabling more knowledgeable and fast maintenance activities. Through this
exploration, we intend to contribute valuable insights into the application of advanced AI tech-
nologies within the automotive service industry, laying a foundation for future research and
development.
2
1.5 Usage of AI tools
statement consisting of the need of enhancements in vehicle service diagnostics and a dis-
cussion of the specific challenges faced in applying the technologies within this specialized
domain.
2. Background: Explores the evolution and types of chatbot technologies, emphasizing
the transition from rule-based systems to AI-powered chatbots, foundational concepts in
conversational AI and the significant advancements brought by deep learning technologies
in NLP.
3. Methodology: Outlines the research strategies and methods used for developing the
chatbot, detailing the research strategy employed to guide the chatbot’s systematic de-
velopment, along with the steps involved in data collection, preprocessing and system
architecture planning.
4. Results: Describes the technical architecture of the chatbot, including detailed descrip-
tions of the integration and functionality of the components and how these are tailored
to meet the operational needs of Volvo’s repair workshops and also presents the results
of the implemented chatbot system, assessing its performance through both qualitative
feedback and quantitative metrics. This chapter evaluates the chatbot’s impact on the
efficiency and accuracy of maintenance operations, user satisfaction and its alignment
with the intended operational goals.
5. Discussion: Explains about the interpretation of the results, challenges encountered and
observations related to the whole implementation.
6. Conclusion and Future Work: Concludes the thesis by summarizing the findings,
discussing the implications of the research, and suggesting areas for future research to
further enhance and expand the capabilities of AI-driven chatbot solutions in specialized
industrial applications.
3
Background
4
2.2 Foundational Concepts in Conversational AI
methodologies that facilitate the interaction between humans and machines. For example,
tokenization breaks down text into individual words or phrases, allowing the chatbots to analyze
languages at a granular level. Semantic analysis investigates the meaning behind words and
sentences, enabling chatbots to grasp the intent and sentiment of the user’s input. This level
of understanding is essential for the chatbot to respond in a contextually appropriate manner
and emotionally attuned to the user’s needs [15].
Beyond these fundamental techniques, NLP also involves pragmatic analysis, which helps un-
derstand the language in context and allows chatbots to maintain the coherence of conversations
over multiple turns of dialogue. These advanced NLP capabilities enable chatbots to handle
complex conversations, track the conversation’s context over time, and provide logical, relevant
and conversational responses. The importance of NLP in chatbots cannot be overstated. It
transforms chatbots from simple, keyword matching programs to systems that engage in natu-
ral, fluid, and meaningful interactions. NLP allows chatbots to answer queries and understand
the intent behind them, manage conversation flow and adapt responses based on the conversa-
tion’s context. This leads to a more natural and satisfying user experience, bridging the gap
between human-human and human-computer communication [20].
5
2.2 Foundational Concepts in Conversational AI
(RNN), were among the first to address the need for context and memory in chatbot con-
versations. They could retain information over long periods, allowing for more coherent and
context-aware dialogue [10].
Introduction to attention mechanisms and transformers marked a further evolution [30]. These
technologies allowed for even more complex processing of conversational context, enabling the
chatbot to focus on relevant parts of the conversation and generate more appropriate and
natural responses.
Transformers have set new standards for chatbot performance, fostering the development of
LLMs like OpenAI’s GPT series. These models can process vast amounts of data, learn from
diverse linguistic patterns, and generate remarkably human-like responses in their complexity,
relevance and coherence [6]. The integration of these core deep learning technologies has rev-
olutionized the field of conversational AI,leading to the creation of chatbots that are not only
more responsive and smart but also capable of continuous learning and improvement. This
shows a significant leap forward from the rule-based systems of the past, indicating a new era
of AI-driven conversational agents [3].
6
2.2 Foundational Concepts in Conversational AI
7
2.3 Limitations of LLM
When vectors are normalized to unit length, the dot product provides the cosine of the
angle between them, serving as an effective measure of similarity.
• Cosine Similarity: Cosine similarity measures the cosine of the angle between two
vectors. This similarity index helps to understand the orientation of vectors in a multi-
dimensional space, indicating how similar their directions are irrespective of their magni-
tude.
Definition and Formula: For two vectors, A and B, cosine similarity is calculated
using the dot product and the magnitudes of the vectors:
A·B
S(A, B) =
∥A∥∥B∥
Here, ∥A∥ and ∥B∥prepresent the magnitudes (or norms) of vectors A and B, computed
as n
and i=1 i respectively. This ratio of the dot product to the prod-
,
pP
2
Pn 2
A
i=1 i B
uct of the magnitudes measures the cosine of the angle between the vectors, effectively
quantifying their directional similarity.
8
2.4 Gap and Novelty of This Research
biases and limitations within their responses. This highlights the need for continuous updating
of the training data and model retraining to keep pace with new knowledge and societal changes.
It is resource-intensive and not always feasible in practice.
9
2.4 Gap and Novelty of This Research
This gap leads to inefficiencies and delays in accessing essential repair information, consequently
extending vehicle downtime and potentially escalating maintenance costs. Technicians at Volvo
are often confronted with complex repair scenarios that require immediate access to an extensive
repository of technical data and historical repair records. While the current systems are robust,
they fail to fully leverage AI capabilities to provide context-sensitive, real-time assistance. This
becomes more important when the technicians face rare repair issues or those involving some
kind of link between multiple vehicle systems.
Moreover, there exists an opportunity to augment system intelligence through the deployment
of AI-driven tools capable of adapting and learning from historical data—qualities inherent to
LLMs. The adoption of such technologies could effectively close the existing gap by offering
dynamic, actionable guidance specifically tailored to the unique challenges of automotive repair.
In response to these needs articulated by the Volvo Group, the objective for this thesis has been
formulated to explore how LLMs can be employed to develop domain-specific chatbots. These
chatbots are envisioned to significantly enhance the efficiency and precision of technical support,
seamlessly integrating into the workflows of automotive technicians and thereby substantially
improving the operational efficacy of the organization.
10
Methodology
This chapter includes the methodology followed for the research done as part of this thesis.
Category Description
Details of the vehicle such as vehicle identification num-
Vehicle Information
ber, brand, model and product type etc.
Details containing the workshop visit of the vehicle such
Repair Information as place of repair, date, defect description and causal
part etc. This also includes claims related to the repair.
Comments logged in the workshop containing the com-
plaint from the customer, cause of the problem, and
Technician Comments correction or action taken to fix the problem. These are
the free text fields which are of main interest for this
research.
This dataset combines structured tabular data, such as vehicle information and repair details,
with unstructured data predominantly consisting of textual comments from both customers
and technicians. The structured data provides a framework for the chatbot to understand
and organize the technical aspects of vehicle maintenance, while the unstructured comments
are crucial for deriving good insights into the real-world applications of maintenance work,
including troubleshooting and customer service interactions.
11
3.1 Data Collection and Preparation
12
3.2 General System Architecture
13
3.3 Chatbot Implementation
This is the second step as shown in Figure 2. The user submits a query through a chatbot
web application, which acts as the interface between the user and the system. This query is
captured by the web application and processed into a vector by the same embedding model
used for document processing. This vectorized query is matched against document embeddings
stored in the Vector DB to identify the most relevant documents. These documents are used by
the LLM, which generates a contextually appropriate response. This response is then streamed
back to the user through the chatbot, providing an answer that leverages the most relevant
information from the database to address the user’s query.
All of these products are made available in a secured way in the on-premise setup which is
part of Volvo’s own datacenter, through an internal framework. The architecture is the same
as shown in the Figure 2. My thesis partner Erfan Kanani has worked more on building this
pipeline whereas I worked more on the Cloud-based implementation.
14
3.3 Chatbot Implementation
These products are used for the project since those are part of the white-listed technology stack
available in Volvo cloud environment, with consideration of important security assesssments and
also the pricing perspective. This will be discussed in more detail in the Discussion. Let us
discuss how these are being used and the technical details associated to them in the following
sub-section with implementation steps.
15
3.3 Chatbot Implementation
the complaint, cause and correction fields to create one vector, considering the focus
of the usecase being CCC. Other fields such as causal part and defect description were
merged to create the second vector and rest of the fields were stored as it is to be used as
metadata. The model text-embedding-ada-002 leverages the capabilities of the GPT ar-
chitecture, particularly focusing on understanding and encoding textual information into
dense vectors that can be used in various natural language processing tasks [22] When
we input a some text into the text-embedding-ada-002 model, here is what happens:
• Tokenization: The text is broken down into tokens using a tokenizer. These tokens
are often words or sub-words that the model can understand. There is a token
limitation (8192 tokens), so the input text should not be very long, otherwise the
same would be truncated as per the implementation.
• Embedding: Each token is then converted into a numerical representation (embed-
ding). These initial embeddings are learnt from the data during the training process.
• Transformer Processing: The embeddings pass through several layers of the trans-
former model. Each layer consists of self-attention mechanisms and feed-forward
neural networks, allowing the model to analyze the text from various perspectives.
• Final Output: The output from the transformer layer is a series of vectors, each
corresponding to a token in the input text. In order to derive a single embedding
vector for the entire input text, the vectors are aggregated and then normalized.
3. Ingestion into Vector Storage (Azure AI Search) - This step refers to the process
of transferring and storing the embeddings created from text data into Azure’s AI Search
service, which is designed to manage and facilitate efficient search operations over large
datasets [23].It enables the subsequent retrieval of information that is most relevant to
user queries by using the embeddings to enhance the precision of search results. The
embeddings generated by the text-embedding-ada-002 model are vectors that represent
the semantic content of text inputs. These vectors need to be indexed in Azure AI Search
to allow for efficient querying. The process involves:
• The embeddings along their associated metadata were formatted according to the
requirements of AI Search
• In Azure AI Search, defining the schema was a critical step where we could specify
the fields of data that would be indexed and how they should be processed. For
embeddings, this included setting the correct data types and indexing fields such as
searchable or filterable etc.
• The prepared data was then uploaded to AI Search where an index is created. The
structure of this index is designed to optimize the performance of vector searches,
ensuring that search queries return the most semantically relevant results quickly.
Azure AI search supports different type of search methods in the context of information
retrieval. Here is a Table 4 explaining each of those with pros and cons:
16
3.3 Chatbot Implementation
Search
Functionality Pros Cons
Method
May return irrelevant
Matches query terms Simple to implement results if query words
Keyword Search directly against a set and effective for pre- are ambiguous; does
of documents cise matches not account for con-
text
Captures semantic
Uses embeddings to Requires computa-
meaning and returns
Vector Search perform search based tional resources for
more contextually
on semantic similarity embeddings
relevant results
Balances speed and
Combines keyword accuracy, leverages
More complex to im-
Hybrid Search and vector search both exact match and
plement and manage.
methodologies semantic understand-
ing
Reranks initial search Improves result qual-
Computationally
Semantic results using AI to ity by considering
intensive, requires
Reranking better match the deeper linguistic
advanced AI models
query intent factors
The flexibility and scalability of Azure AI search make it a powerful tool for building
advanced search solutions for chatbots. We have used Semantic reranking as the search
method while creating the index for our vehicle service repair embeddings and metadata.
4. Creation of User Query Embedding - This step involved transforming the text of
the user’s query into a high-dimensional vector representation using the same embed-
ding model text-embedding-ada-002.This embedding captures the semantic meaning and
then can be compared to the embeddings stored in the vecotor database in order to get
contextually relevant search results.
5. Prompt Creation - A prompt plays an important role as well while we try to provide
a context to the LLM.It is the combination of the user query and the context, i.e. the
relevant documents retrieved from the vector database based on the similarity search.
Number of documents to be retrieved based on similarity is configurable. the prompt also
includes some general instructions to the LLM on how to respond to the queries.
6. Interaction with LLM - This step involved sending the constructed prompt to the
model GPT-3.5 Turbo (0301) [21] via Application Programming Interface (API) and
processing the response to provide the user with a context-rich response. The context
length of this model is maximum 8192 tokens. So the length of the prompt passed to the
LLM will affect the length of the generated response.
7. Creation of a GUI - A simple GUI was created using Python Streamlit in order to
provide an interactive interface to the user to be able to ask questions and get the response.
The same interface also displays the documents retrieved to generate the answer. The
GUI was then deployed to the azure environment as an application.
17
3.4 Evaluation
3.4 Evaluation
To ensure the chatbot effectively addresses the specific needs of Volvo Group’s truck repair
operations, a thorough evaluation framework was implemented as part of the research method-
ology. As with any machine learning model, the performance of individual components within
the LLM and RAG pipeline significantly influences the overall user experience. For this eval-
uation, we employed the RAGAs library [7], which provides specialized metrics designed to
evaluate each component of the RAG pipeline.
Figure 3 outlines a framework for evaluating the performance of a RAG system with a focus
on alignment of generated answers with the relevant context and ground truth data. the main
components of the evaluation framework are: Question, Answer, Context and Ground truth
(optional). Following are the evaluation paths.
Faithfulness
Faithfulness measures the factual consistency of the generated answer against the given context.
It is calculated from the answer and the retrieved context, scaled to a range of (0,1), with higher
values indicating better performance. A generated answer is considered faithful if all the claims
made in the answer can be inferred from the provided context. To determine this, a set of
claims from the generated answer is first identified, and each claim is then cross-checked with
the given context to see if it can be inferred from it.
The formula for faithfulness is:
Number of claims in the generated answer that can be inferred from the given context
Faithfulness =
Total number of claims in the generated answer
Answer Relevancy
Answer Relevancy focuses on assessing how pertinent the generated answer is to the given
prompt. Lower scores are assigned to answers that are incomplete or contain redundant in-
formation, while higher scores indicate better relevancy. This metric is computed using the
question, the context, and the answer.
Answer Relevancy is defined as the mean cosine similarity of the original question to a number
of artificial questions, which are reverse-engineered based on the answer:
N
1 X
Answer Relevancy = cos(Egi , Eo )
N i=1
18
3.4 Evaluation
or
N
1 X Egi · Eo
Answer Relevancy =
N i=1 ∥Egi ∥∥Eo ∥
Where:
• Egi is the embedding of the generated question i
• Eo is the embedding of the original question.
• N is the number of generated questions, typically set to 3 by default
Context Precision
Context Precision is a metric that evaluates whether all of the ground-truth relevant items
present in the contexts are ranked higher or not. Ideally, all the relevant records must appear
at the top ranks. This metric is computed using the question, ground truth, and the contexts,
with values ranging between 0 and 1, where higher scores indicate better precision. The formula
for Context Precision is given by:
PK
k=1 (Precision@k × vk )
Context Precision@K =
Total number of relevant items in the top K results
Where:
true positives@k
Precision@k =
true positives@k + false positives@k
where:
• K is the total number of rows in contexts.
• vk ∈ {0, 1} is the relevance indicator at rank k.
Context Recall
Context Recall measures the extent to which the retrieved context aligns with the annotated
answer, treated as the ground truth. It is computed based on the ground truth and the retrieved
context, with values ranging between 0 and 1, where higher values indicate better performance.
To estimate context recall from the ground truth answer, each sentence in the ground truth
answer is analyzed to determine whether it can be attributed to the retrieved context. In an
ideal scenario, all sentences in the ground truth answer should be attributable to the retrieved
context.
The formula for calculating context recall is:
Number of sentences in the ground truth answer that can be attributed to the context
Context Recall =
Total number of sentences in the ground truth answer
19
Results
This section describes the results obtained from different experiments and implementations and
the qualitative and quantitative comparisons done.
20
4.1 GUI for the Chatbot
Documents: 1. Select K: k referes to the number of the most similar documents that the generated answer will be based on
2. Ask a Question: Ask any question related to the problem with the vehicle and the chatbot will give you resolution.
Document 1
Ask a question
{...}
What to do when signal lamp is malfunctioned in a vehicle?
Document 2
Submit and Process
{...}
Document 3 Answer: If the signal lamp in a vehicle is malfunctioning, the cause could be a burned-out bulb. In such a case, the solution would be to replace the lamp of the rear signal
{...} on the right side. If the issue persists, it could be due to a faulty wiring harness or a defective control unit. In such cases, it is recommended to perform diagnostics and
check the electrical connections of the relevant components. If necessary, replace the faulty parts and perform any required so�ware updates or parameter resets.
Document 4
{...}
Document 5
{...}
Document 6
{...}
Document 7
{...}
Document 8
{...}
Document 9
{...}
21
4.2 Comparative Analysis - LLM vs RAG(Cloud-based)
22
4.2 Comparative Analysis - LLM vs RAG(Cloud-based)
23
4.2 Comparative Analysis - LLM vs RAG(Cloud-based)
24
4.3 Performance Analysis of RAG pipelines
25
4.4 Evaluation Results for RAG pipelines
It is worth noting that the execution/response time observed for the RAG components Retrieval
and Generation differ with the number of documents configured to be retrieved.
26
4.4 Evaluation Results for RAG pipelines
27
Discussion
28
5.2 Pros and Cons - On-premise and Cloud-based
retrieval part, consequently enhancing the context and generation from the LLM. Considering
the difference of response between 1-5 seconds, it doesnot make a big difference.
Table 9 shows the average values of the evaluation parameters for both on-premise and cloud-
based systems, calculated from the values obtained by asking 10 same questions to both the
systems as illustrated in Table 7 and Table 8. Based on this evaluation, it is observed that the
Generation part with Faithfulness and Answer Relevancy are better in the On-Premise chatbot
system, whereas the Retrieval part with Context Precision and Recall are better in the Cloud-
based chatbot system. However, a lot of additional experiments need to be done to support
this statement, possibly with more questions, varying the number of retrieved documents and
also with real Ground Truth validated by experts in a qualitative manner.
29
5.4 Ethical Considerations
30
Conclusion
31
6.4 Final Remarks
32
References
[1] Emily M Bender et al. “On the dangers of stochastic parrots: Can language models be
too big?” In: Proceedings of the 2021 ACM conference on fairness, accountability, and
transparency. 2021, pp. 610–623.
[2] Yonatan Bisk et al. “Experience grounds language”. In: arXiv preprint arXiv:2004.10151
(2020).
[3] Tom Brown et al. “Language models are few-shot learners”. In: Advances in neural infor-
mation processing systems 33 (2020), pp. 1877–1901.
[4] Aakanksha Chowdhery et al. “Palm: Scaling language modeling with pathways”. In: Jour-
nal of Machine Learning Research 24.240 (2023), pp. 1–113.
[5] Richard Csaky. “Deep learning based chatbot models”. In: arXiv preprint arXiv:1908.08835
(2019).
[6] Jacob Devlin et al. “Bert: Pre-training of deep bidirectional transformers for language
understanding”. In: arXiv preprint arXiv:1810.04805 (2018).
[7] Shahul Es et al. “Ragas: Automated evaluation of retrieval augmented generation”. In:
arXiv preprint arXiv:2309.15217 (2023).
[8] David Ferrucci et al. “Building Watson: An overview of the DeepQA project”. In: AI
magazine 31.3 (2010), pp. 59–79.
[9] Zhuoyan Han. “The applications of chatbot”. In: Highlights in Science, Engineering and
Technology 57 (2023), pp. 258–266.
[10] Sepp Hochreiter and Jürgen Schmidhuber. “Long short-term memory”. In: Neural com-
putation 9.8 (1997), pp. 1735–1780.
[11] Lei Huang et al. “A survey on hallucination in large language models: Principles, taxon-
omy, challenges, and open questions”. In: arXiv preprint arXiv:2311.05232 (2023).
[12] Gautier Izacard and Edouard Grave. Leveraging Passage Retrieval with Generative Models
for Open Domain Question Answering. 2021. arXiv: 2007.01282 [cs.CL].
[13] Albert Q Jiang et al. “Mistral 7B”. In: arXiv preprint arXiv:2310.06825 (2023).
[14] Albert Q. Jiang et al. Mixtral of Experts. 2024. arXiv: 2401.04088 [cs.LG].
[15] Daniel Jurafsky and James H Martin. Speech and Language Processing: An Introduction
to Natural Language Processing, Computational Linguistics, and Speech Recognition.
[16] Nikhil Kandpal et al. “Large language models struggle to learn long-tail knowledge”. In:
International Conference on Machine Learning. PMLR. 2023, pp. 15696–15707.
[17] Jared Kaplan et al. “Scaling laws for neural language models”. In: arXiv preprint arXiv:2001.08361
(2020).
[18] Ali Khodadadi, Soroush Ghandiparsi, and Chen-Nee Chuah. “A Natural Language Pro-
cessing and deep learning based model for automated vehicle diagnostics using free-text
customer service reports”. In: Machine Learning with Applications 10 (2022), p. 100424.
issn: 2666-8270. doi: https://doi.org/10.1016/j.mlwa.2022.100424. url: https:
//www.sciencedirect.com/science/article/pii/S2666827022000998.
[19] Patrick S. H. Lewis et al. “Retrieval-Augmented Generation for Knowledge-Intensive NLP
Tasks”. In: CoRR abs/2005.11401 (2020). arXiv: 2005.11401. url: https://arxiv.org/
abs/2005.11401.
[20] Michael McTear et al. “Conversational interfaces: devices, wearables, virtual agents, and
robots”. In: The Conversational Interface: Talking to Smart Devices (2016), pp. 283–308.
33
6.4 Final Remarks
[21] Microsoft. OpenAI Models on Azure - Concepts and Models. Last updated on 2024-05-13,
accessed on 2024-06-07. 2024. url: https://learn.microsoft.com/en-us/azure/ai-
services/openai/concepts/models#gpt-35.
[22] Microsoft. Tutorial: Generate and use embeddings with OpenAI service in Azure. Last
updated on 2024-02-22, accessed on 2024-06-07. 2024. url: https://learn.microsoft.
com / en - us / azure / ai - services / openai / tutorials / embeddings ? tabs = python -
new%2Ccommand-line&pivots=programming-language-python.
[23] Microsoft. What is Azure Search? Last updated on 2024-05-21, accessed on 2024-06-07.
2024. url: https://learn.microsoft.com/en-us/azure/search/search-what-is-
azure-search.
[24] Matthew E. Peters et al. “Deep contextualized word representations”. In: CoRR abs/1802.05365
(2018). arXiv: 1802.05365. url: http://arxiv.org/abs/1802.05365.
[25] Murray Shanahan. “Talking about large language models”. In: Communications of the
ACM 67.2 (2024), pp. 68–79.
[26] Hongjin Su et al. One Embedder, Any Task: Instruction-Finetuned Text Embeddings. 2023.
arXiv: 2212.09741 [cs.CL].
[27] Ross Taylor et al. “Galactica: A large language model for science”. In: arXiv preprint
arXiv:2211.09085 (2022).
[28] Hugo Touvron et al. “Llama: Open and efficient foundation language models”. In: arXiv
preprint arXiv:2302.13971 (2023).
[29] Alan M Turing. “Computing Machinery and Intelligence.” In: Creative Computing 6.1
(1980), pp. 44–53.
[30] Ashish Vaswani et al. “Attention is all you need”. In: Advances in neural information
processing systems 30 (2017).
[31] Paul Voigt and Axel Von dem Bussche. “The eu general data protection regulation (gdpr)”.
In: A Practical Guide, 1st Ed., Cham: Springer International Publishing 10.3152676
(2017), pp. 10–5555.
[32] Richard Wallace. “The elements of AIML style”. In: Alice AI Foundation 139 (2003).
[33] Jianguo Wang et al. “Milvus: A Purpose-Built Vector Data Management System”. In:
Proceedings of the 2021 International Conference on Management of Data. SIGMOD ’21.
Virtual Event, China: Association for Computing Machinery, 2021, pp. 2614–2627. isbn:
9781450383431. doi: 10.1145/3448016.3457550. url: https://doi.org/10.1145/
3448016.3457550.
[34] Jason Wei et al. “Emergent abilities of large language models”. In: arXiv preprint arXiv:2206.07682
(2022).
[35] Joseph Weizenbaum. “ELIZA—a computer program for the study of natural language
communication between man and machine”. In: Communications of the ACM 9.1 (1966),
pp. 36–45.
[36] Ziwei Xu, Sanjay Jain, and Mohan Kankanhalli. “Hallucination is inevitable: An innate
limitation of large language models”. In: arXiv preprint arXiv:2401.11817 (2024).
[37] Wayne Xin Zhao et al. “A survey of large language models”. In: arXiv preprint arXiv:2303.18223
(2023).
34
Appendix A: Sample Data Record
Index Vehicle ID Brand Product Type Defect Causal Part Complaint Cause Correction
Voltage drop in 2
voltage issue during batteries noticed Replace the batteries
1 ABC2745 Trucks BEV drop battery start-up..... during a test with the issue
35
Appendix B: GUI for the Chatbot
Documents: 1. Select K: k referes to the number of the most similar documents that the generated answer will be based on
2. Ask a Question: Ask any question related to the problem with the vehicle and the chatbot will give you resolution.
Document 1
Ask a question
{...}
Why do we need to replace seal in a vehicle, explain?
Document 2
Submit and Process
{...}
Document 3 Answer: Seals are an important component in a vehicle as they prevent oil leakage and ensure that the engine runs smoothly. The need to replace a seal in a vehicle arises when it starts to leak oil. The leakage can be caused
{...} by various factors such as worn-out seals, material defects, or poor sealing due to faulty workmanship. In such cases, the seal needs to be replaced with a new one. The replacement process involves removing the faulty seal,
cleaning the surface, and installing a new seal. The new seal is then tested to ensure that there are no leaks. It is important to note that regular maintenance of the vehicle can help prevent such issues from occurring.
Document 4
{...}
Document 5
{...}
Document 6
{...}
Document 7
{...}
Document 8
{...}
Document 9
{...}
Document 10
{...}
1 of 1 6/2/2024, 11:24 PM
36