0% found this document useful (0 votes)

33 views30 pages

06web Application For Rag Implementation and Testing

Uploaded by

Praveen Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views30 pages

06web Application For Rag Implementation and Testing

Uploaded by

Praveen Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

electronics

Article
Web Application for Retrieval-Augmented Generation:
Implementation and Testing
Irina Radeva 1, * , Ivan Popchev 2 , Lyubka Doukovska 1 and Miroslava Dimitrova 1

1 Intelligent Systems Department, Institute of Information and Communication Technologies,

Bulgarian Academy of Sciences, 1113 Sofia, Bulgaria; [email protected] (L.D.);
[email protected] (M.D.)
2 Bulgarian Academy of Sciences, 1040 Sofia, Bulgaria; [email protected]
* Correspondence: [email protected]

Abstract: The purpose of this paper is to explore the implementation of retrieval-augmented genera-
tion (RAG) technology with open-source large language models (LLMs). A dedicated web-based
application, PaSSER, was developed, integrating RAG with Mistral:7b, Llama2:7b, and Orca2:7b mod-
els. Various software instruments were used in the application’s development. PaSSER employs a set
of evaluation metrics, including METEOR, ROUGE, BLEU, perplexity, cosine similarity, Pearson cor-
relation, and F1 score, to assess LLMs’ performance, particularly within the smart agriculture domain.
The paper presents the results and analyses of two tests. One test assessed the performance of LLMs
across different hardware configurations, while the other determined which model delivered the
most accurate and contextually relevant responses within RAG. The paper discusses the integration
of blockchain with LLMs to manage and store assessment results within a blockchain environment.
The tests revealed that GPUs are essential for fast text generation, even for 7b models. Orca2:7b
on Mac M1 was the fastest, and Mistral:7b had superior performance on the 446 question–answer
dataset. The discussion is on technical and hardware considerations affecting LLMs’ performance.
The conclusion outlines future developments in leveraging other LLMs, fine-tuning approaches, and
further integration with blockchain and IPFS.

Keywords: retrieval-augmented generation (RAG); open-source large language models (LLMs);

Citation: Radeva, I.; Popchev, I.; Mistral:7b; Llama2:7b; Orca2:7b; Antelope blockchain; Ollama; LangChain; smart agriculture
Doukovska, L.; Dimitrova, M. Web
Application for Retrieval-Augmented
Generation: Implementation and
Testing. Electronics 2024, 13, 1361. 1. Introduction
https://doi.org/10.3390/
The advent of LLMs is changing the paradigm in natural language processing (NLP)
electronics13071361
toward improved classification, generation, and understanding of texts. However, general-
Academic Editor: Stefanos Kollias purpose LLMs often require further adaptation to advance their performance in specific
Received: 12 March 2024
tasks or specialized domains. This has led to the development of different approaches to
Revised: 28 March 2024
enhancing the performance of the models. Their goal has been to overcome the inherent
Accepted: 2 April 2024 limitations of pre-trained large-scale models. In this regard, the following models can be
Published: 4 April 2024 mentioned: full fine-tuning, parameter-efficient fine-tuning (PEFT), prompt engineering,
and retrieval-augmented generation (RAG).
Full fine-tuning [1] is a method in which tuning occurs by adjusting all LLM parame-
ters to specific data for a single task. This requires less data and offers greater accuracy and
Copyright: © 2024 by the authors. robustness of the results. The large scale of LLMs, however, makes the method’s implemen-
Licensee MDPI, Basel, Switzerland. tation computationally expensive, requiring significant memory, time, and expertise.
This article is an open access article Parameter-efficient fine-tuning (PEFT) [2] emerged simultaneously, outlining an al-
distributed under the terms and ternative strategy. The method focuses on modifying a selected number of parameters.
conditions of the Creative Commons The results are better (faster) learning performance and knowledge retention from pre-
Attribution (CC BY) license (https://
training. However, the performance of PEFT depends on the complexity of the task and
creativecommons.org/licenses/by/
the technique chosen, as it updates far fewer parameters in comparison to full fine-tuning.
4.0/).

Electronics 2024, 13, 1361. https://doi.org/10.3390/electronics13071361 https://www.mdpi.com/journal/electronics

Electronics 2024, 13, 1361 2 of 30

In [3], the prompt engineering method is presented. This technique does not involve
training network weights. To influence the desired output, it involves crafting the input
to the model. This approach includes zero-shot prompting, few-shot prompting, and
chain-of-thought prompting, each offering a way to guide the model’s response without
direct modification of its parameters. This method leverages the flexibility and capability of
LLM and provides a tool to adapt the model without the computational cost of retraining.
RAG, introduced in [4], enhances language models by combining prompt engineering
and database querying to provide context-rich answers, reducing errors and adapting to
new data efficiently. The main concepts involve a combination of pre-trained language
models with external knowledge retrieval, enabling dynamic, informed content generation.
It is cost-effective and allows for traceable responses, making it interpretable. The develop-
ment of retrieval-augmented generation (RAG) represents a significant advancement in
the field of natural language processing (NLP). However, for deeper task-specific adapta-
tions, like analysing financial or medical records, fine-tuning may be preferable. RAG’s
integration of retrieval and generation techniques addresses LLM issues like inaccuracies
and opaque logic, yet incorporating varied knowledge and ensuring information relevance
and accuracy remain challenges [5].
Each method offers a specific approach to improving LLM performance. Choosing
between them depends on the desired balance between the required results, the available
resources, and the nature of the tasks set.
In fact, there are other different methods in this field. They are founded on these basic
approaches or applied in parallel. For example, dense passage retrieval (DPR) [6] and the
retrieval-augmented language model (REALM) [7] refine retrieval mechanisms similar
to RAG. Fusion-in-decoder (FiD) [8] integrates information from multiple sources into
the decoding process. There are various knowledge-based modelling and meta-learning
approaches. Each of these models reflects efforts to extend the capabilities of pre-trained
language models and offer solutions for a wide range of NLP tasks.
The purpose of this paper is to explore the implementation of retriever-augmented
generation (RAG) technology with open-source large language models (LLMs). In order
to support this research, a web-based application PaSSER that allows the integration,
testing and evaluation of such models in a structured environment has been developed.
The paper discusses the architecture of the web application, the technological tools
used, the models selected for integration, and the set of functionalities developed to operate
and evaluate these models. The evaluation of the models has two aspects: operation on
different computational infrastructures and performance in text generation and summa-
rization tasks.
The domain of smart agriculture is chosen as the empirical domain for testing the
models. Furthermore, the web application is open-source, which promotes transparency
and collaborative improvement. A detailed guide on installing and configuring the ap-
plication, the datasets generated for testing purposes, and the results of the experimental
evaluations are provided and available on GitHub [9].
The application allows adaptive testing of different scenarios. It integrates three of
the leading LLMs, Mistral:7b, Llama2:7b, and Orca2:7b, which do not require significant
computational resources. The selection of the Mistral:7b, Llama2:7b, and Orca2:7b models
is driven by an approach aimed at balancing performance and affordability. The selected
models were determined due to their respective volume parameters that allow installation
and operation in mid-range configurations. Given the appropriate computational resources,
without further refinement, the PaSSER application allows the use of arbitrary open-source
LLMs with more parameters.
A set of standard NLP metrics—METEOR, ROUGE, BLEU, Laplace and Lidstone’s
perplexity, cosine similarity, Pearson correlation coefficient, and F1 score—was selected for
a thorough evaluation of the models’ performance.
Electronics 2024, 13, 1361 3 of 30

In this paper, RAG is viewed as a technology rather than a mere method. This distinction
is due to the paper’s emphasis on the applied, practical, and integrative aspects of RAG in
the field of NLP.
The paper contributes to the field of RAG research in several areas:
1. By implementing the PaSSER application, the study provides a practical framework
that can be used and expanded upon in future RAG research.
2. The paper illustrates the integration of RAG technology with blockchain, enhancing
data security and verifiability, which could inspire further exploration into the secure
and transparent application of RAG systems.
3. By comparing different LLMs within the same RAG framework, the paper provides
insights into the relative strengths and capabilities of the models, contributing knowl-
edge on model selection in RAG contexts.
4. The focus on applying and testing within the domain of smart agriculture adds to the
understanding of how RAG technology can be tailored and utilized in specific fields,
expanding the scope of its application and relevance.
5. The use of open-source technologies in PaSSER development allows the users to
review and trust the application’s underlying mechanisms. More so, it enables col-
laboration, provides flexibility to adapt to specific needs or research goals, reduces
development costs, facilitates scientific accuracy by enabling exact replication of re-
search setups, and serves as a resource for learning about RAG technology and LLMs
in practical scenarios.
The paper is organized as follows: Section 2 provides an overview of the development,
implementation, and functionalities of the PaSSER Web App; Section 3 discusses selected
standard NLP metrics used to measure RAG performance; Section 4 presents the results of
tests on the models; Section 5, the limitations and influencing factors highlighted during
the testing are discussed; and Section 6 summarizes the results and future directions for
development.

2. Web Application Development and Implementation

This section describes the development, implementation, and features of the PaSSER
Web App, which henceforward will be referred to as the PaSSER App.
The PaSSER App is a complementary project to the Smart crop production data
exchange (SCPDx) platform, described in detail in [10,11]. The platform aims to support the
integration of exchanging information and data acquired or generated as a result of the use
of different technologies in smart agriculture. An underlying blockchain and distributed
file system, eosio blockchain and InterPlanetary file system (IPFS), were selected [12].
The networks were deployed as private, permissionless. Later, in 2022, due to the hard
fork, eosio blockchain was renamed to Antelope.io [13], which forced the migration of
the blockchain network. The development of the SCPDx platform was preceded by a
number of studies related to the choice of blockchain platform [14], blockchain-enabled
supply-chain modelling for a smart crop production framework [15], and blockchain oracles
integration [16].
The framework outlined in Figure 1 represents the PaSSER App’s core infrastructure,
the server side, the application itself, and the integration with the blockchain and IPFS of
the SCPDx platform infrastructure.
There are different ways and techniques to work with LLMs and the RAG technology
(Python scripts, desktop or web apps), but from a user’s perspective, it is most convenient
to use a web interface as it is independent of software platforms, where access to external
resources is implemented through APIs. This is why the PaSSER App was developed as a
web application.
The PaSSER App communicates with Ollama [17], ChromaDB [18], Python scripts,
Antelope blockchain, and IPFS through their respective APIs. Currently, the IPFS network
is not utilized within the framework; however, future enhancements aim to incorporate it
Electronics 2024, 13, 1361 4 of 31

Electronics 2024, 13, 1361 4 of 30

The PaSSER App communicates with Ollama [17], ChromaDB [18], Python scripts,
Antelope blockchain, and IPFS through their respective APIs. Currently, the IPFS network
is not utilized within the framework; however, future enhancements aim to incorporate it
to enable storage
to and content
enable storagedistribution, which will which
and content distribution, support
willthe fine-tuning
support processes
the fine-tuning processes
of LLMs. of LLMs.

Figure 1. PaSSERFigure 1. PaSSER App framework.

App framework.

The PaSSER App TheisPaSSER

developed App is indeveloped
JavaScript, in leveraging
JavaScript, leveraging the PrimeReact
the PrimeReact library [19]
library [19]
for its user interface components. To facilitate interaction
for its user interface components. To facilitate interaction with the Antelope blockchain with the Antelope blockchain
network, it employs the WharfKit library [20]. The LangChain library [21] is integrated for
network, it employs the WharfKit library [20]. The LangChain library [21] is integrated for
engagement with RAG and LLMs. The application is hosted on an Apache web server,
engagement with RAG and LLMs. The application is hosted on an Apache web server, en-
enabling communication between the PaSSER App, the LLMs, and the SCPDx infrastruc-
abling communication
ture.
between the PaSSER App, the LLMs, and the SCPDx infrastructure.
The server component
The server of component
the PaSSER ofApp includes
the PaSSER App a vector
includesdatabase, LLM API,
a vector database, LLMandAPI, and
a score evaluation API. Ollama’s API supports different operating systems
a score evaluation API. Ollama’s API supports different operating systems (UBUNTU, (UBUNTU,
Windows, and macOS)
Windows,and andismacOS)
used asand anisinterface
used as an with different
interface withLLMs.
different This APIThis
LLMs. grants
API grants
users the flexibility
userstothemanage
flexibilityandto interact
manage and withinteract
differentwithopen-source LLMs. Specifically,
different open-source LLMs. Specifically,
Ollama
Ollama is deployed on isandeployed
Ubuntuon an Ubuntu
server serverwith
equipped equipped
128 GB with 128 GB
RAM, notRAM, not utilizing
utilizing a GPU,a GPU,
and also
and also configured configured
locally on a Mac locally
Mini onM1a Mac
withMini
16 M1GBwith
RAM 16and
GB RAM and aGPU.
a 10-core 10-core GPU.
ChromaDB is used ChromaDBto work is used
with to work
vector with vector databases.
databases. ChromaDB
ChromaDB is anisopen-source
an open-source vec-
tor database for applications that utilize large language
vector database for applications that utilize large language models (LLMs). It supports models (LLMs). It supports mul-
tiple programming languages such as Python, JavaScript, Ruby, Java, Go, and others. The
multiple programming languages such as Python, JavaScript, Ruby, Java, Go, and others.
database is licensed under the Apache 2.0 license. ChromaDB architecture is suited for
The database is licensed under the Apache 2.0 license. ChromaDB architecture is suited
applications that require semantic search capabilities. An embedding refers to the trans-
for applicationsformation
that require semantic search capabilities. An embedding refers to the
of text, images, or audio into a vector of numbers, which represents the essence
transformation of ofthe
text, images,
content in a way or audio
that caninto a vector of
be processed and numbers,
understood which represents
by machine learningthemodels.
essence of the content in a way that can be processed and understood by machine
In this implementation, ChromaDB was installed on a macOS-based server and locally on learning
models. In this aimplementation,
Mac Mini M1 16 GB ChromaDB
RAM 10 GPU. was installed on a macOS-based server and
locally on a Mac Mini M1
Two 16 GBscripts
Python RAMare 10 developed:
GPU. one for computing various evaluation metrics us-
Two Python ingscripts
librariesaresuchdeveloped:
as NLTK [22], onetorch
for [23],
computing various
numpy [24], rougeevaluation metrics
[25], transformers [26], and
scipyas[27],
using libraries such NLTK and[22],
the other
torchfor logging
[23], numpy model
[24],performance data, both using
rouge [25], transformers theand
[26], Pyntelope
library
scipy [27], and the [28]for
other for logging
blockchain interactions.
model These scripts
performance data, are
bothoperational
using theonPyntelope
an Ubuntu server
and are accessible via GitHub [9], facilitating the analysis
library [28] for blockchain interactions. These scripts are operational on an Ubuntu server and the recording of perfor-
mance metrics within a blockchain framework to ensure
and are accessible via GitHub [9], facilitating the analysis and the recording of perfor- data integrity and reproducibility
of results.
mance metrics within a blockchain framework to ensure data integrity and reproducibility
of results. 2.1. LLMs Integration
2.1. LLMs IntegrationIn this implementation of PaSSER, Mistral:7b, Llama2:7b, and Orca2:7b models were
selected. These models have the same volume parameters and are suitable to be installed
In this implementation of PaSSER, Mistral:7b, Llama2:7b, and Orca2:7b models were
selected. These models have the same volume parameters and are suitable to be installed
and used on hardware configurations with medium computational capacity, while giving
good enough results. A brief description of the models is presented below.
Mistral:7b “https://mistral.ai/news/announcing-mistral-7b/ (accessed on 23 March
2024)” is a language model developed by Mistral AI with a capacity of 7.3 billion parameters.
It is available under the Apache 2.0 license, so it can be used without restrictions. The
model can be downloaded and used anywhere, including locally, and deployed on any
cloud (AWS/GCP/Azure) using the LLM inference server and SkyPilot.
Electronics 2024, 13, 1361 5 of 30

It is structured to process language data, facilitating a wide range of text-based applica-

tions. This model incorporates specific attention mechanisms, namely grouped-query atten-
tion (GQA) and sliding window attention (SWA), aimed at optimizing the processing speed
and managing longer text inputs. These technological choices are intended to improve the
model’s performance while managing computational resources effectively [29–31].
Llama2:7b “https://llama.meta.com/llama2/ (accessed on 23 March 2024)” is a series
of large language models developed by Meta, offering variations in size from 7 billion to
70 billion parameters. The 7 billion parameter version, Llama 2:7b, is part of this collection
and is designed for a broad range of text-based tasks, from generative text models to more
specific applications like chatbots. The architecture of Llama 2 models is auto-regressive,
utilizing an optimized transformer structure. These models have been pre-trained on
a mix of publicly available online data and can be fine-tuned for specific applications.
They employ an auto-regressive language model framework and have been optimized for
various natural language processing tasks, including chat and dialogue scenarios through
supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) [32].
Llama2:7b models have been trained on significant computational resources and have
a considerable carbon footprint, which Meta has committed to offsetting as part of their
sustainability program. The 7b model specifically consumed 184,320 GPU hours and
emitted an estimated 31.22 tCO2eq during its training, with all emissions directly offset.
These models are intended for both commercial and research purposes, and while they have
been primarily tested in English, they carry the potential for a broad array of applications.
Meta has provided a custom commercial license for Llama 2, and details on accessing this
license can be found on their website.
Orca2:7b, Orca 2 [33], developed by Microsoft, is a fine-tuned version of the Llama 2
model with two versions: one with 7 billion and the other with 13 billion parameters. It
focuses on improving the reasoning abilities of smaller language models through enhanced
training methods. By employing high-quality synthetic data for training, Orca 2 is designed
to master various reasoning techniques such as step-by-step processing and recall-then-
generate strategies. This synthetic training data is crafted to guide the model in adopting
different solution strategies appropriate for varied tasks, aiming to optimize the model’s
problem-solving approach.
The model is a result of research aimed at leveraging the advancements in large
language models to boost the performance of smaller models, making them more efficient
and versatile in handling tasks that require complex reasoning. The initiative behind Orca
2 is to provide a resource for research into the development, evaluation, and alignment
of language models, particularly focusing on smaller models that can perform efficiently
across a range of tasks.
Orca 2 is available for public use and research, underscoring Microsoft’s commit-
ment to advancing the field of AI and language model technology. It represents an effort
to explore how smaller models can be enhanced to approach a variety of tasks more ef-
fectively without the extensive computational and environmental costs associated with
larger models.

2.2. PaSSER App Functionalities

The site map for the PaSSER App is depicted in Figure 2, detailing its core features.
Figure 2 provides a structured flowchart of the PaSSER App’s user interface, mapping out
the navigation pathways for various functionalities, such as the creation and management
of a vector store, chat interactions, testing protocols, configuration settings, database
management, and user authentication. Each section outlines the sequence of user actions
and choices, from inputting data to exporting test results, configuring system settings, and
maintaining the database. These include user authentication (login), system configuration
(configuration setup), creation and management of vector stores from various sources (text,
PDF files, and websites), and functionalities for engaging with the stored data through
of a vector store, chat interactions, testing protocols, configuration settings, database man-
agement, and user authentication. Each section outlines the sequence of user actions and
choices, from inputting data to exporting test results, configuring system settings, and
maintaining the database. These include user authentication (login), system configuration
Electronics 2024, 13, 1361 (configuration setup), creation and management of vector stores from various sources 6 of 30
(text, PDF files, and websites), and functionalities for engaging with the stored data
through Q&A chat and RAG Q&A chat based on the established vector stores. Addition-
ally, achat
Q&A testing
andmodule
RAG Q&A is outlined to evaluate
chat based on thethe application’s
established functionalities
vector and perfor-
stores. Additionally, a
mance.module is outlined to evaluate the application’s functionalities and performance.
testing

Figure 2. PaSSER
Figure 2. PaSSER site
site map.
map.

The vectorstore’feature,
‘Create vectorstore’
The ‘Create feature,asasdepicted
depictedin in Figure
Figure 3, outlines
3, outlines the process
the process of
of con-
converting raw textual data into a structured, queryable vector space using LangChain.
verting raw textual data into a structured, queryable vector space using LangChain. This This
transformation of NLP and vector embedding techniques makes it possible to convert text
transformation of NLP and vector embedding techniques makes it possible to convert text
into a format convenient for vector operations. Users can source textual data from text files,
into a format convenient for vector operations. Users can source textual data from text
PDFs, and websites. The outlined procedure for vectorstore creation is standardized across
files, PDFs, and websites. The outlined procedure for vectorstore creation is standardized
these data types, ensuring consistency in processing and storage. At the current phase,
across these data types, ensuring consistency in processing and storage. At the current
automatic retrieval of information from websites (scrapping) is considered impractical
phase, automatic retrieval of information from websites (scrapping) is considered imprac-
due to the necessity for in-depth analysis of website structures and the requirement for
tical due to the necessity for in-depth analysis of website structures and the requirement
extensive manual intervention to adequately structure the retrieved text. This process
for extensive manual intervention to adequately structure the retrieved text. This process
Electronics 2024, 13, 1361 involves understanding varied and complex web layouts and imposing a tailored7approach of 31
involves understanding varied and complex web layouts and imposing a tailored ap-
to effectively extract and organize data.
proach to eﬀectively extract and organize data.

Figure 3.
Figure 3. Vectorstore
Vectorstoreconstruction
constructionworkflow.
workflow.

The process for creating a vectorstore

vectorstore involves
involves the
the following
following steps,
steps, which
whichare
are com-
common
mon
to allto all three
three source
source types:
types:
1. Cleaning and standardizing text data. This is achieved by removing unnecessary char-
acters (punctuation and special characters). Converting the text to a uniform size
(usually lower case). Separating the text into individual words or tokens. In the im-
plementation considered here, the text is divided into chunks with diﬀerent overlaps.
2. Vector embedding. The goal is to convert tokens (text tokens) into numeric vectors. This
is achieved by using pre-trained word embedding models from selected LLMs (in
this case, Misrtal:7b, Llama2:7b, and Orca2:7b). These models map words or phrases
Electronics 2024, 13, 1361 7 of 30

1. Cleaning and standardizing text data. This is achieved by removing unnecessary char-
acters (punctuation and special characters). Converting the text to a uniform size
(usually lower case). Separating the text into individual words or tokens. In the im-
plementation considered here, the text is divided into chunks with different overlaps.
2. Vector embedding. The goal is to convert tokens (text tokens) into numeric vectors. This
is achieved by using pre-trained word embedding models from selected LLMs (in
this case, Misrtal:7b, Llama2:7b, and Orca2:7b). These models map words or phrases
to high-dimensional vectors. Each word or phrase in the text is transformed into a
vector that represents its semantic meaning based on the context in which it appears.
3. Aggregating embeddings for larger text units to represent whole sentences or documents
as vectors. It can be achieved by simple aggregation methods (averaging the vectors of
all words in a sentence or document) or by using sentence transformers or document
embedding techniques that take into account the more consistent and contextual
nature of words. Here, transformers are used, which are taken from the selected LLMs.
4. Create a vectorstore to store the vector representations in a structured format. The
data structures used are optimized for operations with high-dimensional vectors.
ChromaDB is used for the vectorstore.
Figure 4 defines the PaSSER App’s mechanisms for processing user queries and
generating responses. Figure 4a represents a general Q&A chat workflow with direct
input without the augmented context provided by a vectorstore. The corresponding LLM
processes the query, formulates a response, and concurrently provides system performance
data, including metrics such as total load and evaluation timeframes. Additionally, a
numerical array captures the contextual backdrop of the query and the response, drawn
from previous dialogue or related data, which the LLM utilizes similar to short-term
memory to ensure response relevance and coherence. While the capacity of this memory is
limited and not the focus of the current study, it is pivotal in refining responses based on
specific contextual elements such as names and dates. The App enables saving this context
for 13,
Electronics 2024, continued
1361 dialogue and offers features for initiating new conversations by purging the 8 of 31
existing context.

(a) Q&A chat (b) RAG-based Q&A chat

Figure 4. Chat types workflow.
Figure 4. Chat types workflow.
Figure 4b illustrates the workflow of a Q&A chat using the RAG technique, which
Figure 4b illustrates the workflow of a Q&A chat using the RAG technique, which
employs a pre-built vectorstore for data retrieval and response generation. The ‘RAG
employs a pre-builtQ&Avectorstore forfacilitates
chat’ feature data retrieval and with
interaction response generation.
an existing ‘RAGcommence
TheUsers
vectorstore. Q&A the
chat’ feature facilitates
chat byinteraction with an
entering a query, existing
triggering thevectorstore. Usersto
LangChain library commence the
fetch related chat
data from the
by entering a query, triggering
chosen the LangChain
vectorstore. library
This data informs thetosubsequent
fetch related data
query to from the integrating
the LLM, chosen the
original question, any prompts, and a context enriched by the vectorstore’s information.
The LLM then generates a response. Within the app, a dedicated memory buﬀer recalls
history, which the LLM utilizes as a transient context to ensure consistent and logical re-
sponses. The limited capacity of this memory buﬀer and its impact on response quality is
acknowledged, though not extensively explored in this study. In the RAG Q&A chat’,
context-specific details like names and dates are crucial for enhancing the relevance of
Electronics 2024, 13, 1361 8 of 30

vectorstore. This data informs the subsequent query to the LLM, integrating the original
question, any prompts, and a context enriched by the vectorstore’s information. The LLM
then generates a response. Within the app, a dedicated memory buffer recalls history, which
the LLM utilizes as a transient context to ensure consistent and logical responses. The
limited capacity of this memory buffer and its impact on response quality is acknowledged,
though not extensively explored in this study. In the ‘RAG Q&A chat’, context-specific
details like names and dates are crucial for enhancing the relevance of responses.
The ‘Tests’ feature is designed to streamline the testing of various LLMs within a
specific knowledge domain. It involves the following steps:
1. Selection of a specific knowledge base in a specific domain.
With ‘Create vectorstore’, the knowledge base is processed and saved in the vector
database. In order to evaluate the performance of different LLMs for generating RAG
answers on a specific domain, it is necessary to prepare a sufficiently large list of questions
and reference answers. Such a list can be prepared entirely manually by experts in a
specific domain. However, this is a slow and time-consuming process. Another widely
used approach is to generate relevant questions based on reference answers given by a
selected LLM (i.e., creating respective datasets). PaSSER allows the implementation of the
second approach.
2. To create a reference dataset for a specific domain, a collection of answers related to
the selected domain is gathered. Each response contains key information related to
potential queries in that area. These answers are then saved in a text file format.
Electronics 2024, 13, 1361 3. A selected LLM is deployed to systematically generate a series of questions9 corre- of 31
sponding to each predefined reference answer. This operation facilitates the creation
of a structured dataset comprising pairs of questions and their corresponding answers.
4. Subsequently, this dataset
The finalized dataset is saved in
is uploaded to the
the JSON
PaSSER fileApp,
format.initiating an automated se-
4. The finalized dataset is uploaded to the
quence of response generation for each query within PaSSER App, initiating
the targetan automated sequence
domain. Following
of response
that, generationresponse
each generated for each isquery within to
forwarded theatarget domain.
dedicated PythonFollowing
backend that, each
script.
generated
This scriptresponse
is tasked iswithforwarded
assessing to the
a dedicated
responsesPython
based on backend
predefinedscript.metrics
This script
and
is tasked with
comparing themassessing the responses
to the established based answers.
reference on predefined metrics and
The outcomes comparing
of this evalua-
them to the
tion are thenestablished
stored on the reference answers.
blockchain, The outcomes
ensuring a transparent of this
andevaluation
immutableare then
ledger
stored on the blockchain,
of the model’s performance metrics.ensuring a transparent and immutable ledger of the model’s
performance metrics.
To facilitate this process, a smart contract ‘llmtest’ has been created, managing the
To facilitate
interaction with thethisblockchain
process, aand smart
providing ‘llmtest’ hasand
contracta structured been created,
secure method managing the
for storing
interaction with the blockchain and providing a structured
and managing the assessment results derived from the LLM performance tests. and secure method for storing
and managing
The provided the assessment
pseudocode results derived
outlines the from the LLM
structure ‘tests’performance
and its methods tests. within a
The provided
blockchain pseudocode
environment, which were outlines
chosen thetostructure ‘tests’ and
store test-related its methods
entries. It includeswithin
iden-a
blockchain environment,
tifiers (id, userid, which
and testid), were chosen
a timestamp to store test-related
(created_at), numerical resultsentries.(results
It includes
array),identi-
and
fiers (id, userid,
descriptive textand testid), a timestamp
(description). It establishes (created_at), numerical
id as the primary keyresults (results array),
for indexing, with addi-and
descriptive text (description). It establishes id as the primary key for
tional indices based on created_at, userid, and testid to facilitate data retrieval and sorting indexing, with addi-
tional indices
by these based on
attributes. Thiscreated_at,
structure userid, and testid
organizes and to facilitate
accesses data
test retrieval
records withinandthesorting
block-by
these
chain. attributes. This structure organizes and accesses test records within the blockchain.

Define a structure ‘tests’ with the following fields:

id of type integer
userid of type name
testid of type name
created_at of type timestamp
results of type list of doubles
description of type string

Define the following methods for ‘tests’:

primary_key returns id
third_key returns the seconds since epoch from created_at
user_key returns the value of userid
test_key returns the value of testid

The pseudocode below defines an eosio::multi_index table tests_table’ for a blockchain,

which facilitates the storage and indexing of data. It specifies four indices: a primary index
based on id and secondary indices using created_at, userid, and testid attributes for en-
hanced query capabilities. These indices optimize data retrieval operations, allowing for
eﬃcient access based on diﬀerent key attributes like timestamp, user, and test identifiers,
significantly enhancing the database’s functionality within the blockchain environment.
created_at of type timestamp
results of type list of doubles
description of type string

Define the following methods for ‘tests’:

primary_key returns id
Electronics 2024, 13, 1361 third_key returns the seconds since epoch from created_at 9 of 30
user_key returns the value of userid
test_key returns the value of testid

The
Thepseudocode
pseudocodebelowbelow defines
defines an eosio::multi_index table
an eosio::multi_index table ‘tests_table’ foraablockchain,
tests_table’ for blockchain,
which facilitates the storage and indexing of data. It specifies four indices:
which facilitates the storage and indexing of data. It specifies four indices: a primary a primary
index
index based on id and secondary indices using created_at, userid, and testid
based on id and secondary indices using created_at, userid, and testid attributes for en- attributes for
enhanced query capabilities. These indices optimize data retrieval operations,
hanced query capabilities. These indices optimize data retrieval operations, allowing for allowing for
efficient access based on different key attributes like timestamp, user, and test
eﬃcient access based on diﬀerent key attributes like timestamp, user, and test identifiers, identifiers,
significantly
significantlyenhancing
enhancingthethe database’s
database’s functionality
functionality within
within thethe blockchain environment.
blockchain environment.

Define a multi-index table ‘tests_table’ with the following indices:

‘id’ index based on the primary_key method
‘timestamp’ index based on the third_key method
‘users’ index based on the user_key method
‘testid’ index based on the test_key method

Theprovided
The provided pseudocode
pseudocode defines
defines anan EOSIO
EOSIO smart
smart contract
contract action
action named
named add_test,
add_test,
which allows adding a new record to the tests_table. It accepts the creator’sname,
which allows adding a new record to the tests_table. It accepts the creator’s name,test
testID,
ID,
description, and an array of results as parameters. The action assigns a unique ID to
description, and an array of results as parameters. The action assigns a unique ID to the
the
record, stores
record, stores the
the current
current timestamp,
timestamp, and and then
then inserts
inserts aa new
new entry
entry into
into the
the table
table using
using
Electronics 2024, 13, 1361 these details. This action helps in dynamically updating the blockchain state with new
these details. This action helps in dynamically updating the blockchain state with 10 oftest
new 31
test
information,ensuring
information, ensuringthat
that each
each entry
entry is
is time-stamped
time-stamped and and linked
linked to
to its
its creator.
creator.

1. Define a function add_test with parameters: creator, testid, description,

results
2. Create a tests_table object with the current contract’s name
3. Get the next available primary key from the tests_table
4. Get the current timestamp
5. Add a new entry to the tests_table with the following fields:
- id: the next available primary key
- userid: the creator’s name
- testid: the testid
- created_at: the current timestamp
- description: the description
- results: the results
6. Return a new add_test object

Thepseudocodes
The pseudocodesprovided
providedabove
aboveand
andininSection
Section33are
aregenerated
generatedwith
withGitHub
GitHubCopilot
Copi-
lot upon
upon the the actual
actual source
source codecode available
available at “https://github.com/features/copilot (accessed
at “https://github.com/features/copilot (accessed
on11April
on April2024)”.
2024)”.
5.5. The results
The results from
from the
the blockchain
blockchain areare retrieved
retrieved forfor further
further processing
processing andand analysis.
analysis.
To facilitate the execution of these procedures, the interface is structured into three
To facilitate the execution of these procedures, the interface is structured into three
specific features: Q&A dataset’ for managing question and answer datasets, RAG Q&A
specific features: ‘Q&A dataset’ for managing question and answer datasets, ‘RAG Q&A
score test’ for evaluating the performance of RAG utilizing datasets, and Show test re-
score test’ for evaluating the performance of RAG utilizing datasets, and ‘Show test results’
sults’ for displaying the results of the tests. Each submenu is designed to streamline the
for displaying the results of the tests. Each submenu is designed to streamline the respective
respective aspect of the workflow, ensuring a coherent and efficient user experience
aspect of the workflow, ensuring a coherent and efficient user experience throughout the
throughout the process of dataset management, performance evaluation, and result visu-
process of dataset management, performance evaluation, and result visualization.
alization.
Within the ‘Q&A dataset’, the user is guided to employ a specific prompt, aiming
Within
to instruct theLLM
the Q&A to dataset’,
generate the user is guided
questions that alignto employ a specific
closely with prompt, aiming
the provided to
reference
instruct the LLM to generate questions that align closely with the provided
answers, as described in step 2. This operation initiates the creation of a comprehensive reference an-
swers, as
dataset, described inorganizing
subsequently step 2. Thisand operation
storing initiates the creation
this information of aa comprehensive
within da-
JSON file for future
taset, subsequently organizing and storing this information within a JSON
accessibility and analysis. This approach ensures the generation of relevant and accurate file for future
accessibility
questions, and analysis.
thereby enhancing This
theapproach
dataset’sensures thefollow-up
utility for generationevaluation
of relevantprocesses.
and accurate
questions, thereby enhancing the dataset’s utility for follow-up evaluation
The ‘RAG Q&A score test’ is designed to streamline the evaluation of different LLMs’ processes.
The RAGusing
performances Q&Athescore test’as
RAG, is indicated
designed to instreamline the evaluation
Figure 5. This evaluationofprocess
different LLMs’
involves
performances using the RAG, as indicated in Figure 5. This evaluation
importing a JSON-formatted dataset and linking it with an established vectorstore relevant process involves
importing
to the selecteda JSON-formatted dataset and
domain. The automation linking itwithin
embedded with an established
this vectorstore
menu facilitates rele-
a method-
vant to the selected domain. The automation embedded within this
ical assessment of the LLMs, leveraging domain-specific knowledge embedded within menu facilitates a me-
thodical assessment
the vectorstore. of the LLMs, leveraging domain-specific knowledge embedded
within the vectorstore.
The RAG Q&A score test’ is designed to streamline the evaluation of different LLMs’
performances using the RAG, as indicated in Figure 5. This evaluation process involves
importing a JSON-formatted dataset and linking it with an established vectorstore rele-
vant to the selected domain. The automation embedded within this menu facilitates a me-
Electronics 2024, 13, 1361 10 of 30
thodical assessment of the LLMs, leveraging domain-specific knowledge embedded
within the vectorstore.

Figure5.5.Workflow
Figure Workflowdiagram
diagramfor
forRAG
RAGLLM
LLMquery
queryprocessing
processingand
andscore
scorestorage.
storage.

Vectorstores, once created using a specific LLM’s transformers, require the consistent
application of the same LLM model during the RAG process. Within this automated frame-
work, each question from the dataset is processed by the LLM to produce a corresponding
answer. Then, both the generated answers and their associated reference answers are
evaluated by a backend Python script. This script calculates performance metrics, records
these metrics on the blockchain under a specified test series, and iterates this procedure for
every item within the dataset.
The ‘Show test results’ feature is designed to access and display the evaluation
outcomes from various tests as recorded on the blockchain, presenting them in an organized
tabular format. This feature facilitates the visualization of score results for individual
answers across different test series and also provides the functionality to export this data
into an xlsx file format. The export feature makes it much easier for users to understand
and study the data, helping with better evaluations and insights.
The ‘Q&A Time LLM Test’ feature evaluates model performance across various hard-
ware setups using JSON-formatted question–answer pairs. Upon submission, the PaSSER
App prompts the selected model for responses, generating detailed performance metrics
like evaluation and load times, among others. These metrics are packed in a query to
a backend Python script, which records the data on the blockchain via the ‘addtimetest’
action, interacting with the ‘llmtest’ smart contract to ensure performance tracking and
data integrity.
The ‘Show time test results’ makes it easy to access and view LLM performance data,
organized by test series, from the blockchain. When displayed in a structured table, these
metrics can be examined for comprehensive performance assessment. There is an option
to export this data into an xlsx file, thereby improving the process for further in-depth
examination and analysis.
Authentication within the system (‘Login’) is provided through the Anchor wallet,
which is compliant with the security protocols of the SCPDx platform. This process,
described in detail in [34], provides user authentication by ensuring that testing activities
are securely associated with the correct user credentials. This strengthens the integrity and
accountability of the testing process within the platform ecosystem.
The ‘Configuration’ feature is divided into ‘Settings’ and ‘Add Model’.
The ‘Settings’ is designed for configuring connectivity to the Ollama API and Chro-
maDB API, using IP addresses specified in the application’s configuration file. It also
Electronics 2024, 13, 1361 11 of 30

allows users to select an LLM that is currently installed in Ollama. A key feature here is
the ability to adjust the ‘temperature’ parameter, which ranges from 0 to 1, to fine-tune the
balance between creativity and predictability in the output generated by the LLM. Setting a
higher temperature value (>0.8) increases randomness, whereas a lower value enhances
determinism, with the default set at 0.2.
The ‘Add Model’ enables adding and removing LLMs in the Ollama API, allowing
dynamic model management. This feature is useful when testing different models, ensuring
optimal use of computational resources.
The ‘Manage DB’ feature displays a comprehensive list of vectorstores available in
ChromaDB, offering functionalities to inspect or interact with specific dataset records. This
feature enables users to view details within a record’s JSON response. It provides the option
to delete any vectorstore that is no longer needed, enabling efficient database management
by removing obsolete or redundant data, thereby optimizing storage utilization.
Electronics 2024, 13, 1361A block diagram representation of the PaSSER App’s operational logic that illustrates 12 of
the interactions between the various components is provided in Figure 6.

Figure 6. PaSSERFigure 6. PaSSER

App block App block diagram.
diagram.

The UI (webinteracts
The UI (web application) application)
withinteracts with
users for users for configuration,
configuration, authentication,authentication,
and an
operation
operation initiation. initiation.
It utilizes It utilizes
JavaScript andJavaScript and the PrimeReact
the PrimeReact library for UIlibrary for UI component
components.
Enables the userEnables the user
interactions for interactions
authentication for (Login),
authentication (Login),and
configuration, configuration, and operation
operations with
with the
the LLMs and blockchain. LLMs and blockchain.
The web server The web server
(Apache) hosts(Apache)
the webhosts the web application,
application, facilitating communication
facilitating communication be- b
tween the user interface and
tween the user interface and backend components. backend components.
The LLM API The andLLMVectorAPI and Vector
Database Database
utilize utilize API
the Ollama the Ollama
for the API for the management
management of
diﬀerent LLMs. It incorporates ChromaDB for storage
different LLMs. It incorporates ChromaDB for storage and retrieval of vectorizedand retrieval of data.
vectorized data.
Data pre-processing
Data pre-processing and vectorization and vectorization
standardize standardize
and convertand dataconvert data from variou
from various
sources (e.g., PDFs, websites) into numerical vectors for LLM processing using pre-traine
sources (e.g., PDFs, websites) into numerical vectors for LLM processing using pre-trained
models of selected LLMs.
models of selected LLMs.
The RAG Q&A chat facilitates query responses by integrating external data retriev
The RAG Q&A chat facilitates query responses by integrating external data retrieval
with LLM processing. It enables querying the LLMs with augmented information retriev
with LLM processing. It enables querying the LLMs with augmented information retrieval
from the vector database for generating responses.
from the vector database for generating responses.
Testing modules utilize built-in testing modules to assess LLM performance acro
Testing modules utilize built-in testing modules to assess LLM performance across
metrics, with results recorded on the blockchain.
metrics, with results recorded on the blockchain.
The Python evaluation API calculates NLP performance metrics and interacts wi
The Python theevaluation
blockchainAPI calculatesthe
for recording NLP performance
testing results via metrics and interacts with
smart contracts.
the blockchain for recording the testing
Smart contracts manage results via smart
test results contracts.
recording on the blockchain.
Smart contracts manage test results recording on the blockchain.
3. Evaluation Metrics
3. Evaluation Metrics
The evaluation of RAG models within the PaSSER App was performed using a set
The evaluation of RAGNLP
13 standard models within
metrics. themetrics
These PaSSERevaluated
App wasvarious
performed using a of
dimensions setmodel
of perfo
13 standard NLP metrics. These metrics evaluated various dimensions of model perfor-
mance, including the quality of text generation and summarization, semantic similarit
mance, including the quality
predictive of text and
accuracy, generation and summarization,
consistency semantic
of generated content similarity,
compared pre-
to reference or e
dictive accuracy, and consistency of generated content compared to reference or expected
pected results. Metrics included METEOR, ROUGE (with ROUGE-1 and ROUGE-L var
ants), BLEU, perplexity (using Laplace and Lidstone smoothing techniques), cosine sim
larity, Pearson correlation coeﬃcient, and F1 score.
The PaSSER App ran two main tests to assess LLM: “LLM Q&A Time Test” an
“RAG Q&A Assessment Test”. The latter specifically applied the selected metrics to a cr
ated dataset of question–answer pairs for the smart agriculture domain. The test aimed
Electronics 2024, 13, 1361 12 of 30

results. Metrics included METEOR, ROUGE (with ROUGE-1 and ROUGE-L variants),
BLEU, perplexity (using Laplace and Lidstone smoothing techniques), cosine similarity,
Pearson correlation coefficient, and F1 score.
The PaSSER App ran two main tests to assess LLM: “LLM Q&A Time Test” and “RAG
Q&A Assessment Test”. The latter specifically applied the selected metrics to a created
dataset of question–answer pairs for the smart agriculture domain. The test aimed to deter-
mine which model provides the most accurate and contextually relevant answers within
the RAG framework and the capabilities of each model in the context of text generation
and summarization tasks.
The ‘RAG Q&A chat’ was assessed using a set of selected metrics: METEOR, ROUGE,
PPL (perplexity), cosine similarity, Pearson correlation coefficient, and F1 score [35].
An automated evaluation process was developed to apply these metrics to the answers
generated using RAG. The process compared generated answers against the reference
answers in the dataset, calculating scores for each metric.
All calculations were implemented in backEnd.py script in Python, available at:
“https://github.com/scpdxtest/PaSSER/blob/main/scripts/backEnd.py (accessed on
1 April 2024)”.
The following is a brief explanation of the purpose of the metrics used, the simplified
calculation formulas, and the application in the context of RAG.

3.1. METEOR (Metric for Evaluation of Translation with Explicit Ordering)

METEOR score is a metric used to assess the quality of machine-generated text by
comparing it to one or multiple reference texts [36]. The calculating involves several steps:
– Word alignment between candidate and reference translations based on exact, stem,
synonym, and paraphrase matches, with the constraint that each word in the candidate
and reference sentences can only be used once and aims to maximize the overall match
between the candidate and references.
– Calculation of Precision ( P) = Number of matched words in the candidate/Number
of words in the candidate and Recall ( R) = Number of matched words in the candi-
date/Number of words in the reference:
m m
P= ,R = (1)
wc wr
where:
m = Number of unigrams in the candidate translation that are matched with the
reference translation.
wc = Total number of unigrams in the candidate translation.
wr = Total number of unigrams in the reference translation(s).
– Calculation of Penalty for chunkiness, which accounts for the arrangement and fluency
of the matched chunks (c) = Number of chunks of contiguous matched unigrams in
the candidate translation and (m):
c 3
Penalty = 0.5 (2)
m
– The final score is computed using the harmonic mean of Precision and Recall, adjusted
by the penalty factor:
Mscore = Fmean (1 − Penalty) (3)
where Fmean = R10PR
+9P .
The implementation of the calculations of Equations (1)–(3) is conducted with the nltk
library, single_meteor_score function, line 58 in Python script.
This pseudocode describes the process of splitting two texts into words and calculating
the METEOR score between them.
𝑀 =𝐹 (1 − 𝑃𝑒𝑛𝑎𝑙𝑡𝑦) (3)

where 𝐹 = .
The implementation of the calculations of Equations (1)–(3) is conducted with the
nltk library, single_meteor_score function, line 58 in Python script.
Electronics 2024, 13, 1361 13 of 30
This pseudocode describes the process of splitting two texts into words and calculat-
ing the METEOR score between them.

1. Split the reference and candidate texts into words

2. Calculate the METEOR score between the word lists of reference and candidate
using the ‘single_meteor_score’ function

Inthe
In thecontext
contextof
of RAG
RAG models,
models, the METEOR score can be used used to
to evaluate
evaluatethe
thequality
quality
ofthe
of thegenerated
generatedresponses.
responses. AA high
high METEOR score indicates that
that the
the generated
generated response
response
closelymatches
closely matches the
the reference
reference text,
text, suggesting
suggesting that the model is accurately
accurately retrieving
retrieving and
and
generatingresponses.
generating responses. Conversely,
Conversely, aa low METEOR score could
could indicate
indicateareas
areasfor
forimprove-
improve-
mentin
ment inthe
themodel’s
model’sperformance.
performance.

3.2.
3.2.ROUGE
ROUGE(Recall-Oriented
(Recall-Oriented Understudy
Understudy for Gisting Evaluation)
ROUGE
ROUGE [37]
[37] isis aa set
set of
of metrics
metrics used
used for
for evaluating
evaluating automatic summarization
summarization and
and
machine
machine translation. works by comparing
translation. It works automatically produced
comparing an automatically produced summary
summary oror
translation against one or more reference summaries
translation against one or more reference summaries (usually(usually human-generated).
ROUGE has several variants: ROUGE-N, ROUGE-L, and ROUGE-W.
ROUGE-N focuses on the overlap of n-grams (sequences of n words) between the
system-generated summary and the reference summaries. It is computed in terms of recall,
precision, and F1 score:
– Recall ROUGE-N is the ratio of the number of overlapping n-grams between the system
summary and the reference summaries to the total number of n-grams in the reference
summaries:

∑S∈{ Re f erence Summaries} ∑ gramn ∈s Countmatch ( gramn )

Recall ROUGE-N = (4)
∑S∈{ Re f erence Summaries} ∑ gramn ∈s Count( gramn )

– Precision ROUGE-N is the ratio of the number of overlapping n-grams in the system
summary to the total number of n-grams in the system summary itself:

∑S∈{System Summaries} ∑ gramn ∈s Countmatch ( gramn )

Precision ROUGE-N = (5)
∑S∈{System Summaries} ∑ gramn ∈s Count( gramn )

– F1ROUGE-N is the harmonic mean of precision and recall:

Precision ROUGE-N × Recall ROUGE-N

F1ROUGE-N = 2 (6)
Precision ROUGE-N + Recall ROUGE-N

ROUGE-L focuses on the longest common subsequence (LCS) between the generated
summary and the reference summaries. The LCS is the longest sequence of words that ap-
pears in both texts in the same order, though not necessarily consecutively. The parameters
for ROUGE-L include:
– Recall ROUGE-L is the length of the LCS divided by the total number of words in
the reference summary. This measures the extent to which the generated summary
captures the content of the reference summaries:

LCS(System Summary, Re f erence Summary)

Recall ROUGE-N = (7)
Lenght o f Re f erence Summary

– Precision ROUGE-N is the length of the LCS divided by the total number of words in
the generated summary. This assesses the extent to which the words in the generated
summary appear in the reference summaries:

LCS(System Summary, Re f erence Summary)

Precision ROUGE-N = (8)
Lenght o f System Summary
( , )
𝑅𝑒𝑐𝑎𝑙𝑙 = (7)

– 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 is the length of the LCS divided by the total number of words in
the generated summary. This assesses the extent to which the words in the generated
Electronics 2024, 13, 1361 summary appear in the reference summaries: 14 of 30

( , )
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = (8)
– F1ROUGE-N is a harmonic mean of the LCS-based precision and recall:
– 𝐹1 is a harmonic mean of the LCS-based precision and recall:
Precision ROUGE-N × Recall ROUGE-N
F1ROUGE-N = 2 × (9)
𝐹1 =Precision
2 ROUGE-N + Recall ROUGE-N (9)

ROUGE-W
ROUGE-W is is an
an extension
extension of
of ROUGE-L
ROUGE-L with a weighting
weighting scheme
scheme that
that assigns
assignsmore
more
importance
importance to longer
longer sequences
sequencesofofmatching
matchingwords.
words. In this
In this application,
application, ROUGE-W
ROUGE-W is notis
not applied.
applied.
The
The implementation
implementation of the the calculations
calculations of Equations
Equations (4)–(9)
(4)–(9) is
is conducted
conducted with
with the
the
rouge library, rouge.get_scores function, line 65 in Python script.
rouge library, rouge.get_scores function, line 65 in Python script.
This
Thispseudocode
pseudocode describes
describes the
the process of initializing a ROUGE
ROUGE object
object and
andcalculating
calculating
the
theROUGE
ROUGEscores
scoresbetween
between two
two texts.
texts.

1. Set ‘hypothesis’ to the reference text and ‘ref’ to the candidate text
2. Initialize a Rouge object
3. Calculate the ROUGE scores between ‘hypothesis’ and ‘ref’ using the
‘get_scores’ method of the Rouge object

The choice
The choice between
between aa preference
preference for precision, recall,
recall, or
or F1
F1 scoring
scoring depends
depends on on the
the
specificgoals
specific goals of
of the
the summarization
summarization task, such as whether
whether it it is
is more
more important
important totocapture
capture
as much
as much information
information as aspossible
possible(recall) or or
(recall) to ensure thatthat
to ensure what is captured
what is highly
is captured rele-
is highly
vant (precision).
relevant (precision).
In the context of RAG models, ROUGE metric serves as a tool for assessing the
quality of the generated text, especially in summary, question answering, and content-
generation tasks.

3.3. BLEU (Bilingual Evaluation Understudy)

The BLEU [38] score is a metric used to assess the quality of machine-generated text by
comparing it to one or multiple reference texts. It quantifies the resemblance by analysing
the presence of shared n-grams (sequences of n consecutive words).
The metric employs a combination of modified precision scores for various n-gram
lengths and incorporates a brevity penalty to account for the adequacy and fluency of the
translation. Below are the simplified formulas for estimating BLEU.
For each n-gram length n, BLEU computes a modified precision score. The modified
precision Pn for n-grams is calculated as follows:

∑C∈{Candidate Translation} ∑n− grams∈C Countclip (n − gram)

Pn = (10)
∑C′ ∈{Candidate Translation} ∑n− grams∈C′ Countclip (n − gram′ )

where, Countclip is a count of each n-gram in the candidate translation clipped by its
maximum count in any single reference translation.
The brevity penalty (BP) is a component of the BLEU score that ensures translations
are not only accurate but also of appropriate length. The BP is defined as:

1 if c > r
BP = (11)
e(1−r/c) if c ≤ r

where, c is the total length of the candidate translation, and r is the effective reference
corpus length, which is the sum of the lengths of the closest matching reference translations
for each candidate sentence.
BP = 1 if the candidate translation length c is greater than the reference length r,
indicating no penalty.
BP = e(1−r/c) if c is less than or equal to r, indicating a penalty that increases as the
candidate translation becomes shorter relative to the reference.
(11)
𝑒( ⁄ )
𝑖𝑓 𝑐 ≤ 𝑟
where, 𝑐 is the total length of the candidate translation, and 𝑟 is the eﬀective reference
corpus length, which is the sum of the lengths of the closest matching reference transla-
tions for each candidate sentence.
Electronics 2024, 13, 1361
𝐵𝑃 = 1 if the candidate translation length 𝑐 is greater than the reference length 15 of𝑟,
30

indicating no penalty.
𝐵𝑃 = 𝑒 ( ⁄ ) if 𝑐 is less than or equal to 𝑟, indicating a penalty that increases as the
The overall
candidate BLEUbecomes
translation score is calculated usingtothe
shorter relative thefollowing
reference.formula:
The overall BLEU score is calculated using the

N
following formula:
BLEU = BP.exp ∑n=1 wn logPn (12)
𝐵𝐿𝐸𝑈 = 𝐵𝑃. 𝑒𝑥𝑝(∑ 𝑤 𝑙𝑜𝑔𝑃 ) (12)
where, N
where, 𝑁 is
is the maximum n-gram
the maximum n-gramlength
length(typically
(typically4),4),
and and 𝑤 wisn is
thethe weight
weight for for
eacheach
n-
n-gram’s precision
gram’s precision score,
score, often
often set set equally
equally suchsuch
thatthat
theirtheir
sum sum 𝑤 =w
is 1 (e.g.,
is 1 (e.g., n = for
0.25 0.25𝑁 for
=
N4).= 4).
This
This formula
formula aggregates
aggregates thethe individual
individual modified
modifiedprecision
precisionscores scores𝑃Pn for
for n-grams
n-gramsof of
length 1 to N, geometrically averaged and weighted by w , then multiplied
length 1 to 𝑁, geometrically averaged and weighted by 𝑤 , then multiplied by the brevity
n by the brevity
penaltyBP
penalty 𝐵𝑃totoyield
yieldthe
thefinal
finalBLEU
BLEUscore.
score.
The
The implementation of the calculations
implementation of the of Equations
calculations of Equations (10)–(12)
(10)–(12) is is conducted
conductedwith withthe
the
nltk library, sentence_bleu and SmoothingFunction functions, lines 74–79 in
nltk library, sentence_bleu and SmoothingFunction functions, lines 74–79 in Python script.Python script.
This
This pseudocode
pseudocode describes
describes thethe process of splitting
process of splitting two
two texts
texts into
into words,
words, creating
creatingaa
smoothing function, and calculating the BLEU score between
smoothing function, and calculating the BLEU score between them. them.

1. Split the reference and candidate texts into words

2. Create a smoothing function using method4 of the SmoothingFunction class
3. Calculate the BLEU score between the word lists of reference and candidate
using the ‘sentence_bleu’ function with the smoothing function

In the
In the context
context of
of RAG
RAG models,
models, the BLEU score can be used
used toto evaluate
evaluate the
thequality
qualityofof
the generated responses.
the generated responses. A high BLEU score would indicate that the generated response
would indicate that the generated response
closely matches
closely matches the
the reference
reference text, suggesting that the model
model isis accurately
accurately retrieving
retrievingand
and
generating responses.
generating responses. A A low
low BLEU
BLEU score
score could
could indicate
indicate areas
areas for
for improvement
improvement in in the
the
model’sperformance.
model’s performance.

3.4.
3.4. Perplexity
Perplexity (PPL)
(PPL)
Perplexity (PPL) [39] is a measure used to evaluate the performance of probabilistic
language models. The introduction of smoothing techniques, such as Laplace (add-one)
smoothing and Lidstone smoothing [40], aims to address the issue of zero probabilities for
unseen events, thereby enhancing the model’s ability to deal with sparse data. Below are
the formulas for calculating perplexity.
– PPL with Laplace Smoothing adjusts the probability estimation for each word by adding
one to the count of each word in the training corpus, including unseen words. This
method ensures that no word has a zero probability. The adjusted probability estimate
with Laplace smoothing is calculated using the following formula:

C ( wi , h ) + 1
PLaplace (wi |h) = (13)
C (h) + V
where, wi is the probability of a word given its history h (the words that precede it), C (wi , h)
is the count of wi , C (h) is the count of history h, and V is a vocabulary size (the number of
unique words in the training set plus one for unseen words).
The PPL of a sequence of words W = w1 , . . . w N is given by:
1 N
PPL(W ) = e− N ∑i=1 ln( PLaplace (wi |h)) (14)

The implementation of the calculations of Equations (13) and (14) is conducted with
the nltk library, lines 84–102, in Python script.
This pseudocode describes the process of tokenizing an input text paragraph, training
a Laplace model (bigram model), and calculating the perplexity of a candidate text using
the model.
∑ ( | ) (14)
𝑃𝑃𝐿(𝑊) = 𝑒
The implementation of the calculations of Equations (13) and (14) is conducted with
the nltk library, lines 84–102, in Python script.
This pseudocode describes the process of tokenizing an input text paragraph,
Electronics 2024, 13, 1361 16 of 30
training a Laplace model (bigram model), and calculating the perplexity of a candidate
text using the model.

1. Tokenize the input text paragraph into sentences and words, convert all
words to lowercase
2. Split the tokenized text into training data and vocabulary using a bigram
model
3. Train a Laplace model (bigram model) using the training data and vocabulary
4. Define a function ‘calculate_perplexity’ that:
a. Tokenizes the input text into words, converts all words to lowercase
b. Calculates the perplexity of the text using the Laplace model
5. Set ‘test_text’ to the candidate text
6. Calculate the Laplace perplexity of ‘test_text’ using the
‘calculate_perplexity’ function

–– PPL with
PPL with Lidstone
Lidstone smoothing
smoothing is
is aa generalization
generalization of
of Laplace
Laplace smoothing
smoothingwhere
whereinstead
instead
of adding one to each count, a fraction λ (where 0 < λ < 1) is added.
of adding one to each count, a fraction λ (where 0 < λ < 1) is added. This This allows for
allows
more flexibility compared to the fixed increment in Laplace smoothing.
for more flexibility compared to the fixed increment in Laplace smoothing. AdjustedAdjusted
Probability Estimate
Probability Estimate with
with Lidstone
Lidstone Smoothing:
Smoothing:
( , )
𝑃 (𝑤 |ℎ) =C (wi , h) + λ (15)
PLidstone (wi |h) = ( ) (15)
C (h) + λV
The 𝑃𝑃𝐿 of a sequence of words 𝑊 = 𝑤 , … 𝑤 is given by:
The PPL of a sequence of words W = w∑1 , . . . w N is given
( | )by:
𝑃𝑃𝐿(𝑊) = 𝑒 (16)
− N1 ∑iN=1
ln( PLidstone (wi |h))
The implementation ofPPL W) = e
the (calculations of Equations (15) and (16) is conducted with (16)
the nltk library, lines 108–129, in Python script.
The
Thisimplementation of the calculations
pseudocode describes the processof Equations
of tokenizing (15) anandinput
(16) istext
conducted with
paragraph,
the nltk library, lines 108–129, in Python script.
training a Lidstone model (trigram model), and calculating the perplexity of a candidate
text This
usingpseudocode
the model. describes the process of tokenizing an input text paragraph, training
Electronics 2024, 13, 1361 17 of 31
a Lidstone model (trigram model), and calculating the perplexity of a candidate text using
the model.
1. Set the training text to the reference text
2. Tokenize the training text into sentences and then into words, convert all
words to lowercase
1. Prepare
3. Set the the
training textdata
training to the
for reference
a trigram text
model
2. Create
4. Tokenize the
and training
train text into
a Lidstone sentences
model and then
with Lidstone into words,
smoothing, convert
where gamma all
is
words to lowercase
the Lidstone smoothing parameter
3. Prepare the training data for a trigram
5. Set the test text to the candidate text model
4. Tokenize
6. Create and
thetrain a Lidstone
test text model with
into sentences andLidstone
then intosmoothing, where all
words, convert gamma is
words
thelowercase
to Lidstone smoothing parameter
5. Set the test text to the candidate text
6. Tokenize the test text into sentences and then into words, convert all words
to lowercase
7. Prepare the test data
8. Calculate the Lidstone perplexity of the test text

Inboth
In both formulas,
formulas, the goal is to compute
compute how how well
wellthe
themodel
modelpredicts
predictsthe
thetest set𝑊.
testset W.
The lower perplexity indicates
The lower perplexity indicates that that the model predicts the sequence
sequence more accurately. The
more accurately. The
choicebetween
choice between Laplace
Laplace andand Lidstone
Lidstone smoothing depends on on the
the specific
specific requirements
requirementsof of
themodel
the modeland anddataset,
dataset,as as well
well as
as empirical
empirical validation.
validation.
Inthe
In thecontext
context ofof RAG
RAG models,
models, both
both metrics are useful for assessing
assessing the
the quality
quality and
and
ability of
ability of models
models to to deal
deal with
with aa variety
variety of
of language
language and
and information.
information. These
These metrics
metrics
indicate how
indicate how well
well they
they can
can generate
generate contextually
contextually informed, linguistically
linguistically coherent,
coherent, and
and
versatile text.
versatile text.

3.5.
3.5.Cosine
Cosine Similarity
Similarity
Cosine
Cosinesimilarity
similarity [41]
[41] is
is aa measure
measure of
of vector similarity and can
can bebe used
used to
to determine
determine
the
thedistance
distanceofof embeddings
embeddings between
between the chunk and thethe query. It
It is
is aa distance
distance metric
metric that
that
approaches 1 when the question
approaches 1 when the question and chunk are similar and becomes 0 when they
and becomes 0 when they are are
different. The mathematics formulation
diﬀerent. The mathematics formulation of the metric is:
A.B
𝐶𝑜𝑠𝑖𝑛 𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 = ‖ ‖=
.
= ∑∑in=1 Ai Bi
Cosin Similarity = (17)
(17)
∥ A∥∥ B∥ ‖ ‖ ∑n∑ A2 ∑ ∑n B2
q q
i =1 i i =1 i
where, 𝐴. 𝐵 is the dot product of vectors 𝐴 and 𝐵 , ‖𝐴‖ and ‖𝐵‖ are the Euclidean
norms (magnitudes) of vectors 𝐴 and 𝐵 , calculated with ∑ 𝐴 and ∑ 𝐵 ,
respectively, and 𝑛 is the dimensionality of the vectors, assuming 𝐴 and 𝐵 of the same
dimension.
𝐶𝑜𝑠𝑖𝑛 𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 = 1 means the vectors are identical in orientation.
𝐶𝑜𝑠𝑖𝑛 𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 = 0 means the vectors are orthogonal (independent) to each other.
𝐶𝑜𝑠𝑖𝑛 𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 = −1 means the vectors are diametrically opposed.
The implementation of the calculation of Equation (17) is conducted with the
Cosine similarity [41] is a measure of vector similarity and can be used to determine
the distance of embeddings between the chunk and the query. It is a distance metric that
approaches 1 when the question and chunk are similar and becomes 0 when they are
diﬀerent. The mathematics formulation of the metric is:
Electronics 2024, 13, 1361 . ∑
𝐶𝑜𝑠𝑖𝑛 𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 = ‖ ‖‖ ‖
= 17 of 30
(17)
∑ ∑

where, 𝐴. 𝐵 is the dot product of vectors 𝐴 and 𝐵 , ‖𝐴‖ and ‖𝐵‖ are the Euclidean
where, A.B is the dot product of vectors A and B, q ∥ A∥ and ∥ B∥ are the Euclidean norms
norms (magnitudes) of vectors 𝐴 and 𝐵 , calculated n
with ∑q 𝐴 n
and ∑ 𝐵 ,
2 2
(magnitudes)
respectively, and 𝑛 is theAdimensionality
of vectors and B, calculated with
of the ∑i=assuming
vectors, 1 Ai and 𝐴 ∑and i =1 B𝐵
i , respectively,
of the same
and n is the dimensionality of the vectors, assuming A and B of the same dimension.
dimension.
Cosin 𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 =
𝐶𝑜𝑠𝑖𝑛Similarity = 11 means
means thethe vectors are identical
vectors are identical in
in orientation.
orientation.
Cosin 𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 =
𝐶𝑜𝑠𝑖𝑛Similarity = 00 means
means thethe vectors are orthogonal
vectors are orthogonal (independent)
(independent) to to each
each other.
other.
Cosin 𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 =
𝐶𝑜𝑠𝑖𝑛Similarity = −1−1 means
means thethe vectors
vectors are
are diametrically
diametrically opposed.
opposed.
The
The implementation
implementationofofthe thecalculation
calculationof Equation
of Equation(17)(17)
is conducted
is conductedwith with
the trans-
the
formers library, lines 133–164, in Python script.
transformers library, lines 133–164, in Python script.
This
This pseudocode
pseudocode describes
describes the
the process
process ofof tokenizing
tokenizing two two texts,
texts, generating
generating BERTBERT
embeddings for them, and calculating the cosine similarity between the
embeddings for them, and calculating the cosine similarity between the embeddings. The embeddings. The
[CLS] token is used as the aggregate representation for classification
[CLS] token is used as the aggregate representation for classification tasks. tasks.

1. Define a function ‘get_bert_embedding’ that:

a. Takes in a text, a tokenizer, and a model
b. Tokenizes the text and converts it to a tensor
c. Gets the BERT embeddings for the text
d. Returns the [CLS] token embedding
2. Initialize the tokenizer and model for BERT
3. Set ‘text_1’ to the reference text and ‘text_2’ to the candidate text
4. Tokenize ‘text_1’ and ‘text_2’
5. Generate BERT embeddings for ‘text_1’ and ‘text_2’
6. Get the [CLS] token embeddings for ‘text_1’ and ‘text_2’
7. Calculate the cosine similarity between the [CLS] token embeddings of
‘text_1’ and ‘text_2’

Inthe
In theRAG
RAG models,
models, cosine
cosine similarity
similarity ensures
ensures that
that retrieved
retrieved documents
documentsalignalignclosely
closely
with user queries, capturing relationships between the meaning of a user.
with user queries, capturing relationships between the meaning of a user. This is particu- This is
particularly important in RAG models, as they leverage a retriever to find
larly important in RAG models, as they leverage a retriever to find context documents. The context
documents.
use of cosine The use of between
similarity cosine similarity between
embeddings embeddings
ensures that theseensures thatdocuments
retrieved these retrieved
align
Electronics 2024, 13, 1361 documents align closely
closely with user queries. with user queries. 18 of 31

3.6. Pearson Correlation

The Pearson
3.6. Pearson correlation coefficient ® is a statistical measure that calculates the strength
Correlation
and direction of thecorrelation
The Pearson linear relationship
coeﬃcient between
® is atwo continuous
statistical variables.
measure that calculates the
The formula for the Pearson correlation coefficient is as follows:
strength and direction of the linear relationship between two continuous variables.
The formula for the Pearson correlation
∑in=1 Xi −coeﬃcient
X Yi − Yis as follows:

r= q ∑ (2
q )( ) 2 (18)
∑𝑟in==1 Xi − X ∑in=1 Yi − Y (18)
∑ ( ) ∑ ( )

where,n𝑛is is
where, the number
the number ofofdata
datapoints,
points,Xi 𝑋andand
Yi are the the
𝑌 are individual
individualdatadata
points, X and
andand
points, 𝑋
Yand
are 𝑌
theare
means of the X and Y data sets, respectively.
the means of the X and Y data sets, respectively.
The
Theimplementation
implementation of of the
the calculation
calculation of of Equation
Equation (18)
(18) is
is conducted
conductedwith
withthe
thetrans-
trans-
formers
formersand
andscript
scriptlibraries,
libraries,lines
lines 167–178,
167–178, inin Python
Python script.
script.
This
This pseudocode describes the process of tokenizing twotwo
pseudocode describes the process of tokenizing texts,
texts, generating
generating BERTBERT
em-
embeddings for them, and calculating the Pearson correlation coefficient
beddings for them, and calculating the Pearson correlation coeﬃcient between the embed- between the
embeddings. The mean of the last hidden state of the embeddings is used
dings. The mean of the last hidden state of the embeddings is used as the aggregate rep-as the aggregate
representation.
resentation.

1. Define a function ‘get_bert_embedding_manhattan’ that:

a. Takes in a text, a tokenizer, and a model
b. Tokenizes the text and converts it to a tensor
c. Gets the BERT embeddings for the text
d. Returns the mean of the last hidden state of the embeddings
2. Initialize the tokenizer and model for BERT
3. Set ‘text_1’ to the reference text and ‘text_2’ to the candidate text
4. Generate BERT embeddings for ‘text_1’ and ‘text_2’ using the ‘get_bert_em-
bedding_manhattan’ function
5. Calculate the Pearson Correlation Coefficient between the embeddings of
‘text_1’ and ‘text_2’

In the context of evaluating RAG models, the Pearson correlation coefficient can be
used to measure how well the model’s predictions align with actual outcomes. A coeffi-
cient close to +1 indicates a strong positive linear relationship, meaning as one variable
increases, the other also increases. A coefficient close to -1 indicates a strong negative lin-
ear relationship, meaning as one variable increases, the other decreases. A coefficient near
0 suggests no linear correlation between variables. In the evaluation of RAG models, a
Electronics 2024, 13, 1361 18 of 30

In the context of evaluating RAG models, the Pearson correlation coefficient can be
used to measure how well the model’s predictions align with actual outcomes. A coeffi-
cient close to +1 indicates a strong positive linear relationship, meaning as one variable
increases, the other also increases. A coefficient close to -1 indicates a strong negative linear
relationship, meaning as one variable increases, the other decreases. A coefficient near 0
suggests no linear correlation between variables. In the evaluation of RAG models, a high
Pearson correlation coefficient could indicate that the model is accurately retrieving and
generating responses, while a low coefficient could suggest areas for improvement.

3.7. F1 Score
In the context of evaluating the performance of RAG models, the F1 score [42] is used
for quantitatively assessing how well the models perform in tasks for generating or retriev-
ing textual information (question answering, document summarization, or conversational
AI). The evaluation often hinges on their ability to accurately and relevantly generate text
that aligns with reference or ground truth data.
The F1 score is the harmonic mean of precision and recall. Precision assesses the portion
of relevant information in the responses generated by the RAG model. High precision
indicates that most of the content generated by the model is relevant to the query or task at
hand, minimizing irrelevant or incorrect information. Recall (or sensitivity) evaluates the
model’s ability to capture all relevant information from the knowledge base that should be
included in the response. High recall signifies that the model successfully retrieves and
incorporates a significant portion of the pertinent information available in the context.
The formula for calculating is:

Precision × Recall
F1 = 2 × (19)
Precision + Recall
Precision and Recall are defined as:
Electronics 2024, 13, 1361 19 of 31
TP TP
Precision = , Recall = (20)
TP + FP TP + FN

where,TP
where, 𝑇𝑃(True Positives)
(True is the
Positives) count
is the of correctly
count retrieved
of correctly relevant
retrieved documents,
relevant FP (False
documents, 𝐹𝑃
Positives) is the count
(False Positives) is theof incorrectly
count retrieved
of incorrectly documents
retrieved (i.e., the
documents documents
(i.e., that were
the documents that
retrieved but are but
were retrieved not relevant), and FN (False
are not relevant), and 𝐹𝑁Negatives) is the count
(False Negatives) is of
therelevant documents
count of relevant
that were not retrieved.
documents that were not retrieved.
The
Theimplementation
implementation of of the
thecalculations
calculationsofof Equations
Equations (19)(19)
andand
(20)(20) occurs
occurs on lines
on lines 185–
185–204 in Python
204 in Python script. script.
This
Thispseudocode
pseudocodedescribes
describesthetheprocess
processofof tokenizing
tokenizing two two texts, counting the
texts, counting the common
common
tokens between them, and calculating the F1
tokens between them, and calculating the F1 score. score.

1. Define a function ‘f1_score’ that:

a. Takes in a prediction and a truth
b. Tokenizes the prediction and truth into words, converts all words to
lowercase
c. Counts the common tokens between prediction and truth
d. If there are no common tokens, return 0
e. Calculate precision as the number of common tokens divided by the total
number of tokens in prediction
f. Calculate recall as the number of common tokens divided by the total
number of tokens in truth
g. Calculate F1 score as the harmonic mean of precision and recall
h. Return the F1 score
2. Set ‘prediction’ to the candidate text and ‘truth’ to the reference text
3. Calculate the F1 score between ‘prediction’ and ‘truth’ using the ‘f1_score’
function

Fortasks
For tasksof
ofquestion
questionanswering,
answering, the
the F1
F1 score
score can
can be used to measure how wellwell the
the
generated answers match the expected answers, considering
generated answers match the expected answers, considering both the presence of correct
correct
information(high
information (highprecision)
precision)and
andthe
thecompleteness
completenessofofthe
the answer
answer (high
(high recall).
recall).
For tasks of document summarization, the F1 score might evaluate the overlap be-
tween the key phrases or sentences in the model-generated summaries and those in the
reference summaries, reflecting the model’s eﬃciency in capturing essential information
(recall) and avoiding extraneous content (precision).
For‚ conversational AI applications, the F1 score could assess the relevance and com-
pleteness of the model’s responses in dialogue, ensuring that responses are both pertinent
Electronics 2024, 13, 1361 19 of 30

For tasks of document summarization, the F1 score might evaluate the overlap between
the key phrases or sentences in the model-generated summaries and those in the reference
summaries, reflecting the model’s efficiency in capturing essential information (recall) and
avoiding extraneous content (precision).
For‚ conversational AI applications, the F1 score could assess the relevance and com-
pleteness of the model’s responses in dialogue, ensuring that responses are both pertinent
to the conversation context and comprehensive in addressing users’ intents or questions.

4. Testing
The aim of the tests presented in this section is to evaluate the performance of the Mis-
rtal:7b, Llama2:7b, and Orca2:7b models installed on two different hardware configurations
and to assess the performance of these models in generating answers using RAG on the
selected knowledge domain, smart agriculture.
The knowledge base used was retrieved from EU Regulation 2018/848 “https://eur-
lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:32018R0848 (accessed on 1 April
2024)” and Climate-smart agriculture Sourcebook “https://www.fao.org/3/i3325e/i332
5e.pdf (accessed on 1 April 2024)”. These documents were pre-processed manually and
vectorized using the transformers of Misrtal:7b, Llama2:7b, and Orca2:7b LLMs under the
following parameters: chunk size—500, overlapping—100, temperature—0.2.
The dataset containing reference answers specific to smart agriculture was compiled
and stored as a text file. The Mistral:7b model was deployed to formulate questions based
on these reference answers. Initial trials indicated that Mistral:7b excelled in generating
questions with high relevance within this particular domain. To initiate the question
generation process, the following prompt was employed: “Imagine you are a virtual
assistant trained in the detailed regulations of organic agriculture. Your task involves
creating precise questions for a specific regulatory statement provided to you below. The
statement comes directly from the regulations, and your challenge is to reverse-engineer
the question that this statement answers. Your formulated question should be concise, clear,
and directly related to the content of the statement. Aim to craft your question without
implying the statement itself as the answer, and, where relevant, gear your question toward
eliciting specific events, facts, or regulations.”
For testing purposes, two Ollama APIs installed on two different hardware configura-
tions were used:
- Intel Xeon, 32 Cores, 0 GPU, 128 GB RAM, Ubuntu 22.04 OS.
- Mac Mini M1, 8 CPU, 10 GPU, 16 GB RAM, OSX 13.4.
In the PASSER App, the installed Ollama APIs that were to be used could be selected.
This is set in the configuration->settings menu.
The two following tests were designed: testing via the ‘Q&A Time LLM Test’ and the
‘RAG Q&A score test’.
The ‘Q&A Time LLM Test’ evaluated LLM performance across two hardware configu-
rations using a dataset of 446 questions for each model, focusing on seven specific metrics
(evaluation time, evaluation count, load duration time, prompt evaluation count, prompt
evaluation duration, total duration, and tokens per second). These metrics were integral for
analyzing the efficiency and responsiveness of each model under different computational
conditions. The collected data was stored on a blockchain, ensuring both transparency and
traceability of the evaluation results.
The ‘RAG Q&A score test’ aimed to evaluate the performance of the models based on
13 metrics (METEOR, ROUGE-1, ROUGE-l, BLEU, perplexity, cosine similarity, Pearson
correlation, and F1) applied to each of the 446 questions—reference answers—for which
RAG obtained answers.
The ‘RAG Q&A score test’ evaluated the performance of different models in a chat
environment with enhanced RAG Q&A, identifying differences and patterns in their
ability to respond to queries. Its goal was to determine the model that best provided
accurate, context-aware responses that defined terms and summarized specific content.
Electronics 2024, 13, 1361 20 of 30

This evaluation can be used to select a model that ensures the delivery of accurate and
relevant information in the context of the specific knowledge provided.
The performance outcomes from the ‘Q&A Time LLM Test’ and ‘RAG Q&A score test’
for evaluating LLMs were stored on the blockchain via smart contracts. For analysis, this
data was retrieved from the blockchain and stored in an xlsx file. This file was uploaded
to GitHib “https://github.com/scpdxtest/PaSSER/blob/main/tests/TEST%20DATA_
GENERAL%20FILE.xlsx (accessed on 1 April 2024)”.
In the upcoming section, the focus is solely on presenting and analysing the mean
values derived from the test data. This approach eases the interpretation, enabling a
summarized review of the core findings and trends across the conducted evaluations.

4.1. Q&A Time LLM Test Results

When using the ‘Q&A Chat’ feature, performance metrics are returned by the Ollama
API along with each response. The metrics are reported in nanoseconds and converted to
seconds for easier interpretation. They are used to evaluate the performance of different
models under different hardware configurations.
Evaluation count is the number of individual evaluations performed by the model
during a specific test. It helps to understand its capacity to process multiple prompts or
tasks within a given timeframe, reflecting its productivity under varying workloads.
Load duration time is the time required for initializing and loading the language model
into memory before it begins processing data. This metric is essential for assessing the start-
up efficiency of the model, which can significantly impact user experience and operational
latency in real-time applications.
Prompt evaluation count is the number of prompts the model evaluates within a specific
test or operational period. It provides insights into the model’s interactive performance,
especially relevant in scenarios where the model is expected to respond to user inputs or
queries dynamically.
Prompt evaluation duration captures the time model used to evaluate a single prompt,
from receiving the input to generating the output. It measures the model’s responsiveness
and is particularly important for interactive applications where fast output is vital.
Total duration is the overall time for the entire evaluation process, incorporating all
phases from initialization to the completion of the last prompt evaluation. This metric gives
a complete view of the model’s operational efficiency over an entire test cycle.
Tokens per second quantifies the number of tokens (basic units of text, such as words or
characters) the model can process per second. It is a key indicator of the model’s processing
speed and computational throughput, reflecting its ability to handle large volumes of text
data efficiently.
Figure 7 illustrates the performance metrics of the Mistral:7b on macOS and Ubuntu
operating systems across the different indicators. The model demonstrates higher efficiency
on macOS, as indicated by the shorter evaluation time, longer prompt evaluation count, and
significantly higher tokens per second rate. Conversely, the model takes longer to process
prompts on Ubuntu, as indicated by the extended duration of prompt evaluation, even
though the operating system manages a greater number of prompt evaluations overall.
Figure 8 presents a comparison of the Llama2:7b performance. It reveals that MAC
OS is the more efficient platform, processing a higher number of tokens per second and
completing evaluations faster. Despite Ubuntu’s slightly higher prompt evaluation count,
it shows longer prompt evaluation and total duration times.
Figure 9 illustrates performance metrics for the Orca2:7b. MAC OS outperforms
Ubuntu with faster evaluation times, shorter load durations, and notably higher tokens per
second processing efficiency. While Ubuntu managed a higher prompt evaluation count, it
lagged in prompt evaluation speed, as reflected in the longer prompt evaluation duration
and extended total duration times.
Figure
prompt 7 illustrates
evaluation, eventhethough
performance metrics of
the operating the Mistral:7b
system managesona macOS
greaterand Ubuntu
number of
operating
prompt systems overall.
evaluations across the diﬀerent indicators. The model demonstrates higher
eﬃciency on macOS, as indicated by the shorter evaluation time, longer prompt
evaluation count, and significantly higher tokens per second rate. Conversely, the model
takes longer to process prompts on Ubuntu, as indicated by the extended duration of
Electronics 2024, 13, 1361
prompt evaluation, even though the operating system manages a greater number21of of 30

prompt evaluations overall.

Figure 7. A performance of Mistral:7b.

Figure 8 presents a comparison of the Llama2:7b performance. It reveals that MAC

OS is the more eﬃcient platform, processing a higher number of tokens per second and
completing evaluations faster. Despite Ubuntu’s slightly higher prompt evaluation count,
it shows
Figure
Figure longer
7. A promptof
performance
performance ofevaluation
Mistral:7b.and total duration times.
Mistral:7b.

Figure 8 presents a comparison of the Llama2:7b performance. It reveals that MAC

OS is the more eﬃcient platform, processing a higher number of tokens per second and
completing evaluations faster. Despite Ubuntu’s slightly higher prompt evaluation count,
it shows longer prompt evaluation and total duration times.

Electronics 2024, 13, 1361 22 of 31

Figure 9 illustrates performance metrics for the Orca2:7b. MAC OS outperforms

Ubuntu with faster evaluation times, shorter load durations, and notably higher tokens
per second processing eﬃciency. While Ubuntu managed a higher prompt evaluation
count, it lagged in prompt evaluation speed, as reflected in the longer prompt evaluation
Figure
Figure 8. A
duration
8. Aand
performance
extendedof
performance Llama2:7b.
total
of duration times.
Llama2:7b.

Figure 8. A performance of Llama2:7b.

Figure 9.
Figure 9. A
A performance
performance of
of Orca2:7b.
Orca2:7b.

Across
Across allall models,
models, several trends are
several trends are evident
evident (Figures
(Figures 7–9).
7–9). UBUNTU
UBUNTUgenerally
generally
shows longer evaluation times,
shows longer times, indicating
indicatingslower
slowerprocessing
processingcapabilities
capabilitiescompared
compared toto MAC
MAC
OS.Evaluation
OS. Evaluation counts are relatively
relativelycomparable,
comparable,suggesting
suggestingthat
thatthe
thenumber
number ofof
operations
operations
conductedwithin
conducted withina given
a given timeframe
timeframe is similar
is similar acrossacross hardware
hardware configurations.
configurations. Load
Load duration
duration
times times are consistently
are consistently longer onlonger
UBUNTU,on UBUNTU,
affecting affecting
readinessreadiness and response
and response times
times negatively.
negatively.tends
UBUNTU UBUNTU tends
to conduct to conduct
more more prompt
prompt evaluation count,evaluation count,
but also takes but also takes
significantly longer,
significantly
which exposes longer, which
efficiency exposes
issues. efficiency
UBUNTU issues. UBUNTU
experiences experiences
longer total durationslonger
for alltotal
tasks,
durations for all tasks, reinforcing the trend of slower overall performance.
reinforcing the trend of slower overall performance. MAC OS demonstrates higher tokens MAC OS
demonstrates higher tokens per second across all models, indicating
per second across all models, indicating more efficient data processing capabilities.more efficient data
processing capabilities.
The performance indicators (Table 1) suggest that across all models, the evaluation
time on the Mac M1 system is significantly less than on the Ubuntu system with Xeon
processors, indicating faster overall performance. In terms of tokens per second, the Mac
M1 also performs better, suggesting it is more efficient at processing information
Electronics 2024, 13, 1361 22 of 30

The performance indicators (Table 1) suggest that across all models, the evaluation
time on the Mac M1 system is significantly less than on the Ubuntu system with Xeon
processors, indicating faster overall performance. In terms of tokens per second, the Mac
M1 also performs better, suggesting it is more efficient at processing information regardless
of having fewer CPU cores and less RAM.

Table 1. Comparative performance metrics of Llama2:7b, Mistral:7b, and Orca2:7b LLMs on macOS
M1 and Ubuntu Xeon Systems (w/o GPU).

Llama2:7b Mistral:7b Orca2:7b

Metric macOS/M1 Ubuntu/Xeon macOS/M1 Ubuntu/Xeon macOS/M1 Ubuntu/Xeon
Evaluation
Faster (51,613) Slower (115,176) Faster (35,864) Slower (45,325) Fastest (24,759) Slowest (74,431)
time (s)
Evaluation
Slightly Higher (720) Comparable (717) Higher (496) Lower (284) Lower (350) Higher (471)
count (units)
Load duration
Faster (0.025) Slower (0.043) Fastest (0.016) Slower (0.039) Similar (0.037) Similar (0.045)
time (s)
Prompt
Lower (51) Higher (68) Lower (47) Higher (54) Lower (53) Highest (96)
evaluation count
Prompt
evaluation Shorter (0.571) Longer (5.190) Shorter (0.557) Longer (4.488) Shorter (0.588) Longest (6.955)
duration (s)
Total duration (s) Shorter (52,211) Longer (120,413) Shorter (36,440) Longer (49,856) Shortest (25,387) Longer (81,434)
Tokens/second Higher (14.07) Lower (6.3) Higher (13.91) Lower (6.36) Highest (14.38) Lower (6.53)

Despite having a higher core count and more RAM, the evaluation time is longer,
and the tokens per second rate are lower on the Ubuntu system. This suggests that the
hardware advantages of the Xeon system are not translating into performance gains for
these particular models. Notably, the Ubuntu system shows a higher prompt evaluation
count for Orca2:7b, which might be leveraging the greater number of CPU cores to handle
more prompts simultaneously.
Orca2:7b has the lowest evaluation time on the Mac M1 system, showcasing the most ef-
ficient utilization of that hardware. Llama2:7b shows a significant difference in performance
between the two systems, indicating it may be more sensitive to hardware and operating
system optimizations. Mistral:7b has a comparatively closer performance between the two
systems, suggesting it may be more adaptable to different hardware configurations.
The table suggests that the Mac M1’s architecture provides a significant performance
advantage for these language models over the Ubuntu system equipped with a Xeon
processor. This could be due to several factors, including but not limited to the efficiency of
the M1 chip, the optimization of the language models for the specific architectures, and the
potential use of the M1’s GPU in processing.

4.2. RAG Q&A Score Test Results

In this section, the performance of the Mistral:7b, Llama2:7b, and Orca2:7b models is
assessed through several key metrics: METEOR, ROUGE-1, ROUGE-L, BLEU, perplexity,
cosine similarity, Pearson correlation coefficient, and F1 score. A summary of the average
metrics is presented in Table 2.
Electronics 2024, 13, 1361 23 of 30

Table 2. A summary of performance metrics using RAG Q&A chat.

Metric in Text Generation and

Metric Llama2:7b Mistral:7b Orca2:7b Best Model
Summarization Tasks
Assesses fluency and adequacy of generated
METEOR 0.248 0.271 0.236 Mistral:7b text response, considering synonymy
and paraphrase.
Measures the extent to which a generated
ROUGE-1 recall 0.026 0.032 0.021 Mistral:7b summary captures key points from a source
text, indicating coverage.
Evaluates the fraction of content in the
ROUGE-1 precision 0.146 0.161 0.122 Mistral:7b generated summary that is relevant to the
source text, implying conciseness.
Provides a balance between recall and
ROUGE-1 f-score 0.499 0.472 0.503 Orca2:7b precision for assessing the overall quality of a
generated summary.
Reflects the degree to which a generated
ROUGE-l recall 0.065 0.07 0.055 Mistral:7b lowercase summary encompasses the content
of a reference lowercase summary.
Measures the accuracy of a generated
ROUGE-l precision 0.131 0.143 0.108 Mistral:7b lowercase summary in replicating the
significant elements of the source text.
Integrates precision and recall to evaluate the
ROUGE-l f-score 0.455 0.424 0.457 Orca2:7b quality of a generated lowercase
summary holistically.
Quantifies the similarity of the generated text
to reference texts by comparing n-grams,
BLUE 0.186 0.199 0.163 Mistral:7b
which is useful for machine translation
and summarization.
Estimates the likelihood of a sequence in
Laplace perplexity 52.992 53.06 53.083 Llama2:7b generated text, indicating how well the text
generation model predicts sample sequences.
Assesses the smoothness and predictability of
a text generation model by evaluating the
Lidstone perplexity 46.935 46.778 56.94 Mistral:7b
likelihood of sequence occurrence with small
probability adjustments.
Determines the semantic similarity between
Cosine similarity 0.728 0.773 0.716 Mistral:7b the vector representations of generated text
and reference texts.
Quantifies the linear correspondence between
generated text scores and human-evaluated
Pearson correlation 0.843 0.861 0.845 Mistral:7b
scores, indicating model predictability
and reliability.
Combines the precision and recall of the
generated text in summarization tasks,
F1 score 0.178 0.219 0.153 Mistral:7b
providing a singular measure of its
informational quality.

For a more straightforward interpretation of the results, the ranges of values of the
different metrics are briefly described below.
The ideal METEOR score is 1. It indicates a perfect match between the machine-
generated text and the reference translations, encompassing both semantic and syntactic
accuracy. For ROUGE metrics (ROUGE-1 recall, precision, f-score, ROUGE-l recall, precision,
f-score), the best possible value is 1. This value denotes a perfect overlap between the content
generated by the model and the reference content, indicating high levels of relevance and
For a more straightforward interpretation of the results, the ranges of values of the
diﬀerent metrics are briefly described below.
The ideal METEOR score is 1. It indicates a perfect match between the machine-
Electronics 2024, 13, 1361 generated text and the reference translations, encompassing both semantic and syntactic 24 of 30
accuracy. For ROUGE metrics (ROUGE-1 recall, precision, f-score, ROUGE-l recall, precision,
f-score), the best possible value is 1. This value denotes a perfect overlap between the content
generatedin
precision bythe
thecaptured
model and the referenceThe
information. content,
BLEUindicating high levels
score’s maximum of relevance
is also 1 (or 100and
when
precision in the captured information. The BLEU score’s maximum is
expressed in percentage terms), representing an exact match between the machine’salso 1 (or 100 when
output
expressed in percentage terms), representing an exact match between the machine’s
and the reference texts, reflecting high coherence and context accuracy. For perplexity, the
output and the reference texts, reflecting high coherence and context accuracy. For
lower the value, the better the model’s predictive performance. The best perplexity score would
perplexity, the lower the value, the better the model’s predictive performance. The best perplexity
technically approach 1, indicating the model’s predictions are highly accurate with minimal
score would technically approach 1, indicating the model’s predictions are highly accurate
uncertainty. The cosine similarity of 1 signifies maximum similarity between the generated
with minimal uncertainty. The cosine similarity of 1 signifies maximum similarity between the
output and the reference. A Pearson correlation of 1 is ideal, signifying a perfect positive linear
generated output and the reference. A Pearson correlation of 1 is ideal, signifying a perfect
relationship between the model’s outputs and the reference data, indicating high reliability
positive linear relationship between the model’s outputs and the reference data, indicating
of the model’s performance. An F1 score reaches its best at 1, representing perfect precision
high reliability of the model’s performance. An F1 score reaches its best at 1, representing
and recall, meaning the model has no false positives or false negatives in its output. For a
perfect precision and recall, meaning the model has no false positives or false negatives in
better comparison
its output. of the
For a better models, Figure
comparison 10 is presented.
of the models, Figure 10 is presented.

Figure10.
Figure 10. A
A comparison
comparison of
ofperformance
performancemetrics.
metrics.

The presented
The presented metrics
metricsprovide
providea apicture
pictureofofthe
theperformance
performanceof of
thethe
models on text
models on text
generation and summarization tasks. The analysis for each metric is as
generation and summarization tasks. The analysis for each metric is as follows.follows.
METEOR evaluates the quality of translation by aligning the model output to reference
translations when considering precision and recall. Mistral:7b scores highest, suggesting its
translations or generated text are the most accurate.
ROUGE-1 recall measures the overlap of unigrams between the generated summary
and the reference. A higher score indicates more content overlap. Mistral:7b leads, which
implies it includes more of the reference content in its summaries or generated text.
ROUGE-1 precision (the unigram precision). Mistral:7b has the highest score, indicating
that its content is more relevant and has fewer irrelevant inclusions.
ROUGE-1 F-score is the harmonic mean of precision and recall. Orca2:7b leads slightly,
indicating a balanced trade-off between precision and recall in its content generation.
ROUGE-L recall measures the longest common subsequence and is good at evaluating
sentence-level structure similarity. Mistral:7b scores the highest, showing it is better at capturing
longer sequences from the reference text.
ROUGE-L precision. Mistral:7b again scores highest, indicating it includes longer, relevant
sequences in its summaries or generated text without much irrelevant information.
ROUGE-L F-Score. Orca2:7b has a marginally higher score, suggesting a balance in
precision and recall for longer content blocks.
BLEU assesses the quality of machine-generated translation. Mistral:7b outperforms the
others, indicating its translations may be more coherent and contextually appropriate.
Electronics 2024, 13, 1361 25 of 30

Laplace perplexity. For perplexity, a lower score is better as it indicates a model’s

predictions are more certain. Llama2:7b has the lowest score, suggesting the best predictability
under Laplace smoothing conditions.
Lidstone perplexity—Mistral:7b has the lowest score, indicating it is slightly more
predictable under Lidstone smoothing conditions.
Cosine similarity measures the cosine of the angle between two vectors. A higher
score indicates greater semantic similarity. Mistral:7b has the highest score, suggesting its
generated text is most similar to the reference text in terms of meaning.
Pearson correlation measures the linear correlation between two variables. A score
of 1 indicates perfect correlation. Mistral:7b has the highest score, showing its outputs have a
stronger linear relationship with the reference data.
F1 score balances precision and recall. Mistral:7b has the highest F1 score, indicating the
best balance between recall and precision in its outputs.
Based on the above analysis, the following summary can be concluded. For text gener-
ation and summarization, Mistral:7b appears to be the best-performing model in most met-
rics, particularly those related to semantic quality and relevance. Orca2:7b shows strength
in balancing precision and recall, especially for longer content sequences. Llama2:7b
demonstrates the best predictive capability under Laplace smoothing conditions, which
may be beneficial in certain predictive text generation tasks. The selection of the best model
would depend on the specific requirements of the text generation or summarization task.

4.3. Blockchain RAM Resource Evaluation

The integration of blockchain technologies to record test results aims to increase the
confidence and transparency of the results obtained, support their traceability, and facilitate
their documentation. This can also be implemented using classical databases (e.g., Mongo
DB, Postgres, Maria DB, or others), but as the amount of data stored in the blockchain is
relatively small (on average 1 kB per test), there is no obstacle to exploit the advantages of
this technology.
In terms of blockchain performance, the platform used (Antelope) is characterized as
one of the fastest—over 7000 transactions per second. Since the main time for running the
tests is due to the generation of answers (with or without the use of RAG), it is multiple
times longer than the time for recording the results in the blockchain (for example, about 1
min for generating an answer, and about 0.5 s for recording the results in the blockchain).
PaSSER App uses a private, permissionless blockchain network. Such a blockchain
network can be installed by anyone with the required technical knowledge. Instructions
for installing the Antelope blockchain can be found on GitHub at: “https://github.com/
scpdxtest/PaSSER/blob/main/#%20Installation%20Instructions.md (accessed on 1 April
2024)”. There are no plans to implement a public blockchain network in this project.
The blockchain (Antelope) must use a system token. The symbol (SYS), precision, and
supply are set when the network is installed. All system resources (RAM, CPU, NET) are
evaluated in SYS. Similar to other platforms (e.g., Bitcoin, Ethereum), to execute transactions
and build blocks (to prevent denial of service attacks), Antelope charges a fee in SYS tokens.
The NET is used to measure the amount of data that can be sent per transaction as an
average consumption in bytes over the last 3 days (72 h). It is consumed temporarily for
each action/transaction.
The CPU limits the maximum execution time of a transaction and is measured as
average consumption in microseconds, also for the last 3 days. It is also temporarily con-
sumed when an action or transaction is sent. NET and CPU are together called bandwidth.
In public networks, users can use the bandwidth resources if available, i.e., that are not
mortgaged by other accounts. The price of mortgaged bandwidth (NET and CPU) resources
varies, depending on how many SYS tokens are staked in total.
The RAM is the information that is accessible from application logic (order books,
account balances). It limits the maximum space that can be occupied to store permanent
data (in the block producers’ RAM). An account can exchange NET and CPU by mortgaging
sumed when an action or transaction is sent. NET and CPU are together called bandwidth.
In public networks, users can use the bandwidth resources if available, i.e., that are not
mortgaged by other accounts. The price of mortgaged bandwidth (NET and CPU) re-
sources varies, depending on how many SYS tokens are staked in total.
Electronics 2024, 13, 1361 26 of 30
The RAM is the information that is accessible from application logic (order books,
account balances). It limits the maximum space that can be occupied to store permanent
data (in the block producers’ RAM). An account can exchange NET and CPU by mortgag-
SYS tokens,
ing SYS but must
tokens, buybuy
but must RAM. RAM
RAM. is not
RAM freed
is not automatically.
freed automatically.TheThe
only way
only wayto free up
to free
memory
up memoryis toisdelete the data
to delete thatthat
the data is using the account
is using (multi-index
the account tables).
(multi-index Freed
tables). unused
Freed un-
RAM
used can
RAM becan
soldbeand
soldpurchased at the market
and purchased price, which
at the market price, is determined
which by the Bancor
is determined by the
algorithm [43]. In detail, the entire RAM, CPU, NET evaluation procedure
Bancor algorithm [43]. In detail, the entire RAM, CPU, NET evaluation procedure that applies that
here
isapplies
described in [11].
here is described in [11].
In
Inour
ourcase,
case, the
the RAM
RAM price
price is
is extracted
extracted from
from the
the ‘rammarket’ table of
rammarket’ table of the eosio system
the eosio system
account
account by executing the
by executing cleoscommand
thecleos commandfrom fromthethe command
command lineline with
with the appropriate
the appropriate pa-
parameters (date
rameters (date 23 March14:05
23.03.2024, 2024,h).
14:05 h).

“rows”: [{
“supply”: “10000000000.0000 RAMCORE”,
“base”: {
“balance”: “68660625616 RAM”,
“weight”: “0.50000000000000000”
},
“quote”: {
“balance”: “1000857.1307 SYS”,
“weight”: “0.50000000000000000”
}
}
]

Inorder
In orderto
toapply
apply the
the Bancor
Bancor algorithm
algorithm for RAM pricing to our our private
private network,
network,the the
followingclarifications
following clarificationsshould
should be be made.
made. The Antelope blockchain network network has
has aaso-called
so-called
RAM token.
RAM token. The
The PaSSER
PaSSER uses
uses ourour private
private Antelope blockchain
blockchain network,
network, whose
whose system
system
token is SYS. In the context of the
token is SYS. In the context of the Bancor Bancor algorithm, RAM and
and SYS should be considered
SYS should be considered
SmartTokens.
Smart Tokens.The
TheSmart
SmartToken
Tokenisisaa token
token that has one or more
more connectors
connectors with
withother
othertokens
tokens
inthe
in thenetwork.
network. TheThe connector,
connector, in in this case, is a SYS token,
token, and
and itit establishes
establishesaarelationship
relationship
betweenSYS
between SYSand
and RAM.
RAM. Using the Bancor algorithm [43,44] [43,44] could
couldbe bepresented
presentedas asfollows:
follows:
𝑅𝐴𝑀 𝑃𝑟𝑖𝑐𝑒 = 𝑐𝑏/(𝑆𝑇𝑜𝑠 × 𝐶𝑊), 𝐶𝑊 = 𝑐𝑏/𝑆𝑇𝑡𝑣, =>
RAMPrice = cb/(STos × CW ), CW = cb/STtv, => (21)
𝑅𝐴𝑀 𝑃𝑟𝑖𝑐𝑒 (21)
RAMPrice == 𝑐𝑏/(𝑆𝑇𝑜𝑠
cb/ (STos × × 𝑐𝑏/𝑆𝑇𝑡𝑣)
cb/STtv ) ==STtv/STos
𝑆𝑇𝑡𝑣/𝑆𝑇𝑜𝑠

where: cb is for connector balance, Stos is for a Smart Token’s Outstanding Supply = base.balance,
and STtv is for a Smart Token’s total value = Connector Balance = quote.balance.
The cost evaluation of RAM, CPU, and NET resources in SYS tokens during a test
execution occurs as follows. The PaSSER App uses SYS tokens. The CPU price is measured
in (SYS token/ms/Day) and is valid for a specific account on the network. The NET Price
is measured in (SYS token/KiB/Day) and is valid for a specific account on the network.
This study only considers the cost of the RAM resources required to run the tests.
Data on the current market price of the RAM resource is retrieved from an oracle [16]
every 60 min that runs within the SCPDx platform whose blockchain infrastructure is
being used.
The current price of RAM is 0.01503345 SYS/kB as of 23 March 2024, 2:05 PM. Assum-
ing the price of 1 SYS is equal to the price of 1 EOS, it is possible to compare the price of the
RAM used if the tests are run on the public Antelope blockchain because there is a quote
of the EOS RAM Token in USD and it does not depend on the account used. The quote is
available at “https://coinmarketcap.com/ (accessed on 23 March 2024)”.
Table 3 shows that in terms of RAM usage and the associated costs in SYS and USD, the
score tests require more resources than the timing tests. The total cost of using blockchain
resources for these tests is less than 50 USD. This gives reason to assume that using
blockchain to manage and document test results has promise. The RAM price, measured in
SYS per kilobyte (kB), remains constant across different tests in the blockchain network.
Electronics 2024, 13, 1361 27 of 30

Table 3. Blockchain test costs.

RAM Price EOS Price RAM Cost Equivalent

Operation Bytes
(SYS/kB) (USD) (SYS) Cost (USD)
Time tests * 402,300 0.01503345 1.01 5.91 5.97
Score tests ** 2,896,560 0.01503345 1.01 42.52 42.95
Total 3,298,860 48.43 48.92
* Six series of 446 blockchain transactions. ** Three series of 446 blockchain BC transactions.

This means that blockchain developers and users can anticipate and plan for the costs
associated with their blockchain operations. The blockchain resource pricing model is
designed to maintain a predictable and reliable cost structure. This predictability mat-
ters for the long-term sustainability and scalability of blockchain projects as it allows for
accurate cost estimation and resource allocation. However, the real value of implement-
ing blockchain must also consider the benefits of increased transparency, security and
traceability against these costs.

5. Discussion
The PaSSER App testing observations reveal several aspects that affect the acquired re-
sults and the performance and can be managed. These are data cleaning and pre-processing,
chunk sizes, GPU usage, and RAM size.
Data cleaning and pre-processing cannot be fully automated. In addition to removing
special characters and hyperlinks, it is also necessary to remove non-essential or erroneous
information, standardize formats, and correct errors from the primary data. This is done
manually. At this stage, the PaSSER App processes only textual information; therefore, the
normalization of data, handling of missing data, and detection and removal of deviations
are not considered.
Selecting documents with current, validated, and accurate data is pivotal, yet this
process cannot be entirely automated. What can be achieved is to ensure traceability
and record the updates and origins of both primary and processed data, along with their
secure storage. Blockchain and distributed file systems can be used for this purpose.
Here, this objective is partially implemented since blockchain is used solely to record the
testing results.
The second aspect is chunk sizes when creating and using vectorstores. Smaller chunks
require less memory and computational resources. This is at the expense of increased
iterations and overall execution time, which is balanced by greater concurrency in query
processing. On the other hand, larger chunks provide more context but may be more
demanding on resources and potentially slow down processes if not managed efficiently.
Adjusting the chunk size affects both the recall and precision of the results. Adequate
chunk size is essential to ensure a balance between the retrieval and generation tasks in
RAG, as over- or undersized chunks can negatively impact one or both components. In
the tests, 500-character chunk sizes were found to give the best results. In this particular
implementation, no metadata added (document type or section labels) is used in the
vectorstore creation process, which would facilitate more targeted processing when using
smaller chunks.
GPU usage and RAM size obviously affect the performance of the models. It is evident
from the results that hardware configurations that do not use the GPU perform significantly
slower on the text generation and summarization tasks. Models with fewer parameters
(up to 13b) can run reasonably well on 16 GB RAM configurations. Larger models need
more resources, both in terms of RAM and GPU resources. This is the reason, in this
particular implementation, to use the selected small LLMs, which are suitable for standard
and commonly available hardware configurations and platforms.
It is important to note that the choice of a model may vary depending on the specific
requirements of a given task, including the desired balance between creativity and accuracy,
Electronics 2024, 13, 1361 28 of 30

the importance of fluency versus content fidelity, and the computational resources available.
Therefore, while Mistral:7b appears to be the most versatile and capable model based on
the provided metrics, the selection of a model should be guided by the specific objectives
and constraints of the application in question.
While promising, the use of RAG-equipped LLMs requires caution regarding data
accuracy, privacy concerns, and ethical implications. This is of particular importance in the
healthcare domain, where the goal is to assist medical professionals and researchers by ac-
cessing the latest medical research, clinical guidelines, and patient data, as well as assisting
diagnostic processes, treatment planning, and medical education. Pre-trained open-source
models can be found on Huggingface [45]. For example, TheBloke/medicine-LLM-13B-
GPTQ is used for medical question answering, patient record summarization, aiding
medical diagnosis, and general health Q&A [46]. Another model is m42-health/med42-
70b [47]. However, this application requires measures to ensure accuracy, privacy, and
compliance with health regulations.

6. Conclusions
This paper presented the development, integration, and use of the PaSSER web
application, designed to leverage RAG technology with LLMs for enhanced document
retrieval and analysis. Despite the explicit focus on smart agriculture as the chosen specific
domain, the application can be used in other areas.
The web application integrates Mistral:7b, Llama2:7b, and Orca2:7b LLMs, selected
for their performance and compatibility with medium computational capacity hardware. It
has built-in testing modules that evaluate the performance of the LLMs in real-time by a
set of 13 evaluation metrics (ROUGE-1 recall, precision, f-score; ROUGE-l recall, precision,
f-score; BLUE, Laplace perplexity, Lidstone perplexity, cosine similarity, Pearson correlation,
F1 score).
The LLMs were tested via the ‘Q&A Time LLM Test’ and ‘RAG Q&A score test’
functionalities of the PaSSER App. The ‘Q&A Time LLM Test’ was focused on assessing
LLMs across two hardware configurations. From the results of the ‘Q&A Time LLM Test’, it
can be concluded that even when working with 7b models, the presence of GPUs is crucial
for text generation speed. The lowest total duration times were shown by Orca2:7b on the
Mac M1 system. From the results of the ‘RAG Q&A Score Test’ applied to the selected
metrics over the dataset of 446 question–answer pairs, the Mistral:7b model exhibited
superior performance.
The PaSSER App leverages a private, permissionless Antelope blockchain network
for documenting and verifying results from LLMs’ testing. The system operates on a
token-based economy (SYS) to manage RAM, CPU, and NET resources. RAM usage and
associated costs, measured in SYS and USD, indicate that the total cost for blockchain re-
sources for conducted tests remains below 50 USD. This pricing model guarantees reliability
and predictability by facilitating accurate cost estimations and efficient resource distribu-
tion. Beyond the monetary aspects, the value of implementing blockchain encompasses
increased transparency, security, and traceability, highlighting its benefits.
Future development will focus on leveraging other pre-trained open-source LLMs
(over 40b) LLMs, exploring fine-tuning approaches, and further integration in the existing
Antelope blockchain/IPFS infrastructure of the SCPDx platform.

Author Contributions: Conceptualization, I.R. and I.P.; methodology, I.R. and I.P.; software, I.R. and
M.D.; validation, I.R. and M.D.; formal analysis, I.P. and I.R.; investigation, L.D.; resources, L.D.;
data curation, I.R. and M.D.; writing—original draft preparation, I.R. and I.P.; writing—review and
editing, I.R., I.P. and M.D.; visualization, I.R. and M.D.; supervision, I.P.; project administration, L.D.;
funding acquisition, I.R, I.P. and L.D. All authors have read and agreed to the published version of
the manuscript.
Electronics 2024, 13, 1361 29 of 30

Funding: This work was supported by the Bulgarian Ministry of Education and Science under
the National Research Program “Smart crop production” approved by the Ministry Council No.
866/26.11.2020.
Data Availability Statement: All data and source codes are available at: “https://github.com/
scpdxtest/PaSSER (accessed on 1 April 2024). Git Structure: ‘README.md’—information about
the project and instructions on how to use it; ‘package.json’- the list of project dependencies and
other metadata; ‘src’- all the source code for the project; ‘src/components’—all the React compo-
nents for the project; ‘src/components/configuration.json’—various configuration options for the app;
‘src/App.js’—the main React component that represents the entire app; ‘src/index.js’—JavaScript entry
point file; ‘public’—static files like the ‘index.html’ file; ‘scripts’—Python backend scripts; ‘Installation
Instructions.md’—contains instructions on how to install and set up the project.
Conflicts of Interest: The authors declare no conflicts of interest. The funders had no role in the design
of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or
in the decision to publish the results.

References
1. Howard, J.; Ruder, S. Universal Language Model Fine-Tuning for Text Classification. In Proceedings of the 56th Annual Meeting
of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, 1 July 2018. [CrossRef]
2. Houlsby, N.; Giurgiu, A.; Jastrzebski, S.; Morrone, B.; De Laroussilhe, Q.; Gesmundo, A.; Attariyan, M.; Gelly, S. Parameter-
Efficient Transfer Learning for NLP. No. 97. In Proceedings of the 36th International Conference on Machine Learning, Long
Beach, CA, USA, 9–15 June 2019; Chaudhuri, K., Salakhutdinov, R., Eds.; pp. 2790–2799.
3. Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al.
Language Models Are Few-Shot Learners. arXiv 2005, arXiv:2005.14165. Available online: https://arxiv.org/abs/2005.14165v4
(accessed on 26 March 2024).
4. Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.; Rocktäschel, T.; et al.
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv 2005, arXiv:2005.11401. Available online: http:
//arxiv.org/abs/2005.11401 (accessed on 2 February 2024).
5. Gao, Y.; Xiong, Y.; Gao, X.; Jia, K.; Pan, J.; Bi, Y.; Dai, Y.; Sun, J.; Guo, Q.; Wang, M.; et al. Retrieval-Augmented Generation for
Large Language Models: A Survey. arXiv 2023, arXiv:2312.10997. Available online: http://arxiv.org/abs/2312.10997 (accessed on
18 February 2024).
6. Karpukhin, V.; Oguz, B.; Min, S.; Lewis, P.; Wu, L.; Edunov, S.; Chen, D.; Yih, W. Dense Passage Retrieval for Open-Domain
Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP),
Online, 1 November 2020. [CrossRef]
7. Guu, K.; Lee, K.; Tung, Z.; Pasupat, P.; Chang, M. Retrieval Augmented Language Model Pre-Training. Proc. Mach. Learn. Res.
2020, 119, 3929–3938.
8. Izacard, G.; Grave, E. Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering. In
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume,
Online, 20 April 2021. [CrossRef]
9. GitHub. GitHub—Scpdxtest/PaSSER. Available online: https://github.com/scpdxtest/PaSSER (accessed on 8 March 2024).
10. Popchev, I.; Doukovska, L.; Radeva, I. A Framework of Blockchain/IPFS-Based Platform for Smart Crop Production. In
Proceedings of the ICAI’22, Varna, Bulgaria, 6–8 October 2022. [CrossRef]
11. Popchev, I.; Doukovska, L.; Radeva, I. A Prototype of Blockchain/Distributed File System Platform. In Proceedings of the IEEE
International Conference on Intelligent Systems IS’22, Warsaw, Poland, 12–14 October 2022. [CrossRef]
12. IPFS Docs. IPFS Documentation. Available online: https://docs.ipfs.tech/ (accessed on 25 March 2024).
13. GitHub. Antelope. Available online: https://github.com/AntelopeIO (accessed on 11 January 2024).
14. Ilieva, G.; Yankova, T.; Radeva, I.; Popchev, I. Blockchain Software Selection as a Fuzzy Multi-Criteria Problem. Computers 2021,
10, 120. [CrossRef]
15. Radeva, I.; Popchev, I. Blockchain-Enabled Supply-Chain in Crop Production Framework. Cybern. Inf. Technol. 2022, 22, 151–170.
[CrossRef]
16. Popchev, I.; Radeva, I.; Doukovska, L. Oracles Integration in Blockchain-Based Platform for Smart Crop Production Data Exchange.
Electronics 2023, 12, 2244. [CrossRef]
17. Ollama. Available online: https://ollama.com. (accessed on 25 March 2024).
18. GitHub. GitHub—Chroma-Core/Chroma: The AI-Native Open-Source Embedding Database. Available online: https://github.
com/chroma-core/chroma (accessed on 26 February 2024).
19. PrimeReact. React UI Component Library. Available online: https://primereact.org (accessed on 25 March 2024).
20. WharfKit. Available online: https://wharfkit.com/ (accessed on 25 March 2024).
21. LangChain. Available online: https://www.langchain.com/ (accessed on 25 March 2024).
22. NLTK: Natural Language Toolkit. Available online: https://www.nltk.org/ (accessed on 26 February 2024).
Electronics 2024, 13, 1361 30 of 30

23. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch:
An Imperative Style, High-Performance Deep Learning Library. Adv. Neural Inf. Process. Syst. 2019, 32, 8024–8035.
24. NumPy Documentation—NumPy v1.26 Manual. Available online: https://numpy.org/doc/stable/ (accessed on 26 February 2024).
25. Paul Tardy. Rouge: Full Python ROUGE Score Implementation (Not a Wrapper). Available online: https://github.com/pltrdy/
rouge (accessed on 1 April 2024).
26. Contributors. T. H. F. Team (Past and Future) with the Help of All Our. Transformers: State-of-the-Art Machine Learning for JAX,
PyTorch and TensorFlow. Available online: https://github.com/huggingface/transformers (accessed on 1 April 2024).
27. SciPy Documentation—SciPy v1.12.0 Manual. Available online: https://docs.scipy.org/doc/scipy/ (accessed on 26 February 2024).
28. Pyntelope. PyPI. Available online: https://pypi.org/project/pyntelope/ (accessed on 27 February 2024).
29. Rastogi, R. Papers Explained: Mistral 7B. DAIR.AI. Available online: https://medium.com/dair-ai/papers-explained-mistral-7b-
b9632dedf580 (accessed on 24 October 2023).
30. ar5iv. Mistral 7B. Available online: https://ar5iv.labs.arxiv.org/html/2310.06825 (accessed on 6 March 2024).
31. The Cloudflare Blog. Workers AI Update: Hello, Mistral 7B! Available online: https://blog.cloudflare.com/workers-ai-update-
hello-mistral-7b (accessed on 6 March 2024).
32. Hugging Face. Meta-Llama/Llama-2-7b. Available online: https://huggingface.co/meta-llama/Llama-2-7b (accessed on 6
March 2024).
33. Mitra, A.; Corro, L.D.; Mahajan, S.; Codas, A.; Ribeiro, C.S.; Agrawal, S.; Chen, X.; Razdaibiedina, A.; Jones, E.; Aggarwal, K.; et al.
Orca-2: Teaching Small Language Models How to Reason. arXiv 2023, arXiv:2311.11045.
34. Popchev, I.; Radeva, I.; Dimitrova, M. Towards Blockchain Wallets Classification and Implementation. In Proceedings of the 2023
International Conference Automatics and Informatics (ICAI), Varna, Bulgaria, 5–7 October 2023. [CrossRef]
35. Chen, J.; Lin, H.; Han, X.; Sun, L. Benchmarking Large Language Models in Retrieval-Augmented Generation. arXiv 2023,
arXiv:2309.01431. [CrossRef]
36. Banerjee, S.; Lavie, A. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In
Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization,
Ann Arbor, MI, USA, 22 June 2005.
37. Lin, C.-Y. ROUGE: A Package for Automatic Evaluation of Summaries; Association for Computational Linguistics: Barcelona,
Spain, 2004.
38. Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.-J. Bleu: A Method for Automatic Evaluation of Machine Translation. In Proceedings of
the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 6–12 July 2002. [CrossRef]
39. Arora, K.; Rangarajan, A. Contrastive Entropy: A New Evaluation Metric for Unnormalized Language Models. arXiv 2016,
arXiv:1601.00248. Available online: https://arxiv.org/abs/1601.00248v2 (accessed on 2 February 2024).
40. Jurafsky, D.; Martin, J.H. Speech and Language Processing. Available online: https://web.stanford.edu/~jurafsky/slp3/
(accessed on 8 February 2024).
41. Li, B.; Han, L. Distance Weighted Cosine Similarity Measure for Text Classification; Springer: Berlin/Heidelberg, Germany, 2013.
42. Sokolova, M.; Japkowicz, N.; Szpakowicz, S. Beyond Accuracy, F-Score and ROC: A Family of Discriminant Measures for
Performance Evaluation. Adv. Artif. Intell. 2006, 4304, 1015–1021.
43. issuu. Bancor Protocol Whitepaper En. Available online: https://issuu.com/readthewhitepaper/docs/bancor_protocol_
whitepaper_en (accessed on 24 March 2024).
44. Medium; Binesh, A. EOS Resource Usage. Available online: https://medium.com/shyft-network/eos-resource-usage-f0a80988
27d7 (accessed on 24 March 2024).
45. Hugging Face. Models. Available online: https://huggingface.co/models (accessed on 23 March 2024).
46. Cheng, D.; Huang, S.; Wei, F. Adapting Large Language Models via Reading Comprehension. arXiv 2024, arXiv:2309.09530.
[CrossRef]
47. Hugging Face. M42-Health/Med42-70b. Available online: https://huggingface.co/m42-health/med42-70b (accessed on 26
March 2024).

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.

Comparative Analysis of RAG Fine-Tuning and Prompt Engineering in Chatbot Development
No ratings yet
Comparative Analysis of RAG Fine-Tuning and Prompt Engineering in Chatbot Development
4 pages
Cc103-Data Structures and Algrorithm
No ratings yet
Cc103-Data Structures and Algrorithm
112 pages
Web Application for Retrieval-Augmented Generation: Implementation and Testing
No ratings yet
Web Application for Retrieval-Augmented Generation: Implementation and Testing
31 pages
Knowledge Ply Chat
No ratings yet
Knowledge Ply Chat
4 pages
IR-LLMs
No ratings yet
IR-LLMs
17 pages
Untitled 2
No ratings yet
Untitled 2
40 pages
Developing Retrieval Augmented Generation (RAG) Based LLM Systems From Pdfs - An Expert Report
No ratings yet
Developing Retrieval Augmented Generation (RAG) Based LLM Systems From Pdfs - An Expert Report
36 pages
Nis
No ratings yet
Nis
8 pages
External Information On Large Linguistic Models Utilizing Retrieval Enhanced Generation (RAG)
100% (10)
External Information On Large Linguistic Models Utilizing Retrieval Enhanced Generation (RAG)
6 pages
Major Projectpp (1)
No ratings yet
Major Projectpp (1)
5 pages
Major Projectpp
No ratings yet
Major Projectpp
6 pages
Rag Foundry- Diff Framework
No ratings yet
Rag Foundry- Diff Framework
10 pages
Major_projectpp (1) (1) (1)
No ratings yet
Major_projectpp (1) (1) (1)
5 pages
v1_covered_dd8bccc1-d5a3-4e08-8468-11e29c92981b
No ratings yet
v1_covered_dd8bccc1-d5a3-4e08-8468-11e29c92981b
16 pages
Retrieval-Augmented Generation For Large Language Models A Survey
No ratings yet
Retrieval-Augmented Generation For Large Language Models A Survey
26 pages
Wagner Frederik Masterthesis
No ratings yet
Wagner Frederik Masterthesis
66 pages
2502.20541v1
No ratings yet
2502.20541v1
61 pages
Crag Pa Peer
No ratings yet
Crag Pa Peer
16 pages
GASNet Programming and Architecture: Definitive Reference for Developers and Engineers
From Everand
GASNet Programming and Architecture: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
OpenMPI Programming and Architecture: Definitive Reference for Developers and Engineers
From Everand
OpenMPI Programming and Architecture: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Hybrid Retrieval-Augmented Generation Approach For LLMs Query Response Enhancement
No ratings yet
Hybrid Retrieval-Augmented Generation Approach For LLMs Query Response Enhancement
5 pages
Major Projectpp (1)
No ratings yet
Major Projectpp (1)
5 pages
How to Train LLM
No ratings yet
How to Train LLM
6 pages
StarPU: Parallel Computing and Task Scheduling Techniques
From Everand
StarPU: Parallel Computing and Task Scheduling Techniques
Richard Johnson
No ratings yet
Rag
No ratings yet
Rag
10 pages
A Comprehensive Survey of Retrieval-Augmented Generation (RAG) : Evolution, Current Landscape and Future Directions
No ratings yet
A Comprehensive Survey of Retrieval-Augmented Generation (RAG) : Evolution, Current Landscape and Future Directions
18 pages
Corrective Retrieval Augmented Generation: Zhang Et Al. 2023b Muhlgay Et Al. 2023
No ratings yet
Corrective Retrieval Augmented Generation: Zhang Et Al. 2023b Muhlgay Et Al. 2023
14 pages
Prolog Programming Foundations: Definitive Reference for Developers and Engineers
From Everand
Prolog Programming Foundations: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
RAG Papers
No ratings yet
RAG Papers
5 pages
Major Projectpp (2)
No ratings yet
Major Projectpp (2)
5 pages
Generative AI
No ratings yet
Generative AI
25 pages
2501.04652v1
No ratings yet
2501.04652v1
9 pages
Caffe Deep Learning Framework Essentials: Definitive Reference for Developers and Engineers
From Everand
Caffe Deep Learning Framework Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
01rag For LLM A Survey
No ratings yet
01rag For LLM A Survey
21 pages
ssrn-5267341
No ratings yet
ssrn-5267341
16 pages
Practical MXNet Applications: Definitive Reference for Developers and Engineers
From Everand
Practical MXNet Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Thesis RAG Retrieval Augmented Generation For The IR-Anthology
No ratings yet
Thesis RAG Retrieval Augmented Generation For The IR-Anthology
83 pages
RAG Syllabus R&D
No ratings yet
RAG Syllabus R&D
6 pages
Advance Assessment Evaluation: A Deep-Learning Framework With Sophisticated Text Extraction For Unparalleled Precision
No ratings yet
Advance Assessment Evaluation: A Deep-Learning Framework With Sophisticated Text Extraction For Unparalleled Precision
8 pages
Gensim for Natural Language Processing: Definitive Reference for Developers and Engineers
From Everand
Gensim for Natural Language Processing: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Rag
No ratings yet
Rag
15 pages
2501.04635v1
No ratings yet
2501.04635v1
8 pages
Applied GPT-4 Systems: Definitive Reference for Developers and Engineers
From Everand
Applied GPT-4 Systems: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Parallel Programming with MPI: Definitive Reference for Developers and Engineers
From Everand
Parallel Programming with MPI: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Detectron2 in Practice: Definitive Reference for Developers and Engineers
From Everand
Detectron2 in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
2024.eacl-demo.16
No ratings yet
2024.eacl-demo.16
9 pages
Corrective Retrieval Augmented Generation: Zhang Et Al. 2023b Muhlgay Et Al. 2023
No ratings yet
Corrective Retrieval Augmented Generation: Zhang Et Al. 2023b Muhlgay Et Al. 2023
13 pages
Crud Rag
No ratings yet
Crud Rag
31 pages
OpenACC Programming Essentials: Definitive Reference for Developers and Engineers
From Everand
OpenACC Programming Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Aspect-Oriented Programming in Practice: Definitive Reference for Developers and Engineers
From Everand
Aspect-Oriented Programming in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
A Survey On Retrieval-Augmented Text Generation For Large Language Models
No ratings yet
A Survey On Retrieval-Augmented Text Generation For Large Language Models
18 pages
Practical RAG
No ratings yet
Practical RAG
127 pages
Generative AI PPT Final
No ratings yet
Generative AI PPT Final
34 pages
Optimizing Large Language Models a Deep Dive Into
No ratings yet
Optimizing Large Language Models a Deep Dive Into
32 pages
Charm++ Programming and Applications: Definitive Reference for Developers and Engineers
From Everand
Charm++ Programming and Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Applied Techniques for GPT-3: Definitive Reference for Developers and Engineers
From Everand
Applied Techniques for GPT-3: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Applied Data Mining with Weka: Definitive Reference for Developers and Engineers
From Everand
Applied Data Mining with Weka: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
gautam2024evaluating
No ratings yet
gautam2024evaluating
7 pages
RAGBench - Explainable Benchmark For Retrieval-Augmented Generation Systems
No ratings yet
RAGBench - Explainable Benchmark For Retrieval-Augmented Generation Systems
18 pages
Comparative Analysis of RAG, Fine-Tuning, and Prompt Engineering in Chatbot Development - 2024
No ratings yet
Comparative Analysis of RAG, Fine-Tuning, and Prompt Engineering in Chatbot Development - 2024
10 pages
EasyChair-Preprint-15614
No ratings yet
EasyChair-Preprint-15614
20 pages
Python Developer Path
No ratings yet
Python Developer Path
1 page
DevOps Internship Assignment
No ratings yet
DevOps Internship Assignment
3 pages
II Year 3 Day Workshop Report
No ratings yet
II Year 3 Day Workshop Report
8 pages
CSE&DS R24 COURSE STRUTURE With Syllabus
No ratings yet
CSE&DS R24 COURSE STRUTURE With Syllabus
14 pages
Python Code For Artificial Intelligence: Foundations of Computational Agents
No ratings yet
Python Code For Artificial Intelligence: Foundations of Computational Agents
221 pages
Chapter4 Maintainability
No ratings yet
Chapter4 Maintainability
43 pages
Functions in Python Class Notes
No ratings yet
Functions in Python Class Notes
9 pages
Python Programming
0% (2)
Python Programming
5 pages
Its Od 303 Python 1023
No ratings yet
Its Od 303 Python 1023
2 pages
library-management-system-project-report
No ratings yet
library-management-system-project-report
64 pages
Eclipse and PyDev - Anaconda Documentation
No ratings yet
Eclipse and PyDev - Anaconda Documentation
3 pages
Pythonic Data Entry
No ratings yet
Pythonic Data Entry
16 pages
Facial Recognition Attendance System Using Python and OpenCv
No ratings yet
Facial Recognition Attendance System Using Python and OpenCv
16 pages
M Pradeep
No ratings yet
M Pradeep
1 page
Python Hbuh
No ratings yet
Python Hbuh
4 pages
Python Unit-1 QB
No ratings yet
Python Unit-1 QB
1 page
Python_intro
No ratings yet
Python_intro
22 pages
Python Assignment
100% (1)
Python Assignment
4 pages
RasPi Magazine - Issue No. 013
No ratings yet
RasPi Magazine - Issue No. 013
56 pages
Sumit Chauhan: Email-Id
No ratings yet
Sumit Chauhan: Email-Id
1 page
PWP Project
No ratings yet
PWP Project
12 pages
KDnuggets The Complete Collection of Data Science Cheatsheets v3
No ratings yet
KDnuggets The Complete Collection of Data Science Cheatsheets v3
18 pages
Info Pract Xi QP Hy 2024 25
No ratings yet
Info Pract Xi QP Hy 2024 25
4 pages
SSN College of Engineering KALAVAKKAM-603110
No ratings yet
SSN College of Engineering KALAVAKKAM-603110
6 pages
Python The Complete Manual
No ratings yet
Python The Complete Manual
133 pages
11 Uq Software Slides
No ratings yet
11 Uq Software Slides
18 pages
DataVisualizationUsingPython LAB MANUAL
No ratings yet
DataVisualizationUsingPython LAB MANUAL
47 pages
Python For Chemical Engineers An Efficient Approach To Teach Non Programmers To Program
No ratings yet
Python For Chemical Engineers An Efficient Approach To Teach Non Programmers To Program
12 pages
Summer Training Report
No ratings yet
Summer Training Report
36 pages

06web Application For Rag Implementation and Testing

Uploaded by

06web Application For Rag Implementation and Testing

Uploaded by

electronics

1 Intelligent Systems Department, Institute of Information and Communication Technologies,

Keywords: retrieval-augmented generation (RAG); open-source large language models (LLMs);

Electronics 2024, 13, 1361. https://doi.org/10.3390/electronics13071361 https://www.mdpi.com/journal/electronics

2. Web Application Development and Implementation

Electronics 2024, 13, 1361 4 of 30

Figure 1. PaSSERFigure 1. PaSSER App framework.

The PaSSER App TheisPaSSER

It is structured to process language data, facilitating a wide range of text-based applica-

2.2. PaSSER App Functionalities

The process for creating a vectorstore

(a) Q&A chat (b) RAG-based Q&A chat

Define a structure ‘tests’ with the following fields:

Define the following methods for ‘tests’:

The pseudocode below defines an eosio::multi_index table tests_table’ for a blockchain,

Define the following methods for ‘tests’:

Define a multi-index table ‘tests_table’ with the following indices:

1. Define a function add_test with parameters: creator, testid, description,

Figure 6. PaSSERFigure 6. PaSSER

3.1. METEOR (Metric for Evaluation of Translation with Explicit Ordering)

1. Split the reference and candidate texts into words

∑S∈{ Re f erence Summaries} ∑ gramn ∈s Countmatch ( gramn )

∑S∈{System Summaries} ∑ gramn ∈s Countmatch ( gramn )

– F1ROUGE-N is the harmonic mean of precision and recall:

Precision ROUGE-N × Recall ROUGE-N

LCS(System Summary, Re f erence Summary)

LCS(System Summary, Re f erence Summary)

3.3. BLEU (Bilingual Evaluation Understudy)

∑C∈{Candidate Translation} ∑n− grams∈C Countclip (n − gram)

1. Split the reference and candidate texts into words

1. Define a function ‘get_bert_embedding’ that:

3.6. Pearson Correlation

1. Define a function ‘get_bert_embedding_manhattan’ that:

1. Define a function ‘f1_score’ that:

4.1. Q&A Time LLM Test Results

prompt evaluations overall.

Figure 7. A performance of Mistral:7b.

Figure 8 presents a comparison of the Llama2:7b performance. It reveals that MAC

Figure 8 presents a comparison of the Llama2:7b performance. It reveals that MAC

Electronics 2024, 13, 1361 22 of 31

Figure 9 illustrates performance metrics for the Orca2:7b. MAC OS outperforms

Figure 8. A performance of Llama2:7b.

Llama2:7b Mistral:7b Orca2:7b

4.2. RAG Q&A Score Test Results

Table 2. A summary of performance metrics using RAG Q&A chat.

Metric in Text Generation and

Laplace perplexity. For perplexity, a lower score is better as it indicates a model’s

4.3. Blockchain RAM Resource Evaluation

Table 3. Blockchain test costs.

RAM Price EOS Price RAM Cost Equivalent

You might also like