06web Application For Rag Implementation and Testing
06web Application For Rag Implementation and Testing
Article
Web Application for Retrieval-Augmented Generation:
Implementation and Testing
Irina Radeva 1, * , Ivan Popchev 2 , Lyubka Doukovska 1 and Miroslava Dimitrova 1
Abstract: The purpose of this paper is to explore the implementation of retrieval-augmented genera-
tion (RAG) technology with open-source large language models (LLMs). A dedicated web-based
application, PaSSER, was developed, integrating RAG with Mistral:7b, Llama2:7b, and Orca2:7b mod-
els. Various software instruments were used in the application’s development. PaSSER employs a set
of evaluation metrics, including METEOR, ROUGE, BLEU, perplexity, cosine similarity, Pearson cor-
relation, and F1 score, to assess LLMs’ performance, particularly within the smart agriculture domain.
The paper presents the results and analyses of two tests. One test assessed the performance of LLMs
across different hardware configurations, while the other determined which model delivered the
most accurate and contextually relevant responses within RAG. The paper discusses the integration
of blockchain with LLMs to manage and store assessment results within a blockchain environment.
The tests revealed that GPUs are essential for fast text generation, even for 7b models. Orca2:7b
on Mac M1 was the fastest, and Mistral:7b had superior performance on the 446 question–answer
dataset. The discussion is on technical and hardware considerations affecting LLMs’ performance.
The conclusion outlines future developments in leveraging other LLMs, fine-tuning approaches, and
further integration with blockchain and IPFS.
In [3], the prompt engineering method is presented. This technique does not involve
training network weights. To influence the desired output, it involves crafting the input
to the model. This approach includes zero-shot prompting, few-shot prompting, and
chain-of-thought prompting, each offering a way to guide the model’s response without
direct modification of its parameters. This method leverages the flexibility and capability of
LLM and provides a tool to adapt the model without the computational cost of retraining.
RAG, introduced in [4], enhances language models by combining prompt engineering
and database querying to provide context-rich answers, reducing errors and adapting to
new data efficiently. The main concepts involve a combination of pre-trained language
models with external knowledge retrieval, enabling dynamic, informed content generation.
It is cost-effective and allows for traceable responses, making it interpretable. The develop-
ment of retrieval-augmented generation (RAG) represents a significant advancement in
the field of natural language processing (NLP). However, for deeper task-specific adapta-
tions, like analysing financial or medical records, fine-tuning may be preferable. RAG’s
integration of retrieval and generation techniques addresses LLM issues like inaccuracies
and opaque logic, yet incorporating varied knowledge and ensuring information relevance
and accuracy remain challenges [5].
Each method offers a specific approach to improving LLM performance. Choosing
between them depends on the desired balance between the required results, the available
resources, and the nature of the tasks set.
In fact, there are other different methods in this field. They are founded on these basic
approaches or applied in parallel. For example, dense passage retrieval (DPR) [6] and the
retrieval-augmented language model (REALM) [7] refine retrieval mechanisms similar
to RAG. Fusion-in-decoder (FiD) [8] integrates information from multiple sources into
the decoding process. There are various knowledge-based modelling and meta-learning
approaches. Each of these models reflects efforts to extend the capabilities of pre-trained
language models and offer solutions for a wide range of NLP tasks.
The purpose of this paper is to explore the implementation of retriever-augmented
generation (RAG) technology with open-source large language models (LLMs). In order
to support this research, a web-based application PaSSER that allows the integration,
testing and evaluation of such models in a structured environment has been developed.
The paper discusses the architecture of the web application, the technological tools
used, the models selected for integration, and the set of functionalities developed to operate
and evaluate these models. The evaluation of the models has two aspects: operation on
different computational infrastructures and performance in text generation and summa-
rization tasks.
The domain of smart agriculture is chosen as the empirical domain for testing the
models. Furthermore, the web application is open-source, which promotes transparency
and collaborative improvement. A detailed guide on installing and configuring the ap-
plication, the datasets generated for testing purposes, and the results of the experimental
evaluations are provided and available on GitHub [9].
The application allows adaptive testing of different scenarios. It integrates three of
the leading LLMs, Mistral:7b, Llama2:7b, and Orca2:7b, which do not require significant
computational resources. The selection of the Mistral:7b, Llama2:7b, and Orca2:7b models
is driven by an approach aimed at balancing performance and affordability. The selected
models were determined due to their respective volume parameters that allow installation
and operation in mid-range configurations. Given the appropriate computational resources,
without further refinement, the PaSSER application allows the use of arbitrary open-source
LLMs with more parameters.
A set of standard NLP metrics—METEOR, ROUGE, BLEU, Laplace and Lidstone’s
perplexity, cosine similarity, Pearson correlation coefficient, and F1 score—was selected for
a thorough evaluation of the models’ performance.
Electronics 2024, 13, 1361 3 of 30
In this paper, RAG is viewed as a technology rather than a mere method. This distinction
is due to the paper’s emphasis on the applied, practical, and integrative aspects of RAG in
the field of NLP.
The paper contributes to the field of RAG research in several areas:
1. By implementing the PaSSER application, the study provides a practical framework
that can be used and expanded upon in future RAG research.
2. The paper illustrates the integration of RAG technology with blockchain, enhancing
data security and verifiability, which could inspire further exploration into the secure
and transparent application of RAG systems.
3. By comparing different LLMs within the same RAG framework, the paper provides
insights into the relative strengths and capabilities of the models, contributing knowl-
edge on model selection in RAG contexts.
4. The focus on applying and testing within the domain of smart agriculture adds to the
understanding of how RAG technology can be tailored and utilized in specific fields,
expanding the scope of its application and relevance.
5. The use of open-source technologies in PaSSER development allows the users to
review and trust the application’s underlying mechanisms. More so, it enables col-
laboration, provides flexibility to adapt to specific needs or research goals, reduces
development costs, facilitates scientific accuracy by enabling exact replication of re-
search setups, and serves as a resource for learning about RAG technology and LLMs
in practical scenarios.
The paper is organized as follows: Section 2 provides an overview of the development,
implementation, and functionalities of the PaSSER Web App; Section 3 discusses selected
standard NLP metrics used to measure RAG performance; Section 4 presents the results of
tests on the models; Section 5, the limitations and influencing factors highlighted during
the testing are discussed; and Section 6 summarizes the results and future directions for
development.
Figure 2. PaSSER
Figure 2. PaSSER site
site map.
map.
The vectorstore’feature,
‘Create vectorstore’
The ‘Create feature,asasdepicted
depictedin in Figure
Figure 3, outlines
3, outlines the process
the process of
of con-
converting raw textual data into a structured, queryable vector space using LangChain.
verting raw textual data into a structured, queryable vector space using LangChain. This This
transformation of NLP and vector embedding techniques makes it possible to convert text
transformation of NLP and vector embedding techniques makes it possible to convert text
into a format convenient for vector operations. Users can source textual data from text files,
into a format convenient for vector operations. Users can source textual data from text
PDFs, and websites. The outlined procedure for vectorstore creation is standardized across
files, PDFs, and websites. The outlined procedure for vectorstore creation is standardized
these data types, ensuring consistency in processing and storage. At the current phase,
across these data types, ensuring consistency in processing and storage. At the current
automatic retrieval of information from websites (scrapping) is considered impractical
phase, automatic retrieval of information from websites (scrapping) is considered imprac-
due to the necessity for in-depth analysis of website structures and the requirement for
tical due to the necessity for in-depth analysis of website structures and the requirement
extensive manual intervention to adequately structure the retrieved text. This process
for extensive manual intervention to adequately structure the retrieved text. This process
Electronics 2024, 13, 1361 involves understanding varied and complex web layouts and imposing a tailored7approach of 31
involves understanding varied and complex web layouts and imposing a tailored ap-
to effectively extract and organize data.
proach to effectively extract and organize data.
Figure 3.
Figure 3. Vectorstore
Vectorstoreconstruction
constructionworkflow.
workflow.
1. Cleaning and standardizing text data. This is achieved by removing unnecessary char-
acters (punctuation and special characters). Converting the text to a uniform size
(usually lower case). Separating the text into individual words or tokens. In the im-
plementation considered here, the text is divided into chunks with different overlaps.
2. Vector embedding. The goal is to convert tokens (text tokens) into numeric vectors. This
is achieved by using pre-trained word embedding models from selected LLMs (in
this case, Misrtal:7b, Llama2:7b, and Orca2:7b). These models map words or phrases
to high-dimensional vectors. Each word or phrase in the text is transformed into a
vector that represents its semantic meaning based on the context in which it appears.
3. Aggregating embeddings for larger text units to represent whole sentences or documents
as vectors. It can be achieved by simple aggregation methods (averaging the vectors of
all words in a sentence or document) or by using sentence transformers or document
embedding techniques that take into account the more consistent and contextual
nature of words. Here, transformers are used, which are taken from the selected LLMs.
4. Create a vectorstore to store the vector representations in a structured format. The
data structures used are optimized for operations with high-dimensional vectors.
ChromaDB is used for the vectorstore.
Figure 4 defines the PaSSER App’s mechanisms for processing user queries and
generating responses. Figure 4a represents a general Q&A chat workflow with direct
input without the augmented context provided by a vectorstore. The corresponding LLM
processes the query, formulates a response, and concurrently provides system performance
data, including metrics such as total load and evaluation timeframes. Additionally, a
numerical array captures the contextual backdrop of the query and the response, drawn
from previous dialogue or related data, which the LLM utilizes similar to short-term
memory to ensure response relevance and coherence. While the capacity of this memory is
limited and not the focus of the current study, it is pivotal in refining responses based on
specific contextual elements such as names and dates. The App enables saving this context
for 13,
Electronics 2024, continued
1361 dialogue and offers features for initiating new conversations by purging the 8 of 31
existing context.
vectorstore. This data informs the subsequent query to the LLM, integrating the original
question, any prompts, and a context enriched by the vectorstore’s information. The LLM
then generates a response. Within the app, a dedicated memory buffer recalls history, which
the LLM utilizes as a transient context to ensure consistent and logical responses. The
limited capacity of this memory buffer and its impact on response quality is acknowledged,
though not extensively explored in this study. In the ‘RAG Q&A chat’, context-specific
details like names and dates are crucial for enhancing the relevance of responses.
The ‘Tests’ feature is designed to streamline the testing of various LLMs within a
specific knowledge domain. It involves the following steps:
1. Selection of a specific knowledge base in a specific domain.
With ‘Create vectorstore’, the knowledge base is processed and saved in the vector
database. In order to evaluate the performance of different LLMs for generating RAG
answers on a specific domain, it is necessary to prepare a sufficiently large list of questions
and reference answers. Such a list can be prepared entirely manually by experts in a
specific domain. However, this is a slow and time-consuming process. Another widely
used approach is to generate relevant questions based on reference answers given by a
selected LLM (i.e., creating respective datasets). PaSSER allows the implementation of the
second approach.
2. To create a reference dataset for a specific domain, a collection of answers related to
the selected domain is gathered. Each response contains key information related to
potential queries in that area. These answers are then saved in a text file format.
Electronics 2024, 13, 1361 3. A selected LLM is deployed to systematically generate a series of questions9 corre- of 31
sponding to each predefined reference answer. This operation facilitates the creation
of a structured dataset comprising pairs of questions and their corresponding answers.
4. Subsequently, this dataset
The finalized dataset is saved in
is uploaded to the
the JSON
PaSSER fileApp,
format.initiating an automated se-
4. The finalized dataset is uploaded to the
quence of response generation for each query within PaSSER App, initiating
the targetan automated sequence
domain. Following
of response
that, generationresponse
each generated for each isquery within to
forwarded theatarget domain.
dedicated PythonFollowing
backend that, each
script.
generated
This scriptresponse
is tasked iswithforwarded
assessing to the
a dedicated
responsesPython
based on backend
predefinedscript.metrics
This script
and
is tasked with
comparing themassessing the responses
to the established based answers.
reference on predefined metrics and
The outcomes comparing
of this evalua-
them to the
tion are thenestablished
stored on the reference answers.
blockchain, The outcomes
ensuring a transparent of this
andevaluation
immutableare then
ledger
stored on the blockchain,
of the model’s performance metrics.ensuring a transparent and immutable ledger of the model’s
performance metrics.
To facilitate this process, a smart contract ‘llmtest’ has been created, managing the
To facilitate
interaction with thethisblockchain
process, aand smart
providing ‘llmtest’ hasand
contracta structured been created,
secure method managing the
for storing
interaction with the blockchain and providing a structured
and managing the assessment results derived from the LLM performance tests. and secure method for storing
and managing
The provided the assessment
pseudocode results derived
outlines the from the LLM
structure ‘tests’performance
and its methods tests. within a
The provided
blockchain pseudocode
environment, which were outlines
chosen thetostructure ‘tests’ and
store test-related its methods
entries. It includeswithin
iden-a
blockchain environment,
tifiers (id, userid, which
and testid), were chosen
a timestamp to store test-related
(created_at), numerical resultsentries.(results
It includes
array),identi-
and
fiers (id, userid,
descriptive textand testid), a timestamp
(description). It establishes (created_at), numerical
id as the primary keyresults (results array),
for indexing, with addi-and
descriptive text (description). It establishes id as the primary key for
tional indices based on created_at, userid, and testid to facilitate data retrieval and sorting indexing, with addi-
tional indices
by these based on
attributes. Thiscreated_at,
structure userid, and testid
organizes and to facilitate
accesses data
test retrieval
records withinandthesorting
block-by
these
chain. attributes. This structure organizes and accesses test records within the blockchain.
The
Thepseudocode
pseudocodebelowbelow defines
defines an eosio::multi_index table
an eosio::multi_index table ‘tests_table’ foraablockchain,
tests_table’ for blockchain,
which facilitates the storage and indexing of data. It specifies four indices:
which facilitates the storage and indexing of data. It specifies four indices: a primary a primary
index
index based on id and secondary indices using created_at, userid, and testid
based on id and secondary indices using created_at, userid, and testid attributes for en- attributes for
enhanced query capabilities. These indices optimize data retrieval operations,
hanced query capabilities. These indices optimize data retrieval operations, allowing for allowing for
efficient access based on different key attributes like timestamp, user, and test
efficient access based on different key attributes like timestamp, user, and test identifiers, identifiers,
significantly
significantlyenhancing
enhancingthethe database’s
database’s functionality
functionality within
within thethe blockchain environment.
blockchain environment.
Theprovided
The provided pseudocode
pseudocode defines
defines anan EOSIO
EOSIO smart
smart contract
contract action
action named
named add_test,
add_test,
which allows adding a new record to the tests_table. It accepts the creator’sname,
which allows adding a new record to the tests_table. It accepts the creator’s name,test
testID,
ID,
description, and an array of results as parameters. The action assigns a unique ID to
description, and an array of results as parameters. The action assigns a unique ID to the
the
record, stores
record, stores the
the current
current timestamp,
timestamp, and and then
then inserts
inserts aa new
new entry
entry into
into the
the table
table using
using
Electronics 2024, 13, 1361 these details. This action helps in dynamically updating the blockchain state with new
these details. This action helps in dynamically updating the blockchain state with 10 oftest
new 31
test
information,ensuring
information, ensuringthat
that each
each entry
entry is
is time-stamped
time-stamped and and linked
linked to
to its
its creator.
creator.
Thepseudocodes
The pseudocodesprovided
providedabove
aboveand
andininSection
Section33are
aregenerated
generatedwith
withGitHub
GitHubCopilot
Copi-
lot upon
upon the the actual
actual source
source codecode available
available at “https://github.com/features/copilot (accessed
at “https://github.com/features/copilot (accessed
on11April
on April2024)”.
2024)”.
5.5. The results
The results from
from the
the blockchain
blockchain areare retrieved
retrieved forfor further
further processing
processing andand analysis.
analysis.
To facilitate the execution of these procedures, the interface is structured into three
To facilitate the execution of these procedures, the interface is structured into three
specific features: Q&A dataset’ for managing question and answer datasets, RAG Q&A
specific features: ‘Q&A dataset’ for managing question and answer datasets, ‘RAG Q&A
score test’ for evaluating the performance of RAG utilizing datasets, and Show test re-
score test’ for evaluating the performance of RAG utilizing datasets, and ‘Show test results’
sults’ for displaying the results of the tests. Each submenu is designed to streamline the
for displaying the results of the tests. Each submenu is designed to streamline the respective
respective aspect of the workflow, ensuring a coherent and efficient user experience
aspect of the workflow, ensuring a coherent and efficient user experience throughout the
throughout the process of dataset management, performance evaluation, and result visu-
process of dataset management, performance evaluation, and result visualization.
alization.
Within the ‘Q&A dataset’, the user is guided to employ a specific prompt, aiming
Within
to instruct theLLM
the Q&A to dataset’,
generate the user is guided
questions that alignto employ a specific
closely with prompt, aiming
the provided to
reference
instruct the LLM to generate questions that align closely with the provided
answers, as described in step 2. This operation initiates the creation of a comprehensive reference an-
swers, as
dataset, described inorganizing
subsequently step 2. Thisand operation
storing initiates the creation
this information of aa comprehensive
within da-
JSON file for future
taset, subsequently organizing and storing this information within a JSON
accessibility and analysis. This approach ensures the generation of relevant and accurate file for future
accessibility
questions, and analysis.
thereby enhancing This
theapproach
dataset’sensures thefollow-up
utility for generationevaluation
of relevantprocesses.
and accurate
questions, thereby enhancing the dataset’s utility for follow-up evaluation
The ‘RAG Q&A score test’ is designed to streamline the evaluation of different LLMs’ processes.
The RAGusing
performances Q&Athescore test’as
RAG, is indicated
designed to instreamline the evaluation
Figure 5. This evaluationofprocess
different LLMs’
involves
performances using the RAG, as indicated in Figure 5. This evaluation
importing a JSON-formatted dataset and linking it with an established vectorstore relevant process involves
importing
to the selecteda JSON-formatted dataset and
domain. The automation linking itwithin
embedded with an established
this vectorstore
menu facilitates rele-
a method-
vant to the selected domain. The automation embedded within this
ical assessment of the LLMs, leveraging domain-specific knowledge embedded within menu facilitates a me-
thodical assessment
the vectorstore. of the LLMs, leveraging domain-specific knowledge embedded
within the vectorstore.
The RAG Q&A score test’ is designed to streamline the evaluation of different LLMs’
performances using the RAG, as indicated in Figure 5. This evaluation process involves
importing a JSON-formatted dataset and linking it with an established vectorstore rele-
vant to the selected domain. The automation embedded within this menu facilitates a me-
Electronics 2024, 13, 1361 10 of 30
thodical assessment of the LLMs, leveraging domain-specific knowledge embedded
within the vectorstore.
Figure5.5.Workflow
Figure Workflowdiagram
diagramfor
forRAG
RAGLLM
LLMquery
queryprocessing
processingand
andscore
scorestorage.
storage.
Vectorstores, once created using a specific LLM’s transformers, require the consistent
application of the same LLM model during the RAG process. Within this automated frame-
work, each question from the dataset is processed by the LLM to produce a corresponding
answer. Then, both the generated answers and their associated reference answers are
evaluated by a backend Python script. This script calculates performance metrics, records
these metrics on the blockchain under a specified test series, and iterates this procedure for
every item within the dataset.
The ‘Show test results’ feature is designed to access and display the evaluation
outcomes from various tests as recorded on the blockchain, presenting them in an organized
tabular format. This feature facilitates the visualization of score results for individual
answers across different test series and also provides the functionality to export this data
into an xlsx file format. The export feature makes it much easier for users to understand
and study the data, helping with better evaluations and insights.
The ‘Q&A Time LLM Test’ feature evaluates model performance across various hard-
ware setups using JSON-formatted question–answer pairs. Upon submission, the PaSSER
App prompts the selected model for responses, generating detailed performance metrics
like evaluation and load times, among others. These metrics are packed in a query to
a backend Python script, which records the data on the blockchain via the ‘addtimetest’
action, interacting with the ‘llmtest’ smart contract to ensure performance tracking and
data integrity.
The ‘Show time test results’ makes it easy to access and view LLM performance data,
organized by test series, from the blockchain. When displayed in a structured table, these
metrics can be examined for comprehensive performance assessment. There is an option
to export this data into an xlsx file, thereby improving the process for further in-depth
examination and analysis.
Authentication within the system (‘Login’) is provided through the Anchor wallet,
which is compliant with the security protocols of the SCPDx platform. This process,
described in detail in [34], provides user authentication by ensuring that testing activities
are securely associated with the correct user credentials. This strengthens the integrity and
accountability of the testing process within the platform ecosystem.
The ‘Configuration’ feature is divided into ‘Settings’ and ‘Add Model’.
The ‘Settings’ is designed for configuring connectivity to the Ollama API and Chro-
maDB API, using IP addresses specified in the application’s configuration file. It also
Electronics 2024, 13, 1361 11 of 30
allows users to select an LLM that is currently installed in Ollama. A key feature here is
the ability to adjust the ‘temperature’ parameter, which ranges from 0 to 1, to fine-tune the
balance between creativity and predictability in the output generated by the LLM. Setting a
higher temperature value (>0.8) increases randomness, whereas a lower value enhances
determinism, with the default set at 0.2.
The ‘Add Model’ enables adding and removing LLMs in the Ollama API, allowing
dynamic model management. This feature is useful when testing different models, ensuring
optimal use of computational resources.
The ‘Manage DB’ feature displays a comprehensive list of vectorstores available in
ChromaDB, offering functionalities to inspect or interact with specific dataset records. This
feature enables users to view details within a record’s JSON response. It provides the option
to delete any vectorstore that is no longer needed, enabling efficient database management
by removing obsolete or redundant data, thereby optimizing storage utilization.
Electronics 2024, 13, 1361A block diagram representation of the PaSSER App’s operational logic that illustrates 12 of
the interactions between the various components is provided in Figure 6.
The UI (webinteracts
The UI (web application) application)
withinteracts with
users for users for configuration,
configuration, authentication,authentication,
and an
operation
operation initiation. initiation.
It utilizes It utilizes
JavaScript andJavaScript and the PrimeReact
the PrimeReact library for UIlibrary for UI component
components.
Enables the userEnables the user
interactions for interactions
authentication for (Login),
authentication (Login),and
configuration, configuration, and operation
operations with
with the
the LLMs and blockchain. LLMs and blockchain.
The web server The web server
(Apache) hosts(Apache)
the webhosts the web application,
application, facilitating communication
facilitating communication be- b
tween the user interface and
tween the user interface and backend components. backend components.
The LLM API The andLLMVectorAPI and Vector
Database Database
utilize utilize API
the Ollama the Ollama
for the API for the management
management of
different LLMs. It incorporates ChromaDB for storage
different LLMs. It incorporates ChromaDB for storage and retrieval of vectorizedand retrieval of data.
vectorized data.
Data pre-processing
Data pre-processing and vectorization and vectorization
standardize standardize
and convertand dataconvert data from variou
from various
sources (e.g., PDFs, websites) into numerical vectors for LLM processing using pre-traine
sources (e.g., PDFs, websites) into numerical vectors for LLM processing using pre-trained
models of selected LLMs.
models of selected LLMs.
The RAG Q&A chat facilitates query responses by integrating external data retriev
The RAG Q&A chat facilitates query responses by integrating external data retrieval
with LLM processing. It enables querying the LLMs with augmented information retriev
with LLM processing. It enables querying the LLMs with augmented information retrieval
from the vector database for generating responses.
from the vector database for generating responses.
Testing modules utilize built-in testing modules to assess LLM performance acro
Testing modules utilize built-in testing modules to assess LLM performance across
metrics, with results recorded on the blockchain.
metrics, with results recorded on the blockchain.
The Python evaluation API calculates NLP performance metrics and interacts wi
The Python theevaluation
blockchainAPI calculatesthe
for recording NLP performance
testing results via metrics and interacts with
smart contracts.
the blockchain for recording the testing
Smart contracts manage results via smart
test results contracts.
recording on the blockchain.
Smart contracts manage test results recording on the blockchain.
3. Evaluation Metrics
3. Evaluation Metrics
The evaluation of RAG models within the PaSSER App was performed using a set
The evaluation of RAGNLP
13 standard models within
metrics. themetrics
These PaSSERevaluated
App wasvarious
performed using a of
dimensions setmodel
of perfo
13 standard NLP metrics. These metrics evaluated various dimensions of model perfor-
mance, including the quality of text generation and summarization, semantic similarit
mance, including the quality
predictive of text and
accuracy, generation and summarization,
consistency semantic
of generated content similarity,
compared pre-
to reference or e
dictive accuracy, and consistency of generated content compared to reference or expected
pected results. Metrics included METEOR, ROUGE (with ROUGE-1 and ROUGE-L var
ants), BLEU, perplexity (using Laplace and Lidstone smoothing techniques), cosine sim
larity, Pearson correlation coefficient, and F1 score.
The PaSSER App ran two main tests to assess LLM: “LLM Q&A Time Test” an
“RAG Q&A Assessment Test”. The latter specifically applied the selected metrics to a cr
ated dataset of question–answer pairs for the smart agriculture domain. The test aimed
Electronics 2024, 13, 1361 12 of 30
results. Metrics included METEOR, ROUGE (with ROUGE-1 and ROUGE-L variants),
BLEU, perplexity (using Laplace and Lidstone smoothing techniques), cosine similarity,
Pearson correlation coefficient, and F1 score.
The PaSSER App ran two main tests to assess LLM: “LLM Q&A Time Test” and “RAG
Q&A Assessment Test”. The latter specifically applied the selected metrics to a created
dataset of question–answer pairs for the smart agriculture domain. The test aimed to deter-
mine which model provides the most accurate and contextually relevant answers within
the RAG framework and the capabilities of each model in the context of text generation
and summarization tasks.
The ‘RAG Q&A chat’ was assessed using a set of selected metrics: METEOR, ROUGE,
PPL (perplexity), cosine similarity, Pearson correlation coefficient, and F1 score [35].
An automated evaluation process was developed to apply these metrics to the answers
generated using RAG. The process compared generated answers against the reference
answers in the dataset, calculating scores for each metric.
All calculations were implemented in backEnd.py script in Python, available at:
“https://github.com/scpdxtest/PaSSER/blob/main/scripts/backEnd.py (accessed on
1 April 2024)”.
The following is a brief explanation of the purpose of the metrics used, the simplified
calculation formulas, and the application in the context of RAG.
where 𝐹 = .
The implementation of the calculations of Equations (1)–(3) is conducted with the
nltk library, single_meteor_score function, line 58 in Python script.
Electronics 2024, 13, 1361 13 of 30
This pseudocode describes the process of splitting two texts into words and calculat-
ing the METEOR score between them.
Inthe
In thecontext
contextof
of RAG
RAG models,
models, the METEOR score can be used used to
to evaluate
evaluatethe
thequality
quality
ofthe
of thegenerated
generatedresponses.
responses. AA high
high METEOR score indicates that
that the
the generated
generated response
response
closelymatches
closely matches the
the reference
reference text,
text, suggesting
suggesting that the model is accurately
accurately retrieving
retrieving and
and
generatingresponses.
generating responses. Conversely,
Conversely, aa low METEOR score could
could indicate
indicateareas
areasfor
forimprove-
improve-
mentin
ment inthe
themodel’s
model’sperformance.
performance.
3.2.
3.2.ROUGE
ROUGE(Recall-Oriented
(Recall-Oriented Understudy
Understudy for Gisting Evaluation)
ROUGE
ROUGE [37]
[37] isis aa set
set of
of metrics
metrics used
used for
for evaluating
evaluating automatic summarization
summarization and
and
machine
machine translation. works by comparing
translation. It works automatically produced
comparing an automatically produced summary
summary oror
translation against one or more reference summaries
translation against one or more reference summaries (usually(usually human-generated).
ROUGE has several variants: ROUGE-N, ROUGE-L, and ROUGE-W.
ROUGE-N focuses on the overlap of n-grams (sequences of n words) between the
system-generated summary and the reference summaries. It is computed in terms of recall,
precision, and F1 score:
– Recall ROUGE-N is the ratio of the number of overlapping n-grams between the system
summary and the reference summaries to the total number of n-grams in the reference
summaries:
– Precision ROUGE-N is the ratio of the number of overlapping n-grams in the system
summary to the total number of n-grams in the system summary itself:
ROUGE-L focuses on the longest common subsequence (LCS) between the generated
summary and the reference summaries. The LCS is the longest sequence of words that ap-
pears in both texts in the same order, though not necessarily consecutively. The parameters
for ROUGE-L include:
– Recall ROUGE-L is the length of the LCS divided by the total number of words in
the reference summary. This measures the extent to which the generated summary
captures the content of the reference summaries:
– Precision ROUGE-N is the length of the LCS divided by the total number of words in
the generated summary. This assesses the extent to which the words in the generated
summary appear in the reference summaries:
– 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 is the length of the LCS divided by the total number of words in
the generated summary. This assesses the extent to which the words in the generated
Electronics 2024, 13, 1361 summary appear in the reference summaries: 14 of 30
( , )
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = (8)
– F1ROUGE-N is a harmonic mean of the LCS-based precision and recall:
– 𝐹1 is a harmonic mean of the LCS-based precision and recall:
Precision ROUGE-N × Recall ROUGE-N
F1ROUGE-N = 2 × (9)
𝐹1 =Precision
2 ROUGE-N + Recall ROUGE-N (9)
ROUGE-W
ROUGE-W is is an
an extension
extension of
of ROUGE-L
ROUGE-L with a weighting
weighting scheme
scheme that
that assigns
assignsmore
more
importance
importance to longer
longer sequences
sequencesofofmatching
matchingwords.
words. In this
In this application,
application, ROUGE-W
ROUGE-W is notis
not applied.
applied.
The
The implementation
implementation of the the calculations
calculations of Equations
Equations (4)–(9)
(4)–(9) is
is conducted
conducted with
with the
the
rouge library, rouge.get_scores function, line 65 in Python script.
rouge library, rouge.get_scores function, line 65 in Python script.
This
Thispseudocode
pseudocode describes
describes the
the process of initializing a ROUGE
ROUGE object
object and
andcalculating
calculating
the
theROUGE
ROUGEscores
scoresbetween
between two
two texts.
texts.
1. Set ‘hypothesis’ to the reference text and ‘ref’ to the candidate text
2. Initialize a Rouge object
3. Calculate the ROUGE scores between ‘hypothesis’ and ‘ref’ using the
‘get_scores’ method of the Rouge object
The choice
The choice between
between aa preference
preference for precision, recall,
recall, or
or F1
F1 scoring
scoring depends
depends on on the
the
specificgoals
specific goals of
of the
the summarization
summarization task, such as whether
whether it it is
is more
more important
important totocapture
capture
as much
as much information
information as aspossible
possible(recall) or or
(recall) to ensure thatthat
to ensure what is captured
what is highly
is captured rele-
is highly
vant (precision).
relevant (precision).
In the context of RAG models, ROUGE metric serves as a tool for assessing the
quality of the generated text, especially in summary, question answering, and content-
generation tasks.
where, Countclip is a count of each n-gram in the candidate translation clipped by its
maximum count in any single reference translation.
The brevity penalty (BP) is a component of the BLEU score that ensures translations
are not only accurate but also of appropriate length. The BP is defined as:
1 if c > r
BP = (11)
e(1−r/c) if c ≤ r
where, c is the total length of the candidate translation, and r is the effective reference
corpus length, which is the sum of the lengths of the closest matching reference translations
for each candidate sentence.
BP = 1 if the candidate translation length c is greater than the reference length r,
indicating no penalty.
BP = e(1−r/c) if c is less than or equal to r, indicating a penalty that increases as the
candidate translation becomes shorter relative to the reference.
(11)
𝑒( ⁄ )
𝑖𝑓 𝑐 ≤ 𝑟
where, 𝑐 is the total length of the candidate translation, and 𝑟 is the effective reference
corpus length, which is the sum of the lengths of the closest matching reference transla-
tions for each candidate sentence.
Electronics 2024, 13, 1361
𝐵𝑃 = 1 if the candidate translation length 𝑐 is greater than the reference length 15 of𝑟,
30
indicating no penalty.
𝐵𝑃 = 𝑒 ( ⁄ ) if 𝑐 is less than or equal to 𝑟, indicating a penalty that increases as the
The overall
candidate BLEUbecomes
translation score is calculated usingtothe
shorter relative thefollowing
reference.formula:
The overall BLEU score is calculated using the
N
following formula:
BLEU = BP.exp ∑n=1 wn logPn (12)
𝐵𝐿𝐸𝑈 = 𝐵𝑃. 𝑒𝑥𝑝(∑ 𝑤 𝑙𝑜𝑔𝑃 ) (12)
where, N
where, 𝑁 is
is the maximum n-gram
the maximum n-gramlength
length(typically
(typically4),4),
and and 𝑤 wisn is
thethe weight
weight for for
eacheach
n-
n-gram’s precision
gram’s precision score,
score, often
often set set equally
equally suchsuch
thatthat
theirtheir
sum sum 𝑤 =w
is 1 (e.g.,
is 1 (e.g., n = for
0.25 0.25𝑁 for
=
N4).= 4).
This
This formula
formula aggregates
aggregates thethe individual
individual modified
modifiedprecision
precisionscores scores𝑃Pn for
for n-grams
n-gramsof of
length 1 to N, geometrically averaged and weighted by w , then multiplied
length 1 to 𝑁, geometrically averaged and weighted by 𝑤 , then multiplied by the brevity
n by the brevity
penaltyBP
penalty 𝐵𝑃totoyield
yieldthe
thefinal
finalBLEU
BLEUscore.
score.
The
The implementation of the calculations
implementation of the of Equations
calculations of Equations (10)–(12)
(10)–(12) is is conducted
conductedwith withthe
the
nltk library, sentence_bleu and SmoothingFunction functions, lines 74–79 in
nltk library, sentence_bleu and SmoothingFunction functions, lines 74–79 in Python script.Python script.
This
This pseudocode
pseudocode describes
describes thethe process of splitting
process of splitting two
two texts
texts into
into words,
words, creating
creatingaa
smoothing function, and calculating the BLEU score between
smoothing function, and calculating the BLEU score between them. them.
In the
In the context
context of
of RAG
RAG models,
models, the BLEU score can be used
used toto evaluate
evaluate the
thequality
qualityofof
the generated responses.
the generated responses. A high BLEU score would indicate that the generated response
would indicate that the generated response
closely matches
closely matches the
the reference
reference text, suggesting that the model
model isis accurately
accurately retrieving
retrievingand
and
generating responses.
generating responses. A A low
low BLEU
BLEU score
score could
could indicate
indicate areas
areas for
for improvement
improvement in in the
the
model’sperformance.
model’s performance.
3.4.
3.4. Perplexity
Perplexity (PPL)
(PPL)
Perplexity (PPL) [39] is a measure used to evaluate the performance of probabilistic
language models. The introduction of smoothing techniques, such as Laplace (add-one)
smoothing and Lidstone smoothing [40], aims to address the issue of zero probabilities for
unseen events, thereby enhancing the model’s ability to deal with sparse data. Below are
the formulas for calculating perplexity.
– PPL with Laplace Smoothing adjusts the probability estimation for each word by adding
one to the count of each word in the training corpus, including unseen words. This
method ensures that no word has a zero probability. The adjusted probability estimate
with Laplace smoothing is calculated using the following formula:
C ( wi , h ) + 1
PLaplace (wi |h) = (13)
C (h) + V
where, wi is the probability of a word given its history h (the words that precede it), C (wi , h)
is the count of wi , C (h) is the count of history h, and V is a vocabulary size (the number of
unique words in the training set plus one for unseen words).
The PPL of a sequence of words W = w1 , . . . w N is given by:
1 N
PPL(W ) = e− N ∑i=1 ln( PLaplace (wi |h)) (14)
The implementation of the calculations of Equations (13) and (14) is conducted with
the nltk library, lines 84–102, in Python script.
This pseudocode describes the process of tokenizing an input text paragraph, training
a Laplace model (bigram model), and calculating the perplexity of a candidate text using
the model.
∑ ( | ) (14)
𝑃𝑃𝐿(𝑊) = 𝑒
The implementation of the calculations of Equations (13) and (14) is conducted with
the nltk library, lines 84–102, in Python script.
This pseudocode describes the process of tokenizing an input text paragraph,
Electronics 2024, 13, 1361 16 of 30
training a Laplace model (bigram model), and calculating the perplexity of a candidate
text using the model.
1. Tokenize the input text paragraph into sentences and words, convert all
words to lowercase
2. Split the tokenized text into training data and vocabulary using a bigram
model
3. Train a Laplace model (bigram model) using the training data and vocabulary
4. Define a function ‘calculate_perplexity’ that:
a. Tokenizes the input text into words, converts all words to lowercase
b. Calculates the perplexity of the text using the Laplace model
5. Set ‘test_text’ to the candidate text
6. Calculate the Laplace perplexity of ‘test_text’ using the
‘calculate_perplexity’ function
–– PPL with
PPL with Lidstone
Lidstone smoothing
smoothing is
is aa generalization
generalization of
of Laplace
Laplace smoothing
smoothingwhere
whereinstead
instead
of adding one to each count, a fraction λ (where 0 < λ < 1) is added.
of adding one to each count, a fraction λ (where 0 < λ < 1) is added. This This allows for
allows
more flexibility compared to the fixed increment in Laplace smoothing.
for more flexibility compared to the fixed increment in Laplace smoothing. AdjustedAdjusted
Probability Estimate
Probability Estimate with
with Lidstone
Lidstone Smoothing:
Smoothing:
( , )
𝑃 (𝑤 |ℎ) =C (wi , h) + λ (15)
PLidstone (wi |h) = ( ) (15)
C (h) + λV
The 𝑃𝑃𝐿 of a sequence of words 𝑊 = 𝑤 , … 𝑤 is given by:
The PPL of a sequence of words W = w∑1 , . . . w N is given
( | )by:
𝑃𝑃𝐿(𝑊) = 𝑒 (16)
− N1 ∑iN=1
ln( PLidstone (wi |h))
The implementation ofPPL W) = e
the (calculations of Equations (15) and (16) is conducted with (16)
the nltk library, lines 108–129, in Python script.
The
Thisimplementation of the calculations
pseudocode describes the processof Equations
of tokenizing (15) anandinput
(16) istext
conducted with
paragraph,
the nltk library, lines 108–129, in Python script.
training a Lidstone model (trigram model), and calculating the perplexity of a candidate
text This
usingpseudocode
the model. describes the process of tokenizing an input text paragraph, training
Electronics 2024, 13, 1361 17 of 31
a Lidstone model (trigram model), and calculating the perplexity of a candidate text using
the model.
1. Set the training text to the reference text
2. Tokenize the training text into sentences and then into words, convert all
words to lowercase
1. Prepare
3. Set the the
training textdata
training to the
for reference
a trigram text
model
2. Create
4. Tokenize the
and training
train text into
a Lidstone sentences
model and then
with Lidstone into words,
smoothing, convert
where gamma all
is
words to lowercase
the Lidstone smoothing parameter
3. Prepare the training data for a trigram
5. Set the test text to the candidate text model
4. Tokenize
6. Create and
thetrain a Lidstone
test text model with
into sentences andLidstone
then intosmoothing, where all
words, convert gamma is
words
thelowercase
to Lidstone smoothing parameter
5. Set the test text to the candidate text
6. Tokenize the test text into sentences and then into words, convert all words
to lowercase
7. Prepare the test data
8. Calculate the Lidstone perplexity of the test text
Inboth
In both formulas,
formulas, the goal is to compute
compute how how well
wellthe
themodel
modelpredicts
predictsthe
thetest set𝑊.
testset W.
The lower perplexity indicates
The lower perplexity indicates that that the model predicts the sequence
sequence more accurately. The
more accurately. The
choicebetween
choice between Laplace
Laplace andand Lidstone
Lidstone smoothing depends on on the
the specific
specific requirements
requirementsof of
themodel
the modeland anddataset,
dataset,as as well
well as
as empirical
empirical validation.
validation.
Inthe
In thecontext
context ofof RAG
RAG models,
models, both
both metrics are useful for assessing
assessing the
the quality
quality and
and
ability of
ability of models
models to to deal
deal with
with aa variety
variety of
of language
language and
and information.
information. These
These metrics
metrics
indicate how
indicate how well
well they
they can
can generate
generate contextually
contextually informed, linguistically
linguistically coherent,
coherent, and
and
versatile text.
versatile text.
3.5.
3.5.Cosine
Cosine Similarity
Similarity
Cosine
Cosinesimilarity
similarity [41]
[41] is
is aa measure
measure of
of vector similarity and can
can bebe used
used to
to determine
determine
the
thedistance
distanceofof embeddings
embeddings between
between the chunk and thethe query. It
It is
is aa distance
distance metric
metric that
that
approaches 1 when the question
approaches 1 when the question and chunk are similar and becomes 0 when they
and becomes 0 when they are are
different. The mathematics formulation
different. The mathematics formulation of the metric is:
A.B
𝐶𝑜𝑠𝑖𝑛 𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 = ‖ ‖=
.
= ∑∑in=1 Ai Bi
Cosin Similarity = (17)
(17)
∥ A∥∥ B∥ ‖ ‖ ∑n∑ A2 ∑ ∑n B2
q q
i =1 i i =1 i
where, 𝐴. 𝐵 is the dot product of vectors 𝐴 and 𝐵 , ‖𝐴‖ and ‖𝐵‖ are the Euclidean
norms (magnitudes) of vectors 𝐴 and 𝐵 , calculated with ∑ 𝐴 and ∑ 𝐵 ,
respectively, and 𝑛 is the dimensionality of the vectors, assuming 𝐴 and 𝐵 of the same
dimension.
𝐶𝑜𝑠𝑖𝑛 𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 = 1 means the vectors are identical in orientation.
𝐶𝑜𝑠𝑖𝑛 𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 = 0 means the vectors are orthogonal (independent) to each other.
𝐶𝑜𝑠𝑖𝑛 𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 = −1 means the vectors are diametrically opposed.
The implementation of the calculation of Equation (17) is conducted with the
Cosine similarity [41] is a measure of vector similarity and can be used to determine
the distance of embeddings between the chunk and the query. It is a distance metric that
approaches 1 when the question and chunk are similar and becomes 0 when they are
different. The mathematics formulation of the metric is:
Electronics 2024, 13, 1361 . ∑
𝐶𝑜𝑠𝑖𝑛 𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 = ‖ ‖‖ ‖
= 17 of 30
(17)
∑ ∑
where, 𝐴. 𝐵 is the dot product of vectors 𝐴 and 𝐵 , ‖𝐴‖ and ‖𝐵‖ are the Euclidean
where, A.B is the dot product of vectors A and B, q ∥ A∥ and ∥ B∥ are the Euclidean norms
norms (magnitudes) of vectors 𝐴 and 𝐵 , calculated n
with ∑q 𝐴 n
and ∑ 𝐵 ,
2 2
(magnitudes)
respectively, and 𝑛 is theAdimensionality
of vectors and B, calculated with
of the ∑i=assuming
vectors, 1 Ai and 𝐴 ∑and i =1 B𝐵
i , respectively,
of the same
and n is the dimensionality of the vectors, assuming A and B of the same dimension.
dimension.
Cosin 𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 =
𝐶𝑜𝑠𝑖𝑛Similarity = 11 means
means thethe vectors are identical
vectors are identical in
in orientation.
orientation.
Cosin 𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 =
𝐶𝑜𝑠𝑖𝑛Similarity = 00 means
means thethe vectors are orthogonal
vectors are orthogonal (independent)
(independent) to to each
each other.
other.
Cosin 𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 =
𝐶𝑜𝑠𝑖𝑛Similarity = −1−1 means
means thethe vectors
vectors are
are diametrically
diametrically opposed.
opposed.
The
The implementation
implementationofofthe thecalculation
calculationof Equation
of Equation(17)(17)
is conducted
is conductedwith with
the trans-
the
formers library, lines 133–164, in Python script.
transformers library, lines 133–164, in Python script.
This
This pseudocode
pseudocode describes
describes the
the process
process ofof tokenizing
tokenizing two two texts,
texts, generating
generating BERTBERT
embeddings for them, and calculating the cosine similarity between the
embeddings for them, and calculating the cosine similarity between the embeddings. The embeddings. The
[CLS] token is used as the aggregate representation for classification
[CLS] token is used as the aggregate representation for classification tasks. tasks.
Inthe
In theRAG
RAG models,
models, cosine
cosine similarity
similarity ensures
ensures that
that retrieved
retrieved documents
documentsalignalignclosely
closely
with user queries, capturing relationships between the meaning of a user.
with user queries, capturing relationships between the meaning of a user. This is particu- This is
particularly important in RAG models, as they leverage a retriever to find
larly important in RAG models, as they leverage a retriever to find context documents. The context
documents.
use of cosine The use of between
similarity cosine similarity between
embeddings embeddings
ensures that theseensures thatdocuments
retrieved these retrieved
align
Electronics 2024, 13, 1361 documents align closely
closely with user queries. with user queries. 18 of 31
where,n𝑛is is
where, the number
the number ofofdata
datapoints,
points,Xi 𝑋andand
Yi are the the
𝑌 are individual
individualdatadata
points, X and
andand
points, 𝑋
Yand
are 𝑌
theare
means of the X and Y data sets, respectively.
the means of the X and Y data sets, respectively.
The
Theimplementation
implementation of of the
the calculation
calculation of of Equation
Equation (18)
(18) is
is conducted
conductedwith
withthe
thetrans-
trans-
formers
formersand
andscript
scriptlibraries,
libraries,lines
lines 167–178,
167–178, inin Python
Python script.
script.
This
This pseudocode describes the process of tokenizing twotwo
pseudocode describes the process of tokenizing texts,
texts, generating
generating BERTBERT
em-
embeddings for them, and calculating the Pearson correlation coefficient
beddings for them, and calculating the Pearson correlation coefficient between the embed- between the
embeddings. The mean of the last hidden state of the embeddings is used
dings. The mean of the last hidden state of the embeddings is used as the aggregate rep-as the aggregate
representation.
resentation.
In the context of evaluating RAG models, the Pearson correlation coefficient can be
used to measure how well the model’s predictions align with actual outcomes. A coeffi-
cient close to +1 indicates a strong positive linear relationship, meaning as one variable
increases, the other also increases. A coefficient close to -1 indicates a strong negative lin-
ear relationship, meaning as one variable increases, the other decreases. A coefficient near
0 suggests no linear correlation between variables. In the evaluation of RAG models, a
Electronics 2024, 13, 1361 18 of 30
In the context of evaluating RAG models, the Pearson correlation coefficient can be
used to measure how well the model’s predictions align with actual outcomes. A coeffi-
cient close to +1 indicates a strong positive linear relationship, meaning as one variable
increases, the other also increases. A coefficient close to -1 indicates a strong negative linear
relationship, meaning as one variable increases, the other decreases. A coefficient near 0
suggests no linear correlation between variables. In the evaluation of RAG models, a high
Pearson correlation coefficient could indicate that the model is accurately retrieving and
generating responses, while a low coefficient could suggest areas for improvement.
3.7. F1 Score
In the context of evaluating the performance of RAG models, the F1 score [42] is used
for quantitatively assessing how well the models perform in tasks for generating or retriev-
ing textual information (question answering, document summarization, or conversational
AI). The evaluation often hinges on their ability to accurately and relevantly generate text
that aligns with reference or ground truth data.
The F1 score is the harmonic mean of precision and recall. Precision assesses the portion
of relevant information in the responses generated by the RAG model. High precision
indicates that most of the content generated by the model is relevant to the query or task at
hand, minimizing irrelevant or incorrect information. Recall (or sensitivity) evaluates the
model’s ability to capture all relevant information from the knowledge base that should be
included in the response. High recall signifies that the model successfully retrieves and
incorporates a significant portion of the pertinent information available in the context.
The formula for calculating is:
Precision × Recall
F1 = 2 × (19)
Precision + Recall
Precision and Recall are defined as:
Electronics 2024, 13, 1361 19 of 31
TP TP
Precision = , Recall = (20)
TP + FP TP + FN
where,TP
where, 𝑇𝑃(True Positives)
(True is the
Positives) count
is the of correctly
count retrieved
of correctly relevant
retrieved documents,
relevant FP (False
documents, 𝐹𝑃
Positives) is the count
(False Positives) is theof incorrectly
count retrieved
of incorrectly documents
retrieved (i.e., the
documents documents
(i.e., that were
the documents that
retrieved but are but
were retrieved not relevant), and FN (False
are not relevant), and 𝐹𝑁Negatives) is the count
(False Negatives) is of
therelevant documents
count of relevant
that were not retrieved.
documents that were not retrieved.
The
Theimplementation
implementation of of the
thecalculations
calculationsofof Equations
Equations (19)(19)
andand
(20)(20) occurs
occurs on lines
on lines 185–
185–204 in Python
204 in Python script. script.
This
Thispseudocode
pseudocodedescribes
describesthetheprocess
processofof tokenizing
tokenizing two two texts, counting the
texts, counting the common
common
tokens between them, and calculating the F1
tokens between them, and calculating the F1 score. score.
Fortasks
For tasksof
ofquestion
questionanswering,
answering, the
the F1
F1 score
score can
can be used to measure how wellwell the
the
generated answers match the expected answers, considering
generated answers match the expected answers, considering both the presence of correct
correct
information(high
information (highprecision)
precision)and
andthe
thecompleteness
completenessofofthe
the answer
answer (high
(high recall).
recall).
For tasks of document summarization, the F1 score might evaluate the overlap be-
tween the key phrases or sentences in the model-generated summaries and those in the
reference summaries, reflecting the model’s efficiency in capturing essential information
(recall) and avoiding extraneous content (precision).
For‚ conversational AI applications, the F1 score could assess the relevance and com-
pleteness of the model’s responses in dialogue, ensuring that responses are both pertinent
Electronics 2024, 13, 1361 19 of 30
For tasks of document summarization, the F1 score might evaluate the overlap between
the key phrases or sentences in the model-generated summaries and those in the reference
summaries, reflecting the model’s efficiency in capturing essential information (recall) and
avoiding extraneous content (precision).
For‚ conversational AI applications, the F1 score could assess the relevance and com-
pleteness of the model’s responses in dialogue, ensuring that responses are both pertinent
to the conversation context and comprehensive in addressing users’ intents or questions.
4. Testing
The aim of the tests presented in this section is to evaluate the performance of the Mis-
rtal:7b, Llama2:7b, and Orca2:7b models installed on two different hardware configurations
and to assess the performance of these models in generating answers using RAG on the
selected knowledge domain, smart agriculture.
The knowledge base used was retrieved from EU Regulation 2018/848 “https://eur-
lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:32018R0848 (accessed on 1 April
2024)” and Climate-smart agriculture Sourcebook “https://www.fao.org/3/i3325e/i332
5e.pdf (accessed on 1 April 2024)”. These documents were pre-processed manually and
vectorized using the transformers of Misrtal:7b, Llama2:7b, and Orca2:7b LLMs under the
following parameters: chunk size—500, overlapping—100, temperature—0.2.
The dataset containing reference answers specific to smart agriculture was compiled
and stored as a text file. The Mistral:7b model was deployed to formulate questions based
on these reference answers. Initial trials indicated that Mistral:7b excelled in generating
questions with high relevance within this particular domain. To initiate the question
generation process, the following prompt was employed: “Imagine you are a virtual
assistant trained in the detailed regulations of organic agriculture. Your task involves
creating precise questions for a specific regulatory statement provided to you below. The
statement comes directly from the regulations, and your challenge is to reverse-engineer
the question that this statement answers. Your formulated question should be concise, clear,
and directly related to the content of the statement. Aim to craft your question without
implying the statement itself as the answer, and, where relevant, gear your question toward
eliciting specific events, facts, or regulations.”
For testing purposes, two Ollama APIs installed on two different hardware configura-
tions were used:
- Intel Xeon, 32 Cores, 0 GPU, 128 GB RAM, Ubuntu 22.04 OS.
- Mac Mini M1, 8 CPU, 10 GPU, 16 GB RAM, OSX 13.4.
In the PASSER App, the installed Ollama APIs that were to be used could be selected.
This is set in the configuration->settings menu.
The two following tests were designed: testing via the ‘Q&A Time LLM Test’ and the
‘RAG Q&A score test’.
The ‘Q&A Time LLM Test’ evaluated LLM performance across two hardware configu-
rations using a dataset of 446 questions for each model, focusing on seven specific metrics
(evaluation time, evaluation count, load duration time, prompt evaluation count, prompt
evaluation duration, total duration, and tokens per second). These metrics were integral for
analyzing the efficiency and responsiveness of each model under different computational
conditions. The collected data was stored on a blockchain, ensuring both transparency and
traceability of the evaluation results.
The ‘RAG Q&A score test’ aimed to evaluate the performance of the models based on
13 metrics (METEOR, ROUGE-1, ROUGE-l, BLEU, perplexity, cosine similarity, Pearson
correlation, and F1) applied to each of the 446 questions—reference answers—for which
RAG obtained answers.
The ‘RAG Q&A score test’ evaluated the performance of different models in a chat
environment with enhanced RAG Q&A, identifying differences and patterns in their
ability to respond to queries. Its goal was to determine the model that best provided
accurate, context-aware responses that defined terms and summarized specific content.
Electronics 2024, 13, 1361 20 of 30
This evaluation can be used to select a model that ensures the delivery of accurate and
relevant information in the context of the specific knowledge provided.
The performance outcomes from the ‘Q&A Time LLM Test’ and ‘RAG Q&A score test’
for evaluating LLMs were stored on the blockchain via smart contracts. For analysis, this
data was retrieved from the blockchain and stored in an xlsx file. This file was uploaded
to GitHib “https://github.com/scpdxtest/PaSSER/blob/main/tests/TEST%20DATA_
GENERAL%20FILE.xlsx (accessed on 1 April 2024)”.
In the upcoming section, the focus is solely on presenting and analysing the mean
values derived from the test data. This approach eases the interpretation, enabling a
summarized review of the core findings and trends across the conducted evaluations.
Figure 9.
Figure 9. A
A performance
performance of
of Orca2:7b.
Orca2:7b.
Across
Across allall models,
models, several trends are
several trends are evident
evident (Figures
(Figures 7–9).
7–9). UBUNTU
UBUNTUgenerally
generally
shows longer evaluation times,
shows longer times, indicating
indicatingslower
slowerprocessing
processingcapabilities
capabilitiescompared
compared toto MAC
MAC
OS.Evaluation
OS. Evaluation counts are relatively
relativelycomparable,
comparable,suggesting
suggestingthat
thatthe
thenumber
number ofof
operations
operations
conductedwithin
conducted withina given
a given timeframe
timeframe is similar
is similar acrossacross hardware
hardware configurations.
configurations. Load
Load duration
duration
times times are consistently
are consistently longer onlonger
UBUNTU,on UBUNTU,
affecting affecting
readinessreadiness and response
and response times
times negatively.
negatively.tends
UBUNTU UBUNTU tends
to conduct to conduct
more more prompt
prompt evaluation count,evaluation count,
but also takes but also takes
significantly longer,
significantly
which exposes longer, which
efficiency exposes
issues. efficiency
UBUNTU issues. UBUNTU
experiences experiences
longer total durationslonger
for alltotal
tasks,
durations for all tasks, reinforcing the trend of slower overall performance.
reinforcing the trend of slower overall performance. MAC OS demonstrates higher tokens MAC OS
demonstrates higher tokens per second across all models, indicating
per second across all models, indicating more efficient data processing capabilities.more efficient data
processing capabilities.
The performance indicators (Table 1) suggest that across all models, the evaluation
time on the Mac M1 system is significantly less than on the Ubuntu system with Xeon
processors, indicating faster overall performance. In terms of tokens per second, the Mac
M1 also performs better, suggesting it is more efficient at processing information
Electronics 2024, 13, 1361 22 of 30
The performance indicators (Table 1) suggest that across all models, the evaluation
time on the Mac M1 system is significantly less than on the Ubuntu system with Xeon
processors, indicating faster overall performance. In terms of tokens per second, the Mac
M1 also performs better, suggesting it is more efficient at processing information regardless
of having fewer CPU cores and less RAM.
Table 1. Comparative performance metrics of Llama2:7b, Mistral:7b, and Orca2:7b LLMs on macOS
M1 and Ubuntu Xeon Systems (w/o GPU).
Despite having a higher core count and more RAM, the evaluation time is longer,
and the tokens per second rate are lower on the Ubuntu system. This suggests that the
hardware advantages of the Xeon system are not translating into performance gains for
these particular models. Notably, the Ubuntu system shows a higher prompt evaluation
count for Orca2:7b, which might be leveraging the greater number of CPU cores to handle
more prompts simultaneously.
Orca2:7b has the lowest evaluation time on the Mac M1 system, showcasing the most ef-
ficient utilization of that hardware. Llama2:7b shows a significant difference in performance
between the two systems, indicating it may be more sensitive to hardware and operating
system optimizations. Mistral:7b has a comparatively closer performance between the two
systems, suggesting it may be more adaptable to different hardware configurations.
The table suggests that the Mac M1’s architecture provides a significant performance
advantage for these language models over the Ubuntu system equipped with a Xeon
processor. This could be due to several factors, including but not limited to the efficiency of
the M1 chip, the optimization of the language models for the specific architectures, and the
potential use of the M1’s GPU in processing.
For a more straightforward interpretation of the results, the ranges of values of the
different metrics are briefly described below.
The ideal METEOR score is 1. It indicates a perfect match between the machine-
generated text and the reference translations, encompassing both semantic and syntactic
accuracy. For ROUGE metrics (ROUGE-1 recall, precision, f-score, ROUGE-l recall, precision,
f-score), the best possible value is 1. This value denotes a perfect overlap between the content
generated by the model and the reference content, indicating high levels of relevance and
For a more straightforward interpretation of the results, the ranges of values of the
different metrics are briefly described below.
The ideal METEOR score is 1. It indicates a perfect match between the machine-
Electronics 2024, 13, 1361 generated text and the reference translations, encompassing both semantic and syntactic 24 of 30
accuracy. For ROUGE metrics (ROUGE-1 recall, precision, f-score, ROUGE-l recall, precision,
f-score), the best possible value is 1. This value denotes a perfect overlap between the content
generatedin
precision bythe
thecaptured
model and the referenceThe
information. content,
BLEUindicating high levels
score’s maximum of relevance
is also 1 (or 100and
when
precision in the captured information. The BLEU score’s maximum is
expressed in percentage terms), representing an exact match between the machine’salso 1 (or 100 when
output
expressed in percentage terms), representing an exact match between the machine’s
and the reference texts, reflecting high coherence and context accuracy. For perplexity, the
output and the reference texts, reflecting high coherence and context accuracy. For
lower the value, the better the model’s predictive performance. The best perplexity score would
perplexity, the lower the value, the better the model’s predictive performance. The best perplexity
technically approach 1, indicating the model’s predictions are highly accurate with minimal
score would technically approach 1, indicating the model’s predictions are highly accurate
uncertainty. The cosine similarity of 1 signifies maximum similarity between the generated
with minimal uncertainty. The cosine similarity of 1 signifies maximum similarity between the
output and the reference. A Pearson correlation of 1 is ideal, signifying a perfect positive linear
generated output and the reference. A Pearson correlation of 1 is ideal, signifying a perfect
relationship between the model’s outputs and the reference data, indicating high reliability
positive linear relationship between the model’s outputs and the reference data, indicating
of the model’s performance. An F1 score reaches its best at 1, representing perfect precision
high reliability of the model’s performance. An F1 score reaches its best at 1, representing
and recall, meaning the model has no false positives or false negatives in its output. For a
perfect precision and recall, meaning the model has no false positives or false negatives in
better comparison
its output. of the
For a better models, Figure
comparison 10 is presented.
of the models, Figure 10 is presented.
Figure10.
Figure 10. A
A comparison
comparison of
ofperformance
performancemetrics.
metrics.
The presented
The presented metrics
metricsprovide
providea apicture
pictureofofthe
theperformance
performanceof of
thethe
models on text
models on text
generation and summarization tasks. The analysis for each metric is as
generation and summarization tasks. The analysis for each metric is as follows.follows.
METEOR evaluates the quality of translation by aligning the model output to reference
translations when considering precision and recall. Mistral:7b scores highest, suggesting its
translations or generated text are the most accurate.
ROUGE-1 recall measures the overlap of unigrams between the generated summary
and the reference. A higher score indicates more content overlap. Mistral:7b leads, which
implies it includes more of the reference content in its summaries or generated text.
ROUGE-1 precision (the unigram precision). Mistral:7b has the highest score, indicating
that its content is more relevant and has fewer irrelevant inclusions.
ROUGE-1 F-score is the harmonic mean of precision and recall. Orca2:7b leads slightly,
indicating a balanced trade-off between precision and recall in its content generation.
ROUGE-L recall measures the longest common subsequence and is good at evaluating
sentence-level structure similarity. Mistral:7b scores the highest, showing it is better at capturing
longer sequences from the reference text.
ROUGE-L precision. Mistral:7b again scores highest, indicating it includes longer, relevant
sequences in its summaries or generated text without much irrelevant information.
ROUGE-L F-Score. Orca2:7b has a marginally higher score, suggesting a balance in
precision and recall for longer content blocks.
BLEU assesses the quality of machine-generated translation. Mistral:7b outperforms the
others, indicating its translations may be more coherent and contextually appropriate.
Electronics 2024, 13, 1361 25 of 30
“rows”: [{
“supply”: “10000000000.0000 RAMCORE”,
“base”: {
“balance”: “68660625616 RAM”,
“weight”: “0.50000000000000000”
},
“quote”: {
“balance”: “1000857.1307 SYS”,
“weight”: “0.50000000000000000”
}
}
]
Inorder
In orderto
toapply
apply the
the Bancor
Bancor algorithm
algorithm for RAM pricing to our our private
private network,
network,the the
followingclarifications
following clarificationsshould
should be be made.
made. The Antelope blockchain network network has
has aaso-called
so-called
RAM token.
RAM token. The
The PaSSER
PaSSER uses
uses ourour private
private Antelope blockchain
blockchain network,
network, whose
whose system
system
token is SYS. In the context of the
token is SYS. In the context of the Bancor Bancor algorithm, RAM and
and SYS should be considered
SYS should be considered
SmartTokens.
Smart Tokens.The
TheSmart
SmartToken
Tokenisisaa token
token that has one or more
more connectors
connectors with
withother
othertokens
tokens
inthe
in thenetwork.
network. TheThe connector,
connector, in in this case, is a SYS token,
token, and
and itit establishes
establishesaarelationship
relationship
betweenSYS
between SYSand
and RAM.
RAM. Using the Bancor algorithm [43,44] [43,44] could
couldbe bepresented
presentedas asfollows:
follows:
𝑅𝐴𝑀 𝑃𝑟𝑖𝑐𝑒 = 𝑐𝑏/(𝑆𝑇𝑜𝑠 × 𝐶𝑊), 𝐶𝑊 = 𝑐𝑏/𝑆𝑇𝑡𝑣, =>
RAMPrice = cb/(STos × CW ), CW = cb/STtv, => (21)
𝑅𝐴𝑀 𝑃𝑟𝑖𝑐𝑒 (21)
RAMPrice == 𝑐𝑏/(𝑆𝑇𝑜𝑠
cb/ (STos × × 𝑐𝑏/𝑆𝑇𝑡𝑣)
cb/STtv ) ==STtv/STos
𝑆𝑇𝑡𝑣/𝑆𝑇𝑜𝑠
where: cb is for connector balance, Stos is for a Smart Token’s Outstanding Supply = base.balance,
and STtv is for a Smart Token’s total value = Connector Balance = quote.balance.
The cost evaluation of RAM, CPU, and NET resources in SYS tokens during a test
execution occurs as follows. The PaSSER App uses SYS tokens. The CPU price is measured
in (SYS token/ms/Day) and is valid for a specific account on the network. The NET Price
is measured in (SYS token/KiB/Day) and is valid for a specific account on the network.
This study only considers the cost of the RAM resources required to run the tests.
Data on the current market price of the RAM resource is retrieved from an oracle [16]
every 60 min that runs within the SCPDx platform whose blockchain infrastructure is
being used.
The current price of RAM is 0.01503345 SYS/kB as of 23 March 2024, 2:05 PM. Assum-
ing the price of 1 SYS is equal to the price of 1 EOS, it is possible to compare the price of the
RAM used if the tests are run on the public Antelope blockchain because there is a quote
of the EOS RAM Token in USD and it does not depend on the account used. The quote is
available at “https://coinmarketcap.com/ (accessed on 23 March 2024)”.
Table 3 shows that in terms of RAM usage and the associated costs in SYS and USD, the
score tests require more resources than the timing tests. The total cost of using blockchain
resources for these tests is less than 50 USD. This gives reason to assume that using
blockchain to manage and document test results has promise. The RAM price, measured in
SYS per kilobyte (kB), remains constant across different tests in the blockchain network.
Electronics 2024, 13, 1361 27 of 30
This means that blockchain developers and users can anticipate and plan for the costs
associated with their blockchain operations. The blockchain resource pricing model is
designed to maintain a predictable and reliable cost structure. This predictability mat-
ters for the long-term sustainability and scalability of blockchain projects as it allows for
accurate cost estimation and resource allocation. However, the real value of implement-
ing blockchain must also consider the benefits of increased transparency, security and
traceability against these costs.
5. Discussion
The PaSSER App testing observations reveal several aspects that affect the acquired re-
sults and the performance and can be managed. These are data cleaning and pre-processing,
chunk sizes, GPU usage, and RAM size.
Data cleaning and pre-processing cannot be fully automated. In addition to removing
special characters and hyperlinks, it is also necessary to remove non-essential or erroneous
information, standardize formats, and correct errors from the primary data. This is done
manually. At this stage, the PaSSER App processes only textual information; therefore, the
normalization of data, handling of missing data, and detection and removal of deviations
are not considered.
Selecting documents with current, validated, and accurate data is pivotal, yet this
process cannot be entirely automated. What can be achieved is to ensure traceability
and record the updates and origins of both primary and processed data, along with their
secure storage. Blockchain and distributed file systems can be used for this purpose.
Here, this objective is partially implemented since blockchain is used solely to record the
testing results.
The second aspect is chunk sizes when creating and using vectorstores. Smaller chunks
require less memory and computational resources. This is at the expense of increased
iterations and overall execution time, which is balanced by greater concurrency in query
processing. On the other hand, larger chunks provide more context but may be more
demanding on resources and potentially slow down processes if not managed efficiently.
Adjusting the chunk size affects both the recall and precision of the results. Adequate
chunk size is essential to ensure a balance between the retrieval and generation tasks in
RAG, as over- or undersized chunks can negatively impact one or both components. In
the tests, 500-character chunk sizes were found to give the best results. In this particular
implementation, no metadata added (document type or section labels) is used in the
vectorstore creation process, which would facilitate more targeted processing when using
smaller chunks.
GPU usage and RAM size obviously affect the performance of the models. It is evident
from the results that hardware configurations that do not use the GPU perform significantly
slower on the text generation and summarization tasks. Models with fewer parameters
(up to 13b) can run reasonably well on 16 GB RAM configurations. Larger models need
more resources, both in terms of RAM and GPU resources. This is the reason, in this
particular implementation, to use the selected small LLMs, which are suitable for standard
and commonly available hardware configurations and platforms.
It is important to note that the choice of a model may vary depending on the specific
requirements of a given task, including the desired balance between creativity and accuracy,
Electronics 2024, 13, 1361 28 of 30
the importance of fluency versus content fidelity, and the computational resources available.
Therefore, while Mistral:7b appears to be the most versatile and capable model based on
the provided metrics, the selection of a model should be guided by the specific objectives
and constraints of the application in question.
While promising, the use of RAG-equipped LLMs requires caution regarding data
accuracy, privacy concerns, and ethical implications. This is of particular importance in the
healthcare domain, where the goal is to assist medical professionals and researchers by ac-
cessing the latest medical research, clinical guidelines, and patient data, as well as assisting
diagnostic processes, treatment planning, and medical education. Pre-trained open-source
models can be found on Huggingface [45]. For example, TheBloke/medicine-LLM-13B-
GPTQ is used for medical question answering, patient record summarization, aiding
medical diagnosis, and general health Q&A [46]. Another model is m42-health/med42-
70b [47]. However, this application requires measures to ensure accuracy, privacy, and
compliance with health regulations.
6. Conclusions
This paper presented the development, integration, and use of the PaSSER web
application, designed to leverage RAG technology with LLMs for enhanced document
retrieval and analysis. Despite the explicit focus on smart agriculture as the chosen specific
domain, the application can be used in other areas.
The web application integrates Mistral:7b, Llama2:7b, and Orca2:7b LLMs, selected
for their performance and compatibility with medium computational capacity hardware. It
has built-in testing modules that evaluate the performance of the LLMs in real-time by a
set of 13 evaluation metrics (ROUGE-1 recall, precision, f-score; ROUGE-l recall, precision,
f-score; BLUE, Laplace perplexity, Lidstone perplexity, cosine similarity, Pearson correlation,
F1 score).
The LLMs were tested via the ‘Q&A Time LLM Test’ and ‘RAG Q&A score test’
functionalities of the PaSSER App. The ‘Q&A Time LLM Test’ was focused on assessing
LLMs across two hardware configurations. From the results of the ‘Q&A Time LLM Test’, it
can be concluded that even when working with 7b models, the presence of GPUs is crucial
for text generation speed. The lowest total duration times were shown by Orca2:7b on the
Mac M1 system. From the results of the ‘RAG Q&A Score Test’ applied to the selected
metrics over the dataset of 446 question–answer pairs, the Mistral:7b model exhibited
superior performance.
The PaSSER App leverages a private, permissionless Antelope blockchain network
for documenting and verifying results from LLMs’ testing. The system operates on a
token-based economy (SYS) to manage RAM, CPU, and NET resources. RAM usage and
associated costs, measured in SYS and USD, indicate that the total cost for blockchain re-
sources for conducted tests remains below 50 USD. This pricing model guarantees reliability
and predictability by facilitating accurate cost estimations and efficient resource distribu-
tion. Beyond the monetary aspects, the value of implementing blockchain encompasses
increased transparency, security, and traceability, highlighting its benefits.
Future development will focus on leveraging other pre-trained open-source LLMs
(over 40b) LLMs, exploring fine-tuning approaches, and further integration in the existing
Antelope blockchain/IPFS infrastructure of the SCPDx platform.
Author Contributions: Conceptualization, I.R. and I.P.; methodology, I.R. and I.P.; software, I.R. and
M.D.; validation, I.R. and M.D.; formal analysis, I.P. and I.R.; investigation, L.D.; resources, L.D.;
data curation, I.R. and M.D.; writing—original draft preparation, I.R. and I.P.; writing—review and
editing, I.R., I.P. and M.D.; visualization, I.R. and M.D.; supervision, I.P.; project administration, L.D.;
funding acquisition, I.R, I.P. and L.D. All authors have read and agreed to the published version of
the manuscript.
Electronics 2024, 13, 1361 29 of 30
Funding: This work was supported by the Bulgarian Ministry of Education and Science under
the National Research Program “Smart crop production” approved by the Ministry Council No.
866/26.11.2020.
Data Availability Statement: All data and source codes are available at: “https://github.com/
scpdxtest/PaSSER (accessed on 1 April 2024). Git Structure: ‘README.md’—information about
the project and instructions on how to use it; ‘package.json’- the list of project dependencies and
other metadata; ‘src’- all the source code for the project; ‘src/components’—all the React compo-
nents for the project; ‘src/components/configuration.json’—various configuration options for the app;
‘src/App.js’—the main React component that represents the entire app; ‘src/index.js’—JavaScript entry
point file; ‘public’—static files like the ‘index.html’ file; ‘scripts’—Python backend scripts; ‘Installation
Instructions.md’—contains instructions on how to install and set up the project.
Conflicts of Interest: The authors declare no conflicts of interest. The funders had no role in the design
of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or
in the decision to publish the results.
References
1. Howard, J.; Ruder, S. Universal Language Model Fine-Tuning for Text Classification. In Proceedings of the 56th Annual Meeting
of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, 1 July 2018. [CrossRef]
2. Houlsby, N.; Giurgiu, A.; Jastrzebski, S.; Morrone, B.; De Laroussilhe, Q.; Gesmundo, A.; Attariyan, M.; Gelly, S. Parameter-
Efficient Transfer Learning for NLP. No. 97. In Proceedings of the 36th International Conference on Machine Learning, Long
Beach, CA, USA, 9–15 June 2019; Chaudhuri, K., Salakhutdinov, R., Eds.; pp. 2790–2799.
3. Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al.
Language Models Are Few-Shot Learners. arXiv 2005, arXiv:2005.14165. Available online: https://arxiv.org/abs/2005.14165v4
(accessed on 26 March 2024).
4. Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.; Rocktäschel, T.; et al.
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv 2005, arXiv:2005.11401. Available online: http:
//arxiv.org/abs/2005.11401 (accessed on 2 February 2024).
5. Gao, Y.; Xiong, Y.; Gao, X.; Jia, K.; Pan, J.; Bi, Y.; Dai, Y.; Sun, J.; Guo, Q.; Wang, M.; et al. Retrieval-Augmented Generation for
Large Language Models: A Survey. arXiv 2023, arXiv:2312.10997. Available online: http://arxiv.org/abs/2312.10997 (accessed on
18 February 2024).
6. Karpukhin, V.; Oguz, B.; Min, S.; Lewis, P.; Wu, L.; Edunov, S.; Chen, D.; Yih, W. Dense Passage Retrieval for Open-Domain
Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP),
Online, 1 November 2020. [CrossRef]
7. Guu, K.; Lee, K.; Tung, Z.; Pasupat, P.; Chang, M. Retrieval Augmented Language Model Pre-Training. Proc. Mach. Learn. Res.
2020, 119, 3929–3938.
8. Izacard, G.; Grave, E. Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering. In
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume,
Online, 20 April 2021. [CrossRef]
9. GitHub. GitHub—Scpdxtest/PaSSER. Available online: https://github.com/scpdxtest/PaSSER (accessed on 8 March 2024).
10. Popchev, I.; Doukovska, L.; Radeva, I. A Framework of Blockchain/IPFS-Based Platform for Smart Crop Production. In
Proceedings of the ICAI’22, Varna, Bulgaria, 6–8 October 2022. [CrossRef]
11. Popchev, I.; Doukovska, L.; Radeva, I. A Prototype of Blockchain/Distributed File System Platform. In Proceedings of the IEEE
International Conference on Intelligent Systems IS’22, Warsaw, Poland, 12–14 October 2022. [CrossRef]
12. IPFS Docs. IPFS Documentation. Available online: https://docs.ipfs.tech/ (accessed on 25 March 2024).
13. GitHub. Antelope. Available online: https://github.com/AntelopeIO (accessed on 11 January 2024).
14. Ilieva, G.; Yankova, T.; Radeva, I.; Popchev, I. Blockchain Software Selection as a Fuzzy Multi-Criteria Problem. Computers 2021,
10, 120. [CrossRef]
15. Radeva, I.; Popchev, I. Blockchain-Enabled Supply-Chain in Crop Production Framework. Cybern. Inf. Technol. 2022, 22, 151–170.
[CrossRef]
16. Popchev, I.; Radeva, I.; Doukovska, L. Oracles Integration in Blockchain-Based Platform for Smart Crop Production Data Exchange.
Electronics 2023, 12, 2244. [CrossRef]
17. Ollama. Available online: https://ollama.com. (accessed on 25 March 2024).
18. GitHub. GitHub—Chroma-Core/Chroma: The AI-Native Open-Source Embedding Database. Available online: https://github.
com/chroma-core/chroma (accessed on 26 February 2024).
19. PrimeReact. React UI Component Library. Available online: https://primereact.org (accessed on 25 March 2024).
20. WharfKit. Available online: https://wharfkit.com/ (accessed on 25 March 2024).
21. LangChain. Available online: https://www.langchain.com/ (accessed on 25 March 2024).
22. NLTK: Natural Language Toolkit. Available online: https://www.nltk.org/ (accessed on 26 February 2024).
Electronics 2024, 13, 1361 30 of 30
23. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch:
An Imperative Style, High-Performance Deep Learning Library. Adv. Neural Inf. Process. Syst. 2019, 32, 8024–8035.
24. NumPy Documentation—NumPy v1.26 Manual. Available online: https://numpy.org/doc/stable/ (accessed on 26 February 2024).
25. Paul Tardy. Rouge: Full Python ROUGE Score Implementation (Not a Wrapper). Available online: https://github.com/pltrdy/
rouge (accessed on 1 April 2024).
26. Contributors. T. H. F. Team (Past and Future) with the Help of All Our. Transformers: State-of-the-Art Machine Learning for JAX,
PyTorch and TensorFlow. Available online: https://github.com/huggingface/transformers (accessed on 1 April 2024).
27. SciPy Documentation—SciPy v1.12.0 Manual. Available online: https://docs.scipy.org/doc/scipy/ (accessed on 26 February 2024).
28. Pyntelope. PyPI. Available online: https://pypi.org/project/pyntelope/ (accessed on 27 February 2024).
29. Rastogi, R. Papers Explained: Mistral 7B. DAIR.AI. Available online: https://medium.com/dair-ai/papers-explained-mistral-7b-
b9632dedf580 (accessed on 24 October 2023).
30. ar5iv. Mistral 7B. Available online: https://ar5iv.labs.arxiv.org/html/2310.06825 (accessed on 6 March 2024).
31. The Cloudflare Blog. Workers AI Update: Hello, Mistral 7B! Available online: https://blog.cloudflare.com/workers-ai-update-
hello-mistral-7b (accessed on 6 March 2024).
32. Hugging Face. Meta-Llama/Llama-2-7b. Available online: https://huggingface.co/meta-llama/Llama-2-7b (accessed on 6
March 2024).
33. Mitra, A.; Corro, L.D.; Mahajan, S.; Codas, A.; Ribeiro, C.S.; Agrawal, S.; Chen, X.; Razdaibiedina, A.; Jones, E.; Aggarwal, K.; et al.
Orca-2: Teaching Small Language Models How to Reason. arXiv 2023, arXiv:2311.11045.
34. Popchev, I.; Radeva, I.; Dimitrova, M. Towards Blockchain Wallets Classification and Implementation. In Proceedings of the 2023
International Conference Automatics and Informatics (ICAI), Varna, Bulgaria, 5–7 October 2023. [CrossRef]
35. Chen, J.; Lin, H.; Han, X.; Sun, L. Benchmarking Large Language Models in Retrieval-Augmented Generation. arXiv 2023,
arXiv:2309.01431. [CrossRef]
36. Banerjee, S.; Lavie, A. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In
Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization,
Ann Arbor, MI, USA, 22 June 2005.
37. Lin, C.-Y. ROUGE: A Package for Automatic Evaluation of Summaries; Association for Computational Linguistics: Barcelona,
Spain, 2004.
38. Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.-J. Bleu: A Method for Automatic Evaluation of Machine Translation. In Proceedings of
the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 6–12 July 2002. [CrossRef]
39. Arora, K.; Rangarajan, A. Contrastive Entropy: A New Evaluation Metric for Unnormalized Language Models. arXiv 2016,
arXiv:1601.00248. Available online: https://arxiv.org/abs/1601.00248v2 (accessed on 2 February 2024).
40. Jurafsky, D.; Martin, J.H. Speech and Language Processing. Available online: https://web.stanford.edu/~jurafsky/slp3/
(accessed on 8 February 2024).
41. Li, B.; Han, L. Distance Weighted Cosine Similarity Measure for Text Classification; Springer: Berlin/Heidelberg, Germany, 2013.
42. Sokolova, M.; Japkowicz, N.; Szpakowicz, S. Beyond Accuracy, F-Score and ROC: A Family of Discriminant Measures for
Performance Evaluation. Adv. Artif. Intell. 2006, 4304, 1015–1021.
43. issuu. Bancor Protocol Whitepaper En. Available online: https://issuu.com/readthewhitepaper/docs/bancor_protocol_
whitepaper_en (accessed on 24 March 2024).
44. Medium; Binesh, A. EOS Resource Usage. Available online: https://medium.com/shyft-network/eos-resource-usage-f0a80988
27d7 (accessed on 24 March 2024).
45. Hugging Face. Models. Available online: https://huggingface.co/models (accessed on 23 March 2024).
46. Cheng, D.; Huang, S.; Wei, F. Adapting Large Language Models via Reading Comprehension. arXiv 2024, arXiv:2309.09530.
[CrossRef]
47. Hugging Face. M42-Health/Med42-70b. Available online: https://huggingface.co/m42-health/med42-70b (accessed on 26
March 2024).
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.