0% found this document useful (0 votes)
0 views

Document Question Answering using Large Language Model

This study presents the Retrieval-Augmented Generation (RAG) method to enhance Question-Answering (QA) systems by improving document processing through the use of modified Large Language Models (LLM). The research evaluates the performance of RAG using the ChatGPT model, demonstrating its superiority in accuracy and efficiency compared to traditional QA systems. The findings indicate that RAG can significantly reduce manual processing time and improve the reliability of responses in document-based QA applications.

Uploaded by

aroojpandey
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

Document Question Answering using Large Language Model

This study presents the Retrieval-Augmented Generation (RAG) method to enhance Question-Answering (QA) systems by improving document processing through the use of modified Large Language Models (LLM). The research evaluates the performance of RAG using the ChatGPT model, demonstrating its superiority in accuracy and efficiency compared to traditional QA systems. The findings indicate that RAG can significantly reduce manual processing time and improve the reliability of responses in document-based QA applications.

Uploaded by

aroojpandey
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

(IJACSA) International Journal of Advanced Computer Science and Applications,

Vol. 15, No. 3, 2024

Retrieval-Augmented Generation Approach:


Document Question Answering using Large
Language Model
Kurnia Muludi1, Kaira Milani Fitria2*, Joko Triloka3, Sutedi4
Informatics Engineering Graduate Program, Darmajaya Informatics and Business Institute, Bandar Lampung, Indonesia

Abstract—This study introduces the Retrieval Augmented understanding leads to time-consuming efforts, susceptibility to
Generation (RAG) method to improve Question-Answering (QA) human error, and inefficient analysis processes. Based on
systems by addressing document processing in Natural Language previous methods, the use of modified Large Language Models
Processing problems. It represents the latest breakthrough in (LLM) for document processing requires significant resources
applying RAG to document question and answer applications, and poses challenges for widespread implementation. Also, the
overcoming previous QA system obstacles. RAG combines search underutilization of the recently discovered Retrieval
techniques in vector store and text generation mechanism Augmented Generation (RAG) method, particularly in
developed by Large Language Models, offering a time-efficient document processing within Question-Answering (QA)
alternative to manual reading limitations. The research evaluates
systems, provides an opportunity for further exploration. The
RAG's that use Generative Pre-trained Transformer 3.5 or GPT-
3.5-turbo from the ChatGPT model and its impact on document
motivation stems from the challenges associated with manual
data processing, comparing it with other applications. This document processing, resource-intensive Large Language
research also provides datasets to test the capabilities of the QA Model (LLM) modifications, and the underutilization of the
document system. The proposed dataset and Stanford Question Retrieval-Augmented Generation (RAG) method in the
Answering Dataset (SQuAD) are used for performance testing. document-based question-answering domain [2], [3]. In
The study contributes theoretically by advancing methodologies addition, there is a tendency to produce hallucinatory responses
and knowledge representation, supporting benchmarking in that lack accuracy and precision in models that rely solely on
research communities. Results highlight RAG's superiority: LLM capabilities for QA systems without modifications.
achieving a precision of 0.74 in Recall-Oriented Understudy for Finally, the implementation of RAG in QA systems for
Gisting Evaluation (ROUGE) testing, outperforming others at document processing offers the untapped potential to improve
0.5; obtaining an F1 score of 0.88 in BERTScore, surpassing the ability of the system to produce accurate and non-
other QA apps at 0.81; attaining a precision of 0.28 in Bilingual hallucinatory responses.
Evaluation Understudy (BLEU) testing, surpassing others with a
precision of 0.09; and scoring 0.33 in Jaccard Similarity, Building on this line of research, this paper proposes the
outshining others at 0.04. These findings underscore RAG's implementation of the Retrieval Augmented Generation model
efficiency and competitiveness, promising a positive impact on for document question answering tasks, specifically using the
various industrial sectors through advanced Artificial ChatGPT model. RAG, introduced in 2021 [4], addresses the
Intelligence (AI) technology. limitations of previous methods by merging parametric and
non-parametric memory. This hybrid model seamlessly
Keywords—Natural Language Processing; Large Language integrates generative capabilities with data retrieval
Model; Retrieval Augmented Generation; Question Answering; mechanisms, linking language models to external knowledge
GPT sources. RAG combines generative capabilities and the ability
to search for data and incorporate relevant information from
I. INTRODUCTION
the knowledge base in the model. The distinct advantages of
This research proposes a new approach to the increasing RAG lie in its ability to adapt to dynamic data, its flexibility in
reliance on articles and journal documents by introducing a working with external data sources, and its ability to mitigate
Question-Answering (QA) document processing system [1]. hallucinatory responses [5]. These characteristics make RAG
The identification of several critical problems motivates this particularly suitable for QA tasks on internal organizational
research. The problems motivating this research are documents by leveraging external knowledge to reduce
multifaceted. Firstly, manual reading and processing to response hallucinations [6].
comprehend document text are time-consuming, error-prone,
and inefficient. Secondly, previous methods employed to The current research aims to exploit the innovative
modify Large Language Models (LLM) for document approach of RAG to construct an application capable of
processing demanded substantial resources and were automatically processing external text documents. The focus of
challenging to implement widely. Lastly, models relying solely this research is to develop an application system capable of
on the capabilities of LLM for QA systems without processing external document text uploaded by the user. The
modifications tend to generate hallucinatory answers, lacking system will automatically read the document text, allowing
correctness and precision. Manual processing for document users to input questions related to the document. Subsequently,
the system provides answers based on the processed document

776 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 15, No. 3, 2024

text, eliminating the need for manual reading to find answers. the T5 model [12]. The Retrieval-Augmented Generation
This comprehensive solution not only overcomes the (RAG) method, which combines parametric and non-
limitations of previous methods, but also promises to parametric methods, was introduced by Patrick Lewis et al. in
significantly speed up research and study exploration in 2021. RAG has demonstrated its capabilities in open domain
various domains. QA tasks, overcoming previous limitations to deliver more
efficient and comprehensive QA systems [4].
Testing of the proposed model is performed, like several
previous QA-based studies, by calculating the suitability of the Large language model called GPT, or Generative Pre-
answer results provided by the model with the ground truth of Trained Transformer was developed by OpenAI. Previous
the test dataset. Some of the metrics used to calculate the research that has compared the performance of ChatGPT with
performance of this model include Accuracy, ROUGE, BLEU, other large language models like PaLM and LLaMA in open-
BERTScore, and Jaccard Similarity. domain QA tasks indicates that ChatGPT consistently achieves
the highest scores across various open-domain QA datasets
II. RELATED WORKS [13]. Table I presents performance comparisons among LLMs.
This study examines the applicability of RAG, its impact
on the document processing task, and compares it to the TABLE I. LLM PERFORMANCE ON OPEN DOMAIN QA DATASET
previous methods. This research also investigates the capability Model TriviaQA WebQuestion NQ-Open
of the large language model within the ChatGPT systems, gpt-
3.5-turbo within the framework of RAG. This work also PaLM-540B (few-shot) 81.4 43.5 39.6
highlights the development of Artificial Intelligence (AI) and PaLM-540B (zero-shot) 76.9 10.6 21.2
Natural Language Processing (NLP), so this research focuses LLaMA-65B (zero-shot) 68.2 - 23.8
on the improvement of intelligence and the capabilities of
applications [7], [8], [9]. Machine Learning and Deep Learning ChatGPT (zero-shot) 85.9 50.5 48.1
algorithms, which include BERT Base, and Text-to-Text The PolyQuery Synthesis test, which identifies multiple
Transfer Transformer (T5) models, and RAG method, have queries within a single-query prompt and extracts the answers
made significant advances in QA tasks [4], [10], [11], [12]. to all of the questions from the model's latent representation,
This research motivated the implementation of RAG for also shows that ChatGPT outperforms other GPT models from
processing documents, integrated into an interactive QA OpenAI (ada-001, babbage-001, curie, and davinci) in terms of
system. accuracy [13]. According to the evaluations, the gpt-3.5-turbo
Between 2015 and the present, the evolution of question- model has been selected for implementation in this research.
answering (QA) systems shows a trajectory characterized by
diverse methodologies. Starting with semantic parsing-based III. RESEARCH METHOD
systems in 2015, Wen-tau Yih et al. focused on transforming This research undergoes a development phase, starting
natural language queries into structured logical forms, with designing the application system and integrating the APIs
achieving a performance of 52.5% in the F1-score [2]. of ChatGPT, LangChain and FAISS. Subsequent stages
Subsequent knowledge-based paradigms (KB-QA) by Yanchao include extensive system modeling, interface testing and data
Hao et al. in 2017 reformulated questions as predicates, preparation using the proposed dataset and the SQuAD
achieving a performance of 42.9% [3]. Progress has been made dataset. The testing phase, which includes a performance
in integrating AI technologies. Caiming Xiong's exploration of comparison with other applications using ground truth metrics
dynamic memory networks (DMN) in 2016, achieved an (ROUGE, BERTScore, BLEU and Jaccard Similarity), guides
accuracy of 28.79% [7]. In the same year, Minjoon Seo et al.'s
the exploration of the capabilities of the proposed system, as
Bi-Directional Attention Flow (BiDAF) framework
shown in Fig. 1.
demonstrated significant performance with a 68% exact match
and 77.3% F1 score, albeit with a computational time of 20 A. RAG Integration
hours [8]. Adams Wei Yu et al. introduced the QANet model in Retrieval Augmented Generation (RAG) combines retrieval
2018, with a performance of 76.2% exact match and 84.6% F1- and generation models. It uses a Large Language Model
Score, within a shorter computational time of 3 hours [9]. As (LLM) to generate text based on commands and integrates
QA systems evolve, in 2019 Wei Yang et al. applied fine- information from a separate retrieval system to improve output
tuning methods with data augmentation techniques, achieving quality and contextual relevance [14]. The mechanism involves
remarkable results with a modified BERT-Base model of retrieving factual content from a knowledge base via retrieval
49.2% for exact match and 65.4% for F1-Score [10]. Colin models and using generative processes to provide additional
Raffel et al. introduced the Text-to-Text Transfer Transformer context for more accurate output [15]. External data sources are
(T5), with impressive performance of 63.3% for exact match, used, and the numerical representation is facilitated by
94.1% for F1 score, and a peak accuracy of 93.8%, albeit with embedding methods to ensure compatibility. Based on Fig. 2,
an increased number of parameters of 11 billion [11]. In 2020, user queries converted into embeddings are compared with
the focus was on fine-tuning pre-trained models, with Adam vectors from the knowledge library. Relevant context is added
Roberts et al. achieving a recall performance of 34.6% using to the queries before they are fed into the base language model.

777 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 15, No. 3, 2024

classification steps, and prompt engineering, as depicted in Fig.


3. OpenAI's findings, presented in Fig. 3, revealed that RAG
implementation with prompt engineering achieved the highest
accuracy, positioning it as the most effective RAG technique to
date [16]. This discovery serves as a catalyst for the integration
of RAG with prompt engineering using the LangChain module.

Fig. 2. RAG mechanism with LLM.

Fig. 3. Accuracy of the RAG method by Open AI.

LangChain provides a robust data processing pipeline that


utilizes FAISS to perform an efficient retrieval operation in the
VectorDB. The query phase transforms inputs into vectors for
database searches, and prompt engineering enhances the
reusability of retrievals. Output parsers interpret LLM outputs,
ensuring consistency [17]. A highly efficient similarity search
and vector clustering library, Facebook AI Similarity Search or
FAISS [18]. It optimizes the trade-off between memory, speed
and accuracy, allowing developers to effectively navigate
multimedia documents. The mechanism involves the
construction of an index for efficient storage, with vector
searches retrieving the most similar vectors using cosine
similarity scores [19].
Fig. 1. Research flow diagram.
B. Proposed Model
OpenAI, the creator of the Large Language Model GPT, This research employs a modified Large Language Model
conducted a comprehensive number of RAG experiments, (LLM), ChatGPT, augmented with additional libraries to
exploring various implementations such as cosine similarity function as a Question-Answering (QA) system capable of
retrieval, chunk/embedding experiments, reranking, processing external documents for supplementary information.

778 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 15, No. 3, 2024

The chosen methodology for QA system development is the management, and prompting to connect to the LLM model.
Retrieval Augmented Generation (RAG) mechanism. Unlike The process starts with document loading, followed by
previous approaches such as semantic parsing-based, document splitting into text chunks. These text chunks undergo
knowledge-based, and fine-tuning using LSTM or other DL word embedding, converting them into vectors stored in the
algorithms, RAG addresses shortcomings like difficulty vector database. Simultaneously, user-inputted text questions
expanding or revising model memory, an inability to provide are embedded and converted into word vectors. The system
direct insight into generated predictions, and a tendency to connects these vectors to the vector database, performing a
produce hallucinative answers [12]. The solution involves the semantic search and ranking the relevance between vectors.
creation of a hybrid model, merging generative and retrieval The semantic search results in relevant context between
models, which forms the basis for the RAG method. RAG questions and answers. The system retrieves pertinent answers
offers advantages such as adaptive responses to dynamic data, based on user queries and sends them to the LLM (using the
flexibility with external data sources, and minimization of ChatGPT model). The final outcome involves the system
hallucinative responses [5]. Thus, RAG is chosen to construct a receiving LLM-generated answers and delivering them to the
text document-based QA system interacting with users through user. The application system interacts with users, requiring an
a chatbot interface. The system's workflow, implemented using interface connecting the user and the system. Mockups, design
RAG and supporting libraries like LangChain and FAISS, is layouts, and elements for the web application are created using
illustrated in Fig. 4. the Streamlit framework, facilitating rapid development and
sharing of the AI model web application. The mockup for the
The integration of the LangChain framework into the QA application system and user interaction within the system is
document system includes document loading, memory depicted in Fig. 5.

Fig. 4. Integration of langchain framework in RAG for the proposed document QA system.

Fig. 5. Mockup of the application system and user interaction for the app.

779 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 15, No. 3, 2024

C. Proposed Dataset DocuQA reliable benchmarks for assessing the system's performance
across various dimensions. The testing process involves two
The proposed dataset, DocuQA, designed for application- key variables. "Predictions RAG" and "Prediction Others"
based question-answering systems that process document represent the test results from the developed application and
inputs, consists of 20 diverse documents, encompassing journal comparable commercial applications, respectively. Both sets of
articles, news reports, financial documents, and tutorials. Each predictions are compared to the ground truth data, which is
document file includes five questions with corresponding encapsulated in the "references" variable. Different aspects of
ground truth answers, enabling a thorough evaluation of QA language models and question answering systems are evaluated
system capabilities, with a total 100 questions in the dataset. using different metrics. ROUGE measures the overlap in
DocuQA consists of journal documents with calculations and summarization [22]. BERTscore assesses semantic similarity
formulas, news documents with specific titles, financial reports using contextual embeddings [23]. BLEU evaluates n-gram
and news documents with numbers and currency data, and precision [24], and Jaccard Similarity compares text similarity
tutorial documents with step-by-step instructions. Accuracy based on word or n-gram overlap [25]. Precision in question
can be calculated based on the correct answers out of 100, answering systems is commonly assessed through accuracy, F1
providing a metric for information extraction accuracy. The score, and precision metrics, providing insights into their
dataset aims to challenge QA systems in understanding effectiveness. The metrics are used to quantitatively evaluate
context, identifying keywords, and efficiently extracting system performance and establish its superiority over existing
specific information, offering a robust evaluation tool for commercial applications in document processing and
developers and researchers across various document and information retrieval tasks.
question types. The dataset can be accessed publicly [20].
1) Accuracy: Accuracy is defined as the proportion of
Proper citation of the dataset is encouraged for research or
projects using DocuQA to ensure appropriate credit is given. correct responses from the total number of responses.
The preview of the DocuQA dataset can be seen in Fig. 6. Accuracy can be calculated by calculating the percentage of
correct predictions over the total number of references [26]. In
essence, accuracy represents the ability of the system to
provide correct answers, which is expressed as a percentage
using the following formula (see Eq.(1)).
correct predictions
Accuracy= ×100% (1)
all predictions

This metric serves as a valuable indicator of the overall


correctness of the model in the response it generates.
2) ROUGE: Recall-Oriented Understudy for Gisting
Evaluation can be used to evaluate the text generation models,
which are based on the measurement of the overlap between
candidate text and reference text [27]. ROUGE has several
Fig. 6. Preview of the DocuQA test dataset. measurement variants, each depending on the number of
overlapping n-grams. The ROUGE-L variant is the most
D. Testing and Evaluation widely used, because it uses the longest sequence or longest
The tests were performed on two types of test datasets, with common subsequence or LCS with the longest word sequence
DocuQA [20] and SQuAD 1.1 [21]. DocuQA is a dataset that both sentences have. Precision refers to the proportion of
originally created by this research, consisting of 100 questions n-grams in the candidate that are also in the reference (see Eq.
with the ground truth and a total of 20 test documents for 2.). Recall, on the other hand, refers to the proportion of n-
document-based QA systems. In addition, the SQuAD dataset grams that are in the reference text that exactly match in the
was used in the form of modified pdf documents that can be predicted candidate text as shown in Eq. (3). The F1-score can
used to test the QA system's ability to process documents and be calculated from the precision and recall as shown in Eq.
retrieve information based on the questions and related ground
(4).
truth in the SQuAD dataset. Both types of test datasets will be
tested on the QA system developed in this research, and also LCS (candidate, reference)
ROUGE-Lrecall = (2)
on other commercial QA systems that process pdf documents, #words in reference
such as typeset.io. The results of these tests will give an idea of LCS (candidate, reference)
ROUGE-Lprecision = (3)
the QA system performance built on this research, whether it is #words in candidate
superior to other document-based QA applications. recall ∙ precision
ROUGE-LF1-Score = 2× (4)
recall + precision
The proposed QA document processing system is evaluated
through rigorous testing using established metrics such as where, the reference is based on the ground truth in the test
ROUGE or Recall-Oriented Understudy for Gisting dataset, and the candidate is from the system predictions. The
Evaluation, BERTscore, BLEU or Bilingual Evaluation score generated by the ROUGE measure is between 0 and 1. A
Understudy, and Jaccard Similarity. These metrics provide

780 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 15, No. 3, 2024

score of 1 indicates total agreement between reference and grams, denoted as 𝑝𝑛 signifies the n-grams ratios by the
candidate text. candidate text that appearing in any reference translation to the
total of n-grams in the candidate text. 𝑤𝑛 represents the weight
3) BERTScore: BERTScore is an automatic evaluation assigned to each n-gram precision score.
metric in text generation tasks that evaluates the similarity of
each candidate sentence token to each reference sentence 5) The Jaccard similarity quantifies the similarity
token by means of contextual embeddings [23]. The percentage between two sets of data by identifying the
embeddings in BERTScore are contextual, changing common and the different members [29]. This can be
depending on the sentence context. The context awareness calculated by dividing the number of observations shared by
allows BERTScore to score semantically similar sentences the sum of the observations in each of the two sets. Jaccard
despite their different sentence order. For the recall similarity can be expressed as the ratio of the intersection
calculation, each token in 𝑥 is matched with the most similar (𝐴 ∩ 𝐵) to the union (𝐴 ∪ 𝐵) of two sets (see Eq. (9)).
token in 𝑥̂, as for the precision calculation. Greedy matching is |A∩B|
J(A,B)= (9)
used to maximize the similarity score. The values of precision |A∪B|
(see Eq. (5)), recall (see Eq. (6)) and F1 score (see Eq. (7)) for
reference 𝑥 and candidate 𝑥̂ can be calculated using the |𝐴 ∩ 𝐵| indicates the size of the intersection of the sets A
and B, and |𝐴 ∪ 𝐵| indicates the size of the union of the sets A
following equations.
and B. The Jaccard similarity is bounded in the range from 0 to
1 1. A Jaccard similarity of 1 indicates complete identity
RBERT = |x|
∑ max x⊤i x̂ j (5) between the sets, while a similarity of 0 implies that the sets
xi ∈ x x̂ j ∈ x̂
have no common elements.
where, 𝑅𝐵𝐸𝑅𝑇 is the Recall BERTScore, 𝑥 is the reference
token, 𝑥̂ is the candidate token, 𝑥𝑖 is the sequence vector 𝑥, 𝑥𝑗 IV. RESULT AND DISCUSSION
is the sequence vector 𝑥̂ , where 𝛴𝑥𝑖 ∈𝑥 is the number of 𝑥𝑖 A. Result
present in 𝑥, and also 𝑚𝑎𝑥 is the maximum value of 𝑥̂𝑗 present
𝑥̂𝑗 ∈ 𝑥̂ The interface of the proposed QA system can be seen in
in 𝑥̂, and 𝑥𝑖⊤ 𝑥̂𝑗 is the cosine similarity of 𝑥 and 𝑥̂. Fig. 7.
1
PBERT = |x̂ |
∑ max x⊤i x̂ j (6)
x̂ j ∈ x̂ xi ∈ x

Given 𝑃𝐵𝐸𝑅𝑇 as Precision BERTScore, 𝑥 as reference


token, 𝑥̂ as candidate token, 𝑥𝑖 as sequence vector 𝑥 , 𝑥̂𝑗 as
sequence vector 𝑥̂, where 𝛴𝑥̂𝑗 ∈ 𝑥̂ is the number of 𝑥̂𝑗 present in
𝑥̂, and also 𝑚𝑎𝑥 is the maximum value of 𝑥𝑖 present in 𝑥, and
𝑥𝑖 ∈ 𝑥
𝑥𝑖⊤ 𝑥̂𝑗 is the cosine similarity of 𝑥 and 𝑥̂.
PBERT ⋅ RBERT
FBERT = 2× (7)
PBERT + RBERT

where 𝐹𝐵𝐸𝑅𝑇 is the F1-score of BERTScore, then 𝑃𝐵𝐸𝑅𝑇 is


the precision and 𝑅𝐵𝐸𝑅𝑇 is the recall from BERTScore results.
Although the cosine similarity value is theoretically in the
interval [-1, 1], in practice the value is rescaled so that it is
between 0 and 1 in the result of the BERTScore calculation.
4) BLEU: Bilingual Evaluation Understudy is a metric
that computes a modification of precision for n-grams,
combines it with weights, and applies a brevity penalty to
obtain the final BLEU score [28]. The score range of BLEU is
from 0 to 1. The greater the BLEU score, the better the
system's performance is considered to be compared to the
references. The formula for calculating BLEU can be seen in
Eq. (8).
N
BLEU=BP⋅exp (∑ wn log pn ) (8)
n=1

𝐵𝑃 represents the brevity penalty, adjusting the score to


penalize translations shorter than the reference. 𝑁 denotes the
maximum number of considered n-grams. The precision for n- Fig. 7. Document QA system interface.

781 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 15, No. 3, 2024

The interface of the proposed document QA system can of our method. The proposed system achieved accuracy rates
accept multiple PDF format documents. If the user clicks the of 96% (our dataset) and 95.5% (SQuAD dev dataset),
submit button, the system will process the PDF document to surpassing the other application's rates of 55% (our dataset)
convert it to vector form with embedding (as described in the and 85.7% (SQuAD dataset). This underscores the consistently
RAG mechanism in Fig. 2). Once the document submission higher accuracy of our proposed method.
process is complete, the user can ask questions related to the C. ROUGE
submitted document, and the QA system will provide answers
ROUGE-L score evaluation compares the results of our
based on the source documents provided. The set of questions
proposed QA method outperforming other QA applications in
and answers generated from the user's interaction with the QA terms of precision, recall, and F1-Score. Specifically, on our
system will be in the form of a chatbot, so that it stores the dataset, our proposed method demonstrated precision, recall,
communication history. and F1-Score of 73.7%, 23.9%, and 33.7%, respectively. In
B. Accuracy comparison, other QA applications achieved lower
performance metrics with precision, recall, and F1-Score of
Accuracy in our system model is expressed as the 50.0%, 10.5%, and 15.2%, respectively. Similarly, on the
percentage of correct answers within the entire answer key SQuAD dev dataset, our proposed method excelled with
dataset. To assess accuracy, we calculate the ratio of the precision, recall, and F1-Score reaching 85.5%, 16.2%, and
number of correct predictions to the total number of predictions 26.1%, while other QA applications reported lower scores of
[26]. The visualization of this accuracy result can be figured in 77.2%, 10.4%, and 17.1%, respectively. These results
Fig. 8. underscore the superior performance of our proposed method
The accuracy comparison between the proposed QA across both datasets that can be visualized in Fig. 9.
document system and other applications reveals the superiority

Fig. 8. Accuracy result of proposed method using RAG and other document QA application.

Fig. 9. ROUGE-L result of proposed method using RAG and other document QA application.

782 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 15, No. 3, 2024

dataset, our proposed method excelled with precision 17.7%,


D. BERTScore while other QA applications reported lower scores of precision
BERTScore evaluation compares the results of our 5.6%. These results underscore the superior performance of our
proposed QA method outperforming other QA applications in proposed method across both datasets that can be visualized in
terms of precision, recall, and F1-Score. Specifically, on our Fig. 11.
dataset, our proposed method demonstrated precision, recall,
and F1-Score of 85.2%, 90.1%, and 87.6%, respectively. In F. Jaccard Similarity
comparison, other QA applications achieved lower The performance of our QA system, as evaluated through
performance metrics with precision, recall, and F1-Score of Jaccard Similarity, is outstanding. Our method achieved 33.3%
81.6%, 86.3%, and 83.8%, respectively. Similarly, on the on our dataset and 11.1% on SQuAD dev using RAG method.
SQuAD dev dataset, our proposed method excelled with In comparison, other QA applications scored lower with 4.1%
precision, recall, and F1-Score reaching 82.8%, 87.0%, and on our dataset and 9.1% on SQuAD dev. These results
84.8%, while other QA applications reported lower scores of highlight our method's superiority in Jaccard Similarity on both
80.4%, 86.3%, and 83.2%, respectively. These results datasets that can be visualized in Fig. 12.
underscore the superior performance of our proposed method
across both datasets that can be visualized in Fig. 10. G. Discussion
The accuracy result of 95.5% in the SQuAD dev dataset
E. BLEU Accuracy outperforms other research with 61.5% accuracy that tested in
The BLEU metric score taken is the precision value, to SQuAD dev dataset [30] and 71.4% accuracy which also tested
capture the ability of each model to extract keyword answers in SQuAD dev dataset [31]. We also using SQuAD dev dataset
that match the ground truth. Specifically, on our dataset, our for testing the other document QA application platform, and it
proposed method demonstrated precision of 28.2%. In shows accuracy 85.7%. So, the model proposed in this study
comparison, other QA applications achieved lower has a higher accuracy score compared to other applications,
performance precision 9.7%. Similarly, on the SQuAD dev and previous research on the SQuAD test dataset.

Fig. 10. BERTScore result of proposed method using RAG and other document QA application.

Fig. 11. BLEU precision result of proposed method using RAG and other document QA application.

783 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 15, No. 3, 2024

Fig. 12. Jaccard Similarity result of proposed method using RAG and other document QA application.

Our system's precision, recall, and F1-Score are 82.8%, VI. FUTURE WORKS
87%, and 84.8%, respectively, which surpass the precision of
62%, recall of 87%, and F1-Score of 67% reported in other In the future, studies could be conducted to refine the
research [32]. The proposed QA system's effectiveness is architecture of the system, explore additional ways of using
affirmed by the fact that it surpasses the recall result of other external data, and improve the scalability of the model for
research with 42.70% [33] and outperforms other research broader applications. The integration of user feedback
[31], [34], [35] in terms of F1-Score, which is 42.6% [31], 49% mechanisms and continuous learning modules could contribute
[34], and 70.8% [35]. This positions it as a leading solution for to the adaptability of the system and further improve its
automatic document processing and information retrieval tasks accuracy over time. In addition, exploring ways of processing
across a wide range of domains. documents in real time and extending the system's
compatibility with different document formats could open up
Based on the results of testing the proposed model, the new opportunities for research and study.
results of the present study agree with previous literature
studies, namely that the RAG method, through the REFERENCES
implementation of a hybrid model combining parametric and [1] F. Ganier and R. Querrec, “TIP-EXE: A Software Tool for Studying the
nonparametric models, is able to provide good results [4]. In Use and Understanding of Procedural Documents,” IEEE Trans Prof
this case we combine the LangChain and FAISS frameworks Commun, vol. 55, no. 2, pp. 106–121, Jun. 2012, doi:
10.1109/TPC.2012.2194600.
for the RAG technique, and it can provide a good result. This
[2] W. Yih, M.-W. Chang, X. He, and J. Gao, “Semantic Parsing via Staged
model also combined with the use of the best language model Query Graph Generation: Question Answering with Knowledge Base,”
at this current time like GPT-3.5, which provides good results. in Proceedings of the 53rd Annual Meeting of the Association for
This is a very interesting performance that should be further Computational Linguistics and the 7th International Joint Conference on
developed. Natural Language Processing (Volume 1: Long Papers), Stroudsburg,
PA, USA: Association for Computational Linguistics, 2015, pp. 1321–
V. CONCLUSION 1331. doi: doi.org/10.3115/v1/P15-1128.
[3] Y. Hao et al., “An End-to-End Model for Question Answering over
Our proposed model for Question-Answering (QA) Knowledge Base with Cross-Attention Combining Global Knowledge,”
document processing integrates the Retrieval-Augmented in Proceedings of the 55th Annual Meeting of the Association for
Generation (RAG) model. The evaluation of our proposed QA Computational Linguistics (Volume 1: Long Papers), Stroudsburg, PA,
system demonstrates its superiority over existing commercial USA: Association for Computational Linguistics, 2017, pp. 221–231.
doi: 10.18653/v1/P17-1021.
applications in terms of Accuracy, ROUGE-L scores,
[4] P. Lewis et al., “Retrieval-Augmented Generation for Knowledge-
BERTScore metrics, BLEU precision, and Jaccard Similarity. Intensive NLP Tasks,” NIPS’20: Proceedings of the 34th International
The proposed method achieved high accuracy rates of 96% and Conference on Neural Information Processing Systems, vol.
95.5% on our dataset and the SQuAD dev dataset, respectively, abs/2005.11401, pp. 9459–9474, May 2020, doi:
outperforming other applications tested on the same datasets. 10.48550/arXiv.2005.11401.
Our system's precision, recall, and F1-Score metrics were [5] S. Siriwardhana, R. Weerasekera, E. Wen, T. Kaluarachchi, R. Rana,
superior to those of other QA applications on both datasets, as and S. Nanayakkara, “Improving the Domain Adaptation of Retrieval
Augmented Generation (RAG) Models for Open Domain Question
highlighted by the ROUGE-L evaluation. Additionally, the Answering,” Trans Assoc Comput Linguist, vol. 11, pp. 1–17, 2023, doi:
BERTScore metrics consistently showed higher precision, 10.1162/tacl_a_00530.
recall, and F1-Score for our proposed method compared to [6] Y. Ahn, S.-G. Lee, J. Shim, and J. Park, “Retrieval-Augmented
other applications. In addition, our QA system has Response Generation for Knowledge-Grounded Conversation in the
demonstrated superior performance in keyword extraction and Wild,” IEEE Access, vol. 10, pp. 131374–131385, 2022, doi:
text similarity compared to other applications, as assessed by 10.1109/ACCESS.2022.3228964.
BLEU precision and Jaccard Similarity. [7] Xiong, S. Merity, and R. Socher, “Dynamic Memory Networks for
Visual and Textual Question Answering,” Proceedings of The 33rd

784 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 15, No. 3, 2024

International Conference on Machine Learning, pp. 2397–2406, Mar. Computational Linguistics, Nov. 2019, pp. 119–124. doi:
2016, doi: 10.48550/arXiv.1603.01417. 10.18653/v1/D19-5817.
[8] M. Seo, A. Kembhavi, A. Farhadi, and H. Hajishirzi, “Bidirectional [23] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi,
Attention Flow for Machine Comprehension,” International Conference “BERTScore: Evaluating Text Generation with BERT,” International
on Learning Representations, Nov. 2016, doi: Conference on Learning Representations, vol. abs/1904.09675, Apr.
10.48550/arXiv.1611.01603. 2019, doi: 10.48550/arXiv.1904.09675.
[9] A. W. Yu et al., “QANet: Combining Local Convolution with Global [24] B. Ojokoh and E. Adebisi, “A Review of Question Answering Systems,”
Self-Attention for Reading Comprehension,” International Conference Journal of Web Engineering, vol. 17, no. 8, pp. 717–758, 2019, doi:
on Learning Representations, vol. abs/1804.09541, Apr. 2018, doi: 10.13052/jwe1540-9589.1785.
10.48550/arXiv.1804.09541. [25] J. Soni, N. Prabakar, and H. Upadhyay, “Behavioral Analysis of System
[10] W. Yang, Y. Xie, L. Tan, K. Xiong, M. Li, and J. Lin, “Data Call Sequences Using LSTM Seq-Seq, Cosine Similarity and Jaccard
Augmentation for BERT Fine-Tuning in Open-Domain Question Similarity for Real-Time Anomaly Detection,” in 2019 International
Answering,” ArXiv, vol. abs/1904.06652, Apr. 2019, doi: Conference on Computational Science and Computational Intelligence
10.48550/arXiv.1904.06652. (CSCI), IEEE, Dec. 2019, pp. 214–219. doi:
[11] C. Raffel et al., “Exploring the Limits of Transfer Learning with a 10.1109/CSCI49370.2019.00043.
Unified Text-to-Text Transformer,” Journal of Machine Learning [26] J. F. BELL and A. H. FIELDING, “A review of methods for the
Research, vol. 21, pp. 140:1-140:67, 2019, doi: assessment of prediction errors in conservation presence/absence
10.48550/arXiv.1910.10683. models,” Environ Conserv, vol. 24, no. 1, pp. 38–49, 1997, doi: DOI:
[12] A. Roberts, C. Raffel, and N. Shazeer, “How Much Knowledge Can You 10.1017/S0376892997000088.
Pack Into the Parameters of a Language Model?,” in Proceedings of the [27] C.-Y. Lin, “ROUGE: A Package for Automatic Evaluation of
2020 Conference on Empirical Methods in Natural Language Processing Summaries,” Association for Computational Linguistics, vol. Text
(EMNLP), Stroudsburg, PA, USA: Association for Computational Summa, no. 12, pp. 74–81, 2004, [Online]. Available:
Linguistics, 2020, pp. 5418–5426. doi: 10.18653/v1/2020.emnlp- https://aclanthology.org/W04-1013/
main.437. [28] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a Method for
[13] M. T. R. Laskar, M. S. Bari, M. Rahman, M. A. H. Bhuiyan, S. R. Joty, Automatic Evaluation of Machine Translation,” in Proceedings of the
and J. Huang, “A Systematic Study and Comprehensive Evaluation of 40th Annual Meeting of the Association for Computational Linguistics,
ChatGPT on Benchmark Datasets,” in Annual Meeting of the P. Isabelle, E. Charniak, and D. Lin, Eds., Philadelphia, Pennsylvania,
Association for Computational Linguistics, 2023. [Online]. Available: USA: Association for Computational Linguistics, Jul. 2002, pp. 311–
https://api.semanticscholar.org/CorpusID:258967462 318. doi: 10.3115/1073083.1073135.
[14] W. Yu, “Retrieval-augmented Generation across Heterogeneous [29] N. C. Chung, B. Miasojedow, M. Startek, and A. Gambin,
Knowledge,” in Proceedings of the 2022 Conference of the North “Jaccard/Tanimoto similarity test and estimation methods for biological
American Chapter of the Association for Computational Linguistics: presence-absence data,” BMC Bioinformatics, vol. 20, no. 15, p. 644,
Human Language Technologies: Student Research Workshop, Seattle, 2019, doi: 10.1186/s12859-019-3118-5.
Washington: Association for Computational Linguistics, Jul. 2022, pp. [30] A. Stricker, “Question answering in Natural Language: the Special Case
52–58. doi: 10.18653/v1/2022.naacl-srw.7. of Temporal Expressions,” in Proceedings of the Student Research
[15] D. Thulke, N. Daheim, C. Dugast, and H. Ney, “Efficient Retrieval Workshop Associated with RANLP 2021, S. Djabri, D. Gimadi, T.
Augmented Generation from Unstructured Knowledge for Task- Mihaylova, and I. Nikolova-Koleva, Eds., Online: INCOMA Ltd., Sep.
Oriented Dialog,” Conference of Association for the Advancement of 2021, pp. 184–192. [Online]. Available:
Artificial Intelligence (AAAI), Feb. 2021, doi: https://aclanthology.org/2021.ranlp-srw.26
10.48550/arXiv.2102.04643. [31] S. Min, V. Zhong, R. Socher, and C. Xiong, “Efficient and Robust
[16] OpenAI, “A Survey of Techniques for Maximizing LLM Performance.” Question Answering from Minimal Context over Documents,” in
Nov. 2023. Proceedings of the 56th Annual Meeting of the Association for
[17] Jacob Lee, “Building LLM-Powered Web Apps with Client-Side Computational Linguistics (Volume 1: Long Papers), I. Gurevych and
Technology.” Accessed: Dec. 01, 2023. [Online]. Available: Y. Miyao, Eds., Melbourne, Australia: Association for Computational
https://ollama.ai/blog/building-llm-powered-web-apps Linguistics, Jul. 2018, pp. 1725–1735. doi: 10.18653/v1/P18-1160.
[18] J. Johnson, M. Douze, and H. Jégou, “Billion-Scale Similarity Search [32] H. Bahak, F. Taheri, Z. Zojaji, and A. Kazemi, “Evaluating ChatGPT as
with GPUs,” IEEE Trans Big Data, vol. 7, no. 3, pp. 535–547, 2021, doi: a Question Answering System: A Comprehensive Analysis and
10.1109/TBDATA.2019.2921572. Comparison with Existing Models,” ArXiv, vol. abs/2312.07592, Dec.
2023, doi: 10.48550/arXiv.2312.07592.
[19] J. Zhu, J. Jang-Jaccard, I. Welch, H. Al-Sahaf, and S. Camtepe, A
Ransomware Triage Approach using a Task Memory based on Meta- [33] T. Cakaloglu, C. Szegedy, and X. Xu, “Text Embeddings for Retrieval
Transfer Learning Framework. 2022. doi: 10.48550/arXiv.2207.10242. From a Large Knowledge Base,” Research Challenges in Information
Science, vol. abs/1810.10176, Oct. 2018, doi:
[20] K. M. Fitria, “DocuQA: Document Question Answering Dataset.” Feb. 10.48550/arXiv.1810.10176.
2024. doi: 10.6084/m9.figshare.25223990.v1.
[34] S. Gholami and M. Noori, “Zero-Shot Open-Book Question
[21] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, “SQuAD: 100,000+
Answering,” ArXiv, vol. abs/2111.11520, Nov. 2021, doi:
Questions for Machine Comprehension of Text,” in Conference on doi.org/10.48550/arXiv.2111.11520.
Empirical Methods in Natural Language Processing, 2016. [Online].
Available: https://api.semanticscholar.org/CorpusID:11816014 [35] G. Nur Ahmad and A. Romadhony, “End-to-End Question Answering
System for Indonesian Documents Using TF-IDF and IndoBERT,” in
[22] A. Chen, G. Stanovsky, S. Singh, and M. Gardner, “Evaluating Question
2023 10th International Conference on Advanced Informatics: Concept,
Answering Evaluation,” in Proceedings of the 2nd Workshop on
Theory and Application (ICAICTA), 2023, pp. 1–6. doi:
Machine Reading for Question Answering, A. Fisch, A. Talmor, R. Jia, 10.1109/ICAICTA59291.2023.10390111.
M. Seo, E. Choi, and D. Chen, Eds., Hong Kong, China: Association for

785 | P a g e
www.ijacsa.thesai.org

You might also like