Document Question Answering using Large Language Model
Document Question Answering using Large Language Model
Abstract—This study introduces the Retrieval Augmented understanding leads to time-consuming efforts, susceptibility to
Generation (RAG) method to improve Question-Answering (QA) human error, and inefficient analysis processes. Based on
systems by addressing document processing in Natural Language previous methods, the use of modified Large Language Models
Processing problems. It represents the latest breakthrough in (LLM) for document processing requires significant resources
applying RAG to document question and answer applications, and poses challenges for widespread implementation. Also, the
overcoming previous QA system obstacles. RAG combines search underutilization of the recently discovered Retrieval
techniques in vector store and text generation mechanism Augmented Generation (RAG) method, particularly in
developed by Large Language Models, offering a time-efficient document processing within Question-Answering (QA)
alternative to manual reading limitations. The research evaluates
systems, provides an opportunity for further exploration. The
RAG's that use Generative Pre-trained Transformer 3.5 or GPT-
3.5-turbo from the ChatGPT model and its impact on document
motivation stems from the challenges associated with manual
data processing, comparing it with other applications. This document processing, resource-intensive Large Language
research also provides datasets to test the capabilities of the QA Model (LLM) modifications, and the underutilization of the
document system. The proposed dataset and Stanford Question Retrieval-Augmented Generation (RAG) method in the
Answering Dataset (SQuAD) are used for performance testing. document-based question-answering domain [2], [3]. In
The study contributes theoretically by advancing methodologies addition, there is a tendency to produce hallucinatory responses
and knowledge representation, supporting benchmarking in that lack accuracy and precision in models that rely solely on
research communities. Results highlight RAG's superiority: LLM capabilities for QA systems without modifications.
achieving a precision of 0.74 in Recall-Oriented Understudy for Finally, the implementation of RAG in QA systems for
Gisting Evaluation (ROUGE) testing, outperforming others at document processing offers the untapped potential to improve
0.5; obtaining an F1 score of 0.88 in BERTScore, surpassing the ability of the system to produce accurate and non-
other QA apps at 0.81; attaining a precision of 0.28 in Bilingual hallucinatory responses.
Evaluation Understudy (BLEU) testing, surpassing others with a
precision of 0.09; and scoring 0.33 in Jaccard Similarity, Building on this line of research, this paper proposes the
outshining others at 0.04. These findings underscore RAG's implementation of the Retrieval Augmented Generation model
efficiency and competitiveness, promising a positive impact on for document question answering tasks, specifically using the
various industrial sectors through advanced Artificial ChatGPT model. RAG, introduced in 2021 [4], addresses the
Intelligence (AI) technology. limitations of previous methods by merging parametric and
non-parametric memory. This hybrid model seamlessly
Keywords—Natural Language Processing; Large Language integrates generative capabilities with data retrieval
Model; Retrieval Augmented Generation; Question Answering; mechanisms, linking language models to external knowledge
GPT sources. RAG combines generative capabilities and the ability
to search for data and incorporate relevant information from
I. INTRODUCTION
the knowledge base in the model. The distinct advantages of
This research proposes a new approach to the increasing RAG lie in its ability to adapt to dynamic data, its flexibility in
reliance on articles and journal documents by introducing a working with external data sources, and its ability to mitigate
Question-Answering (QA) document processing system [1]. hallucinatory responses [5]. These characteristics make RAG
The identification of several critical problems motivates this particularly suitable for QA tasks on internal organizational
research. The problems motivating this research are documents by leveraging external knowledge to reduce
multifaceted. Firstly, manual reading and processing to response hallucinations [6].
comprehend document text are time-consuming, error-prone,
and inefficient. Secondly, previous methods employed to The current research aims to exploit the innovative
modify Large Language Models (LLM) for document approach of RAG to construct an application capable of
processing demanded substantial resources and were automatically processing external text documents. The focus of
challenging to implement widely. Lastly, models relying solely this research is to develop an application system capable of
on the capabilities of LLM for QA systems without processing external document text uploaded by the user. The
modifications tend to generate hallucinatory answers, lacking system will automatically read the document text, allowing
correctness and precision. Manual processing for document users to input questions related to the document. Subsequently,
the system provides answers based on the processed document
776 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 15, No. 3, 2024
text, eliminating the need for manual reading to find answers. the T5 model [12]. The Retrieval-Augmented Generation
This comprehensive solution not only overcomes the (RAG) method, which combines parametric and non-
limitations of previous methods, but also promises to parametric methods, was introduced by Patrick Lewis et al. in
significantly speed up research and study exploration in 2021. RAG has demonstrated its capabilities in open domain
various domains. QA tasks, overcoming previous limitations to deliver more
efficient and comprehensive QA systems [4].
Testing of the proposed model is performed, like several
previous QA-based studies, by calculating the suitability of the Large language model called GPT, or Generative Pre-
answer results provided by the model with the ground truth of Trained Transformer was developed by OpenAI. Previous
the test dataset. Some of the metrics used to calculate the research that has compared the performance of ChatGPT with
performance of this model include Accuracy, ROUGE, BLEU, other large language models like PaLM and LLaMA in open-
BERTScore, and Jaccard Similarity. domain QA tasks indicates that ChatGPT consistently achieves
the highest scores across various open-domain QA datasets
II. RELATED WORKS [13]. Table I presents performance comparisons among LLMs.
This study examines the applicability of RAG, its impact
on the document processing task, and compares it to the TABLE I. LLM PERFORMANCE ON OPEN DOMAIN QA DATASET
previous methods. This research also investigates the capability Model TriviaQA WebQuestion NQ-Open
of the large language model within the ChatGPT systems, gpt-
3.5-turbo within the framework of RAG. This work also PaLM-540B (few-shot) 81.4 43.5 39.6
highlights the development of Artificial Intelligence (AI) and PaLM-540B (zero-shot) 76.9 10.6 21.2
Natural Language Processing (NLP), so this research focuses LLaMA-65B (zero-shot) 68.2 - 23.8
on the improvement of intelligence and the capabilities of
applications [7], [8], [9]. Machine Learning and Deep Learning ChatGPT (zero-shot) 85.9 50.5 48.1
algorithms, which include BERT Base, and Text-to-Text The PolyQuery Synthesis test, which identifies multiple
Transfer Transformer (T5) models, and RAG method, have queries within a single-query prompt and extracts the answers
made significant advances in QA tasks [4], [10], [11], [12]. to all of the questions from the model's latent representation,
This research motivated the implementation of RAG for also shows that ChatGPT outperforms other GPT models from
processing documents, integrated into an interactive QA OpenAI (ada-001, babbage-001, curie, and davinci) in terms of
system. accuracy [13]. According to the evaluations, the gpt-3.5-turbo
Between 2015 and the present, the evolution of question- model has been selected for implementation in this research.
answering (QA) systems shows a trajectory characterized by
diverse methodologies. Starting with semantic parsing-based III. RESEARCH METHOD
systems in 2015, Wen-tau Yih et al. focused on transforming This research undergoes a development phase, starting
natural language queries into structured logical forms, with designing the application system and integrating the APIs
achieving a performance of 52.5% in the F1-score [2]. of ChatGPT, LangChain and FAISS. Subsequent stages
Subsequent knowledge-based paradigms (KB-QA) by Yanchao include extensive system modeling, interface testing and data
Hao et al. in 2017 reformulated questions as predicates, preparation using the proposed dataset and the SQuAD
achieving a performance of 42.9% [3]. Progress has been made dataset. The testing phase, which includes a performance
in integrating AI technologies. Caiming Xiong's exploration of comparison with other applications using ground truth metrics
dynamic memory networks (DMN) in 2016, achieved an (ROUGE, BERTScore, BLEU and Jaccard Similarity), guides
accuracy of 28.79% [7]. In the same year, Minjoon Seo et al.'s
the exploration of the capabilities of the proposed system, as
Bi-Directional Attention Flow (BiDAF) framework
shown in Fig. 1.
demonstrated significant performance with a 68% exact match
and 77.3% F1 score, albeit with a computational time of 20 A. RAG Integration
hours [8]. Adams Wei Yu et al. introduced the QANet model in Retrieval Augmented Generation (RAG) combines retrieval
2018, with a performance of 76.2% exact match and 84.6% F1- and generation models. It uses a Large Language Model
Score, within a shorter computational time of 3 hours [9]. As (LLM) to generate text based on commands and integrates
QA systems evolve, in 2019 Wei Yang et al. applied fine- information from a separate retrieval system to improve output
tuning methods with data augmentation techniques, achieving quality and contextual relevance [14]. The mechanism involves
remarkable results with a modified BERT-Base model of retrieving factual content from a knowledge base via retrieval
49.2% for exact match and 65.4% for F1-Score [10]. Colin models and using generative processes to provide additional
Raffel et al. introduced the Text-to-Text Transfer Transformer context for more accurate output [15]. External data sources are
(T5), with impressive performance of 63.3% for exact match, used, and the numerical representation is facilitated by
94.1% for F1 score, and a peak accuracy of 93.8%, albeit with embedding methods to ensure compatibility. Based on Fig. 2,
an increased number of parameters of 11 billion [11]. In 2020, user queries converted into embeddings are compared with
the focus was on fine-tuning pre-trained models, with Adam vectors from the knowledge library. Relevant context is added
Roberts et al. achieving a recall performance of 34.6% using to the queries before they are fed into the base language model.
777 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 15, No. 3, 2024
778 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 15, No. 3, 2024
The chosen methodology for QA system development is the management, and prompting to connect to the LLM model.
Retrieval Augmented Generation (RAG) mechanism. Unlike The process starts with document loading, followed by
previous approaches such as semantic parsing-based, document splitting into text chunks. These text chunks undergo
knowledge-based, and fine-tuning using LSTM or other DL word embedding, converting them into vectors stored in the
algorithms, RAG addresses shortcomings like difficulty vector database. Simultaneously, user-inputted text questions
expanding or revising model memory, an inability to provide are embedded and converted into word vectors. The system
direct insight into generated predictions, and a tendency to connects these vectors to the vector database, performing a
produce hallucinative answers [12]. The solution involves the semantic search and ranking the relevance between vectors.
creation of a hybrid model, merging generative and retrieval The semantic search results in relevant context between
models, which forms the basis for the RAG method. RAG questions and answers. The system retrieves pertinent answers
offers advantages such as adaptive responses to dynamic data, based on user queries and sends them to the LLM (using the
flexibility with external data sources, and minimization of ChatGPT model). The final outcome involves the system
hallucinative responses [5]. Thus, RAG is chosen to construct a receiving LLM-generated answers and delivering them to the
text document-based QA system interacting with users through user. The application system interacts with users, requiring an
a chatbot interface. The system's workflow, implemented using interface connecting the user and the system. Mockups, design
RAG and supporting libraries like LangChain and FAISS, is layouts, and elements for the web application are created using
illustrated in Fig. 4. the Streamlit framework, facilitating rapid development and
sharing of the AI model web application. The mockup for the
The integration of the LangChain framework into the QA application system and user interaction within the system is
document system includes document loading, memory depicted in Fig. 5.
Fig. 4. Integration of langchain framework in RAG for the proposed document QA system.
Fig. 5. Mockup of the application system and user interaction for the app.
779 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 15, No. 3, 2024
C. Proposed Dataset DocuQA reliable benchmarks for assessing the system's performance
across various dimensions. The testing process involves two
The proposed dataset, DocuQA, designed for application- key variables. "Predictions RAG" and "Prediction Others"
based question-answering systems that process document represent the test results from the developed application and
inputs, consists of 20 diverse documents, encompassing journal comparable commercial applications, respectively. Both sets of
articles, news reports, financial documents, and tutorials. Each predictions are compared to the ground truth data, which is
document file includes five questions with corresponding encapsulated in the "references" variable. Different aspects of
ground truth answers, enabling a thorough evaluation of QA language models and question answering systems are evaluated
system capabilities, with a total 100 questions in the dataset. using different metrics. ROUGE measures the overlap in
DocuQA consists of journal documents with calculations and summarization [22]. BERTscore assesses semantic similarity
formulas, news documents with specific titles, financial reports using contextual embeddings [23]. BLEU evaluates n-gram
and news documents with numbers and currency data, and precision [24], and Jaccard Similarity compares text similarity
tutorial documents with step-by-step instructions. Accuracy based on word or n-gram overlap [25]. Precision in question
can be calculated based on the correct answers out of 100, answering systems is commonly assessed through accuracy, F1
providing a metric for information extraction accuracy. The score, and precision metrics, providing insights into their
dataset aims to challenge QA systems in understanding effectiveness. The metrics are used to quantitatively evaluate
context, identifying keywords, and efficiently extracting system performance and establish its superiority over existing
specific information, offering a robust evaluation tool for commercial applications in document processing and
developers and researchers across various document and information retrieval tasks.
question types. The dataset can be accessed publicly [20].
1) Accuracy: Accuracy is defined as the proportion of
Proper citation of the dataset is encouraged for research or
projects using DocuQA to ensure appropriate credit is given. correct responses from the total number of responses.
The preview of the DocuQA dataset can be seen in Fig. 6. Accuracy can be calculated by calculating the percentage of
correct predictions over the total number of references [26]. In
essence, accuracy represents the ability of the system to
provide correct answers, which is expressed as a percentage
using the following formula (see Eq.(1)).
correct predictions
Accuracy= ×100% (1)
all predictions
780 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 15, No. 3, 2024
score of 1 indicates total agreement between reference and grams, denoted as 𝑝𝑛 signifies the n-grams ratios by the
candidate text. candidate text that appearing in any reference translation to the
total of n-grams in the candidate text. 𝑤𝑛 represents the weight
3) BERTScore: BERTScore is an automatic evaluation assigned to each n-gram precision score.
metric in text generation tasks that evaluates the similarity of
each candidate sentence token to each reference sentence 5) The Jaccard similarity quantifies the similarity
token by means of contextual embeddings [23]. The percentage between two sets of data by identifying the
embeddings in BERTScore are contextual, changing common and the different members [29]. This can be
depending on the sentence context. The context awareness calculated by dividing the number of observations shared by
allows BERTScore to score semantically similar sentences the sum of the observations in each of the two sets. Jaccard
despite their different sentence order. For the recall similarity can be expressed as the ratio of the intersection
calculation, each token in 𝑥 is matched with the most similar (𝐴 ∩ 𝐵) to the union (𝐴 ∪ 𝐵) of two sets (see Eq. (9)).
token in 𝑥̂, as for the precision calculation. Greedy matching is |A∩B|
J(A,B)= (9)
used to maximize the similarity score. The values of precision |A∪B|
(see Eq. (5)), recall (see Eq. (6)) and F1 score (see Eq. (7)) for
reference 𝑥 and candidate 𝑥̂ can be calculated using the |𝐴 ∩ 𝐵| indicates the size of the intersection of the sets A
and B, and |𝐴 ∪ 𝐵| indicates the size of the union of the sets A
following equations.
and B. The Jaccard similarity is bounded in the range from 0 to
1 1. A Jaccard similarity of 1 indicates complete identity
RBERT = |x|
∑ max x⊤i x̂ j (5) between the sets, while a similarity of 0 implies that the sets
xi ∈ x x̂ j ∈ x̂
have no common elements.
where, 𝑅𝐵𝐸𝑅𝑇 is the Recall BERTScore, 𝑥 is the reference
token, 𝑥̂ is the candidate token, 𝑥𝑖 is the sequence vector 𝑥, 𝑥𝑗 IV. RESULT AND DISCUSSION
is the sequence vector 𝑥̂ , where 𝛴𝑥𝑖 ∈𝑥 is the number of 𝑥𝑖 A. Result
present in 𝑥, and also 𝑚𝑎𝑥 is the maximum value of 𝑥̂𝑗 present
𝑥̂𝑗 ∈ 𝑥̂ The interface of the proposed QA system can be seen in
in 𝑥̂, and 𝑥𝑖⊤ 𝑥̂𝑗 is the cosine similarity of 𝑥 and 𝑥̂. Fig. 7.
1
PBERT = |x̂ |
∑ max x⊤i x̂ j (6)
x̂ j ∈ x̂ xi ∈ x
781 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 15, No. 3, 2024
The interface of the proposed document QA system can of our method. The proposed system achieved accuracy rates
accept multiple PDF format documents. If the user clicks the of 96% (our dataset) and 95.5% (SQuAD dev dataset),
submit button, the system will process the PDF document to surpassing the other application's rates of 55% (our dataset)
convert it to vector form with embedding (as described in the and 85.7% (SQuAD dataset). This underscores the consistently
RAG mechanism in Fig. 2). Once the document submission higher accuracy of our proposed method.
process is complete, the user can ask questions related to the C. ROUGE
submitted document, and the QA system will provide answers
ROUGE-L score evaluation compares the results of our
based on the source documents provided. The set of questions
proposed QA method outperforming other QA applications in
and answers generated from the user's interaction with the QA terms of precision, recall, and F1-Score. Specifically, on our
system will be in the form of a chatbot, so that it stores the dataset, our proposed method demonstrated precision, recall,
communication history. and F1-Score of 73.7%, 23.9%, and 33.7%, respectively. In
B. Accuracy comparison, other QA applications achieved lower
performance metrics with precision, recall, and F1-Score of
Accuracy in our system model is expressed as the 50.0%, 10.5%, and 15.2%, respectively. Similarly, on the
percentage of correct answers within the entire answer key SQuAD dev dataset, our proposed method excelled with
dataset. To assess accuracy, we calculate the ratio of the precision, recall, and F1-Score reaching 85.5%, 16.2%, and
number of correct predictions to the total number of predictions 26.1%, while other QA applications reported lower scores of
[26]. The visualization of this accuracy result can be figured in 77.2%, 10.4%, and 17.1%, respectively. These results
Fig. 8. underscore the superior performance of our proposed method
The accuracy comparison between the proposed QA across both datasets that can be visualized in Fig. 9.
document system and other applications reveals the superiority
Fig. 8. Accuracy result of proposed method using RAG and other document QA application.
Fig. 9. ROUGE-L result of proposed method using RAG and other document QA application.
782 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 15, No. 3, 2024
Fig. 10. BERTScore result of proposed method using RAG and other document QA application.
Fig. 11. BLEU precision result of proposed method using RAG and other document QA application.
783 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 15, No. 3, 2024
Fig. 12. Jaccard Similarity result of proposed method using RAG and other document QA application.
Our system's precision, recall, and F1-Score are 82.8%, VI. FUTURE WORKS
87%, and 84.8%, respectively, which surpass the precision of
62%, recall of 87%, and F1-Score of 67% reported in other In the future, studies could be conducted to refine the
research [32]. The proposed QA system's effectiveness is architecture of the system, explore additional ways of using
affirmed by the fact that it surpasses the recall result of other external data, and improve the scalability of the model for
research with 42.70% [33] and outperforms other research broader applications. The integration of user feedback
[31], [34], [35] in terms of F1-Score, which is 42.6% [31], 49% mechanisms and continuous learning modules could contribute
[34], and 70.8% [35]. This positions it as a leading solution for to the adaptability of the system and further improve its
automatic document processing and information retrieval tasks accuracy over time. In addition, exploring ways of processing
across a wide range of domains. documents in real time and extending the system's
compatibility with different document formats could open up
Based on the results of testing the proposed model, the new opportunities for research and study.
results of the present study agree with previous literature
studies, namely that the RAG method, through the REFERENCES
implementation of a hybrid model combining parametric and [1] F. Ganier and R. Querrec, “TIP-EXE: A Software Tool for Studying the
nonparametric models, is able to provide good results [4]. In Use and Understanding of Procedural Documents,” IEEE Trans Prof
this case we combine the LangChain and FAISS frameworks Commun, vol. 55, no. 2, pp. 106–121, Jun. 2012, doi:
10.1109/TPC.2012.2194600.
for the RAG technique, and it can provide a good result. This
[2] W. Yih, M.-W. Chang, X. He, and J. Gao, “Semantic Parsing via Staged
model also combined with the use of the best language model Query Graph Generation: Question Answering with Knowledge Base,”
at this current time like GPT-3.5, which provides good results. in Proceedings of the 53rd Annual Meeting of the Association for
This is a very interesting performance that should be further Computational Linguistics and the 7th International Joint Conference on
developed. Natural Language Processing (Volume 1: Long Papers), Stroudsburg,
PA, USA: Association for Computational Linguistics, 2015, pp. 1321–
V. CONCLUSION 1331. doi: doi.org/10.3115/v1/P15-1128.
[3] Y. Hao et al., “An End-to-End Model for Question Answering over
Our proposed model for Question-Answering (QA) Knowledge Base with Cross-Attention Combining Global Knowledge,”
document processing integrates the Retrieval-Augmented in Proceedings of the 55th Annual Meeting of the Association for
Generation (RAG) model. The evaluation of our proposed QA Computational Linguistics (Volume 1: Long Papers), Stroudsburg, PA,
system demonstrates its superiority over existing commercial USA: Association for Computational Linguistics, 2017, pp. 221–231.
doi: 10.18653/v1/P17-1021.
applications in terms of Accuracy, ROUGE-L scores,
[4] P. Lewis et al., “Retrieval-Augmented Generation for Knowledge-
BERTScore metrics, BLEU precision, and Jaccard Similarity. Intensive NLP Tasks,” NIPS’20: Proceedings of the 34th International
The proposed method achieved high accuracy rates of 96% and Conference on Neural Information Processing Systems, vol.
95.5% on our dataset and the SQuAD dev dataset, respectively, abs/2005.11401, pp. 9459–9474, May 2020, doi:
outperforming other applications tested on the same datasets. 10.48550/arXiv.2005.11401.
Our system's precision, recall, and F1-Score metrics were [5] S. Siriwardhana, R. Weerasekera, E. Wen, T. Kaluarachchi, R. Rana,
superior to those of other QA applications on both datasets, as and S. Nanayakkara, “Improving the Domain Adaptation of Retrieval
Augmented Generation (RAG) Models for Open Domain Question
highlighted by the ROUGE-L evaluation. Additionally, the Answering,” Trans Assoc Comput Linguist, vol. 11, pp. 1–17, 2023, doi:
BERTScore metrics consistently showed higher precision, 10.1162/tacl_a_00530.
recall, and F1-Score for our proposed method compared to [6] Y. Ahn, S.-G. Lee, J. Shim, and J. Park, “Retrieval-Augmented
other applications. In addition, our QA system has Response Generation for Knowledge-Grounded Conversation in the
demonstrated superior performance in keyword extraction and Wild,” IEEE Access, vol. 10, pp. 131374–131385, 2022, doi:
text similarity compared to other applications, as assessed by 10.1109/ACCESS.2022.3228964.
BLEU precision and Jaccard Similarity. [7] Xiong, S. Merity, and R. Socher, “Dynamic Memory Networks for
Visual and Textual Question Answering,” Proceedings of The 33rd
784 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 15, No. 3, 2024
International Conference on Machine Learning, pp. 2397–2406, Mar. Computational Linguistics, Nov. 2019, pp. 119–124. doi:
2016, doi: 10.48550/arXiv.1603.01417. 10.18653/v1/D19-5817.
[8] M. Seo, A. Kembhavi, A. Farhadi, and H. Hajishirzi, “Bidirectional [23] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi,
Attention Flow for Machine Comprehension,” International Conference “BERTScore: Evaluating Text Generation with BERT,” International
on Learning Representations, Nov. 2016, doi: Conference on Learning Representations, vol. abs/1904.09675, Apr.
10.48550/arXiv.1611.01603. 2019, doi: 10.48550/arXiv.1904.09675.
[9] A. W. Yu et al., “QANet: Combining Local Convolution with Global [24] B. Ojokoh and E. Adebisi, “A Review of Question Answering Systems,”
Self-Attention for Reading Comprehension,” International Conference Journal of Web Engineering, vol. 17, no. 8, pp. 717–758, 2019, doi:
on Learning Representations, vol. abs/1804.09541, Apr. 2018, doi: 10.13052/jwe1540-9589.1785.
10.48550/arXiv.1804.09541. [25] J. Soni, N. Prabakar, and H. Upadhyay, “Behavioral Analysis of System
[10] W. Yang, Y. Xie, L. Tan, K. Xiong, M. Li, and J. Lin, “Data Call Sequences Using LSTM Seq-Seq, Cosine Similarity and Jaccard
Augmentation for BERT Fine-Tuning in Open-Domain Question Similarity for Real-Time Anomaly Detection,” in 2019 International
Answering,” ArXiv, vol. abs/1904.06652, Apr. 2019, doi: Conference on Computational Science and Computational Intelligence
10.48550/arXiv.1904.06652. (CSCI), IEEE, Dec. 2019, pp. 214–219. doi:
[11] C. Raffel et al., “Exploring the Limits of Transfer Learning with a 10.1109/CSCI49370.2019.00043.
Unified Text-to-Text Transformer,” Journal of Machine Learning [26] J. F. BELL and A. H. FIELDING, “A review of methods for the
Research, vol. 21, pp. 140:1-140:67, 2019, doi: assessment of prediction errors in conservation presence/absence
10.48550/arXiv.1910.10683. models,” Environ Conserv, vol. 24, no. 1, pp. 38–49, 1997, doi: DOI:
[12] A. Roberts, C. Raffel, and N. Shazeer, “How Much Knowledge Can You 10.1017/S0376892997000088.
Pack Into the Parameters of a Language Model?,” in Proceedings of the [27] C.-Y. Lin, “ROUGE: A Package for Automatic Evaluation of
2020 Conference on Empirical Methods in Natural Language Processing Summaries,” Association for Computational Linguistics, vol. Text
(EMNLP), Stroudsburg, PA, USA: Association for Computational Summa, no. 12, pp. 74–81, 2004, [Online]. Available:
Linguistics, 2020, pp. 5418–5426. doi: 10.18653/v1/2020.emnlp- https://aclanthology.org/W04-1013/
main.437. [28] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a Method for
[13] M. T. R. Laskar, M. S. Bari, M. Rahman, M. A. H. Bhuiyan, S. R. Joty, Automatic Evaluation of Machine Translation,” in Proceedings of the
and J. Huang, “A Systematic Study and Comprehensive Evaluation of 40th Annual Meeting of the Association for Computational Linguistics,
ChatGPT on Benchmark Datasets,” in Annual Meeting of the P. Isabelle, E. Charniak, and D. Lin, Eds., Philadelphia, Pennsylvania,
Association for Computational Linguistics, 2023. [Online]. Available: USA: Association for Computational Linguistics, Jul. 2002, pp. 311–
https://api.semanticscholar.org/CorpusID:258967462 318. doi: 10.3115/1073083.1073135.
[14] W. Yu, “Retrieval-augmented Generation across Heterogeneous [29] N. C. Chung, B. Miasojedow, M. Startek, and A. Gambin,
Knowledge,” in Proceedings of the 2022 Conference of the North “Jaccard/Tanimoto similarity test and estimation methods for biological
American Chapter of the Association for Computational Linguistics: presence-absence data,” BMC Bioinformatics, vol. 20, no. 15, p. 644,
Human Language Technologies: Student Research Workshop, Seattle, 2019, doi: 10.1186/s12859-019-3118-5.
Washington: Association for Computational Linguistics, Jul. 2022, pp. [30] A. Stricker, “Question answering in Natural Language: the Special Case
52–58. doi: 10.18653/v1/2022.naacl-srw.7. of Temporal Expressions,” in Proceedings of the Student Research
[15] D. Thulke, N. Daheim, C. Dugast, and H. Ney, “Efficient Retrieval Workshop Associated with RANLP 2021, S. Djabri, D. Gimadi, T.
Augmented Generation from Unstructured Knowledge for Task- Mihaylova, and I. Nikolova-Koleva, Eds., Online: INCOMA Ltd., Sep.
Oriented Dialog,” Conference of Association for the Advancement of 2021, pp. 184–192. [Online]. Available:
Artificial Intelligence (AAAI), Feb. 2021, doi: https://aclanthology.org/2021.ranlp-srw.26
10.48550/arXiv.2102.04643. [31] S. Min, V. Zhong, R. Socher, and C. Xiong, “Efficient and Robust
[16] OpenAI, “A Survey of Techniques for Maximizing LLM Performance.” Question Answering from Minimal Context over Documents,” in
Nov. 2023. Proceedings of the 56th Annual Meeting of the Association for
[17] Jacob Lee, “Building LLM-Powered Web Apps with Client-Side Computational Linguistics (Volume 1: Long Papers), I. Gurevych and
Technology.” Accessed: Dec. 01, 2023. [Online]. Available: Y. Miyao, Eds., Melbourne, Australia: Association for Computational
https://ollama.ai/blog/building-llm-powered-web-apps Linguistics, Jul. 2018, pp. 1725–1735. doi: 10.18653/v1/P18-1160.
[18] J. Johnson, M. Douze, and H. Jégou, “Billion-Scale Similarity Search [32] H. Bahak, F. Taheri, Z. Zojaji, and A. Kazemi, “Evaluating ChatGPT as
with GPUs,” IEEE Trans Big Data, vol. 7, no. 3, pp. 535–547, 2021, doi: a Question Answering System: A Comprehensive Analysis and
10.1109/TBDATA.2019.2921572. Comparison with Existing Models,” ArXiv, vol. abs/2312.07592, Dec.
2023, doi: 10.48550/arXiv.2312.07592.
[19] J. Zhu, J. Jang-Jaccard, I. Welch, H. Al-Sahaf, and S. Camtepe, A
Ransomware Triage Approach using a Task Memory based on Meta- [33] T. Cakaloglu, C. Szegedy, and X. Xu, “Text Embeddings for Retrieval
Transfer Learning Framework. 2022. doi: 10.48550/arXiv.2207.10242. From a Large Knowledge Base,” Research Challenges in Information
Science, vol. abs/1810.10176, Oct. 2018, doi:
[20] K. M. Fitria, “DocuQA: Document Question Answering Dataset.” Feb. 10.48550/arXiv.1810.10176.
2024. doi: 10.6084/m9.figshare.25223990.v1.
[34] S. Gholami and M. Noori, “Zero-Shot Open-Book Question
[21] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, “SQuAD: 100,000+
Answering,” ArXiv, vol. abs/2111.11520, Nov. 2021, doi:
Questions for Machine Comprehension of Text,” in Conference on doi.org/10.48550/arXiv.2111.11520.
Empirical Methods in Natural Language Processing, 2016. [Online].
Available: https://api.semanticscholar.org/CorpusID:11816014 [35] G. Nur Ahmad and A. Romadhony, “End-to-End Question Answering
System for Indonesian Documents Using TF-IDF and IndoBERT,” in
[22] A. Chen, G. Stanovsky, S. Singh, and M. Gardner, “Evaluating Question
2023 10th International Conference on Advanced Informatics: Concept,
Answering Evaluation,” in Proceedings of the 2nd Workshop on
Theory and Application (ICAICTA), 2023, pp. 1–6. doi:
Machine Reading for Question Answering, A. Fisch, A. Talmor, R. Jia, 10.1109/ICAICTA59291.2023.10390111.
M. Seo, E. Choi, and D. Chen, Eds., Hong Kong, China: Association for
785 | P a g e
www.ijacsa.thesai.org