0% found this document useful (0 votes)
10 views10 pages

Rag Foundry - Diff Framework

RAG Foundry is an open-source framework designed to enhance large language models (LLMs) for Retrieval-Augmented Generation (RAG) applications, integrating data creation, training, inference, and evaluation into a cohesive workflow. The framework allows for rapid prototyping and experimentation with various RAG techniques, demonstrating effectiveness through improvements in Llama-3 and Phi-3 models across knowledge-intensive datasets. It addresses the complexities of implementing RAG systems by providing customizable modules for data processing, model training, and evaluation, ensuring compatibility and reproducibility in experiments.

Uploaded by

Anmol Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views10 pages

Rag Foundry - Diff Framework

RAG Foundry is an open-source framework designed to enhance large language models (LLMs) for Retrieval-Augmented Generation (RAG) applications, integrating data creation, training, inference, and evaluation into a cohesive workflow. The framework allows for rapid prototyping and experimentation with various RAG techniques, demonstrating effectiveness through improvements in Llama-3 and Phi-3 models across knowledge-intensive datasets. It addresses the complexities of implementing RAG systems by providing customizable modules for data processing, model training, and evaluation, ensuring compatibility and reproducibility in experiments.

Uploaded by

Anmol Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

RAG Foundry: A Framework for Enhancing LLMs for Retrieval

Augmented Generation

Daniel Fleischer Moshe Berchansky Moshe Wasserblat Peter Izsak


Intel Labs
{daniel.fleischer, moshe.berchansky, moshe.wasserblat, peter.izsak}@intel.com

Abstract
Data Training 
Implementing Retrieval-Augmented Genera- LoRA
tion (RAG) systems is inherently complex,
arXiv:2408.02545v1 [cs.CL] 5 Aug 2024

requiring deep understanding of data, use


cases, and intricate design decisions. Addi- Augmentation
tionally, evaluating these systems presents sig- Inference 
nificant challenges, necessitating assessment of Loaders
both retrieval accuracy and generative quality
Selectors
through a multi-faceted approach. We intro-
duce RAG F OUNDRY, an open-source frame-
Evaluation 
Retrievers
work for augmenting large language models Answer Processor

for RAG use cases. RAG F OUNDRY inte- API EM


grates data creation, training, inference and ROUGE
Samplers
evaluation into a single workflow, facilitating
F1
the creation of data-augmented datasets for Prompters
training and evaluating large language mod- Faithfulness
els in RAG settings. This integration en- Caching Relevancy
ables rapid prototyping and experimentation
with various RAG techniques, allowing users
to easily generate datasets and train RAG Figure 1: An overview of the RAG F OUNDRY frame-
models using internal or specialized knowl- work: the Data Augmentation module persists RAG
edge sources. We demonstrate the frame- interactions into a dedicated dataset, which is then used
work effectiveness by augmenting and fine- for training, inference and evaluation.
tuning Llama-3 and Phi-3 models with diverse
RAG configurations, showcasing consistent im- Retrieval-Augmented Generation (RAG) enhances
provements across three knowledge-intensive LLMs performance by integrating external infor-
datasets. Code is released as open-source in mation using retrieval mechanisms. Combining re-
https://github.com/IntelLabs/RAGFoundry. trieval that leverages vast knowledge-bases outside
the knowledge of the model, effectively addresses
knowledge limitations, can reduce hallucinations,
1 Introduction
improve the relevance of generated content, pro-
Large Language Models (LLMs) have emerged as vide interpretability and could be vastly more cost-
a transformative force in the field of AI, demon- efficient (Lewis et al., 2021; Mallen et al., 2022;
strating an impressive ability to perform a wide Gao et al., 2023; Asai et al., 2023; Borgeaud et al.,
range of tasks that traditionally required human in- 2021; Peng et al., 2023; de Jong et al., 2023). Fur-
telligence (Brown et al., 2020; Kojima et al., 2022). thermore, recent research indicates that fine-tuning
Despite their impressive capabilities, LLMs have LLMs for RAG can achieve state-of-the-art perfor-
inherent limitations. These models can produce mance, surpassing that of larger, proprietary mod-
plausible-sounding but incorrect or nonsensical an- els (Yu et al., 2024b; Liu et al., 2024).
swers, struggle with factual accuracy, lack access However, the implementation of RAG systems
to up-to-date information after their training cutoff is inherently complex and requires a series of
and struggle in attending to relevant information in intricate decisions that can significantly impact
large contexts (Huang et al., 2023; Liu et al., 2023). the performance of the system. This process de-
mands a thorough understanding of the data and et al., 2023) and a negative distractor-documents
use case, and often, solutions do not generalize technique (Zhang et al., 2024). We compared
well to other domains (Barnett et al., 2024; Bala- two widely accepted baseline models using vari-
guer et al., 2024). Some key RAG design decisions ous enhancement methods across three knowledge-
include text embedding, indexing parameters, re- intensive question-answering tasks, demonstrating
trieval algorithms, query building, and prompt de- the effectiveness of RAG F OUNDRY.
sign, among other considerations beyond the LLM
configuration (Wang et al., 2024). Another issue is 2 Related Work
reproducibility: achieving consistent and compara-
There are numerous open-source tools related to
ble results across runs, datasets and tasks. Varia-
the different aspects of RAG, namely inference,
tions in training data, pre-processing steps, model
training and evaluation. LlamaIndex (Liu, 2022),
configurations, and hardware can lead to discrep-
LangChain (Chase, 2022) and Haystack (Pietsch
ancies in performance, making it challenging for
et al., 2019) are well known libraries for composing
researchers and practitioners to replicate findings
RAG pipelines; however they are not focused on
and build upon previous work. Additionally, evalu-
evaluation and their training capability is under-
ating RAG systems presents a challenge due to the
developed.
dual reliance on retrieval accuracy and generative
Hoshi et al. (2023) proposes a framework for
quality. These systems require a sophisticated eval-
developing RAG-based LLMs; while our process-
uation suite that accounts for the interplay among
ing may be similar in the sense of being comprised
the retrieved information, the formalization of data,
of custom individual steps, they do not introduce
and the generated output (Chen et al., 2023; Yu
any form of training. Khattab et al. (2023, 2022)
et al., 2024a; Es et al., 2024).
presents a different approach, where LLM prompt-
We introduce RAG F OUNDRY, an open-source
ing is represented as a programming language, to
python framework for developing sophisticated
be optimized and compiled; a rather unique and
retrieval-augmented LLMs for RAG use-cases. The
general approach that could benefit RAG but has
library supports researchers and practitioners in the
a high level of complexity due to the abstractions
nuanced task of enhancing the capabilities of LLMs
introduced. Saad-Falcon et al. (2024) focuses more
in RAG use cases. It is highly customizable, fa-
on the evaluation aspect, by creating synthetic data
cilitating rapid prototyping and experimentation
and training an LLM critic to evaluate the RAG sys-
across all aspects of RAG, including data selec-
tem. Hsia et al. (2024) studies aspects of retrieval
tion, aggregation and filtering, retrieval, text pro-
on the performance of RAG; our RAG Foundry li-
cessing, document ranking, few-shot generation,
brary is general and enables experimentation on all
prompt design using templates, fine-tuning, infer-
aspects of RAG: retrieval, text-processing, prompt
ence, and evaluation. To cater to the specific needs
design, model selection, inference and evaluations.
of researchers, we designed the framework to func-
Recently, a concurrent work by Jin et al. (2024)
tion as an end-to-end experimentation environment.
proposes a RAG building framework, including
The backbone of the library consists of four dis-
some RAG implementations and datasets; we fo-
tinct modules: data creation, training, inference,
cus on extensibility, letting users define custom
and evaluation. Each module is encapsulated and
types of pipelines with custom components. Rau
controlled by a configuration file, ensuring compat-
et al. (2024) presents a framework, sharing a
ibility between the output of one module and the
similar design-principle of extensibility-through-
input of the next. This modular approach allows
configuration as ours; their library imposes a spe-
each step to be isolated and independently experi-
cific workflow structure (retriever, ranker, LLM)
mented with, enabling the production of multiple
while our library is more general and does not im-
outputs and the concurrent execution of numerous
poses any specific paradigm.
experiments. Evaluation can be conducted on the
generated outputs as well as on any feature within 3 RAG Foundry
the data, including retrieval, ranking, and reason-
ing. The RAG F OUNDRY framework facilitates rapid
To illustrate the utility of the framework, we prototyping and experimentation with various RAG
conducted experiments involving retrieval, fine- settings and configurations. The library is com-
tuning, chain-of-thought (CoT) reasoning (Wu posed of four modules: dataset creation, training,
name: my_pipeline pre-processing. The processed data can be saved
cache: true
steps: in a consistent, model-independent format, along
- _target_: dataset_loaders.loaders.HFLoader with all associated metadata, ensuring compatibil-
inputs: main ity and reproducibility across different models and
dataset_config:
path: "Tevatron/wikipedia-trivia" experiments.
split: train The processing module is comprised of an ab-
- _target_: dataset_loaders.loaders.LocalLoader
stract pipeline with multiple steps, each defined by
inputs: fewshot-data Python classes that implement specific data pro-
filename: prepared-fewshot-data.jsonl cessing functionalities. These steps are categorized
- _target_: global_steps.sampling.ShuffleSelect into two types:
inputs: main
shuffle: 42
• Global Steps: Can act on the dataset as a whole,
limit: 10000 making them useful for operations such as aggre-
gations, group-by, examples filtering, join opera-
- _target_:
,→ local_steps.retrievers.HaystackRetriever
tions, and more.
inputs: main • Local Steps: Operate on individual examples,
pipeline_path: configs/qdrant.yaml
query_key: query making them suitable for tasks such as retrieval,
docs_key: positive_passages text processing, and field manipulation.
- _target_: global_steps.sampling.FewShot The modular design allows for building flexible
inputs: main and efficient data processes, tailored to the needs
input_dataset: fewshot-data
k: 3 of RAG-oriented training and inference. Steps can
output_key: fewshot_examples be categorized into the following non-exclusive
- _target_: local_steps.prompter.TextPrompter
categories:
inputs: main
prompt_file: prompts/basic.txt
• Loaders: Load datasets from the Hugging Face1
output_key: my_prompt hub or from local sources.
mapping:
question: query • Selectors: Filter examples, shuffle datasets, and
context: positive_passages select subset datasets.
fewshot: fewshot_examples
answer: answers • Retrievers: Integrate information from external
databases, tools, libraries and pipelines.
- _target_: global_steps.output.OutputData
inputs: main • Samplers: Collect random examples or features
file_name: TQA_train_processed.jsonl from any dataset to compile few-shot or negative
examples.
Listing 1: Example of a dataset creation configuration.
• Prompters: Format prompts using custom tem-
The example contains data loading, shuffling, sampling,
retrieval, few-shot collection, prompt building and sav- plates and keyword mappings.
ing steps. The processing module supports the handling of
multiple datasets at once, through global dataset
inference, and evaluation. Below, we expand on sharing. This feature allows each step of the
each of the modules and provide example configu- pipeline to access any of the loaded datasets, en-
rations for running them. hancing flexibility and allowing for complex pro-
cessing procedures. Furthermore, the module in-
3.1 Data Creation and Processing cludes step caching, which caches each pipeline
step locally. This improves compute efficiency, and
The processing module facilitates the creation of
facilitates easy reproduction of results.
context-enhanced datasets by persisting RAG in-
teractions, which are essential for RAG-oriented 3.1.1 Example: Enhancing a Q&A Dataset
training and inference (Berchansky et al., 2024; Liu To showcase the effectiveness of the process-
et al., 2024; Yu et al., 2024b). These interactions ing module, we demonstrate how to enrich a
encompass dataset loading, column normalization, question-answering dataset with external informa-
data aggregation, information retrieval, template-
1
based prompt creation, and various other forms of https://huggingface.co/
model: model:
_target_: ragfoundry.models.hf.HFTrain _target_: ragfoundry.models.hf.HFInference
model_name_or_path: model_name_or_path:
,→ "microsoft/Phi-3-mini-128k-instruct" ,→ "microsoft/Phi-3-mini-128k-instruct"
load_in_8bit: true load_in_8bit: true
lora: instruction: prompts/prompt_instructions/qa.txt
peft_type: "LORA" lora_path: /path/to/adapter
r: 16 generation:
target_modules: ["qkv_proj"] do_sample: false
completion_start: "<|assistant|>" max_new_tokens: 50
return_full_text: false
train:
gradient_accumulation_steps: 4 data_file: my-processed-data.jsnol
learning_rate: 2e-05 generated_file: model-predictions.jsonl
lr_scheduler_type: "cosine"
num_train_epochs: 1
optim: "paged_adamw_8bit" Listing 3: Example of an inference configuration. In ad-
dition to model and generation options, a system prompt
instruction: prompts/prompt_instructions/qa.txt can be defined.
data_file: TQA_train_processed.jsonl

ports advanced and efficient training techniques,


Listing 2: Example of a training configuration. Model e.g. LoRA (Hu et al., 2021). An example of a
and training parameters are specified, in addition to an
instruction file containing the system prompt.
training configuration is presented in listing 2.

3.3 Inference
tion fetched using a retrieval pipeline, prepare few-
shot examples and combine everything together The inference module generates predictions given
using a prompt template. Listing 1 demonstrates the processed datasets created by the processing
how such a processing pipeline is defined using a module. Inference is conceptually separated from
YAML configuration. The main structure of the file the evaluation step, since it is more computation-
is a list of steps, each defined by a _target_ which ally demanding than evaluation. Additionally, one
points to the step implementation. Each step has can run multiple evaluations on a single, prepared
inputs, which is a name or list of dataset names inference results file. An example configuration for
to act upon. Other keys in a step relate to specific generating predictions given a dataset is presented
step logic. in listing 3.
The first two steps in listing 1 load datasets from
3.4 Evaluation
Hugging Face hub and from a local path. The third
step shuffles and selects 10k examples from the The goal of the framework is augmenting LLMs
main dataset. The forth step runs a Haystack-based for RAG. The evaluation module allows users to
(Pietsch et al., 2019) retrieval pipeline to retrieve run collections of metrics to evaluate RAG tech-
relevant passages using questions from the loaded niques and tuning processes. The evaluation mod-
dataset as queries, storing them in docs_key. We ule loads the output of the inference module and
note that different retrieval processes or frame- runs a configurable list of metrics. Metrics are
works (Liu, 2022; Chase, 2022; Lin et al., 2021) classes implemented in the library. These classes
can be used in retrieval steps. The fifth step selects can be as simple as wrappers around other evalua-
3 few-shot examples from the secondary dataset, tion libraries, or can be implemented by the user.
following a prompt generator step that loads a Local metrics can be run on individual examples,
prompt template and replaces all given informa- like Exact Match (EM), while Global metrics run
tion according to the defined mapping dictionary. on the entire dataset as a whole, e.g. Recall (for
Lastly, the dataset is saved to a local path. classification-based metrics). Metrics can use any
field and metadata in the dataset, not just the input-
3.2 Training output pairs. Some of the metrics implemented
We provide a training module to fine-tune models in the library include: a wrapper for the Hugging
given the datasets created by the previous process- Face evaluate library, EM, F1, classification met-
ing module. The training module relies on the rics, BERTScore (Zhang et al., 2019), Semantic
well established training framework TRL2 and sup- Similarity and a wrapper for DeepEval3 (for using
2 3
https://github.com/huggingface/trl https://github.com/confident-ai/deepeval
answer_processor: guides the model to use the retrieved context, ex-
_target_: ragfoundry.processing.RegexAnswer
capture_pattern: "Answer: (.*)" plain the steps, quote relevant parts and produce
stopping_pattern: a final answer. Complementing that, we explore
fine-tuning recipes. We fine-tune the model in the
metrics:
- _target_: ragfoundry.evaluation.HFEvaluate RAG setup and denote is as RAG-sft. To comple-
metric_names: ["rouge"] ment CoT, we implemented a fine-tuning recipe,
- _target_: ragfoundry.evaluation.EM denoted as CoT-sft, introduced in (Zhang et al.,
- _target_: ragfoundry.evaluation.F1
- _target_: ragfoundry.evaluation.BERTScore 2024), where gold documents and purely distractor
model: "microsoft/deberta-large-mnli" documents are used in the prompt, determined by
- _target_: ragfoundry.evaluation.Faithfulness
- _target_: ragfoundry.evaluation.Relevancy
probability, in conjunction with a CoT prompt. All
embeddings: "BAAI/bge-small-en-v1.5" prompt templates are included in appendix A.1.
results_file: my-evaluation.yaml 4.2 Datasets
generated_file: model-prediction.jsonl
data_file: my-processed-data.jsonl We evaluate our models on TriviaQA (Joshi et al.,
2017), PubmedQA (Jin et al., 2019), and ASQA
Listing 4: Example of an evaluation configuration; it (Stelmakh et al., 2022) which are knowledge in-
contains an answer processor, as well as the list of met- tensive question-answering datasets which ben-
rics, with optional parameters, to run. efit from external sources. The TriviaQA and
PubmedQA datasets contain relevant context; for
the RAGAS metrics (Es et al., 2024)). After the ASQA, retrieval was done over a Wikipedia corpus
evaluation is completed, a results file is written to using a dense retriever4 . Dataset sources and sizes
disk with the local and global metrics results. are included in appendix A.2.
Furthermore, the evaluation module uses a pro-
cessing step called an Answer Processor, which 4.3 Models
can implement custom logic and serve many pur- We experiment with two representative models:
poses, including cleaning and aligning outputs; for Llama-35 (Touvron et al., 2023; AI@Meta, 2024)
example, using regex, one can isolate answers, re- and Phi-36 (Abdin et al., 2024) as they represent
move stop words, chain-of-thought reasoning, de- robust capabilities and are ideal candidate models
fine a stopping criteria, process citations and attri- for RAG use case deployments.
butions and any other form of processing needed
for a given evaluation. 4.4 Evaluation
See listing 4 for a configuration example; it con- We measure and report Exact Match (EM) for
tains an answer processor that extracts an answer TriviaQA, STR-EM for ASQA, accuracy and F1
from an output, and a list of metrics to run. for PubmedQA. Additionally, we evaluate two
RAGAS metrics (Es et al., 2024): Faithfulness and
4 Experiments: RAG Tuning
Relevancy. Faithfulness measures the relation be-
To illustrate the usage and usefulness of the tween the generated text and the context. Relevancy
RAG F OUNDRY library, we experiment with sev- measures the relation between the generated text
eral possible RAG improvements to LLMs, and and the query. These two metrics use the context as
evaluate the results on three knowledge-intensive input for the LLM critic, so are only relevant in the
tasks. RAG settings. The critic LLM used is GPT4-32k,
version 0613. An embedder7 is required for the
4.1 RAG Augmentation Techniques relevancy evaluation.
We explore several techniques for RAG augmenta-
tion, and use RAG F OUNDRY to easily implement 4.5 Results
and evaluate their benefit. As an initial step, we We present a comparative study of RAG augmenta-
evaluate unmodified models; we set Baseline as a tion techniques, on the TriviaQA, ASQA and Pub-
configuration that is defined by running unmodified medQA datasets. Results are presented in table 1:
models and without any external knowledge. We 4
BAAI/llm-embedder
define a RAG setting that introduces top-relevant 5
meta-llama/Meta-Llama-3-8B-Instruct.
documents in a consistent prompt template format 6
microsoft/Phi-3-mini-128k-instruct.
7
with a system instruction, and a CoT scheme which BAAI/bge-small-en-v1.5.
Model Method TriviaQA ASQA PubmedQA
EM Faith. Rel. STR-EM Faith. Rel. Acc F1 Faith. Rel.
Baseline 0.630 - - 0.109 - - 0.476 0.290 - -
RAG 0.876 0.821 0.836 0.294 0.685 0.895 0.530 0.281 - -
Phi-3 3.8B RAG-sft 0.878 0.777 0.750 0.252 0.717 0.833 0.720 0.491 - -
CoT 0.923 0.555 0.741 0.367 0.263 0.826 0.574 0.439 0.477 0.705
CoT-sft 0.795 0.793 0.749 0.386 0.749 0.839 0.620 0.458 0.631 0.853
Baseline 0.722 - - 0.200 - - 0.560 0.366 - -
RAG 0.828 0.783 0.746 0.285 0.610 0.861 0.556 0.398 - -
Llama-3 8B RAG-sft 0.916 0.704 0.714 0.291 0.653 0.854 0.770 0.537 - -
CoT 0.896 0.518 0.764 0.395 0.536 0.730 0.684 0.480 0.378 0.732
CoT-sft 0.851 0.808 0.697 0.422 0.768 0.790 0.694 0.485 0.777 0.883

Table 1: Evaluation results of baseline and different RAG settings, for the three datasets and two models tested. In
addition to the main metrics for each dataset, faithfulness and relevancy are reported for the relevant configurations.
In bold are the best configurations per dataset, based on the main metrics.

main metrics for each dataset are displayed, as well 5 Conclusion


as faithfulness and relevancy scores, as defined in
(Es et al., 2024). For TriviaQA we observe the We introduced RAG F OUNDRY, an open-source
following: retrieved context improves the results, library dedicated to the task of RAG-augmentation
fine-tuning the RAG setting improves the results, of LLMs, namely fine-tuning LLMs to become bet-
fine-tuning on CoT reasoning (which includes train- ter at RAG settings. The library is designed to serve
ing on a combination of gold passages and distrac- as an end-to-end experimentation environment, en-
tor passages) decreases performance. Best method abling users to quickly prototype and experiment
is model dependent for this dataset. For ASQA, with different RAG techniques. We demonstrated
we similarly observe every method improves upon the usefulness of the library by augmenting two
the baseline, CoT reasoning produces consistent models with RAG configurations, evaluating on
improvement in both models, as well as fine-tuning three Q&A datasets and showing the benefit of
of the CoT configuration, which shows to perform RAG techniques, as well as of using multi-aspect
best. Finally, for PubmedQA, we observe that al- metrics relevant for RAG systems evaluation.
most all methods improve upon the baseline (with
one exception); CoT reasoning improves upon the Limitations and Future Plans
untrained RAG setting, but upon fine-tuning, the
RAG method appears to perform best in both mod- Our hope is that the library will be useful to as
els. many people and use-cases as possible. However,
Inspecting the faithfulness and relevancy scores, due to time and resource constraint, we were able to
notice that not all configurations are valid to be demonstrate its usefulness on a subset of tasks and
measured: these metrics require context, so are datasets. Future work can expand the evaluation
irrelevant for the baseline method. Additionally, to other tasks, as well as implementing other RAG
in the PubmedQA dataset, the answers are binary techniques and evaluations.
Yes/No; only in the CoT configurations the LLMs Although we designed the library to be general
produce a reasoning, which can be evaluated. Fi- and customizable, there might be specific work-
nally, the faithfulness and relevancy scores often flows which will be difficult to run as-is and some
do not correlate with the main metrics, neither with code changes may be required. The library proved
each other, possibly indicating they capture differ- useful for our own research projects on a diverse
ent aspects of the retrieval and generated results, set of datasets and tasks and extending it is easy
and represent a trade-off in performance. and straightforward.
The results demonstrate the usefulness of RAG Finally, despite our best efforts to offer detailed
techniques for improving performance, as well as documentation in the library, there could be some
the need to carefully evaluate different aspects of a missing details regarding some functionality or spe-
RAG system, on a diverse set of datasets, as effort cific use-cases. The code repository will accept
on developing generalized techniques is ongoing. suggestions, bug-fixes and pull requests.
Ethics Statement report: A highly capable language model locally on
your phone. Preprint, arXiv:2404.14219.
In conducting our research we strive abiding to
AI@Meta. 2024. Llama 3 model card.
the highest ethical standards, including integrity,
fairness, and societal benefit of our work. We pri- Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and
oritized data privacy and security throughout our Hannaneh Hajishirzi. 2023. Self-rag: Learning to
retrieve, generate, and critique through self-reflection.
research; any data used in our experiments was Preprint, arXiv:2310.11511.
publicly available and did not contain any private
information. We are committed to the principles of Angels Balaguer, Vinamra Benara, Renato Luiz de Fre-
transparency and reproducibility; the methodolo- itas Cunha, Roberto de M. Estevão Filho, Todd
Hendry, Daniel Holstein, Jennifer Marsman, Nick
gies, including data pre-processing, model training, Mecklenburg, Sara Malvar, Leonardo O. Nunes,
and evaluation are documented in order to enable Rafael Padilha, Morris Sharp, Bruno Silva, Swati
others to replicate our findings. Code is made avail- Sharma, Vijay Aski, and Ranveer Chandra. 2024.
able in an open repository. We advocate for the RAG vs Fine-tuning: Pipelines, Tradeoffs, and a
Case Study on Agriculture. arXiv preprint. ArXiv:
responsible use of LLMs and RAG augmentation. 2401.08406 [cs].
It is essential to exercise caution and verify the ac-
curacy and reliability of generated text produced by Scott Barnett, Stefanus Kurniawan, Srikanth Thudumu,
Zach Brannelly, and Mohamed Abdelrazek. 2024.
LLMs. Hallucinations can have negative implica- Seven failure points when engineering a re-
tions, and even when RAG methods can ameliorate trieval augmented generation system. Preprint,
some of these aspects, verification and inspections arXiv:2401.05856.
are needed.
Moshe Berchansky, Daniel Fleischer, Moshe
Wasserblat, and Peter Izsak. 2024. Cotar: Chain-
of-thought attribution reasoning with multi-level
References granularity. Preprint, arXiv:2404.10513.

Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann,
Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Trevor Cai, Eliza Rutherford, Katie Millican, George
Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jian- van den Driessche, Jean-Baptiste Lespiau, Bogdan
min Bao, Harkirat Behl, Alon Benhaim, Misha Damoc, Aidan Clark, Diego de Las Casas, Aurelia
Bilenko, Johan Bjorck, Sébastien Bubeck, Qin Cai, Guy, Jacob Menick, Roman Ring, T. W. Hennigan,
Martin Cai, Caio César Teodoro Mendes, Weizhu Saffron Huang, Lorenzo Maggiore, Chris Jones, Al-
Chen, Vishrav Chaudhary, Dong Chen, Dongdong bin Cassirer, Andy Brock, Michela Paganini, Geof-
Chen, Yen-Chun Chen, Yi-Ling Chen, Parul Chopra, frey Irving, Oriol Vinyals, Simon Osindero, Karen
Xiyang Dai, Allie Del Giorno, Gustavo de Rosa, Simonyan, Jack W. Rae, Erich Elsen, and L. Sifre.
Matthew Dixon, Ronen Eldan, Victor Fragoso, Dan 2021. Improving language models by retrieving from
Iter, Mei Gao, Min Gao, Jianfeng Gao, Amit Garg, trillions of tokens. In International Conference on
Abhishek Goswami, Suriya Gunasekar, Emman Machine Learning.
Haider, Junheng Hao, Russell J. Hewett, Jamie
Huynh, Mojan Javaheripi, Xin Jin, Piero Kauff- Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie
mann, Nikos Karampatziakis, Dongwoo Kim, Ma- Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind
houd Khademi, Lev Kurilenko, James R. Lee, Yin Tat Neelakantan, Pranav Shyam, Girish Sastry, Amanda
Lee, Yuanzhi Li, Yunsheng Li, Chen Liang, Lars Li- Askell, Sandhini Agarwal, Ariel Herbert-Voss,
den, Ce Liu, Mengchen Liu, Weishung Liu, Eric Lin, Gretchen Krueger, Tom Henighan, Rewon Child,
Zeqi Lin, Chong Luo, Piyush Madan, Matt Mazzola, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu,
Arindam Mitra, Hardik Modi, Anh Nguyen, Brandon Clemens Winter, Christopher Hesse, Mark Chen, Eric
Norick, Barun Patra, Daniel Perez-Becker, Thomas Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess,
Portet, Reid Pryzant, Heyang Qin, Marko Radmi- Jack Clark, Christopher Berner, Sam McCandlish,
lac, Corby Rosset, Sambudha Roy, Olatunji Ruwase, Alec Radford, Ilya Sutskever, and Dario Amodei.
Olli Saarikivi, Amin Saied, Adil Salim, Michael San- 2020. Language Models are Few-Shot Learners.
tacroce, Shital Shah, Ning Shang, Hiteshi Sharma, arXiv preprint. ArXiv:2005.14165 [cs].
Swadheen Shukla, Xia Song, Masahiro Tanaka, An- Harrison Chase. 2022. LangChain.
drea Tupini, Xin Wang, Lijuan Wang, Chunyu Wang,
Yu Wang, Rachel Ward, Guanhua Wang, Philipp Jiawei Chen, Hongyu Lin, Xianpei Han, and Le Sun.
Witte, Haiping Wu, Michael Wyatt, Bin Xiao, Can 2023. Benchmarking Large Language Models in
Xu, Jiahang Xu, Weijian Xu, Sonali Yadav, Fan Yang, Retrieval-Augmented Generation. arXiv.
Jianwei Yang, Ziyi Yang, Yifan Yang, Donghan Yu,
Lu Yuan, Chengruidong Zhang, Cyril Zhang, Jian- Michiel de Jong, Yury Zemlyanskiy, Nicholas FitzGer-
wen Zhang, Li Lyna Zhang, Yi Zhang, Yue Zhang, ald, Joshua Ainslie, Sumit Sanghai, Fei Sha, and
Yunan Zhang, and Xiren Zhou. 2024. Phi-3 technical William Cohen. 2023. Pre-computed memory or
on-the-fly encoding? A hybrid approach to retrieval Omar Khattab, Arnav Singhvi, Paridhi Maheshwari,
augmentation makes the most of your compute. Pub- Zhiyuan Zhang, Keshav Santhanam, Sri Vard-
lisher: arXiv Version Number: 2. hamanan, Saiful Haq, Ashutosh Sharma, Thomas T.
Joshi, Hanna Moazam, Heather Miller, Matei Za-
Shahul Es, Jithin James, Luis Espinosa Anke, and haria, and Christopher Potts. 2023. Dspy: Compiling
Steven Schockaert. 2024. RAGAs: Automated evalu- declarative language model calls into self-improving
ation of retrieval augmented generation. In Proceed- pipelines. arXiv preprint arXiv:2310.03714.
ings of the 18th Conference of the European Chap-
ter of the Association for Computational Linguistics: Takeshi Kojima, S. Gu, Machel Reid, Yutaka Matsuo,
System Demonstrations, pages 150–158, St. Julians, and Yusuke Iwasawa. 2022. Large Language Models
Malta. Association for Computational Linguistics. are Zero-Shot Reasoners. ArXiv.

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio
Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, and Haofen Petroni, Vladimir Karpukhin, Naman Goyal, Hein-
Wang. 2023. Retrieval-Augmented Generation for rich Küttler, Mike Lewis, Wen-tau Yih, Tim Rock-
Large Language Models: A Survey. arXiv preprint. täschel, Sebastian Riedel, and Douwe Kiela. 2021.
ArXiv:2312.10997 [cs]. Retrieval-Augmented Generation for Knowledge-
Intensive NLP Tasks. arXiv preprint.
Yasuto Hoshi, Daisuke Miyashita, Youyang Ng, Kento Jimmy Lin, Xueguang Ma, Sheng-Chieh Lin, Jheng-
Tatsuno, Yasuhiro Morioka, Osamu Torii, and Jun Hong Yang, Ronak Pradeep, and Rodrigo Nogueira.
Deguchi. 2023. RaLLe: A Framework for Devel- 2021. Pyserini: A Python toolkit for reproducible
oping and Evaluating Retrieval-Augmented Large information retrieval research with sparse and dense
Language Models. arXiv preprint. representations. In Proceedings of the 44th Annual
International ACM SIGIR Conference on Research
Jennifer Hsia, Afreen Shaikh, Zhiruo Wang, and Gra-
and Development in Information Retrieval (SIGIR
ham Neubig. 2024. RAGGED: Towards Informed
2021), pages 2356–2362.
Design of Retrieval Augmented Generation Systems.
arXiv preprint. ArXiv:2403.09040 [cs]. Jerry Liu. 2022. LlamaIndex.
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paran-
Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and jape, Michele Bevilacqua, Fabio Petroni, and Percy
Weizhu Chen. 2021. LoRA: Low-Rank Adaptation Liang. 2023. Lost in the middle: How language mod-
of Large Language Models. arXiv preprint. ArXiv: els use long contexts. Preprint, arXiv:2307.03172.
2106.09685 [cs].
Zihan Liu, Wei Ping, Rajarshi Roy, Peng Xu, Chankyu
Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Lee, Mohammad Shoeybi, and Bryan Catanzaro.
Zhangyin Feng, Haotian Wang, Qianglong Chen, 2024. ChatQA: Surpassing GPT-4 on Conversational
Weihua Peng, Xiaocheng Feng, Bing Qin, and QA and RAG. arXiv preprint. ArXiv: 2401.10225
Ting Liu. 2023. A Survey on Hallucination in [cs].
Large Language Models: Principles, Taxonomy,
Alex Troy Mallen, Akari Asai, Victor Zhong, Rajarshi
Challenges, and Open Questions. arXiv preprint.
Das, Hannaneh Hajishirzi, and Daniel Khashabi.
ArXiv:2311.05232 [cs].
2022. When not to trust language models: Investigat-
Jiajie Jin, Yutao Zhu, Xinyu Yang, Chenghao Zhang, ing effectiveness of parametric and non-parametric
and Zhicheng Dou. 2024. FlashRAG: A Modular memories. In Annual Meeting of the Association for
Toolkit for Efficient Retrieval-Augmented Genera- Computational Linguistics.
tion Research. Baolin Peng, Michel Galley, Pengcheng He, Hao Cheng,
Yujia Xie, Yu Hu, Qiuyuan Huang, Lars Liden, Zhou
Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William W. Yu, Weizhu Chen, and Jianfeng Gao. 2023. Check
Cohen, and Xinghua Lu. 2019. PubMedQA: A Your Facts and Try Again: Improving Large Lan-
Dataset for Biomedical Research Question Answer- guage Models with External Knowledge and Auto-
ing. arXiv preprint. ArXiv: 1909.06146 [cs, q-bio]. mated Feedback. Publisher: arXiv Version Number:
3.
Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke
Zettlemoyer. 2017. TriviaQA: A Large Scale Dis- Malte Pietsch, Timo Möller, Bogdan Kostic, Julian
tantly Supervised Challenge Dataset for Reading Risch, Massimiliano Pippi, Mayank Jobanputra, Sara
Comprehension. arXiv preprint. ArXiv:1705.03551 Zanzottera, Silvano Cerza, Vladimir Blagojevic,
[cs]. Thomas Stadelmann, Tanay Soni, and Sebastian Lee.
2019. Haystack: the end-to-end NLP framework for
Omar Khattab, Keshav Santhanam, Xiang Lisa pragmatic builders.
Li, David Hall, Percy Liang, Christopher Potts,
and Matei Zaharia. 2022. Demonstrate-search- David Rau, Herv’e D’ejean, Nadezhda Chirkova,
predict: Composing retrieval and language mod- Thibault Formal, Shuai Wang, Vassilina Nikoulina,
els for knowledge-intensive NLP. arXiv preprint and S. Clinchant. 2024. BERGEN: A Benchmarking
arXiv:2212.14024. Library for Retrieval-Augmented Generation.
Jon Saad-Falcon, Omar Khattab, Christopher Potts, and
Matei Zaharia. 2024. ARES: An Automated Evalua-
tion Framework for Retrieval-Augmented Generation
Systems. arXiv preprint. ArXiv:2311.09476 [cs].
Ivan Stelmakh, Yi Luan, Bhuwan Dhingra, and Ming-
Wei Chang. 2022. ASQA: Factoid Questions Meet
Long-Form Answers. In Proceedings of the 2022
Conference on Empirical Methods in Natural Lan-
guage Processing, pages 8273–8288, Abu Dhabi,
United Arab Emirates. Association for Computa-
tional Linguistics.
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
Martinet, Marie-Anne Lachaux, Timothée Lacroix,
Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal
Azhar, Aurelien Rodriguez, Armand Joulin, Edouard
Grave, and Guillaume Lample. 2023. Llama: Open
and efficient foundation language models. Preprint,
arXiv:2302.13971.

Xiaohua Wang, Zhenghua Wang, Xuan Gao, Feiran


Zhang, Yixin Wu, Zhibo Xu, Tianyuan Shi,
Zhengyuan Wang, Shizheng Li, Qi Qian, Ruicheng
Yin, Changze Lv, Xiaoqing Zheng, and Xuanjing
Huang. 2024. Searching for Best Practices in
Retrieval-Augmented Generation. arXiv preprint.
Dingjun Wu, Jing Zhang, and Xinmei Huang. 2023.
Chain of thought prompting elicits knowledge aug-
mentation. In Findings of the Association for Com-
putational Linguistics: ACL 2023, pages 6519–6534,
Toronto, Canada. Association for Computational Lin-
guistics.
Hao Yu, Aoran Gan, Kai Zhang, Shiwei Tong, Qi Liu,
and Zhaofeng Liu. 2024a. Evaluation of Retrieval-
Augmented Generation: A Survey. arXiv preprint.
ArXiv:2405.07437 [cs].
Yue Yu, Wei Ping, Zihan Liu, Boxin Wang, Jiaxuan You,
Chao Zhang, Mohammad Shoeybi, and Bryan Catan-
zaro. 2024b. RankRAG: Unifying Context Rank-
ing with Retrieval-Augmented Generation in LLMs.
arXiv preprint. ArXiv:2407.02485 [cs].
Tianjun Zhang, Shishir G. Patil, Naman Jain, Sheng
Shen, Matei Zaharia, Ion Stoica, and Joseph E. Gon-
zalez. 2024. Raft: Adapting language model to do-
main specific rag.
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q.
Weinberger, and Yoav Artzi. 2019. BERTScore:
Evaluating Text Generation with BERT. ArXiv.
A Implementation Details
A.1 Prompts

You are a helpful question answerer who can provide an answer given a question and relevant context.

Listing 5: System instruction used in the experiments.

Question: {query}
Context: {docs}

Listing 6: Template for inserting relevant documents as


context.

Question: {query}
Context: {docs}

Answer this question using the information given in the context above. Here is things to pay attention to:
- First provide step-by-step reasoning on how to answer the question.
- In the reasoning, if you need to copy paste some sentences from the context, include them in
##begin_quote## and ##end_quote##. This would mean that things outside of ##begin_quote## and
##end_quote## are not directly copy paste from the context.
- End your response with final answer in the form <ANSWER>: $answer, the answer should be succinct.

Listing 7: Template for Chain-of-Thought reasoning.

A.2 Datasets
Datasets used:
• TriviaQA
• ASQA
• PubmedQA
Context size was k = 5, unless indicated otherwise.
Dataset sizes are:
Dataset Training Evaluation
TriviaQA 6000 1000
ASQA 4353 948
PubmedQA 10000 500

A.3 Training Details


Parameter Value
LoRA r 16
LoRA α 16
LoRA Dropout 0.1
LoRA Bias None
LoRA Modules qkv_proj, Phi-3
q/v_proj, Llama-3
LR 1e-4
LR Scheduler cosine
Warmup Ratio 0.03
Weight Decay 0.001
Batch Size 1
Epochs 1

You might also like