Rag Foundry - Diff Framework
Rag Foundry - Diff Framework
Augmented Generation
Abstract
Data Training
Implementing Retrieval-Augmented Genera- LoRA
tion (RAG) systems is inherently complex,
arXiv:2408.02545v1 [cs.CL] 5 Aug 2024
3.3 Inference
tion fetched using a retrieval pipeline, prepare few-
shot examples and combine everything together The inference module generates predictions given
using a prompt template. Listing 1 demonstrates the processed datasets created by the processing
how such a processing pipeline is defined using a module. Inference is conceptually separated from
YAML configuration. The main structure of the file the evaluation step, since it is more computation-
is a list of steps, each defined by a _target_ which ally demanding than evaluation. Additionally, one
points to the step implementation. Each step has can run multiple evaluations on a single, prepared
inputs, which is a name or list of dataset names inference results file. An example configuration for
to act upon. Other keys in a step relate to specific generating predictions given a dataset is presented
step logic. in listing 3.
The first two steps in listing 1 load datasets from
3.4 Evaluation
Hugging Face hub and from a local path. The third
step shuffles and selects 10k examples from the The goal of the framework is augmenting LLMs
main dataset. The forth step runs a Haystack-based for RAG. The evaluation module allows users to
(Pietsch et al., 2019) retrieval pipeline to retrieve run collections of metrics to evaluate RAG tech-
relevant passages using questions from the loaded niques and tuning processes. The evaluation mod-
dataset as queries, storing them in docs_key. We ule loads the output of the inference module and
note that different retrieval processes or frame- runs a configurable list of metrics. Metrics are
works (Liu, 2022; Chase, 2022; Lin et al., 2021) classes implemented in the library. These classes
can be used in retrieval steps. The fifth step selects can be as simple as wrappers around other evalua-
3 few-shot examples from the secondary dataset, tion libraries, or can be implemented by the user.
following a prompt generator step that loads a Local metrics can be run on individual examples,
prompt template and replaces all given informa- like Exact Match (EM), while Global metrics run
tion according to the defined mapping dictionary. on the entire dataset as a whole, e.g. Recall (for
Lastly, the dataset is saved to a local path. classification-based metrics). Metrics can use any
field and metadata in the dataset, not just the input-
3.2 Training output pairs. Some of the metrics implemented
We provide a training module to fine-tune models in the library include: a wrapper for the Hugging
given the datasets created by the previous process- Face evaluate library, EM, F1, classification met-
ing module. The training module relies on the rics, BERTScore (Zhang et al., 2019), Semantic
well established training framework TRL2 and sup- Similarity and a wrapper for DeepEval3 (for using
2 3
https://github.com/huggingface/trl https://github.com/confident-ai/deepeval
answer_processor: guides the model to use the retrieved context, ex-
_target_: ragfoundry.processing.RegexAnswer
capture_pattern: "Answer: (.*)" plain the steps, quote relevant parts and produce
stopping_pattern: a final answer. Complementing that, we explore
fine-tuning recipes. We fine-tune the model in the
metrics:
- _target_: ragfoundry.evaluation.HFEvaluate RAG setup and denote is as RAG-sft. To comple-
metric_names: ["rouge"] ment CoT, we implemented a fine-tuning recipe,
- _target_: ragfoundry.evaluation.EM denoted as CoT-sft, introduced in (Zhang et al.,
- _target_: ragfoundry.evaluation.F1
- _target_: ragfoundry.evaluation.BERTScore 2024), where gold documents and purely distractor
model: "microsoft/deberta-large-mnli" documents are used in the prompt, determined by
- _target_: ragfoundry.evaluation.Faithfulness
- _target_: ragfoundry.evaluation.Relevancy
probability, in conjunction with a CoT prompt. All
embeddings: "BAAI/bge-small-en-v1.5" prompt templates are included in appendix A.1.
results_file: my-evaluation.yaml 4.2 Datasets
generated_file: model-prediction.jsonl
data_file: my-processed-data.jsonl We evaluate our models on TriviaQA (Joshi et al.,
2017), PubmedQA (Jin et al., 2019), and ASQA
Listing 4: Example of an evaluation configuration; it (Stelmakh et al., 2022) which are knowledge in-
contains an answer processor, as well as the list of met- tensive question-answering datasets which ben-
rics, with optional parameters, to run. efit from external sources. The TriviaQA and
PubmedQA datasets contain relevant context; for
the RAGAS metrics (Es et al., 2024)). After the ASQA, retrieval was done over a Wikipedia corpus
evaluation is completed, a results file is written to using a dense retriever4 . Dataset sources and sizes
disk with the local and global metrics results. are included in appendix A.2.
Furthermore, the evaluation module uses a pro-
cessing step called an Answer Processor, which 4.3 Models
can implement custom logic and serve many pur- We experiment with two representative models:
poses, including cleaning and aligning outputs; for Llama-35 (Touvron et al., 2023; AI@Meta, 2024)
example, using regex, one can isolate answers, re- and Phi-36 (Abdin et al., 2024) as they represent
move stop words, chain-of-thought reasoning, de- robust capabilities and are ideal candidate models
fine a stopping criteria, process citations and attri- for RAG use case deployments.
butions and any other form of processing needed
for a given evaluation. 4.4 Evaluation
See listing 4 for a configuration example; it con- We measure and report Exact Match (EM) for
tains an answer processor that extracts an answer TriviaQA, STR-EM for ASQA, accuracy and F1
from an output, and a list of metrics to run. for PubmedQA. Additionally, we evaluate two
RAGAS metrics (Es et al., 2024): Faithfulness and
4 Experiments: RAG Tuning
Relevancy. Faithfulness measures the relation be-
To illustrate the usage and usefulness of the tween the generated text and the context. Relevancy
RAG F OUNDRY library, we experiment with sev- measures the relation between the generated text
eral possible RAG improvements to LLMs, and and the query. These two metrics use the context as
evaluate the results on three knowledge-intensive input for the LLM critic, so are only relevant in the
tasks. RAG settings. The critic LLM used is GPT4-32k,
version 0613. An embedder7 is required for the
4.1 RAG Augmentation Techniques relevancy evaluation.
We explore several techniques for RAG augmenta-
tion, and use RAG F OUNDRY to easily implement 4.5 Results
and evaluate their benefit. As an initial step, we We present a comparative study of RAG augmenta-
evaluate unmodified models; we set Baseline as a tion techniques, on the TriviaQA, ASQA and Pub-
configuration that is defined by running unmodified medQA datasets. Results are presented in table 1:
models and without any external knowledge. We 4
BAAI/llm-embedder
define a RAG setting that introduces top-relevant 5
meta-llama/Meta-Llama-3-8B-Instruct.
documents in a consistent prompt template format 6
microsoft/Phi-3-mini-128k-instruct.
7
with a system instruction, and a CoT scheme which BAAI/bge-small-en-v1.5.
Model Method TriviaQA ASQA PubmedQA
EM Faith. Rel. STR-EM Faith. Rel. Acc F1 Faith. Rel.
Baseline 0.630 - - 0.109 - - 0.476 0.290 - -
RAG 0.876 0.821 0.836 0.294 0.685 0.895 0.530 0.281 - -
Phi-3 3.8B RAG-sft 0.878 0.777 0.750 0.252 0.717 0.833 0.720 0.491 - -
CoT 0.923 0.555 0.741 0.367 0.263 0.826 0.574 0.439 0.477 0.705
CoT-sft 0.795 0.793 0.749 0.386 0.749 0.839 0.620 0.458 0.631 0.853
Baseline 0.722 - - 0.200 - - 0.560 0.366 - -
RAG 0.828 0.783 0.746 0.285 0.610 0.861 0.556 0.398 - -
Llama-3 8B RAG-sft 0.916 0.704 0.714 0.291 0.653 0.854 0.770 0.537 - -
CoT 0.896 0.518 0.764 0.395 0.536 0.730 0.684 0.480 0.378 0.732
CoT-sft 0.851 0.808 0.697 0.422 0.768 0.790 0.694 0.485 0.777 0.883
Table 1: Evaluation results of baseline and different RAG settings, for the three datasets and two models tested. In
addition to the main metrics for each dataset, faithfulness and relevancy are reported for the relevant configurations.
In bold are the best configurations per dataset, based on the main metrics.
Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann,
Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Trevor Cai, Eliza Rutherford, Katie Millican, George
Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jian- van den Driessche, Jean-Baptiste Lespiau, Bogdan
min Bao, Harkirat Behl, Alon Benhaim, Misha Damoc, Aidan Clark, Diego de Las Casas, Aurelia
Bilenko, Johan Bjorck, Sébastien Bubeck, Qin Cai, Guy, Jacob Menick, Roman Ring, T. W. Hennigan,
Martin Cai, Caio César Teodoro Mendes, Weizhu Saffron Huang, Lorenzo Maggiore, Chris Jones, Al-
Chen, Vishrav Chaudhary, Dong Chen, Dongdong bin Cassirer, Andy Brock, Michela Paganini, Geof-
Chen, Yen-Chun Chen, Yi-Ling Chen, Parul Chopra, frey Irving, Oriol Vinyals, Simon Osindero, Karen
Xiyang Dai, Allie Del Giorno, Gustavo de Rosa, Simonyan, Jack W. Rae, Erich Elsen, and L. Sifre.
Matthew Dixon, Ronen Eldan, Victor Fragoso, Dan 2021. Improving language models by retrieving from
Iter, Mei Gao, Min Gao, Jianfeng Gao, Amit Garg, trillions of tokens. In International Conference on
Abhishek Goswami, Suriya Gunasekar, Emman Machine Learning.
Haider, Junheng Hao, Russell J. Hewett, Jamie
Huynh, Mojan Javaheripi, Xin Jin, Piero Kauff- Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie
mann, Nikos Karampatziakis, Dongwoo Kim, Ma- Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind
houd Khademi, Lev Kurilenko, James R. Lee, Yin Tat Neelakantan, Pranav Shyam, Girish Sastry, Amanda
Lee, Yuanzhi Li, Yunsheng Li, Chen Liang, Lars Li- Askell, Sandhini Agarwal, Ariel Herbert-Voss,
den, Ce Liu, Mengchen Liu, Weishung Liu, Eric Lin, Gretchen Krueger, Tom Henighan, Rewon Child,
Zeqi Lin, Chong Luo, Piyush Madan, Matt Mazzola, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu,
Arindam Mitra, Hardik Modi, Anh Nguyen, Brandon Clemens Winter, Christopher Hesse, Mark Chen, Eric
Norick, Barun Patra, Daniel Perez-Becker, Thomas Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess,
Portet, Reid Pryzant, Heyang Qin, Marko Radmi- Jack Clark, Christopher Berner, Sam McCandlish,
lac, Corby Rosset, Sambudha Roy, Olatunji Ruwase, Alec Radford, Ilya Sutskever, and Dario Amodei.
Olli Saarikivi, Amin Saied, Adil Salim, Michael San- 2020. Language Models are Few-Shot Learners.
tacroce, Shital Shah, Ning Shang, Hiteshi Sharma, arXiv preprint. ArXiv:2005.14165 [cs].
Swadheen Shukla, Xia Song, Masahiro Tanaka, An- Harrison Chase. 2022. LangChain.
drea Tupini, Xin Wang, Lijuan Wang, Chunyu Wang,
Yu Wang, Rachel Ward, Guanhua Wang, Philipp Jiawei Chen, Hongyu Lin, Xianpei Han, and Le Sun.
Witte, Haiping Wu, Michael Wyatt, Bin Xiao, Can 2023. Benchmarking Large Language Models in
Xu, Jiahang Xu, Weijian Xu, Sonali Yadav, Fan Yang, Retrieval-Augmented Generation. arXiv.
Jianwei Yang, Ziyi Yang, Yifan Yang, Donghan Yu,
Lu Yuan, Chengruidong Zhang, Cyril Zhang, Jian- Michiel de Jong, Yury Zemlyanskiy, Nicholas FitzGer-
wen Zhang, Li Lyna Zhang, Yi Zhang, Yue Zhang, ald, Joshua Ainslie, Sumit Sanghai, Fei Sha, and
Yunan Zhang, and Xiren Zhou. 2024. Phi-3 technical William Cohen. 2023. Pre-computed memory or
on-the-fly encoding? A hybrid approach to retrieval Omar Khattab, Arnav Singhvi, Paridhi Maheshwari,
augmentation makes the most of your compute. Pub- Zhiyuan Zhang, Keshav Santhanam, Sri Vard-
lisher: arXiv Version Number: 2. hamanan, Saiful Haq, Ashutosh Sharma, Thomas T.
Joshi, Hanna Moazam, Heather Miller, Matei Za-
Shahul Es, Jithin James, Luis Espinosa Anke, and haria, and Christopher Potts. 2023. Dspy: Compiling
Steven Schockaert. 2024. RAGAs: Automated evalu- declarative language model calls into self-improving
ation of retrieval augmented generation. In Proceed- pipelines. arXiv preprint arXiv:2310.03714.
ings of the 18th Conference of the European Chap-
ter of the Association for Computational Linguistics: Takeshi Kojima, S. Gu, Machel Reid, Yutaka Matsuo,
System Demonstrations, pages 150–158, St. Julians, and Yusuke Iwasawa. 2022. Large Language Models
Malta. Association for Computational Linguistics. are Zero-Shot Reasoners. ArXiv.
Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio
Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, and Haofen Petroni, Vladimir Karpukhin, Naman Goyal, Hein-
Wang. 2023. Retrieval-Augmented Generation for rich Küttler, Mike Lewis, Wen-tau Yih, Tim Rock-
Large Language Models: A Survey. arXiv preprint. täschel, Sebastian Riedel, and Douwe Kiela. 2021.
ArXiv:2312.10997 [cs]. Retrieval-Augmented Generation for Knowledge-
Intensive NLP Tasks. arXiv preprint.
Yasuto Hoshi, Daisuke Miyashita, Youyang Ng, Kento Jimmy Lin, Xueguang Ma, Sheng-Chieh Lin, Jheng-
Tatsuno, Yasuhiro Morioka, Osamu Torii, and Jun Hong Yang, Ronak Pradeep, and Rodrigo Nogueira.
Deguchi. 2023. RaLLe: A Framework for Devel- 2021. Pyserini: A Python toolkit for reproducible
oping and Evaluating Retrieval-Augmented Large information retrieval research with sparse and dense
Language Models. arXiv preprint. representations. In Proceedings of the 44th Annual
International ACM SIGIR Conference on Research
Jennifer Hsia, Afreen Shaikh, Zhiruo Wang, and Gra-
and Development in Information Retrieval (SIGIR
ham Neubig. 2024. RAGGED: Towards Informed
2021), pages 2356–2362.
Design of Retrieval Augmented Generation Systems.
arXiv preprint. ArXiv:2403.09040 [cs]. Jerry Liu. 2022. LlamaIndex.
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paran-
Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and jape, Michele Bevilacqua, Fabio Petroni, and Percy
Weizhu Chen. 2021. LoRA: Low-Rank Adaptation Liang. 2023. Lost in the middle: How language mod-
of Large Language Models. arXiv preprint. ArXiv: els use long contexts. Preprint, arXiv:2307.03172.
2106.09685 [cs].
Zihan Liu, Wei Ping, Rajarshi Roy, Peng Xu, Chankyu
Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Lee, Mohammad Shoeybi, and Bryan Catanzaro.
Zhangyin Feng, Haotian Wang, Qianglong Chen, 2024. ChatQA: Surpassing GPT-4 on Conversational
Weihua Peng, Xiaocheng Feng, Bing Qin, and QA and RAG. arXiv preprint. ArXiv: 2401.10225
Ting Liu. 2023. A Survey on Hallucination in [cs].
Large Language Models: Principles, Taxonomy,
Alex Troy Mallen, Akari Asai, Victor Zhong, Rajarshi
Challenges, and Open Questions. arXiv preprint.
Das, Hannaneh Hajishirzi, and Daniel Khashabi.
ArXiv:2311.05232 [cs].
2022. When not to trust language models: Investigat-
Jiajie Jin, Yutao Zhu, Xinyu Yang, Chenghao Zhang, ing effectiveness of parametric and non-parametric
and Zhicheng Dou. 2024. FlashRAG: A Modular memories. In Annual Meeting of the Association for
Toolkit for Efficient Retrieval-Augmented Genera- Computational Linguistics.
tion Research. Baolin Peng, Michel Galley, Pengcheng He, Hao Cheng,
Yujia Xie, Yu Hu, Qiuyuan Huang, Lars Liden, Zhou
Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William W. Yu, Weizhu Chen, and Jianfeng Gao. 2023. Check
Cohen, and Xinghua Lu. 2019. PubMedQA: A Your Facts and Try Again: Improving Large Lan-
Dataset for Biomedical Research Question Answer- guage Models with External Knowledge and Auto-
ing. arXiv preprint. ArXiv: 1909.06146 [cs, q-bio]. mated Feedback. Publisher: arXiv Version Number:
3.
Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke
Zettlemoyer. 2017. TriviaQA: A Large Scale Dis- Malte Pietsch, Timo Möller, Bogdan Kostic, Julian
tantly Supervised Challenge Dataset for Reading Risch, Massimiliano Pippi, Mayank Jobanputra, Sara
Comprehension. arXiv preprint. ArXiv:1705.03551 Zanzottera, Silvano Cerza, Vladimir Blagojevic,
[cs]. Thomas Stadelmann, Tanay Soni, and Sebastian Lee.
2019. Haystack: the end-to-end NLP framework for
Omar Khattab, Keshav Santhanam, Xiang Lisa pragmatic builders.
Li, David Hall, Percy Liang, Christopher Potts,
and Matei Zaharia. 2022. Demonstrate-search- David Rau, Herv’e D’ejean, Nadezhda Chirkova,
predict: Composing retrieval and language mod- Thibault Formal, Shuai Wang, Vassilina Nikoulina,
els for knowledge-intensive NLP. arXiv preprint and S. Clinchant. 2024. BERGEN: A Benchmarking
arXiv:2212.14024. Library for Retrieval-Augmented Generation.
Jon Saad-Falcon, Omar Khattab, Christopher Potts, and
Matei Zaharia. 2024. ARES: An Automated Evalua-
tion Framework for Retrieval-Augmented Generation
Systems. arXiv preprint. ArXiv:2311.09476 [cs].
Ivan Stelmakh, Yi Luan, Bhuwan Dhingra, and Ming-
Wei Chang. 2022. ASQA: Factoid Questions Meet
Long-Form Answers. In Proceedings of the 2022
Conference on Empirical Methods in Natural Lan-
guage Processing, pages 8273–8288, Abu Dhabi,
United Arab Emirates. Association for Computa-
tional Linguistics.
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
Martinet, Marie-Anne Lachaux, Timothée Lacroix,
Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal
Azhar, Aurelien Rodriguez, Armand Joulin, Edouard
Grave, and Guillaume Lample. 2023. Llama: Open
and efficient foundation language models. Preprint,
arXiv:2302.13971.
You are a helpful question answerer who can provide an answer given a question and relevant context.
Question: {query}
Context: {docs}
Question: {query}
Context: {docs}
Answer this question using the information given in the context above. Here is things to pay attention to:
- First provide step-by-step reasoning on how to answer the question.
- In the reasoning, if you need to copy paste some sentences from the context, include them in
##begin_quote## and ##end_quote##. This would mean that things outside of ##begin_quote## and
##end_quote## are not directly copy paste from the context.
- End your response with final answer in the form <ANSWER>: $answer, the answer should be succinct.
A.2 Datasets
Datasets used:
• TriviaQA
• ASQA
• PubmedQA
Context size was k = 5, unless indicated otherwise.
Dataset sizes are:
Dataset Training Evaluation
TriviaQA 6000 1000
ASQA 4353 948
PubmedQA 10000 500