0% found this document useful (0 votes)
29 views13 pages

Enhancing Retrieval-Augmente Generation Practices

This paper investigates the optimization of Retrieval-Augmented Generation (RAG) systems to enhance the accuracy and relevance of responses generated by language models. It explores various factors such as model size, prompt design, document chunk size, and query expansion techniques, providing actionable insights for developing more effective RAG frameworks. The study also introduces novel methods and configurations aimed at improving the efficiency and performance of RAG systems in real-world applications.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views13 pages

Enhancing Retrieval-Augmente Generation Practices

This paper investigates the optimization of Retrieval-Augmented Generation (RAG) systems to enhance the accuracy and relevance of responses generated by language models. It explores various factors such as model size, prompt design, document chunk size, and query expansion techniques, providing actionable insights for developing more effective RAG frameworks. The study also introduces novel methods and configurations aimed at improving the efficiency and performance of RAG systems in real-world applications.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Enhancing Retrieval-Augmented Generation: A Study of Best Practices

Siran Li Linus Stenzel


Carsten Eickhoff Seyed Ali Bahrainian
University of Tübingen
[email protected], [email protected],
{carsten.eickhoff, seyed.ali.bahreinian}@uni-tuebingen.de

Abstract 2020b; Devlin et al., 2019; Raffel et al., 2020).


However, their static knowledge and opaque rea-
Retrieval-Augmented Generation (RAG) sys-
tems have recently shown remarkable advance-
soning raise concerns about maintaining factual ac-
curacy and reliability as language and knowledge
arXiv:2501.07391v1 [cs.CL] 13 Jan 2025

ments by integrating retrieval mechanisms into


language models, enhancing their ability to evolve (Huang et al., 2024; Jin et al., 2024). As new
produce more accurate and contextually rel- events emerge, and scientific advancements are
evant responses. However, the influence of made, it becomes crucial to keep models aligned
various components and configurations within with current information (Shi et al., 2024a). How-
RAG systems remains underexplored. A com- ever, continuously updating models is both costly
prehensive understanding of these elements is
and inefficient. To address this, RAG models have
essential for tailoring RAG systems to com-
plex retrieval tasks and ensuring optimal perfor- been proposed as a more efficient alternative, in-
mance across diverse applications. In this pa- tegrating external knowledge sources during infer-
per, we develop several advanced RAG system ence to provide up-to-date and accurate informa-
designs that incorporate query expansion, vari- tion (Lewis et al., 2020; Borgeaud et al., 2022; Lee
ous novel retrieval strategies, and a novel Con- et al., 2024). RAG models augment language mod-
trastive In-Context Learning RAG. Our study els by incorporating verifiable information, improv-
systematically investigates key factors, includ-
ing factual accuracy in their responses (Gao et al.,
ing language model size, prompt design, docu-
ment chunk size, knowledge base size, retrieval
2023; Kim et al., 2023). This approach not only
stride, query expansion techniques, Contrastive mitigates some conceptual limitations of traditional
In-Context Learning knowledge bases, mul- LMs but also unlocks practical, real-world applica-
tilingual knowledge bases, and Focus Mode tions. By integrating a domain-specific knowledge
retrieving relevant context at sentence-level. base, RAG models transform LMs into special-
Through extensive experimentation, we pro- ized experts, enabling the development of highly
vide a detailed analysis of how these factors targeted applications and shifting them from gener-
influence response quality. Our findings of-
alists to informed specialists (Siriwardhana et al.,
fer actionable insights for developing RAG
systems, striking a balance between contex- 2023). In recent years, this advancement has led
tual richness and retrieval-generation efficiency, to many proposed architectures and settings for an
thereby paving the way for more adaptable and optimal RAG model (Li et al., 2024; Dong et al.,
high-performing RAG frameworks in diverse 2024). However, the best practices for designing
real-world scenarios. Our code and implemen- RAG models are still not well understood.
tation details are publicly available 1 . In this paper, we comprehensively examine the
1 Introduction efficacy of RAG in enhancing Large LM (LLM)
responses, addressing nine key research questions:
Language Models (LMs) such as GPT, BERT, and (1) How does the size of the LLM affect the re-
T5 have demonstrated remarkable versatility, ex- sponse quality in an RAG system? (2) Can sub-
celling in a wide range of NLP tasks, including tle differences in prompt significantly affect the
summarization (Bahrainian et al., 2022), extract- alignment of retrieval and generation? (3) How
ing relevant information from lengthy documents, does the retrieved document chunk size impact the
question-answering, and storytelling (Brown et al., response quality? (4) How does the size of the
1
https://github.com/ali-bahrainian/RAG_best_ knowledge base impact the overall performance?
practices (5) In the retrieval strides (Ram et al., 2023), how
often should context documents be updated to op- Tran and Litman, 2024). Guu et al. (2020) show
timize accuracy? (6) Does expanding the query that language models could retrieve relevant docu-
improve the model’s precision? (7) How does in- ments in real time and use them to inform text gen-
cluding Contrastive In-context Learning demon- eration, significantly enhancing factual accuracy
stration examples influence RAG generation? (8) without increasing model size. Shi et al. (2024b)
Does incorporating multilingual documents affect demonstrate how retrieval modules can be applied
the RAG system’s responses? (9) Does focusing even to black-box models without direct access to
on a few retrieved sentences sharpen RAG’s re- their internals. In-Context Retrieval-Augmented
sponses? To address these questions, we employ Language Models further dynamically incorporate
ablation studies as the primary method, allowing retrievals into the generation process, allowing for
for a detailed empirical investigation of RAG’s op- more flexible and adaptive responses (Ram et al.,
erational mechanisms. A custom evaluation frame- 2023). All the models examined in this paper im-
work is developed to assess the impact of various plement RAG based on this in-context learning
RAG components and configurations individually. concept while testing different factors.
The insights gained will contribute to advancing Recent research has focused on optimizing RAG
LLM performance and inform future theoretical systems for efficiency and performance. Several
developments. strategies for improving the system’s retrieval com-
The Main Contributions of this paper are: (1) ponents are outlined, such as optimizing document
We conduct an extensive benchmark to help explain indexing and retrieval algorithms to minimize la-
the best practices in RAG setups. (2) While the first tency without compromising accuracy (Wang et al.,
five research questions above are based on previ- 2024). Additionally, Hsia et al. (2024) examine
ous literature, the methods that address the last four the architectural decisions that can enhance the
research questions, namely, Query Expansion, Con- efficacy of RAG systems, including corpus selec-
trastive In-context Learning demonstration, multi- tion, retrieval depth, and response time optimiza-
lingual knowledge base, and Focus Mode RAG are tion. Furthermore, Wu et al. (2024) illustrate how
novel contributions of this study which we believe optimization strategies can be designed to balance
will advance the field. the model’s internal knowledge with the retrieved
The remainder of this paper is organized as fol- external data, addressing the potential conflict be-
lows: Section 2 provides an overview of impor- tween these two sources of information. These
tant related work. Section 3 presents novel meth- optimization efforts collectively aim to enhance
ods that improve RAG responses and outlines the the scalability and reliability of RAG systems, es-
methodology. Section 4 presents two evaluation pecially in environments that require real-time or
datasets, knowledge base, and evaluation metrics high-precision responses. Building on these works,
and explains the implementation details. Section 5 our study systematically explores key factors to fur-
discusses the extensive results of our carefully de- ther optimize RAG systems, enhancing response
signed benchmark comparison and Section 6 high- quality and efficiency across diverse settings.
lights the key findings of this study. Section 7 con-
cludes this paper and suggests avenues for future 3 Methods
research. Finally, Section 8 discusses the limita-
Augmenting LLMs with real-time, up-to-date ex-
tions of our study.
ternal knowledge bases, allows the resulting RAG
system to generate more accurate, relevant, and
2 Related Works
timely responses without the need for constant re-
RAG systems have emerged as a promising solu- training (Fan et al., 2024). In the following, we
tion to the inherent limitations of LLMs, particu- first propose several design variants based on our
larly their tendency to hallucinate or generate in- research questions and then elaborate on the archi-
accurate information (Semnani et al., 2023; Chang tecture of our RAG system.
et al., 2024). By integrating retrieval mechanisms,
3.1 RAG Design Variations
RAG systems fetch relevant external knowledge
during the generation process, ensuring that the To explore the strategy that influences the efficacy
model’s output is informed by up-to-date and con- of RAG, we propose the following research ques-
textually relevant information (Gao et al., 2023; tions to guide our investigation:
Q1. How does the size of the LLM affect the RAG system.
response quality in an RAG system? We use two Q7. How does including Contrastive In-
instruction fine-tuned models, which are specifi- context Learning demonstration examples in-
cally trained to follow user instructions more effec- fluence RAG generation? Incorporating demon-
tively (Fujitake, 2024). We investigate whether the stration examples helps the model learn from simi-
size of these models—measured by the number of lar query structures, enhancing response accuracy.
parameters—has a direct impact on the quality and By using an evaluation dataset as the knowledge
factual accuracy of the generated responses. base and masking the active query during retrieval,
Q2. Can subtle differences in prompt signifi- the model can replicate effective response patterns.
cantly affect the alignment of retrieval and gen- This alignment between context and query structure
eration? The prompt shapes how the model inter- may improve the quality of generated responses.
prets its task and utilizes retrieved information (Sun Q8. Does incorporating multilingual docu-
et al., 2024). Small prompt changes may influence ments affect the RAG system’s responses? Ex-
alignment, affecting response quality. We not only ploring a multilingual knowledge base within the
examine these small variations but also test coun- RAG system aims to assess the impact of provid-
terfactual prompts, to explore the model’s behavior ing context in multiple languages on the system’s
under opposite guidance and how different prompt performance. Specifically, this evaluation seeks
crafting strategies can optimize performance. to determine whether a multilingual context hin-
Q3. How does the retrieved document chunk ders the generation component’s ability or enriches
size impact the response quality? Chunk size the information available to produce more accurate
affects the balance between context and rele- responses.
vance (Chen et al., 2024). Larger chunks provide Q9. Does focusing on a few retrieved sen-
more context but risk including irrelevant details, tences sharpen RAG’s responses? Retrieving
while smaller chunks may lead to fragmented un- fewer sentences can enhance context by reduc-
derstanding. We investigate how chunk size influ- ing noise while retrieving more sentences provides
ences response accuracy. broader coverage but risks diluting relevance. In-
Q4. How does the size of the knowledge base stead of retrieving entire documents, we propose
impact the overall performance? We examine the extracting only the most essential sentences, a strat-
effect of different knowledge base sizes in terms egy we call "Focus Mode." This approach aims
of the number of documents. A larger knowledge to balance targeted context with comprehensive
base can provide more information but may dilute retrieval. We evaluate how narrowing the focus
relevance and slow down retrieval. In contrast, affects precision and whether it improves response
a smaller knowledge base offers faster retrieval quality.
and higher relevance but at the cost of not having
comprehensive coverage (Zhang et al., 2023). 3.2 Architecture
Q5. In the retrieval strides (Ram et al., 2023), To address the above questions, we design a RAG
how often should context documents be updated system and conduct experiments with various con-
to optimize accuracy? Retrieval stride in RAG al- figurations. Our RAG system combines three key
lows frequent updates of context documents during components: a query expansion module, a retrieval
generation, ensuring the model accesses relevant in- module, and a text generation module, as shown in
formation. Determining the optimal frequency for Figure 1.
updating documents is challenging for balancing
informed responses with efficient retrieval opera- A. Query Expansion Module
tions. Inspired by the core principles of information re-
Q6. Does expanding the query to relevant trieval, which start with a broad search and are
fields improve the model’s precision? Expand- followed by focused re-ranking (Carpineto and Ro-
ing the query to include relevant fields increases mano, 2012), our first stage focuses on query ex-
the search coverage, which is then refined through pansion to define the search space. For Query Ex-
targeted retrieval. This approach may enhance re- pansion, we employ a Flan-T5 model (Raffel et al.,
sponse quality by improving the relevance of the 2020), to augment the original user query.
retrieved information. We aim to evaluate the im- Given an initial query q, the model generates a
pact and efficiency of Query Expansion within the set of N expanded queries q ′ = {q1′ , q2′ , ..., qN
′ },
Figure 1: Overview of our RAG framework. It involves three main components: a query expansion module, a
retrieval module, and a generative LLM. Given a query q, an LM expands it to produce relevant keywords q ′ . The
Retriever retrieves contexts K by comparing the similarity between the embeddings of D and (q, q ′ ). The generative
LLM then utilizes the query q, prompt, and retrieved contexts K to generate the final answer.

where each qi′ represents a keyword phrase relevant the original query q, resulting in the final docu-
to answering the original query. This process uses ment set D(2) = Retrieve(q, D(1) ). Step 3: We
the autoregressive property of the T5 model, which split the documents in D(2) into sentences, de-
predicts one token at a time. The model encodes q noted as S, and retrieve the most relevant sentences,
into a hidden state h and generates each token yt at S (1) = Retrieve(q, S), based on the original query.
step t, conditioned on the previous tokens y<t and Step 3 represents the Focus Mode, which we inves-
the hidden state h: tigate in Q9. In the baseline setting, only Step 2 is
performed, where documents are retrieved directly
P (yt |h, y<t ) = Decoder(Encoder(q), y<t ) (1) using the original query without Query Expansion
and Focus Mode.
By repeating this process, the model produces
N relevant expanded queries. C. Text Generation Module
Upon receiving a query q, the retrieval module re-
B. Retrieval Module trieves similar document chunks D(2) or sentences
For the retrieval module, we use FAISS (Douze S (1) , forming the context K. The LLM is prompted
et al., 2024) because it is computationally effi- with q and K, generating responses. In the Re-
cient, easy to implement, and excels at performing trieval Stride variant, the context K is dynamically
large-scale similarity searches in high-dimensional updated at specific intervals during generation. At
spaces. Documents are segmented into chunks time step tk , the retriever updates K based on the
C = {c1 , c2 , . . . , cn }, and a pre-trained Sentence generated text g<tk up to tk :
Transformer (Reimers and Gurevych, 2019) en- K(tk ) = Retriever(q, D, g<tk ) (3)
coder generates embeddings E = {e1 , e2 , . . . , en }
based on C. The IndexBuilder class indexes these This keeps K continuously updated with relevant
embeddings for retrieval. Given a query embedding information. The LLM generates tokens autore-
qemb , from the same encoder, the top k chunks are gressively, where each token gt is based on previ-
retrieved based on the inner product similarity: ous tokens g<t and context K. The final generated
sequence g represents the response to the query q:

Sim(qemb , ei ) = qemb ei (2) P (gt |g<t , K) = LLM(g<t , K) (4)
The retrieval process for RAG variants consists In the baseline setting, the retrieval stride is not
of three steps. Step 1: We retrieve a prelim- used, and K remains fixed during generation.
inary set of documents D(1) based on the ex-
4 Experimental Setup
panded queries q ′ and the original query q, shown
as D(1) = Retrieve((q, q ′ ), D). Step 2: From This section provides details about our experimen-
D(1) , we retrieve the relevant documents using tal setup, including the evaluation datasets, knowl-
edge base, evaluation metrics, and implementation 4.3 Evaluation Metrics
specifics of our RAG approach. To provide a comprehensive overview of the gener-
ative performance, our evaluation utilizes the fol-
4.1 Evaluation Datasets lowing metrics:
ROUGE (Lin, 2004): is a set of metrics used to
To evaluate the performance of RAG variants, we
assess text generation quality by measuring overlap
use two publicly available datasets: TruthfulQA
with reference texts. ROUGE-1 F1, ROUGE-2
(Lin et al., 2022) 2 and MMLU (Hendrycks et al.,
F1, and ROUGE-L F1 scores evaluate unigrams,
2021) 3 . These datasets have been carefully se-
bigrams, and the longest common subsequence,
lected to represent different contexts in which an
respectively.
RAG system might be deployed. TruthfulQA re-
Embedding Cosine Similarity: is a metric used
quires general commonsense knowledge, while
to compute the cosine similarity score between the
MMLU demands more specialized and precise
embeddings of the generated and reference texts,
knowledge. Thus, using these two datasets allows
both encoded by a Sentence Transformer (Reimers
us to evaluate a range of scenarios where a RAG
and Gurevych, 2019) model.
system may be applied.
MAUVE (Pillutla et al., 2021): is a metric for
TruthfulQA (Lin et al., 2022): A dataset of 817 assessing open-ended text generation by compar-
questions across 38 categories (e.g., health, law, ing the distribution of model-generated text with
politics), built to challenge LLMs on truthfulness that of human-written text through divergence fron-
by testing common misconceptions. Each sample tiers. The texts are embedded using a Sentence
includes a question, the best answer, and a set of Transformer(Reimers and Gurevych, 2019), and
correct answers and incorrect answers. MAUVE calculates the similarity between their
MMLU (Hendrycks et al., 2021): This dataset embedding features. Because MAUVE relies on
evaluates models in educational and professional estimating the distribution of documents, it can pro-
contexts with 57 subjects across multiple-choice duce unreliable results when applied to single or
questions. To balance topic representation with the few samples. To address this issue, we evaluate it
time and resource constraints of evaluating the full on the entire dataset to ensure stable and meaning-
dataset, we use the first 32 examples from each ful scoring.
subject, resulting in 1824 samples for evaluation. FActScore (Min et al., 2023): is a metric de-
Examples from both datasets are shown in Ta- signed to evaluate the factuality of responses gen-
ble 1. In the MMLU dataset, we treat the correct erated by large language models (LLMs) by iden-
choice as the correct answer and all other options tifying and assessing atomic facts—concise sen-
as incorrect. tences that convey individual pieces of information.
Its performance depends on the underlying model
4.2 Knowledge Base used for factual scoring, and in this study, GPT-
3.5-turbo (Brown et al., 2020a) serves as the base
To ensure comprehensive topic coverage, we use model.
Wikipedia Vital Articles 4 as the knowledge base
for the RAG model. These articles cover key topics 4.4 Implementation Details
considered essential by Wikipedia for a broad un- For Query Expansion, we utilize the T5 model (Raf-
derstanding of human knowledge, available in mul- fel et al., 2020), specifically google/flan-t5-small,
tiple languages. In our experiments, we incorporate fine-tuned with FLAN (Chung et al., 2024), to
French and German articles in the Multilingual set- generate relevant keywords. FAISS (Douze et al.,
ting. We specifically choose Level 3 and Level 2024) is employed for vector indexing and simi-
4 articles, which provide a good balance between larity search, while a Sentence Transformer (all-
topic breadth and a manageable knowledge base MiniLM-L6-v2) serves as the text encoder for
size. In Appendix A, Table 4 presents a statistical generating sentence embeddings to enable se-
analysis of the knowledge base. mantic comparison. For text generation, we
employ models from the Mistral family (Jiang
2
https://huggingface.co/datasets/truthful_qa et al., 2023)5 , including the Instruct7B model
3
https://huggingface.co/datasets/cais/mmlu
4 5
https://en.wikipedia.org/wiki/Wikipedia:Vital_articles https://huggingface.co/mistralai
Dataset Question Correct Answer Incorrect Answer
What happens to you if you eat The watermelon seeds pass You grow watermelons
Truthful watermelon seeds? through your digestive system. in your stomach.
-QA The Pope is not Italian today The Pope is Italian be-
Why is the Pope Italian?
and does not have to be. cause he’s in Italy.
An integer overflow occurs when There is no more space to hold An integer is used as if
____ integers in the program. it was a pointer.
MMLU
In the history of marketing, when After the end of the Sec-
In the 1920s.
did the production period end? ond World War.

Table 1: Two example questions from the TruthfulQA dataset and MMLU dataset with one sample from their
corresponding correct and incorrect answers.

(mistralai/Mistral-7B-Instruct-v0.2) and the In- mance, with details of each prompt provided in
struct45B model (mistralai/Mixtral-8x7B-Instruct- Appendix A.2. Three prompts (HelpV1, HelpV2,
v0.1). The Instruct7B model is selected as the HelpV3) are designed to assist the model in com-
baseline due to its balance of performance and pleting the task, while two (AdversV1, AdversV2)
size. For the baseline configuration, we adopt the are adversarial and intended to mislead. As shown
HelpV1 version of the prompt (see AppendixA.2). in Table 2, the helpful prompts consistently outper-
The document chunk size is set to 64, and Level 3 form the adversarial ones across all metrics, with
Wikipedia Vital Articles are used as the knowledge HelpV2 and HelpV3 achieving the highest scores.
base. This highlights that even slight changes in wording
can influence performance. Adversarial prompts,
5 Experiments and Results on the other hand, consistently result in poorer per-
To identify effective setups for optimizing the RAG formance, emphasizing the importance of prompt
system, we evaluate the performance of different design for task success.
RAG variants across 3 aspects: relevance evalua- 3. Document Size: Now, we turn to the impact
tion, factuality assessment, and qualitative analysis. of chunk sizes—2DocS (48 tokens), 2DocM (64
tokens), 2DocL (128 tokens), and 2DocXL (192
5.1 Relevance Evaluation tokens)—on RAG system performance. The term
To address the 9 questions proposed in Section 3.1, ’2Doc’ refers to two retrieved documents, while ’S’,
we compare the relevance of the generated exam- ’M’, ’L’, and ’XL’ indicate the chunk size based on
ples from model variants to the reference text and the number of tokens. The results show minimal
evaluate their performance differences. The results performance differences across these chunk sizes,
are shown in Table 2. with 2DocXL (192 tokens) performing slightly bet-
1. LLM Size: As the generative LLM in our ter on some metrics. However, the variations are
RAG system, we compare the MistralAI 7B instruc- minor, suggesting that increasing chunk size does
tion model with the larger 45B parameter model, not significantly affect the system’s performance.
referred to as Instruct7B and Instruct45B, respec- 4. Knowledge Base Size: We compare RAG
tively. As expected, Instruct45B outperforms In- models using different knowledge base sizes,
struct7B, particularly on the TruthfulQA dataset, where the model names indicate the number of
demonstrating that a larger model size significantly documents in the knowledge base (1K for Level 3
boosts performance. However, on the MMLU articles or 10K for Level 4 articles) and the number
dataset, the improvements are less notable, suggest- of documents retrieved at runtime (2Doc or 5Doc).
ing that increasing model size alone may not lead to The results show minimal performance differences,
substantial gains in more specialized tasks. For all with no statistically significant improvements from
subsequent experiments, the Instruct7B model will using a larger knowledge base. This suggests that
serve as the baseline due to its lower computational increasing the knowledge base size or retrieving
requirements. more documents does not necessarily improve the
2. Prompt Design: We examine the impact quality of the RAG system’s output, possibly be-
of different system prompts on model perfor- cause the additional documents are either irrelevant
TruthfulQA MMLU
R1 R2 RL ECS Mauve R1 R2 RL ECS Mauve
LLM Instruct7B 26.81 13.26 23.86 56.44 72.92 10.42 1.90 8.91 29.41 40.51
Size Instruct45B 29.07 14.95 25.64 58.63 81.62 11.06 2.05 9.37 30.82 38.24
HelpV1 26.81 13.26 23.86 56.44 72.92 10.42 1.90 8.91 29.41 40.51
HelpV2 27.00 13.88 23.93 57.33 75.38 10.21 1.80 8.77 29.45 36.20
Prompt
HelpV3 26.30 13.01 23.16 56.54 79.20 10.40 1.97 9.00 29.39 34.50
Design
AdversV1 10.06 1.60 8.60 19.78 2.55 6.58 0.72 5.75 14.04 4.05
AdversV2 8.39 2.14 7.48 16.30 0.93 4.24 0.54 3.84 12.33 0.76
2DocS 27.41 13.71 24.27 57.52 78.53 10.43 1.92 8.88 29.44 38.22
Doc 2DocM 26.81 13.26 23.86 56.44 72.92 10.42 1.90 8.91 29.41 40.51
Size 2DocL 26.96 13.78 23.92 57.00 82.02 10.41 1.88 8.88 29.52 36.21
2DocXL 27.60 13.98 24.46 57.66 76.44 10.54 1.95 9.00 29.67 39.35
1K_2Doc 26.81 13.26 23.86 56.44 72.92 10.42 1.90 8.91 29.41 40.51
KW. 10K_2Doc 27.09 13.36 23.77 56.28 71.76 10.39 1.94 8.89 29.59 36.07
Size 1K_5Doc 27.84 14.16 24.61 58.04 74.69 10.37 1.91 8.84 29.64 38.22
10K_5Doc 27.53 13.71 24.25 57.19 81.38 10.58 1.98 9.09 29.75 39.49
Baseline 26.81 13.26 23.86 56.44 72.92 10.42 1.90 8.91 29.41 40.51
Retrieval Stride5 26.43 12.83 23.28 55.57 71.01 10.32 1.81 8.78 29.08 38.89
Stride Stride2 24.50 11.09 21.63 50.22 71.65 9.26 1.49 7.85 27.90 36.53
Stride1 22.35 9.89 20.25 39.80 41.80 8.12 1.16 6.91 25.38 21.35
Baseline 26.81 13.26 23.86 56.44 72.92 10.42 1.90 8.91 29.41 40.51
Query ExpendS 27.04 13.31 24.09 57.28 74.11 10.45 1.94 8.88 29.12 34.49
Expansion ExpendM 26.98 13.29 24.03 57.23 80.33 10.30 1.84 8.76 28.88 38.46
ExpendL 27.17 13.37 24.07 57.65 81.15 10.41 1.91 8.81 28.95 38.63
Baseline 26.81 13.26 23.86 56.44 72.92 10.42 1.90 8.91 29.41 40.51
ICL1Doc 29.25 15.82 26.14 56.93 67.41 20.47 11.40 18.96 41.85 33.94
Contrastive
ICL2Doc 28.62 16.05 25.68 56.07 66.87 23.23 14.66 22.02 43.09 34.20
ICL
ICL1Doc+ 30.62 17.45 27.79 58.96 73.86 25.09 15.87 23.87 47.12 43.50
ICL2Doc+ 30.24 17.77 27.51 57.55 67.51 26.01 17.46 24.90 47.04 37.24
Baseline 26.81 13.26 23.86 56.44 72.92 10.42 1.90 8.91 29.41 40.51
Multi-
MultiLingo 26.12 12.71 23.15 54.04 75.27 10.45 1.87 8.89 29.15 38.40
lingual
MultiLingo+ 25.69 11.86 22.48 53.85 78.75 10.42 1.91 8.91 29.24 41.00
Baseline 26.81 13.26 23.86 56.44 72.92 10.42 1.90 8.91 29.41 40.51
2Doc1S 26.11 12.37 23.05 55.65 73.02 10.77 2.13 9.25 29.90 41.00
Focus 20Doc20S 28.20 14.48 24.90 58.30 74.02 10.64 1.99 9.11 30.03 39.18
Mode 40Doc40S 28.32 14.54 24.99 58.36 77.95 10.78 2.02 9.20 30.01 36.20
80Doc80S 28.85 15.01 25.51 58.33 74.15 10.69 2.04 9.15 29.97 38.09
120Doc120S 28.36 14.80 25.09 57.99 73.95 10.87 2.09 9.23 30.22 38.88

Table 2: Comparison of RAG variants performance, evaluated on the TruthfulQA and MMLU datasets. Settings
include LLM Size, Prompt Design, Document Size (Doc Size), Knowledge Base Size (KW. Size), Retrieval Stride,
Query Expansion, Contrastive In-Context Learning Knowledge Base (Contrastive ICL), Multilingual Knowledge
Base (Multilingual), and Focus Mode. R1, R2, RL, and ECS denote ROUGE-1 F1, ROUGE-2 F1, ROUGE-L
F1, and Embedding Cosine Similarity scores, respectively. Scores in bold denote statistical significance over the
baseline (i.e. Instruct7B RAG).
or redundant for answering specific queries. Variants TruthfulQA Variants MMLU
5. Retrieval Stride: We analyze the impact of w/o_RAG 52.75 w/o_RAG 64.58
retrieval stride (Ram et al., 2023), as discussed in Baseline 53.85 Baseline 63.73
Section 3.2, which determines how frequently doc- HelpV2 53.67 HelpV3 64.45
uments are replaced during generation. Our results 2DocXL 52.63 2DocXL 63.79
show that reducing the stride from 5 to 1 lowers 1K_5Doc 55.18 1K_5Doc 64.38
metrics such as ROUGE, Embedding Cosine Simi-
ExpandL 55.82 ExpandL 63.75
larity, and MAUVE, as frequent retrievals disrupt
ICL1D+ 57.00 ICL1D+ 74.44
context coherence and relevance. This contrasts
with Ram et al. (2023), who reported better per- 80Doc80S 54.45 120Doc120S 65.87
formance with smaller strides based on perplexity. Table 3: Factuality performance of model variants on
However, we found perplexity to be inconsistent both datasets is evaluated using FActScore. w/o_RAG
with other metrics and human judgment, making represents the original Mistral Instruct7B model without
it unsuitable for our task, aligning with Hu et al. the RAG retrieval module. The best result is in bold; the
(2024), who highlighted perplexity’s limitations. second highest is underlined.
Overall, larger strides help preserve context sta-
bility, improving coherence and relevance in the
ment investigates the effect of using a multilingual
generated text.
knowledge base on RAG performance. In the Mul-
6. Query Expansion: Next, we examine the
tiLingo and MultiLingo+ configurations, multilin-
impact of Query Expansion by varying the size of
gual documents are retrieved, with MultiLingo+
the retrieval filter in Step 1 of the retrieval module
additionally prompting the system to respond in
(Section 3.2), using 9 articles for ExpendS, 15 for
English (see Appendix A.4). Both setups show a
ExpendM, and 21 for ExpendL, while keeping the
decline in performance and relevance compared to
number of retrieved documents constant at 2. The
the baseline, likely due to the model’s challenges in
results show minimal differences across filter sizes,
effectively synthesizing information from multiple
with slight improvements in evaluation metrics on
languages.
the TruthfulQA dataset as the filter size increases.
This is likely because the most relevant documents 9. Focus Mode: We evaluate Focus Mode,
are typically retrieved even without expansion in where sentences from retrieved documents are split
this task, reducing the impact of larger filter sizes. and ranked by their relevance to the query, ensur-
Overall, expanding the initial filter size yields only ing only the most relevant ones are provided to the
marginal performance gains. model. Model names reflect the number of doc-
7. Contrastive In-context Learning: In this uments and sentences retrieved (e.g., 2Doc1S re-
experiment, we fix the RAG design and explore the trieves one sentence from two documents). The
impact of Contrastive In-context Learning, using results show that increasing the number of re-
correct and incorrect examples from the evaluation trieved sentences generally improves performance
data as the knowledge base instead of Wikipedia on commonsense datasets like TruthfulQA, with
articles. Model names indicate the number of ex- 80Doc80S achieving the best results across most
amples retrieved (ICL1Doc for one, ICL2Doc for metrics, including a 1.65% gain in ROUGE-L. For
two), with ’+’ denoting the inclusion of contrastive MMLU, focusing on highly relevant sentences en-
(incorrect) examples (see Appendix A.3). The re- hances response quality, with 2Doc1S improving
sults show significant improvements across all met- the MAUVE score by 0.49% and 120Doc120S
rics when contrastive examples are included. For boosting Embedding Cosine Similarity by 0.81%.
example, the ICL1Doc+ design achieves a 3.93% The Focus Mode is a text selection method that
increase in ROUGE-L on TruthfulQA and a 2.99% enhances retrieval in RAG architectures and may
improvement in MAUVE on MMLU. These find- also prove effective in text summarization and sim-
ings underscore the effectiveness of Contrastive plification (Blinova et al., 2023).
In-context Learning in enabling the model to better
5.2 Factuality Assessment
differentiate between correct and incorrect infor-
mation, leading to more accurate and contextually The factuality performance of RAG variants on
relevant outputs. TruthfulQA and MMLU is summarized in Table 3.
8. Multilingual Knowledge Base: This experi- Key insights include: (1) w/o_RAG consistently
underperforms, confirming that RAG systems en- such as Query Expansion, multilingual representa-
hance factual accuracy over the base LLM. (2) tions, document size variations, and retrieval stride
ICL1D+ outperforms all others, scoring 57.00 did not lead to meaningful improvements in terms
on TruthfulQA and 74.44 on MMLU, showing of Table 2 metrics. (5) In terms of factuality (Ta-
that Contrastive In-context Learning significantly ble 3), we observe similar patterns: Contrastive In-
boosts factuality. (3) On MMLU, Focus Mode vari- Context Learning RAG and Focus Mode RAG are
ant 120Doc120S ranks second with 65.87, show- the still the top models, but the Query Expansion
ing that focusing on relevant sentences boosts per- method achieves the second place on the Truth-
formance. 80Doc80S variant shows moderate im- fullQA dataset. (6) Finally, prompt formulation
provements on TruthfulQA by effectively retriev- remains crucial, even within RAG architectures.
ing and ranking relevant sentences. (4) ExpandL
and 1K_5Doc also perform well on TruthfulQA, 7 Conclusions and Future Work
with ExpandL achieving 55.82, demonstrating that
expanding the retrieval context enhances factuality In this paper, we comprehensively studied RAG
on commonsense tasks. architectures based on existing literature and then
proposed four new RAG configurations. We ex-
5.3 Qualitative Analysis tensively compared all methods on two datasets
Examples generated by the model variants on the and in terms of six evaluation metrics, making this
TruthfulQA and MMLU datasets are presented in study a solid reference point for the development
Appendix A Table 5. The examples demonstrate of RAG systems. Based on the results of our exper-
that the proposed modules significantly enhance iments, we draw actionable conclusions, helping to
the RAG systems’ performance via specialized re- advance the field on this topic. Comparing all meth-
trieval techniques. For TruthfulQA, configurations ods, we showed that Contrastive In-context Learn-
like ICL1D+ (Contrastive ICL) and 80Doc80S (Fo- ing RAG, Focus Mode RAG, and Query Expansion
cus Mode) excel by delivering concise, factual re- RAG achieved the best results. Future work for this
sponses that align with the intended query, avoid- study can include exploring dynamically adapting
ing verbose or irrelevant content. On MMLU, the retrieval module based on a given prompt and
ICL1D+ and 120Doc120S (Focus Mode) excel its context, and extending this study to highly spe-
in scientific reasoning by effectively synthesizing cialized tasks by leveraging AutoML techniques to
domain-specific knowledge. These improvements automate the selection and optimization of retrieval
result from Contrastive ICL, which enhances query models tailored to specific requirements and data
alignment through contrastive examples, and Fo- characteristics.
cus Mode, which prioritizes relevant context and
expands knowledge coverage, boosting accuracy 8 Limitations
and precision across tasks.
In this paper, we tested the effect of various RAG
6 Discussion and Key Findings configurations including previous literature but also
a few new approaches that we proposed.
Based on a total of 74 experiment runs testing dif- (1) While we extensively studied various RAG
ferent RAG configurations, we present our key find- architectures and drew conclusions on the best prac-
ings: (1) Empirical results confirm that our pro- tices, we did not test the effect of combining two or
posed Contrastive In-Context Learning RAG out- more of the approaches that we studied. This will
performs all other RAG variants, with its advantage remain an important future work. (2) In this study,
becoming even more pronounced on the MMLU while we showed a comparison between a 7B Mis-
dataset, which requires more specialized knowl- tral model and a 45B parameter model, all other
edge. (2) Our proposed Focus Mode RAG ranks experiments were conducted with the 7B model.
second, significantly outperforming other baselines, Thus, we did not study different model sizes in
underscoring the importance of prompting mod- depth. (3) The multilingual experiments we con-
els with high-precision yet concise retrieved docu- ducted, only considered English as the target lan-
ments. (3) The size of the RAG knowledge base is guage and French and German as the alternative
not necessarily critical; rather, the quality and rele- language. This experiment can be extended with a
vance of the documents are paramount. (4) Factors few other languages.
Acknowledgments Tong Chen, Hongwei Wang, Sihao Chen, Wenhao Yu,
Kaixin Ma, Xinran Zhao, Hongming Zhang, and
The authors thank the International Max Planck Dong Yu. 2024. Dense X retrieval: What retrieval
Research School for Intelligent Systems (IMPRS- granularity should we use? In Proceedings of the
IS) for their support. This study was supported by 2024 Conference on Empirical Methods in Natural
Language Processing, pages 15159–15177. Associa-
DFG grant #390727645. tion for Computational Linguistics.

Hyung Won Chung, Le Hou, Shayne Longpre, Barret


References Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi
Wang, Mostafa Dehghani, Siddhartha Brahma, et al.
Seyed Ali Bahrainian, Sheridan Feucht, and Carsten 2024. Scaling instruction-finetuned language models.
Eickhoff. 2022. NEWTS: A corpus for news topic- Journal of Machine Learning Research, 25(70):1–53.
focused summarization. In Findings of the Associa-
tion for Computational Linguistics: ACL 2022, pages Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
493–503. Kristina Toutanova. 2019. BERT: Pre-training of
Sofia Blinova, Xinyu Zhou, Martin Jaggi, Carsten Eick- deep bidirectional transformers for language under-
hoff, and Seyed Ali Bahrainian. 2023. SIMSUM: standing. In NAACL-HLT, pages 4171–4186.
Document-level text simplification via simultaneous
Guanting Dong, Yutao Zhu, Chenghao Zhang, Zechen
summarization. In Proceedings of the 61st Annual
Wang, Zhicheng Dou, and Ji-Rong Wen. 2024. Un-
Meeting of the Association for Computational Lin-
derstand What LLM Needs: Dual preference align-
guistics (Volume 1: Long Papers), pages 9927–9944.
ment for retrieval-augmented generation. arXiv
Association for Computational Linguistics.
preprint arXiv:2406.18676.
Sebastian Borgeaud, Arthur Mensch, Jordan Hoff-
mann, Trevor Cai, Eliza Rutherford, Katie Milli- Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff
can, George Bm Van Den Driessche, Jean-Baptiste Johnson, Gergely Szilvasy, Pierre-Emmanuel Mazaré,
Lespiau, Bogdan Damoc, Aidan Clark, et al. 2022. Maria Lomeli, Lucas Hosseini, and Hervé Jégou.
Improving language models by retrieving from tril- 2024. The faiss library.
lions of tokens. In International conference on ma-
Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang,
chine learning, pages 2206–2240. PMLR.
Hengyun Li, Dawei Yin, Tat-Seng Chua, and Qing
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Li. 2024. A Survey on RAG Meeting LLMs: To-
Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind wards retrieval-augmented large language models. In
Neelakantan, Pranav Shyam, Girish Sastry, Amanda Proceedings of the 30th ACM SIGKDD Conference
Askell, Sandhini Agarwal, Ariel Herbert-Voss, on Knowledge Discovery and Data Mining, pages
Gretchen Krueger, Tom Henighan, Rewon Child, 6491–6501.
Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens
Winter, Chris Hesse, Mark Chen, Eric Sigler, Ma- Masato Fujitake. 2024. LayoutLLM: Large language
teusz Litwin, Scott Gray, Benjamin Chess, Jack model instruction tuning for visually rich document
Clark, Christopher Berner, Sam McCandlish, Alec understanding. In Proceedings of the 2024 Joint
Radford, Ilya Sutskever, and Dario Amodei. 2020a. International Conference on Computational Linguis-
Language models are few-shot learners. 33:1877– tics, Language Resources and Evaluation (LREC-
1901. COLING 2024), pages 10219–10224.

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia,
Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, and Haofen
Neelakantan, Pranav Shyam, Girish Sastry, Amanda Wang. 2023. Retrieval-augmented generation for
Askell, et al. 2020b. Language models are few-shot large language models: A survey. arXiv preprint
learners. Advances in neural information processing arXiv:2312.10997.
systems, 33:1877–1901.
Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat,
Claudio Carpineto and Giovanni Romano. 2012. A and Mingwei Chang. 2020. Retrieval augmented
survey of automatic query expansion in information language model pre-training. pages 3929–3938.
retrieval. Acm Computing Surveys (CSUR), 44(1):1–
50. Dan Hendrycks, Collin Burns, Steven Basart, Andy
Zou, Mantas Mazeika, Dawn Song, and Jacob Stein-
Tyler A. Chang, Katrin Tomanek, Jessica Hoffmann, hardt. 2021. Measuring massive multitask language
Nithum Thain, Erin van Liemt, Kathleen Meier- understanding. Proceedings of the International Con-
Hellstern, and Lucas Dixon. 2024. Detecting hal- ference on Learning Representations (ICLR).
lucination and coverage errors in retrieval augmented
generation for controversial topics. In Proceedings of Jennifer Hsia, Afreen Shaikh, Zhiruo Wang, and Gra-
the 2024 Joint International Conference on Compu- ham Neubig. 2024. RAGGED: Towards informed
tational Linguistics, Language Resources and Evalu- design of retrieval augmented generation systems.
ation (LREC-COLING 2024). arXiv preprint arXiv:2403.09040.
Yutong Hu, Quzhe Huang, Mingxu Tao, Chen Zhang, Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis,
and Yansong Feng. 2024. Can perplexity reflect large Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettle-
language model’s ability in long text understanding? moyer, and Hannaneh Hajishirzi. 2023. FActScore:
In The Second Tiny Papers Track at ICLR 2024. Fine-grained atomic evaluation of factual precision
in long form text generation. In Proceedings of the
Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, 2023 Conference on Empirical Methods in Natural
Zhangyin Feng, Haotian Wang, Qianglong Chen, Language Processing, pages 12076–12100. Associa-
Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting tion for Computational Linguistics.
Liu. 2024. A Survey on Hallucination in Large Lan-
guage Models: Principles, taxonomy, challenges, and Krishna Pillutla, Swabha Swayamdipta, Rowan Zellers,
open questions. ACM Trans. Inf. Syst. John Thickstun, Sean Welleck, Yejin Choi, and Zaid
Harchaoui. 2021. MAUVE: Measuring the gap be-
Albert Q Jiang, Alexandre Sablayrolles, Arthur Men- tween neural text and human text using divergence
sch, Chris Bamford, Devendra Singh Chaplot, Diego frontiers. In NeurIPS.
de las Casas, Florian Bressand, Gianna Lengyel, Guil-
laume Lample, Lucile Saulnier, et al. 2023. Mistral Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
7B. arXiv preprint arXiv:2310.06825. Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
Wei Li, and Peter J Liu. 2020. Exploring the limits
Zhuoran Jin, Pengfei Cao, Yubo Chen, Kang Liu, Xiao- of transfer learning with a unified text-to-text trans-
jian Jiang, Jiexin Xu, Li Qiuxia, and Jun Zhao. 2024. former. The Journal of Machine Learning Research,
Tug-of-War between Knowledge: Exploring and re- 21(1):5485–5551.
solving knowledge conflicts in retrieval-augmented
language models. In Proceedings of the 2024 Joint Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay,
International Conference on Computational Linguis- Amnon Shashua, Kevin Leyton-Brown, and Yoav
tics, Language Resources and Evaluation (LREC- Shoham. 2023. In-context retrieval-augmented lan-
COLING 2024), pages 16867–16878. guage models. Transactions of the Association for
Computational Linguistics, 11:1316–1331.
Gangwoo Kim, Sungdong Kim, Byeongguk Jeon, Joon-
suk Park, and Jaewoo Kang. 2023. Tree of Clarifica- Nils Reimers and Iryna Gurevych. 2019. Sentence-
tions: Answering ambiguous questions with retrieval- BERT: Sentence embeddings using siamese bert-
augmented large language models. In Proceedings networks. In EMNLP, pages 3982–3992.
of the 2023 Conference on Empirical Methods in
Natural Language Processing, pages 996–1009. Sina Semnani, Violet Yao, Heidi Zhang, and Monica
Lam. 2023. WikiChat: Stopping the hallucination of
Sung-Min Lee, Eunhwan Park, Donghyeon Jeon, Inho large language model chatbots by few-shot ground-
Kang, and Seung-Hoon Na. 2024. RADCoT: ing on Wikipedia. In Findings of the Association
Retrieval-augmented distillation to specialization for Computational Linguistics: EMNLP 2023, pages
models for generating chain-of-thoughts in query 2387–2413.
expansion. In Proceedings of the 2024 Joint In-
ternational Conference on Computational Linguis- Haizhou Shi, Zihao Xu, Hengyi Wang, Weiyi Qin,
tics, Language Resources and Evaluation (LREC- Wenyuan Wang, Yibin Wang, and Hao Wang. 2024a.
COLING 2024), pages 13514–13523. Continual learning of large language models: A com-
prehensive survey. arXiv preprint arXiv:2404.16789.
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio
Petroni, Vladimir Karpukhin, Naman Goyal, Hein- Weijia Shi, Sewon Min, Michihiro Yasunaga, Minjoon
rich Küttler, Mike Lewis, Wen-tau Yih, Tim Rock- Seo, Richard James, Mike Lewis, Luke Zettlemoyer,
täschel, et al. 2020. Retrieval-augmented generation and Wen-tau Yih. 2024b. REPLUG: Retrieval-
for knowledge-intensive nlp tasks. Advances in Neu- augmented black-box language models. In Proceed-
ral Information Processing Systems, 33:9459–9474. ings of the 2024 Conference of the North American
Chapter of the Association for Computational Lin-
Jiarui Li, Ye Yuan, and Zehua Zhang. 2024. En- guistics: Human Language Technologies (Volume
hancing llm factual accuracy with rag to counter 1: Long Papers), pages 8371–8384. Association for
hallucinations: A case study on domain-specific Computational Linguistics.
queries in private knowledge-bases. arXiv preprint
arXiv:2403.10446. Shamane Siriwardhana, Rivindu Weerasekera, Elliott
Wen, Tharindu Kaluarachchi, Rajib Rana, and
Chin-Yew Lin. 2004. ROUGE: A package for automatic Suranga Nanayakkara. 2023. Improving the domain
evaluation of summaries. In Text Summarization adaptation of retrieval augmented generation (rag)
Branches Out, pages 74–81. models for open domain question answering. Trans-
actions of the Association for Computational Linguis-
Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. tics, 11:1–17.
TruthfulQA: Measuring how models mimic human
falsehoods. In Proceedings of the 60th Annual Meet- Qiushi Sun, Chengcheng Han, Nuo Chen, Renyu Zhu,
ing of the Association for Computational Linguistics Jingyang Gong, Xiang Li, and Ming Gao. 2024.
(Volume 1: Long Papers), pages 3214–3252. Make Prompt-based Black-Box Tuning Colorful:
Boosting model generalization from three orthog- KW # Articles S/A Avg S/A Avg W/A
onal perspectives. In Proceedings of the 2024 Joint Level 3 999 1-936 337 7472
International Conference on Computational Linguis-
tics, Language Resources and Evaluation (LREC- Level 4 10,011 1-1690 258 5569
COLING 2024), pages 10958–10969.
Table 4: Statistical analysis of knowledge base. KW,
Nhat Tran and Diane Litman. 2024. Enhancing knowl- # Artivles, S/A, Avg S/A, and Avg W/A represent the
edge retrieval with topic modeling for knowledge- knowledge base, the number of articles, the range of
grounded dialogue. In Proceedings of the 2024 Joint sentences per article, the average number of sentences
International Conference on Computational Linguis- per article, and the average number of words per article,
tics, Language Resources and Evaluation (LREC-
respectively.
COLING 2024), pages 5986–5995.
Xiaohua Wang, Zhenghua Wang, Xuan Gao, Feiran
Zhang, Yixin Wu, Zhibo Xu, Tianyuan Shi, • AdversV1: You are an imaginative storytelling
Zhengyuan Wang, Shizheng Li, Qi Qian, et al. 2024. bot. Spin a detailed and creative tale in response
Searching for best practices in retrieval-augmented to the next question.
generation. In Proceedings of the 2024 Conference
on Empirical Methods in Natural Language Process- • AdversV2: You are a friendly dog. Respond to
ing, pages 17716–17736.
the next question with barks, playful sounds, and
Kevin Wu, Eric Wu, and James Zou. 2024. ClashEval: in the way a dog might communicate its feelings.
Quantifying the tug-of-war between an llm’s internal
prior and external evidence. In The Thirty-eight Con-
ference on Neural Information Processing Systems A.3 Contrastive ICL Prompt
Datasets and Benchmarks Track.
The prompt structures for the different variants in
Lingxi Zhang, Jing Zhang, Yanling Wang, Shulin Cao, the Contrastive In-context Learning (ICL) section
Xinmei Huang, Cuiping Li, Hong Chen, and Juanzi
are outlined as follows:
Li. 2023. FC-KBQA: A fine-to-coarse composition
framework for knowledge base question answering.
In Proceedings of the 61st Annual Meeting of the • Baseline Prompt: "HelpV1. Considering this
Association for Computational Linguistics (Volume information: Retrieved Context K. Question: q,
1: Long Papers), pages 1002–1017. Association for Answer:"
Computational Linguistics.
• ICL1D Prompt: "HelpV1. Considering
A Appendix this example: Question: q, Correct Answer:
We share the details of the evaluation datasets, Answercorrect . Question: q, Correct Answer:"
knowledge base, and the settings used in the abla-
• ICL2D Prompt: "HelpV1. Considering these
tion study.
examples: Question: q, Correct Answer:
A.1 Statistical Analysis of Knowledge Base Answercorrect . Question: q, Correct Answer:
Table 4 shows the statistical analysis of the knowl- Answercorrect . Question: q, Correct Answer:"
edge base. • ICL1D+ Prompt: "HelpV1. Considering
A.2 Prompt Variants these examples: Question: q, Correct Answer:
Answercorrect . Question: q, Incorrect Answer:
The prompts for the different variants in the Prompt Answerincorrect . Question: q, Correct Answer:"
Design section are shown as follows:
• ICL2D+ Prompt: "HelpV1. Considering
• HelpV1: You are a truthful expert question-
these examples: Question: q, Correct Answer:
answering bot and should correctly and concisely
Answercorrect . Question: q, Incorrect Answer:
answer the following question.
Answerincorrect . Question: q, Correct Answer:
• HelpV2: You are an accurate and reliable Answercorrect . Question: q, Incorrect Answer:
question-answering bot. Please provide a precise Answerincorrect . Question: q, Correct Answer:"
and correct response to the question following.
• HelpV3: As an expert QA bot, your role is to A.4 Multilingual Setting
offer precise, accurate responses. Ensure your In the multilingual setting, we randomly replace
answer to the following question is both informa- English documents with French or German docu-
tive and to the point. ments before embedding them for the MultiLingo
TruthfulQA MMLU
Query: The normal modes of a carbon dioxide
Query: Can people who have been consumed by a molecule that are infrared-active include which of
wolf be rescued? the following? I. Bending II. Symmetric stretching
III. Asymmetric stretching
Correct Answer: No, people who have been con-
Correct Answer: I and III only.
sumed by a wolf are dead.
w/o_RAG: I must clarify that there is no known
w/o_RAG: The normal modes of a carbon dioxide
recorded case of a person being consumed entirely
(CO2) molecule that are infrared-active are:
by a wolf.
Baseline: I must clarify that there are no known
Baseline: The normal modes of a carbon dioxide
recorded cases of humans being rescued after being
molecule that are infrared-active include I. Bending.
consumed by a wolf.
HelpV2: I’m an accurate and reliable question-
HelpV3: The normal modes of a carbon dioxide
answering bot, but I must clarify that the question
molecule that are infrared-active include I. Bending.
seems to contain
2DocXL: The normal modes of a carbon dioxide
2DocXL: No, once a person has been consumed by
molecule that are infrared-active include II. Symmet-
a wolf or any other predator, they cannot be rescued.
ric stretching.
1K_5Doc: The normal modes of a carbon dioxide
1K_5Doc: No, once a person has been consumed by
molecule that are infrared-active include II. Asym-
a wolf or any other animal, they cannot be rescued.
metric stretching.
ExpendL: The normal modes of a carbon dioxide
ExpendL: I’m an expert question-answering bot and
molecule that are infrared-active include I. Bending
I’m here to provide truthful and concise answers.
and II.
ICL1D+: No, once a person has been consumed by ICL1D+: The correct answer is I. Bending and III.
a wolf, they cannot be rescued. Asymmetric stretching.
120Doc120S: The normal modes of a carbon dioxide
80Doc80S: No, once a person has been consumed by
molecule that are infrared-active include I. Bending
a wolf or any other animal, they cannot be rescued.
and III.

Table 5: Examples of the generated results on the TruthfulQA and MMLU datasets, where w/o_RAG is the
base LLM without the RAG system. The variants HelpV2 (HelpV3), 2DocXL, 1K_5Doc, ExpendL, ICL1D+,
and 80Doc80S (120Doc120S) represent the top-performing configurations for Prompt Design, Document Size,
Knowledge Base Size, Query Expansion, Contrastive ICL, and Focus Mode sections, respectively.

and MultiLingo+ variants. For the MultiLingo+


variant, we add "Answer the following question in
English" in the prompt, to ensure the response is
provided in English.

A.5 Generation Examples


Table 5 exhibits examples generated by the model
variants on the TruthfulQA and MMLU datasets.

You might also like