Enhancing Retrieval-Augmente Generation Practices
Enhancing Retrieval-Augmente Generation Practices
where each qi′ represents a keyword phrase relevant the original query q, resulting in the final docu-
to answering the original query. This process uses ment set D(2) = Retrieve(q, D(1) ). Step 3: We
the autoregressive property of the T5 model, which split the documents in D(2) into sentences, de-
predicts one token at a time. The model encodes q noted as S, and retrieve the most relevant sentences,
into a hidden state h and generates each token yt at S (1) = Retrieve(q, S), based on the original query.
step t, conditioned on the previous tokens y<t and Step 3 represents the Focus Mode, which we inves-
the hidden state h: tigate in Q9. In the baseline setting, only Step 2 is
performed, where documents are retrieved directly
P (yt |h, y<t ) = Decoder(Encoder(q), y<t ) (1) using the original query without Query Expansion
and Focus Mode.
By repeating this process, the model produces
N relevant expanded queries. C. Text Generation Module
Upon receiving a query q, the retrieval module re-
B. Retrieval Module trieves similar document chunks D(2) or sentences
For the retrieval module, we use FAISS (Douze S (1) , forming the context K. The LLM is prompted
et al., 2024) because it is computationally effi- with q and K, generating responses. In the Re-
cient, easy to implement, and excels at performing trieval Stride variant, the context K is dynamically
large-scale similarity searches in high-dimensional updated at specific intervals during generation. At
spaces. Documents are segmented into chunks time step tk , the retriever updates K based on the
C = {c1 , c2 , . . . , cn }, and a pre-trained Sentence generated text g<tk up to tk :
Transformer (Reimers and Gurevych, 2019) en- K(tk ) = Retriever(q, D, g<tk ) (3)
coder generates embeddings E = {e1 , e2 , . . . , en }
based on C. The IndexBuilder class indexes these This keeps K continuously updated with relevant
embeddings for retrieval. Given a query embedding information. The LLM generates tokens autore-
qemb , from the same encoder, the top k chunks are gressively, where each token gt is based on previ-
retrieved based on the inner product similarity: ous tokens g<t and context K. The final generated
sequence g represents the response to the query q:
⊤
Sim(qemb , ei ) = qemb ei (2) P (gt |g<t , K) = LLM(g<t , K) (4)
The retrieval process for RAG variants consists In the baseline setting, the retrieval stride is not
of three steps. Step 1: We retrieve a prelim- used, and K remains fixed during generation.
inary set of documents D(1) based on the ex-
4 Experimental Setup
panded queries q ′ and the original query q, shown
as D(1) = Retrieve((q, q ′ ), D). Step 2: From This section provides details about our experimen-
D(1) , we retrieve the relevant documents using tal setup, including the evaluation datasets, knowl-
edge base, evaluation metrics, and implementation 4.3 Evaluation Metrics
specifics of our RAG approach. To provide a comprehensive overview of the gener-
ative performance, our evaluation utilizes the fol-
4.1 Evaluation Datasets lowing metrics:
ROUGE (Lin, 2004): is a set of metrics used to
To evaluate the performance of RAG variants, we
assess text generation quality by measuring overlap
use two publicly available datasets: TruthfulQA
with reference texts. ROUGE-1 F1, ROUGE-2
(Lin et al., 2022) 2 and MMLU (Hendrycks et al.,
F1, and ROUGE-L F1 scores evaluate unigrams,
2021) 3 . These datasets have been carefully se-
bigrams, and the longest common subsequence,
lected to represent different contexts in which an
respectively.
RAG system might be deployed. TruthfulQA re-
Embedding Cosine Similarity: is a metric used
quires general commonsense knowledge, while
to compute the cosine similarity score between the
MMLU demands more specialized and precise
embeddings of the generated and reference texts,
knowledge. Thus, using these two datasets allows
both encoded by a Sentence Transformer (Reimers
us to evaluate a range of scenarios where a RAG
and Gurevych, 2019) model.
system may be applied.
MAUVE (Pillutla et al., 2021): is a metric for
TruthfulQA (Lin et al., 2022): A dataset of 817 assessing open-ended text generation by compar-
questions across 38 categories (e.g., health, law, ing the distribution of model-generated text with
politics), built to challenge LLMs on truthfulness that of human-written text through divergence fron-
by testing common misconceptions. Each sample tiers. The texts are embedded using a Sentence
includes a question, the best answer, and a set of Transformer(Reimers and Gurevych, 2019), and
correct answers and incorrect answers. MAUVE calculates the similarity between their
MMLU (Hendrycks et al., 2021): This dataset embedding features. Because MAUVE relies on
evaluates models in educational and professional estimating the distribution of documents, it can pro-
contexts with 57 subjects across multiple-choice duce unreliable results when applied to single or
questions. To balance topic representation with the few samples. To address this issue, we evaluate it
time and resource constraints of evaluating the full on the entire dataset to ensure stable and meaning-
dataset, we use the first 32 examples from each ful scoring.
subject, resulting in 1824 samples for evaluation. FActScore (Min et al., 2023): is a metric de-
Examples from both datasets are shown in Ta- signed to evaluate the factuality of responses gen-
ble 1. In the MMLU dataset, we treat the correct erated by large language models (LLMs) by iden-
choice as the correct answer and all other options tifying and assessing atomic facts—concise sen-
as incorrect. tences that convey individual pieces of information.
Its performance depends on the underlying model
4.2 Knowledge Base used for factual scoring, and in this study, GPT-
3.5-turbo (Brown et al., 2020a) serves as the base
To ensure comprehensive topic coverage, we use model.
Wikipedia Vital Articles 4 as the knowledge base
for the RAG model. These articles cover key topics 4.4 Implementation Details
considered essential by Wikipedia for a broad un- For Query Expansion, we utilize the T5 model (Raf-
derstanding of human knowledge, available in mul- fel et al., 2020), specifically google/flan-t5-small,
tiple languages. In our experiments, we incorporate fine-tuned with FLAN (Chung et al., 2024), to
French and German articles in the Multilingual set- generate relevant keywords. FAISS (Douze et al.,
ting. We specifically choose Level 3 and Level 2024) is employed for vector indexing and simi-
4 articles, which provide a good balance between larity search, while a Sentence Transformer (all-
topic breadth and a manageable knowledge base MiniLM-L6-v2) serves as the text encoder for
size. In Appendix A, Table 4 presents a statistical generating sentence embeddings to enable se-
analysis of the knowledge base. mantic comparison. For text generation, we
employ models from the Mistral family (Jiang
2
https://huggingface.co/datasets/truthful_qa et al., 2023)5 , including the Instruct7B model
3
https://huggingface.co/datasets/cais/mmlu
4 5
https://en.wikipedia.org/wiki/Wikipedia:Vital_articles https://huggingface.co/mistralai
Dataset Question Correct Answer Incorrect Answer
What happens to you if you eat The watermelon seeds pass You grow watermelons
Truthful watermelon seeds? through your digestive system. in your stomach.
-QA The Pope is not Italian today The Pope is Italian be-
Why is the Pope Italian?
and does not have to be. cause he’s in Italy.
An integer overflow occurs when There is no more space to hold An integer is used as if
____ integers in the program. it was a pointer.
MMLU
In the history of marketing, when After the end of the Sec-
In the 1920s.
did the production period end? ond World War.
Table 1: Two example questions from the TruthfulQA dataset and MMLU dataset with one sample from their
corresponding correct and incorrect answers.
(mistralai/Mistral-7B-Instruct-v0.2) and the In- mance, with details of each prompt provided in
struct45B model (mistralai/Mixtral-8x7B-Instruct- Appendix A.2. Three prompts (HelpV1, HelpV2,
v0.1). The Instruct7B model is selected as the HelpV3) are designed to assist the model in com-
baseline due to its balance of performance and pleting the task, while two (AdversV1, AdversV2)
size. For the baseline configuration, we adopt the are adversarial and intended to mislead. As shown
HelpV1 version of the prompt (see AppendixA.2). in Table 2, the helpful prompts consistently outper-
The document chunk size is set to 64, and Level 3 form the adversarial ones across all metrics, with
Wikipedia Vital Articles are used as the knowledge HelpV2 and HelpV3 achieving the highest scores.
base. This highlights that even slight changes in wording
can influence performance. Adversarial prompts,
5 Experiments and Results on the other hand, consistently result in poorer per-
To identify effective setups for optimizing the RAG formance, emphasizing the importance of prompt
system, we evaluate the performance of different design for task success.
RAG variants across 3 aspects: relevance evalua- 3. Document Size: Now, we turn to the impact
tion, factuality assessment, and qualitative analysis. of chunk sizes—2DocS (48 tokens), 2DocM (64
tokens), 2DocL (128 tokens), and 2DocXL (192
5.1 Relevance Evaluation tokens)—on RAG system performance. The term
To address the 9 questions proposed in Section 3.1, ’2Doc’ refers to two retrieved documents, while ’S’,
we compare the relevance of the generated exam- ’M’, ’L’, and ’XL’ indicate the chunk size based on
ples from model variants to the reference text and the number of tokens. The results show minimal
evaluate their performance differences. The results performance differences across these chunk sizes,
are shown in Table 2. with 2DocXL (192 tokens) performing slightly bet-
1. LLM Size: As the generative LLM in our ter on some metrics. However, the variations are
RAG system, we compare the MistralAI 7B instruc- minor, suggesting that increasing chunk size does
tion model with the larger 45B parameter model, not significantly affect the system’s performance.
referred to as Instruct7B and Instruct45B, respec- 4. Knowledge Base Size: We compare RAG
tively. As expected, Instruct45B outperforms In- models using different knowledge base sizes,
struct7B, particularly on the TruthfulQA dataset, where the model names indicate the number of
demonstrating that a larger model size significantly documents in the knowledge base (1K for Level 3
boosts performance. However, on the MMLU articles or 10K for Level 4 articles) and the number
dataset, the improvements are less notable, suggest- of documents retrieved at runtime (2Doc or 5Doc).
ing that increasing model size alone may not lead to The results show minimal performance differences,
substantial gains in more specialized tasks. For all with no statistically significant improvements from
subsequent experiments, the Instruct7B model will using a larger knowledge base. This suggests that
serve as the baseline due to its lower computational increasing the knowledge base size or retrieving
requirements. more documents does not necessarily improve the
2. Prompt Design: We examine the impact quality of the RAG system’s output, possibly be-
of different system prompts on model perfor- cause the additional documents are either irrelevant
TruthfulQA MMLU
R1 R2 RL ECS Mauve R1 R2 RL ECS Mauve
LLM Instruct7B 26.81 13.26 23.86 56.44 72.92 10.42 1.90 8.91 29.41 40.51
Size Instruct45B 29.07 14.95 25.64 58.63 81.62 11.06 2.05 9.37 30.82 38.24
HelpV1 26.81 13.26 23.86 56.44 72.92 10.42 1.90 8.91 29.41 40.51
HelpV2 27.00 13.88 23.93 57.33 75.38 10.21 1.80 8.77 29.45 36.20
Prompt
HelpV3 26.30 13.01 23.16 56.54 79.20 10.40 1.97 9.00 29.39 34.50
Design
AdversV1 10.06 1.60 8.60 19.78 2.55 6.58 0.72 5.75 14.04 4.05
AdversV2 8.39 2.14 7.48 16.30 0.93 4.24 0.54 3.84 12.33 0.76
2DocS 27.41 13.71 24.27 57.52 78.53 10.43 1.92 8.88 29.44 38.22
Doc 2DocM 26.81 13.26 23.86 56.44 72.92 10.42 1.90 8.91 29.41 40.51
Size 2DocL 26.96 13.78 23.92 57.00 82.02 10.41 1.88 8.88 29.52 36.21
2DocXL 27.60 13.98 24.46 57.66 76.44 10.54 1.95 9.00 29.67 39.35
1K_2Doc 26.81 13.26 23.86 56.44 72.92 10.42 1.90 8.91 29.41 40.51
KW. 10K_2Doc 27.09 13.36 23.77 56.28 71.76 10.39 1.94 8.89 29.59 36.07
Size 1K_5Doc 27.84 14.16 24.61 58.04 74.69 10.37 1.91 8.84 29.64 38.22
10K_5Doc 27.53 13.71 24.25 57.19 81.38 10.58 1.98 9.09 29.75 39.49
Baseline 26.81 13.26 23.86 56.44 72.92 10.42 1.90 8.91 29.41 40.51
Retrieval Stride5 26.43 12.83 23.28 55.57 71.01 10.32 1.81 8.78 29.08 38.89
Stride Stride2 24.50 11.09 21.63 50.22 71.65 9.26 1.49 7.85 27.90 36.53
Stride1 22.35 9.89 20.25 39.80 41.80 8.12 1.16 6.91 25.38 21.35
Baseline 26.81 13.26 23.86 56.44 72.92 10.42 1.90 8.91 29.41 40.51
Query ExpendS 27.04 13.31 24.09 57.28 74.11 10.45 1.94 8.88 29.12 34.49
Expansion ExpendM 26.98 13.29 24.03 57.23 80.33 10.30 1.84 8.76 28.88 38.46
ExpendL 27.17 13.37 24.07 57.65 81.15 10.41 1.91 8.81 28.95 38.63
Baseline 26.81 13.26 23.86 56.44 72.92 10.42 1.90 8.91 29.41 40.51
ICL1Doc 29.25 15.82 26.14 56.93 67.41 20.47 11.40 18.96 41.85 33.94
Contrastive
ICL2Doc 28.62 16.05 25.68 56.07 66.87 23.23 14.66 22.02 43.09 34.20
ICL
ICL1Doc+ 30.62 17.45 27.79 58.96 73.86 25.09 15.87 23.87 47.12 43.50
ICL2Doc+ 30.24 17.77 27.51 57.55 67.51 26.01 17.46 24.90 47.04 37.24
Baseline 26.81 13.26 23.86 56.44 72.92 10.42 1.90 8.91 29.41 40.51
Multi-
MultiLingo 26.12 12.71 23.15 54.04 75.27 10.45 1.87 8.89 29.15 38.40
lingual
MultiLingo+ 25.69 11.86 22.48 53.85 78.75 10.42 1.91 8.91 29.24 41.00
Baseline 26.81 13.26 23.86 56.44 72.92 10.42 1.90 8.91 29.41 40.51
2Doc1S 26.11 12.37 23.05 55.65 73.02 10.77 2.13 9.25 29.90 41.00
Focus 20Doc20S 28.20 14.48 24.90 58.30 74.02 10.64 1.99 9.11 30.03 39.18
Mode 40Doc40S 28.32 14.54 24.99 58.36 77.95 10.78 2.02 9.20 30.01 36.20
80Doc80S 28.85 15.01 25.51 58.33 74.15 10.69 2.04 9.15 29.97 38.09
120Doc120S 28.36 14.80 25.09 57.99 73.95 10.87 2.09 9.23 30.22 38.88
Table 2: Comparison of RAG variants performance, evaluated on the TruthfulQA and MMLU datasets. Settings
include LLM Size, Prompt Design, Document Size (Doc Size), Knowledge Base Size (KW. Size), Retrieval Stride,
Query Expansion, Contrastive In-Context Learning Knowledge Base (Contrastive ICL), Multilingual Knowledge
Base (Multilingual), and Focus Mode. R1, R2, RL, and ECS denote ROUGE-1 F1, ROUGE-2 F1, ROUGE-L
F1, and Embedding Cosine Similarity scores, respectively. Scores in bold denote statistical significance over the
baseline (i.e. Instruct7B RAG).
or redundant for answering specific queries. Variants TruthfulQA Variants MMLU
5. Retrieval Stride: We analyze the impact of w/o_RAG 52.75 w/o_RAG 64.58
retrieval stride (Ram et al., 2023), as discussed in Baseline 53.85 Baseline 63.73
Section 3.2, which determines how frequently doc- HelpV2 53.67 HelpV3 64.45
uments are replaced during generation. Our results 2DocXL 52.63 2DocXL 63.79
show that reducing the stride from 5 to 1 lowers 1K_5Doc 55.18 1K_5Doc 64.38
metrics such as ROUGE, Embedding Cosine Simi-
ExpandL 55.82 ExpandL 63.75
larity, and MAUVE, as frequent retrievals disrupt
ICL1D+ 57.00 ICL1D+ 74.44
context coherence and relevance. This contrasts
with Ram et al. (2023), who reported better per- 80Doc80S 54.45 120Doc120S 65.87
formance with smaller strides based on perplexity. Table 3: Factuality performance of model variants on
However, we found perplexity to be inconsistent both datasets is evaluated using FActScore. w/o_RAG
with other metrics and human judgment, making represents the original Mistral Instruct7B model without
it unsuitable for our task, aligning with Hu et al. the RAG retrieval module. The best result is in bold; the
(2024), who highlighted perplexity’s limitations. second highest is underlined.
Overall, larger strides help preserve context sta-
bility, improving coherence and relevance in the
ment investigates the effect of using a multilingual
generated text.
knowledge base on RAG performance. In the Mul-
6. Query Expansion: Next, we examine the
tiLingo and MultiLingo+ configurations, multilin-
impact of Query Expansion by varying the size of
gual documents are retrieved, with MultiLingo+
the retrieval filter in Step 1 of the retrieval module
additionally prompting the system to respond in
(Section 3.2), using 9 articles for ExpendS, 15 for
English (see Appendix A.4). Both setups show a
ExpendM, and 21 for ExpendL, while keeping the
decline in performance and relevance compared to
number of retrieved documents constant at 2. The
the baseline, likely due to the model’s challenges in
results show minimal differences across filter sizes,
effectively synthesizing information from multiple
with slight improvements in evaluation metrics on
languages.
the TruthfulQA dataset as the filter size increases.
This is likely because the most relevant documents 9. Focus Mode: We evaluate Focus Mode,
are typically retrieved even without expansion in where sentences from retrieved documents are split
this task, reducing the impact of larger filter sizes. and ranked by their relevance to the query, ensur-
Overall, expanding the initial filter size yields only ing only the most relevant ones are provided to the
marginal performance gains. model. Model names reflect the number of doc-
7. Contrastive In-context Learning: In this uments and sentences retrieved (e.g., 2Doc1S re-
experiment, we fix the RAG design and explore the trieves one sentence from two documents). The
impact of Contrastive In-context Learning, using results show that increasing the number of re-
correct and incorrect examples from the evaluation trieved sentences generally improves performance
data as the knowledge base instead of Wikipedia on commonsense datasets like TruthfulQA, with
articles. Model names indicate the number of ex- 80Doc80S achieving the best results across most
amples retrieved (ICL1Doc for one, ICL2Doc for metrics, including a 1.65% gain in ROUGE-L. For
two), with ’+’ denoting the inclusion of contrastive MMLU, focusing on highly relevant sentences en-
(incorrect) examples (see Appendix A.3). The re- hances response quality, with 2Doc1S improving
sults show significant improvements across all met- the MAUVE score by 0.49% and 120Doc120S
rics when contrastive examples are included. For boosting Embedding Cosine Similarity by 0.81%.
example, the ICL1Doc+ design achieves a 3.93% The Focus Mode is a text selection method that
increase in ROUGE-L on TruthfulQA and a 2.99% enhances retrieval in RAG architectures and may
improvement in MAUVE on MMLU. These find- also prove effective in text summarization and sim-
ings underscore the effectiveness of Contrastive plification (Blinova et al., 2023).
In-context Learning in enabling the model to better
5.2 Factuality Assessment
differentiate between correct and incorrect infor-
mation, leading to more accurate and contextually The factuality performance of RAG variants on
relevant outputs. TruthfulQA and MMLU is summarized in Table 3.
8. Multilingual Knowledge Base: This experi- Key insights include: (1) w/o_RAG consistently
underperforms, confirming that RAG systems en- such as Query Expansion, multilingual representa-
hance factual accuracy over the base LLM. (2) tions, document size variations, and retrieval stride
ICL1D+ outperforms all others, scoring 57.00 did not lead to meaningful improvements in terms
on TruthfulQA and 74.44 on MMLU, showing of Table 2 metrics. (5) In terms of factuality (Ta-
that Contrastive In-context Learning significantly ble 3), we observe similar patterns: Contrastive In-
boosts factuality. (3) On MMLU, Focus Mode vari- Context Learning RAG and Focus Mode RAG are
ant 120Doc120S ranks second with 65.87, show- the still the top models, but the Query Expansion
ing that focusing on relevant sentences boosts per- method achieves the second place on the Truth-
formance. 80Doc80S variant shows moderate im- fullQA dataset. (6) Finally, prompt formulation
provements on TruthfulQA by effectively retriev- remains crucial, even within RAG architectures.
ing and ranking relevant sentences. (4) ExpandL
and 1K_5Doc also perform well on TruthfulQA, 7 Conclusions and Future Work
with ExpandL achieving 55.82, demonstrating that
expanding the retrieval context enhances factuality In this paper, we comprehensively studied RAG
on commonsense tasks. architectures based on existing literature and then
proposed four new RAG configurations. We ex-
5.3 Qualitative Analysis tensively compared all methods on two datasets
Examples generated by the model variants on the and in terms of six evaluation metrics, making this
TruthfulQA and MMLU datasets are presented in study a solid reference point for the development
Appendix A Table 5. The examples demonstrate of RAG systems. Based on the results of our exper-
that the proposed modules significantly enhance iments, we draw actionable conclusions, helping to
the RAG systems’ performance via specialized re- advance the field on this topic. Comparing all meth-
trieval techniques. For TruthfulQA, configurations ods, we showed that Contrastive In-context Learn-
like ICL1D+ (Contrastive ICL) and 80Doc80S (Fo- ing RAG, Focus Mode RAG, and Query Expansion
cus Mode) excel by delivering concise, factual re- RAG achieved the best results. Future work for this
sponses that align with the intended query, avoid- study can include exploring dynamically adapting
ing verbose or irrelevant content. On MMLU, the retrieval module based on a given prompt and
ICL1D+ and 120Doc120S (Focus Mode) excel its context, and extending this study to highly spe-
in scientific reasoning by effectively synthesizing cialized tasks by leveraging AutoML techniques to
domain-specific knowledge. These improvements automate the selection and optimization of retrieval
result from Contrastive ICL, which enhances query models tailored to specific requirements and data
alignment through contrastive examples, and Fo- characteristics.
cus Mode, which prioritizes relevant context and
expands knowledge coverage, boosting accuracy 8 Limitations
and precision across tasks.
In this paper, we tested the effect of various RAG
6 Discussion and Key Findings configurations including previous literature but also
a few new approaches that we proposed.
Based on a total of 74 experiment runs testing dif- (1) While we extensively studied various RAG
ferent RAG configurations, we present our key find- architectures and drew conclusions on the best prac-
ings: (1) Empirical results confirm that our pro- tices, we did not test the effect of combining two or
posed Contrastive In-Context Learning RAG out- more of the approaches that we studied. This will
performs all other RAG variants, with its advantage remain an important future work. (2) In this study,
becoming even more pronounced on the MMLU while we showed a comparison between a 7B Mis-
dataset, which requires more specialized knowl- tral model and a 45B parameter model, all other
edge. (2) Our proposed Focus Mode RAG ranks experiments were conducted with the 7B model.
second, significantly outperforming other baselines, Thus, we did not study different model sizes in
underscoring the importance of prompting mod- depth. (3) The multilingual experiments we con-
els with high-precision yet concise retrieved docu- ducted, only considered English as the target lan-
ments. (3) The size of the RAG knowledge base is guage and French and German as the alternative
not necessarily critical; rather, the quality and rele- language. This experiment can be extended with a
vance of the documents are paramount. (4) Factors few other languages.
Acknowledgments Tong Chen, Hongwei Wang, Sihao Chen, Wenhao Yu,
Kaixin Ma, Xinran Zhao, Hongming Zhang, and
The authors thank the International Max Planck Dong Yu. 2024. Dense X retrieval: What retrieval
Research School for Intelligent Systems (IMPRS- granularity should we use? In Proceedings of the
IS) for their support. This study was supported by 2024 Conference on Empirical Methods in Natural
Language Processing, pages 15159–15177. Associa-
DFG grant #390727645. tion for Computational Linguistics.
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia,
Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, and Haofen
Neelakantan, Pranav Shyam, Girish Sastry, Amanda Wang. 2023. Retrieval-augmented generation for
Askell, et al. 2020b. Language models are few-shot large language models: A survey. arXiv preprint
learners. Advances in neural information processing arXiv:2312.10997.
systems, 33:1877–1901.
Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat,
Claudio Carpineto and Giovanni Romano. 2012. A and Mingwei Chang. 2020. Retrieval augmented
survey of automatic query expansion in information language model pre-training. pages 3929–3938.
retrieval. Acm Computing Surveys (CSUR), 44(1):1–
50. Dan Hendrycks, Collin Burns, Steven Basart, Andy
Zou, Mantas Mazeika, Dawn Song, and Jacob Stein-
Tyler A. Chang, Katrin Tomanek, Jessica Hoffmann, hardt. 2021. Measuring massive multitask language
Nithum Thain, Erin van Liemt, Kathleen Meier- understanding. Proceedings of the International Con-
Hellstern, and Lucas Dixon. 2024. Detecting hal- ference on Learning Representations (ICLR).
lucination and coverage errors in retrieval augmented
generation for controversial topics. In Proceedings of Jennifer Hsia, Afreen Shaikh, Zhiruo Wang, and Gra-
the 2024 Joint International Conference on Compu- ham Neubig. 2024. RAGGED: Towards informed
tational Linguistics, Language Resources and Evalu- design of retrieval augmented generation systems.
ation (LREC-COLING 2024). arXiv preprint arXiv:2403.09040.
Yutong Hu, Quzhe Huang, Mingxu Tao, Chen Zhang, Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis,
and Yansong Feng. 2024. Can perplexity reflect large Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettle-
language model’s ability in long text understanding? moyer, and Hannaneh Hajishirzi. 2023. FActScore:
In The Second Tiny Papers Track at ICLR 2024. Fine-grained atomic evaluation of factual precision
in long form text generation. In Proceedings of the
Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, 2023 Conference on Empirical Methods in Natural
Zhangyin Feng, Haotian Wang, Qianglong Chen, Language Processing, pages 12076–12100. Associa-
Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting tion for Computational Linguistics.
Liu. 2024. A Survey on Hallucination in Large Lan-
guage Models: Principles, taxonomy, challenges, and Krishna Pillutla, Swabha Swayamdipta, Rowan Zellers,
open questions. ACM Trans. Inf. Syst. John Thickstun, Sean Welleck, Yejin Choi, and Zaid
Harchaoui. 2021. MAUVE: Measuring the gap be-
Albert Q Jiang, Alexandre Sablayrolles, Arthur Men- tween neural text and human text using divergence
sch, Chris Bamford, Devendra Singh Chaplot, Diego frontiers. In NeurIPS.
de las Casas, Florian Bressand, Gianna Lengyel, Guil-
laume Lample, Lucile Saulnier, et al. 2023. Mistral Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
7B. arXiv preprint arXiv:2310.06825. Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
Wei Li, and Peter J Liu. 2020. Exploring the limits
Zhuoran Jin, Pengfei Cao, Yubo Chen, Kang Liu, Xiao- of transfer learning with a unified text-to-text trans-
jian Jiang, Jiexin Xu, Li Qiuxia, and Jun Zhao. 2024. former. The Journal of Machine Learning Research,
Tug-of-War between Knowledge: Exploring and re- 21(1):5485–5551.
solving knowledge conflicts in retrieval-augmented
language models. In Proceedings of the 2024 Joint Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay,
International Conference on Computational Linguis- Amnon Shashua, Kevin Leyton-Brown, and Yoav
tics, Language Resources and Evaluation (LREC- Shoham. 2023. In-context retrieval-augmented lan-
COLING 2024), pages 16867–16878. guage models. Transactions of the Association for
Computational Linguistics, 11:1316–1331.
Gangwoo Kim, Sungdong Kim, Byeongguk Jeon, Joon-
suk Park, and Jaewoo Kang. 2023. Tree of Clarifica- Nils Reimers and Iryna Gurevych. 2019. Sentence-
tions: Answering ambiguous questions with retrieval- BERT: Sentence embeddings using siamese bert-
augmented large language models. In Proceedings networks. In EMNLP, pages 3982–3992.
of the 2023 Conference on Empirical Methods in
Natural Language Processing, pages 996–1009. Sina Semnani, Violet Yao, Heidi Zhang, and Monica
Lam. 2023. WikiChat: Stopping the hallucination of
Sung-Min Lee, Eunhwan Park, Donghyeon Jeon, Inho large language model chatbots by few-shot ground-
Kang, and Seung-Hoon Na. 2024. RADCoT: ing on Wikipedia. In Findings of the Association
Retrieval-augmented distillation to specialization for Computational Linguistics: EMNLP 2023, pages
models for generating chain-of-thoughts in query 2387–2413.
expansion. In Proceedings of the 2024 Joint In-
ternational Conference on Computational Linguis- Haizhou Shi, Zihao Xu, Hengyi Wang, Weiyi Qin,
tics, Language Resources and Evaluation (LREC- Wenyuan Wang, Yibin Wang, and Hao Wang. 2024a.
COLING 2024), pages 13514–13523. Continual learning of large language models: A com-
prehensive survey. arXiv preprint arXiv:2404.16789.
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio
Petroni, Vladimir Karpukhin, Naman Goyal, Hein- Weijia Shi, Sewon Min, Michihiro Yasunaga, Minjoon
rich Küttler, Mike Lewis, Wen-tau Yih, Tim Rock- Seo, Richard James, Mike Lewis, Luke Zettlemoyer,
täschel, et al. 2020. Retrieval-augmented generation and Wen-tau Yih. 2024b. REPLUG: Retrieval-
for knowledge-intensive nlp tasks. Advances in Neu- augmented black-box language models. In Proceed-
ral Information Processing Systems, 33:9459–9474. ings of the 2024 Conference of the North American
Chapter of the Association for Computational Lin-
Jiarui Li, Ye Yuan, and Zehua Zhang. 2024. En- guistics: Human Language Technologies (Volume
hancing llm factual accuracy with rag to counter 1: Long Papers), pages 8371–8384. Association for
hallucinations: A case study on domain-specific Computational Linguistics.
queries in private knowledge-bases. arXiv preprint
arXiv:2403.10446. Shamane Siriwardhana, Rivindu Weerasekera, Elliott
Wen, Tharindu Kaluarachchi, Rajib Rana, and
Chin-Yew Lin. 2004. ROUGE: A package for automatic Suranga Nanayakkara. 2023. Improving the domain
evaluation of summaries. In Text Summarization adaptation of retrieval augmented generation (rag)
Branches Out, pages 74–81. models for open domain question answering. Trans-
actions of the Association for Computational Linguis-
Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. tics, 11:1–17.
TruthfulQA: Measuring how models mimic human
falsehoods. In Proceedings of the 60th Annual Meet- Qiushi Sun, Chengcheng Han, Nuo Chen, Renyu Zhu,
ing of the Association for Computational Linguistics Jingyang Gong, Xiang Li, and Ming Gao. 2024.
(Volume 1: Long Papers), pages 3214–3252. Make Prompt-based Black-Box Tuning Colorful:
Boosting model generalization from three orthog- KW # Articles S/A Avg S/A Avg W/A
onal perspectives. In Proceedings of the 2024 Joint Level 3 999 1-936 337 7472
International Conference on Computational Linguis-
tics, Language Resources and Evaluation (LREC- Level 4 10,011 1-1690 258 5569
COLING 2024), pages 10958–10969.
Table 4: Statistical analysis of knowledge base. KW,
Nhat Tran and Diane Litman. 2024. Enhancing knowl- # Artivles, S/A, Avg S/A, and Avg W/A represent the
edge retrieval with topic modeling for knowledge- knowledge base, the number of articles, the range of
grounded dialogue. In Proceedings of the 2024 Joint sentences per article, the average number of sentences
International Conference on Computational Linguis- per article, and the average number of words per article,
tics, Language Resources and Evaluation (LREC-
respectively.
COLING 2024), pages 5986–5995.
Xiaohua Wang, Zhenghua Wang, Xuan Gao, Feiran
Zhang, Yixin Wu, Zhibo Xu, Tianyuan Shi, • AdversV1: You are an imaginative storytelling
Zhengyuan Wang, Shizheng Li, Qi Qian, et al. 2024. bot. Spin a detailed and creative tale in response
Searching for best practices in retrieval-augmented to the next question.
generation. In Proceedings of the 2024 Conference
on Empirical Methods in Natural Language Process- • AdversV2: You are a friendly dog. Respond to
ing, pages 17716–17736.
the next question with barks, playful sounds, and
Kevin Wu, Eric Wu, and James Zou. 2024. ClashEval: in the way a dog might communicate its feelings.
Quantifying the tug-of-war between an llm’s internal
prior and external evidence. In The Thirty-eight Con-
ference on Neural Information Processing Systems A.3 Contrastive ICL Prompt
Datasets and Benchmarks Track.
The prompt structures for the different variants in
Lingxi Zhang, Jing Zhang, Yanling Wang, Shulin Cao, the Contrastive In-context Learning (ICL) section
Xinmei Huang, Cuiping Li, Hong Chen, and Juanzi
are outlined as follows:
Li. 2023. FC-KBQA: A fine-to-coarse composition
framework for knowledge base question answering.
In Proceedings of the 61st Annual Meeting of the • Baseline Prompt: "HelpV1. Considering this
Association for Computational Linguistics (Volume information: Retrieved Context K. Question: q,
1: Long Papers), pages 1002–1017. Association for Answer:"
Computational Linguistics.
• ICL1D Prompt: "HelpV1. Considering
A Appendix this example: Question: q, Correct Answer:
We share the details of the evaluation datasets, Answercorrect . Question: q, Correct Answer:"
knowledge base, and the settings used in the abla-
• ICL2D Prompt: "HelpV1. Considering these
tion study.
examples: Question: q, Correct Answer:
A.1 Statistical Analysis of Knowledge Base Answercorrect . Question: q, Correct Answer:
Table 4 shows the statistical analysis of the knowl- Answercorrect . Question: q, Correct Answer:"
edge base. • ICL1D+ Prompt: "HelpV1. Considering
A.2 Prompt Variants these examples: Question: q, Correct Answer:
Answercorrect . Question: q, Incorrect Answer:
The prompts for the different variants in the Prompt Answerincorrect . Question: q, Correct Answer:"
Design section are shown as follows:
• ICL2D+ Prompt: "HelpV1. Considering
• HelpV1: You are a truthful expert question-
these examples: Question: q, Correct Answer:
answering bot and should correctly and concisely
Answercorrect . Question: q, Incorrect Answer:
answer the following question.
Answerincorrect . Question: q, Correct Answer:
• HelpV2: You are an accurate and reliable Answercorrect . Question: q, Incorrect Answer:
question-answering bot. Please provide a precise Answerincorrect . Question: q, Correct Answer:"
and correct response to the question following.
• HelpV3: As an expert QA bot, your role is to A.4 Multilingual Setting
offer precise, accurate responses. Ensure your In the multilingual setting, we randomly replace
answer to the following question is both informa- English documents with French or German docu-
tive and to the point. ments before embedding them for the MultiLingo
TruthfulQA MMLU
Query: The normal modes of a carbon dioxide
Query: Can people who have been consumed by a molecule that are infrared-active include which of
wolf be rescued? the following? I. Bending II. Symmetric stretching
III. Asymmetric stretching
Correct Answer: No, people who have been con-
Correct Answer: I and III only.
sumed by a wolf are dead.
w/o_RAG: I must clarify that there is no known
w/o_RAG: The normal modes of a carbon dioxide
recorded case of a person being consumed entirely
(CO2) molecule that are infrared-active are:
by a wolf.
Baseline: I must clarify that there are no known
Baseline: The normal modes of a carbon dioxide
recorded cases of humans being rescued after being
molecule that are infrared-active include I. Bending.
consumed by a wolf.
HelpV2: I’m an accurate and reliable question-
HelpV3: The normal modes of a carbon dioxide
answering bot, but I must clarify that the question
molecule that are infrared-active include I. Bending.
seems to contain
2DocXL: The normal modes of a carbon dioxide
2DocXL: No, once a person has been consumed by
molecule that are infrared-active include II. Symmet-
a wolf or any other predator, they cannot be rescued.
ric stretching.
1K_5Doc: The normal modes of a carbon dioxide
1K_5Doc: No, once a person has been consumed by
molecule that are infrared-active include II. Asym-
a wolf or any other animal, they cannot be rescued.
metric stretching.
ExpendL: The normal modes of a carbon dioxide
ExpendL: I’m an expert question-answering bot and
molecule that are infrared-active include I. Bending
I’m here to provide truthful and concise answers.
and II.
ICL1D+: No, once a person has been consumed by ICL1D+: The correct answer is I. Bending and III.
a wolf, they cannot be rescued. Asymmetric stretching.
120Doc120S: The normal modes of a carbon dioxide
80Doc80S: No, once a person has been consumed by
molecule that are infrared-active include I. Bending
a wolf or any other animal, they cannot be rescued.
and III.
Table 5: Examples of the generated results on the TruthfulQA and MMLU datasets, where w/o_RAG is the
base LLM without the RAG system. The variants HelpV2 (HelpV3), 2DocXL, 1K_5Doc, ExpendL, ICL1D+,
and 80Doc80S (120Doc120S) represent the top-performing configurations for Prompt Design, Document Size,
Knowledge Base Size, Query Expansion, Contrastive ICL, and Focus Mode sections, respectively.