Regression Analysis
Regression Analysis
Abstract
Retrieval-Augmented Generation (RAG) is essential for integrating external knowledge into
Large Language Model (LLM) outputs. While the literature on RAG is growing, it primarily
focuses on systematic reviews and comparisons of new state-of-the-art (SoTA) techniques against
their predecessors, with a gap in extensive experimental comparisons. This study begins to address
this gap by assessing various RAG methods’ impacts on retrieval precision and answer similarity.
We found that Hypothetical Document Embedding (HyDE) and LLM reranking significantly en-
hance retrieval precision. However, Maximal Marginal Relevance (MMR) and Cohere rerank did
not exhibit notable advantages over a baseline Naive RAG system, and Multi-query approaches
underperformed. Sentence Window Retrieval emerged as the most effective for retrieval precision,
despite its variable performance on answer similarity. The study confirms the potential of the
Document Summary Index as a competent retrieval approach. All resources related to this re-
search are publicly accessible for further investigation through our GitHub repository ARAGOG.
We welcome the community to further this exploratory study in RAG systems.
1 Introduction
Large Language Models (LLMs) have significantly advanced the field of natural language processing,
enabling a wide range of applications from text generation to question answering. However, inte-
grating dynamic, external information remains a challenge for these models. Retrieval Augmented
Generation (RAG) techniques address this limitation by incorporating external knowledge sources into
the generation process, thus enhancing the models’ ability to produce contextually relevant and in-
formed outputs. This integration of retrieval mechanisms with generative models is a key development
in improving the performance and versatility of LLMs, facilitating more accurate and context-aware
responses. See Figure 1 for an overview of the standard RAG workflow.
Despite the growing interest in RAG techniques within the domain of LLMs, the existing body of
literature primarily consists of systematic reviews (Gao et al., 2024) and direct comparisons between
successive state-of-the-art (SoTA) models (Gao et al., 2022; Jiang et al., 2023). This pattern reveals
a notable gap: a comprehensive experimental comparison across a broad spectrum of advanced RAG
techniques is missing. Such a comparison is crucial for understanding the relative strengths and
weaknesses of these techniques in enhancing LLMs’ performance across various tasks. This study seeks
to contribute to bridging this gap by providing an extensive evaluation of multiple RAG techniques
and their combinations, thereby offering insights into their efficacy and applicability in real-world
scenarios.
The focus of this investigation is a spectrum of advanced RAG techniques aimed at optimizing the
retrieval process. These techniques can be categorized into several areas:
1
RAG Technique Type
Sentence-window retrieval
Decoupling of Retrieval and Generation
Document summary index
HyDE
Query Expansion
Multi-query
Maximal Marginal Relevance (MMR) Enhancement Mechanism
Cohere Re-ranker
Re-rankers
LLM-based Re-ranker
To evaluate the RAG techniques, this study leverages two metrics: Retrieval Precision and Answer
Similarity (Tonic AI, 2023). Retrieval Precision measures the relevance of the retrieved context to the
question asked, while Answer Similarity assesses how closely the system’s answers align with reference
responses, on a scale from 0 to 5.
Figure 1: A high-level overview of the workflow within a Retrieval-Augmented Generation (RAG) system. This
process diagram shows how a user query is processed by the system to retrieve relevant documents from a database
and how these documents inform the generation of a response.
2 RAG Techniques
2.1 Sentence-window retrieval
The Sentence-window Retrieval technique is grounded in the principle of optimizing both retrieval
and generation processes by tailoring the text chunk size to the specific needs of each stage (Yang,
2023). For retrieval, this technique emphasizes single sentences to take advantage of small data
chunks for potentially better retrieving capabilities. On the generation side, it adds more sentences
around the initial one to offer the LLM extended context, aiming for richer, more detailed outputs.
This decoupling is supposed to increase the performance of both retrieval and generation, ultimately
leading to better performance of the whole RAG system.
2
2.2 Document summary index
The Document Summary Index method enhances RAG systems by indexing document summaries for
efficient retrieval, while providing LLMs with full text documents for response generation (Liu, 2023a).
This decoupling strategy optimizes retrieval speed and accuracy through summary-based indexing and
supports comprehensive response synthesis by utilizing the original text.
2.3 HyDE
The Hypothetical Document Embedding (Gao et al., 2022) technique enhances the document retrieval
by leveraging LLMs to generate a hypothetical answer to a query. HyDE capitalizes on the ability of
LLMs to produce context-rich answers, which, once embedded, serve as a powerful tool to refine and
focus document retrieval efforts. See Figure 2 for overview of HyDE RAG system workflow.
Figure 2: The process flow of Hypothetical Document Embedding (HyDE) technique within a Retrieval-
Augmented Generation system. The diagram illustrates the steps from the initial query input to the generation
of a hypothetical answer and its use in retrieving relevant documents to inform the final generated response.
2.4 Multi-query
The Multi-query technique (Langchain, 2023) enhances document retrieval by expanding a single user
query into multiple similar queries with the assistance of an LLM. This process involves generating N
alternative questions that echo the intent of the original query but from different angles, thereby cap-
turing a broader spectrum of potential answers. Each query, including the original, is then vectorized
and subjected to its own retrieval process, which increases the chances of fetching a higher volume
of relevant information from the document repository. To manage the resultant expanded dataset,
a re-ranker is often employed, utilizing machine learning models to sift through the retrieved chunks
and prioritize those most relevant in regards to the initial query. See Figure 3 for an overview of how
Multi-query RAG system workflow.
3
Figure 3: This diagram showcases how multiple similar queries are generated from an initial user query, and how
they contribute to retrieving a wider range of relevant documents.
to the query’s intent but also for their uniqueness compared to documents already selected. This
approach mitigates the issue of redundancy, ensuring that the set of retrieved documents covers a
broader range of information.
4
Figure 4: This flowchart outlines the reranking process in a RAG system. It illustrates how retrieved documents
are further assessed for relevance using a reranking step, which refines the set of documents that will inform the
generated response.
3 Methods
3.1 Data
This study utilizes a tailored dataset derived from the AI ArXiv collection, accessible via Hugging
Face (James Calam, 2023). The dataset consists of 423 selected research papers centered around the
themes of AI and LLMs, sourced from arXiv. This selection offers a comprehensive foundation for
constructing a database to test the RAG techniques and creating a set of evaluation data to assess
their effectiveness.
5
3.1.3 Evaluation Data Preparation
The evaluation dataset comprises 107 question-answer (QA) pairs generated with the assistance of
GPT-4. The generation process was guided by specific criteria to ensure that the questions were
challenging, technically precise, and reflective of potential user inquiries sent to a RAG system. Each
QA pair was then reviewed by humans to validate its relevance and accuracy, ensuring that the
evaluation data accurately measures the RAG techniques’ performance in real-world applications.
The QA dataset is available in this paper’s associated Github repository ARAGOG. See Figure 5 for
an overview of the data preparation process.
Figure 5: The visualization of the AI ArXiv dataset preparation process. This diagram shows the selection of
papers for question-answer generation, the employment of the full dataset to provide ample noise for the RAG
system, and the chunking approaches used to process the documents for the vector database.
3.3 Metrics
To evaluate the performance of various RAG techniques within this study, two primary metrics were
employed from the Tonic Validate package/platform: Retrieval Precision and Answer Similarity (Tonic
AI, 2023). These metrics were selected to evaluate both the retrieval process and the generative
capabilities of the LLMs used, with a primary focus on the precision of information retrieval.
6
Figure 6: This diagram illustrates the evaluation process in a RAG system, where each chunk’s relevance is scored,
contributing to the overall retrieval precision metric. The process highlights how individual chunks are evaluated
for their utility in responding to user queries.
3.4 LLM
For the experimental setup GPT-3.5-turbo was selected because of its cost-effectiveness and ease of
implementation. Tonic Validate’s requirement for OpenAI models led us to choose GPT-3.5-turbo
due to its cost-effectiveness, even though GPT-4 might provide more precise grading at a higher
expense. It is important to note that the choice of the LLM used for generation itself was less critical
for the main objective—evaluating retrieval precision, since answer similarity can be regarded as a
supplementary metric.
4 Results
The study systematically evaluates a variety of advanced RAG techniques using metrics of Retrieval
Precision and Answer Similarity. A comparative analysis is presented through boxplots to visualize the
7
distribution of these metrics, followed by ANOVA and Tukey’s HSD tests to determine the statistical
significance of the differences observed.
Figure 7: Boxplot of Retrieval Precision by Experiment. Each boxplot demonstrates the range and dis-
tribution of retrieval precision scores across different RAG techniques. Higher median values and tighter
interquartile ranges suggest better performance and consistency.
The analysis of answer similarity (Figure 8) presents intriguing patterns that both align with and
diverge from those observed in retrieval precision. For Classic Vector Database (VDB) techniques
and Document Summary Index, there is a notable positive correlation between retrieval precision and
8
answer similarity, suggesting that when relevant information is accurately retrieved, it can lead to
answers that more closely mirror reference responses.
In contrast, Sentence Window Retrieval displays a disparity between high retrieval precision and
lower answer similarity scores. This could indicate that while the technique is adept at identifying
relevant passages, it may not be translating this information into answers that are semantically parallel
to the reference, possibly due to the generation phase not fully leveraging the retrieved context.
Figure 8: Boxplot of Answer Similarity by Experiment. Each boxplot demonstrates the range and dis-
tribution of answer similarity scores across different RAG techniques. Higher median values and tighter
interquartile ranges suggest better performance and consistency.
9
4.2.1 Classic VDB
The Tukey post-hoc test results for the Classic VDB setup confirm that both HyDE and its combina-
tions with Cohere Rerank and LLM Rerank significantly outperform the Naive RAG, aligning with the
boxplot observations of higher retrieval precision. Significant improvemnt is also observed for LLM
Rerank alone. However, the Maximal Marginal Relevance (MMR) and Cohere Rerank do not show a
significant improvement over Naive RAG, which is also reflected in their closer median precision scores
in the boxplots. Interestingly, the Multi Query + Reciprocal technique, while statistically significant,
presents a mean difference suggesting lower performance compared to Naive RAG, contradicting the
anticipated outcome and calling for additional scrutiny.
Table 1: Tukey’s HSD test results comparing RAG techniques to Naive RAG, all within the
Classic VDB framework
Next, the focus is on techniques which have shown statistically significant improvement over the
Naive RAG approach. The table below presents the results from Tukey’s post-hoc tests, contrasting
each of the high-performing techniques against each other.
Table 2: Tukey’s HSD test results of RAG techniques that offer significant improvement over
Naive RAG
The combination of HyDE and LLM Rerank emerged as the most potent in enhancing retrieval
precision within the Classic VDB framework, surpassing other techniques. However, this superior
performance comes with higher latency and cost implications due to the additional LLM calls required
for both reranking and hypothetical document embedding. Close second is HyDE alone, not showing
significant difference from HyDE + LLM rerank combination. Experiments including Cohere Rerank
did not demonstrate anticipated benefits.
10
Technique Comparison Mean Diff. P-adj Reject Null
HyDE + LLM Rerank Sentence Window + Cohere Rerank 0.1021 0.0000 True
Table 3: Tukey’s HSD test results for the best performing Classic VDB technique against the
worst Sentence Window retrieval technique.
Table 4: Tukey’s HSD test results for Sentence Window retrieval enhancements
The Tukey’s HSD test delineates Sentence Window retrieval with LLM Rerank as the only variant
to offer a statistically significant improvement over the base Sentence Window technique.
Table 5: Tukey’s HSD test results comparing Document Summary Index variants
To finish off the analysis, a basic version of every vector database set up were compared with
another, i.e. Sentence Window, Naive RAG and Document Summary with Cohere rerank. Utilizing
plain Document Summary without enhancements was not feasible for this analysis, as it aggregates
multiple chunks into one summary, leading to results not directly comparable to other techniques that
operate on different chunk quantities.
Table 6: Tukey’s HSD test results comparing the performance of Sentence Window retrieval and Document Summary
Index + Cohere Rerank against the baseline Naive RAG
The Tukey’s HSD test results establish the Sentence Window retrieval as the leading technique, sur-
passing the Document Summary Index in precision. Document Summary Index with Cohere Rerank
trails behind as a viable second, whereas the Classic VDB, in its standard form, demonstrates the
least retrieval precision among the evaluated techniques.
11
5 Limitations
• Model selection: We used GPT-3.5-turbo for evaluating responses due to the constraints of
Tonic Validate, which requires the use of OpenAI models. The choice of GPT-3.5-turbo, while
cost-effective, may not offer the same depth of analysis as more advanced models like GPT-4.
• Data and question scope: The study was conducted using a singular dataset and a set of 107
questions, which may affect the generalizability of the findings across different LLM applications.
Expanding the variety of datasets and questions could potentially yield more comprehensive
insights.
• Chunking variability: While the use of multiple chunking strategies allowed for a compre-
hensive evaluation of different retrieval methods, it also highlighted the inherent challenges in
directly comparing their performance against the same metrics. Each retrieval method required
a distinct chunking approach tailored to its specific needs. For instance, the sentence window
retrieval method necessitated overlapping chunks of consecutive sentences, while the document
summary index used larger chunks to leverage the language model’s summarization capabilities
effectively. Consequently, the retrieval methods were evaluated on chunk types with varying de-
grees of context and information density, making it difficult to draw definitive conclusions about
their relative strengths and weaknesses. This limitation stems from the fundamental differences
in how these retrieval methods operate and the distinct chunking requirements they impose.
• Evaluation metrics: The lack of a clear consensus on the optimal metrics for evaluating RAG
systems means our chosen metrics—Retrieval Precision and Answer Similarity—are based on
conceptual alignment rather than empirical evidence of their efficacy. This highlights an area
for future research to solidify the evaluation framework for RAG systems.
• Technique Selection: The subset of RAG techniques evaluated, while selected based on cur-
rent relevance and potential, is not exhaustive. Excluded techniques such as Step back prompt-
ing (Dai et al., 2023), Auto-merging retrieval (Phaneendra, 2023), and Hybrid search (Akash,
2023) reflect the study’s scope limitation and the subjective nature of selection. Future research
should consider these and other emerging methods to broaden the understanding of RAG system
enhancements.
6 Conclusion
Our investigation into Retrieval-Augmented Generation (RAG) techniques has identified HyDE and
LLM reranking as notable enhancers of retrieval precision in LLMs. These approaches, however,
necessitate additional LLM queries, incurring greater latency and cost. Surprisingly, established
techniques like MMR and Cohere rerank did not demonstrate significant benefits, and Multi-query
was found to be less effective than baseline Naive RAG.
The results demonstrate the efficacy of the Sentence Window Retrieval technique in achieving high
precision for retrieval tasks, although a discrepancy was observed between retrieval precision and an-
swer similarity scores. Given its conceptual similarity to Sentence Window retrieval, we suggest that
Auto-merging retrieval (Phaneendra, 2023) might offer comparable benefits, warranting future inves-
tigation. The Document Summary Index approach also exhibited satisfactory performance, however,
it requires an upfront investment in generating summaries for each document in the corpus.
Due to constraints such as dataset singularity, limited questions, and the use of GPT-3.5-turbo for
evaluation, the results may not fully capture the potential of more advanced models. Future studies
with broader datasets and higher-capability LLMs could provide more comprehensive insights. This
research contributes a foundational perspective to the field, encouraging subsequent works to refine,
validate, and expand upon our findings.
12
To facilitate this continuation of research and allow for the replication and extension of our work,
we have made our experimental pipeline available through a publicly accessible GitHub repository.
(Predlico, 2024)
7 Future Work
• Knowledge Graph RAG: Integrating Knowledge Graphs (KGs) with RAG systems represents
a promising direction for enhancing retrieval precision and contextual relevance. KGs offer a
well-organized framework of relationship-rich data that could refine the retrieval phase of RAG
systems (Bratanic, 2023). Although setting up such systems is resource-demanding, the potential
for significantly improved retrieval processes justifies further investigation.
• Unfrozen RAG systems: Unlike the static application of RAG systems in our study, future
investigations can benefit from adapting RAG components, including embedding models and
rerankers, directly to specific datasets (Gao et al., 2024; Kiela, 2024). This ”unfrozen” approach
allows for fine-tuning on nuanced use-case data, potentially enhancing system specificity and
output quality. Exploring these adaptations could lead to more adaptable and effective RAG
systems tailored to diverse application needs.
• Experiment replication across diverse datasets: To ensure the robustness and general-
izability of our findings, it is imperative for future research to replicate our experiments using
a variety of datasets. Conducting these experiments across multiple datasets is important to
verify the applicability of our results and to identify any context-specific adjustments needed.
• Auto-RAG: The idea of automatically optimizing RAG systems, akin to Auto-ML’s approach
in traditional machine learning, presents a significant opportunity for future exploration. Cur-
rently, selecting the optimal configuration of RAG components — e.g., chunking strategies,
window sizes, and parameters within rerankers — relies on manual experimentation and intu-
ition. An automated system could systematically explore a vast space of RAG configurations
and select the very best model (Markr.AI, 2024).
References
Akash. Hybrid search: Optimizing rag implementation. https://medium.com/@csakash03/
hybrid-search-is-a-method-to-optimize-rag-implementation-98d9d0911341, 2023. Ac-
cessed: 2024-04-01.
13
L. Gao, X. Ma, J. Lin, and J. Callan. Precise zero-shot dense retrieval without relevance labels, 2022.
Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, J. Sun, Q. Guo, M. Wang, and H. Wang.
Retrieval-augmented generation for large language models: A survey, 2024.
James Calam. Ai arxiv dataset. https://huggingface.co/datasets/jamescalam/ai-arxiv, 2023.
Accessed: 2024-03-24.
Z. Jiang, F. F. Xu, L. Gao, Z. Sun, Q. Liu, J. Dwivedi-Yu, Y. Yang, J. Callan, and G. Neubig. Active
retrieval augmented generation, 2023.
D. Kiela. Stanford cs25: V3 i retrieval augmented language models. https://www.youtube.com/
watch?v=mE7IDf2SmJg, 2024. Accessed: 2024-03-24.
14