0% found this document useful (0 votes)

38 views14 pages

Regression Analysis

Free test for regression analysis

Uploaded by

akhil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views14 pages

Regression Analysis

Free test for regression analysis

Uploaded by

akhil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

ARAGOG: Advanced RAG Output Grading

Matouš Eibich Shivay Nagpal Alexander Fred-Ojala

Predli Predli Predli & UC Berkeley
[email protected] [email protected] [email protected]
April 2, 2024
arXiv:2404.01037v1 [cs.CL] 1 Apr 2024

Abstract
Retrieval-Augmented Generation (RAG) is essential for integrating external knowledge into
Large Language Model (LLM) outputs. While the literature on RAG is growing, it primarily
focuses on systematic reviews and comparisons of new state-of-the-art (SoTA) techniques against
their predecessors, with a gap in extensive experimental comparisons. This study begins to address
this gap by assessing various RAG methods’ impacts on retrieval precision and answer similarity.
We found that Hypothetical Document Embedding (HyDE) and LLM reranking significantly en-
hance retrieval precision. However, Maximal Marginal Relevance (MMR) and Cohere rerank did
not exhibit notable advantages over a baseline Naive RAG system, and Multi-query approaches
underperformed. Sentence Window Retrieval emerged as the most effective for retrieval precision,
despite its variable performance on answer similarity. The study confirms the potential of the
Document Summary Index as a competent retrieval approach. All resources related to this re-
search are publicly accessible for further investigation through our GitHub repository ARAGOG.
We welcome the community to further this exploratory study in RAG systems.

1 Introduction
Large Language Models (LLMs) have significantly advanced the field of natural language processing,
enabling a wide range of applications from text generation to question answering. However, inte-
grating dynamic, external information remains a challenge for these models. Retrieval Augmented
Generation (RAG) techniques address this limitation by incorporating external knowledge sources into
the generation process, thus enhancing the models’ ability to produce contextually relevant and in-
formed outputs. This integration of retrieval mechanisms with generative models is a key development
in improving the performance and versatility of LLMs, facilitating more accurate and context-aware
responses. See Figure 1 for an overview of the standard RAG workflow.
Despite the growing interest in RAG techniques within the domain of LLMs, the existing body of
literature primarily consists of systematic reviews (Gao et al., 2024) and direct comparisons between
successive state-of-the-art (SoTA) models (Gao et al., 2022; Jiang et al., 2023). This pattern reveals
a notable gap: a comprehensive experimental comparison across a broad spectrum of advanced RAG
techniques is missing. Such a comparison is crucial for understanding the relative strengths and
weaknesses of these techniques in enhancing LLMs’ performance across various tasks. This study seeks
to contribute to bridging this gap by providing an extensive evaluation of multiple RAG techniques
and their combinations, thereby offering insights into their efficacy and applicability in real-world
scenarios.
The focus of this investigation is a spectrum of advanced RAG techniques aimed at optimizing the
retrieval process. These techniques can be categorized into several areas:

1
RAG Technique Type
Sentence-window retrieval
Decoupling of Retrieval and Generation
Document summary index
HyDE
Query Expansion
Multi-query
Maximal Marginal Relevance (MMR) Enhancement Mechanism
Cohere Re-ranker
Re-rankers
LLM-based Re-ranker

To evaluate the RAG techniques, this study leverages two metrics: Retrieval Precision and Answer
Similarity (Tonic AI, 2023). Retrieval Precision measures the relevance of the retrieved context to the
question asked, while Answer Similarity assesses how closely the system’s answers align with reference
responses, on a scale from 0 to 5.

Figure 1: A high-level overview of the workflow within a Retrieval-Augmented Generation (RAG) system. This
process diagram shows how a user query is processed by the system to retrieve relevant documents from a database
and how these documents inform the generation of a response.

2 RAG Techniques
2.1 Sentence-window retrieval
The Sentence-window Retrieval technique is grounded in the principle of optimizing both retrieval
and generation processes by tailoring the text chunk size to the specific needs of each stage (Yang,
2023). For retrieval, this technique emphasizes single sentences to take advantage of small data
chunks for potentially better retrieving capabilities. On the generation side, it adds more sentences
around the initial one to offer the LLM extended context, aiming for richer, more detailed outputs.
This decoupling is supposed to increase the performance of both retrieval and generation, ultimately
leading to better performance of the whole RAG system.

2
2.2 Document summary index
The Document Summary Index method enhances RAG systems by indexing document summaries for
efficient retrieval, while providing LLMs with full text documents for response generation (Liu, 2023a).
This decoupling strategy optimizes retrieval speed and accuracy through summary-based indexing and
supports comprehensive response synthesis by utilizing the original text.

2.3 HyDE
The Hypothetical Document Embedding (Gao et al., 2022) technique enhances the document retrieval
by leveraging LLMs to generate a hypothetical answer to a query. HyDE capitalizes on the ability of
LLMs to produce context-rich answers, which, once embedded, serve as a powerful tool to refine and
focus document retrieval efforts. See Figure 2 for overview of HyDE RAG system workflow.

Figure 2: The process flow of Hypothetical Document Embedding (HyDE) technique within a Retrieval-
Augmented Generation system. The diagram illustrates the steps from the initial query input to the generation
of a hypothetical answer and its use in retrieving relevant documents to inform the final generated response.

2.4 Multi-query
The Multi-query technique (Langchain, 2023) enhances document retrieval by expanding a single user
query into multiple similar queries with the assistance of an LLM. This process involves generating N
alternative questions that echo the intent of the original query but from different angles, thereby cap-
turing a broader spectrum of potential answers. Each query, including the original, is then vectorized
and subjected to its own retrieval process, which increases the chances of fetching a higher volume
of relevant information from the document repository. To manage the resultant expanded dataset,
a re-ranker is often employed, utilizing machine learning models to sift through the retrieved chunks
and prioritize those most relevant in regards to the initial query. See Figure 3 for an overview of how
Multi-query RAG system workflow.

2.5 Maximum Marginal Relevance

The Maximal Marginal Relevance (MMR) technique (Carbonell and Goldstein, 1998) aims to refine
the retrieval process by striking a balance between relevance and diversity in the documents retrieved.
By employing MMR, the retrieval system evaluates potential documents not only for their closeness

3
Figure 3: This diagram showcases how multiple similar queries are generated from an initial user query, and how
they contribute to retrieving a wider range of relevant documents.

to the query’s intent but also for their uniqueness compared to documents already selected. This
approach mitigates the issue of redundancy, ensuring that the set of retrieved documents covers a
broader range of information.

2.6 Cohere Rerank

Rerankers aim to enhance the RAG process by refining the selection of documents retrieved in response
to a query, with the goal of prioritizing the most relevant and contextually appropriate information
for generating responses (Pinecone, 2023). This step employs ML algorithms (such as cross-encoder)
to reassess the initially retrieved set, using criteria that extend beyond cosine similarity. Through
this evaluation, rerankers are expected to improve the input for generative models, potentially leading
to more accurate and contextually rich outputs. See Figure 4 for an overview of the Reranker RAG
system workflow.
One tool in this domain is Cohere rerank, which uses a cross-encoder architecture to assess the
relevance of documents to the query. This approach differs from methods that process queries and doc-
uments separately, as cross-encoders analyze them jointly, which could allow for a more comprehensive
understanding of their mutual relevance.

2.7 LLM rerank

Following the introduction of cross-encoder based rerankers such as Cohere rerank, the LLM reranker
offers an alternative strategy by directly applying LLMs to the task of reranking retrieved documents
(Liu, 2023b). This method prioritizes the comprehensive analytical abilities of LLMs over the joint
query-document analysis typical of cross-encoders. Although less efficient in terms of processing speed
and cost compared to cross-encoder models, LLM rerankers can achieve higher accuracy by leveraging
the advanced understanding of language and context inherent in LLMs. This makes the LLM reranker
suitable for applications where the quality of the reranked results is more critical than computational
efficiency. Figure 4 workflow applies to LLM reranker as well.

4
Figure 4: This flowchart outlines the reranking process in a RAG system. It illustrates how retrieved documents
are further assessed for relevance using a reranking step, which refines the set of documents that will inform the
generated response.

3 Methods
3.1 Data
This study utilizes a tailored dataset derived from the AI ArXiv collection, accessible via Hugging
Face (James Calam, 2023). The dataset consists of 423 selected research papers centered around the
themes of AI and LLMs, sourced from arXiv. This selection offers a comprehensive foundation for
constructing a database to test the RAG techniques and creating a set of evaluation data to assess
their effectiveness.

3.1.1 RAG Database Construction

For the study, a subset of 13 key research papers was selected for their potential to generate specific,
technical questions suitable for evaluating Retrieval-Augmented Generation (RAG) systems. Among
the selected papers were significant contributions such as RoBERTa: A Robustly Optimized BERT
Pretraining Approach (Liu et al., 2019) and BERT: Pre-training of Deep Bidirectional Transformers
for Language Understanding (Devlin et al., 2019). To better simulate a real-world vector database
environment, where noise and irrelevant documents are present, the database was expanded to include
the full dataset of 423 papers available. The additional 410 papers act as noise, enhancing the
complexity and diversity of the retrieval challenges faced by the RAG system.

3.1.2 Chunking Approach

Multiple chunking strategies were utilized to create vector databases for different retrieval methods.
For the classic vector database, a TokenTextSplitter was employed with a chunk size of 512 tokens
and an overlap of 50 tokens. This approach split the documents into smaller chunks while maintain-
ing context by allowing for overlapping text between chunks. For the sentence window method, a
SentenceWindowNodeParser was used with a window size of 3 sentences, effectively creating overlap-
ping chunks consisting of three consecutive sentences. Lastly, for the document summary index, a
TokenTextSplitter was employed with a larger chunk size of 3072 tokens and an overlap of 100 tokens,
generating larger chunks to be summarized by the language model.

5
3.1.3 Evaluation Data Preparation
The evaluation dataset comprises 107 question-answer (QA) pairs generated with the assistance of
GPT-4. The generation process was guided by specific criteria to ensure that the questions were
challenging, technically precise, and reflective of potential user inquiries sent to a RAG system. Each
QA pair was then reviewed by humans to validate its relevance and accuracy, ensuring that the
evaluation data accurately measures the RAG techniques’ performance in real-world applications.
The QA dataset is available in this paper’s associated Github repository ARAGOG. See Figure 5 for
an overview of the data preparation process.

Figure 5: The visualization of the AI ArXiv dataset preparation process. This diagram shows the selection of
papers for question-answer generation, the employment of the full dataset to provide ample noise for the RAG
system, and the chunking approaches used to process the documents for the vector database.

3.2 Mitigating LLM Output Variability

To address the inherent variability of LLM outputs, the methodology included conducting 10 runs for
each RAG technique. This strategy was chosen to balance the need for statistical reliability against the
limitations of computational resources and time. While more runs could increase statistical reliability,
they would also complicate the need to distinguish between statistical and practical significance.

3.3 Metrics
To evaluate the performance of various RAG techniques within this study, two primary metrics were
employed from the Tonic Validate package/platform: Retrieval Precision and Answer Similarity (Tonic
AI, 2023). These metrics were selected to evaluate both the retrieval process and the generative
capabilities of the LLMs used, with a primary focus on the precision of information retrieval.

3.3.1 Retrieval Precision

This metric serves as the cornerstone of the evaluation, directly measuring the efficacy of the retrieval
techniques implemented in the RAG system. Retrieval Precision quantifies the percentage of context
retrieved by the system that is relevant to answering a given question, with scores ranging from 0 to
1. A higher score indicates a greater proportion of the retrieved content is pertinent to the query.
The evaluation is conducted by asking an LLM evaluator to determine the relevance of each piece
of retrieved context, ensuring that the assessment focuses on the accuracy of information retrieval,
rather than the subsequent generation quality. See Figure 6 for visual explanation of retrieval precision
grading workflow.

6
Figure 6: This diagram illustrates the evaluation process in a RAG system, where each chunk’s relevance is scored,
contributing to the overall retrieval precision metric. The process highlights how individual chunks are evaluated
for their utility in responding to user queries.

3.3.2 Answer Similarity

As a complementary metric, Answer Similarity assesses how well the answers generated by the RAG
system align with reference answers, scored on a scale from 0 to 5. While this end-to-end test provides
valuable insights into the system’s overall performance, it was considered secondary to the primary
objective of evaluating retrieval techniques. This is because Answer Similarity could be influenced by
the generative capabilities of the LLM, potentially confounding the assessment of retrieval effectiveness
alone.

3.3.3 Rationale for Metric Selection

To select appropriate metrics for the study the goal was to move beyond simplistic measures of
similarity, such as cosine similarity between embeddings (which do not fully capture the complexity
of effective retrieval and generation). The landscape of available metrics and evaluation platforms
revealed a lack of consensus on optimal evaluation strategies for RAG systems, particularly with a focus
on retrieval. While some methods, such as those proposed by RAGAS (RAGAS Documentation, 2023),
involve detailed calculations with LLMs and F1 scores, these were deemed to be unsuitable for the
objectives in this paper. Their complexity often led to results that were unreliable. Conversely, simple
embedding comparisons were deemed insufficient for capturing the nuanced effectiveness of retrieval
techniques. Ultimately, the selection of Retrieval Precision as the primary metric, complemented by
Answer Similarity, was driven by the focus on evaluating the retrieval component of RAG systems.
Though confident in the appropriateness of these metrics, especially Retrieval Precision, one has to
acknowledge the ongoing development in this area and remain open to future advancements and
consensus in evaluation methodologies.

3.4 LLM
For the experimental setup GPT-3.5-turbo was selected because of its cost-effectiveness and ease of
implementation. Tonic Validate’s requirement for OpenAI models led us to choose GPT-3.5-turbo
due to its cost-effectiveness, even though GPT-4 might provide more precise grading at a higher
expense. It is important to note that the choice of the LLM used for generation itself was less critical
for the main objective—evaluating retrieval precision, since answer similarity can be regarded as a
supplementary metric.

4 Results
The study systematically evaluates a variety of advanced RAG techniques using metrics of Retrieval
Precision and Answer Similarity. A comparative analysis is presented through boxplots to visualize the

7
distribution of these metrics, followed by ANOVA and Tukey’s HSD tests to determine the statistical
significance of the differences observed.

4.1 Comparative Performance Analysis: Boxplots

The boxplots for Retrieval Precision (Figure 7) indicate varied performance across RAG techniques.
The Sentence Window Retrieval approach is notably effective, with a high median precision, though
this does not directly correlate with Answer Similarity performance (see Figure 8). Techniques utiliz-
ing LLM Rerank and Hypothetical Document Embedding (HyDE) show enhanced precision, markedly
outperforming the Naive RAG baseline. Conversely, Maximal Marginal Relevance (MMR) and Co-
here Rerank demonstrate limited benefits, with median precision scores comparable to or below the
baseline. Multi-query, interestingly, presents a reduction in retrieval precision compared to Naive
RAG, warranting further investigation into its application. Document summary index performance
is similar to the best setting of Classic VDB, indicating that with further enhancements, Document
summary technique would surpass Classic VDB.

Figure 7: Boxplot of Retrieval Precision by Experiment. Each boxplot demonstrates the range and dis-
tribution of retrieval precision scores across different RAG techniques. Higher median values and tighter
interquartile ranges suggest better performance and consistency.

The analysis of answer similarity (Figure 8) presents intriguing patterns that both align with and
diverge from those observed in retrieval precision. For Classic Vector Database (VDB) techniques
and Document Summary Index, there is a notable positive correlation between retrieval precision and

8
answer similarity, suggesting that when relevant information is accurately retrieved, it can lead to
answers that more closely mirror reference responses.
In contrast, Sentence Window Retrieval displays a disparity between high retrieval precision and
lower answer similarity scores. This could indicate that while the technique is adept at identifying
relevant passages, it may not be translating this information into answers that are semantically parallel
to the reference, possibly due to the generation phase not fully leveraging the retrieved context.

Figure 8: Boxplot of Answer Similarity by Experiment. Each boxplot demonstrates the range and dis-
tribution of answer similarity scores across different RAG techniques. Higher median values and tighter
interquartile ranges suggest better performance and consistency.

4.2 Statistical Validation of Differences

As described in section 3.2, 10 iterations of each experiment were conducted to mitigate the im-
pact of inherent LLM variability on the results. An ANOVA test applied to these results confirmed
significant differences in Retrieval Precision across the various techniques, validating that observed
variations were not due to random chance but reflect true performance disparities. Following this,
Tukey’s Honestly Significant Difference (HSD) test provided a more granular understanding of these
performance differentials. The statistical tests focus only on the primary metric of this study, retrieval
precision. In light of the extensive range of possible pairwise comparisons, the analysis concentrated
on the comparisons deemed to be most relevant.

9
4.2.1 Classic VDB
The Tukey post-hoc test results for the Classic VDB setup confirm that both HyDE and its combina-
tions with Cohere Rerank and LLM Rerank significantly outperform the Naive RAG, aligning with the
boxplot observations of higher retrieval precision. Significant improvemnt is also observed for LLM
Rerank alone. However, the Maximal Marginal Relevance (MMR) and Cohere Rerank do not show a
significant improvement over Naive RAG, which is also reflected in their closer median precision scores
in the boxplots. Interestingly, the Multi Query + Reciprocal technique, while statistically significant,
presents a mean difference suggesting lower performance compared to Naive RAG, contradicting the
anticipated outcome and calling for additional scrutiny.

Technique Comparison Mean Diff. P-adj Reject Null

Cohere Rerank Naive RAG -0.0150 0.4515 False
HyDE Naive RAG -0.0648 0.0000 True
HyDE + Cohere Rerank Naive RAG -0.0371 0.0000 True
HyDE + LLM Rerank Naive RAG -0.0749 0.0000 True
LLM Rerank Naive RAG -0.0514 0.0000 True
Maximal Marginal Relevance (MMR) Naive RAG -0.0156 0.3787 False
Multi Query + Cohere Rerank Naive RAG 0.0012 1.0000 False
Multi Query + Reciprocal Naive RAG 0.0542 0.0000 True

Table 1: Tukey’s HSD test results comparing RAG techniques to Naive RAG, all within the
Classic VDB framework

Next, the focus is on techniques which have shown statistically significant improvement over the
Naive RAG approach. The table below presents the results from Tukey’s post-hoc tests, contrasting
each of the high-performing techniques against each other.

Technique Comparison Mean Diff. P-adj Reject Null

HyDE HyDE + Cohere Rerank -0.0277 0.0002 True
HyDE HyDE + LLM Rerank 0.0101 0.3255 False
HyDE LLM Rerank -0.0134 0.1175 False
HyDE + Cohere Rerank HyDE + LLM Rerank 0.0378 0.0000 True
HyDE + Cohere Rerank LLM Rerank 0.0143 0.0842 False
HyDE + LLM Rerank LLM Rerank -0.0235 0.0015 True

Table 2: Tukey’s HSD test results of RAG techniques that offer significant improvement over
Naive RAG

The combination of HyDE and LLM Rerank emerged as the most potent in enhancing retrieval
precision within the Classic VDB framework, surpassing other techniques. However, this superior
performance comes with higher latency and cost implications due to the additional LLM calls required
for both reranking and hypothetical document embedding. Close second is HyDE alone, not showing
significant difference from HyDE + LLM rerank combination. Experiments including Cohere Rerank
did not demonstrate anticipated benefits.

4.2.2 Sentence window

This section delves into the analysis of Sentence Window retrieval techniques. First, the worst sentence
window retriever technique is compared with the best classic VDB technique.
The Tukey’s HSD test clearly indicates that even the least performing Sentence Window technique
surpasses the best Classic VDB method in retrieval precision. This underscores the potential of the
Sentence Window approach in RAG systems. However, the contrasting results from the Answer Sim-
ilarity metric serve as a reminder to interpret these findings cautiously.

10
Technique Comparison Mean Diff. P-adj Reject Null
HyDE + LLM Rerank Sentence Window + Cohere Rerank 0.1021 0.0000 True

Table 3: Tukey’s HSD test results for the best performing Classic VDB technique against the
worst Sentence Window retrieval technique.

Next is a comparison of individual Sentence Window retrieval variants.

Base Technique Compared Technique Mean Diff. P-adj Reject Null

Sentence Window Sentence Window + Cohere Rerank 0.0090 0.9768 False
Sentence Window Sentence Window + HyDE -0.0025 1.0000 False
Sentence Window Sentence Window + HyDE + Cohere Rerank -0.0078 0.9945 False
Sentence Window Sentence Window + LLM Rerank 0.0332 0.0000 True

Table 4: Tukey’s HSD test results for Sentence Window retrieval enhancements

The Tukey’s HSD test delineates Sentence Window retrieval with LLM Rerank as the only variant
to offer a statistically significant improvement over the base Sentence Window technique.

4.2.3 Document Summary Index

The Document Summary Index technique was analyzed, focusing on two variations: one augmented
with Cohere Rerank and another with HyDE plus Cohere Rerank. The choice to limit the study to
these two is due to computational constraints and the need for comparability across experiments. The
table below shows that there is no significant difference between the two techniques.

Technique Comparison Mean Diff. P-adj Reject Null

Doc Sum + Cohere Re Doc Summ + HyDE + Cohere Re 0.0109 0.8935 False

Table 5: Tukey’s HSD test results comparing Document Summary Index variants

To finish off the analysis, a basic version of every vector database set up were compared with
another, i.e. Sentence Window, Naive RAG and Document Summary with Cohere rerank. Utilizing
plain Document Summary without enhancements was not feasible for this analysis, as it aggregates
multiple chunks into one summary, leading to results not directly comparable to other techniques that
operate on different chunk quantities.

Technique Comparison Mean Diff. P-adj Reject Null

Doc Summ Index + Cohere Rerank Classic VDB + Naive RAG 0.0545 0.0000 True
Doc Summ Index + Cohere Rerank Sentence Window Retrieval 0.1679 0.0000 True
Classic VDB + Naive RAG Sentence Window Retrieval 0.1134 0.0000 True

Table 6: Tukey’s HSD test results comparing the performance of Sentence Window retrieval and Document Summary
Index + Cohere Rerank against the baseline Naive RAG

The Tukey’s HSD test results establish the Sentence Window retrieval as the leading technique, sur-
passing the Document Summary Index in precision. Document Summary Index with Cohere Rerank
trails behind as a viable second, whereas the Classic VDB, in its standard form, demonstrates the
least retrieval precision among the evaluated techniques.

11
5 Limitations
• Model selection: We used GPT-3.5-turbo for evaluating responses due to the constraints of
Tonic Validate, which requires the use of OpenAI models. The choice of GPT-3.5-turbo, while
cost-effective, may not offer the same depth of analysis as more advanced models like GPT-4.
• Data and question scope: The study was conducted using a singular dataset and a set of 107
questions, which may affect the generalizability of the findings across different LLM applications.
Expanding the variety of datasets and questions could potentially yield more comprehensive
insights.

• Chunking variability: While the use of multiple chunking strategies allowed for a compre-
hensive evaluation of different retrieval methods, it also highlighted the inherent challenges in
directly comparing their performance against the same metrics. Each retrieval method required
a distinct chunking approach tailored to its specific needs. For instance, the sentence window
retrieval method necessitated overlapping chunks of consecutive sentences, while the document
summary index used larger chunks to leverage the language model’s summarization capabilities
effectively. Consequently, the retrieval methods were evaluated on chunk types with varying de-
grees of context and information density, making it difficult to draw definitive conclusions about
their relative strengths and weaknesses. This limitation stems from the fundamental differences
in how these retrieval methods operate and the distinct chunking requirements they impose.

• Evaluation metrics: The lack of a clear consensus on the optimal metrics for evaluating RAG
systems means our chosen metrics—Retrieval Precision and Answer Similarity—are based on
conceptual alignment rather than empirical evidence of their efficacy. This highlights an area
for future research to solidify the evaluation framework for RAG systems.
• Technique Selection: The subset of RAG techniques evaluated, while selected based on cur-
rent relevance and potential, is not exhaustive. Excluded techniques such as Step back prompt-
ing (Dai et al., 2023), Auto-merging retrieval (Phaneendra, 2023), and Hybrid search (Akash,
2023) reflect the study’s scope limitation and the subjective nature of selection. Future research
should consider these and other emerging methods to broaden the understanding of RAG system
enhancements.

6 Conclusion
Our investigation into Retrieval-Augmented Generation (RAG) techniques has identified HyDE and
LLM reranking as notable enhancers of retrieval precision in LLMs. These approaches, however,
necessitate additional LLM queries, incurring greater latency and cost. Surprisingly, established
techniques like MMR and Cohere rerank did not demonstrate significant benefits, and Multi-query
was found to be less effective than baseline Naive RAG.
The results demonstrate the efficacy of the Sentence Window Retrieval technique in achieving high
precision for retrieval tasks, although a discrepancy was observed between retrieval precision and an-
swer similarity scores. Given its conceptual similarity to Sentence Window retrieval, we suggest that
Auto-merging retrieval (Phaneendra, 2023) might offer comparable benefits, warranting future inves-
tigation. The Document Summary Index approach also exhibited satisfactory performance, however,
it requires an upfront investment in generating summaries for each document in the corpus.
Due to constraints such as dataset singularity, limited questions, and the use of GPT-3.5-turbo for
evaluation, the results may not fully capture the potential of more advanced models. Future studies
with broader datasets and higher-capability LLMs could provide more comprehensive insights. This
research contributes a foundational perspective to the field, encouraging subsequent works to refine,
validate, and expand upon our findings.

12
To facilitate this continuation of research and allow for the replication and extension of our work,
we have made our experimental pipeline available through a publicly accessible GitHub repository.
(Predlico, 2024)

7 Future Work
• Knowledge Graph RAG: Integrating Knowledge Graphs (KGs) with RAG systems represents
a promising direction for enhancing retrieval precision and contextual relevance. KGs offer a
well-organized framework of relationship-rich data that could refine the retrieval phase of RAG
systems (Bratanic, 2023). Although setting up such systems is resource-demanding, the potential
for significantly improved retrieval processes justifies further investigation.

• Unfrozen RAG systems: Unlike the static application of RAG systems in our study, future
investigations can benefit from adapting RAG components, including embedding models and
rerankers, directly to specific datasets (Gao et al., 2024; Kiela, 2024). This ”unfrozen” approach
allows for fine-tuning on nuanced use-case data, potentially enhancing system specificity and
output quality. Exploring these adaptations could lead to more adaptable and effective RAG
systems tailored to diverse application needs.

• Experiment replication across diverse datasets: To ensure the robustness and general-
izability of our findings, it is imperative for future research to replicate our experiments using
a variety of datasets. Conducting these experiments across multiple datasets is important to
verify the applicability of our results and to identify any context-specific adjustments needed.

• Auto-RAG: The idea of automatically optimizing RAG systems, akin to Auto-ML’s approach
in traditional machine learning, presents a significant opportunity for future exploration. Cur-
rently, selecting the optimal configuration of RAG components — e.g., chunking strategies,
window sizes, and parameters within rerankers — relies on manual experimentation and intu-
ition. An automated system could systematically explore a vast space of RAG configurations
and select the very best model (Markr.AI, 2024).

References
Akash. Hybrid search: Optimizing rag implementation. https://medium.com/@csakash03/
hybrid-search-is-a-method-to-optimize-rag-implementation-98d9d0911341, 2023. Ac-
cessed: 2024-04-01.

T. Bratanic. Using a knowledge graph to implement a rag application. https://neo4j.com/

developer-blog/knowledge-graph-rag-application/, 2023. Accessed: 2024-03-24.
J. Carbonell and J. Goldstein. The use of mmr, diversity-based reranking for reordering documents and
producing summaries. https://www.cs.cmu.edu/~jgc/publication/The_Use_MMR_Diversity_
Based_LTMIR_1998.pdf, 1998. Accessed: 2024-03-24.
Z. Dai, J. Callan, K.-W. Chang, D. Chen, K. Guu, X. Han, K. Hashimoto, H. He, M. Joshi,
D. Jurafsky, J. Karishnamurthy, D. Khashabi, D. Kiela, A. Kumar, Z. Lan, M. Lewis, X. Ma,
S. Min, A. Neelakantan, A. Y. Ng, P. Pasupat, P. Qi, C. Raffel, S. Roller, K. Shih, and
L. Zettlemoyer. Step back prompting: Enhancing llms with historical context retrieval. https:
//arxiv.org/abs/2310.06117, 2023.
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional trans-
formers for language understanding, 2019.

13
L. Gao, X. Ma, J. Lin, and J. Callan. Precise zero-shot dense retrieval without relevance labels, 2022.

Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, J. Sun, Q. Guo, M. Wang, and H. Wang.
Retrieval-augmented generation for large language models: A survey, 2024.
James Calam. Ai arxiv dataset. https://huggingface.co/datasets/jamescalam/ai-arxiv, 2023.
Accessed: 2024-03-24.

Z. Jiang, F. F. Xu, L. Gao, Z. Sun, Q. Liu, J. Dwivedi-Yu, Y. Yang, J. Callan, and G. Neubig. Active
retrieval augmented generation, 2023.
D. Kiela. Stanford cs25: V3 i retrieval augmented language models. https://www.youtube.com/
watch?v=mE7IDf2SmJg, 2024. Accessed: 2024-03-24.

Langchain. Query transformations. https://blog.langchain.dev/query-transformations/, 2023.

Accessed: 2024-03-23.
J. Liu. A new document summary index for llm-powered qa systems. https://www.llamaindex.ai/
blog/a-new-document-summary-index-for-llm-powered-qa-systems-9a32ece2f9ec, 2023a.
Accessed: 2024-03-23.

J. Liu. Using llms for retrieval and reranking. https://www.llamaindex.ai/blog/

using-llms-for-retrieval-and-reranking-23cf2d3a14b6, 2023b. Accessed: 2024-03-24.
Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoy-
anov. Roberta: A robustly optimized bert pretraining approach, 2019.

Markr.AI. Autorag: A framework for automated retrieval-augmented generation. https://github.

com/Marker-Inc-Korea/AutoRAG, 2024. Accessed: 2024-03-24.
K. Phaneendra. Deep dive into advanced rag applications
in llm-based systems. https://phaneendrakn.medium.com/
deep-dive-into-advanced-rag-applications-in-llm-based-systems-1ccee0473b3b, 2023.
Accessed: 2024-04-01.

Pinecone. Rerankers. https://www.pinecone.io/learn/series/rag/rerankers/, 2023. Accessed:

2024-03-24.
Predlico. Aragog - advanced retrieval augmented generation output grading. https://github.com/
predlico/ARAGOG, 2024. Accessed: 2024-03-24.

RAGAS Documentation. Metrics. https://docs.ragas.io/en/v0.0.17/concepts/metrics/

index.html, 2023. Accessed: 2024-03-24.
Tonic AI. About rag metrics: Tonic validate rag metrics summary. https://docs.tonic.ai/
validate/about-rag-metrics/tonic-validate-rag-metrics-summary, 2023. Accessed: 2024-
03-24.

S. Yang. Advanced rag 01: Small to big retrieval. https://towardsdatascience.com/

advanced-rag-01-small-to-big-retrieval-172181b396d4, 2023. Accessed: 2024-03-23.

Final 1Z0-083 v1
No ratings yet
Final 1Z0-083 v1
28 pages
Weaviate Advanced RAG Techniques Ebook
100% (1)
Weaviate Advanced RAG Techniques Ebook
13 pages
RAG Understanding PDF
No ratings yet
RAG Understanding PDF
12 pages
RAG Notes
No ratings yet
RAG Notes
4 pages
RAG Slide ENG
No ratings yet
RAG Slide ENG
41 pages
Three Schema Architecture
No ratings yet
Three Schema Architecture
5 pages
Agents in LangChain
100% (2)
Agents in LangChain
11 pages
Retrieval-Augmented Generation For Large Language Models A Survey
No ratings yet
Retrieval-Augmented Generation For Large Language Models A Survey
26 pages
Hybrid Retrieval-Augmented Generation Approach For LLMs Query Response Enhancement
No ratings yet
Hybrid Retrieval-Augmented Generation Approach For LLMs Query Response Enhancement
5 pages
A Simple Guide To Retrieval Augmented Generation 1720484135
No ratings yet
A Simple Guide To Retrieval Augmented Generation 1720484135
9 pages
Unit - 3 Data Cube Technology
No ratings yet
Unit - 3 Data Cube Technology
6 pages
DataCamp Curriculum Cheat Sheet
No ratings yet
DataCamp Curriculum Cheat Sheet
11 pages
Developing Retrieval Augmented Generation (RAG) Based LLM Systems From Pdfs - An Expert Report
No ratings yet
Developing Retrieval Augmented Generation (RAG) Based LLM Systems From Pdfs - An Expert Report
36 pages
Introduction To RAG (Retrieval Augmented Generation) and Vector Database - by Sachinsoni - Medium
No ratings yet
Introduction To RAG (Retrieval Augmented Generation) and Vector Database - by Sachinsoni - Medium
18 pages
RAG Technics
100% (1)
RAG Technics
8 pages
Advance RAG Technique
No ratings yet
Advance RAG Technique
23 pages
Mini Project Clustering
50% (2)
Mini Project Clustering
33 pages
RAG - Genai
No ratings yet
RAG - Genai
11 pages
Generative AI
No ratings yet
Generative AI
25 pages
Naan Mudhalvan PPT Final
No ratings yet
Naan Mudhalvan PPT Final
7 pages
RAG Developers Stack
No ratings yet
RAG Developers Stack
13 pages
External Information On Large Linguistic Models Utilizing Retrieval Enhanced Generation (RAG)
100% (10)
External Information On Large Linguistic Models Utilizing Retrieval Enhanced Generation (RAG)
6 pages
Starbucks Management Report
No ratings yet
Starbucks Management Report
12 pages
Rag System Notes
No ratings yet
Rag System Notes
26 pages
A Comprehensive Guide To Building Agentic RAG Systems With LangGraph
No ratings yet
A Comprehensive Guide To Building Agentic RAG Systems With LangGraph
23 pages
Srs (1) 1
No ratings yet
Srs (1) 1
21 pages
Optimizing Retrieval-Augmented Generation: Analysis of Hyperparameter Impact On Performance and Efficiency
No ratings yet
Optimizing Retrieval-Augmented Generation: Analysis of Hyperparameter Impact On Performance and Efficiency
14 pages
Rag 1708257109
No ratings yet
Rag 1708257109
5 pages
50 FREE AI SEO Tools
No ratings yet
50 FREE AI SEO Tools
4 pages
RAG Syllabus R&D
No ratings yet
RAG Syllabus R&D
6 pages
Practical RAG
No ratings yet
Practical RAG
127 pages
Ue21cs421ac1 20240924233834
No ratings yet
Ue21cs421ac1 20240924233834
54 pages
File System Manipulation
100% (1)
File System Manipulation
2 pages
12 Essential RAG Types 1735544647
No ratings yet
12 Essential RAG Types 1735544647
29 pages
RAG Cheat Sheet-2
No ratings yet
RAG Cheat Sheet-2
29 pages
Advanced RAG Techniques - What They Are & How To Use Them
No ratings yet
Advanced RAG Techniques - What They Are & How To Use Them
16 pages
A Survey On Rag Meeting LLM
No ratings yet
A Survey On Rag Meeting LLM
18 pages
Untitled 2
No ratings yet
Untitled 2
40 pages
Maximizing Rag Efficiency A Comparative Analysis of Rag Methods
No ratings yet
Maximizing Rag Efficiency A Comparative Analysis of Rag Methods
25 pages
Chapter 3 Methods
No ratings yet
Chapter 3 Methods
20 pages
Distributed Data Architecture & Management: DR Simon Scola
No ratings yet
Distributed Data Architecture & Management: DR Simon Scola
42 pages
R AG: Incorporating Retrieval Information Into Retrieval Augmented Generation
No ratings yet
R AG: Incorporating Retrieval Information Into Retrieval Augmented Generation
13 pages
Week 5 - LLM - RAG
No ratings yet
Week 5 - LLM - RAG
34 pages
Generative AI - A Modern Approach
No ratings yet
Generative AI - A Modern Approach
26 pages
Automation Benefits For Display - #3 Solution Document
No ratings yet
Automation Benefits For Display - #3 Solution Document
26 pages
04 - 2 Step Picking
No ratings yet
04 - 2 Step Picking
23 pages
Hybrid RAG For Unstructured Data
No ratings yet
Hybrid RAG For Unstructured Data
25 pages
DMS Microproject
No ratings yet
DMS Microproject
30 pages
Non - Authoritative Applications - 1
No ratings yet
Non - Authoritative Applications - 1
33 pages
Informatica
No ratings yet
Informatica
32 pages
Assignment - NoSQL Databases
No ratings yet
Assignment - NoSQL Databases
13 pages
CS121Lec01 PDF
No ratings yet
CS121Lec01 PDF
37 pages
SSRN 5267341
No ratings yet
SSRN 5267341
16 pages
2024.naacl Industry.23
No ratings yet
2024.naacl Industry.23
16 pages
Towards Understanding Retrieval Accuracy and Prompt Quality in RAG Systems
No ratings yet
Towards Understanding Retrieval Accuracy and Prompt Quality in RAG Systems
18 pages
Lesson3 - Basic Concepts of Computer PDF
No ratings yet
Lesson3 - Basic Concepts of Computer PDF
37 pages
Medium
No ratings yet
Medium
22 pages
Learning: Gen Ai
No ratings yet
Learning: Gen Ai
6 pages
Semantic Search and Beyond handout-Tim-Clarke
No ratings yet
Semantic Search and Beyond handout-Tim-Clarke
16 pages
8 CE Internship Report Format
No ratings yet
8 CE Internship Report Format
11 pages
Crag Pa Peer
No ratings yet
Crag Pa Peer
16 pages
Steps Involved in RAG
No ratings yet
Steps Involved in RAG
4 pages
Speculative RAG Enhancing RAG Through Drafting 1721165432
No ratings yet
Speculative RAG Enhancing RAG Through Drafting 1721165432
17 pages
01rag For LLM A Survey
No ratings yet
01rag For LLM A Survey
21 pages
Corrective Retrieval Augmented Generation: Zhang Et Al. 2023b Muhlgay Et Al. 2023
No ratings yet
Corrective Retrieval Augmented Generation: Zhang Et Al. 2023b Muhlgay Et Al. 2023
14 pages
Advanced RAG
No ratings yet
Advanced RAG
12 pages
Searching For Best Practices in Retrieval-Augmented Generation
No ratings yet
Searching For Best Practices in Retrieval-Augmented Generation
22 pages
Corrective Retrieval Augmented Generation: Zhang Et Al. 2023b Muhlgay Et Al. 2023
No ratings yet
Corrective Retrieval Augmented Generation: Zhang Et Al. 2023b Muhlgay Et Al. 2023
13 pages
EasyChair Preprint 15614
No ratings yet
EasyChair Preprint 15614
20 pages
Minor Proj
No ratings yet
Minor Proj
15 pages
A Deep Dive Into Retrieval Augmented Generation: Team Members
No ratings yet
A Deep Dive Into Retrieval Augmented Generation: Team Members
14 pages
Omrani Et Al. - 2024 - Hybrid Retrieval-Augmented Generation Approach For LLMs Query Response Enhancement
No ratings yet
Omrani Et Al. - 2024 - Hybrid Retrieval-Augmented Generation Approach For LLMs Query Response Enhancement
5 pages
Rag Foundry - Diff Framework
No ratings yet
Rag Foundry - Diff Framework
10 pages
Gautam 2024 Evaluating
No ratings yet
Gautam 2024 Evaluating
7 pages
A Survey On Rag Meeting LLMS: Towards Retrieval-Augmented Large Language Models
No ratings yet
A Survey On Rag Meeting LLMS: Towards Retrieval-Augmented Large Language Models
18 pages
Enhancing Retrieval-Augmente Generation Practices
No ratings yet
Enhancing Retrieval-Augmente Generation Practices
13 pages
A Research of Challenges and Solutions in Retrieva
No ratings yet
A Research of Challenges and Solutions in Retrieva
7 pages
L Rag: S F R - A G: Ight Imple and AST Etrieval Ugmented Eneration
No ratings yet
L Rag: S F R - A G: Ight Imple and AST Etrieval Ugmented Eneration
16 pages
Llmrag
No ratings yet
Llmrag
6 pages
Manual de ERwin 7.3
No ratings yet
Manual de ERwin 7.3
67 pages
A Survey On Retrieval-Augmented Text Generation For Large Language Models
No ratings yet
A Survey On Retrieval-Augmented Text Generation For Large Language Models
18 pages
Constraints
No ratings yet
Constraints
6 pages
DATAWAREHOUSE PPT NEWW
No ratings yet
DATAWAREHOUSE PPT NEWW
27 pages
Nutricrunch GingerSnap
No ratings yet
Nutricrunch GingerSnap
5 pages
A Strategy For Automatically Extracting References From PDF Documents
No ratings yet
A Strategy For Automatically Extracting References From PDF Documents
6 pages
Galero Reanne 04 Task Performance 1
No ratings yet
Galero Reanne 04 Task Performance 1
4 pages
SAP Fiori and UI5
No ratings yet
SAP Fiori and UI5
2 pages
Data-Resource-management
No ratings yet
Data-Resource-management
3 pages
Editorial Abstract The Supply Street
No ratings yet
Editorial Abstract The Supply Street
3 pages
Simplex Maximization
No ratings yet
Simplex Maximization
3 pages
Eight Elements of An Ethical Organization
No ratings yet
Eight Elements of An Ethical Organization
1 page
Getting An - 1101 Error While Running An AV-transformation or Mapplet
No ratings yet
Getting An - 1101 Error While Running An AV-transformation or Mapplet
2 pages
R
No ratings yet
R
1 page
Business Plan Outline
No ratings yet
Business Plan Outline
15 pages
CS1004 DWM 2marks 2013
No ratings yet
CS1004 DWM 2marks 2013
22 pages
Splunk Test Blueprint Power User v.1.1
No ratings yet
Splunk Test Blueprint Power User v.1.1
3 pages
Pinecone Hybrid Search Engineering: The Complete Guide for Developers and Engineers
From Everand
Pinecone Hybrid Search Engineering: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Mongoose in Practice: Definitive Reference for Developers and Engineers
From Everand
Mongoose in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet

Regression Analysis

Uploaded by

Regression Analysis

Uploaded by

ARAGOG: Advanced RAG Output Grading

Matouš Eibich Shivay Nagpal Alexander Fred-Ojala

2.5 Maximum Marginal Relevance

2.6 Cohere Rerank

2.7 LLM rerank

3.1.1 RAG Database Construction

3.1.2 Chunking Approach

3.2 Mitigating LLM Output Variability

3.3.1 Retrieval Precision

3.3.2 Answer Similarity

3.3.3 Rationale for Metric Selection

4.1 Comparative Performance Analysis: Boxplots

4.2 Statistical Validation of Differences

Technique Comparison Mean Diff. P-adj Reject Null

Technique Comparison Mean Diff. P-adj Reject Null

4.2.2 Sentence window

Next is a comparison of individual Sentence Window retrieval variants.

Base Technique Compared Technique Mean Diff. P-adj Reject Null

4.2.3 Document Summary Index

Technique Comparison Mean Diff. P-adj Reject Null

Technique Comparison Mean Diff. P-adj Reject Null

T. Bratanic. Using a knowledge graph to implement a rag application. https://neo4j.com/

Langchain. Query transformations. https://blog.langchain.dev/query-transformations/, 2023.

J. Liu. Using llms for retrieval and reranking. https://www.llamaindex.ai/blog/

Markr.AI. Autorag: A framework for automated retrieval-augmented generation. https://github.

Pinecone. Rerankers. https://www.pinecone.io/learn/series/rag/rerankers/, 2023. Accessed:

RAGAS Documentation. Metrics. https://docs.ragas.io/en/v0.0.17/concepts/metrics/

S. Yang. Advanced rag 01: Small to big retrieval. https://towardsdatascience.com/

You might also like