0% found this document useful (0 votes)
37 views13 pages

An End-to-End Model With Adaptive Filtering For Retrieval-Augmented Generation

Retrieval-augmented generation methods often neglect the quality of content retrieved from external knowledge bases, resulting in irrelevant information or potential misinformation that negatively affects the generation results of large language models. In this paper, we propose an end-to-end model with adaptive filtering for retrieval-augmented generation (E2E-AFG), which integrates answer existence judgment and text generation into a single end-to-end framework. This enablesthe model to focus

Uploaded by

shubhampandey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views13 pages

An End-to-End Model With Adaptive Filtering For Retrieval-Augmented Generation

Retrieval-augmented generation methods often neglect the quality of content retrieved from external knowledge bases, resulting in irrelevant information or potential misinformation that negatively affects the generation results of large language models. In this paper, we propose an end-to-end model with adaptive filtering for retrieval-augmented generation (E2E-AFG), which integrates answer existence judgment and text generation into a single end-to-end framework. This enablesthe model to focus

Uploaded by

shubhampandey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

E2E-AFG: An End-to-End Model with Adaptive Filtering

for Retrieval-Augmented Generation

Yun Jiang, Zilong Xie, Wei Zhang, Yun Fang and Shuai Pan*

Advanced Institute of Information Technology, Peking University, China


[email protected]

Abstract. Retrieval-augmented generation methods often neglect the quality of


content retrieved from external knowledge bases, resulting in irrelevant infor-
mation or potential misinformation that negatively affects the generation results
of large language models. In this paper, we propose an end-to-end model with
adaptive filtering for retrieval-augmented generation (E2E-AFG), which inte-
grates answer existence judgment and text generation into a single end-to-end
framework. This enables the model to focus more effectively on relevant content
while reducing the influence of irrelevant information and generating accurate
answers. We evaluate E2E-AFG on six representative knowledge-intensive lan-
guage datasets, and the results show that it consistently outperforms baseline
models across all tasks, demonstrating the effectiveness and robustness of the
proposed approach. 1

Keywords: Retrieval Augmented Generation, Large Language Model, Ques-


tion Answering, Multitask Learning.

1 Introduction

The remarkable natural language understanding and generation capabilities demon-


strated by Large Language Models (LLMs) have led to their success in knowledge-
intensive tasks, such as open-domain question answering and fact verification [4, 28,
1]. However, LLMs are prone to generating hallucinatory content that contains factual
errors in the absence of supporting documentation. To address this issue, [21] proposed
the retrieval-augmented generation (RAG) method, which involves retrieves relevant
context from external knowledge bases to provide additional evidence for LLMs when
answering input queries. Other approaches [31] directly utilize a pre-trained LLM to
generate a relatively accurate pseudo-answer as an extended document for the input
query. However, these methods often fail to adequately consider the quality of the re-
trieved or generated content, which may include distracting irrelevant content or erro-
neous information, leading LLMs to still produce hallucinatory answers.

* Corresponding Author
1 Our code is available at: https://github.com/XieZilongAI/E2E-AFG
2 Y. Jiang et al.

Earlier studies [32, 23] attempted to select more relevant content by re-ranking the
retrieved contexts, but they may still contain irrelevant information. [7] achieved auto-
matic decontextualization of sentences through training a coreference resolution model,
although this requires extensive manual annotation efforts. Recent research, such as
HyDE [12], employs unsupervised contrastive learning where an encoder’s dense bot-
tleneck acts as a lossy compressor to filter out hallucinatory content. FILCO [33] trains
a filtering model to remove irrelevant contexts, improving the quality of the context
provided to the generation model. However, these methods typically involve multiple
independent models and complex preprocessing operations, which not only increase
system complexity but also elevate training and inference costs.
To address the aforementioned issues, we propose an End-to-End Model with Adap-
tive Filtering for Retrieval-Augmented Generation (E2E-AFG), which integrates clas-
sification and generation tasks into an end-to-end framework, allowing the model to
simultaneously learn context filtering and answer generation. Specifically, we first em-
ploy a pre-trained large language model to generate a pseudo-answer related to the input
query, enriching the content. We then apply three context filtering strategies to obtain
silver classification labels. The construction of the end-to-end model is based on the
generation model, augmented with a classification module that employs a cross-atten-
tion mechanism to predict whether sentences in the context contain answers, enabling
the model to answer the input query based on a certain judgment of the context.
We conducted experiments on six knowledge-intensive language datasets, covering
three tasks: question answering (Natural Questions [19], TriviaQA [17], HotpotQA
[36], ELI5 [10]), fact verification (FEVER [30]), and knowledge-based dialogue gen-
eration (Wizard of Wikipedia [9]). Compared to baseline models, our approach
achieved state-of-the-art results across all six datasets, with improvements ranging from
+0.13 to +1.83 points, validating the effectiveness of the proposed method.

2 Related Work

Retrieval-Augmented Generation. Early research methods such as REALM [13] and


RAG [21], laid the foundation for the field of retrieval-augmented generation (RAG)
by combining retrievers with large language models (LLMs). Subsequently, RETRO
[3] introduced the concept of training language models on fixed retrievers, while Atlas
[16] further explored dedicated loss functions and training strategies, achieving im-
proved results, particularly in few-shot learning scenarios. Recent studies have shifted
towards optimizing the retrieval component while leveraging pre-trained, fixed LLMs.
For instance, RePlug [34] and In-context RALM [29] demonstrated that fine-tuning the
retrieval module can surpass end-to-end trained models in certain tasks, such as ques-
tion answering. In contrast, SAIL [22] integrated real search engines with information
denoising processes, aiming to enhance the relevance and accuracy of retrieval results,
showcasing potential in broader application contexts. Our work seeks to enhance atten-
tion to reliable information by performing answer existence judgment on the retrieved
passages prior to generation, thereby reducing the interference caused by irrelevant in-
formation.
E2E-AFG: E2E Model with Adaptive Filtering for RAG 3

GT-answer 𝐴𝐴 Lossgen

STRINC Losscls
Retrieved Passages 𝑃𝑃
Choose one Classification
LEXICAL
labels
CXMI

Pseudo answer 𝑆𝑆

Output-answer 𝜀𝜀𝑖𝑖
E2EEncoder
𝑂𝑂 𝜉𝜉
Pre-trained E2Egen
LLM E2E-AFG
𝑃𝑃embs
Cross 𝛼𝛼𝑖𝑖
𝑄𝑄embs FFN
Attention 𝛽𝛽
Input query 𝑄𝑄 𝑆𝑆embs

Fig. 1: The overall architecture diagram of the proposed method.

Retrieval Content Filtering Strategies. In knowledge-intensive tasks, post-processing


of retrieved content is crucial for enhancing system performance, with common prac-
tices including re-ranking and context filtering. In early studies, [32] and [20] explored
passage re-ranking methods based on BiLSTM, while [23] and [27] employed BERT-
based cross-encoders to achieve more precise passage re-ranking. Subsequently, [26]
proposed a method for re-ranking passages by updating the query, and [15] directly
applied heuristic re-ranking to the answers. In recent years, several context filtering
strategies have been introduced. For example, FILCO [33] trains a context filtering
model to perform fine-grained sentence-level filtering on the retrieved passages. Multi-
Meta-RAG [25] utilizes a specific set of domain queries and formats to select the most
relevant documents through database filtering. In contrast, our approach constructs a
single end-to-end model that can simultaneously perform context filtering and answer
generation.
Multi-task Learning. Multi-task learning (MTL) enhances overall model performance
by jointly learning multiple tasks, allowing it to capture the correlations and shared
features among tasks [5]. In natural language processing applications, MTL not only
leverages task relevance to mitigate issues of data scarcity and model overfitting but
also improves the generalization capability of the model. For instance, [6] proposed a
hierarchical multi-task learning approach that enhances the model’s ability to capture
inter-task dependencies. ROM [11] introduced a generalizable Retrieval Optimized
Multi-task framework that reduces the model’s parameters. Our method applies MTL
to the retrieval-augmented generation domain by jointly learning binary classification
and generation tasks, enabling the model to acquire context filtering and answer gener-
ation capabilities.
4 Y. Jiang et al.

3 Method

Problem Statement. In knowledge-intensive tasks, each entry consists an input query


𝑄𝑄, a ground truth answer 𝐴𝐴, and a set of retrieved passages 𝑃𝑃 = {𝑝𝑝𝑖𝑖 }𝐾𝐾
𝑖𝑖=1 from a data-
base. We provide the generator with one or more passages along with a pre-generated
pseudo-answer 𝑆𝑆 to generate a response to the query 𝑄𝑄. Specifically, in the question-
answering tasks, 𝑄𝑄 and 𝐴𝐴 are natural language questions and their corresponding
ground truth answers; in the fact verification tasks, 𝑄𝑄 is a statement and 𝐴𝐴 ∈
{SUPPORTS, REFUTES} indicates the correctness of the statement; in the knowledge-
based dialogue generation tasks, 𝑄𝑄 consists of a dialogue history, and 𝐴𝐴 is a response
that accurately continues the conversation.
Overview. The overall architecture of our proposed method is illustrated in Fig. 1. First,
a pre-trained large language model generates a pseudo-answer 𝑆𝑆 for the query 𝑄𝑄. Next,
the query 𝑄𝑄, the retrieved set of passages 𝑃𝑃, and the pseudo-answer 𝑆𝑆 are input into the
E2E-AFG model, where both generation and binary classification tasks are performed.
The generation task utilizes the generator E2Egen to produce an answer. The binary
classification task employs E2EEncoder to obtain embeddings for the three inputs,
which are then processed through cross-attention and a feedforward neural network to
predict the category scores. Finally, the cross-entropy loss for both the generation and
binary classification tasks is computed. This approach allows for the update of the in-
ternal parameters of the shared E2EEncoder, implicitly learning a filtering capability that
prioritizes sentences more likely to contain answers while reducing interference from
irrelevant sentences.
E2E-AFG: E2E Model with Adaptive Filtering for RAG 5

Fig. 2: Three kinds of LLM prompts and their generated pseudo-answer examples.

3.1 Generating Pseudo-Answers

In knowledge-intensive tasks, models typically rely on passages retrieved from data-


bases to generate answers. However, these passages often do not perfectly match the
questions, leading to a lack of reliable evidence for the model to generate accurate an-
swers. To mitigate this limitation, we introduce a strategy that utilizes a pre-trained
large language model to generate pseudo-answers, which serve as an additional refer-
ence to assist the model in producing more accurate responses. To explore how to gen-
erate high-quality pseudo-answers, we have devised several different prompts, as illus-
trated in Fig. 2. The first directly generates concise answers, which may lead to the
generation of hallucinatory content; the second encourages the model to make the best
guess for the correct answer when it is uncertain; and the third structured prompt in-
structs the model to also provide the reasoning behind the derived answer.

3.2 Obtaining Silver Classification Labels

To determine whether the retrieved passage set 𝑃𝑃 and the generated pseudo-answer 𝑆𝑆
contain answers, we introduce three context filtering methods based on [33]: (i) String
Inclusion (STRINC): checking if the context directly contains the ground truth answer;
(ii) Lexical Overlap (LEXICAL): measuring the overlap of words between the context
and the ground truth answer; and (iii) Conditional Cross-Mutual Information (CXMI):
assessing the likelihood of the generator producing the ground truth answer given the
context. For a specific task, we select the most appropriate filtering method to obtain
silver classification labels. For instance, in question-answering tasks, we may use StrInc
to evaluate whether each passage or pseudo-answer contains the ground truth answer.
In contrast, for fact extraction tasks, where the ground truth answer resembles a boolean
value and cannot be assessed using the first two methods, we employ CXMI to compute
the corresponding probability and set a threshold 𝑡𝑡0 to derive the silver classification
label. We concatenate the obtained labels with the ground truth answer 𝐴𝐴 to facilitate
loss calculation.

3.3 Generation Task

For each training sample (𝑄𝑄, 𝐴𝐴, 𝑃𝑃, 𝑆𝑆), we first insert a special character between the
different fields to ensure they can be distinguished after encoding with E2EEncoder. We
then input the encoded query 𝑄𝑄embs, the retrieved passage set 𝑃𝑃embs, and the pseudo-
answer 𝑆𝑆embs into E2Egen to produce the output answer 𝑂𝑂. The sequence probability is
calculated as follows:

𝑃𝑃𝑜𝑜 (𝑂𝑂|𝑄𝑄, 𝑃𝑃, 𝑆𝑆) = ∏𝐿𝐿𝑖𝑖=1 𝑝𝑝(𝑜𝑜𝑖𝑖 |𝑂𝑂<𝑖𝑖 , 𝑄𝑄, 𝑃𝑃, 𝑆𝑆) (1)

where 𝑜𝑜𝑖𝑖 represents the i-th token of the generated output 𝑂𝑂, and 𝐿𝐿 is the final output
length. To simplify the notation, we continue to use 𝑄𝑄, 𝑃𝑃, 𝑆𝑆 in place of 𝑄𝑄embs, 𝑃𝑃embs,
6 Y. Jiang et al.

and 𝑆𝑆embs respectively in the equations above and in the subsequent content. The loss
function for the generation task is calculated as follows:

𝐿𝐿gen = − ∑𝐿𝐿𝑖𝑖=1 log 𝑝𝑝(𝑜𝑜𝑖𝑖𝑔𝑔𝑔𝑔 |𝑂𝑂<𝑖𝑖 , 𝑄𝑄, 𝑃𝑃, 𝑆𝑆) (2)

where 𝑜𝑜𝑖𝑖𝑔𝑔𝑔𝑔 denotes the i-th token of the ground truth answer 𝐴𝐴.

3.4 Classification Task

To enhance the model’s context filtering capability, we introduce a classification mod-


ule specifically designed to determine whether the input context contains the answer.
The generator and the classification module share the same encoder E2EEncoder, allow-
ing the classification model to indirectly improve the model’s context filtering capabil-
ities by influencing the encoder’s parameters.
The classification module comprises two main components: cross-attention layer,
and feedforward neural network. First, the encoded query 𝑄𝑄, each retrieved passage 𝑝𝑝𝑖𝑖 ,
and the pseudo-answer 𝑆𝑆 are fed into the cross-attention layer. In this layer, the model
computes the attention weights between 𝑄𝑄 and 𝑝𝑝𝑖𝑖 , as well as between 𝑄𝑄 and 𝑆𝑆, gener-
ating cross-attention representations:
𝑄𝑄𝑝𝑝T
𝛼𝛼𝑖𝑖 = softmax � 𝑖𝑖 � 𝑝𝑝i (3)
�𝑑𝑑𝑘𝑘

𝑄𝑄𝑆𝑆T
𝛽𝛽 = softmax � � 𝑆𝑆 (4)
�𝑑𝑑𝑘𝑘

where 𝑑𝑑𝑘𝑘 is the dimensionality of the encoder’s feature channels.


Next, the generated cross-attention representations are fed into a feedforward neural
network to predict two binary classification results:

𝜀𝜀𝑖𝑖 = FFN(𝛼𝛼𝑖𝑖 ) , 𝜉𝜉 = FFN(𝛽𝛽) (5)

where FFN denotes a two-layer feedforward neural network. The loss function for the
classification task is defined as the cross-entropy:
𝑔𝑔𝑔𝑔
𝐿𝐿cls = ∑𝐾𝐾
𝑖𝑖=1 −(log 𝜀𝜀𝑖𝑖 ) + log 𝜉𝜉
𝑔𝑔𝑔𝑔 (6)

Here, 𝜀𝜀𝑖𝑖𝑔𝑔𝑔𝑔 and 𝜉𝜉 𝑔𝑔𝑔𝑔 represent the predicted probability values corresponding to the
ground truth classes of each passage 𝑝𝑝𝑖𝑖 and the pseudo-answer 𝑆𝑆, respectively, while
𝐾𝐾 is the number of retrieved passages.

3.5 Model Training

During the training process, we simultaneously optimize the loss functions of both the
generator and the classification module. The overall loss function is defined as a
weighted sum of the two losses:
𝐿𝐿TOTAL = (1 − 𝜎𝜎)𝐿𝐿gen + 𝜎𝜎𝜎𝜎cls (7)
E2E-AFG: E2E Model with Adaptive Filtering for RAG 7

where 𝐿𝐿gen is the loss from the generator, 𝐿𝐿cls is the loss from the classification module,
and 𝜎𝜎 is the weighting factor.
To further enhance the training efficiency and performance of the model, we employ
Low-Rank Adaptation (LoRA) [14] techniques, which add low-rank matrices to the
weight matrices of the pre-trained model for fine-tuning. This approach reduces com-
putational overhead and accelerates the training process.

4 Experiments

4.1 Datasets and Evaluation Metrics

As shown in Table 1, we conducted experiments on six retrieval-augmented


knowledge-intensive language datasets, which utilize data constructed from Wikipedia
articles as supporting documents. Each dataset is divided into a training set (train), a
development set (dev), and a test set (test). Exact Match (EM): Measures the percentage
of predictions that exactly match the ground truth. Unigram F1 (F1): Evaluates the har-
monic mean of precision and recall based on individual word overlap between the pre-
diction and the ground truth. Accuracy (Acc): Represents the proportion of correct pre-
dictions out of the total number of predictions. Top-20 recall [2]: Measures whether the
answer string is included among the top 20 passages in the development set (applicable
to Natural Questions [19] and TriviaQA-unfiltered [17]), or whether it originates from
the relevant annotated source articles in the KILT dataset [24] (applicable to FEVER
[30] and Wizard of Wikipedia [9]).

Table 1: Statistics and evaluation metric for six datasets.

# Examples Evaluation Top-20


Dataset
train dev test metric recall (%)
Natural Questions 79,168 8,757 3,610 EM 82.1
TriviaQA-unfiltered 78,785 8,837 11,313 EM 75.2
FEVER 104,966 10,444 10,100 Acc 98.1
HotpotQA 88,924 5,947 5,631 F1 63.5
ELI5 273,036 3,098 2,367 F1 56.5
Wizard of Wikipedia 63,734 3,054 2,944 F1 96.2

Open-Domain Question Answering: The Natural Questions (NQ) and TriviaQA-unfil-


tered (TQA) datasets consist of questions, answers, and relevant passages from Wik-
ipedia, using short answers limited to five tokens. Fact Verification: The FEVER da-
taset contains paraphrased claims from Wikipedia, labeled as “SUPPORTS” or
“REFUTES” based on their alignment with original content. Multi-Hop Question An-
swering: The HotpotQA dataset features complex questions requiring reasoning
through multiple passages to find answers, with 113K question-answer pairs derived
from Wikipedia. Long-Form Question Answering: The ELI5 dataset includes 270K
Reddit posts requiring detailed, multi-sentence answers to open-ended questions.
Knowledge-Based Dialogue Generation: The Wizard of Wikipedia (WoW) dataset
8 Y. Jiang et al.

generates dialogue responses based on a history of turns, utilizing information from


Wikipedia articles.

4.2 Implementation Details

We loaded the model checkpoints from HuggingFace Transformers [35], using FLAN-
T5-xl [8] as our backbone model architecture. We employed prompt 3 and the Llama-
3 model to generate pseudo-answers, limiting their generation length to no more than
200 tokens. For the queries in each dataset, we utilized the Dense Passage Retriever
(DPR) [18] to extract the top 5 most relevant passages from Wikipedia. To obtain silver
classification labels, we adopted the optimized settings from FILCO, using STRINC for
NQ and TQA, LEXICAL for WoW, and CXMI for FEVER, HotpotQA, and ELI5, with a
threshold 𝑡𝑡0 set to 0.5.
For the generator E2Egen, we allowed a maximum input sequence length of 512 to-
kens during both training and inference. We generated up to 64 tokens for open-domain
question answering, multi-hop question answering, fact verification, and dialogue gen-
eration tasks, and up to 256 tokens for long-form question answering. We used greedy
decoding to produce the final answers. Regarding model parameters, we set the en-
coder’s feature channel dimension 𝑑𝑑𝑘𝑘 to 2048, trained for 3 epochs, with a learning rate
of 5e−5 and a batch size of 8. The weight factor 𝜎𝜎 was set to 0.2.

4.3 Baseline Methods

In this section, we introduce three baseline methods: FULL [21], HyDE [12], and
FILCO [33], along with the proposed E2E-AFG and SILVER configurations. To ensure
a fair comparison, we employed the same backbone model architecture across all meth-
ods as that used in our proposed E2E-AFG.
FULL: A common approach in retrieval-augmented generation where all passages,
including pseudo-answers, are input into the generation model with the query.
HyDE: Filters passages through a dense bottleneck using unsupervised contrastive
learning, encoding them before inputting into the generation model.
FILCO: Uses a trained model to filter sentences within passages, passing only the
selected sentences to the generation model.
E2E-AFG: Ours end-to-end model potentially assesses the existence of answers for
the input passages before feeding all passages into the model for answer generation.
SILVER: This configuration inputs only those passages labeled as containing an an-
swer, testing the performance upper bound of E2E-AFG.

Table 2: Comparison with baseline methods using top-1 retrieved passages.

Method NQ TQA FEVER HotpotQA ELI5 WoW


FULL 41.64 60.90 88.32 59.58 67.50 65.73
HyDE 43.37 62.28 90.27 60.62 71.38 67.60
FILCO 46.65 64.33 94.46 62.71 74.99 70.12
E2E-AFG 48.48 65.99 95.45 64.39 75.12 71.47
SILVER 51.77 68.73 96.64 65.50 77.89 72.68
E2E-AFG: E2E Model with Adaptive Filtering for RAG 9

Table 3: The impact of different modules on the overall performance of E2E-AFG.


Method NQ FEVER WoW
Metric EM Acc F1
Ours 48.48 95.45 71.47
- pseudo answer 44.76 92.63 68.35
- cross attention layer 43.60 91.02 67.81
- classification module 40.03 87.52 65.12

Table 4: The recall rates of pseudo-answers generated by different prompts.

Recall (%)
Dataset
Prompt1 Prompt2 Prompt3
Natural Questions 40.3 45.6 46.8
TriviaQA-unfiltered 51.0 57.4 57.2
FEVER 62.8 63.7 65.3
HotpotQA 12.5 15.6 16.6
ELI5 9.3 11.9 13.4
Wizard of Wikipedia 28.7 30.2 30.5

Table 5: The impact of different top-K retrieved passages on the generated results.
NQ FEVER WoW
Method
top-1 top-3 top-5 top-1 top-3 top-5 top-1 top-3 top-5
FULL 41.64 50.84 52.22 88.32 88.26 87.34 65.73 65.86 64.34
HyDE 43.37 52.91 58.77 90.27 91.69 91.82 67.60 68.07 68.15
FILCO 46.65 54.38 62.03 94.46 93.83 92.60 70.12 70.65 69.38
E2E-AFG 48.48 56.92 63.24 95.45 96.14 95.67 71.47 71.80 71.62

4.4 Comparison with Baseline Methods

Table 2 presents the experimental results of E2E-AFG across six datasets, demonstrat-
ing that our model outperforms the baseline models in all cases. Specifically, for ex-
tractive question-answering tasks NQ and TQA, we achieved improvements of at least
1.83% and 1.56% in EM, respectively. This indicates that our model focuses more on
credible passages and reduces attention to irrelevant information, thereby generating-
more accurate answers. In the fact verification task FEVER, we attained an accuracy
increase of at least 1.09%. For the complex multi-hop question-answering task Hot-
potQA and the long-form question-answering task ELI5, we observed improvements
of at least 1.68% and 0.13% in F1 score, respectively. We hypothesize that the relatively
modest performance gain on ELI5 may be due to the fact that it requires detailed,
lengthy answers, while the generated pseudo-answers tend to be relatively brief, limit-
ing the model’s filtering capabilities. Additionally, in the dialogue generation task
WoW, we improve the F1 score by at least 1.35%. Furthermore, the performance of
E2E-AFG approaches the upper bound performance of SILVER, indicating its excep-
tional capabilities in context filtering and text generation, allowing it to achieve near-
optimal results without relying on specific annotations.
10 Y. Jiang et al.

4.5 Ablation Studies

Table 3 illustrates the ablation studies conducted on E2E-AFG, assessing the contribu-
tion of key components to the overall performance by progressively removing them
from the model. First, when the pseudo-answer generation module is removed, the gen-
erator relies solely on the retrieved passages, resulting in a significant decline in per-
formance across the three different tasks. Building on this, further removal of the cross-
attention layer in the classification module results in a slight decrease in performance.
Without the cross-attention mechanism, the classification module no longer aligns the
encoded query 𝑄𝑄 with the retrieved passages 𝑃𝑃 and pseudo-answers 𝑆𝑆 separately
through cross-attention. Instead, 𝑄𝑄 is concatenated with both representations, and the
concatenated features are fed into the feedforward neural network to predict answer
existence. Finally, when the classification module is completely removed, the model’s
performance drops sharply, as it loses its context filtering capability.
Table 4 demonstrates the impact of different prompts on pseudo-answer generation,
revealing that the pseudo-answers generated using prompt 3 achieve the highest aver-
age recall rate, indicating that they are most likely to support the generator in producing
correct answers. While simpler prompts may also generate useful pseudo-answers, de-
tailed and structured prompts help align the model’s output more closely with stand-
ards, such as avoiding the generation of nonsensical text and alleviating issues related
to hallucinatory content.
Table 5 shows the effect of different top-K retrieved passages on the generation re-
sults. We observed that aggregating multiple top-ranked passages significantly en-
hances the performance of extraction tasks. However, this improvement comes with a
linear or quadratic increase in computational load. Furthermore, the performance on the
FEVER and WoW datasets did not show substantial improvements and even declined
in some methods. We believe this may be attributed to the decreased content quality of
the lower-ranked retrieved passages.
Fig. 3(a) illustrates the impact of the weight factor 𝜎𝜎 on model performance. When
𝜎𝜎 is around 0.2 to 0.3, the model achieves optimal performance. As 𝜎𝜎 increases further,
the F1 scores across the three datasets begin to decline, with a notable drop when 𝜎𝜎
reaches 0.9. This indicates that in multi-task learning, the distribution of loss weights
across different tasks significantly affects model performance, necessitating careful
tuning of weight factors for specific tasks.
E2E-AFG: E2E Model with Adaptive Filtering for RAG 11

Fig. 3: (a) The impact of the weight factor 𝜎𝜎 on model performance. (b) Comparison of model
parameters for each method.

4.6 Further Analysis

Fig. 3(b) compares the model parameters for each method. It can be seen that our pro-
posed E2E-AFG method has fewer parameters than the other methods, particularly
when compared to the FILCO model, which has the most parameters. This indicates
that our method achieves fewer parameters while maintaining strong performance po-
tential by integrating filtering and generative models.

5 Conclusion

The End-to-End Model with Adaptive Filtering (E2E-AFG) proposed in this paper ef-
fectively addresses the issue of the generator being distracted by irrelevant information
retrieved during retrieval-augmented generation tasks. By integrating answer existence
judgment with the generation task into a single end-to-end model, E2E-AFG achieves
synchronous learning of context filtering and answer generation. Experimental results
demonstrate that our model outperforms baseline models across six knowledge-inten-
sive language datasets, with performance improvements ranging from +0.13 to +1.83
points. E2E-AFG not only enhances generation quality but also simplifies model com-
plexity and reduces training costs. Future research could further optimize the model
architecture and filtering strategies to explore its potential in various application sce-
narios.

Acknowledgments. This work was supported by the National Key Research and Development
Program of China under Grant 2022YFF0903302.

References
1. Anil, R., Borgeaud, S., Wu, Y., Alayrac, J.B., Yu, J., Soricut, R., et al.: Gemini: a family of
highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)
2. Asai, A., Gardner, M., Hajishirzi, H.: Evidentiality-guided generation for knowledge-inten-
sive NLP tasks. In: ACL. pp. 2226–2243 (2022)
3. Borgeaud, S., Mensch, A., Hoffmann, J., Cai, T., Rutherford, E., et al.: Improving Language
Models by Retrieving from Trillions of Tokens. In: ICML. pp. 2206–2240 (2022)
4. Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., et al.: Language
Models are Few-Shot Learners. NeurIPS 33 (2020)
5. Caruana, R.: Multitask learning. Machine learning. pp. 41–75 (1997)
6. Chen, S., Zhang, Y., Yang, Q.: Multi-task learning in natural language processing: An over-
view. ACM Computing Surveys 56(12), 1–32 (2024)
7. Choi, E., Palomaki, J., Lamm, M., Kwiatkowski, T., Das, D., Collins, M.: Decontextualiza-
tion: Making sentences stand-alone. In: ACL. pp. 447–461 (2021)
8. Chung, H.W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., et al.: Scaling instruction-
finetuned language models. Journal of Machine Learning Research 25(70), 1–53 (2024)
12 Y. Jiang et al.

9. Dinan, E., Roller, S., Shuster, K., Fan, A., Auli, M., Weston, J.: Wizard of Wikipedia:
Knowledge-powered conversational agents. In: ICLR (2019)
10. Fan, A., Jernite, Y., Perez, E., Grangier, D., Weston, J., Auli, M.: ELI5: Long form question
answering. In: ACL. pp. 3558–3567 (2019)
11. Fun, H., Gandhi, S., Ravi, S.: Efficient retrieval optimized multi-task learning. arXiv pre-
print arXiv:2104.10129 (2021)
12. Gao, L., Ma, X., Lin, J., Callan, J.: Precise zero-shot dense retrieval without relevance la-
bels. In: ACL. pp. 1762–1777 (2023)
13. Guu, K., Lee, K., Tung, Z., Pasupat, P., Chang, M.: Retrieval Augmented Language Model
Pre-Training. In: ICML. pp. 3929–3938 (2020)
14. Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: LoRA:
Low-Rank Adaptation of Large Language Models. In: ICLR (2021)
15. Iyer, S., Min, S., Mehdad, Y., Yih, W.T.: RECONSIDER: re-ranking using span-focused
cross-attention for open domain question answering. In: ACL. pp. 1280–1287 (2020)
16. Izacard, G., Lewis, P., Lomeli, M., Hosseini, et al.: Atlas: Few-shot learning with retrieval
augmented language models. Journal of Machine Learning Research 24(251), 1–43 (2023)
17. Joshi, M., Choi, E., Weld, D.S., Zettlemoyer, L.: TriviaQA: A large scale distantly super-
vised challenge dataset for reading comprehension. In: ACL. pp. 1601–1611 (2017)
18. Karpukhin, V., Oğuz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., Yih, W.T.: Dense
passage retrieval for open-domain question answering. In: EMNLP. pp. 6769–6781 (2020)
19. Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M., Parikh, A., et al.: Natural ques-
tions: A benchmark for question answering research. In: ACL. pp. 452–466 (2019)
20. Lee, J., Yun, S., Kim, H., Ko, M., Kang, J.: Ranking passages for improving answer recall
in open-domain question answering. In: ACL. pp. 565–569 (2018)
21. Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., et al.: Retrieval-Aug-
mented Generation for knowledge-intensive NLP tasks. NeurIPS 33, 9459–9474 (2020)
22. Luo, H., Chuang, Y.S., Gong, Y., Zhang, T., Kim, Y., Wu, X., Fox, D., Meng, H., Glass, J.:
Sail: Search-augmented instruction learning. arXiv preprint arXiv:2305.15225 (2023)
23. Nogueira, R., Cho, K.: Passage re-ranking with bert. arXiv preprint arXiv:1901.04085
(2019)
24. Petroni, F., Piktus, A., Fan, A., Lewis, P., Yazdani, M., De Cao, N., Thorne, J., et al.: KILT:
a benchmark for knowledge intensive language tasks. In: NAACL. pp. 2523–2544 (2021)
25. Poliakov, M., Shvai, N.: Multi-Meta-RAG: Improving RAG for Multi-Hop Queries using
Database Filtering with LLM-Extracted Metadata. arXiv preprint arXiv:2406.13213 (2024)
26. Qi, P., Lee, H., Sido, O., et al.: Retrieve, rerank, read, then iterate: Answering open-domain
questions of arbitrary complexity from text. arXiv preprint arXiv:2010.12527 (2020)
27. Qiao, Y., Xiong, C., Liu, Z., Liu, Z.: Understanding the behaviors of bert in ranking. arXiv
preprint arXiv:1904.07531 (2019)
28. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are
unsupervised multitask learners. OpenAI blog (2019)
29. Ram, O., Levine, Y., Dalmedigos, I., Muhlgay, D., Shashua, A., Leyton-Brown, K., Sho-
ham, Y.: In-context retrieval-augmented language models. In: ACL. pp. 1316–1331 (2023)
30. Thorne, J., Vlachos, A., Christodoulopoulos, C., Mittal, A.: FEVER: a large-scale dataset
for fact extraction and VERification. In: NAACL. pp. 809–819 (2018)
31. Wang, L., Yang, N., Wei, F.: Query2doc: Query expansion with large language mod-
els. arXiv preprint arXiv:2303.07678 (2023)
32. Wang, S., Yu, M., Guo, X., Wang, Z., Klinger, T., Zhang, W., et al.: R3: Reinforced ranker-
reader for open-domain question answering. In: AAAI (2018)
E2E-AFG: E2E Model with Adaptive Filtering for RAG 13

33. Wang, Z., Araki, J., Jiang, Z., Parvez, M.R., Neubig, G.: Learning to filter context for re-
trieval-augmented generation. arXiv preprint arXiv:2311.08377 (2023)
34. Weijia, S., Sewon, M., Michihiro, Y., Minjoon, S., Rich, J., Mike, L., et al.: REPLUG: Re-
trieval-augmented black-box language models. arXiv preprint arXiv:2301.12652 (2023)
35. Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., et al.: Transformers:
State-of-the-art natural language processing. In: EMNLP. pp. 38–45 (2020)
36. Yang, Z., Qi, P., Zhang, S., Bengio, Y., Cohen, W.W., et al.: HotpotQA: A dataset for di-
verse, explainable multi-hop question answering. In: EMNLP. pp. 2369–2380 (2018)

You might also like