An End-to-End Model With Adaptive Filtering For Retrieval-Augmented Generation
An End-to-End Model With Adaptive Filtering For Retrieval-Augmented Generation
Yun Jiang, Zilong Xie, Wei Zhang, Yun Fang and Shuai Pan*
1 Introduction
* Corresponding Author
1 Our code is available at: https://github.com/XieZilongAI/E2E-AFG
2 Y. Jiang et al.
Earlier studies [32, 23] attempted to select more relevant content by re-ranking the
retrieved contexts, but they may still contain irrelevant information. [7] achieved auto-
matic decontextualization of sentences through training a coreference resolution model,
although this requires extensive manual annotation efforts. Recent research, such as
HyDE [12], employs unsupervised contrastive learning where an encoder’s dense bot-
tleneck acts as a lossy compressor to filter out hallucinatory content. FILCO [33] trains
a filtering model to remove irrelevant contexts, improving the quality of the context
provided to the generation model. However, these methods typically involve multiple
independent models and complex preprocessing operations, which not only increase
system complexity but also elevate training and inference costs.
To address the aforementioned issues, we propose an End-to-End Model with Adap-
tive Filtering for Retrieval-Augmented Generation (E2E-AFG), which integrates clas-
sification and generation tasks into an end-to-end framework, allowing the model to
simultaneously learn context filtering and answer generation. Specifically, we first em-
ploy a pre-trained large language model to generate a pseudo-answer related to the input
query, enriching the content. We then apply three context filtering strategies to obtain
silver classification labels. The construction of the end-to-end model is based on the
generation model, augmented with a classification module that employs a cross-atten-
tion mechanism to predict whether sentences in the context contain answers, enabling
the model to answer the input query based on a certain judgment of the context.
We conducted experiments on six knowledge-intensive language datasets, covering
three tasks: question answering (Natural Questions [19], TriviaQA [17], HotpotQA
[36], ELI5 [10]), fact verification (FEVER [30]), and knowledge-based dialogue gen-
eration (Wizard of Wikipedia [9]). Compared to baseline models, our approach
achieved state-of-the-art results across all six datasets, with improvements ranging from
+0.13 to +1.83 points, validating the effectiveness of the proposed method.
2 Related Work
GT-answer 𝐴𝐴 Lossgen
STRINC Losscls
Retrieved Passages 𝑃𝑃
Choose one Classification
LEXICAL
labels
CXMI
Pseudo answer 𝑆𝑆
Output-answer 𝜀𝜀𝑖𝑖
E2EEncoder
𝑂𝑂 𝜉𝜉
Pre-trained E2Egen
LLM E2E-AFG
𝑃𝑃embs
Cross 𝛼𝛼𝑖𝑖
𝑄𝑄embs FFN
Attention 𝛽𝛽
Input query 𝑄𝑄 𝑆𝑆embs
3 Method
Fig. 2: Three kinds of LLM prompts and their generated pseudo-answer examples.
To determine whether the retrieved passage set 𝑃𝑃 and the generated pseudo-answer 𝑆𝑆
contain answers, we introduce three context filtering methods based on [33]: (i) String
Inclusion (STRINC): checking if the context directly contains the ground truth answer;
(ii) Lexical Overlap (LEXICAL): measuring the overlap of words between the context
and the ground truth answer; and (iii) Conditional Cross-Mutual Information (CXMI):
assessing the likelihood of the generator producing the ground truth answer given the
context. For a specific task, we select the most appropriate filtering method to obtain
silver classification labels. For instance, in question-answering tasks, we may use StrInc
to evaluate whether each passage or pseudo-answer contains the ground truth answer.
In contrast, for fact extraction tasks, where the ground truth answer resembles a boolean
value and cannot be assessed using the first two methods, we employ CXMI to compute
the corresponding probability and set a threshold 𝑡𝑡0 to derive the silver classification
label. We concatenate the obtained labels with the ground truth answer 𝐴𝐴 to facilitate
loss calculation.
For each training sample (𝑄𝑄, 𝐴𝐴, 𝑃𝑃, 𝑆𝑆), we first insert a special character between the
different fields to ensure they can be distinguished after encoding with E2EEncoder. We
then input the encoded query 𝑄𝑄embs, the retrieved passage set 𝑃𝑃embs, and the pseudo-
answer 𝑆𝑆embs into E2Egen to produce the output answer 𝑂𝑂. The sequence probability is
calculated as follows:
𝑃𝑃𝑜𝑜 (𝑂𝑂|𝑄𝑄, 𝑃𝑃, 𝑆𝑆) = ∏𝐿𝐿𝑖𝑖=1 𝑝𝑝(𝑜𝑜𝑖𝑖 |𝑂𝑂<𝑖𝑖 , 𝑄𝑄, 𝑃𝑃, 𝑆𝑆) (1)
where 𝑜𝑜𝑖𝑖 represents the i-th token of the generated output 𝑂𝑂, and 𝐿𝐿 is the final output
length. To simplify the notation, we continue to use 𝑄𝑄, 𝑃𝑃, 𝑆𝑆 in place of 𝑄𝑄embs, 𝑃𝑃embs,
6 Y. Jiang et al.
and 𝑆𝑆embs respectively in the equations above and in the subsequent content. The loss
function for the generation task is calculated as follows:
where 𝑜𝑜𝑖𝑖𝑔𝑔𝑔𝑔 denotes the i-th token of the ground truth answer 𝐴𝐴.
𝑄𝑄𝑆𝑆T
𝛽𝛽 = softmax � � 𝑆𝑆 (4)
�𝑑𝑑𝑘𝑘
where FFN denotes a two-layer feedforward neural network. The loss function for the
classification task is defined as the cross-entropy:
𝑔𝑔𝑔𝑔
𝐿𝐿cls = ∑𝐾𝐾
𝑖𝑖=1 −(log 𝜀𝜀𝑖𝑖 ) + log 𝜉𝜉
𝑔𝑔𝑔𝑔 (6)
Here, 𝜀𝜀𝑖𝑖𝑔𝑔𝑔𝑔 and 𝜉𝜉 𝑔𝑔𝑔𝑔 represent the predicted probability values corresponding to the
ground truth classes of each passage 𝑝𝑝𝑖𝑖 and the pseudo-answer 𝑆𝑆, respectively, while
𝐾𝐾 is the number of retrieved passages.
During the training process, we simultaneously optimize the loss functions of both the
generator and the classification module. The overall loss function is defined as a
weighted sum of the two losses:
𝐿𝐿TOTAL = (1 − 𝜎𝜎)𝐿𝐿gen + 𝜎𝜎𝜎𝜎cls (7)
E2E-AFG: E2E Model with Adaptive Filtering for RAG 7
where 𝐿𝐿gen is the loss from the generator, 𝐿𝐿cls is the loss from the classification module,
and 𝜎𝜎 is the weighting factor.
To further enhance the training efficiency and performance of the model, we employ
Low-Rank Adaptation (LoRA) [14] techniques, which add low-rank matrices to the
weight matrices of the pre-trained model for fine-tuning. This approach reduces com-
putational overhead and accelerates the training process.
4 Experiments
We loaded the model checkpoints from HuggingFace Transformers [35], using FLAN-
T5-xl [8] as our backbone model architecture. We employed prompt 3 and the Llama-
3 model to generate pseudo-answers, limiting their generation length to no more than
200 tokens. For the queries in each dataset, we utilized the Dense Passage Retriever
(DPR) [18] to extract the top 5 most relevant passages from Wikipedia. To obtain silver
classification labels, we adopted the optimized settings from FILCO, using STRINC for
NQ and TQA, LEXICAL for WoW, and CXMI for FEVER, HotpotQA, and ELI5, with a
threshold 𝑡𝑡0 set to 0.5.
For the generator E2Egen, we allowed a maximum input sequence length of 512 to-
kens during both training and inference. We generated up to 64 tokens for open-domain
question answering, multi-hop question answering, fact verification, and dialogue gen-
eration tasks, and up to 256 tokens for long-form question answering. We used greedy
decoding to produce the final answers. Regarding model parameters, we set the en-
coder’s feature channel dimension 𝑑𝑑𝑘𝑘 to 2048, trained for 3 epochs, with a learning rate
of 5e−5 and a batch size of 8. The weight factor 𝜎𝜎 was set to 0.2.
In this section, we introduce three baseline methods: FULL [21], HyDE [12], and
FILCO [33], along with the proposed E2E-AFG and SILVER configurations. To ensure
a fair comparison, we employed the same backbone model architecture across all meth-
ods as that used in our proposed E2E-AFG.
FULL: A common approach in retrieval-augmented generation where all passages,
including pseudo-answers, are input into the generation model with the query.
HyDE: Filters passages through a dense bottleneck using unsupervised contrastive
learning, encoding them before inputting into the generation model.
FILCO: Uses a trained model to filter sentences within passages, passing only the
selected sentences to the generation model.
E2E-AFG: Ours end-to-end model potentially assesses the existence of answers for
the input passages before feeding all passages into the model for answer generation.
SILVER: This configuration inputs only those passages labeled as containing an an-
swer, testing the performance upper bound of E2E-AFG.
Recall (%)
Dataset
Prompt1 Prompt2 Prompt3
Natural Questions 40.3 45.6 46.8
TriviaQA-unfiltered 51.0 57.4 57.2
FEVER 62.8 63.7 65.3
HotpotQA 12.5 15.6 16.6
ELI5 9.3 11.9 13.4
Wizard of Wikipedia 28.7 30.2 30.5
Table 5: The impact of different top-K retrieved passages on the generated results.
NQ FEVER WoW
Method
top-1 top-3 top-5 top-1 top-3 top-5 top-1 top-3 top-5
FULL 41.64 50.84 52.22 88.32 88.26 87.34 65.73 65.86 64.34
HyDE 43.37 52.91 58.77 90.27 91.69 91.82 67.60 68.07 68.15
FILCO 46.65 54.38 62.03 94.46 93.83 92.60 70.12 70.65 69.38
E2E-AFG 48.48 56.92 63.24 95.45 96.14 95.67 71.47 71.80 71.62
Table 2 presents the experimental results of E2E-AFG across six datasets, demonstrat-
ing that our model outperforms the baseline models in all cases. Specifically, for ex-
tractive question-answering tasks NQ and TQA, we achieved improvements of at least
1.83% and 1.56% in EM, respectively. This indicates that our model focuses more on
credible passages and reduces attention to irrelevant information, thereby generating-
more accurate answers. In the fact verification task FEVER, we attained an accuracy
increase of at least 1.09%. For the complex multi-hop question-answering task Hot-
potQA and the long-form question-answering task ELI5, we observed improvements
of at least 1.68% and 0.13% in F1 score, respectively. We hypothesize that the relatively
modest performance gain on ELI5 may be due to the fact that it requires detailed,
lengthy answers, while the generated pseudo-answers tend to be relatively brief, limit-
ing the model’s filtering capabilities. Additionally, in the dialogue generation task
WoW, we improve the F1 score by at least 1.35%. Furthermore, the performance of
E2E-AFG approaches the upper bound performance of SILVER, indicating its excep-
tional capabilities in context filtering and text generation, allowing it to achieve near-
optimal results without relying on specific annotations.
10 Y. Jiang et al.
Table 3 illustrates the ablation studies conducted on E2E-AFG, assessing the contribu-
tion of key components to the overall performance by progressively removing them
from the model. First, when the pseudo-answer generation module is removed, the gen-
erator relies solely on the retrieved passages, resulting in a significant decline in per-
formance across the three different tasks. Building on this, further removal of the cross-
attention layer in the classification module results in a slight decrease in performance.
Without the cross-attention mechanism, the classification module no longer aligns the
encoded query 𝑄𝑄 with the retrieved passages 𝑃𝑃 and pseudo-answers 𝑆𝑆 separately
through cross-attention. Instead, 𝑄𝑄 is concatenated with both representations, and the
concatenated features are fed into the feedforward neural network to predict answer
existence. Finally, when the classification module is completely removed, the model’s
performance drops sharply, as it loses its context filtering capability.
Table 4 demonstrates the impact of different prompts on pseudo-answer generation,
revealing that the pseudo-answers generated using prompt 3 achieve the highest aver-
age recall rate, indicating that they are most likely to support the generator in producing
correct answers. While simpler prompts may also generate useful pseudo-answers, de-
tailed and structured prompts help align the model’s output more closely with stand-
ards, such as avoiding the generation of nonsensical text and alleviating issues related
to hallucinatory content.
Table 5 shows the effect of different top-K retrieved passages on the generation re-
sults. We observed that aggregating multiple top-ranked passages significantly en-
hances the performance of extraction tasks. However, this improvement comes with a
linear or quadratic increase in computational load. Furthermore, the performance on the
FEVER and WoW datasets did not show substantial improvements and even declined
in some methods. We believe this may be attributed to the decreased content quality of
the lower-ranked retrieved passages.
Fig. 3(a) illustrates the impact of the weight factor 𝜎𝜎 on model performance. When
𝜎𝜎 is around 0.2 to 0.3, the model achieves optimal performance. As 𝜎𝜎 increases further,
the F1 scores across the three datasets begin to decline, with a notable drop when 𝜎𝜎
reaches 0.9. This indicates that in multi-task learning, the distribution of loss weights
across different tasks significantly affects model performance, necessitating careful
tuning of weight factors for specific tasks.
E2E-AFG: E2E Model with Adaptive Filtering for RAG 11
Fig. 3: (a) The impact of the weight factor 𝜎𝜎 on model performance. (b) Comparison of model
parameters for each method.
Fig. 3(b) compares the model parameters for each method. It can be seen that our pro-
posed E2E-AFG method has fewer parameters than the other methods, particularly
when compared to the FILCO model, which has the most parameters. This indicates
that our method achieves fewer parameters while maintaining strong performance po-
tential by integrating filtering and generative models.
5 Conclusion
The End-to-End Model with Adaptive Filtering (E2E-AFG) proposed in this paper ef-
fectively addresses the issue of the generator being distracted by irrelevant information
retrieved during retrieval-augmented generation tasks. By integrating answer existence
judgment with the generation task into a single end-to-end model, E2E-AFG achieves
synchronous learning of context filtering and answer generation. Experimental results
demonstrate that our model outperforms baseline models across six knowledge-inten-
sive language datasets, with performance improvements ranging from +0.13 to +1.83
points. E2E-AFG not only enhances generation quality but also simplifies model com-
plexity and reduces training costs. Future research could further optimize the model
architecture and filtering strategies to explore its potential in various application sce-
narios.
Acknowledgments. This work was supported by the National Key Research and Development
Program of China under Grant 2022YFF0903302.
References
1. Anil, R., Borgeaud, S., Wu, Y., Alayrac, J.B., Yu, J., Soricut, R., et al.: Gemini: a family of
highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)
2. Asai, A., Gardner, M., Hajishirzi, H.: Evidentiality-guided generation for knowledge-inten-
sive NLP tasks. In: ACL. pp. 2226–2243 (2022)
3. Borgeaud, S., Mensch, A., Hoffmann, J., Cai, T., Rutherford, E., et al.: Improving Language
Models by Retrieving from Trillions of Tokens. In: ICML. pp. 2206–2240 (2022)
4. Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., et al.: Language
Models are Few-Shot Learners. NeurIPS 33 (2020)
5. Caruana, R.: Multitask learning. Machine learning. pp. 41–75 (1997)
6. Chen, S., Zhang, Y., Yang, Q.: Multi-task learning in natural language processing: An over-
view. ACM Computing Surveys 56(12), 1–32 (2024)
7. Choi, E., Palomaki, J., Lamm, M., Kwiatkowski, T., Das, D., Collins, M.: Decontextualiza-
tion: Making sentences stand-alone. In: ACL. pp. 447–461 (2021)
8. Chung, H.W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., et al.: Scaling instruction-
finetuned language models. Journal of Machine Learning Research 25(70), 1–53 (2024)
12 Y. Jiang et al.
9. Dinan, E., Roller, S., Shuster, K., Fan, A., Auli, M., Weston, J.: Wizard of Wikipedia:
Knowledge-powered conversational agents. In: ICLR (2019)
10. Fan, A., Jernite, Y., Perez, E., Grangier, D., Weston, J., Auli, M.: ELI5: Long form question
answering. In: ACL. pp. 3558–3567 (2019)
11. Fun, H., Gandhi, S., Ravi, S.: Efficient retrieval optimized multi-task learning. arXiv pre-
print arXiv:2104.10129 (2021)
12. Gao, L., Ma, X., Lin, J., Callan, J.: Precise zero-shot dense retrieval without relevance la-
bels. In: ACL. pp. 1762–1777 (2023)
13. Guu, K., Lee, K., Tung, Z., Pasupat, P., Chang, M.: Retrieval Augmented Language Model
Pre-Training. In: ICML. pp. 3929–3938 (2020)
14. Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: LoRA:
Low-Rank Adaptation of Large Language Models. In: ICLR (2021)
15. Iyer, S., Min, S., Mehdad, Y., Yih, W.T.: RECONSIDER: re-ranking using span-focused
cross-attention for open domain question answering. In: ACL. pp. 1280–1287 (2020)
16. Izacard, G., Lewis, P., Lomeli, M., Hosseini, et al.: Atlas: Few-shot learning with retrieval
augmented language models. Journal of Machine Learning Research 24(251), 1–43 (2023)
17. Joshi, M., Choi, E., Weld, D.S., Zettlemoyer, L.: TriviaQA: A large scale distantly super-
vised challenge dataset for reading comprehension. In: ACL. pp. 1601–1611 (2017)
18. Karpukhin, V., Oğuz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., Yih, W.T.: Dense
passage retrieval for open-domain question answering. In: EMNLP. pp. 6769–6781 (2020)
19. Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M., Parikh, A., et al.: Natural ques-
tions: A benchmark for question answering research. In: ACL. pp. 452–466 (2019)
20. Lee, J., Yun, S., Kim, H., Ko, M., Kang, J.: Ranking passages for improving answer recall
in open-domain question answering. In: ACL. pp. 565–569 (2018)
21. Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., et al.: Retrieval-Aug-
mented Generation for knowledge-intensive NLP tasks. NeurIPS 33, 9459–9474 (2020)
22. Luo, H., Chuang, Y.S., Gong, Y., Zhang, T., Kim, Y., Wu, X., Fox, D., Meng, H., Glass, J.:
Sail: Search-augmented instruction learning. arXiv preprint arXiv:2305.15225 (2023)
23. Nogueira, R., Cho, K.: Passage re-ranking with bert. arXiv preprint arXiv:1901.04085
(2019)
24. Petroni, F., Piktus, A., Fan, A., Lewis, P., Yazdani, M., De Cao, N., Thorne, J., et al.: KILT:
a benchmark for knowledge intensive language tasks. In: NAACL. pp. 2523–2544 (2021)
25. Poliakov, M., Shvai, N.: Multi-Meta-RAG: Improving RAG for Multi-Hop Queries using
Database Filtering with LLM-Extracted Metadata. arXiv preprint arXiv:2406.13213 (2024)
26. Qi, P., Lee, H., Sido, O., et al.: Retrieve, rerank, read, then iterate: Answering open-domain
questions of arbitrary complexity from text. arXiv preprint arXiv:2010.12527 (2020)
27. Qiao, Y., Xiong, C., Liu, Z., Liu, Z.: Understanding the behaviors of bert in ranking. arXiv
preprint arXiv:1904.07531 (2019)
28. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are
unsupervised multitask learners. OpenAI blog (2019)
29. Ram, O., Levine, Y., Dalmedigos, I., Muhlgay, D., Shashua, A., Leyton-Brown, K., Sho-
ham, Y.: In-context retrieval-augmented language models. In: ACL. pp. 1316–1331 (2023)
30. Thorne, J., Vlachos, A., Christodoulopoulos, C., Mittal, A.: FEVER: a large-scale dataset
for fact extraction and VERification. In: NAACL. pp. 809–819 (2018)
31. Wang, L., Yang, N., Wei, F.: Query2doc: Query expansion with large language mod-
els. arXiv preprint arXiv:2303.07678 (2023)
32. Wang, S., Yu, M., Guo, X., Wang, Z., Klinger, T., Zhang, W., et al.: R3: Reinforced ranker-
reader for open-domain question answering. In: AAAI (2018)
E2E-AFG: E2E Model with Adaptive Filtering for RAG 13
33. Wang, Z., Araki, J., Jiang, Z., Parvez, M.R., Neubig, G.: Learning to filter context for re-
trieval-augmented generation. arXiv preprint arXiv:2311.08377 (2023)
34. Weijia, S., Sewon, M., Michihiro, Y., Minjoon, S., Rich, J., Mike, L., et al.: REPLUG: Re-
trieval-augmented black-box language models. arXiv preprint arXiv:2301.12652 (2023)
35. Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., et al.: Transformers:
State-of-the-art natural language processing. In: EMNLP. pp. 38–45 (2020)
36. Yang, Z., Qi, P., Zhang, S., Bengio, Y., Cohen, W.W., et al.: HotpotQA: A dataset for di-
verse, explainable multi-hop question answering. In: EMNLP. pp. 2369–2380 (2018)