A Survey On Rag Meeting LLMS: Towards Retrieval-Augmented Large Language Models
A Survey On Rag Meeting LLMS: Towards Retrieval-Augmented Large Language Models
The Hong Kong Polytechnic The Hong Kong Polytechnic Baidu Inc, China
University, HK SAR University, HK SAR
ABSTRACT 1 INTRODUCTION
As one of the most advanced techniques in AI, Retrieval-Augmented As one of the most fundamental data mining techniques, retrieval
Generation (RAG) can offer reliable and up-to-date external knowl- aims to understand the input query and extract relevant infor-
edge, providing huge convenience for numerous tasks. Particularly mation from external data sources [24, 30, 67, 140]. It has found
in the era of AI-Generated Content (AIGC), the powerful capac- extensive application in various fields [8, 28, 106, 179], such as
ity of retrieval in providing additional knowledge enables RAG search, question answering, and recommender systems. For in-
to assist existing generative AI in producing high-quality outputs. stance, search engines (e.g., Google, Bing, and Baidu) are the most
Recently, Large Language Models (LLMs) have demonstrated revo- successful applications of retrieval in the industry; they can filter
lutionary abilities in language understanding and generation, while and retrieve the most relevant web pages or documents that can
still facing inherent limitations, such as hallucinations and out-of- match a user’s query [19, 179], enabling users to find the desired
date internal knowledge. Given the powerful abilities of RAG in information effectively. Meanwhile, retrieval models, through effec-
providing the latest and helpful auxiliary information, Retrieval- tive data maintenance in external databases, can provide faithful
Augmented Large Language Models (RA-LLMs) have emerged to and timely external knowledge, thereby serving vital functions in
harness external and authoritative knowledge bases, rather than various knowledge-intensive tasks. Due to their powerful capaci-
solely relying on the model’s internal knowledge, to augment the ties, retrieval techniques have been successfully incorporated into
generation quality of LLMs. In this survey, we comprehensively advanced generative models in the era of AI-Generated Content
review existing research studies in RA-LLMs, covering three pri- (AIGC) [77, 132, 163]. Notably, the integration of retrieval mod-
mary technical perspectives: architectures, training strategies, and els with language models has given rise to Retrieval-Augmented
applications. As the preliminary knowledge, we briefly introduce Generation (RAG) [74], which has emerged as one of the most
the foundations and recent advances of LLMs. Then, to illustrate the representative techniques in the field of generative AI, aiming to
practical significance of RAG for LLMs, we systematically review enhance the quality of the generated text content with retrieved
mainstream relevant work by their architectures, training strate- information [6, 74, 77].
gies, and application areas, detailing specifically the challenges To advance generation models and enhance the generated re-
of each and the corresponding capabilities of RA-LLMs. Finally, sults, RAG incorporates information or knowledge from external
to deliver deeper insights, we discuss current limitations and sev- data sources, which serves as supplementary for the input query or
eral promising directions for future research. Updated information the generated output [62, 103]. Specifically, RAG first invokes the
about this survey can be found at https:// advanced-recommender- retriever to search and extract the relevant documents from external
systems.github.io/ RAG-Meets-LLMs/ 1 . databases, which are then leveraged as the context to enhance the
generation process [54]. In practice, RAG techniques are feasible
KEYWORDS and efficient to apply in various generation tasks with simple adap-
tation of the retrieval component, requiring minimal or even no ad-
Retrieval-Augmented Generation (RAG), Large Language Model
ditional training [117]. Recent studies have demonstrated the great
(LLM), Pre-training, Fine-tuning, In-context Learning, Prompting.
potential of RAG not only for knowledge-intensive tasks such as
the Open-domain Question Answering (OpenQA) [6, 46, 109, 133],
but also for general language tasks [48, 62, 170], and various down-
∗ Corresponding author: Yujuan Ding
1 This
stream applications [90, 163].
is the long version of the survey to appear at KDD 2024 [33]
1
Conference’17, July 2017, Washington, DC, USA Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, and Qing Li
t
Context an LLM-based dialog system will not be able to answer well for
pu
As of my last update
ut
pt
O
in January 2022, I out-of-scope queries. With the help of RAG to retrieve relevant
om
can't provide which
Pr
country won ... 2023. knowledge from external database and integrate it into the process
External
Query Database of generation, the dialog system succeeds in giving correct answers.
Spain won the
Given the remarkable progress in advancing LLMs with RAG, there
Women's World Cup Additional information: is an imperative need for a systematic review of recent advances in
2023.
New, Domain-specific, etc. Retrieval-Augmented Large Language Models (RA-LLMs).
with RAG
This survey aims to provide a comprehensive overview of RA-
LLMs by summarizing representative methods from the aspects of
Figure 1: Retrieval-Augmented Generation (RAG) meets the architecture, training strategy, and application area respectively.
Large Language Models (LLMs). When the user’s query is out- More specifically, following a brief introduction to the background
of-scope, e.g., unseen content in training data or the need for knowledge of LLMs in Section 2, we review existing research from
the latest information for the answer, LLMs might show in- several primary perspectives of RA-LLMs in terms of retrieval,
ferior generation performance. With the help of RAG, LLMs generation, and augmentation in Section 3, as well as the necessity
can leverage additional relevant information from external and application frequency of retrieval in RAG. Then, we summarize
database to enhance their text generation capability. the main training techniques of RA-LLMs in Section 4 and various
RA-LLMs applications in Section 5. Finally, in Section 6, we discuss
key challenges and potential directions for future exploration.
Recent years have witnessed the rapid development of pre-trained Concurrent to our survey, several related surveys have diverse fo-
foundation models, particularly Large Language Models (LLMs), cuses for RAG and LLMs. For example, Zhao et al. [193] specifically
which have demonstrated impressive performance across various review multi-modal information-based RAG techniques and Zhao
tasks [1, 18], including recommender systems [195], molecule dis- et al. [192] discuss the RAG for AIGC. Gao et al. [41] conduct a rela-
covery [77], and report generation [27]. Technically, the great suc- tively comprehensive overview of RAG for LLMs. Our survey differs
cess of LLMs can be technically attributed to the advanced architec- from these surveys in concentrating on technical perspectives and
tures with billion-level parameters pre-training on a huge amount systematically reviewing models according to the architecture and
of training corpus from various sources. These technical improve- training paradigm in RA-LLMs, as well as application tasks.
ments have given rise to the remarkable emergence capabilities
of LLMs [194, 195], particularly in language understanding and 2 BACKGROUND
generation, in-context learning, and others. For instance, GPT-FAR
In this section, we briefly present the background of large language
introduces detailed prompts to teach GPT-4 to perform image tag-
models and prompt learning.
ging, statistical analysis, and text analysis for multi-modal fashion
report generation [27]. LLMs also achieve promising performance
in recommender systems by understanding users’ preferences to- 2.1 Large Language Models (LLMs)
wards items [154, 195]. Despite the success, LLMs still suffer from Recently, the significant breakthrough of LLMs has revolutionized
intrinsic limitations [194, 195], such as the lack of domain-specific the field of artificial intelligence [7, 37, 194]. The advanced LLMs
knowledge, the problem of “hallucination”, and the substantial are typically pre-trained on extensive data with billion-level param-
computational resources required for updating the models. These eters and have demonstrated the ability to understand and generate
problems are particularly notable in domain-specific fields like human-like text, leading to advancements in various natural lan-
medicine and law. For instance, a recent study has demonstrated guage processing tasks such as text generation and information
that legal hallucinations are pervasive and disturbing, with halluci- retrieval [194, 195]. LLMs can be adapted to a variety of down-
nation rates ranging from 69% to 88% in responses to specific legal stream tasks by fine-tuning them on specific datasets, allowing
queries for state-of-the-art LLMs [21]. Moreover, the challenges of them to specialize in particular domains or applications. In general,
tackling the hallucination problem become even harder due to the most existing LLMs can be broadly divided into three main cate-
substantial computational resources required for fine-tuning LLMs gories: Encoder-only, Decoder-only, and Encoder-Decoder models.
with domain-specific or the latest data. This, in turn, significantly Encoder-only models, such as the BERT (Bidirectional Encoder
hinders the widespread adoption of LLMs in various real-world Representations from Transformers) [25] family of models, pro-
applications. cess input text by encoding it into a high-dimensional space. The
To address these limitations, recent efforts have been made to key feature of Encoder-only models is their bi-directional nature,
take advantage of RAG to enhance the capabilities of LLMs in var- meaning that they can take into account both the left and right con-
ious tasks [6, 53, 62, 135], especially those demanding high for text of each token when encoding it. This bi-directionality allows
the latest and reliable knowledge such as Question Answer (QA), Encoder-only models to better understand the meaning of words in
AI4Science, and software engineering. For example, Lozano et al. context, which is crucial for tasks like sentiment analysis, review
[92] introduces a scientific QA system based on retrieving scientific reading, and text classification [25, 169]. In contrast to these models,
2
A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models Conference’17, July 2017, Washington, DC, USA
Citation
>1000 [500, 1000) [200, 500) [100, 200) [50, 100) [20, 50) <20
RAG Framework/Pipeline
FiD IRCoT
REPLUG AAR (Yu, SKR (Wang, REFEED Self-RAG ToC (Kim,
(Izacard, (Trivedi,
(Shi, 2023) 2023) 2023) (Yu, 2023) (Asai, 2023) 2023)
2021) 2023)
In-Context ITER-
kNN-LM SE-FiD RETRO OpenBook DSP FLARE COMBO
REALM RAG (Lewis, RALM RETGEN RADA (Xu, SlimPLM
(Khandelwal, (Komeili, (Borgeaud, (Lazaridou, (Khattab, (Jiang, (Zhang,
(Guu, 2020) 2020) (Ram, (Shao, 2023) (Tan, 2024)
2019) 2021) 2022) 2022) 2022) 2023) 2023)
2023) 2023)
RAG Learning
EMDR2 RETRO Atlas RAG-end2end RETRO++ ITER- PRCA
REALM RAG (Lewis, Self-RAG
(Singh, (Borgeaud, (Izacard, (siriwardhana, (Wang, RETGEN (Yang,
(Guu, 2020) 2020) (Asai, 2023)
2021) 2022) 2023) 2023) 2023) (Shao, 2023) 2023)
Retriever Learning
UPRISE LLM-R
EPR (Rubin, Dr.ICL (Luo,
(Cheng, (Wang,
2021) 2023)
2023) 2023)
Pre-/Post-Retrieval Technique
SPALM Re2G Query2doc PRCA
HyPE (Gao, SAIL (Luo, RECOMP SlimPLM
(Yogatama, (Glass, (Wang, (Yang,
2022) 2023) (Xu, 2023) (Tan, 2024)
2021) 2022) 2023) 2023)
R-BM25 BlendFilter
QueryRewriter
(Agrawal, (Wang,
(Ma, 2023)
2022) 2024)
Figure 2: Representing RAG and RA-LLMs methods organized by their main design focus, proposed time and impact (shown by
citation). Note that the first author and year shown in the figure along with the model name can be used to locate corresponding
reference.
Decoder-only models generate text in a left-to-right fashion. As a closely match the form of their pre-training task [20, 110]. For
representative Decoder-only model, GPT (Generative Pre-trained other models like GPT, prefix prompts tend to be more suitable as
Transformer) [114] predicts the next token in a sequence based on they mesh well with the generation tasks [7]. However, manually
the context provided by the previous tokens. Their architecture designed prompts rely on human experience without effectiveness
makes them particularly effective for tasks like language generation, guarantees. To address this limitation, soft prompt tuning was
code generation, and creative writing. Encoder-Decoder models, developed to learn the trainable continuous prompt embeddings [83,
such as T5 (Text-To-Text Transfer Transformer) [116], uniquely 150, 151]. For instance, Prefix-Tuning [83] prepends a series of prefix
transform a variety of NLP tasks into text generation problems. To embedding in the input, which can be trained and updated. This
be more specific, the encoder in T5 processes the input sequence apportion allows prompts not to be real text, giving more flexibility
to capture its meaning, while the decoder generates the output in the generation of prompts. However, due to the lack of domain-
sequence based on the encoded information. This T5 architecture specific knowledge, the model might still not generate accurate
is well-suited for tasks that involve converting one sequence into responses when facing new tasks.
another, such as machine translation, summarization, and conver-
sational response generation.
2.2.2 In-Context Learning (ICL). To overcome the limitations of
vanilla prompt learning, recent efforts [66, 89, 191] have developed
2.2 Prompt Learning in-context learning (ICL). ICL is a specific method of prompt learn-
2.2.1 Prompting Engineering. Due to the massive parameters of ing that gives the model a few demonstrations of tasks within the
LLMs, prompt learning emerged as a paradigm to leverage the prompt. This paradigm allows pre-trained LLMs to understand the
power of LLM to implement various tasks [194, 195], instead of pattern provided by the demonstrations to solve novel tasks with-
fine-tuning the LLMs extensively. Prompt learning carefully designs out the need for fine-tuning. For example, by carefully selecting a
the input that guides the model to perform downstream tasks in few demonstrations, GPT-3 [7] has shown the capability to perform
LLMs. For example, early methods [7, 110] provide manually crafted few-shot tasks [89]. This success indicates that LLMs have a remark-
templates to handle various tasks in NLP. Specifically, Encoder-only able ability to rapidly adapt to new tasks based on task-specific
models like BERT typically adopt cloze prompts because they very knowledge.
3
Conference’17, July 2017, Washington, DC, USA Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, and Qing Li
Despite its effectiveness, ICL usually relies heavily on the quality Dense retrieval, on the contrary, embeds the query and docu-
of the provided demonstrations [143? ], which may lead to the gener- ments into continuous vector space with certain criteria, for exam-
ation of sub-optimal outputs. Even worse, ICL may not have enough ple, semantic similarity [61]. Dense retrieval methods are usually
necessary information or prior knowledge to guide the LLMs in trainable, therefore holding more flexibility and potential in adap-
generating accurate responses. To address the aforementioned limi- tation. As the key component of dense retriever, the embedding
tations of ICL, more recent studies introduce Retrieval-Augmented models have delicately different designs in existing RAG models.
Generation (RAG) technologies [74, 117, 135]. By integrating re- A simple design [62, 72, 165] is to directly use a part of the gener-
trieval with generation, RAG models provide a promising direction ation model as the embedding layer of the retriever, which might
for enhancing the performance and adaptability of LLMs across be able to enhance the alignment between the retrieval and gen-
various tasks. eration processes. BERT-based backbone [25] is widely applied
in retrieval models. One common retriever design in RAG is to
3 RETRIEVAL-AUGMENTED LARGE construct two-stream encoders with the BERT structure (one en-
coder for the query and the other for the documents), which is
LANGUAGE MODELS (RA-LLMS)
also called bi-encoder [135, 164]. Early-stage RAG methods tend to
The RAG framework in the era of LLMs consists of several major freeze [6, 117] or partially freeze [74] the parameters of the retriever
processes: retrieval, generation, and augmentation, as well as the to perform general-level relevant knowledge extraction and pay
mechanism to determine whether the retrieval is needed. In this more attention to the knowledge leveraging and generator fine-
section, we will introduce important techniques involved in each tuning. Large-scale specialized pre-training further enhances RAG
process. models to excel in more knowledge-intensive tasks. One typical
success is Dense Passage Retriever (DPR) [61], which uses a BERT-
3.1 Retrieval based backbone and is pre-trained specifically for the OpenQA task
Given the query from the input of LLMs, the retrieval process in with question-answer pair data. DPR has shown strong capacity as
RAG aims to provide relevant information from the external knowl- a pre-trained retriever, facilitating many RAG models to succeed in
edge sources, which can be either open-sourced or closed-sourced various downstream tasks [54, 74, 135, 139, 141]. It has also been
as shown in Figure 3. The key component, retriever, as further regarded as the first step in the RAG paradigm for improving the
detailed in Figure 4, consists of several procedures, functioning as a performance of LLMs, which may further enhance the alignment of
whole to measure the relevance between the query and documents the embeddings between queries and relevant textual data through
in the database for effective information retrieval. The specific fine-tuning [16]. A recent study [122] has also discovered that DPR
pipeline of the retrieval is further determined by whether the pre- training decentralizes how knowledge is stored in the network,
and post-retrieval processes are included. In this subsection, we will creating multiple access pathways to the same information. With
introduce the major techniques involved in the retrieval of tradi- effective fine-tuning, bi-encoder retrievers are also applied widely
tional and LLM-based RAGs, including the retriever type, retrieval in ICL-based RAG [82, 93, 101, 111, 126, 176]. Specifically, they have
granularity, pre- and post-retrieval enhancement, and database been more often used for sentence embedding similarity-based re-
construction. trieval, as well as for some special requirement in ICL, such as
diverse example retrieval [176].
3.1.1 Retriever Type. Retrieval methods can be generally catego- Another stream of dense retrievers having been widely applied
rized into two types: sparse and dense, based on the information in RA-LLMs uses one encoder only, which may be based on Trans-
encoding methods. Sparse retrieval is word-based and applied in former, BERT or other off-the-shelf sequence modeling backbones.
text retrieval mostly, while dense retrieval embeds queries and ex- These one-encoder retrievers are generally pre-trained on large-
ternal knowledge into vector spaces and can applied to various data scale unaligned documents by contrastive learning [122], which
formats. may therefore excel for their versatility, meaning that they can
As a straightforward approach, sparse retrieval, e.g., TF-IDF and transfer and generalize better to new domains or tasks. Such general-
BM25 [125, 142], usually relies on inverted index matching along purpose pre-trained retrievers, e.g., Contriever [42] and Spider [118],
with the raw data input. For example, many studies directly ap- would be more flexible to use in LLMs targeting on various tasks
ply BM25 for passage-level retrieval to facilitate their RAG [10, and have demonstrated their effectiveness in many RA-LLM meth-
57, 117, 168, 196, 197], where passages are specifically represented ods, such as In-Context RALM [117], Atlas [55] and Self-RAG [5].
as a bag of words and ranked based on term and inverse docu- According to experimental results in existing studies [182], for open-
ment frequencies [54]. On top of offering supplementary to en- domain QA tasks, when cooperated with InstructGPT [107], ap-
hance the input of the generator, sparse retrieval has also been plying general-purpose pre-trained retriever (Contriever) without
used to find demonstrations to function in in-context learning for fine-tuning achieves comparable performance to sparse retriever
RA-LLMs [2, 96, 126, 138, 176]. The main limitation of applying (BM25). However, they are both worse than the DPR model fine-
sparse retrieval in RAG is its no-training nature, which makes the tuned on target datasets, showing the effectiveness of fine-tuning
retrieval performance heavily rely on the quality of the database on targeted tasks and data.
and the query. Moreover, such fixed term-based methods only sup-
port similarity-based retrieval, while cannot be adapted for other 3.1.2 Retrieval Granularity. Retrieval granularity denotes the re-
retrieval criteria possibly existing in LLM applications, such as the trieval unit in which the corpus is indexed, e.g., document, passage,
diversity [31]. token, or other levels like entity. For RAG, the choice of retrieval
4
A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models Conference’17, July 2017, Washington, DC, USA
Sympton: I had very bad cardiac pain this morning, also felt dizzy and nauseous. It lasted for a few minutes.
Patient file: female, 34 years old, height: 170cm, weight: 55kg,...
Pre-retrieval Enhancement
... Input-layer
The first Generator According to
symptom Augmentation your sympton
Rewrite Expand sign of and conditions,
a heart you are likely
attack is
OR
sudden
having Coronary
Angina Artery Disease
cardiac
pectoris is Output-layer
Retriever arrest.
caused by Augmentation Generator (CAD). CAD is
a decrease
caused by
in plaque buidup in
myocardial OR the wall of the
Post-retrieval Enhancement blood flow. arteries that
supply blood to
... Intermidiate-layer the heart (called
Generator
Augmentation coronary
Filter Compress arteries)....
Figure 3: Illustration of the basic Retrieval-Augmented Large Language Models (RA-LLMs) framework for a specific QA task,
which consists of three main components: retrieval, augmentation, and generation. Retrieval may have different procedures
with various designs, which optionally includes pre-retrieval and post-retrieval processes. The retrieved documents are further
leveraged in generation with the augmentation module, which may be at different integration stages.
Dense Retrieval
more burden for the database saving. Token retrieval is more suit-
Relevance Scoring able in cases requiring rare patterns or out-of-domain data [62],
meanwhile cooperates well with the every-token retrieval strategy
Embedding
as applied in kNN-LM and other similar work [47, 104, 180]. In
Query comparison, a text chunk may contain compact and complete infor-
Retrieved Results
Chunking/ (Chunks/Documents/ mation with less redundancy and irrelevancy, therefore becoming
Tokenizing/ Indexing Tokens/Entities/..) the mainstream retrieval text granularity in RAG.
Database ...
Another major retrieval granularity proposed in RAG is entity
Relevance Scoring
retrieval. Unlike the above types of granularity, entity retrieval is
designed from the perspective of knowledge rather than language.
Sparse Retrieval
Févry et al. [39] introduce the Entities as Experts (EAE) model,
which divides the parameter space of language models according
Figure 4: Illustration of the retriever in RA-LLMs, which can to the entity identity. The proposed EAE model aims to learn entity
be implemented in either dense or sparse manners, each with representations from the text along with other model parameters
several key operations. with the Wikipedia database and represent knowledge with entity
memory. At a more fine-grained level, de Jong et al. [22] propose to
build the knowledge base by learning and retrieving mention rather
than entity. Overall, applying entity or mention-level retrieval in
granularity can significantly impact the overall performance of the
RAG would be more effective for entity-centric tasks, and more
model in terms of effectiveness and efficiency as they determine
efficient in space compared to token-wise retrieval.
the saving space for the database as well as the computational cost
for searching [4]. Early stage retrieval-augmented language mod- 3.1.3 Pre-retrieval and Post-retrieval Enhancement. To ensure the
els [10] propose to retrieve whole pieces of documents, and then retrieval quality, i.e., increase the accuracy and relevance of the re-
apply a machine comprehension model trained to detect answer trieved results, various pre- and post-retrieval strategies have been
spans in the returned documents, which focuses more on language proposed to further enhance the input and output of the retriever.
reading and key information locating in the document. In gener- Wang et al. [156] propose a query expansion approach Query2doc,
ative language models, Chunk retrieval (also called passages in which generates pseudo-documents by few-shot prompting LLMs
some references [46, 57, 61]) is common, which has been used in and expands the query with the relevant information in pseudo-
both traditional and LLM-based RAG models such as REALM [46], documents to improve the query disambiguation and guide the
RAG [74] and Atlas [55]. A more fine-grained retrieval, i.e., token retrievers. They have empirically demonstrated that such a method
retrieval, instead can be done with faster searching but will bring can boost the performance of both the sparse and dense retriever [61]
5
Conference’17, July 2017, Washington, DC, USA Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, and Qing Li
on ad-hoc information retrieval datasets. Similarly, Gao et al. [40] query generation blending and the post-retrieval knowledge filter-
propose Hypothetical Document Embedding (HyDE) method, which ing. This method can tackle the complex questions as well as the
instructs an LLM to generate hypothetical documents for the given noisy retrieved knowledge problems, therefore comprehensively
query. The hypothetical documents are then used as new queries enhancing the RA-LLM performance.
to get embedded and search for neighbors with the dense retriever. More recently, advanced RAG pipelines have been proposed
Another pre-retrieval strategy, query rewrite [98], aims to close using LLMs to generate reasoning paths and plans with the Infor-
the gaps between the input text and the needed knowledge in mation Retrieval (IR) module to iteratively retrieve knowledge to
retrieval, to reformulate the original question into a more conducive enhance LLM-based generation [130, 172, 175]. However, Zhu et al.
version to retrieve. Specifically, Ma et al. [98] propose the Rewrite- [198] point out that if the outputs of IR and LLM are low-quality,
Retrieve-Read framework, which prompts an LLM to generate the the retrieval and generation processes will get hindered by each
query for the retrieval function. The motivation of the rewriting other with such an iterative guidance pipeline. To overcome this
step is to clarify the retrieval need in the new query to ease the barrier, they propose a new reasoning approach for query and re-
burden on the retrieval function to comprehend the input and trieved knowledge enhancement. Post-retrieval strategies may also
enhance the output, i.e., retrieved relevant information. They have function to enhance the compatibility between the retrieved results
tested both the settings of using a frozen LLM and a trainable model and the generation models. For example, one of the main limitations
to be the rewriter, both outperforming naive RAG or generation of existing LLMs is the length of the input tokens, which prevents
models, demonstrating diverse performance on different tested QA long retrieved documents being directly incorporated into exist-
datasets though. Tan et al. [146] also formulate a query rewriting ing RA-LLMs. For this limitation, Xu et al. [168] propose Retrieve,
strategy in their model that decomposes the heuristic answer from Compress, Prepend (RECOMP), which adds an intermediate step
a proxy generation model into distinct claims. to process the retrieved documents into a textual summary before
Yu et al. [183] propose query augmentation to combine the in-context augmentation in the generation process. From another
original query and the preliminary generated outputs as a new perspective, long retrieved passage list leads to a high inference
query, which is further used to retrieve relevant information from latency when using auto-regressive decoding at generation stage,
the external database. The retrieved results can inspire the language which hurts the model’s efficiency. For this limitation, Hofstätter
model to rethink the generated results and enhance them. Com- et al. [50] propose a light version of FiD model that compresses the
pared to applying only the original query, such augmentation may encoded vectors per retrieved passage before concatenating and
contribute more relevant information retrieved from the corpus for feeding them through the decoder and also includes a re-ranker on
the directly clarification of query-output relationships. Including the retrieved results before applying them in the generation.
initial output in the new query further enhances the lexical and
semantic overlap between the supporting documents to be retrieved 3.1.4 Database. Retrieval in RAG is conducted based on external
with the given question. Query augmentation achieves overall bet- knowledge source, which can be a closed- or open-sourced [98, 100],
ter performance among these query enhancement strategies since it as illustrated in Figure 3. Closed-sourced database generally stores
may process all retrieved knowledge collectively while generating key-value pairs for knowledge, which can be constructed in various
answers [155]. ways. The keys are primarily used for similarity matching, being as
Post-retrieval enhancement denotes the procedure to process sparse vectors such as in BM25 or dense embeddings from the re-
the extracted top-k documents from the retriever before feeding trieval encoding. The value depends on the specific retrieval target,
them to the generator for the sake of better alignment between the which is raw text in most cases [6, 46, 54, 72, 74, 129]. For example,
retrieval and generation stages [173], particularly for closed-source each Wikipedia article is split into disjoint 100-word chunks, to
generators such as LLMs. For example, Yang et al. [173] propose make a total of 21M documents in early RAG [74]. Each document
the Pluggable Reward-driven Context Adapter (PRCA) that enables is encoded by a dense embedding and saved in the database as the
to fine-tune the lightweight adapter instead of the generator on value and key, respectively. The value can store tokens too, one
specific datasets. It also distills the retrieved documents through re- for each as applied in kNN-LM [62] and SPALM [180]. The source
inforcement learning with the rewards resulting from the generator. of the database depends on the specific application domains and
Glass et al. [44] propose Retrieve-Rerank-Generate (R2 G) method, tasks. Wikipedia is one of the most commonly applied general re-
which assembles the retrieved documents of different retrieval ap- trieval sets in previous RAG work, which stores factual structured
proaches with the rerank operation to boost the robustness of the
1 Retrievers:[BE: Bi-Encoder (also referred as dual encoder), OE: One-Encoder, BT:
retrieval results. Another consideration for applying post-retrieval
BERT-style Transformer, GP: Partial Generation, SE: Search Engine, SR: Sparse
enhancement is that the retrieved information may sometimes be ir- Retrieval, DPR:[61]], Generators: [DT: Decoder-only Transformer, ET: Encoder-
relevant or contain noise, which might not help with the generation only Transformer, ED: Encoder-Decoder], Pre-/Post-Retrieval techniques: [RQG:
model for the task, or even worse, harm the generation process [159]. Retrieval Query Generation, QE: Query Expansion, (T)RR: (Trainable) Re-Ranker,
TRA: Trainable Retriever Adaptor, CR: Candidate Retrieval, CM: Critic Model], Aug-
Wang et al. [159], Asai et al. [5], Yu et al. [183] propose different mentations:[Output: Output-layer Integration, Input: Input-layer Integration, Inter:
strategies to mitigate the noise in retrieved knowledge documents. Intermediate-layer Integration, Demon: As demonstration], Tasks: [AQA: Abstractive
Question Answering, QG: Question Generation, NQ: Natural Questions, WQ: We-
However, Xiong et al. [166] empirically studied that these methods bQuestions, CT: CuratedTrec, FV: Factor Verification, TQA: TriviaQA, WizInt: Wizard
are dependent on the LLM’s confidence levels, which might not of the Internet task, WoW: Wizard of Wikipedia task, MHQA: Multi-hop QA, CQA:
be as precise as expected. For this problem, Wang et al. [155] pro- Conversational QA, EL: Entity Linking, SF: Slot-filling, MMLU: Massively-Multitask
Language Understanding, CR: Commonsense Reasoning, LongQA: Long-form QA, OS:
pose BlendFilter, which simultaneously considers the pre-retrieval Open-domain Summarization, BG: Biography Generation, UR: Utterance Representing,
RF: Retrieval Fusion]
6
A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models Conference’17, July 2017, Washington, DC, USA
Time Model Cite Retriever RetTrain RetAug Pre-/Post- Generator Aug Evaluation
Stage Retrieval
2019 kNN-LM [62] 619 DR(GP) No Inf RA DT Output LG
2020 REALM [46] 1437 DR(BE,BT) Yes PT+FT / ET Input OpenQA(NQ, WQ, CT)
2020 RAG [74] 2125 DR(DPR) Yes FT / ED (BART) Input OpenQA, AQA, Jeop-
ardy QG, FV
2021 FiD [54] 780 SR(BM25)/ No FT / ED (T5/BART) Input OpenQA
DR(DPR)
2021 SE-FiD [68] 286 SE(Bing) No Inf RQG FiD Input WizInt, WoW
2021 FiD-KD [53] 190 DR(BE) Yes FT CR FiD Input OpenQA
2021 RETRO [6] 683 DR(BERT, No PT / ED Inter LM, OpenQA
DPR)
2021 EPR [126] 384 DR(DPR) Yes Inf CR GPT-3,J,Neo, Demon UR
CODEX
2022 OpenBook [70] 145 SE+SR No QE GOPHER LM Input QA, FV
2022 DSP [63] 117 ColBERTv2 No Inf RQG, RF GPT-3.5 Demon OpenQA, MHQA, CQA
2023 In-Context 211 DR/SR No Inf TRR GPT-2,J,Neo Input LM, OpenQA
RALM [117]
2023 Atlas [55] 367 DR(OE) Yes PT+FT / ED Input OpenQA, FV, WoW,
EL,SF, MMLU
2023 FLARE [57] 133 SR(BM25)/ No Inf RQG GPT-3.5 Input MHQA, CR, LongQA,
SE(Bing) OS
2023 IRCoT [149] 114 SR(BM25) No Inf / GPT-3,Flan-T5 Input OpenQA
2023 Self-RAG [5] 85 DR(OE) No FT CM tunable LLM OpenQA, LongQA, FV,
BG
2023 REPLUG [135] 48 DR(BE) Yes FT TRA GPT-2,3 Inpput MMLU, OpenQA
2023 UDR [80] 42 DR(DPR) Yes FT CR GPT-Neo Demon 40 NLP tasks
2023 ITER- 40 DR(DPR) Yes FT RR InstructGPT, Input MHQA, FV, CR
RETGEN [130] Llama-2
Table 1: Basic publication information and main technical designs of high-impact RAG and RA-LLM models.1
information and has several versions differing in scale, from billion 3.2 Generation
token-level [22, 39, 46, 62, 74, 117, 135, 168, 180] to trillion token- The design of the generator heavily depends on the downstream
level [6]. Domain-specific database is also used for downstream tasks. For most text generation tasks, Decoder-only and Encoder-
tasks. For example, for the code generation task, Zan et al. [185] Decoder are two dominant structures [194]. The recent develop-
collect API information and code files of public libraries to build ment of commercial closed-sourced large foundation models makes
their APIretriever database. In addition, Zhou et al [197] propose black-box generation models mainstream in RA-LLMs. In this part,
to use a documentation pool frequently updated with new content we will briefly review studies with these two types of genera-
(newly released libraries) in their model. tors: parameter-accessible (white-box) and parameter-inaccessible
Applying Internet searching engine [95] such as Bing and Google (black-box).
avoids the maintenance of the search index and can access up-to-
date knowledge [68, 70]. Meanwhile, it provides a broader knowl- 3.2.1 Parameter-Accessible Generators (White-box). The structure
edge base than the closed-sourced database [5, 70]. It can also of Encoder-Decoder processes the input and the target independently
provide high-quality ranking after being tuned over decades of with different sets of parameters, in which a cross-attention compo-
use. Internet search has been widely incorporated with black-box nent is developed to connect input tokens to target tokens. Repre-
LLMs and shows effectiveness for different functions such as knowl- sentative Encoder-Decoder models include T5 [116] and BART [73].
edge augmentation [70], fact-checking [100] and LLM agent en- In comparison, Decoder-only models process inputs and targets after
hancement [175]. Compared to traditional RAG, Internet search concatenation, which makes the representations of the two parts
has been leveraged more as the retriever in RA-LLMs owing to concurrently built layer-by-layer as they propagate up the network.
the extraordinary capability of LLMs to be the Reader to compre- These two types of generators are widely applied in existing RAG
hend the searching results, i.e., the retrieved documents, as well work. For example, RAG [74] and Re2 G [44] employ BART; FID [54]
as LLMs’ ability to use tools to process and analyze the them [98]. and EMDR2 utilize T5. There are other models [6, 84] leveraging
Existing studies [182] have shown that leveraging search engines Transformer-based Encoder-Decoder architecture but with some
(e.g., InstrucGPT) is particularly effective for LLMs on zero-shot customized design. Generators in RAG differ themselves from gen-
knowledge-intensive tasks such as OpenQA and fact checking. eral ones by incorporating retrieved data to enhance the generation
7
Conference’17, July 2017, Washington, DC, USA Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, and Qing Li
accuracy and relevance. Furthermore, white-box generators allow More specially for LLMs, input-layer integration may use the
parameter optimization, which can be trained to adapt to different retrieved content as (additional) prompts or demonstrations on top
retrieval and augmentation approaches for a better performance of of using it as supplementary to the original input as in traditional
generation. RAGs [126]. Prompt retrieval aims to find suitable natural language
prompts automatically through retrieval to teach the LLM to learn
3.2.2 Parameter-Inaccessible Generators (Black-box). A certain pro- in context [7] or to induce the LLM to reason[162]. It may boost
portion of LLMs are released without the disclosure of internal the zero-shot ability of LLMs without delicate prompt engineering.
structures or the accessibility of parameters, especially those par- For example, Cheng et al. [16] propose to learn a prompt retriever
ticularly large-scale ones such as GPT series [1], Codex [12] and based on the input-prompt pair data with score labels resulting
Claude, which are called black-box generation models. These gen- from a frozen LLM.
erators only allow the operations of feeding queries (input) and re-
ceiving responses (output) while not allowing the internal structure 3.3.2 Output-Layer Integration. Another kind of augmentation is
to be altered or parameters to be updated. From another perspec- post-hoc, i.e., output-layer integration, which joints retrieval and
tive, LLMs, even those open for fine-tuning, are large in scale and generation results. For example, kNN-LM [62] interpolates two
difficult to tune for downstream domain-specific tasks with only a next-token distributions in prediction: one induced by the LM and
limited amount of data. Black-box RA-LLMs, therefore, focus more the other induced by the nearest neighbors from the retrieval cor-
on the retrieval and augmentation processes, trying to enhance pus. Output-layer linear integration [45, 196] is flexible to apply
the generator by augmenting the input (also called prompt in the since it can be plugged into most generation models without addi-
context of LLMs) with better knowledge, guidance, or examples for tional training. However, the simplicity of output-layer integration
the generation. For example, Rubin et al. [126] proposes to train a also limits the model’s ability to reason about the retrieved text.
prompt retriever with the data labeled by language models them- To tackle this limitation, Yogatama et al. [180] propose to add an
selves, which can be used to provide better examples for in-context extra gating network to post-process the retrieved data and achieve
learning, therefore enhancing the final generation performance. Xu comparatively better performance. For LLMs, output-layer inte-
et al. [168] propose to compress the retrieved documents before gration is as reasonable and adaptive as input-layer integration.
in-context integration, which can reduce the computational costs REFEED [183] proposes an answer refining mechanism that applies
and also relieve the burden of LMs to identify relevant information an LLM to evaluate the retrieved information and adjust the initial
in long retrieved documents. answer accordingly to enhance the accuracy of the response. Sim-
ilarly, Zhang et al. [190] propose the COMBO framework, which
3.3 Retrieval Integration for Generation matches LLM-generated passages with retrieved counterparts into
Augmentation compatible pairs based on pre-trained discriminators. The passage
Augmentation describes the technical process that integrates re- pairs are then handled by a Fusion-in-Decoder-based [54] to derive
trieval and generation parts, which is the essential part of RA-LLMs. a final answer.
In this subsection, we introduce three main designs of augmenta-
tion, which are conducted at the input, output, and intermediate 3.3.3 Intermediate-Layer Integration. Compared to the above two
layers of generator respectively, as illustrated in Figure 3. non-parametric approaches, a more engaging augmentation is to
design a semi-parametric module to integrate the retrieved results
3.3.1 Input-Layer Integration. A common way to integrate retrieved through the internal layers of the generation model, which is called
information/documents is to combine them with the original in- intermediate-layer integration. Such integration might add addi-
put/query and jointly pass them to the generator, which is called tional complexity and is promising to enhance the capability of the
input-layer integration. For example, In-Context RALM [117] ap- generation model with effective training. Typically, a Transformer
plies input-layer integration by specifically concatenating the origi- module is introduced to leverage retrieved information (mostly
nal input and all retrieved documents into a single sequence as the encoded into dense representations) into the generation model to
new input for the generation model. Despite the effectiveness, such interact with the representations in the middle stage of the genera-
integration is limited to the number of retrieved documents, since tion. For example, RETRO [6] introduces a Chunked Cross Attention
the concatenated new input may be too long to be processed by (CCA) layer to process the retrieved chunks in the generator blocks,
the generation model. In-context RALM specifically alleviates this and Wu et al. [165] introduces the kNN-Augmented Attention Layer.
limitation by removing tokens from the beginning of the new in- Similarly, EAE [39] and TOME [22] use Entity Memory and Mem-
put. To avoid information loss with such a token removing strategy, oryAttention layer to incorporate the retrieved Entity and Entity
FID [54] employs a different integration method that processes each Mentions, respectively. Such intermediate-layer integration can
retrieved document independently in the encoder. This strategy use many blocks frequently and efficiently to enhance the capa-
is scalable to a large number of contexts as it only performs self- bility of the whole RAG model. It offers an efficient alternative to
attention over one context at a time in the follow-up processing. incorporate a large number of text chunks frequently retrieved,
Atlas [55] and REPLUG [135] apply a similar parallel integration which are challenging to process with input-layer integration due
by concatenating the query and one retrieved document at a time. to the input length limit of LMs [6]. However, it also needs to be
In general, most black-box generation-based RAG methods apply noted that intermediate-layer integration requires high access to
input-layer integration since neither the intermediate layer of the the generation models, which is not feasible for most LLMs that
generation model or the output distribution is accessible. are accessible through inference APIs [98].
8
A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models Conference’17, July 2017, Washington, DC, USA
3.4 Retrieval Augmentation Necessity and cases, pre-retrieved documents (through one-time retrieval) might
Frequency not be enough to support the generation of the whole sequence of
output, which calls for in-generation retrieval operations. To this
The retrieval operation in LLM-based generation generally aims to
end, In-Context RALM [117] and RETRO [6] apply every-n-token
supplement knowledge to enhance generation. Although retrieval-
retrieval in the process of generation for better augmentation. In
augmented models have emerged promising, they have been criti-
comparison, kNN-LM [62] adopts a more frequent retrieval strat-
cized for not being a universal solution [75, 109] as indiscriminately
egy, which retrieves information for the prediction of every token
augmenting LLMs with irrelevant passages can override poten-
during the generation. Overall, applying different frequencies of re-
tially correct knowledge already possessed by LLMs and result in
trieval can impact both the effectiveness and efficiency of the whole
incorrect responses instead [99]. Thakur et al. [147] contribute a
RAG method. For example, more frequent retrieval leads to better
human-annotated dataset to help evaluate the robustness of LLMs
performance but also increases the computing cost [117]. Choosing
against errors in external retrieved knowledge and observe that
retrieval frequency is almost a trade-off between computing cost
LLMs may double the hallucination rate on the non-relevant re-
and performance.
trieved passages than on the relevant ones. Therefore, it is critical
for RA-LLMs to accurately recall the prior knowledge while selec-
tively incorporating retrieved information only when necessary, 4 RA-LLMS TRAINING
which is the path to robust RA-LLMs. Based on whether training is required or not, existing RAG methods
Most existing methods determine the necessity of retrieval based can be categorized into two main classes: train-free approaches
on the preliminary answers of LLMs or their internal reasoning and training-based approaches. Training-free methods usually
results [102, 117]. For example, Self-RAG [5] introduces special directly leverage the retrieved knowledge during inference time
tokens to assess the necessity of retrieval and control retrieval be- without introducing extra training by inserting the retrieved text
havior. Other methods design iterative prompts to decide if extra into the prompt, which is computationally efficient. However, one
information is required during generation, which thereby needs to potential challenge is that the retriever and generator components
invoke retrieval or other actions for LLMs [162, 175]. Wang et al. are not specifically optimized for downstream tasks, which could
[159] propose Self-Knowledge guided Retrieval augmentation (SKR) easily lead to sub-optimal utilization of the retrieved knowledge.
method, which uses LLMs themselves or explicit small trainable To fully exploit the external knowledge, extensive methods are
models to offer self-knowledge as the reference for the adaptive proposed to fine-tune the retriever and generator, thereby guiding
calling of a retriever. In traditional RAGs, retrieval necessity judg- large language models to effectively adapt and integrate retrieved
ment has also been explored and proposed to address by intuitive information [127, 128, 130, 135, 153, 199].
approaches such as assessing the confidence of the logits produced According to the training strategies, we categorize these training-
by the generation models [47, 56, 59]. Such a solution is also appli- based approaches into three classes: 1) Independent Training
cable to RA-LLMs, for example, FLARE [57] dynamically triggers approaches independently train each component in the RAG proce-
RAG if logits are lower than a specific threshold. Tan et al. [146] dure, 2) Sequential Training methods train one module first and
introduce a more flexible model SlimPLM, which detects missing freeze the well-trained component to guide the tuning process of
knowledge in LLMs with a slim proxy model, which functions to the other part, and 3) Joint Training approaches train retriever
generate a “heuristic answer”. The “heuristic answer” is used to and generator simultaneously. In the following section, we will
assess the necessity of retrieval and facilitate the retrieval process comprehensively review the training-free, independent training,
for query rewriting when necessary. sequential training, and joint training methods. The comparison of
In traditional RAGs that rarely consider retrieval necessity, re- these different training methods is depicted in Figure 5.
trieval frequency (also called retrieval stride) is an important design
aspect to determine the degree of using the retrieval in the gen- 4.1 Training-free
eration, thereby greatly affecting the overall performance of RAG
With the huge number of parameters, LLMs have exhibited human-
models [117]. Retrieval frequency controls how much to rely on the
level intelligence and achieved promising prediction performance
retrieval results, thereby affecting both the efficiency and effective-
on various downstream tasks. However, it is extremely challenging
ness of the model. When the necessity of retrieval is not considered,
to frequently perform fine-tuning and update the knowledge stored
retrieval frequency is often pre-defined and fixed, which have three
in the model parameters [74] due to the considerable time and
common settings: one-time, every-n-token, and every-token. One-
computational resources required. Recently, numerous studies have
time retrieval invokes the retrieval function only once and tries to
suggested enhancing LLMs with retrieval mechanisms, enabling
find all desired information in that one-time operation. One-time
them to dynamically acquire new knowledge from external sources
retrieval is usually operated at the beginning of the generation pro-
without extra training processes (i.e., training-free) [54, 57, 63],
cess, and then provides all retrieved documents to the generation
instead of relying solely on the implicit knowledge encoded in
models along with the original input, as applied in REALM [46].
the model’s parameters. These approaches have shown significant
One-time retrieval is more suitable for the cases that the informa-
performance improvement for various knowledge-intensive tasks,
tion needs in external databases are obvious to LLMs [57]. However,
such as open-domain question answering [74]. According to the
for language tasks requiring long-form output such as open-domain
different ways in which LLMs utilize retrieved information, we cat-
summarization, the dependency among the tokens in the output is
egorize these training-free methods into two categories: 1) Prompt
more important to be considered during the generation. In these
Engineering-based Methods integrate retrieved knowledge into
9
Conference’17, July 2017, Washington, DC, USA Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, and Qing Li
Training-Free
Sequential Training
Retriever Datastore
Large Language
Input Output
Models
Retriever Datastore
Independent Training
Retriever Datastore
Retriever Datastore
Figure 5: An illustration of different training methods in Retrieval-Augmented Large Language Models (RA-LLMs). Existing RA-
LLMs approaches can be categorized into two classes: training-free approaches usually directly leverage retrieved information
during the inference time by integrating the retrieved knowledge into the prompt, and training-based approaches fine-tune the
retrieval and generator to enhance the generation performance. Based on the training strategies, training-based methods can
be further categorized into three groups: independent training, where the retrieval and generator components are trained
independently; sequential training, where they are trained sequentially; and joint training, where they are trained jointly.
the original prompt directly, and 2) Retrieval-Guided Token Gen- answer a given question based on their internal knowledge, en-
eration Methods retrieve information to calibrate the token gen- abling flexible utilization of both internal and external knowledge
eration process. by selectively calling the retriever. TOC [65] first retrieves relevant
knowledge for ambiguous questions and recursively constructs
4.1.1 Prompt Engineering-based Methods. As the LLMs’ generation a tree structure by clarifying ambiguous questions into multiple
performance highly depends on the input query, numerous training- disambiguate questions, which is further aggregated to generate
free RAG approaches employ external knowledge by refining the long-form answers.
original prompts [57, 63, 81]. Specifically, the retrieved texts are usu-
ally used as contextual information and combined with the original
prompt to guide the generation of LLMs [54, 57, 63, 65, 81, 112, 158].
For example, In-Context RALM [117] keeps the LLM parameters 4.1.2 Retrieval-Guided Token Generation Methods. In addition to
unchanged and directly incorporates the retrieved document before directly integrating external knowledge into the original prompt,
the original prompt to augment the generation process. IRCoT [149] the auxiliary information can be employed to adjust the token gen-
interleaves chain-of-thought (CoT) generation and knowledge re- eration process. For example, KNN-KMs [62] first retrieves 𝑘 most
trieval steps, enabling the retrieval of more relevant information for relevant contexts from the datastore based on the given query, and
subsequent reasoning steps compared to standard retrieval methods computes a neighbor distribution based on the distance. The output
that rely solely on the question as the query. Instead of retrieving distribution is calibrated by interpolating the neighbor distribution
knowledge from a large corpus, GENREAD [182] first prompts a and the original model’s output distribution. Rest [49] is proposed
LLM to generate contextual documents based on the query, and to replace the parametric draft model with a non-parametric re-
then generate answers based on the given context and question. trieval datastore and retrieves relevant tokens based on the current
SKR [159] proposes guiding LLMs to determine whether they can context for speculative decoding [9, 71, 145].
10
A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models Conference’17, July 2017, Washington, DC, USA
4.2 Independent Training and introduces an adaptive hybrid retrieval strategy for retrieving
Independent training refers to training the retriever and LLMs as demonstrations. Additionally, it leverages T5 [116] as the generator,
two entirely independent processes, in which there is no interac- which undergoes further fine-tuning based on the target label and
tion between the retriever and the LLMs during the training pro- input combining the original prompt with retrieved demonstrations.
cess [61, 69, 197]. Compared with training-free methods, the perfor- SMALLCAP [121] proposes using the CLIP [113], which is a pow-
mance of the RAG-empowered models can be effectively enhanced erful pre-trained multi-modal network, to encode the input image
by training LLMs to leverage the retrieved knowledge or retrievers and the textual data of the external datastore and retrieve the most
to bridge the gap between information retrieval and language gen- relevant items based on the cosine similarity. A cross-attention
eration. For the training of LLMs, the negative loglikelihood loss is layer is trained and GPT-2 [115] is used as the decoder to produce
the most representative training objective [115, 148], which aims captions.
to guide the LLMs to generate desired output based on the given
input. Regarding the retriever, it can be categorized into two types:
1) Sparse retriever [120, 125], and 2) Dense retriever [61, 69, 197]. 4.3.2 LLMs First. Similarly, it can also pre-train LLMs first, and
The sparse retrievers usually exploit sparse features, e.g., word fre- then tune the retriever under the supervision of the well-trained
quencies, to represent the documents and calculate the relevance LLMs. For example, DKRR [53] shows that attention scores from a
scores based on task-specific metrics [77, 120, 125] such as TF-IDF sequence-to-sequence model can indicate the document’s relevance.
and BM25. As for the dense retrievers, deep neural networks are Therefore, they propose to leverage the attention scores of a reader
employed to encode the query and documents into dense repre- model to produce synthetic labels to train the retriever. AAR [184]
sentations, and then the inner product is usually used to calculate proposes using a small language model to generate the supervised
relevance scores and retrieve the relevant external knowledge. For signal for training retrievers. The well-trained retriever can be
example, DPR [61] adopts two independent BERT [25] networks further leveraged to enhance the performance of black-box LLMs.
to encode the query and passages respectively, and trains these RA-DIT [86] first fine-tunes the LLMs to enhance their ability to
models by utilizing contrastive learning. CoG [69] proposes to train leverage retrieved knowledge, and then train the retriever to better
a prefix encoder and a phrase encoder for retrieval and reformulate align its output with LLMs. UPRISE [16] proposes a lightweight
the text generation as multiple copy-and-paste operations from method to enhance the zero-shot performance of LLMs in unseen
existing source text collection. tasks by introducing a prompt retriever. A frozen LLM is employed
to guide the fine-tuning process of the prompt retriever, and this
4.3 Sequential Training retriever then retrieves prompts for different tasks with various
LLMs during inference.
Independent training is an efficient approach to exploit the exter-
nal knowledge during the generation process since the retriever
and generator can be trained offline and any off-the-shelf models
can be utilized, avoiding extra training costs. To better enhance 4.4 Joint Training
the synergy between the retriever and generator, several methods Joint training methods [17, 51, 60, 79, 167, 196] employ the end-to-
have been proposed to train the retriever and LLMs sequentially. end paradigm to optimize the retriever and generator simultane-
In these sequential training methods, the process typically begins ously. Instead of training each module sequentially, joint training
with the independent pre-training of either the retriever or the methods effectively enhance the retriever’s ability to locate external
generator, after which the pre-trained module is fixed while the knowledge for generation and the generator’s capacity to effectively
other module undergoes training. Note that various existing mod- leverage the retrieved information. For instance, RAG [74] mini-
els (e.g., BERT [25, 64, 123], CLIP [113], T5 [116]) can be directly mizes the negative loglikelihood to jointly train the retriever and
employed as the fixed retriever and generator, thereby bypassing generator. REALM [46] adopts a similar training paradigm to that
the first pertaining process. Compared to independent training, of RAG [74], and Maximum Inner Product Search (MIPS) [15, 29,
sequential training involves coordinated training of the retriever 119, 131] technique is used to locate the most relevant documents.
and generator, where the trainable module benefits from the assis- To employ MIPS, all external documents are embedded first and a
tance of the fixed module. Based on the training order between the search index is produced for each embedding. An asynchronous
retriever and generator, sequential training can be categorized into index updating strategy [46, 52, 55, 141] is proposed to refresh the
two classes: 1) Retriever First [5, 127, 128, 153, 199], and 2) LLMs index every several hundred training steps to avoid time consump-
First [130, 135, 157]. tion of re-indexing all documents.
4.3.1 Retriever First. These methods first train the retrieval model
and then fix it. LLMs are then trained by utilizing the retrieved
knowledge. For instance, RETRO [6] adopts the BERT model that is 5 APPLICATIONS
pre-trained independently as the retriever, and an encoder-decoder In this section, we will introduce some representative applications
architecture is trained to integrate retrieval chunks into the model’s of retrieval-augmented large language models (RA-LLMs). To pro-
predictions. RALMs [181] adopts Google Search and the open- vide a clear overview of the applications of RA-LLMs, we will review
source COLBERTV2 [64] as the pre-trained retriever and fine- them from three perspectives: NLP applications, downstream tasks,
tunes the LLM to effectively leverage the retrieved passages. ITER- and domain-specific applications. The studies mentioned in this
RTGEN [124] utilizes the pre-trained S-BERT [123] as the retriever section are summarized and categorized in Figure 6.
11
Conference’17, July 2017, Washington, DC, USA Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, and Qing Li
Applications
Downstream Domain-specific
NLP Applications
Tasks Applications
Software
QA Systems ChatBots Fact Verification Recommendations AI for Science Finance
Engineering
Ghazvininejad et al.
RETRO [6] RAG [74] Di Palma [26] Docprompting [197] Clinfo. ai [92] Zhang et al. [187]
[43]
Fusion-in-Decoder [54] KDBTS [14] Atlas [55] CoRAL [163] Atlas [105] RetMol [160] AlphaFin [78]
REALM [46], etc. Komeili et al. [68] Self-RAG [5], etc. RevCore [94], etc. Dater [177] MoleculeSTM [90] ChatDOC [85]
Wang et al. [152], etc. SheetCopilo [76] PMINet [174] Yepes et al. [178], etc.
Figure 6: A summary of applications of RA-LLMs categorized by NLP applications, downstream tasks, and domain-specific
application. Specifically, NLP applications include QA systems, ChatBots, and fact verification; downstream tasks include
recommendations and software engineering; and domain-specific applications include AI for Science and Finance.
5.1 NLP Applications internet search to further augment conversation performance. Con-
Due to the intrinsic capability in text generation, RA-LLMs have sidering the dynamic nature of knowledge in the world, another
various applications in the NLP field, such as Question Answer model [152] further accesses large amounts of dynamic information
(QA) Systems, ChatBot, and Fact Verification. in search engines to generate responses.
5.1.1 QA Systems. QA Systems aim to provide precise answers 5.1.3 Fact Verification. Fact Verification is a critical task in veri-
to user’s queries. However, even when trained on extensive data, fying the accuracy and reliability of information. With the need
these systems may lack the latest information or specific domain for trusted evidence, RA-LLMs are being utilized to enhance the
knowledge that is not included in their training data [54, 91]. To capabilities of fact verification [55, 74, 74]. Lewis et al. [74] first
address this limitation, the integration of RA-LLMs has played a cru- propose retrieval of external knowledge to augment a range of
cial role in advancing the capabilities of QA systems by enhancing knowledge-intensive tasks including fact verification. On the other
their ability to retrieve and synthesize relevant information [6, 54]. hand, Atlas [55] examines the performance of the RA-LLMs for fact
Specifically, RA-LLMs can provide coherent and contextually rele- verification under few-shot learning. Recently, Self-RAG [5] has
vant answers by leveraging their retrieval component to access a greatly made a notable impression by incorporating a self-reflective
vast knowledge base. For example, REALM [46] integrates a knowl- mechanism. Specifically, Self-RAG reflects on whether retrieved
edge retriever that can retrieve information from a large corpus information is helpful and judges the reliability of retrieved infor-
during pre-training, fine-tuning, and inference. This approach al- mation, thereby greatly improving the verification accuracy.
lows REALM to effectively retrieve from a vast knowledge corpus,
thereby improving the accuracy of its responses. Similarly, Fusion-
5.2 Downstream Tasks
in-Decoder [54] retrieves passages from support documents and
then fuses them with questions to generate the answer, achieving In addition to NLP applications, RA-LLMs can also be applied to
higher accuracy. In addition, Borgeaud et al. [6] indicate that the various downstream tasks, such as recommendations and software
quality of the answers may rely more on the output of retrieval. engineering.
5.1.2 ChatBot. ChatBot is designed to interact with users in a 5.2.1 Recommendations. Recommender systems play an impor-
natural and conversational manner [87]. Different from the QA tant role in modeling users’ preferences and providing personalized
system, ChatBot focuses on maintaining a coherent and contextu- recommendations [34–36, 154, 189, 195]. Recently, RA-LLMs have
ally rich conversation with the user. To enhance these capabilities, demonstrated great potential in providing personalized and con-
recent methods focus on integrating RA-LLMs [60, 68, 188] for its textually relevant recommendations by integrating retrieval and
ability to augment the ChatBot with relevant external knowledge, generation processes [26, 94, 163]. For example, Di Palma [26] pro-
facilitating more engaging and context-rich interactions with users. poses a simple retrieval-augmented recommendation model, that
For example, some studies [14, 43] retrieve relevant knowledge leverages knowledge from movie or book datasets to enhance rec-
from static databases (e.g., a Wikipedia dump) to augment conver- ommendations. Additionally, Lu et al. [94] further retrieval from
sation. Komeili et al. [68] propose retrieving information from the the reviews to enrich item information in recommender systems.
12
A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models Conference’17, July 2017, Washington, DC, USA
CoRAL [163] utilizes reinforcement learning to retrieve collabo- Trustworthy RA-LLMs. The essential objective of developing
rative information from the dataset and align it with semantic RAG-empowered LLMs is to enhance the capability of the language
information for more accurate recommendations. models, thereby benefiting users and society by alleviating redun-
dant and meaningless labor, increasing conveniences, and spurring
5.2.2 Software Engineering. The rise of RA-LLMs has influenced social progress. However, recent research indicates that RA-LLMs
many aspects of software engineering [105, 177, 197]. For example, can be maliciously and unintentionally manipulated to make un-
some studies propose the retrieval-augmented generation para- reliable decisions and harm humans [23, 200], which may have
digm for code generation [197] and program repair [105]. Simi- serious consequences in safety-critical scenarios [11, 13, 32, 38, 88].
larly, Parvez et al. [108] retrieve top-ranked codes or summaries In addition, private retrieval database has a risk of leakage, raising
from the codebase and aggregate them with input to enhance code concerns regarding the privacy of RA-LLMs [186]. Therefore, de-
generation and summarization. In addition, RA-LLMs show poten- veloping trustworthy RA-LLMs is of paramount importance as it
tial in tabular data processing [76, 177] and Text-to-SQL semantic can significantly mitigate the potential negative impacts of LLMs
parsing [111, 134]. technology and provide people with powerful AI models that can be
fully trusted. To be specific, the ideal trustworthiness in RA-LLMs
5.3 Domain-specific Applications systems should possess the following characteristics: 1) robust-
RA-LLMs have been widely adopted for various domain-specific ness, 2) fairness, 3) explainability, and 4) privacy. For example,
tasks, such as AI for Science and Finance. robustness means a trustworthy RA-LLMs system should be ro-
bust against malicious or inadvertent perturbations introduced by
5.3.1 AI for Science. RA-LLMs have proven to be beneficial for attackers. Fairness indicates a trustworthy RA-LLMs system ought
the realms of science, such as molecular and protein. Molecules to avoid discrimination during the decision-making process. Ex-
include identifying the molecule’s property and predicting new plainability requires a complete understanding of the intrinsic
molecules, thereby favoring drug discovery. Currently, some RA- workings of RA-LLMs systems, i.e., the predictions of RA-LLMs sys-
LLMs have been applied to molecules by integrating retrieval of tems are explainable and transparent. Privacy entails safeguarding
molecule structure and biomedical entities like protein, molecule, the safety of this private information housed within the datastore
and disease [90, 160, 161, 174], etc. Li et al. [77], Wang et al. [160] when establishing trustworthy RA-LLMs systems.
propose retrieval-based frameworks by retrieving from the database Multi-Lingual RA-LLMs. The ability of leveraging knowledge
to guide molecule generation. Liu et al. [90] introduce a multi-modal from multiple languages can greatly enhance the capabilities of
molecule structure-text model by retrieving textual knowledge from retrieval-augmented language models. As the world becomes in-
a large-scale dataset for molecular property prediction. In addition, creasingly interconnected, there is a growing need for AI systems
RA-LLMs also significantly influence Protein representation and that can understand and communicate across different languages.
generation [97, 144]. For instance, RSA [97] queries protein se- By incorporating multilingual knowledge retrieval and generation,
quences associated with a collection of structurally or functionally these models can access and synthesize information from diverse
similar sequences in the database to enhance protein representa- linguistic sources, enabling more comprehensive and nuanced un-
tions. Furthermore, Lozano et al. [92] present a clinical QA system derstanding and generation capabilities. Additionally, multilingual
based on retrieving published review articles. models can facilitate cross-cultural communication and knowledge
sharing and breaking down language barriers, thereby bringing con-
5.3.2 Finance. In the highly data-driven and information-intensive
venience to people across different regions of the world, especially
field of finance, RA-LLMs have proved to be a significant technology
those in areas with minority languages [58, 81]. For example, users
for enhancing decision-making [78, 178, 187]. For example, Zhang
from countries with less prevalent languages can utilize abundant
et al. [187] retrieve financial information from external sources,
English and Chinese corpora for knowledge retrieval, enhancing
such as news platforms (e.g., Bloomberg and Reuters) and social
the performance of large language models in downstream tasks.
media platforms (e.g., Twitter, Reddit), to combine with the original
Multi-modal RA-LLMs. Multi-modal retrieval-augmented gener-
query to enhance the precision of financial sentiment analysis. In
ation extends the knowledge sources beyond text to include various
addition, financial QA is another primary task of financial analysis,
data modalities such as images, videos, and audio. By integrating
which usually extracts relevant knowledge from financial docu-
various modalities, LLMs can leverage richer contextual informa-
ments. As professional documents are usually stored in PDF format,
tion than single-modal RAG and develop a more comprehensive
Lin [85] introduces a PDF parser combined with RA-LLMs to re-
understanding of users’ needs, bringing precise, fine-grained, and
trieve knowledge from financial reports. On the other hand, Yepes
high-quality generation. For instance, an image or video can provide
et al. [178] propose a document chunking method based on struc-
valuable visual cues that complement textual information, leading
ture instead of chunking based on paragraphs, further improving
to more precise language generation [51, 199]. By fusing multiple
the quality of RA-LLMs outputs.
modalities, multi-modal RA-LLMs can develop a more comprehen-
sive understanding of the world, leading to more accurate and
6 FUTURE CHALLENGES AND insightful outputs, benefiting a wide range of domains, including
OPPORTUNITIES healthcare [199], drug discovery [136], molecular analysis [3, 90],
Since the studies of RA-LLMs are still in the early stage, we present etc.
some potential research directions that can be explored in the future
in the field of RA-LLMs.
13
Conference’17, July 2017, Washington, DC, USA Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, and Qing Li
Quality of External Knowledge. As a commonly used datastore [9] Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Lau-
in current RAG systems, Wikipedia [61, 199] serves as a vast reposi- rent Sifre, and John Jumper. 2023. Accelerating large language model decoding
with speculative sampling. arXiv preprint arXiv:2302.01318 (2023).
tory of external textual knowledge used to augment the generation [10] Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. 2017. Reading
process, which contains millions of articles covering various disci- Wikipedia to Answer Open-Domain Questions. In ACL (1). Association for
Computational Linguistics, 1870–1879.
plines. However, the reliability and accuracy of individual articles [11] Jingfan Chen, Wenqi Fan, Guanghui Zhu, Xiangyu Zhao, Chunfeng Yuan, Qing
within Wikipedia vary significantly, and the introduction of some Li, and Yihua Huang. 2022. Knowledge-enhanced Black-box Attacks for Recom-
texts that deviate from facts might even mislead the model’s gener- mendations. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge
Discovery and Data Mining. 108–117.
ation process. Therefore, it is crucial to enhance the quality of the [12] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde
external knowledge corpus and mitigate the negative impact of low- de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph,
quality knowledge on the performance of LLMs. By enhancing the Greg Brockman, et al. 2021. Evaluating large language models trained on code.
arXiv preprint arXiv:2107.03374 (2021).
quality of the external knowledge and tailing robust mechanisms [13] Xiao Chen, Wenqi Fan, Jingfan Chen, Haochen Liu, Zitao Liu, Zhaoxiang Zhang,
by filtering out low-quality or unreliable information, the RA-LLM and Qing Li. 2023. Fairly adaptive negative sampling for recommendations. In
Proceedings of the ACM Web Conference 2023. 3723–3733.
systems might produce more accurate, reliable outputs, thereby [14] Xiuyi Chen, Fandong Meng, Peng Li, Feilong Chen, Shuang Xu, Bo Xu, and Jie
improving their effectiveness in various real-world applications. Zhou. 2020. Bridging the gap between prior and posterior knowledge selection
for knowledge-grounded dialogue generation. In Proceedings of the 2020 confer-
ence on empirical methods in natural language processing (EMNLP). 3426–3437.
7 CONCLUSION [15] Yudong Chen, Zhihui Lai, Yujuan Ding, Kaiyi Lin, and Wai Keung Wong. 2019.
Retrieval-augmented generation (RAG), a cutting-edge AI tech- Deep supervised hashing with anchor graph. In Proceedings of the IEEE/CVF
international conference on computer vision. 9796–9804.
nique, has achieved remarkable success across various applications, [16] Daixuan Cheng, Shaohan Huang, Junyu Bi, Yuefeng Zhan, Jianfeng Liu, Yujing
including recommendation, molecule generation, protein represen- Wang, Hao Sun, Furu Wei, Weiwei Deng, and Qi Zhang. 2023. UPRISE: Universal
tation, and software engineering, owing to the potent capabilities of Prompt Retrieval for Improving Zero-Shot Evaluation. In Proceedings of the 2023
Conference on Empirical Methods in Natural Language Processing. 12318–12337.
retrieval in providing supplementary information to enhance gen- [17] Xin Cheng, Di Luo, Xiuying Chen, Lemao Liu, Dongyan Zhao, and Rui Yan.
eration performance. Recently, increasing efforts have been made 2024. Lift yourself up: Retrieval-augmented text generation with self-memory.
Advances in Neural Information Processing Systems 36 (2024).
to alleviate the limitations of large language models (LLMs), such [18] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav
as hallucination and out-of-date internal knowledge, by leveraging Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Se-
retrieval to provide the latest auxiliary information and teaching bastian Gehrmann, et al. 2023. Palm: Scaling language modeling with pathways.
Journal of Machine Learning Research 24, 240 (2023), 1–113.
LLMs to harness the retrieved external knowledge. With the rapid [19] W Bruce Croft, Donald Metzler, and Trevor Strohman. 2010. Search engines:
advancements in retrieval-augmented large language models (RA- Information retrieval in practice. Vol. 520. Addison-Wesley Reading.
LLMs), there is a pressing need for a comprehensive and systematic [20] Leyang Cui, Yu Wu, Jian Liu, Sen Yang, and Yue Zhang. 2021. Template-Based
Named Entity Recognition Using BART. In ACL/IJCNLP (Findings) (Findings of
overview. To bridge this gap, in this paper, we comprehensively ACL, Vol. ACL/IJCNLP 2021). Association for Computational Linguistics, 1835–
review the RA-LLMs from the perspectives of morel architecture, 1845.
[21] Matthew Dahl, Varun Magesh, Mirac Suzgun, and Daniel E Ho. 2024. Large
training strategy, and application area, providing researchers with legal fictions: Profiling legal hallucinations in large language models. arXiv
an in-depth understanding. Moreover, since the studies of RA-LLMs preprint arXiv:2401.01301 (2024).
are still in the early stage, we also discuss the current limitations [22] Michiel de Jong, Yury Zemlyanskiy, Nicholas FitzGerald, Fei Sha, and William W.
Cohen. 2022. Mention Memory: incorporating textual knowledge into Trans-
and several potential research directions for future research. formers through entity mention attention. In ICLR. OpenReview.net.
[23] Gelei Deng, Yi Liu, Kailong Wang, Yuekang Li, Tianwei Zhang, and Yang Liu.
REFERENCES 2024. Pandora: Jailbreak GPTs by Retrieval Augmented Generation Poisoning.
arXiv preprint arXiv:2402.08416 (2024).
[1] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- [24] Ziqing Deng, Zhihui Lai, Yujuan Ding, Heng Kong, and Xu Wu. 2024. Deep
cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Scaling Factor Quantization Network for Large-scale Image Retrieval. In ICMR.
Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 ACM, 851–859.
(2023). [25] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT:
[2] Sweta Agrawal, Chunting Zhou, Mike Lewis, Luke Zettlemoyer, and Marjan Pre-training of Deep Bidirectional Transformers for Language Understanding.
Ghazvininejad. 2023. In-context Examples Selection for Machine Translation. In NAACL-HLT (1). Association for Computational Linguistics, 4171–4186.
In ACL (Findings). Association for Computational Linguistics, 8857–8873. [26] Dario Di Palma. 2023. Retrieval-augmented recommender system: Enhancing
[3] Miles C Andrews, Junna Oba, Chang-Jiun Wu, Haifeng Zhu, Tatiana Karpinets, recommender systems with large language models. In Proceedings of the 17th
Caitlin A Creasy, Marie-Andrée Forget, Xiaoxing Yu, Xingzhi Song, Xizeng ACM Conference on Recommender Systems. 1369–1373.
Mao, et al. 2022. Multi-modal molecular programs regulate melanoma cell state. [27] Yujuan Ding, Yunshan Ma, Wenqi Fan, Yige Yao, Tat-Seng Chua, and Qing
Nature communications 13, 1 (2022), 4000. Li. 2024. FashionReGen: LLM-Empowered Fashion Report Generation. arXiv
[4] Akari Asai, Sewon Min, Zexuan Zhong, and Danqi Chen. 2023. Retrieval-based preprint arXiv:2403.06660 (2024).
language models and applications. In Proceedings of the 61st Annual Meeting [28] Yujuan Ding, P. Y. Mok, Yunshan Ma, and Yi Bin. 2023. Personalized fashion
of the Association for Computational Linguistics (Volume 6: Tutorial Abstracts). outfit generation with user coordination preference learning. Inf. Process. Manag.
41–46. 60, 5 (2023), 103434.
[5] Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. 2023. [29] Yujuan Ding, Wai Keung Wong, Zhihui Lai, and Zheng Zhang. 2020. Bilinear
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. Supervised Hashing Based on 2D Image Features. IEEE Trans. Circuits Syst.
In The Twelfth International Conference on Learning Representations. Video Technol. 30, 2 (2020), 590–602.
[6] Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Ruther- [30] Yujuan Ding, Wai Keung Wong, Zhihui Lai, and Zheng Zhang. 2020. Discrimi-
ford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bog- native dual-stream deep hashing for large-scale image retrieval. Information
dan Damoc, Aidan Clark, et al. 2022. Improving language models by retrieving Processing & Management 57, 6 (2020), 102288.
from trillions of tokens. In International conference on machine learning. PMLR, [31] Andrew Drozdov, Nathanael Schärli, Ekin Akyürek, Nathan Scales, Xinying
2206–2240. Song, Xinyun Chen, Olivier Bousquet, and Denny Zhou. 2022. Compositional
[7] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, semantic parsing with large language models. In The Eleventh International
Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Conference on Learning Representations.
Askell, et al. 2020. Language models are few-shot learners. Advances in neural [32] Wenqi Fan, Tyler Derr, Xiangyu Zhao, Yao Ma, Hui Liu, Jianping Wang, Jiliang
information processing systems 33 (2020), 1877–1901. Tang, and Qing Li. 2021. Attacking black-box recommendations via copying
[8] Stefan Buttcher, Charles LA Clarke, and Gordon V Cormack. 2016. Information
retrieval: Implementing and evaluating search engines. Mit Press.
14
A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models Conference’17, July 2017, Washington, DC, USA
cross-domain user profiles. In 2021 IEEE 37th International Conference on Data Grave. 2023. Atlas: Few-shot Learning with Retrieval Augmented Language
Engineering (ICDE). IEEE, 1583–1594. Models. Journal of Machine Learning Research 24, 251 (2023), 1–43.
[33] Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, [56] Zhengbao Jiang, Jun Araki, Haibo Ding, and Graham Neubig. 2021. How can we
Tat-Seng Chua, and Qing Li. 2024. A Survey on RAG Meeting LLMs: Towards know when language models know? on the calibration of language models for
Retrieval-Augmented Large Language Models. Proroceedings of the 30th ACM question answering. Transactions of the Association for Computational Linguistics
SIGKDD Conference on Knowledge Discovery & Data Mining (2024). 9 (2021), 962–977.
[34] Wenqi Fan, Xiaorui Liu, Wei Jin, Xiangyu Zhao, Jiliang Tang, and Qing Li. 2022. [57] Zhengbao Jiang, Frank F Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-
Graph Trend Filtering Networks for Recommendation. In Proceedings of the 45th Yu, Yiming Yang, Jamie Callan, and Graham Neubig. 2023. Active Retrieval
International ACM SIGIR Conference on Research and Development in Information Augmented Generation. In Proceedings of the 2023 Conference on Empirical
Retrieval. 112–121. Methods in Natural Language Processing. 7969–7992.
[35] Wenqi Fan, Yao Ma, Qing Li, Yuan He, Eric Zhao, Jiliang Tang, and Dawei Yin. [58] Anubha Kabra, Emmy Liu, Simran Khanuja, Alham Fikri Aji, Genta Winata,
2019. Graph neural networks for social recommendation. In The world wide web Samuel Cahyawijaya, Anuoluwapo Aremu, Perez Ogayo, and Graham Neubig.
conference. 417–426. 2023. Multi-lingual and Multi-cultural Figurative Language Understanding. In
[36] Wenqi Fan, Yao Ma, Qing Li, Jianping Wang, Guoyong Cai, Jiliang Tang, and The 61st Annual Meeting Of The Association For Computational Linguistics.
Dawei Yin. 2020. A graph neural network framework for social recommenda- [59] Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain,
tions. IEEE Transactions on Knowledge and Data Engineering (2020). Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-
[37] Wenqi Fan, Shijie Wang, Jiani Huang, Zhikai Chen, Yu Song, Wenzhuo Tang, Johnson, et al. 2022. Language models (mostly) know what they know. arXiv
Haitao Mao, Hui Liu, Xiaorui Liu, Dawei Yin, et al. 2024. Graph Machine Learn- preprint arXiv:2207.05221 (2022).
ing in the Era of Large Language Models (LLMs). arXiv preprint arXiv:2404.14928 [60] Minki Kang, Jin Myung Kwak, Jinheon Baek, and Sung Ju Hwang. 2023. Knowl-
(2024). edge graph-augmented language models for knowledge-grounded dialogue
[38] Wenqi Fan, Xiangyu Zhao, Xiao Chen, Jingran Su, Jingtong Gao, Lin Wang, generation. arXiv preprint arXiv:2305.18846 (2023).
Qidong Liu, Yiqi Wang, Han Xu, Lei Chen, et al. 2022. A Comprehensive Survey [61] Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick S. H. Lewis, Ledell Wu,
on Trustworthy Recommender Systems. arXiv preprint arXiv:2209.10117 (2022). Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval
[39] Thibault Févry, Livio Baldini Soares, Nicholas FitzGerald, Eunsol Choi, and Tom for Open-Domain Question Answering. In EMNLP (1). Association for Compu-
Kwiatkowski. 2020. Entities as Experts: Sparse Memory Access with Entity tational Linguistics, 6769–6781.
Supervision. In EMNLP (1). Association for Computational Linguistics, 4937– [62] Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike
4951. Lewis. 2020. Generalization through Memorization: Nearest Neighbor Language
[40] Luyu Gao, Xueguang Ma, Jimmy Lin, and Jamie Callan. 2023. Precise Zero- Models. In International Conference on Learning Representations.
Shot Dense Retrieval without Relevance Labels. In ACL (1). Association for [63] Omar Khattab, Keshav Santhanam, Xiang Lisa Li, David Hall, Percy Liang,
Computational Linguistics, 1762–1777. Christopher Potts, and Matei Zaharia. 2022. Demonstrate-search-predict: Com-
[41] Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, posing retrieval and language models for knowledge-intensive nlp. arXiv
Jiawei Sun, and Haofen Wang. 2023. Retrieval-augmented generation for large preprint arXiv:2212.14024 (2022).
language models: A survey. arXiv preprint arXiv:2312.10997 (2023). [64] Omar Khattab and Matei Zaharia. 2020. Colbert: Efficient and effective passage
[42] Izacard Gautier, Caron Mathilde, Hosseini Lucas, Riedel Sebastian, Bojanowski search via contextualized late interaction over bert. In Proceedings of the 43rd
Piotr, Joulin Armand, and Grave Edouard. 2022. Unsupervised dense information International ACM SIGIR conference on research and development in Information
retrieval with contrastive learning. Transactions on Machine Learning Research Retrieval. 39–48.
(2022). [65] Gangwoo Kim, Sungdong Kim, Byeongguk Jeon, Joonsuk Park, and Jaewoo Kang.
[43] Marjan Ghazvininejad, Chris Brockett, Ming-Wei Chang, Bill Dolan, Jianfeng 2023. Tree of Clarifications: Answering Ambiguous Questions with Retrieval-
Gao, Wen-tau Yih, and Michel Galley. 2018. A knowledge-grounded neural con- Augmented Large Language Models. In The 2023 Conference on Empirical Methods
versation model. In Proceedings of the AAAI Conference on Artificial Intelligence, in Natural Language Processing.
Vol. 32. [66] Hyuhng Joon Kim, Hyunsoo Cho, Junyeob Kim, Taeuk Kim, Kang Min Yoo,
[44] Michael R. Glass, Gaetano Rossiello, Md. Faisal Mahbub Chowdhury, Ankita and Sang-goo Lee. 2022. Self-generated in-context learning: Leveraging auto-
Naik, Pengshan Cai, and Alfio Gliozzo. 2022. Re2G: Retrieve, Rerank, Generate. regressive language models as a demonstration generator. arXiv preprint
In NAACL-HLT. Association for Computational Linguistics, 2701–2715. arXiv:2206.08082 (2022).
[45] Edouard Grave, Armand Joulin, and Nicolas Usunier. 2017. Improving Neural [67] Mei Kobayashi and Koichi Takeda. 2000. Information retrieval on the web. ACM
Language Models with a Continuous Cache. In ICLR (Poster). OpenReview.net. computing surveys (CSUR) 32, 2 (2000), 144–173.
[46] Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. [68] Mojtaba Komeili, Kurt Shuster, and Jason Weston. 2022. Internet-Augmented
2020. Retrieval augmented language model pre-training. In International confer- Dialogue Generation. In ACL (1). Association for Computational Linguistics,
ence on machine learning. PMLR, 3929–3938. 8460–8478.
[47] Junxian He, Graham Neubig, and Taylor Berg-Kirkpatrick. 2021. Efficient Near- [69] Tian Lan, Deng Cai, Yan Wang, Heyan Huang, and Xian-Ling Mao. 2022. Copy
est Neighbor Language Models. In EMNLP (1). Association for Computational is All You Need. In The Eleventh International Conference on Learning Represen-
Linguistics, 5703–5714. tations.
[48] Qiuxiang He, Guoping Huang, Qu Cui, Li Li, and Lemao Liu. 2021. Fast and [70] Angeliki Lazaridou, Elena Gribovskaya, Wojciech Stokowiec, and Nikolai Grig-
accurate neural machine translation with translation memory. In Proceedings of orev. 2022. Internet-augmented language models through few-shot prompting
the 59th Annual Meeting of the Association for Computational Linguistics and the for open-domain question answering. arXiv preprint arXiv:2203.05115 (2022).
11th International Joint Conference on Natural Language Processing (Volume 1: [71] Yaniv Leviathan, Matan Kalman, and Yossi Matias. 2023. Fast inference from
Long Papers). 3170–3180. transformers via speculative decoding. In International Conference on Machine
[49] Zhenyu He, Zexuan Zhong, Tianle Cai, Jason D Lee, and Di He. 2023. Rest: Learning. PMLR, 19274–19286.
Retrieval-based speculative decoding. arXiv preprint arXiv:2311.08252 (2023). [72] Mike Lewis, Marjan Ghazvininejad, Gargi Ghosh, Armen Aghajanyan, Sida
[50] Sebastian Hofstätter, Jiecao Chen, Karthik Raman, and Hamed Zamani. 2023. Wang, and Luke Zettlemoyer. 2020. Pre-training via paraphrasing. Advances in
FiD-Light: Efficient and Effective Retrieval-Augmented Text Generation. In Neural Information Processing Systems 33 (2020), 18470–18481.
SIGIR. ACM, 1437–1447. [73] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman
[51] Ziniu Hu, Ahmet Iscen, Chen Sun, Zirui Wang, Kai-Wei Chang, Yizhou Sun, Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART:
Cordelia Schmid, David A Ross, and Alireza Fathi. 2023. Reveal: Retrieval- Denoising Sequence-to-Sequence Pre-training for Natural Language Genera-
augmented visual-language pre-training with multi-source multimodal knowl- tion, Translation, and Comprehension. In ACL. Association for Computational
edge memory. In Proceedings of the IEEE/CVF conference on computer vision and Linguistics, 7871–7880.
pattern recognition. 23369–23379. [74] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir
[52] Jie Huang, Wei Ping, Peng Xu, Mohammad Shoeybi, Kevin Chen-Chuan Chang, Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rock-
and Bryan Catanzaro. 2023. Raven: In-context learning with retrieval augmented täschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp
encoder-decoder language models. arXiv preprint arXiv:2308.07922 (2023). tasks. Advances in Neural Information Processing Systems 33 (2020), 9459–9474.
[53] Gautier Izacard and Edouard Grave. 2021. Distilling Knowledge from Reader to [75] Daliang Li, Ankit Singh Rawat, Manzil Zaheer, Xin Wang, Michal Lukasik,
Retriever for Question Answering. In ICLR 2021-9th International Conference on Andreas Veit, Felix Yu, and Sanjiv Kumar. 2022. Large language models with
Learning Representations. controllable working memory. arXiv preprint arXiv:2211.05110 (2022).
[54] Gautier Izacard and Edouard Grave. 2021. Leveraging Passage Retrieval with [76] Hongxin Li, Jingran Su, Yuntao Chen, Qing Li, and ZHAO-XIANG ZHANG.
Generative Models for Open Domain Question Answering. In EACL 2021-16th 2024. SheetCopilot: Bringing Software Productivity to the Next Level through
Conference of the European Chapter of the Association for Computational Linguis- Large Language Models. Advances in Neural Information Processing Systems 36
tics. Association for Computational Linguistics, 874–880. (2024).
[55] Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni,
Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard
15
Conference’17, July 2017, Washington, DC, USA Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, and Qing Li
[77] Jiatong Li, Yunqing Liu, Wenqi Fan, Xiao-Yong Wei, Hui Liu, Jiliang Tang, Language Models. arXiv preprint arXiv:2402.13492 (2024).
and Qing Li. 2023. Empowering Molecule Discovery for Molecule-Caption [100] Jacob Menick, Maja Trebacz, Vladimir Mikulik, John Aslanides, Francis Song,
Translation with Large Language Models: A ChatGPT Perspective. arXiv preprint Martin Chadwick, Mia Glaese, Susannah Young, Lucy Campbell-Gillingham,
arXiv:2306.06615 (2023). Geoffrey Irving, et al. 2022. Teaching language models to support answers with
[78] Xiang Li, Zhenyu Li, Chen Shi, Yong Xu, Qing Du, Mingkui Tan, Jun Huang, verified quotes. arXiv preprint arXiv:2203.11147 (2022).
and Wei Lin. 2024. AlphaFin: Benchmarking Financial Analysis with Retrieval- [101] Aristides Milios, Siva Reddy, and Dzmitry Bahdanau. 2023. In-context learning
Augmented Stock-Chain Framework. arXiv preprint arXiv:2403.12582 (2024). for text classification with many labels. In Proceedings of the 1st GenBench
[79] Xinze Li, Zhenghao Liu, Chenyan Xiong, Shi Yu, Yu Gu, Zhiyuan Liu, and Workshop on (Benchmarking) Generalisation in NLP. 173–184.
Ge Yu. 2023. Structure-Aware Language Model Pretraining Improves Dense [102] Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh
Retrieval on Structured Data. In The 61st Annual Meeting Of The Association For Hajishirzi, and Luke Zettlemoyer. 2022. Rethinking the Role of Demonstra-
Computational Linguistics. tions: What Makes In-Context Learning Work?. In EMNLP. Association for
[80] Xiaonan Li, Kai Lv, Hang Yan, Tianyang Lin, Wei Zhu, Yuan Ni, Guotong Xie, Computational Linguistics, 11048–11064.
Xiaoling Wang, and Xipeng Qiu. 2023. Unified Demonstration Retriever for [103] Sewon Min, Julian Michael, Hannaneh Hajishirzi, and Luke Zettlemoyer. 2020.
In-Context Learning. In ACL (1). Association for Computational Linguistics, AmbigQA: Answering Ambiguous Open-domain Questions. In EMNLP (1). As-
4644–4668. sociation for Computational Linguistics, 5783–5797.
[81] Xiaoqian Li, Ercong Nie, and Sheng Liang. 2023. From Classification to Gen- [104] Sewon Min, Weijia Shi, Mike Lewis, Xilun Chen, Wen-tau Yih, Hannaneh Ha-
eration: Insights into Crosslingual Retrieval Augmented ICL. In NeurIPS 2023 jishirzi, and Luke Zettlemoyer. 2023. Nonparametric Masked Language Model-
Workshop on Instruction Tuning and Instruction Following. ing. In ACL (Findings). Association for Computational Linguistics, 2097–2118.
[82] Xiaonan Li and Xipeng Qiu. 2023. MoT: Memory-of-Thought Enables ChatGPT [105] Noor Nashid, Mifta Sintaha, and Ali Mesbah. 2023. Retrieval-based prompt
to Self-Improve. In Proceedings of the 2023 Conference on Empirical Methods selection for code-related few-shot learning. In 2023 IEEE/ACM 45th International
in Natural Language Processing. Association for Computational Linguistics, Conference on Software Engineering (ICSE). IEEE, 2450–2462.
Singapore, 6354–6374. [106] Neil O’Hare, Paloma De Juan, Rossano Schifanella, Yunlong He, Dawei Yin,
[83] Xiang Lisa Li and Percy Liang. 2021. Prefix-Tuning: Optimizing Continuous and Yi Chang. 2016. Leveraging user interaction signals for web image search.
Prompts for Generation. In ACL/IJCNLP (1). Association for Computational In Proceedings of the 39th International ACM SIGIR conference on Research and
Linguistics, 4582–4597. Development in Information Retrieval. 559–568.
[84] Zonglin Li, Ruiqi Guo, and Sanjiv Kumar. 2022. Decoupled context processing [107] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela
for context augmented language modeling. Advances in Neural Information Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022.
Processing Systems 35 (2022), 21698–21710. Training language models to follow instructions with human feedback. Advances
[85] Demiao Lin. 2024. Revolutionizing Retrieval-Augmented Generation with En- in neural information processing systems 35 (2022), 27730–27744.
hanced PDF Structure Recognition. arXiv preprint arXiv:2401.12599 (2024). [108] Md. Rizwan Parvez, Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray,
[86] Xi Victoria Lin, Xilun Chen, Mingda Chen, Weijia Shi, Maria Lomeli, Richard and Kai-Wei Chang. 2021. Retrieval Augmented Code Generation and Sum-
James, Pedro Rodriguez, Jacob Kahn, Gergely Szilvasy, Mike Lewis, et al. 2023. marization. In EMNLP (Findings). Association for Computational Linguistics,
RA-DIT: Retrieval-Augmented Dual Instruction Tuning. In The Twelfth Interna- 2719–2734.
tional Conference on Learning Representations. [109] Fabio Petroni, Patrick S. H. Lewis, Aleksandra Piktus, Tim Rocktäschel, Yuxiang
[87] Haochen Liu, Jamell Dacon, Wenqi Fan, Hui Liu, Zitao Liu, and Jiliang Tang. 2020. Wu, Alexander H. Miller, and Sebastian Riedel. 2020. How Context Affects
Does Gender Matter? Towards Fairness in Dialogue Systems. In Proceedings of Language Models’ Factual Predictions. In AKBC.
the 28th International Conference on Computational Linguistics. 4403–4416. [110] Fabio Petroni, Tim Rocktäschel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu,
[88] Haochen Liu, Yiqi Wang, Wenqi Fan, Xiaorui Liu, Yaxin Li, Shaili Jain, Yunhao Alexander H Miller, and Sebastian Riedel. 2019. Language models as knowledge
Liu, Anil K Jain, and Jiliang Tang. 2021. Trustworthy ai: A computational bases? arXiv preprint arXiv:1909.01066 (2019).
perspective. arXiv preprint arXiv:2107.06641 (2021). [111] Gabriel Poesia, Alex Polozov, Vu Le, Ashish Tiwari, Gustavo Soares, Christopher
[89] Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, and Meek, and Sumit Gulwani. 2022. Synchromesh: Reliable Code Generation from
Weizhu Chen. 2022. What Makes Good In-Context Examples for GPT-3?. In Pre-trained Language Models. In ICLR. OpenReview.net.
DeeLIO@ACL. Association for Computational Linguistics, 100–114. [112] Anupam Purwar and Rahul Sundar. 2023. Keyword Augmented Retrieval: Novel
[90] Shengchao Liu, Weili Nie, Chengpeng Wang, Jiarui Lu, Zhuoran Qiao, Ling Liu, framework for Information Retrieval integrated with speech interface. arXiv
Jian Tang, Chaowei Xiao, and Animashree Anandkumar. 2023. Multi-modal preprint arXiv:2310.04205 (2023).
molecule structure–text model for text-based retrieval and editing. Nature [113] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sand-
Machine Intelligence 5, 12 (2023), 1447–1457. hini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al.
[91] Ye Liu, Semih Yavuz, Rui Meng, Dragomir Radev, Caiming Xiong, and Yingbo 2021. Learning transferable visual models from natural language supervision.
Zhou. 2022. Uni-Parser: Unified Semantic Parser for Question Answering In International conference on machine learning. PMLR, 8748–8763.
on Knowledge Base and Database. In EMNLP. Association for Computational [114] Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. 2018.
Linguistics, 8858–8869. Improving language understanding by generative pre-training. (2018).
[92] Alejandro Lozano, Scott L Fleming, Chia-Chun Chiang, and Nigam Shah. 2023. [115] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya
Clinfo. ai: An open-source retrieval-augmented large language model system for Sutskever, et al. 2019. Language models are unsupervised multitask learners.
answering medical questions using scientific literature. In PACIFIC SYMPOSIUM OpenAI blog 1, 8 (2019), 9.
ON BIOCOMPUTING 2024. World Scientific, 8–23. [116] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang,
[93] Pan Lu, Liang Qiu, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, Tanmay Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits
Rajpurohit, Peter Clark, and Ashwin Kalyan. 2023. Dynamic Prompt Learn- of transfer learning with a unified text-to-text transformer. Journal of machine
ing via Policy Gradient for Semi-structured Mathematical Reasoning. In ICLR. learning research 21, 140 (2020), 1–67.
OpenReview.net. [117] Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Amnon Shashua, Kevin
[94] Yu Lu, Junwei Bao, Yan Song, Zichen Ma, Shuguang Cui, Youzheng Wu, and Leyton-Brown, and Yoav Shoham. 2023. In-context retrieval-augmented lan-
Xiaodong He. 2021. RevCore: Review-Augmented Conversational Recommen- guage models. Transactions of the Association for Computational Linguistics 11
dation. In ACL/IJCNLP (Findings) (Findings of ACL, Vol. ACL/IJCNLP 2021). As- (2023), 1316–1331.
sociation for Computational Linguistics, 1161–1173. [118] Ori Ram, Gal Shachaf, Omer Levy, Jonathan Berant, and Amir Globerson. 2022.
[95] Hongyin Luo, Tianhua Zhang, Yung-Sung Chuang, Yuan Gong, Yoon Kim, Learning to Retrieve Passages without Supervision. In NAACL-HLT. Association
Xixin Wu, Helen Meng, and James R. Glass. 2023. Search Augmented Instruction for Computational Linguistics, 2687–2700.
Learning. In EMNLP (Findings). Association for Computational Linguistics, 3717– [119] Parikshit Ram and Alexander G Gray. 2012. Maximum inner-product search
3729. using cone trees. In Proceedings of the 18th ACM SIGKDD international conference
[96] Man Luo, Xin Xu, Zhuyun Dai, Panupong Pasupat, Mehran Kazemi, Chitta Baral, on Knowledge discovery and data mining. 931–939.
Vaiva Imbrasaite, and Vincent Y Zhao. 2023. Dr. icl: Demonstration-retrieved [120] Juan Ramos et al. 2003. Using tf-idf to determine word relevance in document
in-context learning. arXiv preprint arXiv:2305.14128 (2023). queries. In Proceedings of the first instructional conference on machine learning,
[97] Chang Ma, Haiteng Zhao, Lin Zheng, Jiayi Xin, Qintong Li, Lijun Wu, Zhi- Vol. 242. Citeseer, 29–48.
hong Deng, Yang Lu, Qi Liu, and Lingpeng Kong. 2023. Retrieved Sequence [121] Rita Ramos, Bruno Martins, Desmond Elliott, and Yova Kementchedjhieva. 2023.
Augmentation for Protein Representation Learning. bioRxiv (2023), 2023–02. Smallcap: lightweight image captioning prompted with retrieval augmenta-
[98] Xinbei Ma, Yeyun Gong, Pengcheng He, Hai Zhao, and Nan Duan. 2023. Query tion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
rewriting for retrieval-augmented large language models. arXiv preprint Recognition. 2840–2849.
arXiv:2305.14283 (2023). [122] Benjamin Z. Reichman and Larry Heck. 2024. Retrieval-Augmented Generation:
[99] Seiji Maekawa, Hayate Iso, Sairam Gurajada, and Nikita Bhutani. 2024. Retrieval Is Dense Passage Retrieval Retrieving? CoRR abs/2402.11035 (2024).
Helps or Hurts? A Deeper Dive into the Efficacy of Retrieval Augmentation to
16
A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models Conference’17, July 2017, Washington, DC, USA
[123] Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings When and What to Retrieve for LLMs. arXiv preprint arXiv:2402.12052 (2024).
using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical [147] Nandan Thakur, Luiz Bonifacio, Xinyu Zhang, Odunayo Ogundepo, Ehsan
Methods in Natural Language Processing and the 9th International Joint Conference Kamalloo, David Alfonso-Hermelo, Xiaoguang Li, Qun Liu, Boxing Chen,
on Natural Language Processing (EMNLP-IJCNLP). 3982–3992. Mehdi Rezagholizadeh, et al. 2023. NoMIRACL: Knowing When You Don’t
[124] Yubing Ren, Yanan Cao, Ping Guo, Fang Fang, Wei Ma, and Zheng Lin. 2023. Know for Robust Multilingual Retrieval-Augmented Generation. arXiv preprint
Retrieve-and-sample: Document-level event argument extraction via hybrid re- arXiv:2312.11361 (2023).
trieval augmentation. In Proceedings of the 61st Annual Meeting of the Association [148] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi,
for Computational Linguistics (Volume 1: Long Papers). 293–306. Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti
[125] Stephen Robertson, Hugo Zaragoza, et al. 2009. The probabilistic relevance Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models.
framework: BM25 and beyond. Foundations and Trends® in Information Retrieval arXiv preprint arXiv:2307.09288 (2023).
3, 4 (2009), 333–389. [149] Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal.
[126] Ohad Rubin, Jonathan Herzig, and Jonathan Berant. 2022. Learning To Retrieve 2023. Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-
Prompts for In-Context Learning. In NAACL-HLT. Association for Computa- Intensive Multi-Step Questions. In The 61st Annual Meeting Of The Association
tional Linguistics, 2655–2671. For Computational Linguistics.
[127] Sara Sarto, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. 2022. Retrieval- [150] Lifu Tu, Caiming Xiong, and Yingbo Zhou. 2022. Prompt-Tuning Can Be Much
augmented transformer for image captioning. In Proceedings of the 19th interna- Better Than Fine-Tuning on Cross-lingual Understanding With Multilingual
tional conference on content-based multimedia indexing. 1–7. Language Models. In EMNLP (Findings). Association for Computational Linguis-
[128] Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, tics, 5478–5485.
Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2024. [151] Tu Vu, Brian Lester, Noah Constant, Rami Al-Rfou’, and Daniel Cer. 2022. SPoT:
Toolformer: Language models can teach themselves to use tools. Advances in Better Frozen Model Adaptation through Soft Prompt Transfer. In ACL (1).
Neural Information Processing Systems 36 (2024). Association for Computational Linguistics, 5039–5059.
[129] Minjoon Seo, Jinhyuk Lee, Tom Kwiatkowski, Ankur P Parikh, Ali Farhadi, and [152] Ante Wang, Linfeng Song, Qi Liu, Haitao Mi, Longyue Wang, Zhaopeng Tu,
Hannaneh Hajishirzi. 2019. Real-time open-domain question answering with Jinsong Su, and Dong Yu. 2023. Search-engine-augmented dialogue response
dense-sparse phrase index. arXiv preprint arXiv:1906.05807 (2019). generation with cheaply supervised query production. Artificial Intelligence 319
[130] Zhihong Shao, Yeyun Gong, Minlie Huang, Nan Duan, Weizhu Chen, et al. (2023), 103874.
2023. Enhancing Retrieval-Augmented Large Language Models with Iterative [153] Boxin Wang, Wei Ping, Peng Xu, Lawrence McAfee, Zihan Liu, Mohammad
Retrieval-Generation Synergy. In The 2023 Conference on Empirical Methods in Shoeybi, Yi Dong, Oleksii Kuchaiev, Bo Li, Chaowei Xiao, et al. 2023. Shall We
Natural Language Processing. Pretrain Autoregressive Language Models with Retrieval? A Comprehensive
[131] Fumin Shen, Wei Liu, Shaoting Zhang, Yang Yang, and Heng Tao Shen. 2015. Study. In Proceedings of the 2023 Conference on Empirical Methods in Natural
Learning binary codes for maximum inner product search. In Proceedings of the Language Processing. 7763–7786.
IEEE International Conference on Computer Vision. 4148–4156. [154] Hanbing Wang, Xiaorui Liu, Wenqi Fan, Xiangyu Zhao, Venkataramana Kini,
[132] Shelly Sheynin, Oron Ashual, Adam Polyak, Uriel Singer, Oran Gafni, Eliya Devendra Yadav, Fei Wang, Zhen Wen, Jiliang Tang, and Hui Liu. 2024. Re-
Nachmani, and Yaniv Taigman. 2023. kNN-Diffusion: Image Generation via thinking Large Language Model Architectures for Sequential Recommendations.
Large-Scale Retrieval. In ICLR. OpenReview.net. arXiv preprint arXiv:2402.09543 (2024).
[133] Kaize Shi, Xueyao Sun, Qing Li, and Guandong Xu. 2024. Compressing Long [155] Haoyu Wang, Tuo Zhao, and Jing Gao. 2024. BlendFilter: Advancing Retrieval-
Context for Enhancing RAG with AMR-based Concept Distillation. arXiv Augmented Large Language Models via Query Generation Blending and Knowl-
preprint arXiv:2405.03085 (2024). edge Filtering. arXiv preprint arXiv:2402.11129 (2024).
[134] Peng Shi, Rui Zhang, He Bai, and Jimmy Lin. 2022. XRICL: Cross-lingual [156] Liang Wang, Nan Yang, and Furu Wei. 2023. Query2doc: Query Expansion with
Retrieval-Augmented In-Context Learning for Cross-lingual Text-to-SQL Se- Large Language Models. In EMNLP. Association for Computational Linguistics,
mantic Parsing. In EMNLP (Findings). Association for Computational Linguistics, 9414–9423.
5248–5259. [157] Liang Wang, Nan Yang, and Furu Wei. 2024. Learning to Retrieve In-Context Ex-
[135] Weijia Shi, Sewon Min, Michihiro Yasunaga, Minjoon Seo, Rich James, Mike amples for Large Language Models. In EACL (1). Association for Computational
Lewis, Luke Zettlemoyer, and Wen-tau Yih. 2023. Replug: Retrieval-augmented Linguistics, 1752–1767.
black-box language models. arXiv preprint arXiv:2301.12652 (2023). [158] Xintao Wang, Qianwen Yang, Yongting Qiu, Jiaqing Liang, Qianyu He,
[136] Guy Shtar. 2021. Multimodal machine learning for drug knowledge discovery. Zhouhong Gu, Yanghua Xiao, and Wei Wang. 2023. Knowledgpt: Enhanc-
In Proceedings of the 14th ACM International Conference on Web Search and Data ing large language models with retrieval and storage access on knowledge bases.
Mining. 1115–1116. arXiv preprint arXiv:2308.11761 (2023).
[137] Kurt Shuster, Spencer Poff, Moya Chen, Douwe Kiela, and Jason Weston. 2021. [159] Yile Wang, Peng Li, Maosong Sun, and Yang Liu. 2023. Self-Knowledge Guided
Retrieval Augmentation Reduces Hallucination in Conversation. In EMNLP Retrieval Augmentation for Large Language Models. In The 2023 Conference on
(Findings). Association for Computational Linguistics, 3784–3803. Empirical Methods in Natural Language Processing.
[138] Suzanna Sia and Kevin Duh. 2023. In-context learning as maintaining coherency: [160] Zichao Wang, Weili Nie, Zhuoran Qiao, Chaowei Xiao, Richard G. Baraniuk, and
A study of on-the-fly machine translation using large language models. arXiv Anima Anandkumar. 2023. Retrieval-based Controllable Molecule Generation.
preprint arXiv:2305.03573 (2023). In ICLR. OpenReview.net.
[139] Devendra Singh, Siva Reddy, Will Hamilton, Chris Dyer, and Dani Yogatama. [161] Zifeng Wang, Zichen Wang, Balasubramaniam Srinivasan, Vassilis N Ioannidis,
2021. End-to-end training of multi-document reader and retriever for open- Huzefa Rangwala, and Rishita Anubhai. 2023. BioBridge: Bridging Biomedical
domain question answering. Advances in Neural Information Processing Systems Foundation Models via Knowledge Graph. arXiv preprint arXiv:2310.03320
34 (2021), 25968–25981. (2023).
[140] Amit Singhal et al. 2001. Modern information retrieval: A brief overview. IEEE [162] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi,
Data Eng. Bull. 24, 4 (2001), 35–43. Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reason-
[141] Shamane Siriwardhana, Rivindu Weerasekera, Elliott Wen, Tharindu Kalu- ing in large language models. Advances in neural information processing systems
arachchi, Rajib Rana, and Suranga Nanayakkara. 2023. Improving the domain 35 (2022), 24824–24837.
adaptation of retrieval augmented generation (RAG) models for open domain [163] Junda Wu, Cheng-Chun Chang, Tong Yu, Zhankui He, Jianing Wang, Yupeng
question answering. Transactions of the Association for Computational Linguistics Hou, and Julian McAuley. 2024. CoRAL: Collaborative Retrieval-Augmented
11 (2023), 1–17. Large Language Models Improve Long-tail Recommendation. arXiv preprint
[142] Karen Sparck Jones. 1972. A statistical interpretation of term specificity and its arXiv:2403.06447 (2024).
application in retrieval. Journal of documentation 28, 1 (1972), 11–21. [164] Ledell Wu, Fabio Petroni, Martin Josifoski, Sebastian Riedel, and Luke Zettle-
[143] Hongjin Su, Jungo Kasai, Chen Henry Wu, Weijia Shi, Tianlu Wang, Jiayi Xin, moyer. 2020. Scalable Zero-shot Entity Linking with Dense Entity Retrieval. In
Rui Zhang, Mari Ostendorf, Luke Zettlemoyer, Noah A. Smith, and Tao Yu. 2023. EMNLP (1). Association for Computational Linguistics, 6397–6407.
Selective Annotation Makes Language Models Better Few-Shot Learners. In [165] Yuhuai Wu, Markus Norman Rabe, DeLesley Hutchins, and Christian Szegedy.
ICLR. OpenReview.net. 2022. Memorizing Transformers. In ICLR. OpenReview.net.
[144] Fang Sun, Zhihao Zhan, Hongyu Guo, Ming Zhang, and Jian Tang. 2023. Graphvf: [166] Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan
Controllable protein-specific 3d molecule generation with variational flow. arXiv Hooi. 2023. Can llms express their uncertainty? an empirical evaluation of
preprint arXiv:2304.12825 (2023). confidence elicitation in llms. arXiv preprint arXiv:2306.13063 (2023).
[145] Ziteng Sun, Ananda Theertha Suresh, Jae Hun Ro, Ahmad Beirami, Himanshu [167] Benfeng Xu, Chunxu Zhao, Wenbin Jiang, PengFei Zhu, Songtai Dai, Chao
Jain, and Felix Yu. 2024. Spectr: Fast speculative decoding via optimal transport. Pang, Zhuo Sun, Shuohuan Wang, and Yu Sun. 2023. Retrieval-augmented
Advances in Neural Information Processing Systems 36 (2024). domain adaptation of language models. In Proceedings of the 8th Workshop on
[146] Jiejun Tan, Zhicheng Dou, Yutao Zhu, Peidong Guo, Kun Fang, and Ji-Rong Representation Learning for NLP (RepL4NLP 2023). 54–64.
Wen. 2024. Small Models, Big Insights: Leveraging Slim Proxy Models To Decide
17
Conference’17, July 2017, Washington, DC, USA Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, and Qing Li
[168] Fangyuan Xu, Weijia Shi, and Eunsol Choi. 2023. RECOMP: Improving retrieval- Linguistics (Volume 1: Long Papers). 2421–2436.
augmented LMs with context compression and selective augmentation. In The [185] Daoguang Zan, Bei Chen, Zeqi Lin, Bei Guan, Yongji Wang, and Jian-Guang
Twelfth International Conference on Learning Representations. Lou. 2022. When Language Model Meets Private Library. In EMNLP (Findings).
[169] Hu Xu, Bing Liu, Lei Shu, and Philip S. Yu. 2019. BERT Post-Training for Review Association for Computational Linguistics, 277–288.
Reading Comprehension and Aspect-based Sentiment Analysis. In NAACL-HLT [186] Shenglai Zeng, Jiankun Zhang, Pengfei He, Yue Xing, Yiding Liu, Han Xu, Jie
(1). Association for Computational Linguistics, 2324–2335. Ren, Shuaiqiang Wang, Dawei Yin, Yi Chang, et al. 2024. The Good and The
[170] Jitao Xu, Josep-Maria Crego, and Jean Senellart. 2020. Boosting neural machine Bad: Exploring Privacy Issues in Retrieval-Augmented Generation (RAG). arXiv
translation with similar translations. In Annual Meeting of the Association for preprint arXiv:2402.16893 (2024).
Computational Linguistics. Association for Computational Linguistics, 1570– [187] Boyu Zhang, Hongyang Yang, Tianyu Zhou, Muhammad Ali Babar, and Xiao-
1579. Yang Liu. 2023. Enhancing financial sentiment analysis via retrieval augmented
[171] Jing Xu, Arthur Szlam, and Jason Weston. 2022. Beyond Goldfish Memory: Long- large language models. In Proceedings of the Fourth ACM International Conference
Term Open-Domain Conversation. In ACL (1). Association for Computational on AI in Finance. 349–356.
Linguistics, 5180–5197. [188] Houyu Zhang, Zhenghao Liu, Chenyan Xiong, and Zhiyuan Liu. 2020. Grounded
[172] Shicheng Xu, Liang Pang, Huawei Shen, Xueqi Cheng, and Tat-seng Chua. 2023. Conversation Generation as Guided Traverses in Commonsense Knowledge
Search-in-the-chain: Towards the accurate, credible and traceable content gen- Graphs. In ACL. Association for Computational Linguistics, 2031–2043.
eration for complex knowledge-intensive tasks. arXiv preprint arXiv:2304.14732 [189] Jiahao Zhang, Rui Xue, Wenqi Fan, Xin Xu, Qing Li, Jian Pei, and Xiaorui Liu.
(2023). 2024. Linear-Time Graph Neural Networks for Scalable Recommendations.
[173] Haoyan Yang, Zhitao Li, Yong Zhang, Jianzong Wang, Ning Cheng, Ming Li, arXiv preprint arXiv:2402.13973 (2024).
and Jing Xiao. 2023. PRCA: Fitting Black-Box Large Language Models for [190] Yunxiang Zhang, Muhammad Khalifa, Lajanugen Logeswaran, Moontae Lee,
Retrieval Question Answering via Pluggable Reward-Driven Contextual Adapter. Honglak Lee, and Lu Wang. 2023. Merging generated and retrieved knowledge
In EMNLP. Association for Computational Linguistics, 5364–5375. for open-domain QA. arXiv preprint arXiv:2310.14393 (2023).
[174] Ling Yang, Zhilin Huang, Xiangxin Zhou, Minkai Xu, Wentao Zhang, Yu Wang, [191] Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. 2023. Automatic Chain
Xiawu Zheng, Wenming Yang, Ron O Dror, Shenda Hong, et al. 2023. Prompt- of Thought Prompting in Large Language Models. In ICLR. OpenReview.net.
based 3d molecular diffusion models for structure-based drug design. (2023). [192] Penghao Zhao, Hailin Zhang, Qinhan Yu, Zhengren Wang, Yunteng Geng,
[175] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Fangcheng Fu, Ling Yang, Wentao Zhang, and Bin Cui. 2024. Retrieval-
Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Augmented Generation for AI-Generated Content: A Survey. arXiv preprint
Language Models. In ICLR. OpenReview.net. arXiv:2402.19473 (2024).
[176] Jiacheng Ye, Zhiyong Wu, Jiangtao Feng, Tao Yu, and Lingpeng Kong. 2023. [193] Ruochen Zhao, Hailin Chen, Weishi Wang, Fangkai Jiao, Xuan Long Do, Cheng-
Compositional exemplars for in-context learning. In International Conference on wei Qin, Bosheng Ding, Xiaobao Guo, Minzhi Li, Xingxuan Li, et al. 2023. Re-
Machine Learning. PMLR, 39818–39833. trieving multimodal information for augmented generation: A survey. arXiv
[177] Yunhu Ye, Binyuan Hui, Min Yang, Binhua Li, Fei Huang, and Yongbin Li. 2023. preprint arXiv:2303.10868 (2023).
Large Language Models are Versatile Decomposers: Decomposing Evidence and [194] Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou,
Questions for Table-based Reasoning. In SIGIR. ACM, 174–184. Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. 2023. A survey
[178] Antonio Jimeno Yepes, Yao You, Jan Milczek, Sebastian Laverde, and Leah Li. of large language models. arXiv preprint arXiv:2303.18223 (2023).
2024. Financial Report Chunking for Effective Retrieval Augmented Generation. [195] Zihuai Zhao, Wenqi Fan, Jiatong Li, Yunqing Liu, Xiaowei Mei, Yiqi Wang, Zhen
arXiv preprint arXiv:2402.05131 (2024). Wen, Fei Wang, Xiangyu Zhao, Jiliang Tang, et al. 2024. Recommender systems
[179] Dawei Yin, Yuening Hu, Jiliang Tang, Tim Daly, Mianwei Zhou, Hua Ouyang, in the era of large language models (llms). IEEE Transactions on Knowledge and
Jianhui Chen, Changsung Kang, Hongbo Deng, Chikashi Nobata, et al. 2016. Data Engineering (2024).
Ranking relevance in yahoo search. In Proceedings of the 22nd ACM SIGKDD [196] Zexuan Zhong, Tao Lei, and Danqi Chen. 2022. Training Language Models with
International Conference on Knowledge Discovery and Data Mining. 323–332. Memory Augmentation. In 2022 Conference on Empirical Methods in Natural
[180] Dani Yogatama, Cyprien de Masson d’Autume, and Lingpeng Kong. 2021. Adap- Language Processing, EMNLP 2022.
tive semiparametric language models. Transactions of the Association for Com- [197] Shuyan Zhou, Uri Alon, Frank F Xu, Zhengbao Jiang, and Graham Neubig.
putational Linguistics 9 (2021), 362–373. 2022. Docprompting: Generating code by retrieving the docs. In The Eleventh
[181] Ori Yoran, Tomer Wolfson, Ori Ram, and Jonathan Berant. 2023. Making International Conference on Learning Representations.
Retrieval-Augmented Language Models Robust to Irrelevant Context. In The [198] Yin Zhu, Zhiling Luo, and Gong Cheng. 2023. Furthest Reasoning with Plan
Twelfth International Conference on Learning Representations. Assessment: Stable Reasoning Path with Retrieval-Augmented Large Language
[182] Wenhao Yu, Dan Iter, Shuohang Wang, Yichong Xu, Mingxuan Ju, Soumya Models. arXiv preprint arXiv:2309.12767 (2023).
Sanyal, Chenguang Zhu, Michael Zeng, and Meng Jiang. 2023. Generate rather [199] Yinghao Zhu, Changyu Ren, Shiyun Xie, Shukai Liu, Hangyuan Ji, Zixiang
than Retrieve: Large Language Models are Strong Context Generators. In ICLR. Wang, Tao Sun, Long He, Zhoujun Li, Xi Zhu, et al. 2024. REALM: RAG-Driven
OpenReview.net. Enhancement of Multimodal Electronic Health Records Analysis via Large
[183] Wenhao Yu, Zhihan Zhang, Zhenwen Liang, Meng Jiang, and Ashish Sabharwal. Language Models. arXiv preprint arXiv:2402.07016 (2024).
2023. Improving language models via plug-and-play retrieval feedback. arXiv [200] Wei Zou, Runpeng Geng, Binghui Wang, and Jinyuan Jia. 2024. PoisonedRAG:
preprint arXiv:2305.14002 (2023). Knowledge Poisoning Attacks to Retrieval-Augmented Generation of Large
[184] Zichun Yu, Chenyan Xiong, Shi Yu, and Zhiyuan Liu. 2023. Augmentation- Language Models. arXiv preprint arXiv:2402.07867 (2024).
Adapted Retriever Improves Generalization of Language Models as Generic Plug-
In. In Proceedings of the 61st Annual Meeting of the Association for Computational
18