0% found this document useful (0 votes)
116 views63 pages

A Survey On Generative Information Retrieval

This document surveys the emerging field of Generative Information Retrieval (GenIR), highlighting its shift from traditional similarity-based methods to generative approaches. It categorizes current research into two main areas: Generative Document Retrieval (GR), which generates document identifiers, and Reliable Response Generation, which produces user-centric answers. The paper reviews advancements, challenges, and future directions in GenIR, aiming to provide a comprehensive reference for researchers in this evolving domain.

Uploaded by

HENDRIK2011
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
116 views63 pages

A Survey On Generative Information Retrieval

This document surveys the emerging field of Generative Information Retrieval (GenIR), highlighting its shift from traditional similarity-based methods to generative approaches. It categorizes current research into two main areas: Generative Document Retrieval (GR), which generates document identifiers, and Reliable Response Generation, which produces user-centric answers. The paper reviews advancements, challenges, and future directions in GenIR, aiming to provide a comprehensive reference for researchers in this evolving domain.

Uploaded by

HENDRIK2011
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

From Matching to Generation: A Survey on Generative

Information Retrieval
XIAOXI LI and JIAJIE JIN, Renmin University of China, China
YUJIA ZHOU, Tsinghua University, China
YUYAO ZHANG and PEITIAN ZHANG, Renmin University of China, China
arXiv:2404.14851v4 [cs.IR] 4 Mar 2025

YUTAO ZHU and ZHICHENG DOU∗ , Renmin University of China, China


Information Retrieval (IR) systems are crucial tools for users to access information, which have long been
dominated by traditional methods relying on similarity matching. With the advancement of pre-trained
language models, generative information retrieval (GenIR) emerges as a novel paradigm, attracting increasing
attention. Based on the form of information provided to users, current research in GenIR can be categorized
into two aspects: (1) Generative Document Retrieval (GR) leverages the generative model’s parameters for
memorizing documents, enabling retrieval by directly generating relevant document identifiers without explicit
indexing. (2) Reliable Response Generation employs language models to directly generate information
users seek, breaking the limitations of traditional IR in terms of document granularity and relevance matching
while offering flexibility, efficiency, and creativity to meet practical needs. This paper aims to systematically
review the latest research progress in GenIR. We will summarize the advancements in GR regarding model
training and structure, document identifier, incremental learning, etc., as well as progress in reliable response
generation in aspects of internal knowledge memorization, external knowledge augmentation, etc. We also
review the evaluation, challenges and future developments in GenIR systems. This review aims to offer a
comprehensive reference for researchers, encouraging further development in the GenIR field. 1
CCS Concepts: • Information systems → Retrieval models and ranking.
Additional Key Words and Phrases: Generative Information Retrieval; Generative Document Retrieval; Reliable
Response Generation
ACM Reference Format:
Xiaoxi Li, Jiajie Jin, Yujia Zhou, Yuyao Zhang, Peitian Zhang, Yutao Zhu, and Zhicheng Dou. 2025. From
Matching to Generation: A Survey on Generative Information Retrieval. ACM Trans. Inf. Syst. 1, 1 (March 2025),
63 pages. https://doi.org/XXXXXXX.XXXXXXX

1 INTRODUCTION
Information retrieval (IR) systems are crucial for navigating the vast sea of online information
in today’s digital landscape. From using search engines such as Google [76], Bing [196], and
Baidu [209], to engaging with question-answering or dialogue systems like ChatGPT [209] and
Bing Chat [197], and discovering content via recommendation platforms like Amazon [4] and
∗ Zhicheng Dou is the corresponding author.
1 Github Repository: https://github.com/RUC-NLPIR/GenIR-Survey

Authors’ addresses: Xiaoxi Li, [email protected]; Jiajie Jin, [email protected], Renmin University of China, Beijing,
China; Yujia Zhou, [email protected], Tsinghua University, Beijing, China; Yuyao Zhang, 2020201710@ruc.
edu.cn; Peitian Zhang, [email protected], Renmin University of China, Beijing, China; Yutao Zhu, yutaozhu94@
gmail.com; Zhicheng Dou, [email protected], Renmin University of China, Beijing, China.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and
the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from [email protected].
© 2025 Association for Computing Machinery.
1046-8188/2025/3-ART $15.00
https://doi.org/XXXXXXX.XXXXXXX

ACM Trans. Inf. Syst., Vol. 1, No. 1, Article . Publication date: March 2025.
2 Li et al.

(a) Traditional Retrieval (b) Generative Retrieval (c) Response Generation: Direct Information Accessing

How to cook How to cook Language Reliability Enhancement Strategies


Encoder How to cook Language
a steak? a steak? Model a steak? Model Model External
Enhancement Knowledge

Similarity Constrained
Index Pool Matching DocID Prefix Generation
How to cook a steak?

Here’s a straightforward approach for cooking a steak, such as a ribeye


or sirloin, on a stovetop:
How to cook a steak? • Ingredients
Steak (ribeye, sirloin, ...); Salt; Freshly ground black pepper; Olive oil
How to cook the perfect steak | Steak recipe or butter; Optional: garlic cloves, fresh herbs [1].
2022.2.12 — During cooking, aim to cook your steak medium-rare to medium – • Equipment
any more and you'll be left with a tough piece of meat. Turning it every minute ... Cast iron skillet or heavy-bottomed pan, Tongs, Meat thermometer [1].
• Instructions
How to Cook Steak - like a chef! (1) Prep the Steak: ... (2) Preheat the Pan: ... (3) Add Fat: ... (4) Cook
2019.7.19 — Ingredients · 1 – 2 boneless ribeye or scotch fillet , 2.5 cm / 1″ thick, the Steak: ... (5) ... (6) ... (7) Rest the Steak: ... (8) Serve: ... [1].
approx 300g/10 oz each (Note 1) · 1 tbsp vegetable oil · Salt and ... Cooking times and temperatures can vary based on the steak's size, the
starting temperature of the meat, and your cooking setup [2][3] , so
How to cook the perfect steak
adjustments may be necessary. Enjoy your steak!
2021.9.26 — Cooking times (for a 2cm-2.5cm thick Sirloin or Ribeye) · Rare: 3
minutes total · Medium Rare: 4 minutes total · Medium Well: 5-6 minutes total ... [1] How to cook steak on a stovetop [2] How to cook a perfect stack [3] How ...

Fig. 1. Exploring IR Evolution: From Traditional to Generative Methods - This diagram illustrates the shift
from traditional similarity-based document matching (a) to GenIR techniques. Current GenIR methods can
be categorized into two types: generative retrieval (b), which retrieves documents by directly generating
relevant DocIDs constrained by a DocID prefix tree; and response generation (c), which directly generates
reliable and user-centric answers.

YouTube [77], IR technologies are integral to our everyday online experiences. These systems are
reliable and play a key role in spreading knowledge and ideas globally.
Traditional IR systems primarily rely on sparse retrieval methods based on word-level match-
ing. These methods, which include Boolean Retrieval [242], BM25 [238], SPLADE [65], and Uni-
COIL [163], establish connections between vocabulary and documents, offering high retrieval
efficiency and robust system performance. With the rise of deep learning, dense retrieval methods
such as DPR [117] and ANCE [324], based on the bidirectional encoding representations from the
BERT model [121], capture the deep semantic information of documents, significantly improv-
ing retrieval precision. Although these methods have achieved leaps in accuracy, they rely on
large-scale document indices [57, 187] and cannot be optimized in an end-to-end way. Moreover,
when people search for information, what they really need is a precise and reliable answer. This
document ranking list-based IR approach still requires users to spend time summarizing their
required answers, which is not ideal enough for information seeking [195].
With the development of Transformer-based pre-trained language models such as T5 [231],
BART [138], and GPT [228], they have demonstrated their strong text generation capabilities. In
recent years, large language models (LLMs) have brought about revolutionary changes in the
field of AI-generated content (AIGC) [19, 359]. Based on large pre-training corpora and advanced
training techniques like RLHF [36], LLMs [8, 105, 209, 286] have made significant progress in
natural language tasks, such as dialogue [209, 282] and question answering [174, 225]. The rapid
development of LLMs is transforming IR systems, giving rise to a new paradigm of generative
information retrieval (GenIR), which achieves IR goals through generative approaches.
As envisioned by Metzler et al. [195], in order to build an IR system that can respond like a
domain expert, the system should not only provide accurate responses but also include source
citations to ensure the credibility of the results. To achieve this, GenIR models must possess both
sufficient memorized knowledge and the ability to recall the associations between knowledge and
source documents, which could be the final goal of GenIR systems. Currently, research in GenIR
primarily focuses on two main patterns: (1) Generative Document Retrieval (GR), which involves
retrieving documents by generating their identifiers; and (2) Reliable Response Generation,

ACM Trans. Inf. Syst., Vol. 1, No. 1, Article . Publication date: March 2025.
From Matching to Generation: A Survey on Generative Information Retrieval 3

which entails directly generating user-centric responses with reliability enhancement strategies.
Noting that although these two methods have not yet been integrated technically, they represent
two primary forms by which IR systems present information to users in generative manners: either
by generating lists of document identifiers or by generating reliable and user-centric responses.
Figure 1 illustrates the difference between these two forms. These strategies are essential to the
next generation of information retrieval and constitute the central focus of this survey.
Generative document retrieval, a new retrieval paradigm based on generative models, is
garnering increasing attention. This approach leverages the parametric memory of generative
models to directly generate document identifiers (DocIDs) related to the documents [18, 281, 307,
371]. Figure 1 illustrates this transition, where traditional IR systems match queries to documents
based on an indexed database (Figure 1(a)), while generative methods use language models to
retrieve by directly generating relevant document identifiers (Figure 1(b)). Specifically, GR assigns
a unique identifier to each document, which can be numeric-based or text-based, and then trains a
generative retrieval model to learn the mapping from queries to the relevant DocIDs. This allows
the model to index documents using its internal parameters. During inference, GR models use
constrained beam search to limit the generated DocIDs to be valid within the corpus, ranking them
based on generation probability to produce a ranked list of DocIDs. This eliminates the need for
large-scale document indexes in traditional methods, enabling end-to-end training of the model.
Recent studies on generative retrieval have delved into model training and structure [6, 153,
281, 307, 365, 369, 372], document identifier design [18, 265, 281, 288, 330], continual learning on
dynamic corpora [80, 124, 192], downstream task adaptation [27, 28, 152], multi-modal generative
retrieval [157, 178, 357], and generative recommender systems [74, 233, 304]. The progress in GR
is shifting retrieval systems from matching to generation. It has also led to the emergence of
workshops [10] and tutorials [279]. However, there is currently no comprehensive review that
systematically organizes the research, challenges, and prospects of this emerging field.
Reliable response generation is also a promising direction in the IR field, offering user-
centric and accurate answers that directly meet users’ needs. Since LLMs are particularly adept
at following instructions [359], capable of generating customized responses, and can even cite
the knowledge sources [204, 223], making direct response generation a new and intuitive way
to access information [54, 75, 241, 315, 367]. As illustrated in Figure 1, the generative approach
marks a significant shift from traditional IR systems, which return a ranked list of documents (as
shown in Figure 1(a,b)). Instead, response generation methods (depicted in Figure 1(c)) offer a more
dynamic form of information access by directly generating detailed, user-centric responses, thereby
providing a richer and more immediate understanding of the information need behind the users’
queries.
However, the responses generated by language models may not always be reliable. They have
the potential to generate irrelevant answers [85], contradict factual information [90, 104], provide
outdated data [291], or generate toxic content [93, 263]. Consequently, these limitations render them
unsuitable for many scenarios that require accurate and up-to-date information. To address these
challenges, the academic community has developed strategies across four key aspects: enhancing
internal knowledge [16, 37, 56, 119, 132, 193, 243, 267, 285]; augmenting external knowledge [5, 113,
139, 151, 204, 245, 333]; generating responses with citation [129, 142, 156, 204, 314]; and improving
personal information assistance [149, 172, 295, 327]. Despite these efforts, there is still a lack of
a systematic review that organizes the existing research under this new paradigm of generative
information access.
This review will systematically review the latest research progress and future developments in
the field of GenIR, as shown in Figure 2, which displays the classification of research related to
the GenIR system. We will introduce background knowledge in Section 2, generative document

ACM Trans. Inf. Syst., Vol. 1, No. 1, Article . Publication date: March 2025.
4 Li et al.

Training (3.1.1): DSI [281], DynamicRetriever [370], NCI [307], DSI-QG [375],
Chen et al. [29], LTRGR [159], GenRRL [365], DGR [161], ListGR [280],
Model Training and
Structure (Sec 3.1) Structure (3.1.2): NCI [307], TOME [237], NP Decoding [130], MEVI [346],
DiffusionRet [224], GDR [342], Self-Retrieval [275], PAG [345]

Numeric (3.2.1): DSI [281], DynamicRetriever [370], Ultron [371], GenRet [265],
Tied-Atomic [206], MEVI [346], LMIndexer [112], ASI [330], RIPOR [344]
Document Identifier
(Sec 3.2) Text (3.2.2): GENRE [18], SEAL [13], Ultron [371], LLM-URL [376],
Generative Document
UGR [26], MINDER [160], AutoTSG [352], SE-DSI [352], NOVO [311], GLEN [135]
Retrieval (Sec 3)
Incremental Learning
DSI++ [192], IncDSI [124], CLEVER [25], CorpusBrain++ [80]
(Sec 3.3)
Separate Training (3.4.1): GERE [27], CorpusBrain [28], GMR [131], DearDR [283],
CodeDSI [203], UGR [26], GCoQA [158], Re3val [257]

Joint Training (3.4.2): UniGen [155], CorpusLM [152], RetroLLM [153]


Downstream Task
Adaptation (Sec 3.4) Multi-Modal (3.4.3): IRGen [357], GeMKR [178], GRACE [157]

Generative Recommender Systems (3.4.4): P5 [74], TIGER [233], SEATER [254],


IDGenRec [273], LC-Rec [360], ColaRec [309]

Structure (4.1.1): Model Scaling: GPT3 [16], BLOOM [243], LLaMA [285],
Model Structure: PaLM [34], Mixtral 8x7B [106]

Internal Knowledge Training and Inference (4.1.2): Training: Sadeq et al. [239] FactTune [211];
Memorization (Sec 4.1) Inference: GenRead [339], RECITE [268], DoLa [37]

Knowledge Updating (4.1.3): Incremental Learning: Ernie 2.0 [267], DAP [119];
DynaInst [201], Knowledge Editing: KE [17], MEND [199], ROME [193]

Retrieval (4.2.1): Sequential: RAG [139], RRR [183], ARL2 [350];


Branching: TOC [123], BlendFilter [297], REPLUG [252];
Conditional: SKR [308], Self-DC [296], Rowen [52];
Loop: Iter-RetGen [248], IR-COT [287], FLARE [111], Self-RAG [5], Search-o1 [151]
External Knowledge
Augmentation (Sec 4.2) Tool (4.2.2): Search Engine: ReAct [333], WebGPT [204];
Reliable Response Knowledge Graph: StructGPT [110], ToG [262], RoG [181];
Generation (Sec 4) API-based Tools: Toolformer [245], ToolLLM [226], AssistGPT [69];
Model-based Tools: HuggingGPT [250], Visual ChatGPT [317]

Direct Citation (4.3.1): According-to Prompting [314], IFL [129], Fierro et al. [63],
Credible without Credit [217], 1-PAGER [97], Khalifa et al. [122]
GenIR System Generating Response
with Citation (Sec 4.3) Retrieval-based Citation (4.3.2): WebGPT [204], WebBrain [223], RARR [70],
SearChain [326], LLatrieval [71], VTG [261], CEG [150], APO [142]

Personalized Dialogue (4.4.1): Zhang et al. [354], P 2 Bot [172], Wu et al. [322],
SAFARI [295], Personalized Soups [98], OPPU [274]
Personal Information
Assistant (Sec 4.4) Domain-specific (4.4.2): Healthcare: Zhongjing [177], Mental-LLM [327];
Academic: RevGAN [149], Pearl [202]; Education: EduChat [48]; Recipe, Robot, etc.

Metrics (5.1.1): Recall, MRR [40], R-Precision, MAP, nDCG [101]

Benchmarks (5.1.2): MS MARCO [205], NQ [126], TriviaQA [115], KILT [218],


TREC DL 19 & 20 [41, 42], DynamicIR [337], Liu et al. [176], ExcluIR [356]
Generative Document
Retrieval (Sec 5.1) Analysis (5.1.3): Chen et al. [29], Pradeep et al. [220], Liu et al. [176],
Wu et al. [319]

Experiments (5.1.4): Performance Comparison on MS MARCO [205], NQ [126]


Evaluation (Sec 5) and KILT [218] Benchmarks

Metrics (5.2.1): Rule-based: EM, BLEU [213], ROUGE [162], Perplexity;


Model-based: BERTScore [355], BLEURT [247], GPTScore [66], FActScore [198];
Human Evaluation: Comprehensibility, Relevance, Fluency
Reliable Response
Generation (Sec 5.2) Benchmarks (5.2.2): General: MMLU [84], BIG-bench [259], LLM-Eval [166];
Tool: API-Bank [148], ToolBench [226];
Factuality: TruthfulQA [164], ALCE [71], HaluEval [144];
Real Time: RealTime QA [118], FreshQA [291];
Trustworthy: SafetyBench [358], TrustGPT [93], TrustLLM [263]

Generative Document Scalability (6.1.1); Dynamic Corpora (6.1.2); Document Representation (6.1.3);
Retrieval (Sec 6.1) Efficiency (6.1.4); Multi-modal (6.1.5)

Challenges and Reliable Response Accuracy and Factuality (6.2.1); Real-time Property (6.2.2);
Prospects (Sec 6) Generation (Sec 6.2) Bias and Fairness (6.2.3); Privacy and Security (6.2.4)

Unified Framework Unified Framework for Retrieval and Generation (6.3.1);


(Sec 6.3) Towards End2end Framework for Various IR Tasks (6.3.2)

Fig. 2. Taxonomy of research on generative information retrieval: investigating generative document retrieval,
reliable response generation, evaluation, challenges and prospects.

ACM Trans. Inf. Syst., Vol. 1, No. 1, Article . Publication date: March 2025.
From Matching to Generation: A Survey on Generative Information Retrieval 5

retrieval technologies in Section 3, direct information accessing with generative language models in
Section 4, evaluation in Section 5, current challenges and future directions in Section 6, respectively.
Section 7 will summarize the content of this review. This article is the first to systematically organize
the research, evaluation, challenges and prospects of generative IR, while also looking forward to
the potential and importance of GenIR’s future development. Through this review, readers will
gain a deep understanding of the latest progress in developing GenIR systems and how it shapes
the future of information access. The main contribution of this survey is summarized as follows:
• First comprehensive survey on generative information retrieval (GenIR): This survey is
the first to comprehensively organize the techniques, evaluation, challenges, and prospects on
the emerging field of GenIR, providing a deep understanding of the latest progress in developing
GenIR systems and its future in shaping information access.
• Systematic categorization and in-depth analysis: The survey offers a systematic catego-
rization of research related to GenIR systems, including generative document retrieval, reliable
response generation. It provides an in-depth analysis of each category, covering model training
and structure, document identifier, etc. in generative document retrieval; internal knowledge
memorization, external knowledge enhancement, etc. for reliable response generation.
• Comprehensive review of evaluation metrics and benchmarks: The survey reviews a
range of widely used evaluation metrics and benchmark datasets for accessing GenIR methods,
alongside analysis on the effectiveness and weaknesses of existing GenIR methods.
• Discussions of current challenges and future directions: The survey identifies and discusses
the current challenges faced in the GenIR field. We also provide potential solutions for each
challenge and outline future research directions for building GenIR systems.

2 BACKGROUND AND PRELIMINARIES


Information retrieval techniques aim at efficiently obtaining, processing, and understanding infor-
mation from massive data. Technological advancements have continuously driven the evolution of
these methods: from early keyword-based sparse retrieval to deep learning-based dense retrieval,
and more recently, to generative retrieval, large language models, and their augmentation tech-
niques. Each advancement enhances retrieval accuracy and efficiency, catering to the complex and
diverse query needs of users.

2.1 Traditional Information Retrieval


Sparse Retrieval. In the field of traditional information retrieval, sparse retrieval techniques
implement fast and accurate document retrieval through the inverted index method. Inverted
indexing technology maps each unique term to a list of all documents containing that term,
providing an efficient means for information retrieval in large document collections. Among these
methods, TF-IDF (Term Frequency-Inverse Document Frequency) [235] is a particularly important
statistical tool used to assess the importance of a word in a document collection, thereby widely
applied in various traditional retrieval systems.
The core of sparse retrieval technology lies in evaluating the relevance between documents
and user queries. Specifically, given a document collection D and a user query 𝑞, traditional
information retrieval systems identify and retrieve information by calculating the relevance R
between document 𝑑 and query 𝑞. This relevance evaluation typically relies on the similarity
measure between document 𝑑 and query 𝑞, as shown below:
∑︁
R (𝑞, 𝑑) = tf-idf(𝑡, 𝑑) · tf-idf(𝑡, 𝑞), (1)
𝑡 ∈𝑞∩𝑑

ACM Trans. Inf. Syst., Vol. 1, No. 1, Article . Publication date: March 2025.
6 Li et al.

where 𝑡 represents the terms common to both query 𝑞 and document 𝑑, and tf-idf(𝑡, 𝑑) and tf-idf(𝑡, 𝑞)
represent the TF-IDF weights of term 𝑡 in document 𝑑 and query 𝑞, respectively. Although sparse
retrieval methods like TF-IDF [235] and BM25 [238] excel at fast retrieval, it struggles with complex
queries involving synonyms, specialized terms, or context, as term matching and TF-IDF may not
fully meet users’ information needs [180].
Dense Retrieval. The advent of pre-trained language models like BERT [121] has revolutionized
information retrieval, leading to the development of dense retrieval methods, like DPR [117],
ANCE [324], E5 [298], SimLM [299]. Unlike traditional sparse retrieval, these methods leverage
Transformer-based encoders to create dense vector representations for both queries and documents.
This approach enhances the capability to grasp the underlying semantics, thereby improving
retrieval accuracy.
The core of dense retrieval lies in converting documents and queries into vector representations.
Given document 𝑑 and query 𝑞 and their vector representations v𝑞 , each document 𝑑 is transformed
into a dense vector v𝑑 through a pre-trained language model, similarly, query 𝑞 is transformed
into vector v𝑞 . Specifically, we can use encoder functions 𝐸𝑑 (·) and 𝐸𝑞 (·) to represent the encoding
process for documents and queries, respectively:
v𝑑 = 𝐸𝑑 (𝑑), v𝑞 = 𝐸𝑞 (𝑞), (2)
where 𝐸𝑑 (·) and 𝐸𝑞 (·) can be the same model or different models optimized for specific tasks.
Dense retrieval methods evaluate relevance by calculating the similarity between the query
vector and document vector, which can be measured by cosine similarity, expressed as follows:
v𝑞 · v𝑑
R (𝑞, 𝑑) = cos(v𝑞 , v𝑑 ) = , (3)
|v𝑞 ||v𝑑 |
where v𝑞 · v𝑑 represents the dot product of query vector v𝑞 and document vector v𝑑 , and |v𝑞 | and
|v𝑑 | respectively represent the magnitudes of the query and document vector. Finally, documents
are ranked based on these similarity scores to identify the most relevant ones for the user.

2.2 Generative Retrieval


With the significant progress of language models, generative retrieval has emerged as a new
direction in the field of information retrieval [195, 281, 328]. Unlike traditional index-based retrieval
methods, generative retrieval relies on pre-trained generative language models, such as T5 [231]
and BART [138], to directly generate document identifiers (DocIDs) relevant to the query, thereby
achieving end-to-end retrieval without relying on large-scale pre-built document indices.
DocID Construction and Prefix Constraints. To facilitate generative retrieval, each document
𝑑 in the corpus D = {𝑑 1, 𝑑 2, . . . , 𝑑 𝑁 } is assigned a unique document identifier 𝑑 ′ , forming the set
D ′ = {𝑑 1′ , 𝑑 2′ , . . . , 𝑑 𝑁′ }. This mapping is typically established via a bijective function 𝜙 : D → D ′ ,
ensuring that:
𝜙 (𝑑𝑖 ) = 𝑑𝑖′, ∀𝑑𝑖 ∈ D. (4)
To enable the language model to generate only valid DocIDs during inference, we construct prefix
constraints based on D ′ . This is typically implemented using a trie (prefix tree), where each path
from the root to a leaf node corresponds to a valid DocID.
Constrained Beam Search. Given a query 𝑞, the generative retrieval model aims to generate
the top-𝑘 DocIDs that are most relevant to 𝑞. The language model 𝑃 (·|𝑞; 𝜃 ) generates DocIDs token
by token, guided by the prefix constraints. At each decoding step 𝑖, only those tokens that extend
the current partial sequence 𝑑 <𝑖 ′ into a valid prefix of some DocIDs in D ′ are considered. Formally,

the set of allowable next tokens is:



V (𝑑 <𝑖 ) = {𝑣 | ∃𝑑 ′ ∈ D ′ such that 𝑑 <𝑖

𝑣 is a prefix of 𝑑 ′ }. (5)

ACM Trans. Inf. Syst., Vol. 1, No. 1, Article . Publication date: March 2025.
From Matching to Generation: A Survey on Generative Information Retrieval 7

By employing constrained beam search, the model efficiently explores the space of valid DocIDs,
maintaining a beam of the most probable sequences at each decoding step while adhering to the
DocID prefix constraints.
Document Relevance. The relevance between the query 𝑞 and a document 𝑑 is quantified by
the probability of generating its corresponding DocID 𝑑 ′ given 𝑞. This is computed as:
Ö𝑇
R (𝑞, 𝑑) = 𝑃 (𝑑 ′ |𝑞; 𝜃 ) = 𝑃 (𝑑𝑖′ | 𝑑 <𝑖

, 𝑞; 𝜃 ), (6)
𝑖=1
where 𝑇 is the length of the DocID 𝑑 ′ in tokens, 𝑑𝑖′ is the token at position 𝑖, and 𝑑 <𝑖 ′ denotes the se-

quence of tokens generated before position 𝑖. The constrained beam search produces a ranked list of
top-𝑘 DocIDs {𝑑 ′(1) , 𝑑 ′(2) , . . . , 𝑑 ′(𝑘 ) } based on their generation probabilities {R (𝑞, 𝑑 (1) ), R (𝑞, 𝑑 (2) ),
. . . , R (𝑞, 𝑑 (𝑘 ) )}. The corresponding documents {𝑑 (1) , 𝑑 (2) , . . . , 𝑑 (𝑘 ) } are then considered the most
relevant to the query 𝑞.
Model Optimization. Generative retrieval models are typically optimized using cross-entropy
loss, which measures the discrepancy between the generated DocID sequence and the ground truth
DocID. Given a query 𝑞 and its corresponding DocID 𝑑 ′ , the cross-entropy loss is defined as:
∑︁𝑇
L=− log 𝑃 (𝑑𝑖′ | 𝑑 <𝑖

, 𝑞; 𝜃 ), (7)
𝑖=1
where 𝑇 is the length of the DocID in tokens, 𝑑𝑖′ is the token at position 𝑖, and 𝑑 <𝑖
′ denotes the

sequence of tokens generated before position 𝑖. This loss function encourages the model to learn
the association between query and labeled DocID sequence.
This approach allows the generative retrieval model to produce a relevance-ordered list of
documents without relying on traditional indexing structures. The core of this approach lies in
leveraging the language model’s capability to generate DocID sequences within prefix constraints.
This section discusses the simplest generative retrieval method. In Section 3, we will delve into
advanced methods from multiple perspectives, including model architectures, training strategies,
and DocID design, to further enhance retrieval performance across various scenarios.

2.3 Large Language Models


The evolution of Large Language Models (LLMs) marks a significant leap in natural language
processing (NLP), rooted from early statistical and neural network-based language models [374].
These models, through pre-training on vast text corpora, learned deep semantic features of language,
greatly enriching the understanding of text. Subsequently, generative language models, most notably
the GPT series [16, 228, 229], significantly improved text generation and understanding capabilities
with improved model size and number of parameters.
LLMs can be mainly divided into two categories: encoder-decoder models and decoder-only
models. Encoder-decoder models, like T5 [231] and BART [138], convert input text into vector
representations through their encoder, then the decoder generates output text based on these
representations. This model perspective treats various NLP tasks as text-to-text conversion prob-
lems, solving them through text generation. On the other hand, decoder-only models, like the
GPT [228] and GPT-2 [229], rely entirely on the Transformer decoder, generating text step by
step through the self-attention mechanism. The introduction of GPT-3 [16], with its 175 billion
parameters, marked a significant milestone in this field and led to the creation of models like
InstructGPT [210], Falcon [215], PaLM [34] and Llama series [59, 285, 286]. These models, all using
a decoder-only architecture, trained on large-scale datasets, have shown astonishing language
processing capabilities [359].
For information retrieval tasks, large language models (LLMs) play a crucial role in directly
generating the exact information users seek [55, 173, 374]. This capability marks a significant

ACM Trans. Inf. Syst., Vol. 1, No. 1, Article . Publication date: March 2025.
8 Li et al.

step towards a new era of generative information retrieval. In this era, the retrieval process is
not solely about locating existing information but also about creating new content that meets the
specific needs of users. This feature is especially advantageous in situations where users might not
know how to phrase their queries or when they are in search of complex and highly personalized
information, scenarios where traditional matching-based methods fall short.

2.4 Augmented Language Models


Despite the advances of LLMs, they still face significant challenges such as hallucination, particu-
larly in complex tasks or those requiring access to long-tail or real-time information [90, 359]. To
address these issues, retrieval augmentation and tool augmentation have emerged as effective strate-
gies. Retrieval augmentation involves integrating external knowledge sources into the language
model’s workflow. This integration allows the model to access up-to-date and accurate information
during the generation process, thereby grounding its responses in verified data and reducing the
likelihood of hallucinations [139, 252, 271]. Tool augmentation, on the other hand, extends the
capabilities of LLMs by incorporating specialized tools or APIs that can perform specific functions
like mathematical computations, data retrieval, or executing predefined commands [226, 245, 276].
With retrieval and tool augmentations, language models can provide more precise and contextually
relevant responses, thereby improving factuality and functionality in practical applications.
Moreover, due to the aforementioned issue of hallucinations, the responses generated by LLMs are
often considered unreliable because users are unaware of the sources behind the generated content,
making it difficult to verify its accuracy. To enhance the credibility of responses, some studies have
focused on generating responses with citations [143, 204, 256]. This approach involves enabling
language models to cite the source documents of their generated content, thereby increasing the
trustworthiness of the responses. All these methods are effective strategies for improving both the
quality and reliability of language model outputs and are essential technologies for building the
next generation of generative information retrieval systems.

3 GENERATIVE DOCUMENT RETRIEVAL: FROM SIMILARITY MATCHING TO


GENERATING DOCUMENT IDENTIFIERS
In recent advancements in AIGC, generative retrieval (GR) has emerged as an promising approach
in the field of information retrieval, garnering increasing interest from the academic community.
Figure 3 showcases a timeline of the GR methods. Initially, GENRE [18] proposed to retrieve entities
by generating their unique names through constrained beam search via a pre-built entity prefix
tree, achieving advanced entity retrieval performance. Subsequently, Metzler et al. [195] envisioned
a model-based information retrieval framework aiming to combine the strengths of traditional
document retrieval systems and pre-trained language models to create systems capable of providing
expert-quality answers in various domains.
Following their lead, a diverse range of methods including DSI [281], DynamicRetriever [370],
SEAL [13], NCI [307], etc., have been developed, with a continuously growing body of work. These
methods explore various aspects such as model training and architectures, document identifiers,
incremental learning, task-specific adaptation, and generative recommendations. Figure 4 presents
an overview of the GR system and we’ll provide an in-depth discussion of each associated challenge
in the following sections.

3.1 Model Training and Structure


One of the core components of GR is the model training and structure, aiming to enhance the
model’s ability to memorize documents in the corpus.

ACM Trans. Inf. Syst., Vol. 1, No. 1, Article . Publication date: March 2025.
From Matching to Generation: A Survey on Generative Information Retrieval 9

AutoTSG Re3val Wu et al.

NCI Ultron MINDER DynamicIR IncDSI GLEN GeMKR ListGR

GENRE Dynamic DSI-QG CorpusBrain DearDR IRGen SE-DSI Pradeep et al. NP- MEVI RIPOR GDR Self-
Retriever Decoding Retrieval

2022/02 2022/04 2022/07 2022/10 2022/12 2023/04 2023/05 2023/06 2023/08 2023/10 2023/12 2024/02 2024/04

2020/10 2022/03 2022/06 2022/08 2022/11 2023/03 2023/05 2023/05 2023/07 2023/09 2023/11 2024/01 2024/03

DSI SEAL GERE CodeDSI GMR UGR Chen et al. LTRGR GCoQA GenRRL DiffusionRet DGR PAG

DSI++ GenRet TOME Tied-Atomic CLEVER NOVO UniGen CorpusLM ExcluIR


Model Training and Structure
LLM-URL Liu et al. ASI GRACE
Document Identifier Incremental Learning
LMIndexer CorpusBrain
Downstream Task Adaptation Expermental Analysis ++

Fig. 3. Timeline of research in generative retrieval: focus on model training and structure, document identifier
design, incremental learning and downstream task adaptation.

3.1.1 Model Training. To effectively train generative models for indexing documents, the standard
approach is to train the mapping from queries to relevant DocIDs, based on standard sequence-to-
sequence (seq2seq) training methods, as described in Equation (2). This method has been widely
used in numerous GR research works, such as DSI [281], NCI [307], SEAL [13], etc. Moreover, a
series of works have proposed various model training methods tailored for GR tasks to further
enhance GR retrieval performance, such as sampling documents or generating queries based on
document content to serve as pseudo queries for data augmentation; or including training objectives
for document ranking.
Specifically, DSI [281] proposed two training strategies: one is “indexing”, that is, training the
model to associate document tokens with their corresponding DocIDs, where DocIDs are pre-built
based on documents in corpus, which will be discussed in detail in Section 3.2; the other is “retrieval”,
using labeled query-DocID pairs to fine-tune the model. Notably, DSI was the first to realize a
differentiable search index based on the Transformer [290] structure, showing good performance
in web search [205] and question answering [126] scenarios. Next, a series of methods propose
training methods for data augmentation and improving GR model ranking ability
Sampling Document Pieces as Pseudo Queries. In the same era, DynamicRetriever [370],
also based on the encoder-decoder model, constructed a model-based IR system by initializing
the encoder with a pre-trained BERT [121]. Besides, DynamicRetriever utilizes passages, sampled
terms and N-grams to serve as pseudo queries to enhance the model’s memorization of DocIDs.
Formally, the training methods can be summarized as follows:

Sampled Document : 𝑑𝑠𝑖 −→ DocID, 𝑖 ∈ {1, ..., 𝑘𝑑𝑠 }, (8)


Labeled Query : 𝑞𝑖 −→ DocID, 𝑖 ∈ {1, ..., 𝑘𝑞 }, (9)

where 𝑑𝑠𝑖 and 𝑞𝑖 denote each of the 𝑘𝑑𝑠 sampled document text and each of the 𝑘𝑞 labeled query for
the corresponding DocID, respectively.
Generating Pseudo Queries from Documents. Following DSI, the NCI [307] model was
trained using a combination of labeled query-document pairs and augmented pseudo query-
document pairs. Specifically, NCI proposed two strategies: one using the DocT5Query [208] model
as a query generator, generating pseudo queries for each document in the corpus through beam
search; the other strategy directly uses the document as a query, as stated in Equation (8). Similarly,
DSI-QG [375] also proposed using a query generator to enhance training data, establishing a
bridge between indexing and retrieval in DSI. This data augmentation method has been proven
in subsequent works to be an effective method to enhance the model’s memorization for DocIDs,

ACM Trans. Inf. Syst., Vol. 1, No. 1, Article . Publication date: March 2025.
10 Li et al.

Generative Retrieval: Retrieve Documents by Downstream Tasks


DocID1
Directly Generating the Identifiers Search Query • Description: Adapting generative
DocID2 retrieval to specific spplications.
• Challenge: How to leverage the
DocID Prefix Language ... strengths of generative retrieval
Corpus Construction Constraints Model model to achieve improved
DocIDk
downstream performance?

Incremental Learning Document Identifier Model Training and Structure Generative Recommendation
• Description: New documents are • Description: Assigning each • Description: Language model is • Description: Direct generate item
added to the corpus. document in the corpus a unique the core to memerize documents. IDs without similarity matching.
• Challenge: How to effectively identifier to represent it. • Challenge: How to design training • Challenge: How to effectively
index new documents without • Challenge: How to design DocID strategies and model structures that recommend new items through
forgetting old ones? that the language model can easily effectively memorize and generate generative approaches?
memorize and generalize? DocIDs?

Fig. 4. A conceptual framework for a generative retrieval system, with a focus on challenges in incremental
learning, identifier construction, model training and structure, and integration with downstream tasks and
recommendation systems.

which can be expressed as follows:


Pseudo Query : 𝑞𝑠𝑖 −→ DocID, 𝑖 ∈ {1, ..., 𝑘𝑞𝑠 }, (10)
where 𝑞𝑠𝑖 represents each of the 𝑘𝑞𝑠 generated pseudo query for the corresponding DocID.
Improving Ranking Capability. Additionally, a series of methods focus on further optimizing
the ranking capability of GR models. Chen et al. [30] proposed a multi-task distillation method to
improve retrieval quality without changing the model structure, thereby obtaining better indexing
and ranking capabilities. Meanwhile, LTRGR [159] introduced a ranking loss to train the model
in ranking paragraphs. Subsequently, [365] proposed GenRRL, which improves ranking quality
through reinforcement learning with relevance feedback, aligning token-level DocID generation
with document-level relevance estimation. Moreover, [161] introduced DGR, which enhances
generative retrieval through knowledge distillation. Specifically, DGR uses a cross-encoder as a
teacher model, providing fine-grained passage ranking supervision signals, and then optimizes
the model with a distilled RankNet loss. ListGR [280] defined positional conditional probabilities,
emphasizing the importance of the generation order of each DocID in the list. In addition, ListGR
employs relevance calibration that adjusts the generated list of DocIDs to better align with the
labeled ranking list. See Table 1 for a detailed comparison of GR methods.
3.1.2 Model Structure. Basic generative retrieval models mostly use pre-trained encoder-decoder
structured generative models, such as T5 [231] and BART [138], fine-tuned for the DocID generation
task. To better adapt to the GR task, researchers have proposed a series of specifically designed
model structures [130, 224, 237, 275, 307, 342, 346].
Model Decoding Methods. For the semantic structured DocID proposed by DSI [281], NCI [307]
designed a Prefix-Aware Weight-Adaptive (PAWA) decoder. By adjusting the weights at different
positions of DocIDs, this decoder can capture the semantic hierarchy of DocIDs. To allow the GR
model to utilize both own parametric knowledge and external information, NP-Decoding [130]
proposed using non-parametric contextualized word embeddings (as external memory) instead
of traditional word embeddings as the input to the decoder. Additionally, PAG [345] proposed a
planning-ahead generation approach, which first decodes the set-based DocID to approximate
document-level scores, and then continues to decode the sequence-based DocID on this basis.
Combining Generative and Dense Retrieval Methods. Combining seq2seq generative mod-
els with dual-encoder retrieval models, MEVI [346] utilizes Residual Quantization (RQ) [189] to
organize documents into hierarchical clusters, enabling efficient retrieval of candidate clusters and
precise document retrieval within those clusters. Similarly, Generative Dense Retrieval (GDR) [342]
proposed to first broadly match queries to document clusters, optimizing for interaction depth and

ACM Trans. Inf. Syst., Vol. 1, No. 1, Article . Publication date: March 2025.
From Matching to Generation: A Survey on Generative Information Retrieval 11

memory efficiency, and then perform precise, cluster-specific document retrieval, boosting both
recall and scalability.
Utilizing Multiple Models. TOME [237] proposed to decompose the GR task into two stages,
first generating text paragraphs related to the query through an additional model, then using the
GR model to generate the URL related to the paragraph. DiffusionRet [224] proposed to first use
a diffusion model (SeqDiffuSeq [341]) to generate a pseudo-document from a query, where the
generated pseudo-document is similar to real documents in length, format, and content, rich in
semantic information; Then, it employs another generative model to perform retrieval based on
N-grams, similar to the process used by SEAL[13], leveraging an FM-Index[62] for generating
N-grams found in the corpus. Self-Retrieval [275] fully integrated indexing, retrieval, and evaluation
into a single large language model. It generates natural language indices and document segments,
and performs self-evaluation to score and rank the generated documents.

3.2 Design of Document Identifiers


Another essential component of generative retrieval is document representation, also known
as document identifiers (DocIDs), which act as the target outputs for the GR model. Accurate
document representations are crucial as they enable the model to more effectively memorize
document information, leading to enhanced retrieval performance. Table 1 provides a detailed
comparison of the states, data types, and order of DocIDs across numerous GR methods. In the
following sections, we will explore the design of DocIDs from two categories: numeric-based
identifiers and text-based identifiers.

3.2.1 Numeric-based Identifiers. An intuitive method to represent documents is by using a single


number or a series of numbers, referred to as DocIDs. Existing methods have designed both static
and learnable DocIDs.
Static DocIDs. Initially, DSI [281] introduced three numeric DocIDs to represent documents: (1)
Unstructured Atomic DocID: a unique integer identifier is randomly assigned to each document,
containing no structure or semantic information. (2) Naively Structured String DocID: treating
random integers as divisible strings, implementing character-level DocID decoding to replace large
softmax output layers. (3) Semantically Structured DocID: introducing semantic structure through
hierarchical 𝑘-means method, allowing semantically similar documents to share prefixes in their
identifiers, effectively reducing the search space. Concurrently, DynamicRetriever [370] also built a
model-based IR system based on unstructured atomic DocID. Subsequently, Ultron [371] encoded
documents into a latent semantic space using BERT [121], and compressed vectors into a smaller
semantic space via Product Quantization (PQ) [73, 102], preserving semantic information. Each
document’s PQ code serves as its semantic identifier. MEVI [346] clusters documents using Residual
Quantization (RQ) [189] and utilizes dual-tower and seq2seq model embeddings for a balanced
performance in large-scale document retrieval.
Learnable DocIDs. Unlike previous static DocIDs, GenRet [265] proposed learnable document
representations, transforming documents into DocIDs through an encoder, then reconstructs doc-
uments from DocIDs using a decoder, trained to minimize reconstruction error. Furthermore, it
used progressive training and diversity clustering for optimization. To ensure that DocID embed-
dings can reflect document content, Tied-Atomic [206] proposed to link document text with token
embeddings and employs contrastive loss for DocID generation. LMIndexer [112] and ASI [330]
learned optimal DocIDs through semantic indexing, with LMIndexer using a reparameterization
mechanism for unified optimization, facilitating efficient retrieval by aligning semantically simi-
lar documents under common DocIDs. ASI extends this by establishing an end-to-end retrieval
framework, incorporating semantic loss functions and reparameterization to enable joint training.

ACM Trans. Inf. Syst., Vol. 1, No. 1, Article . Publication date: March 2025.
12 Li et al.

Table 1. Comparisons of representative generative retrieval methods, focusing on document identifier, training
data augmentation, and training objective.

Document Identifier Training Data Augmentation Training Objective


Method
State Data Type Order Sample Doc Doc2Query Seq2seq DocID Ranking
GENRE [18] Static Text Sequence - - ✓ - -
DSI [281] Static Numeric Sequence ✓ - ✓ - -
DynamicRetriever [370] Static Numeric Sequence ✓ - ✓ - -
SEAL [13] Static Text Sequence ✓ - ✓ - -
DSI-QG [375] Static Numeric Sequence - ✓ ✓ - -
NCI [307] Static Numeric Sequence ✓ ✓ ✓ - -
Ultron [371] Static Numeric/Text Sequence ✓ ✓ ✓ - -
CorpusBrain [28] Static Text Sequence ✓ - ✓ - -
GenRet [265] Learnable Numeric Sequence - ✓ ✓ ✓ -
AutoTSG [352] Static Text Set - ✓ ✓ - -
SE-DSI [278] Static Text Sequence ✓ - ✓ - -
Chen et al. [30] Static Numeric Sequence ✓ ✓ ✓ - ✓
LLM-URL [376] Static Text Sequence - - - - -
MINDER [160] Static Text Sequence - ✓ ✓ - -
LTRGR [159] Static Text Sequence - ✓ ✓ - ✓
NOVO [311] Learnable Text Set ✓ - - ✓ -
GenRRL [365] Static Text Sequence - ✓ ✓ - ✓
LMIndexer [112] Learnable Numeric Sequence - ✓ ✓ ✓ -
ASI [330] Learnable Numeric Sequence - ✓ ✓ ✓ -
RIPOR [344] Learnable Numeric Sequence - ✓ ✓ ✓ ✓
GLEN [135] Learnable Text Sequence - ✓ ✓ ✓ ✓
DGR [161] Static Text Sequence - ✓ ✓ - ✓
ListGR [280] Static Numeric Sequence - ✓ ✓ - ✓

Furthermore, RIPOR [344] treats the GR model as a dense encoder to encode document content.
It then splits these representations into vectors via RQ [189], creating unique DocID sequences.
Furthermore, RIPOR implements a prefix-guided ranking optimization, increasing relevance scores
for prefixes of pertinent DocIDs through margin decomposed pairwise loss during decoding.
In summary, numeric-based document representations can utilize the embeddings of dense
retrievers, obtaining semantically meaningful DocID sequences through methods such as 𝑘-means,
PQ [102], and RQ [189]; they can also combine encoder-decoder GR models with bi-encoder DR
models to achieve complementary advantages [206, 346].

3.2.2 Text-based Identifiers. Text-based DocIDs have the inherent advantage of effectively leverag-
ing the strong capabilities of pre-trained language models and offering better interpretability.
Document Titles. The most straightforward text-based identifier is the document title, which
requires each title to uniquely represent a document in the corpus, otherwise, it would not be
possible to accurately retrieve a specific document. The Wikipedia corpus used in the KILT [218]
benchmark, due to its well-regulated manual annotation, has a unique title corresponding to each
document. Thus, GENRE [18], based on the title as DocID and leveraging the generative model
BART [138] and pre-built DocID prefix, achieved superior retrieval performance across 11 datasets
in KILT. Following GENRE, GERE [27], CorpusBrain [28], Re3val [257], and CorpusBrain++ [80]
also based their work on title DocIDs for Wikipedia-based tasks. Notably, LLM-URL [376] directly
generated URLs using ChatGPT prompts, achieving commendable performance after removing
invalid URLs. However, in the web search scenario [205], document titles in the corpus often have
significant duplication and many meaningless titles, making it unfeasible to use titles alone as
DocIDs. Thus, Ultron [371] effectively addressed this issue by combining URLs and titles as DocIDs,
identifying documents through keywords in web page URLs and titles.

ACM Trans. Inf. Syst., Vol. 1, No. 1, Article . Publication date: March 2025.
From Matching to Generation: A Survey on Generative Information Retrieval 13

Sub-strings of Documents. To increase the flexibility of DocIDs, SEAL [13] proposed a sub-
string identifier, representing documents with any N-grams within them. Using FM-Index (a
compressed full-text sub-string index) [62], SEAL could generate N-grams present in the corpus to
retrieve all documents containing those N-grams, scoring and ranking documents based on the
frequency of N-grams in each document and the importance of N-grams. Following SEAL, various
GR models [26, 159–161] also utilized sub-string DocIDs and FM-Index during inference. For a
more comprehensive representation of documents, MINDER [160] proposed multi-view identifiers,
including generated pseudo queries from document content via DocT5Query [208], titles, and
sub-strings. This multi-view DocID was also used in LTRGR [159] and DGR [161].
Term Sets. Unlike the sequential DocIDs described earlier, AutoTSG [352] proposed a term
set-based document representation, using keywords extracted from titles and content, rather than
predefined sequences, allowing for retrieval of the target document as long as the generated term
set is included in the extracted keywords. Recently, PAG [345] also constructed DocIDs based on
sets of key terms, disregarding the order of terms, which is utilized for approximating document
relevance in decoding.
Learnable DocIDs. Text-based identifiers can also be learnable. Similarly based on term-sets,
NOVO [311] proposed learnable continuous N-grams constituting term-set DocIDs. Through de-
noising query modeling, the model learned to generate queries from documents with noise, thereby
implicitly learning to filter out document N-grams more relevant to queries. NOVO also improves
the document’s semantic representation by updating N-gram embeddings. Later, GLEN[135] uses
dynamic lexical DocIDs and follows a two-phase index learning strategy. First, it assigns DocIDs
by extracting keywords from documents using self-supervised signals. Then, it refines DocIDs by
integrating query-document relevance through two loss functions. During inference, GLEN ranks
documents using DocID weights without additional overhead.

3.3 Incremental Learning on Dynamic Corpora


Prior studies have focused on generative retrieval from static document corpora. However, in
reality, the documents available for retrieval are continuously updated and expanded. To address
this challenge, researchers have developed a range of methods to optimize GR models for adapting
to dynamic corpora.
Optimizer and Document Rehearsal. At first, DSI++ [192] aims to address the incremental
learning challenges encountered by DSI [281]. DSI++ modifies the training by optimizing flat loss
basins through the Sharpness-Aware Minimization (SAM) optimizer, stabilizing the learning process
of the model. It also employs DocT5Query [208] to generate pseudo queries for documents in the
existing corpus as training data augmentation, mitigating the forgetting issue of GR models.
Constrained Optimization Addressing the scenario of real-time addition of new documents,
such as news or scientific literature IR systems, IncDSI [124] views the addition of new documents
as a constrained optimization problem to find optimal representations for the new documents. This
approach aims to (1) ensure new documents can be correctly retrieved by their relevant queries,
and (2) maintain the retrieval performance of existing documents unaffected.
Incremental Product Quantization. CLEVER [25], based on Product Quantization (PQ) [102],
proposes Incremental Product Quantization (IPQ) for generating PQ codes as DocIDs for documents.
Compared to traditional PQ methods, IPQ designs two adaptive thresholds to update only a subset
of centroids instead of all, maintaining the indices of updated centroids constant. This method
reduces computational costs and allows the system to adapt flexibly to new documents.
Fine-tuning Adatpers for Specific Tasks. CorpusBrain++ [80] introduces the KILT++ bench-
mark for continuously updated KILT [218] tasks and designs a dynamic architecture paradigm
with a backbone-adapter structure. By fixing a shared backbone model to provide basic retrieval

ACM Trans. Inf. Syst., Vol. 1, No. 1, Article . Publication date: March 2025.
14 Li et al.

capabilities while introducing task-specific adapters to incrementally learn new documents for
each task, it effectively avoids catastrophic forgetting. During training, CorpusBrain++ generates
pseudo queries for new document sets and continues to pre-train adapters for specific tasks.

3.4 Downstream Task Adaption


Generative retrieval methods, apart from addressing retrieval tasks individually, have been tailored
to various downstream generative tasks. These include fact verification [284], entity linking [86],
open-domain QA [126], dialogue [51], slot filling [137], among others, as well as knowledge-
intensive tasks [218], code [179], conversational QA [3], and multi-modal retrieval scenarios [165],
demonstrating superior performance and efficiency. These methods are discussed in terms of
separate training, joint training, and multi-modal generative retrieval.
3.4.1 Separate Training. For fact verification tasks [284], which involve determining the correctness
of input claims, GERE [27] proposed using an encoder-decoder-based GR model to replace traditional
indexing-based methods. Specifically, GERE first utilizes a claim encoder to encode input claims,
and then generates document titles related to the claim through a title decoder to obtain candidate
sentences for corresponding documents.
Knowledge-Intensive Language Tasks. For Knowledge-Intensive Language Tasks (KILT) [218],
CorpusBrain [28] introduced three pre-training tasks to enhance the model’s understanding of
query-document relationships at various granularities: Internal Sentence Selection, Leading Para-
graph Selection, and Hyperlink Identifier Prediction. Similarly, UGR [26] proposed using different
granularities of N-gram DocIDs to adapt to various downstream tasks, unifying different retrieval
tasks into a single generative form. UGR achieves this by letting the GR model learn prompts
specific to tasks, generating corresponding document, passage, sentence, or entity identifiers.
Futhermore, DearDR [283] utilizes remote supervision and self-supervised learning techniques,
using Wikipedia page titles and hyperlinks as training data. The model samples sentences from
Wikipedia documents as input and trains a self-regressive model to decode page titles or hyperlinks,
or both, without the need for manually labeled data. Re3val [257] proposes a retrieval framework
combining generative reordering and reinforcement learning. It first reorders retrieved page titles
using context information obtained from a dense retriever, then optimizes the reordering using the
REINFORCE algorithm to maximize rewards generated by constrained decoding.
Multi-hop retrieval. In multi-hop retrieval tasks, which require iterative document retrievals to
gather adequate evidence for answering a query, GMR [131] proposed to employ language model
memory and multi-hop memory to train a generative retrieval model, enabling it to memorize the
target corpus and simulate real retrieval scenarios through constructing pseudo multi-hop query
data, achieving dynamic stopping and efficient performance in multi-hop retrieval tasks.
Code Retrieval. CodeDSI [203] is an end-to-end generative code search method that directly
maps queries to pre-stored code samples’ DocIDs instead of generating new code. Similar to
DSI [281], it includes indexing and retrieval stages, learning to map code samples and real queries
to their respective DocIDs. CodeDSI explores different DocID representation strategies, including
direct and clustered representation, as well as numerical and character representations.
Conversational Question Answering. GCoQA [158] is a generative retrieval method for
conversational QA systems that directly generates DocIDs for passage retrieval. This method
focuses on key information in the dialogue context at each decoding step, achieving more precise
and efficient passage retrieval and answer generation, thereby improving retrieval performance
and overall system efficiency.
3.4.2 Joint Training. The methods in the previous section involve separately training generative
retrievers and downstream task generators. However, due to the inherent nature of GR models as

ACM Trans. Inf. Syst., Vol. 1, No. 1, Article . Publication date: March 2025.
From Matching to Generation: A Survey on Generative Information Retrieval 15

generative models, a natural advantage lies in their ability to be jointly trained with downstream
generators to obtain a unified model for retrieval and generation tasks.
Multi-decoder Structure. UniGen [155] proposes a unified generation framework to integrate
retrieval and question answering tasks, bridging the gap between query input and generation
targets using connectors generated by large language models. UniGen employs shared encoders
and task-specific decoders for retrieval and question answering, introducing iterative enhancement
strategies to continuously improve the performance of both tasks.
Multi-task Training. Later, CorpusLM [152] introduces a unified language model that integrates
GR, closed-book generation, and retrieval-augmented generation to handle various knowledge-
intensive tasks. The model adopts a multi-task learning approach and introduces ranking-guided
DocID decoding strategies and continuous generation strategies to improve retrieval and generation
performance. In addition, CorpusLM designs a series of auxiliary DocID understanding tasks to
deepen the model’s understanding of DocID semantics.

3.4.3 Multi-modal Generative Retrieval. Generative retrieval methods can also leverage multi-
modal data such as text, images, etc., to achieve end-to-end multi-modal retrieval.
Tokenizing Images to DocID Sequences. At first, IRGen [357] transforms image retrieval
problems into generative problems, predicting relevant discrete visual tokens, i.e., image identifiers,
through a seq2seq model given a query image. IRGen proposed a semantic image tokenizer, which
converts global image features into short sequences capturing high-level semantic information.
Advanced Model Training and Structure. Later, GeMKR [178] combines LLMs’ generation
capabilities with visual-text features, designing a generative knowledge retrieval framework. It
first guides multi-granularity visual learning using object-aware prefix tuning techniques to align
visual features with LLMs’ text feature space, achieving cross-modal interaction. GeMKR then
employs a two-step retrieval process: generating knowledge clues closely related to the query and
then retrieving corresponding documents based on these clues. GRACE [178] achieves generative
cross-modal retrieval method by assigning unique identifier strings to images and training multi-
modal large language models (MLLMs) [7] to memorize the association between images and their
identifiers. The training process includes (1) learning to memorize images and their corresponding
identifiers, and (2) learning to generate the target image identifiers from textual queries. GRACE
explores various types of image identifiers, including strings, numbers, semantic and atomic
identifiers, to adapt to different memory and retrieval requirements.

3.4.4 Generative Recommender Systems. Recommendation systems, as an integral part of the


information retrieval, are currently undergoing a paradigm shift from discriminative models to
generative models. Generative recommendation systems do not require the computation of ranking
scores for each item followed by database indexing, but instead accomplish item recommendations
through the direct generation of IDs. In this section, several seminal works, including P5 [74],
GPT4Rec [146], TIGER [233], SEATER [254], IDGenRec [273], LC-Rec [360] and ColaRec [309], are
summarized to outline the development trends in generative recommendations.
P5 [74] transforms various recommendation tasks into different natural language sequences,
designing a universal, shared framework for recommendation completion. This method, by setting
unique training objectives, prompts, and prediction paradigms for each recommendation domain’s
downstream tasks, serves well as a backbone model, accomplishing various recommendation tasks
through generated text. In generative retrieval, effective indexing identifiers have been proven
to significantly enhance the performance of generative methods. Similarly, TIGER [233] initially
learns a residual quantized autoencoder to generate semantically informative indexing identifiers
for different items. It then trains a transformer-based encoder-decoder model with this semantically

ACM Trans. Inf. Syst., Vol. 1, No. 1, Article . Publication date: March 2025.
16 Li et al.

informative indexing identifier sequence to generate item identifiers for recommending the next
item based on historical sequences.
Focusing solely on semantic information and overlooking the collaborative filtering informa-
tion under the recommendation context might limit the further development of generative mod-
els. Therefore, after generating semantic indexing identifiers similar to TIGER using a residual
quantized autoencoder with uniform semantic mapping, LC-Rec [360] also engages in a series
of alignment tasks, including sequential item prediction, explicit index-language alignment, and
recommendation-oriented implicit alignment. Based on the learned item identifiers, it integrates
semantic and collaborative information, enabling large language models to better adapt to sequence
recommendation tasks.
IDGenRec [273] innovatively combines generative recommendation systems with large lan-
guage models by using human language tokens to generate unique, concise, semantically rich and
platform-agnostic texual identifiers for recommended items. The framework includes a text ID
generator trained on item metadata with a diversified ID generation algorithm, and an alternating
training strategy that optimizes both the ID generator and the LLM-based recommendation model
for improved performance and accuracy in sequential recommendations. SEATER [254] designs a
balanced k-ary tree-structured indexes, using a constrained k-means clustering method to recur-
sively cluster vectors encoded from item texts, obtaining equal-length identifiers. Compared to the
method proposed by DSI [281], this balanced k-ary tree index maintains semantic consistency at
every level. It then trains a Transformer-based encoder-decoder model and enhances the semantics
of each level of indexing through contrastive learning and multi-task learning. ColaRec [309] inte-
grates collaborative filtering signals and content information by deriving generative item identifiers
from a pretrained recommendation model and representing users via aggregated item content.
Then it uses an item indexing generation loss and contrastive loss to align content-based semantic
spaces with collaborative interaction spaces, enhancing the model’s ability to recommend items in
an end-to-end framework.

4 RELIABLE RESPONSE GENERATION: DIRECT INFORMATION ACCESSING WITH


GENERATIVE LANGUAGE MODELS
The rapid advancement of large language models has positioned them as a novel form of IR system,
capable of generating reliable responses directly aligned with users’ informational needs. This not
only saves the time users would otherwise spend on collecting and integrating information but
also provides personalized, user-centric answers tailored to individual users.
However, challenges remain in creating a grounded system that delivers faithful answers, such
as hallucination, prolonged inference time, and high operational costs. This section will outline
strategies for constructing a faithful GenIR system by: (1) Optimizing the GenIR model internally,
(2) Enhancing the model with external knowledge, (3) Increasing accountability, and (4) Developing
personalized information assistants.

4.1 Internal Knowledge Memorization


To develop a user-friendly and reliable IR system, the generative model should be equipped with
comprehensive internal knowledge. Optimization of the backbone generative model can be catego-
rized into three aspects: structural enhancements, training strategies, and inference techniques.
The overview of this section is shown in the green part of Figure 5.
4.1.1 Model Structure. With the advent of generative models, various methods have been intro-
duced to improve model structure and enhance generative reliability. We aim to discuss the crucial
technologies contributing to this advancement in this subsection.

ACM Trans. Inf. Syst., Vol. 1, No. 1, Article . Publication date: March 2025.
From Matching to Generation: A Survey on Generative Information Retrieval 17

Internal Knowledge Memerization


Structural Enhancement Training and Inference Knowledge Updating
• Model Scaling: Enhance response • Traing: Improve training data and • Inrecmental Learning: Adapt model
quality by scaling up model parameters. training methods to get reliable models. to new domains or tasks while retaining
• Model Structure: Design unique • Inference: Design special prompts and previously acquired knowledge.
model architectures to achieve better advanced decoding strategies to refine • Knowledge Editing: Modify existing
generative capabilities, e.g. MOE. the quality of the generated text. knowledge within language models.

Generative User-centric Reliable


Search Query Language Model Response

External Knowledge Augmentation


Retrieval Augmentation Tool Augmentation
• Sequential RAG: Handle queries within a single process. • Search Engine: Answer common and timely questions.
• Branching RAG: Distribute queries over multiple processes. • Knowledge Graph: Extract structured and explicit knowledge.
• Conditional RAG: Use decision-making modules to • API-based Tools: Access specialized real-world data, such as
determine retrieval necessity. financial markets, films, code, etc.
• Loop RAG: Engage in iterative retrieval and generation. • Model-based Tools: Employ models tailored to specific tasks.

Fig. 5. An Illustration of strategies for enhancing language models to generate user-centric and reliable
responses, including model internal knowledge memorization and external knowledge augmentation.

(1) Model Scaling Model parameter scaling is a pivotal factor influencing performance. Con-
temporary language models predominantly employ the Transformer architecture, and scaling both
the model parameters and the training data enhances the model’s capacity to retain knowledge
and capabilities [116]. For instance, in the GPT [2, 16, 228, 229] series and LLaMA [285, 286] family,
larger models tend to perform better on diverse downstream tasks, including few-shot learning, lan-
guage understanding, and generation [34]. Additionally, scaling the model contributes to improved
instruction-following capabilities [227], enabling a more adept comprehension of user intent and
generating responses that better satisfy user requests.
(2) Model Integration Model integration is an effective method to enhance the reliability of
generated outputs by capitalizing on the diverse strengths of various models. The predominant
approach is the Mixture of Experts (MoE) [96], which utilizes a gating mechanism to selectively
activate sections of network parameters during inference, greatly increasing the effective parameters
without inflating inference costs [58, 61, 106, 136]. This method also boasts impressive scalability,
with efficacy augmented alongside the expanding parameter volume and the number of expert
models [38]. Alternatively, the LLM-Blender framework [107] employs a ranker and a fuser to
combine answers from various LLMs, including black-box models, but faces high deployment costs.
4.1.2 Training and Inference. In the model training stage, methods to enhance the reliability of
answers can be categorized into two aspects: training data optimization and training methods
optimization.
(1) Training Data Optimization The quality of training data substantially affects the reliability
of model outputs. Noise, misinformation, and incomplete information can disrupt the learning
process, leading to hallucinations and other issues. To address this, [79] used GPT-3.5 to artificially
create textbooks filled with examples and language descriptions as training data, resulting in
significant improvements on downstream tasks after minor fine-tuning. LIMA [363] used dialogues
from community forums to construct a small-scale fine-tuning dataset, enhancing the model’s
conversation capabilities during the alignment phase. To reduce redundancies in crawled internet
data, Lee et al. [132] combined suffix arrays [188] and MinHash [15] to approximate matching and
deduplicate the training dataset, reducing direct reproduction from the same source.

ACM Trans. Inf. Syst., Vol. 1, No. 1, Article . Publication date: March 2025.
18 Li et al.

(2) Training Methods Optimization Beyond conventional training methods, additional tech-
niques have been proposed to improve the factuality of model outputs. MixCL [264] incorporates
contrastive learning into the training objective, using an external knowledge base to identify
correct snippets and reduce the probability of generating incorrect tokens, thus enhancing model
reliability. CaliNet [56] utilizes a contrastive method to assess erroneous knowledge learned by
the model and fine-tunes the parameters of the FFN layer to rectify these errors. FactTune [211]
incorporates factuality assessment during the RLHF phase, using automatic evaluation methods like
FactScore [198] to rank outputs and employing DPO [230] to teach the model factuality preference.
Apart from enhancing the internal knowledge reliability during training, the inference stage
significantly impacts the reliability of answers. The overall inference process consists of user input
and the model’s token decoding, and approaches to increase generation reliability can be divided
into prompt engineering and decoding strategy.
(3) Prompt Engineering Prompting methods play a vital role in guiding the model. A well-
designed prompt can better promote the model’s internal capabilities to provide more accurate
answers. The Chain-of-Thought (CoT) [313] prompting method guides the model to explicitly
decompose the question into a reasoning chain during decoding, improving response accuracy by
grounding the final answer on accurate intermediate steps. Further, CoT-SC [306] samples multiple
answers and chooses the most consistent one as the final answer. The Tree of Thoughts [332]
expands CoT’s single reasoning path to multiple paths, synthesizing their outcomes to arrive at the
final answer. The Chain-of-Verification (CoVE) [49] introduces a self-reflection mechanism where
the LLM generates a draft response, then validates each statement for factual inaccuracies, correcting
errors to enhance factual accuracy. Additionally, methods like RECITE [268] and GenRead [339]
prompt the model to output relevant internal knowledge fragments, which are then used to bolster
the question-answering process.
(4) Decoding Strategy Decoding strategies are another critical factor influencing the reliability
of model-generated responses. An appropriate decoding method can maintain the reliability and
diversity of a model’s response. Nucleus Sampling [133] samples within a set probability range for
tokens, ensuring better diversity while balancing variety and reliability. Building on this, Factual-
Nucleus Sampling [134] employs a dynamic, decaying threshold for token sampling, ensuring later
tokens are not influenced by earlier less factual tokens. Wan et al. [292] proposed a faithfulness-
aware decoding method to enhance the faithfulness of the beam-search approach by incorporating
a Ranker to reorder generated sequences and a lookahead method to avoid unfaithful tokens.
Apart from directly modifying the decoding method, several studies influence the decoding
distribution by leveraging hidden layer information. DoLa [37] uses distributional differences
between hidden and output layers to prioritize newly learned factual knowledge or key terms,
increasing their generation likelihood. Inference-Time Intervention (ITI) [147] identifies attention
heads strongly correlated with response correctness, adjusts their orientations, and moderates their
activation, achieving more truthful generation with minimal model interference. Shi et al. [251]
proposed CAD, comparing output distributions before and after adding extra information, reducing
reliance on the model’s own knowledge to avoid conflicts leading to inaccuracies.

4.1.3 Knowledge Updating. In real-life scenarios, information is constantly evolving, and therefore,
the GenIR system needs to continuously acquire the latest knowledge to meet users’ information
needs. Since the model’s knowledge storage is limited, knowledge updating is necessary to ensure
more reliable generated responses. In this section, we will discuss existing methods for knowledge
updating from two perspectives: incremental learning and knowledge editing.
(1) Incremental Learning Incremental learning refers to the ability of machine learning models
to continuously learn new skills and tasks while retaining previously acquired knowledge [301,

ACM Trans. Inf. Syst., Vol. 1, No. 1, Article . Publication date: March 2025.
From Matching to Generation: A Survey on Generative Information Retrieval 19

Table 2. Comparison of representative reliable response generation methods, considering model configu-
rations, specializations, and evaluations. For simplicity, "LM" stands for Language Modeling and "ODQA"
stands for Open-Domain Question Answering.

Model Configuration Target Domain


Method
Backbone Parameters Trained Capability Evaluation Task
GPT-3 [16] Transformer 175B ✓ General General Tasks (LM, QA, Reasoning, ...)
Llama-3.1 [59] Transformer 8B/70B/405B ✓ General General Tasks
Mistral [105] Transformer 7B/22B/123B ✓ General General Tasks
PaLM [34] Transformer 540B ✓ General General Tasks
FactTune [211] Llama-2 7B ✓ Factuality Domain-specific QA
GenRead [339] InstructGPT 175B × Factuality Knowledge-intensive Tasks
DoLa [37] LLaMA 7B6̃5B × Factuality Multi-choice QA, Open-ended Generation
RAG [139] BART 400M ✓ Factuality Knowledge-intensive Tasks
REPLUG [252] GPT-3 175B × Factuality LM, Multi-choice QA, ODQA
FLARE [111] GPT-3 175B × Factuality Knowledge-intensive Tasks
Self-RAG [5] Llama-2 7B/13B ✓ Factuality ODQA, Reasoning, Fact Check.
IR-CoT [287] GPT-3/Flan-T5 175B/11B × Factuality Multi-hop QA
ReAct [333] PaLM 540B × Tools Multi-hop QA, Fact Check., Decision Making
StructGPT [110] GPT-3/GPT-3.5 175B/- × Tools KG-based QA, Table-based QA, Text-to-SQL
ToolFormer [245] GPT-J 6B ✓ Tools LM, Math, QA, Temporal Tasks
ToolLLM [226] LLaMA 7B ✓ Tools Tool Use
HuggingGPT [250] GPT-3.5 - × Tools Various Complex AI Tasks
According to [314] GPT-3/Flan-T5/... 175B/11B/... × Accountability ODQA
IFL [129] GPT-J 6B ✓ Accountability Long-form QA
WebGPT [204] GPT-3 175B ✓ Accountability Long-form QA
WebBrain [223] BART 400M ✓ Accountability Long-form QA
RARR [70] PaLM 540B × Accountability ODQA, Reasoning, Conversational QA
SearChain [326] GPT-3.5 - × Accountability Knowledge-intensive Tasks
P2Bot [172] Transformer - ✓ Personalization Personalized Dialogue
P-Soups [98] Tulu 7B ✓ Personalization Personalized Dialogue
OPPU [274] Llama-2 7B ✓ Personalization Language Model Personalization Tasks
Zhongjing [11] Ziya-LLaMA 13B ✓ Healthcare Chinese Medical Dialogue
Mental-LLM [327] Alpaca/GPT-3.5/... 7B/-/... ✓/ × Healthcare Mental Health Reasoning Tasks
Edu-Chat [48] LLaMA 13B ✓ Education ODQA, Education Tasks

303, 321, 351]. In the GenIR system, it is crucial to enable the language model to memorize the
latest information while preventing the forgetting of previous knowledge.
One approach is Incremental Pre-training, which does not rely on supervised data but continues
pre-training on continuously updated corpora to alleviate catastrophic forgetting. For example,
Baidu proposed the ERNIE 2.0 framework [267], enhancing language understanding through
continuous multi-task learning. Jang et al. [100] introduced Continual Knowledge Learning (CKL)
to explore how LLMs update and retain knowledge amidst rapidly changing information, creating
benchmarks like FUAR. Cossu et al. [39] studied continual pre-training for language and vision,
finding that self-supervised or unsupervised methods are more effective in retaining previous
knowledge compared to supervised learning. Additionally, Ke et al. [119] proposed Domain Adaptive
Pre-training (DAP-training) to improve the model’s adaptability to new domains while preventing
forgetting using techniques like soft masking and contrastive learning. For domain-specific model
construction, Xie et al. [323] introduced FinPythia-6.9B, an efficient continual pre-training method
specifically designed for large-scale language models in the financial domain.
On the other hand, Incremental Fine-tuning utilizes only labeled data for training. Progressive
Prompts [236] appends new soft prompts for each new task, facilitating knowledge transfer and
reducing forgetting. DynaInst [201] enhances lifelong learning in pre-trained language models
through parameter regularization and experience replay, employing dynamic instance and task

ACM Trans. Inf. Syst., Vol. 1, No. 1, Article . Publication date: March 2025.
20 Li et al.

selection for efficient learning under resource constraints. Jang et al. [99] challenge traditional multi-
task prompt fine-tuning by refining expert models on individual tasks. Suhr et al. [260] introduce
a feedback-driven continual learning approach for instruction-following agents, where natural
language feedback is converted into immediate rewards via contextual bandits to optimize learning.
O-LoRA [305] achieves superior continual learning by training new tasks in orthogonal low-rank
subspaces, significantly minimizing task interference. Peng et al. [216] propose a scalable language
model that dynamically adjusts parameters based on task requirements, effectively preventing the
forgetting of previously learned tasks.
(2) Knowledge Editing Knowledge editing refers to the process of modifying and updating
existing knowledge within language models [191, 303], distinct from incremental learning that
focuses on adapting to new domains or tasks. By editing the weights or layers of a model, knowledge
editing methods can correct erroneous facts and incorporate new knowledge, making it important
before deploying GenIR systems. There are primarily three paradigms for internal knowledge
editing within language models: adding trainable parameters, locate-then-edit, and meta-learning.
One method of Adding Trainable Parameters is by integrating new single neurons (patches) in
the final feed-forward neural network (FFN) layer, as in T-Patcher [94] and CaliNet [56], which
serve as trainable parameters to adjust the model’s behavior. Alternatively, discrete code-book
modules are introduced in the middle layers of the language model, as in GRACE [83], to adjust
and correct information.
Moreover, the Locate-then-Edit method first identifies the parameters corresponding to specific
knowledge and then updates these targeted parameters directly. Common techniques involve
identifying key-value pairs in the FFN matrix, known as "knowledge neurons," and updating
them [45]. Techniques like ROME [193] use causal mediation analysis to pinpoint areas needing
editing, and MEMIT [194] builds on ROME to implement synchronized editing in various scenarios.
Methods such as PMET [154] employ attention mechanisms for editing, while BIRD [182] introduces
a bidirectional inverse relation modeling approach.
Meta-Learning, another paradigm, uses hyper-networks to generate the necessary updates for
model editing. KE (Knowledge Editor) [17] predicts weight updates for each data point using a
hyper-network. MEND [199], by taking low-order decomposition of gradients as input, learns to
rapidly edit language models to enhance performance. Additionally, MALMEN [270] separates the
computations of hyper-networks and language models, facilitating the editing of multiple facts
under a limited memory budget. These meta-learning mechanisms enable models to swiftly adapt
to new knowledge and tasks. A detailed comparison of representative reliable response generation
methods is provided in Table 2.

4.2 External Knowledge Augmentation


Although large language models have demonstrated significant effectiveness in response genera-
tion, issues such as susceptibility to hallucinations, difficulty handling in-domain knowledge, and
challenges with knowledge updating persist. Augmenting the model’s generative process with
external knowledge sources can serve as an effective way to tackle these issues. Based on the form
of external knowledge employed, these approaches can be classified into retrieval augmentation
and tool augmentation. The blue area in Figure 5 provides an overview of this section.

4.2.1 Retrieval Augmentation. Retrieval-Augmented Generation (RAG) enhances the response of


generative models by combining them with a retrieval mechanism [95, 139, 368]. By querying a
large collection of documents, information that is relevant to the input query can be fetched and
integrated into the input of the generative model. RAG enables generative models to be grounded in
existing reliable knowledge, significantly improving the reliability of model generation. Typically,

ACM Trans. Inf. Syst., Vol. 1, No. 1, Article . Publication date: March 2025.
From Matching to Generation: A Survey on Generative Information Retrieval 21

a RAG method involves a retriever and a generator. Based on the interaction flow between these
two, RAG methods can be divided into four categories [72].
(1) Sequential RAG: Sequential RAG operates on a linear progression, where the retriever first
retrieves relevant information and the generator utilizes this information to directly complete the
response generation process.
The basic form of sequential RAG is a “Retrieve-Read” framework [183], where early works
perform joint [14, 81, 139] or separate [95] training of retriever and generator but require costly
pre-training. In-Context RALM [234] addresses this by directly using retrieved documents as input,
leveraging the model’s in-context learning without additional training.
With the widespread adoption of LLMs, most subsequent works are built on the foundation of a
frozen generator. AAR [340] fine-tunes a general retriever to adapt to the information acquisition
preferences of the generative model. LLM-embedder [353] uses rewards produced by LLM to train
an embedding model dedicated to retrieval augmentation. ARL2 [350] leverages LLM to annotate
relevance scores in the training set and trains a retriever using contrastive learning.
Several works introduce pre-retrieval and post-retrieval processes [72] into the sequential pipeline
to enhance the overall efficiency. In the pre-retrieval process, the RRR model [183] introduces a
rewriter module before the retriever, trained using the generator’s feedback to enable the retrieval
system to provide more suitable information for generation.
In the post-retrieval process, information compressors are proposed to filter out irrelevant content
from documents, avoiding misleading the generator’s response [43, 114, 170]. RECOMP [325] uses
both abstractive and extractive compressors to generate concise summaries of retrieved documents.
LLMLingua [109] retains important tokens by calculating token importance based on the perplexity
provided by the generative model. LongLLMLingua [108] introduces query-aware compression
and reranks retrieved documents based on importance scores to alleviate the “loss in the middle”
phenomenon [170]. PRCA [329] employs reinforcement learning to train a text compressor adaptable
to black-box LLMs and various retrievers, serving as a versatile plug-in.
(2) Branching RAG: In the Branching RAG framework, the input query is processed across
multiple pipelines, and each pipeline may involve the entire process in the sequential pipeline.
The outputs from all pipelines are merged to form the final response, allowing for finer-grained
handling of the query or retrieval results.
In the pre-retrieval stage, TOC [123] uses few-shot prompting to recursively decompose complex
questions into clear sub-questions in a tree structure, retrieving relevant documents for each and
generating a comprehensive answer. BlendFilter [297] enhances the original query using prompts
with internal and external knowledge, retrieves related documents with the augmented queries,
and merges them for a comprehensive response.
In the post-retrieval stage, REPLUG [252] processes each retrieved document with the query
through the generator separately and combines the resulting probability distributions to form the
final prediction. GenRead [339] prompts LLM to generate related documents and merges them with
retrieved documents from the retriever as input, enhancing content coverage.
(3) Conditional RAG: The Conditional RAG framework adapts to various query types through
distinct processes, improving the system’s flexibility. Since there can be knowledge conflict between
the knowledge from retrieved documents and the generator’s own knowledge, RAG’s effectiveness
isn’t consistent across all scenarios. To address this, common conditional RAG methods include a
decision-making module that determines whether to engage the retrieval process for each query.
SKR [308] trains a binary classifier on a dataset of questions LLMs can or cannot answer,
determining at inference whether to use retrieval. Training labels are obtained by prompting
the model to assess if external knowledge is needed. Self-DC [296] uses the model’s confidence
score to decide on retrieval necessity, categorizing queries into unknown, uncertain, and known,

ACM Trans. Inf. Syst., Vol. 1, No. 1, Article . Publication date: March 2025.
22 Li et al.

with unknown queries processed through sequential RAG and uncertain ones decomposed into
sub-questions. Rowen [52] introduces a multilingual detection module that perturbs the original
question and measures response consistency to decide on retrieval.
(4) Loop RAG: Loop RAG involves deep interactions between the retriever and generator com-
ponents. Owing to multi-turn retrieval and generation processes, accompanied by comprehensive
interactions, it excels at handling complex and diverse input queries, yielding superior results in
response generation.
ITER-RETGEN [248] introduces an iterative framework alternating between retrieval-augmented
generation and generation-augmented retrieval, repeating this process to produce the final answer.
IR-COT [287] follows a similar procedure to ITER-RETGEN but the iteration pauses based on the
model’s own generative process. FLARE [111] conducts concurrent retrieval during generation,
evaluating the need for retrieval at each new sentence based on the LLM’s confidence score, dynam-
ically supplementing information to enhance content reliability. COG [128] models generation as
continual retrieval and copying of segments from an external corpus, with the generator producing
conjunctions to maintain fluency. Self-RAG [5] adds special tokens into the vocabulary, allowing
the generator to decide on retrieval, document importance, and whether to perform a critique.
Some works focus on deconstructing complex inquiries into sub-questions, addressing these
individually to produce a more dependable response. [221] guides LLM to decompose complex
questions into sub-questions, answer each using retrieved results, and synthesize the answers;
RET-Robust [338] builds upon this by incorporating an NLI model to verify retrieved documents
support the sub-question answers, reducing misinformation.

4.2.2 Tool Augmentation. Although retrieval-augmented techniques have significantly improved


upon the blind spots of a generator’s self-knowledge, these methods struggle with the rapid and
flexible update of information since they rely on the existence of information within an external
corpus of documents. Tool augmentation, on the other hand, excels in addressing this issue by
invoking various tools that allow for the timely acquisition and usage of the latest data, including
finance, news, and more. Moreover, tool augmentation expands the scope of responses a model can
offer, such as language translation, image generation, and other tasks, to more comprehensively
meet users’ information retrieval needs.
There are four categories of tools that can be utilized to construct a more reliable information
retrieval system:
(1) Search Engine: Common search engine tools like Google Search and Bing Search help
answer frequent and time-sensitive queries effectively. Self-Ask [221] initially decomposes complex
questions into multiple sub-questions, then uses a search engine to answer each sub-question, and
finally generates a comprehensive answer to the complex question. ReAct [333] embeds search
engine calls into the model’s reasoning process, allowing the generative model to determine when
to make calls and what queries to input for more flexible reasoning. New Bing can automatically
search relevant information from Bing based on user input, yielding reliable and detailed answers,
including citation annotations in the generated content.
Some works have also built advanced conversational systems based on tools like search engines.
Internet-Augmented Generation [125] enhances the quality of conversational replies by using search
engines during conversations. LaMDA [282] and BlenderBot [253] combine search engines with
conversational agents, constantly accessing internet information to enrich conversation factualness.
WebGPT [204] and WebCPM [225] directly teach models to perform human-like browser operations
by generating commands such as Search, Click, and Quote, facilitating the automated retrieval and
acquisition of information.

ACM Trans. Inf. Syst., Vol. 1, No. 1, Article . Publication date: March 2025.
From Matching to Generation: A Survey on Generative Information Retrieval 23

(2) Knowledge Graph (KG): Compared to search engines, KGs are particularly useful for
extracting structured, explicit knowledge. Relevant knowledge from a knowledge graph can be
extracted and used as a prompt input to enhance the generative process [262]. StructGPT [110]
introduces an iterative reading-and-reasoning framework where the model can access a knowledge
graph through a well-designed interface, continually acquiring information and reasoning until an
answer is obtained. RoG [181] generates plausible reasoning paths from a KG, executes them in
parallel, and integrates outcomes for a final answer; ToG [262] allows the model to explore entities
and links without pre-planning paths, continuously assessing reasoning feasibility.
(3) API-based Tools: An important part of the tools is the real-world APIs, which enable the
model to obtain information from specific data sources, such as real-time stock information, movie
services, code interpreters, and so on. However, the multitude and diversity of APIs, coupled with
the adherence to certain operational protocols, make the teaching of API usage to models a focal
point of this area.
Toolformer [245] trains language models in a self-supervised manner to automatically call
APIs when needed, using prompts to generate API calls, executing them, and filtering ineffective
ones to form the dataset. Training with standard language modeling objectives yields models
that can autonomously invoke APIs across tasks without losing language modeling capabilities.
RestGPT [258] formulates a framework prompting LLMs to invoke RESTful APIs, comprising an
online planner, an API selector, and an executor. ToolLLM [226] uses a large corpus of scraped
APIs to build a dataset for fine-tuning. Gorilla [214] introduces an information retriever providing
the model with reference API documentation, facilitating retrieval-based information utilization
during fine-tuning. ToolkenGPT [82] incorporates each tool as a new token into the vocabulary,
enabling the model to invoke APIs during inference as naturally as generating text.
Beyond learning to invoke APIs, CREATOR [222] prompts models to write code based on actual
problems as new tool implementations, with generated tools functioning through a code interpreter
and demonstrating impressive results on complex tasks.
Some works additionally support multimodal inputs, broadening the application scope of the
models. AssistGPT [69] offers a framework including modules like Planner, Executor, Inspector, and
Learner, utilizing language and code for intricate inference. ViperGPT [269] feeds CodeX with user
queries and visual API information to generate Python code invoking APIs, successfully completing
complex visual tasks.
(4) Model-based Tools: With the swift expansion of diverse AI communities (i.e., Huggingface,
ModelScope, GitHub), various types of AI models have become readily accessible for use, serving
as a pivotal tool in enhancing generative retrieval systems. These AI models encompass a wide
array of tasks, each accompanied by comprehensive model descriptions and usage examples.
HuggingGPT [250] employs ChatGPT as a controller to deconstruct user queries into tasks,
determining which models to invoke for execution. Similarly, Visual ChatGPT [317] integrates a
visual foundation model with LLMs, leveraging ChatGPT as a prompt manager to mobilize visual
foundation models like BLIP [145] and ControlNet [349], adept at processing image-based requests
efficiently compared to multi-modal models.

4.3 Generating Response with Citation


To build a reliable GenIR system, generating responses with citations is a promising approach [88,
168, 195]. Citations allow users to clearly understand the source of each piece of knowledge in the
response, enhancing trust and facilitating widespread adoption. Existing methods can be divided
into directly generating responses with citations and using a retrieval module to enhance the
generated content. Refer to the green portion in Figure 6 for an overview of this section.

ACM Trans. Inf. Syst., Vol. 1, No. 1, Article . Publication date: March 2025.
24 Li et al.

4.3.1 Direct Generating Response with Citation. Generating Response with Citation
This method uses the model’s intrinsic memory Direct Generating Citation Retrieval-based Citation
to generate source citations without relying on • Model Intrinsic Knowledge: Leverage • Citation within Generation: Direct
the LM itself to attribute knowledge to generate response with citation using
a retrieval module. the source documents. retrieved documents.
• Incorporating Generative Retrieval: • Citation after Generation: First
(1) Model Intrinsic Knowledge. Leverag- Generate citations through document generate a response, then add citations
identifiers (DocIDs). through models like NLI.
ing the capabilities of the language model it-
self, according-to prompting [314] guides LLMs Generative User-centric Reliable
Search Query Language Model Response
to more accurately cite information from pre-
training data by adding phrases like "according Personal Information Assistant
to Wikipedia" in the prompts. Personlized Dialogue System Domain-specific Personlization

To improve citation quality, Iterative Feed- • Personalized Prompt: Direct prompt • Healthcare: Medical aids, mental
(large) language models to generate counseling, drug suggestion, etc.
back Learning (IFL) [129] employs a critique personalized responses. • Academic: Review generation, writing
• Personalized Fine-tuning: Language enhancement, etc..
model to assess and provide feedback on gen- models are fine-tuned to align with
users’ peronsal interests.
• Other Domains: Like personlized
recipes, robots, news headline, etc.
erated text, iteratively enhancing LLMs’ cita-
tion accuracy, content correctness, and fluency.
Additionally, Fierro et al. [63] introduce a plan- Fig. 6. Generating response with citation and personal
based approach using a series of questions as a information assistant are also crucial approaches for
building a reliable and user-centric GenIR system.
blueprint for content generation, with abstract
and extractive attribution models, showing that
planning improves citation quality.
(2) Incorporating Generative Retrieval. As envisioned by Metzler et al. [195], allowing the
model to directly generate responses with citations is a promising approach for building an expert-
level reliable IR system. Users receive reliable responses tailored to their needs without searching
through returned documents. Moreover, the cited document is generated by the model through the
generative retrieval approach described in Section 3, directly producing corresponding DocIDs.
Utilizing generative retrieval, 1-PAGER [97] combines answer generation and evidence retrieval
by generating N-gram DocIDs through constrained decoding using FM-Index [62], enabling step-
by-step corpus partitioning, document selection, and response generation. This method matches
retrieval-then-read methods in accuracy and surpasses closed-book QA models by attributing
predictions to specific evidence, offering a new scheme for integrating retrieval into seq2seq
generation.
Recently, [122] proposes a source-aware training method where models learn to associate DocIDs
with knowledge during pre-training and provide supporting citations during instruction tuning,
effectively achieving knowledge attribution and enhancing LLM verifiability.

4.3.2 Retrieval-based Response with Citation. To enhance the accuracy of citations, several methods
have been developed based on retrieval techniques to fetch relevant documents, thereby improving
the quality of responses with embedded citations.
(1) Citation within Generation. Following retrieval, models directly generate responses that
include citations. Initially, systems like WebGPT [204], LaMDA [282], and WebBrain [223] utilized
web pages or Wikipedia to construct large-scale pre-training datasets, teaching models how to
generate responses with citations.
Subsequently, more advanced strategies for citation generation were proposed. For instance,
Search-in-the-Chain (SearChain) [326] first generates a reasoning chain (Chain-of-Query, CoQ) via
LLM prompts, then interacts with each CoQ node using retrieval for verification and completion,
ultimately generating the reasoning process and marking citations at each inference step.
LLatrieval [156] suggests continuously improving retrieval results through iterative updating, ver-
ifying whether retrieved documents support the generated answers until satisfaction. AGREE [335]

ACM Trans. Inf. Syst., Vol. 1, No. 1, Article . Publication date: March 2025.
From Matching to Generation: A Survey on Generative Information Retrieval 25

uses a Natural Language Inference (NLI) model to verify consistency between LLM-generated
answers and retrieved documents, employing a Test-Time Adaptation (TTA) strategy that allows
LLMs to actively search and cite current information during generation, enhancing response accu-
racy and reliability. VTG [261] integrates an evolved memory system and a dual-layer validator
for generating verifiable text, combining long-term and short-term memories to adapt to dynamic
content, and uses an NLI model to evaluate logical support between claims and evidence.
Based on the Graph of Thoughts (GoT) [12], HGOT [60] improves context learning in retrieval-
augmented settings by constructing a hierarchical GoT, leveraging the LLM’s planning capabilities
to break down complex queries into smaller sub-queries and introducing a scoring mechanism to
assess the quality of retrieved paragraphs.
Employing reinforcement learning, Huang et al. [87] introduce a fine-grained reward mechanism
to train language models, allocating specific rewards for each generated sentence and citation
to teach models accurate external source citation. This approach uses rejection sampling and
reinforcement learning algorithms to enhance citation-inclusive text generation through localized
reward signals. APO [142] reimagines attributive text generation as a preference learning problem,
automatically generating preference data pairs to reduce annotation costs, and uses progressive
preference optimization and experience replay to reinforce preference signals without overfitting
or text degradation.
(2) Citation after Generation. This approach involves models first generating a response, then
adding citations through models like NLI. RARR [70] improves attributability by automatically
finding external evidence for the language model’s output and post-editing to correct content while
preserving the original output, enhancing attribution capabilities without altering the existing
model. PURR [24] employs an unsupervised learning method where LLMs generate text noise
themselves, then trains an editor to eliminate this noise, improving attribution performance and
significantly speeding up generation. CEG [150] searches for supporting documents related to
generated content and uses an NLI-based citation generation module to ensure each statement is
supported by citations. "Attribute First, then Generate" [256] decomposes the generation process,
first selecting relevant source text details and then generating based on these details, achieving
localized attributability with each sentence supported by a clear source, greatly reducing manual
fact-checking workload.

4.4 Personal Information Assistant


The core of the GenIR system is the user, so understanding user intent is crucial. Researchers
have explored various methods like personalized search [302, 364, 373? ], dialogue [172, 184, 354],
and recommender [46, 169, 361] systems to explore users’ interests. Specifically, personalized
information assistants aim to better understand users’ personalities and preferences, generating
personalized responses to better meet their information needs. This section reviews the progress in
research on personalized dialogue and domain-specific personalization. An overview of this section
is provided in the blue area of Figure 6.

4.4.1 Personalized Dialogue System. To better understand user needs, researchers have explored
two main approaches: personalized prompt design and model fine-tuning.
(1) Personalized Prompt. For personalized prompt design, Liu et al. [169] and Dai et al. [46]
input users’ interaction and rating history into ChatGPT [209] for in-context learning, effectively
generating personalized responses. LaMP [240] enhances the language model’s personalized output
by retrieving personalized history from user profiles. Using long-term history, [35] designs prompts
describing users’ long-term interests, needs, and goals for input into LLMs. BookGPT [361] uses LLM
prompts, interactive querying methods, and result verification frameworks to obtain personalized

ACM Trans. Inf. Syst., Vol. 1, No. 1, Article . Publication date: March 2025.
26 Li et al.

book recommendations. PerSE [294] infers preferences from several reviews by a specific reviewer
and provides personalized evaluations for new story inputs.
Using prompt rewriting, [140] proposes a method combining supervised and reinforcement
learning to better generate responses from frozen LLMs. Similarly, [31] rewrites user input prompts
using extensive user text-to-image interaction history to align better with expected visual outputs.
(2) Personalized Fine-tuning. This line of work focuses on fine-tuning models for personalized
response generation. Zhang et al. [354] introduced the Persona-Chat dataset with 5 million personas
to train models for more personalized dialogues. Mazaré et al. [190] created a dataset of over 700
million conversations extracted from Reddit, demonstrating the effectiveness of training dialogue
models on large-scale personal profiles. P 2 Bot [172] generates personalized and consistent dialogues
by simulating the perception of personalities between conversation participants. DHAP [184]
designs a novel Transformer structure to automatically learn implicit user profiles from dialogue
history without explicit personal information. Wu et al. [322] propose a generative segmentation
memory network to integrate diverse personal information. Fu et al. [67] developed a variational
approach to model the relationship between personal memory and knowledge selection, with a
bidirectional learning mechanism.
Using reinforcement learning, Cheng et al. [32] collected a domain-specific preference (DSP)
dataset and proposed a three-stage reward model learning scheme, including base model training,
general preference fine-tuning, and customized preference fine-tuning. Jang et al. [98] developed
"Personalized Soups," first optimizing multiple policy models with different preferences using
PPO [246], then dynamically combining parameters during inference.
Using retrieval-enhanced methods, LAPDOG [91] retrieves relevant information from story
documents to enhance personal profiles and generate better personalized responses. SAFARI [295]
leverages LLMs’ planning and knowledge integration to generate responses consistent with char-
acter settings. Inspired by writing education, Li et al. [141] proposed a multi-stage, multi-task
framework including retrieval, ranking, summarization, synthesis, and generation to teach LLMs
personalized responses. For subjective tasks, [316] studied the superior performance of personalized
fine-tuning in subjective text perception tasks compared to non-personalized models.
To achieve a personalized information assistant for every user, OPPU [274] uses personalized
PEFT [53] to store user-specific behavioral patterns and preferences, showing superior performance.
For multimodal scenarios, PMG [249] proposes a personalized multi-modal generation method that
transforms user behavior into natural language, allowing LLMs to understand and extract user
preferences.

4.4.2 Domain-specific Personalization. Understanding users’ personalized information needs, the


GenIR system has broad applications across various domains such as healthcare, academia, educa-
tion, and recipes.
(1) Healthcare. In AI-assisted healthcare, personalization plays a crucial role. Liu et al. [175]
utilize few-shot tuning to process time-series physiological and behavioral data. Zhang et al. [347]
implement medical diagnosis identification and diagnostic assistance using prompts from Chat-
GPT [209] and GPT-4 [2]. Yang et al. [11] propose an LLM for traditional Chinese medicine called
Zhongjing, based on LLaMA [285], undergoing pre-training, supervised fine-tuning, and RLHF [36].
Abbasian et al. [1] introduce an open-source LLM-based conversational health agent framework
called openCHA, which collects necessary information through specific actions and generates
personalized responses. MedAgents [277] propose a multidisciplinary collaboration framework
where LLM-based agents engage in multi-round cooperative discussions to enhance expertise and
reasoning.

ACM Trans. Inf. Syst., Vol. 1, No. 1, Article . Publication date: March 2025.
From Matching to Generation: A Survey on Generative Information Retrieval 27

For mental healthcare, Mental-LLM [327] presents a framework using LLMs to predict mental
health from social media text data, with prompting-based and fine-tuning methods for real-time
monitoring of issues like depression and anxiety. Lai et al. [127] introduce Psy-LLM, a psychological
consultation aid combining pre-trained LLMs with real psychologist Q&As and psychological
articles.
For medication suggestions, Liu et al. [177] propose PharmacyGPT, a framework for generating
personalized patient groups, formulating medication plans, and predicting outcomes.
(2) Academic. In the academic domain, RevGAN [149] can automatically generate controllable
and personalized user reviews based on users’ emotional tendencies and stylistic information. For
writing assistants, Porsdam et al. [219] explore personalized enhancement of academic writing
using LLMs like GPT-3 [16], showing higher quality after training with authors’ published works.
Similarly, to address the lack of personalized outputs in LLMs, Mysore et al. [202] propose Pearl,
a personalized LLM writing assistant trained on users’ historical documents, and develop a KL
divergence training objective for retrievers.
(3) Education. Cui et al. [44] propose an adaptive and personalized exercise generation method
that adjusts difficulty to match students’ progress by combining knowledge tracing and controlled
text generation. EduChat [48] learns education-specific functionalities through pre-training on
educational corpora and fine-tuning on customized instructions, addressing delayed knowledge
updates and lack of expertise in LLMs.
(4) Other Domains. For recipe generation tasks, Majumder et al. [186] propose a personalized
generation model based on users’ historical recipe consumption, enhancing personalization. For
personalized headline generation, Zhang et al. [348] simulate users’ interests based on browsing
history to generate news headlines. Salemi et al. [240] propose the LaMP benchmark, including
personalized generation tasks like news headline, academic title, email subject, and tweet rewriting.
Additionally, for personalized assistance with home cleaning robots, TidyBot [318] uses LLMs to
generalize from user examples to infer user preference rules.

5 EVALUATION
This section will provide a range of evaluation metrics and benchmarks for generative information
retrieval methods, along with analysis and discussions on their performance.

5.1 Evaluation for Generative Document Retrieval


5.1.1 Metrics. In this section, we discuss several core metrics for evaluating Generative Retrieval
(GR) methods. These metrics provide different perspectives on the effectiveness of a GR system,
including its accuracy, efficiency, and the relevance of its results. Specifically, we consider Recall,
R-Precision, Mean Reciprocal Rank (MRR), Mean Average Precision (MAP), and Normalized Dis-
counted Cumulative Gain (nDCG). Each metric captures unique aspects of retrieval performance,
allowing for a comprehensive assessment of the system’s capabilities.
• Recall measures the proportion of relevant documents retrieved by the search system, reflecting
its ability to find all relevant items.
• R-Precision evaluates the precision at a rank position corresponding to the number of relevant
documents, balancing precision and recall at a specific cutoff.
• Mean Reciprocal Rank (MRR) captures the average rank position of the first relevant document,
emphasizing the system’s ability to return relevant results early in the ranking.
• Mean Average Precision (MAP) calculates the average precision across multiple queries,
considering the exact positions of all relevant documents and providing a comprehensive measure
of retrieval accuracy.

ACM Trans. Inf. Syst., Vol. 1, No. 1, Article . Publication date: March 2025.
28 Li et al.

• Normalized Discounted Cumulative Gain (nDCG) takes into account not only the relevance
of the documents returned but also their positions in the result list, reflecting both the quality
and the ordering of the results.

For detailed mathematical formulations of these metrics, please refer to Appendix A.1.

5.1.2 Benchmarks. Evaluating the effectiveness of GR methods relies on high-quality and chal-
lenging benchmark datasets.
MS MARCO [205] is a large-scale dataset designed for machine reading comprehension, retrieval,
and question-answering tasks in web search environments. It contains millions of documents and
passages derived from real user queries, providing a realistic benchmark for assessing GR systems
in complex search scenarios.
Natural Questions (NQ) [126] is a question-answering dataset introduced by Google, utilizing
Wikipedia as its primary corpus. It includes a vast number of natural user queries and their
corresponding answers, making it suitable for evaluating the retrieval performance of GR systems
in addressing real-world informational needs.
KILT (Knowledge-Intensive Language Tasks) [218] is a comprehensive benchmark integrat-
ing multiple categories of knowledge-intensive tasks such as fact checking, entity linking, slot
filling, open-domain QA, and dialogue. Utilizing Wikipedia as its corpus, KILT aims to evaluate
the effectiveness of information retrieval systems in handling complex language tasks that require
extensive background knowledge.
TREC Deep Learning Track 2019 & 2020 [41, 42] focus on leveraging deep learning techniques
to enhance information retrieval efficiency, primarily through document and passage ranking tasks.
These tracks utilize the MS MARCO dataset to emulate real-world search queries, providing a
standardized environment for benchmarking various retrieval methodologies.
DynamicIR. For dynamic corpora, DynamicIR [337] proposes a task framework based on
StreamingQA [167] benchmark for evaluating IR models within dynamically updated corpora.
Through experimental analysis, DynamicIR revealed that GR systems are superior in adapting to
evolving knowledge, handling temporally informed data, and are more efficient in terms of memory,
indexing time, and FLOPs compared to dense retrieval systems.
ExcluIR. For exclusionary retrieval tasks, where users explicitly indicate in their queries that
they do not want certain information, ExcluIR [356] provides a set of resources. This includes
an evaluation benchmark and a training set to help retrieval models understand and process
exclusionary queries.
For detailed descriptions and comprehensive information about benchmark datasets, please refer
to Appendix A.2.

5.1.3 Analysis. In addition to the benchmarks and metrics for evaluating the performance of GR
methods, there is a series of works that have conducted detailed analyses and discussions to study
the behavior of GR models.
Understanding Generative Retrieval. To understand the performance of DSI [281] in text
retrieval, Chen et al. [29] examines uniqueness, completeness, and relevance ordering. These
respectively reflect the system’s ability to distinguish between different documents, retrieve all
relevant documents, and accurately rank documents by relevance. Experimental analysis find
that DSI excels in remembering the mapping from pseudo queries to DocIDs, indicating a strong
capability to recall specific DocIDs from particular queries. However, the study also pointed out
DSI’s deficiency in distinguishing relevant documents from random ones, negatively impacting its
retrieval effectiveness.

ACM Trans. Inf. Syst., Vol. 1, No. 1, Article . Publication date: March 2025.
From Matching to Generation: A Survey on Generative Information Retrieval 29

Exploring the connection between generative and dense retrieval, [206] demonstrates that they
can be considered as bi-encoders in dense retrieval. Specifically, the authors analyze the computation
of dot products during the generative retrieval process, which is similar to the calculation of dot
products between query vectors and document vectors in dense retrieval. Following this, [319]
revisits generative retrieval from the perspective of multi-vector dense retrieval (MVDR), revealing a
common framework in computing document-query relevance between the two methods. This work
also analyzes their differences in document encoding and alignment strategies, further confirming
through experiments the phenomenon of term matching in the alignment matrices and their
commonalities in retrieval.
Large-scale Experimental Analysis. Later, Pradeep et al. [220] conduct the first comprehensive
experimental study on GR techniques over large document sets, such as the 8.8M MS MARCO
passages. It was found that among all the techniques examined, using generated pseudo queries to
augment training data remains the only effective method on large document corpus. The strongest
result in the experiments was achieved by using a training task that only utilized synthetic queries
to Naive DocIDs, expanding the model to T5-XL (3B parameters) to achieve an MRR@10 of
26.7. Surprisingly, increasing the parameters to T5 XXL (11B) in the same setup did not improve
performance but rather led to a decline. These findings suggest that more research and in-depth
analysis are needed in the GR field, and possibly additional improvements to the paradigm, to fully
leverage the power of larger language models.
Out-of-distribution Perspective. For out-of-distribution (OOD) robustness of GR models, Liu
et al. [176] investigate three aspects: query variations, new query types, and new tasks. Their study
showed that all types of retrieval models suffer from performance drops with query variations,
indicating sensitivity to query quality and structure. However, when dealing with new query types
and tasks, GR models showed different levels of adaptability, with pre-training enhancing their
flexibility. The research highlights the critical need for OOD robustness in GR models for dealing
with ever-changing real-world information sources.

5.1.4 Experiments. Analyzing experimental results is essential for understanding the performance
of different GR models. This section provides a comprehensive evaluation of current GR models on
widely used benchmark tests and examines their applicability and limitations in scenarios such as
web search, question answering, and knowledge-intensive tasks. The overall results are presented
in Table 3 and Table 4.
Experimental Settings. Our evaluation is based on the MS MARCO [205], NQ [126], and
KILT [218] benchmarks, which are commonly used datasets for existing GR methods. For the MS
MARCO dataset, following previous works [265, 352, 371], we use the MS MARCO 300K subset,
which contains 320k documents, 360k training instances, and 772 testing instances. For the NQ
dataset, following [135, 265, 281, 307, 352], we use the NQ320K subset, which, after deduplication
based on titles, contains 109k documents, 320k training instances, and 7,830 testing instances. For
the KILT benchmark, we use the standard development sets. Detailed statistics are available in
previous works [28, 218].
Regarding evaluation metrics, we employ Recall@{1, 10, 100} and MRR@{10, 100} for the MS
MARCO and NQ datasets, and R-Precision for the KILT benchmark. In our comparisons, we include
not only existing representative GR methods but also sparse retrieval methods such as BM25 [238]
and SPLADEv2 [64], which are based on bag-of-words representations, and dense retrieval methods
like DPR [117], GTR [207], RAG [139] and MT-DPR [185], which rely on dense embeddings.
Due to variations in datasets, corpus sizes, and evaluation metrics across different methods,
alignment is necessary for a fair comparison. For the methods evaluated in our experiments, we

ACM Trans. Inf. Syst., Vol. 1, No. 1, Article . Publication date: March 2025.
30 Li et al.

Table 3. Overall retrieval performance on the MS MARCO (300K) and Natural Questions (320K) Datasets.
The best results are Bold and the second-best are Underlined. Symbol "*" indicates results from our own
implementation, while other results are consistent with those reported in existing papers.

MS MARCO Natural Questions (NQ)


Model Doc Rep.
R@1 R@10 R@100 M@10 M@100 R@1 R@10 R@100 M@10 M@100
Sparse&Dense Retrieval
BM25 [265] Bag-of-words 0.196 0.591 0.861 0.313 0.325 0.297 0.603 0.821 - 0.402
SPLADEv2 [352] Bag-of-words 0.328 0.779 0.956 0.443 0.452 0.624 0.873 0.954 0.726 0.731
DPR [265] Dense Vector 0.271 0.764 0.948 0.424 0.433 0.502 0.777 0.909 - 0.489
GTR-Base [265] Dense Vector 0.332 0.793 0.960 0.484 0.485 0.560 0.844 0.937 - 0.662
Generative Retrieval
GENRE [352] Title 0.266 0.579 0.751 0.361 0.368 0.591 0.756 0.814 0.653 0.656
DSI [352] Semantic ID 0.257 0.538 0.692 0.339 0.346 0.533 0.715 0.816 0.594 0.598
DSI-QG [265, 371] Semantic ID 0.288 0.623 - 0.385 - 0.631 0.807 0.880 - 0.695
NCI [265] Semantic ID 0.301 0.643 0.851 0.408 - 0.659 0.852 0.924 - 0.731
SEAL [265] Sub-string 0.259 0.686 0.879 0.393 0.402 0.570 0.800 0.914 - 0.655
Ultron [352] Title+URL 0.304 0.676 0.794 0.432 0.437 0.654 0.854 0.911 0.726 0.729
GenRet [265] Learnable - - - - - 0.681 0.888 0.952 - 0.759
MINDER [352] Multi-view 0.289 0.728 0.916 0.431 0.435 0.627 0.869 0.933 0.709 0.713
LTRGR* Multi-view 0.327 0.759 0.929 0.463 0.469 0.644 0.879 0.941 0.721 0.726
GLEN [135] Learnable - - - - - 0.691 0.860 - - 0.754
TSGen [352] Term Set 0.384 0.781 0.931 0.502 0.505 0.708 0.889 0.948 0.771 0.774
NOVO [311] Term Set - - - - - 0.693 0.897 0.959 - 0.767
DGR* Multi-view 0.359 0.779 0.934 0.498 0.504 0.682 0.887 0.949 0.759 0.764

primarily use results reported in existing papers. For methods where settings are not aligned, we
provide results based on our own implementations.
Results on MS MARCO and NQ Datasets. MS MARCO and Natural Questions (NQ) are among
the most widely used benchmarks for evaluating generative retrieval (GR) methods, particularly
in the contexts of web search and question answering. Table 3 presents a detailed comparison of
various GR models against traditional sparse and dense retrieval methods on these datasets.
(1) Overall Performance Comparison. Overall, GR methods demonstrate competitive performance
compared to sparse and dense retrieval baselines. Specifically, on the MS MARCO dataset, GR
models such as TSGen and DGR achieve Recall@1 scores of 0.384 and 0.359 respectively, surpassing
dense methods like DPR (0.271) and being comparable to SPLADEv2 (0.328). On the NQ dataset,
GR models also show strong performance, with TSGen attaining the highest Recall@1 of 0.708,
outperforming both SPLADEv2 (0.624) and DPR (0.502).
(2) Term Set DocID Methods. Analyzing models that utilize term set-based document identifiers,
such as TSGen and NOVO, reveals that these methods excel in both datasets. TSGen leads with the
highest Recall@1 and MRR@10 on MS MARCO and NQ respectively, indicating robust retrieval
capabilities. NOVO also performs exceptionally well on the NQ dataset, achieving the second-best
Recall@1 and MRR@10, demonstrating the effectiveness of term set representations in capturing
relevant document information.
(3) Multi-view DocID Methods. Multi-view approaches, exemplified by MINDER, LTRGR, and
DGR, show consistent improvements over several metrics. For instance, LTRGR achieves the highest
Recall@10 on MS MARCO (0.759) and maintains strong performance across other metrics and on
the NQ dataset. These results suggest that leveraging multi-view DocIDs, ranking and distillation
training methods enhances retrieval effectiveness by capturing diverse aspects of the documents.

ACM Trans. Inf. Syst., Vol. 1, No. 1, Article . Publication date: March 2025.
From Matching to Generation: A Survey on Generative Information Retrieval 31

Table 4. Overall retrieval performance on the KILT Benchmark. The best results are Bold and the second-best
are Underlined. Symbol "*" indicates results from our own implementation, while other results are consistent
with those reported in existing papers.

FC Entity Linking Slot Filling Open Domain QA Dial.


Model Doc Rep.
FEVER AY2 WnWi WnCw TREx zsRE NQ HoPo TQA ELI5 WoW
Sparse&Dense Retrieval
BM25 [185] Bag-of-words 0.501 0.035 - - 0.586 0.664 0.258 0.440 0.294 - 0.275
RAG [218] Dense Vector 0.635 0.774 0.490 0.467 0.293 0.654 0.603 308 0.493 0.104 0.467
MT-DPR [185] Dense Vector 0.747 0.838 - - 0.692 0.772 0.615 0.442 0.620 - 0.397
Generative Retrieval
BART* Semantic ID 0.003 0.001 0.000 0.000 0.000 0.001 0.000 0.000 0.000 0.000 0.000
BART [28] Title 0.819 0.892 0.676 0.623 0.752 0.911 0.586 0.487 0.676 0.121 0.510
T5 [218] Title - 0.866 0.474 0.465 - - - - - - -
GENRE [18] Title 0.847 0.928 0.877 0.706 0.797 0.948 0.643 0.518 0.711 0.135 0.563
SEAL* Sub-string 0.826 0.866 0.809 0.651 0.704 0.919 0.658 0.565 0.715 0.124 0.527
CorpusBrain [28] Title 0.821 0.908 0.723 0.662 0.776 0.983 0.591 0.501 0.688 0.129 0.538

(4) Learnable DocID Methods. Learnable DocID models, such as GenRet and GLEN, exhibit mixed
performance. While GenRet shows competitive Recall@1 on NQ (0.681), it does not report results
on MS MARCO. GLEN achieves the highest MRR@100 on NQ (0.754) but lags behind in other
metrics. This indicates that learnable DocID approaches may benefit from further refinement to
consistently outperform other representation methods across different datasets.
(5) Other DocID Methods. Other methods like GENRE, DSI, NCI, SEAL, and Ultron, generally un-
derperform compared to term set and multi-view DocID methods. For example, on the MS MARCO
dataset, GENRE achieves a Recall@1 of 0.266 and an MRR@10 of 0.361, which are significantly
lower than TSGen (Recall@1 = 0.384, MRR@10 = 0.502) and LTRGR (Recall@1 = 0.327, MRR@10 =
0.463). The lower performance of methods utilizing simpler DocID designs (e.g. titles, semantic IDs)
highlights the need for more sophisticated or alternative DocID strategies to effectively capture
key information for high-quality retrieval across different scenarios.
Results on KILT Benchmark. The KILT benchmark provides a comprehensive evaluation
across various knowledge-intensive tasks, utilizing a large-scale Wikipedia corpus comprising 5.9
million documents. Overall results are shown in Table 4.
(1) Overall Performance Comparison. GR methods generally outperform traditional sparse and
dense retrieval approaches in most tasks. Notably, GENRE achieves the highest scores in several
categories, including FEVER (0.847), AY2 (0.928), WnWi (0.877), and WnCw (0.706), outperforming
the best sparse method BM25 and dense methods like MT-DPR.
(2) Title DocID Methods. Models utilizing title-based document identifiers consistently perform
well on the KILT benchmark. For instance, GENRE and BART achieve FEVER scores of 0.847 and
0.821, respectively. This superior performance can be attributed to the fact that Wikipedia document
titles accurately represent the key entities within each document, making the task of predicting titles
relatively straightforward. Moreover, these models effectively leverage the pre-trained knowledge
embedded within language models, enhancing their ability to generalize and retrieve relevant
documents based on titles.
(3) Sub-string DocID Methods. Methods based on sub-string document identifiers also demon-
strate strong performance on the KILT benchmark, particularly in question answering (QA) tasks.
SEAL achieves the highest QA scores across several categories, including NQ (0.658), HoPo (0.565),

ACM Trans. Inf. Syst., Vol. 1, No. 1, Article . Publication date: March 2025.
32 Li et al.

TQA (0.715), and WoW (0.527). The ability of sub-string DocID methods to capture meaningful frag-
ments of the documents likely contributes to their high accuracy in retrieving precise information
necessary for answering questions effectively.
(4) DSI-based Numeric DocID Methods. In contrast, methods employing numeric Semantic DocIDs
based on hierarchical k-means clustering [281] exhibit significantly diminished performance on the
KILT benchmark. The BART model, which uses Semantic IDs and trained with just labeled queries,
records scores close to zero across all tasks (e.g., FEVER: 0.003, AY2: 0.001). This decline is primarily
due to the substantial increase in corpus size, and <query, document> pairs in training data cover
only a small fraction of the entire document set. Consequently, these models struggle to generalize
beyond the training pairs, just "memorizing" DocIDs without capturing the broader diversity of
the corpus. This observation aligns with findings from [220], which reported similar challenges of
DSI [281] when scaling to an 8.8 million passage corpus in the MS MARCO benchmark.

5.2 Evaluation for Response Generation


5.2.1 Metrics. Evaluating the quality of generated responses includes aspects such as accuracy,
fluency, relevance, etc. In this section, we’ll introduce the main metrics for evaluating reliable
response generation, categorized into rule-based, model-based, and human evaluation metrics.
(1) Rule-based Metrics. Exact Match (EM) is a straightforward evaluation method requiring
the model’s output to be completely identical to the reference answer at the word level. This full
character-level matching is stringent, often used in tasks requiring precise and concise answers,
such as question answering systems, e.g., NQ [126], TriviaQA [115], SQuAD [232], etc. It simply
calculates the ratio of perfectly matched instances to the total number of instances.
For the generation of longer text sequences, BLEU [213] is a common metric initially used to
evaluate the quality of machine translation. It compares the similarity between the model’s output
and a set of reference texts by calculating the overlap of n-grams, thereby deriving a score. This
method assumes that high-quality generation should have a high lexical overlap with the labeled
answer. Optimized from BLEU, METEOR [9] is an alignment-based metric that considers not only
exact word matches but also synonyms and stem matches. Additionally, METEOR introduces
considerations for word order and syntactic structure to better assess the fluency and consistency
of the generated text.
ROUGE [162], also a commonly used metric for evaluating longer texts, by measuring the extent
of overlap in words, sentences, n-grams, and so forth, between the generated text and a collection
of reference texts. It focuses on recall, meaning it evaluates how much of the information in the
reference text is covered by the generated text. ROUGE comes in various forms, including ROUGE-N,
which evaluates based on n-gram overlap, and ROUGE-L, which considers the longest common
subsequence, catering to diverse evaluation requirements.
Perplexity (PPL) is a metric for evaluating the performance of language models, defined as the
exponentiation of the average negative log-likelihood, reflecting the model’s average predictive
ability for a given corpus of text sequences. The lower the perplexity, the stronger the model’s
predictive ability. Specifically, given a sequence of words 𝑊 = 𝑤 1, 𝑤 2, . . . , 𝑤 𝑁 , where 𝑁 is the total
number of words in the sequence, PPL can be expressed as:
(
𝑁
)
1 ∑︁
PPL(𝑊 ) = exp − log 𝑝 (𝑤𝑖 |𝑤 <𝑖 ) , (11)
𝑁 𝑖=1

where 𝑝 (𝑤𝑖 |𝑤 <𝑖 ) represents the pre-trained language model’s probability of predicting the 𝑖-th
word 𝑤𝑖 given the previous words 𝑤 <𝑖 .

ACM Trans. Inf. Syst., Vol. 1, No. 1, Article . Publication date: March 2025.
From Matching to Generation: A Survey on Generative Information Retrieval 33

(2) Model-based Metrics. With the rise of pre-trained language models, a series of model-based
evaluation metrics have emerged. These metrics utilize neural models to capture the deep semantic
relationships between texts.
Unlike traditional rule-based metrics, BERTScore [355] utilizes the contextual embeddings of
BERT [121] to capture the deep semantics of words, evaluating the similarity between candidate
and reference sentences through the cosine similarity of embeddings. BERTScore employs a greedy
matching strategy to optimize word-level matching and uses optional inverse document frequency
weighting to emphasize important words, ultimately providing a comprehensive evaluation through
a combination of recall, precision, and F1 score. BERTScore captures not only surface lexical overlap
but also a deeper understanding of the semantic content of sentences.
Similarly based on BERT [121], BLEURT [247] designed multiple pre-training tasks, enhancing
the model’s ability to recognize textual differences with millions of synthetic training pairs. These
pre-training tasks include automatic evaluation metrics (such as BLEU [213], ROUGE [162], and
BERTScore [355]), back-translation likelihood, textual entailment, etc. Each task provides different
signals to help the model learn how to evaluate the quality of text generation.
BARTScore [343], based on the pre-trained seq2seq generative model BART [138], treats the
evaluation of generated text as a text generation problem. Specifically, BARTScore determines
the quality of text based on the transition probability between the generated text and reference
text. BARTScore does not require additional parameters or labeled data and can flexibly evaluate
generated text from multiple perspectives (such as informativeness, fluency, factuality, etc.) and
further enhance evaluation performance through text prompts or fine-tuning for specific tasks.
FActScore [198] focuses on the factual accuracy of each independent information point in long
texts. It calculates a score representing factual accuracy by decomposing the text into atomic
facts and verifying whether these facts are supported by reliable knowledge sources. This method
provides a more detailed evaluation than traditional binary judgments and can be implemented
efficiently and accurately through human evaluation and automated models (combining retrieval
and powerful language models).
GPTScore [66] is a flexible, multi-faceted evaluation tool that allows users to evaluate text using
natural language instructions without the need for complex training processes or costly annotations.
GPTScore constructs an evaluation protocol dynamically through task specification and aspect
definition and utilizes the zero-shot capability of pre-trained language models to evaluate text
quality, optionally using demonstration samples to improve evaluation accuracy.
(3) Human Evaluation Metrics. Human evaluation is an important method for assessing the
performance of language models, especially in complex tasks where automated evaluation tools
struggle to provide accurate assessments. Compared to rule-based and model-based metrics, human
evaluation is more accurate and reliable in real-world applications. This evaluation method requires
human evaluators (such as experts, researchers, or everyday users) to provide comprehensive
assessments of the model-generated content based on their intuition and knowledge.
Human evaluation measures the quality of language model outputs by integrating multiple
assessment criteria, following [23]: Accuracy [255] primarily evaluates the correctness of informa-
tion and its correspondence with facts; Relevance [362] focuses on whether the model’s output is
pertinent to the specific context and user query; Fluency [289] examines whether the text is coher-
ent, natural, and facilitates smooth communication with users; Safety [103] scrutinizes whether
the content may lead to potential adverse consequences or harm. These indicators collectively
provide a comprehensive assessment of the model’s performance in real-world settings, ensuring
its effectiveness and applicability.
However, human evaluation also faces numerous challenges, primarily including high costs
and time consumption, difficulty in controlling evaluation quality, inconsistency in evaluation

ACM Trans. Inf. Syst., Vol. 1, No. 1, Article . Publication date: March 2025.
34 Li et al.

dimensions, issues of consistency due to evaluators’ subjectivity, and the need for professional
evaluators for specific tasks. These problems limit the widespread application of human evaluation
and the comparability of results [22].

5.2.2 Benchmarks and Analysis. In this section, we explore various benchmarks for evaluating
the performance of language models in generating reliable responses. These benchmarks assess
language understanding, factual accuracy, reliability, and the ability to provide timely information.
(1) General Evaluation. To comprehensively assess the language models’ understanding capa-
bilities across a wide range of scenarios, MMLU [84] utilizes a multiple-choice format covering
57 different tasks, from basic mathematics to American history, computer science, and law. This
benchmark spans evaluations in humanities, social science, and science, technology, engineering,
and mathematics, providing a comprehensive and challenging test. It has been widely used in the
evaluation of Large Language Models (LLMs) in recent years [105, 285, 286].
Furthermore, BIG-bench [259] introduces a large-scale and diverse benchmark designed to
measure and understand the capabilities and limitations of LLMs across a broad range of tasks.
Including 204 tasks contributed by 450 authors from 132 institutions, it covers areas such as
linguistics, mathematics, and common sense reasoning. It focuses on tasks beyond the capabilities
of language models, exploring how model performance and societal biases evolve with scale and
complexity.
LLM-Eval [166] offers a unified multi-dimensional automatic evaluation method for open-domain
dialogue of LLMs, eliminating the need for manual annotation. The performance of LLM-Eval
across various datasets demonstrates its effectiveness, efficiency, and adaptability, improving over
existing evaluation methods. The research also analyzes the impact of different LLMs and decoding
strategies on the evaluation outcomes, underscoring the importance of selecting suitable LLMs and
decoding strategies.
For Chinese, C-Eval [92] aims to comprehensively evaluate LLMs’ advanced knowledge and
reasoning capabilities in the Chinese context. It is based on a multiple-choice format, covering
four difficulty levels and 52 different academic fields from secondary school to professional levels.
C-Eval also introduces C-Eval Hard, a subset containing highly challenging subjects to test the
models’ advanced reasoning capabilities. Through evaluating state-of-the-art English and Chinese
LLMs, C-Eval reveals areas where current models still fall short in handling complex tasks, guiding
the development and optimization of Chinese LLMs.
(2) Tool Evaluation. To assess the ability of language models to utilize tools, API-Bank [148]
provides a comprehensive evaluation framework containing 73 APIs and 314 tool usage dialogs,
along with a rich training dataset of 1,888 dialogs covering 1,000 domains to improve LLMs’
tool usage capabilities. Experiments show that different LLMs perform variably in tool usage,
highlighting their strengths and areas for improvement.
Later, ToolBench [226] developed a comprehensive framework including a dataset and evaluation
tools to facilitate and assess the ability of LLMs to use over 16,000 real-world APIs. It enhances
reasoning capabilities by automatically generating diverse instruction and API usage scenario
paths, introducing a decision tree based on depth-first search. ToolBench significantly enhances
LLMs’ performance in executing complex instructions and in their ability to generalize to unseen
APIs. ToolLLaMA, an LLM fine-tuned from LLaMA [285], exhibits remarkable zero-shot capabilities
and performance comparable to state-of-the-art LLMs like ChatGPT [209].
(3) Factuality Evaluation. TruthfulQA [164] measures the truthfulness of language models in
answering questions. This benchmark consists of 817 questions covering 38 categories, including
health, law, finance, and politics. This evaluation reveals that, even in optimal conditions, the
truthfulness of model responses only reaches 58%, in stark contrast to human performance at

ACM Trans. Inf. Syst., Vol. 1, No. 1, Article . Publication date: March 2025.
From Matching to Generation: A Survey on Generative Information Retrieval 35

94%. Moreover, they proposed an automated evaluation metric named GPT-judge, which classifies
the truthfulness of answers by fine-tuning the GPT-3 [16] model, achieving 90-96% accuracy in
predicting human evaluations.
HaluEval [144] is a benchmark for evaluating LLM illusions, constructed using a dataset contain-
ing 35K illusion samples, employing a combination of automated generation and manual annotation.
This provides effective tools and methods for assessing and enhancing large language models’
capabilities in identifying and reducing illusions. For Chinese scenarios, HalluQA [33] designs 450
meticulously selected adversarial questions to assess the illusion phenomenon in Chinese LLMs,
covering multiple domains and reflecting Chinese culture and history, identifying two main types
of illusions: imitative falsehoods and factual errors.
To evaluate the ability of LLMs to generate answers with cited text, ALCE [71] builds an end-
to-end system for retrieving relevant text passages and generating answers with citations. ALCE
contains three datasets, covering different types of questions, and evaluates the generated text’s
quality from ’fluency’, ’correctness’, and ’citation quality’ dimensions, combining human evaluation
to verify the effectiveness of the evaluation metrics. The experimental results show that while
LLMs excel at generating fluent text, there is significant room for improvement in ensuring content
factual correctness and citation quality, especially on the ELI5 dataset where the best model was
incomplete in citation support half of the time.
(4) Real-Time Evaluation. RealTime QA [118] created a dynamic question-and-answer platform
that regularly releases questions and evaluates systems weekly to ask and answer questions about
the latest events or information. It challenges the static assumption of traditional QA datasets
aiming for immediate application. Experiments based on LLMs like GPT-3 and T5 found that models
could effectively update their generated results based on newly retrieved documents. However,
when the retrieved documents failed to provide sufficient information, models tended to return
outdated answers.
Furthermore, FreshQA [291] evaluates large language models’ performance in challenges in-
volving time-sensitive and erroneous premise questions by creating a new benchmark containing
questions of this nature. Evaluating various open and closed-source LLMs revealed significant
limitations in handling questions involving rapidly changing knowledge and erroneous premises.
Based on these findings, the study proposed a simple in-context learning method, FreshPrompt,
significantly improving LLMs’ performance on FreshQA by integrating relevant and up-to-date
information sourced from search engines into the prompt.
(5) Safety, Ethic, and Trustworthiness. To comprehensively evaluate the safety of LLMs,
SafetyBench [358] implements an efficient and accurate evaluation of LLMs’ safety through 11,435
multiple-choice questions covering 7 safety categories in multiple languages (Chinese and English).
The diversity of question types and the broad data sources ensure rigorous testing of LLMs in
various safety-related scenarios. Comparing the performance of 25 popular LLMs, SafetyBench
revealed GPT-4’s significant advantage and pointed out the areas where current models need
improvements in safety to promote the rapid development of safer LLMs.
For ethics, TrustGPT [93] aims to assess LLMs’ ethical performance from toxicity, bias, and
value alignment, three key dimensions. The benchmark uses predefined prompt templates based on
social norms to guide LLMs in generating content and employs multiple metrics to quantitatively
assess the toxicity, bias, and value consistency of these contents. Experimental analysis revealed
that even the most advanced LLMs still have significant issues and potential risks in these ethical
considerations.
For trustworthiness, TrustLLM [263] explores principles and benchmarks including truthful-
ness, safety, fairness, robustness, privacy, and machine ethics across six dimensions. Extensive
experiments, including assessing 16 mainstream LLMs’ performance on 30 datasets, found that

ACM Trans. Inf. Syst., Vol. 1, No. 1, Article . Publication date: March 2025.
36 Li et al.

trustworthiness usually positively correlates with functional effectiveness. While proprietary mod-
els typically outperform open-source models in trustworthiness, some open-source models like
Llama2 showed comparable high performance.
These benchmarks provide important tools and metrics for evaluating and improving the capa-
bilities of language models, contributing to the development of more accurate, reliable, safe, and
timely GenIR systems. For further understanding of the evaluation works, [23, 47, 90, 293] offer
more detailed introductions.

6 CHALLENGES AND PROSPECTS


This section discusses the key challenges faced in the fields of generative document retrieval and
reliable response generation, as well as potential directions for future research.

6.1 Challenges on Generative Document Retrieval


6.1.1 Scalability Issues. As extensively studied by [220], generative retrieval demonstrates signifi-
cantly lower retrieval accuracy compared to dense retrieval when handling million-level document
corpora in web search scenarios. Merely increasing the model size does not yield stable performance
improvements. However, GR outperforms dense retrieval in document collections smaller than
300K, posing a question: What impedes GR methods from scaling to large document sizes? This
issue encompasses several aspects:
Training Data. Current LLMs are pre-trained on huge datasets ranging from hundreds of billions
to several trillion tokens, covering vast knowledge sources such as the internet, books, and news
articles, consuming substantial computational power [359]. They are then extensively fine-tuned
with high-quality, human-annotated data to achieve substantial generalization capabilities [138,
210, 231, 285]. In contrast, generative retrieval (GR) models often begin with a pre-trained language
model and are fine-tuned on labeled data comprising <query, DocID> pairs, which does not
sufficiently prepare them to fully grasp GR tasks. For numeric-based DocIDs, the models, having
not encountered these numbers in their pre-training phase, tend to rote memorize the DocIDs seen
during training, struggling to predict unseen ones effectively. Similarly, if text-based DocIDs fail to
precisely represent the documents, the model also tend to rote learning.
A potential solution is to create a large-scale pre-training dataset for generative retrieval on a
general corpus, possibly including a variety of common DocIDs such as URLs, titles, and numerical
sequences. We can utilize instructions to distinguish generation targets for various DocIDs. Then
we can pre-train a GR model from scratch, the model can understand generative retrieval across
diverse domains. This method could bridge the gap between language model pre-training data and
GR tasks, enhancing the generalization ability of GR models across different corpora.
Training Method. As described in Section 3.1.1, existing training methods explore various
training objectives, including seq2seq training, learning DocID, and ranking capabilities. Other
methods involve knowledge distillation [30], reinforcement learning [365], etc. Is there a better
training method to enable GR models to master generating DocID ranking lists? For example,
RLHF [36] has been effectively used to train LLMs [210, 286], though at a high cost. Exploring
RLHF in the GR field is also worthwhile.
Model Structure. As discussed in Section 3.1.2, most current GR models are based on encoder-
decoder Transformers structures [281, 307, 371], such as T5 [231] and BART [138]. Some GR methods
like CorpusLM [152] have experimented with a decoder-only structure of the LLM Llama2 [286],
requiring more training computational power but not significantly improving performance. Re-
search is needed to determine which structure is more suitable for generative retrieval. Additionally,
whether increasing model and data size could lead to emergent phenomena similar to those observed
in LLMs [244, 312] is also a promising research direction.

ACM Trans. Inf. Syst., Vol. 1, No. 1, Article . Publication date: March 2025.
From Matching to Generation: A Survey on Generative Information Retrieval 37

6.1.2 Handling Dynamic Corpora. Real-world applications often involve dynamically changing
corpora, such as the web and news archives, where incremental learning is essential. However,
for language models, indexing new documents inevitably leads to forgetting old ones, posing a
challenge for GR systems. Existing methods like DSI++ [192], IncDSI [124], CLEVER [25], and Cor-
pusBrain++ [80] propose solutions such as experience replay, constrained optimization, incremental
product quantization, and continual generative pre-training frameworks to address incremental
learning issues. Yet, these methods have their specific applicable scenarios, and more effective and
universally applicable incremental learning strategies remain a key area for exploration.
6.1.3 Document Identifier. Accurately representing a document with high-quality DocIDs is crucial
for generative retrieval.
For example, the KILT dataset based on the Wikipedia corpus, which includes 5.9 million
documents, demonstrates optimistic retrieval performance for GR methods using titles as Do-
cIDs [18, 28, 152]. This is because each document in Wikipedia has a unique manually annotated
title that represents the core entity discussed in that page. However, in the web search scenario,
such as in the MS MARCO dataset [205], many documents lack a unique title, are overlapping, and
the titles do not accurately represent the core content of the documents. Thus, GR performance
significantly declines in the MS MARCO corpus of 8.8 million passages.
Therefore, how to construct high-quality titles (or other types of DocIDs) in general corpora,
similar to those in Wikipedia, that not only accurately represent documents but also are lightweight,
is a critical factor for implementing GR methods and warrants in-depth research.
Text or Numeric? As discussed in Section 3.2, current methods include text-based and numeric-
based DocIDs, each with their advantages and disadvantages. Text-based DocIDs effectively leverage
the linguistic capabilities of pre-trained generative language models and offer better interpretability.
Numeric-based DocIDs can utilize dense retriever embeddings to obtain semantic DocID sequences;
they can also complement dense retrievers to achieve synergistic benefits.
However, to ensure good generalization ability of GR models without extensive pre-training, it is
essential to utilize the inherent pre-trained parameters of the model. Coherent textual DocIDs can
naturally leverage this aspect, but they also need to capture key document semantics and maintain
linguistic sequence characteristics. Numeric DocIDs, however, do not offer this advantage. Thus, as
mentioned in 6.1.1, extensive pre-training is necessary to enable models to fully understand the
meanings behind these numerical strings, which is a costly endeavor.
Do We Need a Unique ID for Each Document? Most current GR methods use a unique DocID
to uniquely identify a document. However, as the number of documents in a corpus increases, main-
taining a unique DocID becomes increasingly challenging. Even if a unique DocID is maintained, it
is difficult to differentiate significantly from other DocIDs semantically, leading to reduced retrieval
precision. Some methods, such as using sub-string as DocIDs [13, 26], have proven effective. These
methods utilize the FM-Index [62] to ensure the generated sub-string exists in the corpus and use
the number of generated sub-strings in different documents to rank documents, demonstrating
good performance and generalization ability.
However, since this method is based on FM-Index, its inference latency is high, which is an issue
that needs addressing. Furthermore, exploring other more efficient alternatives to FM-Index and
even considering not using constrained search but freely generating a DocID sequence followed by
a lightweight matching and scoring module to efficiently return a document ranking list are also
worthy of exploration.
6.1.4 Efficiency Concerns. Current GR methods generally rely on constrained beam search to
generate multiple DocID sequences during inference, resulting in high latency. This is particu-
larly severe when returning 100 or more documents, with latencies reaching several hundred

ACM Trans. Inf. Syst., Vol. 1, No. 1, Article . Publication date: March 2025.
38 Li et al.

milliseconds [307], which is unacceptable for low-latency IR systems. Therefore, designing more
efficient inference methods is crucial. To reduce inference latency, the length of the DocID sequence
should not be too long; 16 tokens or fewer is an efficient range. This necessitates designing DocIDs
that are precise and concise enough to represent documents while maintaining performance and
improving efficiency. Additionally, developing more efficient decoding strategies is a valuable
research direction for the future.
6.1.5 Multi-modal Generative Retrieval. Existing multi-modal generative retrieval models aim to
retrieve images by converting each image in the collection into a unique sequence that serves as its
identifier. A language model is then employed to predict these image identifiers, enabling effective
image retrieval. However, there are still potential areas for future optimization: (1) Image Repre-
sentation: Developing advanced image representation techniques is essential for enhancing the
performance of multi-modal generative retrieval. These techniques should capture the key features
of an image within its identifier sequence. (2) End-to-end Training: Existing methods perform
image representation and image identifier prediction separately for generative retrieval. Exploring
how to train these two tasks in a fully end-to-end manner is also worth investigating. (3) Extend to
Additional Modalities: Current multi-modal generative retrieval methods predominantly focus
on text and image modalities. Expanding these approaches to incorporate additional modalities
such as audio and video presents a valuable research opportunity.

6.2 Challenges on Reliable Response Generation


6.2.1 Improving Accuracy and Factuality. In GenIR systems, ensuring content accuracy and factual-
ity is crucial. To achieve this, as mentioned in Section 4, there are two main areas of improvement:
Internal Knowledge Memorization. Firstly, training stronger generative models is critical
for building reliable GenIR systems. Various commercial LLMs continue to progress, utilizing vast
training data and computational resources, but exploring better model structures is also worthwhile.
Recent research such as Retentive Networks [266], Mamba [78], and others, have shown potential to
challenge the performance and efficiency of Transformers [290]. However, whether these can scale
and truly surpass Transformer-based LLMs in generation quality is still an open question. Moreover,
what types of training data and methods can consistently produce models capable of generating
high-quality, reliable text also deserve thorough investigation and summary. The mechanisms by
which language models recall knowledge during inference are not yet clear and need to be fully
understood to better serve user information needs.
External Knowledge Enhancement. As described in Section 4.2.1, retrieval-augmented genera-
tion is an effective method widely applied in LLMs. However, there is still room for improvement. (1)
For example, whether inserting retrieved documents directly into generative models via prompts is
the best method, or if there are better ways, such as inputting embeddings [336], needs exploration.
(2) Additionally, whether models can autonomously decide whether to perform retrieval [272, 296],
and when in the generation process to perform it [111]. (3) Third, in dialogue scenarios, enhancing
RAG models to better utilize long conversational history is also worth further exploration [200].
Tool-augmented generation, as discussed in Section 4.2.2, is also a popular method for endowing
LLMs with fine-grained world knowledge and performing complex tasks. Recent research has raised
questions, such as "Should tools always be used?" [310]. More specifically, whether the performance
improvements brought by using tools justify the extra computational costs incurred during model
training or the inference costs during testing. Existing work mainly focuses on task accuracy, but
studying the cost-effectiveness of these methods is also a valuable topic.
6.2.2 Real-time Properties of GenIR Systems. Timeliness is critical for GenIR systems, as well as
traditional IR systems, to provide users with the most up-to-date information. However, since the

ACM Trans. Inf. Syst., Vol. 1, No. 1, Article . Publication date: March 2025.
From Matching to Generation: A Survey on Generative Information Retrieval 39

knowledge of pre-trained generative models is fixed after training, methods like retrieval and tool
augmentation are needed to acquire new external knowledge. Research on real-time knowledge
acquisition remains limited, making it a valuable area for investigation.
Moreover, continually relying on outdated knowledge from language models is inadequate, as
models cannot comprehend the significance of given contexts or backgrounds in the current era,
thus reducing the reliability of the generated content. Therefore, updating the information in
language models while avoiding the forgetting of existing knowledge, such as through continual
learning [301, 320], knowledge editing [191, 212, 303, 334], etc., is a topic worth further exploring.
6.2.3 Bias and Fairness. Since LLMs are often trained on large, unfiltered datasets, GenIR sys-
tems may propagate stereotypes and biases present in the data regarding race, culture, and other
aspects [68]. Researchers have explored various methods to enhance the fairness of generated
content during training data selection, training methods, generation techniques, and rewriting
phases. However, biases have not been eradicated and require a thorough understanding of the
mechanisms by which generative models produce biases, to design methods to solve them and
build fair GenIR systems that further the practical application of GenIR.
6.2.4 Privacy and Security. Firstly, the content generated by GenIR systems risks plagiarism [50,
120]. Studies such as [21, 89] indicate that pre-trained language models can reproduce large segments
of their training data, leading to inadvertent plagiarism and causing academic dishonesty or
copyright issues. On one hand, legal regulations regarding the copyright of AI-generated content
will gradually emerge and evolve. On the other hand, technical research aimed at reducing plagiarism
by generative models, such as generating text with correct citations [88, 171, 195], is a promising
research direction for reliable GenIR that has received increasing attention in recent years.
Moreover, due to the unclear mechanisms of memory and generation in pre-trained language
models, GenIR systems inevitably return unsafe content. For example, [20, 21, 366] show that
when attacked, LLMs may return private information of users seen in training data. Therefore,
understanding the mechanisms by which LLMs recall training data and designing effective defense
mechanisms to enhance security are crucial for the widespread use of GenIR systems. Additionally,
developing effective detection methods for content generated by LLMs is essential for enhancing
the security of GenIR systems [331].

6.3 Unified Framework


This article discusses two mainstream forms of GenIR: generative document retrieval and reliable
response generation. However, each approach has its advantages and limitations. Generative
document retrieval still returns a list of documents, whereas the reliable response generation model
itself cannot effectively capture document-level relationships. Therefore, integrating these two
approaches is a promising research direction.
6.3.1 Unified Framework for Retrieval and Generation. Given that both generative retrieval and
downstream generation tasks can be based on generative language models, could a single model
perform both retrieval and generation tasks? Indeed, it could.
Current attempts, such as UniGen [155], use a shared encoder and two decoders for GR and
QA tasks respectively, and show superior performance on small-scale retrieval and QA datasets.
However, they struggle to generalize across multiple downstream tasks and to integrate with
powerful LLMs. Additionally, CorpusLM [152] uses a multi-task training approach to obtain a
universal model for GR, QA, and RAG. Yet, merely merging training data does not significantly
improve retrieval and generation performance, and CorpusLM remains limited to the Wikipedia
corpus. Facing a broader internet corpus presents significant challenges.

ACM Trans. Inf. Syst., Vol. 1, No. 1, Article . Publication date: March 2025.
40 Li et al.

In the future, can we construct a large search model (LSM) that allows an LLM to have the
capability to generate DocIDs and reliable responses autonomously? Even LSM could decide when
to generate DocIDs to access the required knowledge before continuing generation. Unlike the
large search model defined in [300], which unifies models beyond the first-stage retrieval (such
as re-ranking, snippet, and answer models), we aim to integrate the first-stage retrieval as well,
enabling the LSM to fully understand the meaning of retrieval and its connection with various
downstream generation tasks.

6.3.2 Towards End-to-End Framework for Various IR Tasks. Metzler et al. [195] envisioned an
expert-level corpus model that not only possesses linguistic capabilities but also understands
document-level DocIDs and knows the sources of its own knowledge. Such a model could not only
solve the issue of hallucinations common in traditional language models but could also generate
texts with references pointing to the source documents, thus achieving a reliable end-to-end GenIR
model. By understanding DocIDs and knowledge sources, this end-to-end system could also perform
additional IR tasks, such as returning the main content of a document given its DocID or returning
other related document DocIDs, as well as enabling multi-lingual and multi-modal retrieval.
Current methods, as discussed in this GenIR survey, primarily focus on generative document
retrieval (GR) and response generation as separate entities. GR models excel at comprehending
document identifiers at the document-level, while downstream models demonstrate powerful task
generation capabilities. However, existing methods face challenges when it comes to effectively
integrating these two generative abilities, limiting the overall performance and effectiveness of
the GenIR system. The integration of these generative abilities in a seamless and efficient manner
remains a key challenge in the field.
In the future, we can design training methods that align knowledge and DocIDs and construct
high-quality training datasets for generating answers with references, to train such an end-to-end
GenIR model. Achieving this goal remains challenging and requires the collaborative efforts of
researchers to contribute to building the next generation of GenIR systems.

7 CONCLUSION
In this survey, we explore the latest research developments, evaluations, current challenges, and
future directions in generative information retrieval (GenIR). We discuss two main directions in
the GenIR field: generative document retrieval (GR) and reliable response generation. Specifically,
we systematically review the progress of GR covering model training, document identifier design,
incremental learning, adaptability to downstream tasks, multi-modal GR, and generative recom-
mendation systems; as well as advancements in reliable response generation in terms of internal
knowledge memorization, external knowledge enhancement, generating responses with citations,
and personal information assistance. Additionally, we have sorted out the existing evaluation
methods and benchmarks for GR and response generation. We organize the current limitations
and future directions of GR systems, addressing scalability, handling dynamic corpora, document
representation, and efficiency challenges. Furthermore, we identify challenges in reliable response
generation, such as accuracy, real-time capabilities, bias and fairness, privacy, and security. We
propose potential solutions and future research directions to tackle these challenges. Finally, we also
envision a unified framework, including unified retrieval and generation tasks, and even building
an end-to-end framework capable of handling various information retrieval tasks. Through this
review, we hope to provide a comprehensive reference for researchers in the GenIR field to further
promote the development of this area.

ACM Trans. Inf. Syst., Vol. 1, No. 1, Article . Publication date: March 2025.
From Matching to Generation: A Survey on Generative Information Retrieval 41

A DETAILS FOR EVALUATION


A.1 Evaluation Metrics for Generative Document Retrieval
Recall. Recall is a metric that measures the proportion of relevant documents retrieved by the
search system. For a given cutoff point 𝑘, the recall Recall@𝑘 is defined as:
|𝑄 |
1 ∑︁ 𝑟𝑒𝑡𝑞,𝑘
Recall@𝑘 = , (12)
|𝑄 | 𝑞=1 𝑟𝑒𝑙𝑞

where |𝑄 | is the number of queries in the set, 𝑟𝑒𝑡𝑞,𝑘 is the number of relevant documents retrieved
for the 𝑞-th query within the top 𝑘 results, and 𝑟𝑒𝑙𝑞 is the total number of relevant documents for
the 𝑞-th query.
R-Precision. R-Precision measures the precision at the rank position 𝑅, which corresponds to
the number of relevant documents for a given query 𝑞. It is calculated as:
𝑟𝑒𝑡𝑞,𝑅
R-Precision = , (13)
𝑟𝑒𝑙𝑞
where 𝑟𝑒𝑡𝑞,𝑅 is the number of relevant documents retrieved within the top 𝑅 positions, and 𝑅 is
equivalent to 𝑟𝑒𝑙𝑞 .
Mean Reciprocal Rank (MRR). MRR reflects the average rank position of the first relevant
document returned in the search results. It is computed as follows:
|𝑄 |
1 ∑︁ 1
MRR = , (14)
|𝑄 | 𝑞=1 rank𝑞

where rank𝑞 is the rank of the first relevant document returned for the 𝑞-th query.
Mean Average Precision (MAP). MAP calculates the average precision across multiple queries.
It considers the exact position of all relevant documents and is calculated using the following
formula:
|𝑄 | 𝑛𝑞 !
1 ∑︁ 1 ∑︁
MAP = P@𝑘 × 𝐼 (𝑞, 𝑘) , (15)
|𝑄 | 𝑞=1 𝑟𝑒𝑙𝑞
𝑘=1
where P@𝑘 is the precision at cutoff 𝑘, and 𝐼 (𝑞, 𝑘) is an indicator function that is 1 if the document
at position 𝑘 is relevant to the 𝑞-th query and 0 otherwise.
Normalized Discounted Cumulative Gain (nDCG). nDCG takes into account not only the
relevance of the documents returned but also their positions in the result list. It is defined by:

2rel𝑖 − 1
𝑘
∑︁
DCG@𝑘 = , (16)
𝑖=1
log2 (𝑖 + 1)

DCG@𝑘
nDCG@𝑘 = , (17)
IDCG@𝑘
where rel𝑖 represents the graded relevance of the 𝑖-th document, DCG@𝑘 is the discounted cumu-
lative gain, and IDCG@𝑘 represents the maximum possible DCG@𝑘.

A.2 Benchmarks for Generative Document Retrieval


MS MARCO. MS MARCO (Microsoft Machine Reading Comprehension) is a large-scale dataset
developed by Microsoft for evaluating machine reading comprehension, retrieval, and question-
answering capabilities within web search contexts. It comprises two primary benchmarks:

ACM Trans. Inf. Syst., Vol. 1, No. 1, Article . Publication date: March 2025.
42 Li et al.

• Document Ranking: This benchmark includes approximately 3.2 million documents derived
from real user queries extracted from Microsoft Bing’s search logs. Each query is paired with
annotated relevant documents, facilitating the evaluation of retrieval accuracy and scalability.
• Passage Ranking: Containing around 8.8 million passages, this benchmark focuses on more
granular retrieval tasks, assessing the system’s ability to identify relevant information at the
passage level.
The diversity of question types and document genres in MS MARCO aims to mimic complex
web search scenarios, making it a pivotal resource for testing the robustness and effectiveness of
GR systems.
Natural Questions (NQ). Natural Questions (NQ) is a question-answering dataset introduced
by Google, utilizing Wikipedia as its foundational corpus. It encompasses approximately 3.2 million
documents, each corresponding to a Wikipedia page. The dataset includes a wide array of natural
user queries along with their respective answers extracted directly from web pages in Google
search results. NQ is designed to evaluate the retrieval performance of GR systems in addressing
real-world, information-seeking questions, emphasizing the ability to understand and retrieve
precise answers from a vast knowledge base.
KILT. KILT (Knowledge Intensive Language Tasks) is an extensive benchmark dataset that
integrates five categories of knowledge-intensive tasks, including:
• Fact Checking: Utilizing datasets like FEVER, KILT assesses the system’s ability to verify
factual claims against a knowledge base.
• Entity Linking: Incorporates datasets such as AIDA CoNLL-YAGO, WNED-WIKI, and
WNED-CWEB to evaluate the linking of entities mentioned in text to their corresponding
entries in a knowledge base.
• Slot Filling: Includes T-REx and Zero Shot RE datasets to test the system’s ability to populate
predefined slots with relevant information extracted from the text.
• Open-Domain QA: Combines datasets like Natural Questions, HotpotQA, TriviaQA, and
ELI5 to evaluate the retrieval and comprehension capabilities of the system in answering
open-ended questions.
• Dialogue: Utilizes the Wizard of Wikipedia dataset to assess the system’s performance in
maintaining informative and coherent dialogues based on retrieved knowledge.
KILT employs Wikipedia as its primary corpus, consisting of approximately 5.9 million wiki pages.
The benchmark aims to evaluate the effectiveness of information retrieval systems in handling
complex language tasks that require extensive background knowledge and the ability to integrate
information across multiple domains.
TREC Deep Learning Track 2019 & 2020. The TREC Deep Learning Tracks for 2019 and 2020
are specialized evaluation campaigns focusing on the application of deep learning techniques to
enhance the efficiency and effectiveness of information retrieval systems. The primary tasks in
these tracks include:
• Document Ranking: Assessing the ability of retrieval systems to rank entire documents
based on their relevance to a given query.
• Passage Ranking: Evaluating the system’s capability to identify and rank specific passages
within documents that are most relevant to the query.

REFERENCES
[1] Mahyar Abbasian, Iman Azimi, Amir M. Rahmani, and Ramesh C. Jain. 2023. Conversational Health Agents: A
Personalized LLM-Powered Agent Framework. CoRR abs/2310.02374 (2023). https://doi.org/10.48550/ARXIV.2310.
02374 arXiv:2310.02374

ACM Trans. Inf. Syst., Vol. 1, No. 1, Article . Publication date: March 2025.
From Matching to Generation: A Survey on Generative Information Retrieval 43

[2] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida,
Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774
(2023).
[3] Vaibhav Adlakha, Shehzaad Dhuliawala, Kaheer Suleman, Harm de Vries, and Siva Reddy. 2022. TopiOCQA: Open-
domain Conversational Question Answering with Topic Switching. Trans. Assoc. Comput. Linguistics 10 (2022),
468–483. https://doi.org/10.1162/TACL_A_00471
[4] Amazon. 2023. Amazon. https://www.amazon.com (2023).
[5] Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. 2023. Self-RAG: Learning to Retrieve,
Generate, and Critique through Self-Reflection. CoRR abs/2310.11511 (2023). https://doi.org/10.48550/ARXIV.2310.
11511 arXiv:2310.11511
[6] Arian Askari, Chuan Meng, Mohammad Aliannejadi, Zhaochun Ren, Evangelos Kanoulas, and Suzan Verberne. 2024.
Generative Retrieval with Few-shot Indexing. arXiv preprint arXiv:2408.02152 (2024).
[7] Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan
Bitton, Samir Yitzhak Gadre, Shiori Sagawa, Jenia Jitsev, Simon Kornblith, Pang Wei Koh, Gabriel Ilharco, Mitchell
Wortsman, and Ludwig Schmidt. 2023. OpenFlamingo: An Open-Source Framework for Training Large Autoregressive
Vision-Language Models. CoRR abs/2308.01390 (2023). https://doi.org/10.48550/ARXIV.2308.01390 arXiv:2308.01390
[8] Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang,
Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin
Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang,
Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen
Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zhang, Zhenru Zhang, Chang Zhou,
Jingren Zhou, Xiaohuan Zhou, and Tianhang Zhu. 2023. Qwen Technical Report. CoRR abs/2309.16609 (2023).
https://doi.org/10.48550/ARXIV.2309.16609 arXiv:2309.16609
[9] Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments. In Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures
for Machine Translation and/or Summarization@ACL 2005, Ann Arbor, Michigan, USA, June 29, 2005. Association for
Computational Linguistics, 65–72. https://aclanthology.org/W05-0909/
[10] Garbiel Bénédict, Ruqing Zhang, and Donald Metzler. 2023. Gen-ir@ sigir 2023: The first workshop on generative
information retrieval. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in
Information Retrieval. 3460–3463.
[11] Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda,
Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, et al. 2024. Graph of thoughts: Solving elaborate problems
with large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 17682–17690.
[12] Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda,
Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, et al. 2024. Graph of thoughts: Solving elaborate problems
with large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 17682–17690.
[13] Michele Bevilacqua, Giuseppe Ottaviano, Patrick S. H. Lewis, Scott Yih, Sebastian Riedel, and Fabio Petroni.
2022. Autoregressive Search Engines: Generating Substrings as Document Identifiers. In Advances in Neural
Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS
2022, New Orleans, LA, USA, November 28 - December 9, 2022. http://papers.nips.cc/paper_files/paper/2022/hash/
cd88d62a2063fdaf7ce6f9068fb15dcd-Abstract-Conference.html
[14] Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George van den
Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, Diego de Las Casas, Aurelia Guy, Jacob Menick, Roman
Ring, Tom Hennigan, Saffron Huang, Loren Maggiore, Chris Jones, Albin Cassirer, Andy Brock, Michela Paganini,
Geoffrey Irving, Oriol Vinyals, Simon Osindero, Karen Simonyan, Jack W. Rae, Erich Elsen, and Laurent Sifre. 2022.
Improving Language Models by Retrieving from Trillions of Tokens. In International Conference on Machine Learning,
ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA (Proceedings of Machine Learning Research, Vol. 162). PMLR,
2206–2240. https://proceedings.mlr.press/v162/borgeaud22a.html
[15] A.Z. Broder. 1997. On the resemblance and containment of documents. In Proceedings. Compression and Complexity of
SEQUENCES 1997 (Cat. No.97TB100171). 21–29. https://doi.org/10.1109/SEQUEN.1997.666900
[16] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan,
Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural
information processing systems 33 (2020), 1877–1901.
[17] Nicola De Cao, Wilker Aziz, and Ivan Titov. 2021. Editing Factual Knowledge in Language Models. In Proceedings
of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta
Cana, Dominican Republic, 7-11 November, 2021. Association for Computational Linguistics, 6491–6506. https:
//doi.org/10.18653/V1/2021.EMNLP-MAIN.522

ACM Trans. Inf. Syst., Vol. 1, No. 1, Article . Publication date: March 2025.
44 Li et al.

[18] Nicola De Cao, Gautier Izacard, Sebastian Riedel, and Fabio Petroni. 2021. Autoregressive Entity Retrieval. In 9th
International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
https://openreview.net/forum?id=5k8F6UU39V
[19] Yihan Cao, Siyu Li, Yixin Liu, Zhiling Yan, Yutong Dai, Philip S. Yu, and Lichao Sun. 2023. A Comprehensive Survey
of AI-Generated Content (AIGC): A History of Generative AI from GAN to ChatGPT. CoRR abs/2303.04226 (2023).
https://doi.org/10.48550/ARXIV.2303.04226 arXiv:2303.04226
[20] Nicholas Carlini, Chang Liu, Úlfar Erlingsson, Jernej Kos, and Dawn Song. 2019. The Secret Sharer: Evaluating
and Testing Unintended Memorization in Neural Networks. In 28th USENIX Security Symposium, USENIX Security
2019, Santa Clara, CA, USA, August 14-16, 2019. USENIX Association, 267–284. https://www.usenix.org/conference/
usenixsecurity19/presentation/carlini
[21] Nicholas Carlini, Florian Tramèr, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts,
Tom B. Brown, Dawn Song, Úlfar Erlingsson, Alina Oprea, and Colin Raffel. 2021. Extracting Training Data from
Large Language Models. In 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021. USENIX
Association, 2633–2650. https://www.usenix.org/conference/usenixsecurity21/presentation/carlini-extracting
[22] Asli Celikyilmaz, Elizabeth Clark, and Jianfeng Gao. 2020. Evaluation of Text Generation: A Survey. CoRR
abs/2006.14799 (2020). arXiv:2006.14799 https://arxiv.org/abs/2006.14799
[23] Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Kaijie Zhu, Hao Chen, Linyi Yang, Xiaoyuan Yi, Cunxiang Wang,
Yidong Wang, Wei Ye, Yue Zhang, Yi Chang, Philip S. Yu, Qiang Yang, and Xing Xie. 2023. A Survey on Evaluation of
Large Language Models. CoRR abs/2307.03109 (2023). https://doi.org/10.48550/ARXIV.2307.03109 arXiv:2307.03109
[24] Anthony Chen, Panupong Pasupat, Sameer Singh, Hongrae Lee, and Kelvin Guu. 2023. PURR: Efficiently Editing
Language Model Hallucinations by Denoising Language Model Corruptions. CoRR abs/2305.14908 (2023). https:
//doi.org/10.48550/ARXIV.2305.14908 arXiv:2305.14908
[25] Jiangui Chen, Ruqing Zhang, Jiafeng Guo, Maarten de Rijke, Wei Chen, Yixing Fan, and Xueqi Cheng. 2023. Continual
Learning for Generative Retrieval over Dynamic Corpora. In Proceedings of the 32nd ACM International Conference on
Information and Knowledge Management. 306–315.
[26] Jiangui Chen, Ruqing Zhang, Jiafeng Guo, Maarten de Rijke, Yiqun Liu, Yixing Fan, and Xueqi Cheng. 2023. A
Unified Generative Retriever for Knowledge-Intensive Language Tasks via Prompt Learning. In Proceedings of the
46th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2023, Taipei,
Taiwan, July 23-27, 2023. ACM, 1448–1457. https://doi.org/10.1145/3539618.3591631
[27] Jiangui Chen, Ruqing Zhang, Jiafeng Guo, Yixing Fan, and Xueqi Cheng. 2022. GERE: Generative Evidence Retrieval
for Fact Verification. arXiv preprint arXiv:2204.05511 (2022).
[28] Jiangui Chen, Ruqing Zhang, Jiafeng Guo, Yiqun Liu, Yixing Fan, and Xueqi Cheng. 2022. CorpusBrain: Pre-train a
Generative Retrieval Model for Knowledge-Intensive Language Tasks. In Proceedings of the 31st ACM International
Conference on Information & Knowledge Management, Atlanta, GA, USA, October 17-21, 2022. ACM, 191–200. https:
//doi.org/10.1145/3511808.3557271
[29] Xiaoyang Chen, Yanjiang Liu, Ben He, Le Sun, and Yingfei Sun. 2023. Understanding Differential Search Index for
Text Retrieval. In Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023.
Association for Computational Linguistics, 10701–10717. https://doi.org/10.18653/V1/2023.FINDINGS-ACL.681
[30] Xiaoyang Chen, Yanjiang Liu, Ben He, Le Sun, and Yingfei Sun. 2023. Understanding Differential Search Index for
Text Retrieval. (2023), 10701–10717. https://doi.org/10.18653/V1/2023.FINDINGS-ACL.681
[31] Zijie Chen, Lichao Zhang, Fangsheng Weng, Lili Pan, and Zhenzhong Lan. 2023. Tailored Visions: Enhancing Text-to-
Image Generation with Personalized Prompt Rewriting. CoRR abs/2310.08129 (2023). https://doi.org/10.48550/ARXIV.
2310.08129 arXiv:2310.08129
[32] Pengyu Cheng, Jiawen Xie, Ke Bai, Yong Dai, and Nan Du. 2023. Everyone Deserves A Reward: Learning Customized
Human Preferences. CoRR abs/2309.03126 (2023). https://doi.org/10.48550/ARXIV.2309.03126 arXiv:2309.03126
[33] Qinyuan Cheng, Tianxiang Sun, Wenwei Zhang, Siyin Wang, Xiangyang Liu, Mozhi Zhang, Junliang He, Mianqiu
Huang, Zhangyue Yin, Kai Chen, and Xipeng Qiu. 2023. Evaluating Hallucinations in Chinese Large Language Models.
CoRR abs/2310.03368 (2023). https://doi.org/10.48550/ARXIV.2310.03368 arXiv:2310.03368
[34] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham,
Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua
Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben
Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke,
Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Misra, Kevin Robinson,
Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan
Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana Pillai,
Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou,
Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas

ACM Trans. Inf. Syst., Vol. 1, No. 1, Article . Publication date: March 2025.
From Matching to Generation: A Survey on Generative Information Retrieval 45

Eck, Jeff Dean, Slav Petrov, and Noah Fiedel. 2023. PaLM: Scaling Language Modeling with Pathways. J. Mach. Learn.
Res. 24 (2023), 240:1–240:113. http://jmlr.org/papers/v24/22-1144.html
[35] Konstantina Christakopoulou, Alberto Lalama, Cj Adams, Iris Qu, Yifat Amir, Samer Chucri, Pierce Vollucci, Fabio
Soldo, Dina Bseiso, Sarah Scodel, Lucas Dixon, Ed H. Chi, and Minmin Chen. 2023. Large Language Models for User
Interest Journeys. CoRR abs/2305.15498 (2023). https://doi.org/10.48550/ARXIV.2305.15498 arXiv:2305.15498
[36] Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. Deep Reinforcement
Learning from Human Preferences. In Advances in Neural Information Processing Systems 30: Annual Conference on
Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA. 4299–4307. https://proceedings.
neurips.cc/paper/2017/hash/d5e2c0adad503c91f91df240d0cd4e49-Abstract.html
[37] Yung-Sung Chuang, Yujia Xie, Hongyin Luo, Yoon Kim, James R. Glass, and Pengcheng He. 2023. DoLa: Decoding by
Contrasting Layers Improves Factuality in Large Language Models. CoRR abs/2309.03883 (2023). https://doi.org/10.
48550/ARXIV.2309.03883 arXiv:2309.03883
[38] Aidan Clark, Diego de Las Casas, Aurelia Guy, Arthur Mensch, Michela Paganini, Jordan Hoffmann, Bogdan Damoc,
Blake A. Hechtman, Trevor Cai, Sebastian Borgeaud, George van den Driessche, Eliza Rutherford, Tom Hennigan,
Matthew J. Johnson, Katie Millican, Albin Cassirer, Chris Jones, Elena Buchatskaya, David Budden, Laurent Sifre, Simon
Osindero, Oriol Vinyals, Jack W. Rae, Erich Elsen, Koray Kavukcuoglu, and Karen Simonyan. 2022. Unified Scaling
Laws for Routed Language Models. CoRR abs/2202.01169 (2022). arXiv:2202.01169 https://arxiv.org/abs/2202.01169
[39] Andrea Cossu, Tinne Tuytelaars, Antonio Carta, Lucia Passaro, Vincenzo Lomonaco, and Davide Bacciu. 2022.
Continual pre-training mitigates forgetting in language and vision. arXiv preprint arXiv:2205.09357 (2022).
[40] Nick Craswell. 2009. Mean Reciprocal Rank. In Encyclopedia of Database Systems. Springer US, 1703. https:
//doi.org/10.1007/978-0-387-39940-9_488
[41] Nick Craswell, Bhaskar Mitra, Emine Yilmaz, and Daniel Campos. 2021. Overview of the TREC 2020 deep learning
track. CoRR abs/2102.07662 (2021). arXiv:2102.07662 https://arxiv.org/abs/2102.07662
[42] Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Ellen M. Voorhees. 2020. Overview of the TREC
2019 deep learning track. CoRR abs/2003.07820 (2020). arXiv:2003.07820 https://arxiv.org/abs/2003.07820
[43] Florin Cuconasu, Giovanni Trappolini, Federico Siciliano, Simone Filice, Cesare Campagnano, Yoelle Maarek,
Nicola Tonellotto, and Fabrizio Silvestri. 2024. The Power of Noise: Redefining Retrieval for RAG Systems.
arXiv:2401.14887 [cs.IR]
[44] Peng Cui and Mrinmaya Sachan. 2023. Adaptive and Personalized Exercise Generation for Online Language Learning.
In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL
2023, Toronto, Canada, July 9-14, 2023. Association for Computational Linguistics, 10184–10198. https://doi.org/10.
18653/V1/2023.ACL-LONG.567
[45] Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao Chang, and Furu Wei. 2022. Knowledge Neurons in Pretrained
Transformers. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume
1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022. Association for Computational Linguistics, 8493–8502.
https://doi.org/10.18653/V1/2022.ACL-LONG.581
[46] Sunhao Dai, Ninglu Shao, Haiyuan Zhao, Weijie Yu, Zihua Si, Chen Xu, Zhongxiang Sun, Xiao Zhang, and Jun Xu.
2023. Uncovering ChatGPT’s Capabilities in Recommender Systems. In Proceedings of the 17th ACM Conference on
Recommender Systems, RecSys 2023, Singapore, Singapore, September 18-22, 2023. ACM, 1126–1132. https://doi.org/10.
1145/3604915.3610646
[47] Sunhao Dai, Chen Xu, Shicheng Xu, Liang Pang, Zhenhua Dong, and Jun Xu. 2024. Unifying Bias and Unfairness
in Information Retrieval: A Survey of Challenges and Opportunities with Large Language Models. arXiv preprint
arXiv:2404.11457 (2024).
[48] Yuhao Dan, Zhikai Lei, Yiyang Gu, Yong Li, Jianghao Yin, Jiaju Lin, Linhao Ye, Zhiyan Tie, Yougen Zhou, Yilei Wang,
et al. 2023. Educhat: A large-scale language model-based chatbot system for intelligent education. arXiv preprint
arXiv:2308.02773 (2023).
[49] Shehzaad Dhuliawala, Mojtaba Komeili, Jing Xu, Roberta Raileanu, Xian Li, Asli Celikyilmaz, and Jason Weston. 2023.
Chain-of-Verification Reduces Hallucination in Large Language Models. arXiv:2309.11495 [cs.CL]
[50] Joseph Dien. 2023. Generative artificial intelligence as a plagiarism problem. , 108621 pages.
[51] Emily Dinan, Stephen Roller, Kurt Shuster, Angela Fan, Michael Auli, and Jason Weston. 2018. Wizard of Wikipedia:
Knowledge-Powered Conversational Agents. In International Conference on Learning Representations.
[52] Hanxing Ding, Liang Pang, Zihao Wei, Huawei Shen, and Xueqi Cheng. 2024. Retrieve Only When It Needs: Adaptive
Retrieval Augmentation for Hallucination Mitigation in Large Language Models. arXiv:2402.10612 [cs.CL]
[53] Ning Ding, Yujia Qin, Guang Yang, Fuchao Wei, Zonghan Yang, Yusheng Su, Shengding Hu, Yulin Chen, Chi-Min
Chan, Weize Chen, Jing Yi, Weilin Zhao, Xiaozhi Wang, Zhiyuan Liu, Hai-Tao Zheng, Jianfei Chen, Yang Liu, Jie Tang,
Juanzi Li, and Maosong Sun. 2023. Parameter-efficient fine-tuning of large-scale pre-trained language models. Nat.
Mac. Intell. 5, 3 (2023), 220–235. https://doi.org/10.1038/S42256-023-00626-4

ACM Trans. Inf. Syst., Vol. 1, No. 1, Article . Publication date: March 2025.
46 Li et al.

[54] Guanting Dong, Xiaoshuai Song, Yutao Zhu, Runqi Qiao, Zhicheng Dou, and Ji-Rong Wen. 2024. Toward General
Instruction-Following Alignment for Retrieval-Augmented Generation. CoRR abs/2410.09584 (2024). https://doi.org/
10.48550/ARXIV.2410.09584 arXiv:2410.09584
[55] Guanting Dong, Yutao Zhu, Chenghao Zhang, Zechen Wang, Zhicheng Dou, and Ji-Rong Wen. 2024. Understand
What LLM Needs: Dual Preference Alignment for Retrieval-Augmented Generation. CoRR abs/2406.18676 (2024).
https://doi.org/10.48550/ARXIV.2406.18676 arXiv:2406.18676
[56] Qingxiu Dong, Damai Dai, Yifan Song, Jingjing Xu, Zhifang Sui, and Lei Li. 2022. Calibrating Factual Knowledge
in Pretrained Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2022, Abu
Dhabi, United Arab Emirates, December 7-11, 2022. Association for Computational Linguistics, 5937–5947. https:
//doi.org/10.18653/V1/2022.FINDINGS-EMNLP.438
[57] Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre-Emmanuel Mazaré, Maria
Lomeli, Lucas Hosseini, and Hervé Jégou. 2024. The Faiss library. CoRR abs/2401.08281 (2024). https://doi.org/10.
48550/ARXIV.2401.08281 arXiv:2401.08281
[58] Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi
Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten Bosma, Zongwei Zhou, Tao Wang, Yu Emma
Wang, Kellie Webster, Marie Pellat, Kevin Robinson, Kathy Meier-Hellstern, Toju Duke, Lucas Dixon, Kun Zhang,
Quoc V. Le, Yonghui Wu, Zhifeng Chen, and Claire Cui. 2021. GLaM: Efficient Scaling of Language Models with
Mixture-of-Experts. CoRR abs/2112.06905 (2021). arXiv:2112.06905 https://arxiv.org/abs/2112.06905
[59] Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil
Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783
(2024).
[60] Yihao Fang, Stephen W. Thomas, and Xiaodan Zhu. 2024. HGOT: Hierarchical Graph of Thoughts for Retrieval-
Augmented In-Context Learning in Factuality Evaluation. CoRR abs/2402.09390 (2024). https://doi.org/10.48550/
ARXIV.2402.09390 arXiv:2402.09390
[61] William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch transformers: scaling to trillion parameter models with
simple and efficient sparsity. J. Mach. Learn. Res. 23, 1, Article 120 (jan 2022), 39 pages.
[62] Paolo Ferragina and Giovanni Manzini. 2000. Opportunistic Data Structures with Applications. In 41st Annual
Symposium on Foundations of Computer Science, FOCS 2000, 12-14 November 2000, Redondo Beach, California, USA.
IEEE Computer Society, 390–398. https://doi.org/10.1109/SFCS.2000.892127
[63] Constanza Fierro, Reinald Kim Amplayo, Fantine Huot, Nicola De Cao, Joshua Maynez, Shashi Narayan, and Mirella
Lapata. 2024. Learning to Plan and Generate Text with Citations. arXiv preprint arXiv:2404.03381 (2024).
[64] Thibault Formal, Carlos Lassance, Benjamin Piwowarski, and Stéphane Clinchant. 2021. SPLADE v2: Sparse Lexical
and Expansion Model for Information Retrieval. CoRR abs/2109.10086 (2021). arXiv:2109.10086 https://arxiv.org/abs/
2109.10086
[65] Thibault Formal, Benjamin Piwowarski, and Stéphane Clinchant. 2021. SPLADE: Sparse Lexical and Expansion Model
for First Stage Ranking. In SIGIR ’21: The 44th International ACM SIGIR Conference on Research and Development in
Information Retrieval, Virtual Event, Canada, July 11-15, 2021. ACM, 2288–2292. https://doi.org/10.1145/3404835.
3463098
[66] Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, and Pengfei Liu. 2023. GPTScore: Evaluate as You Desire. CoRR
abs/2302.04166 (2023). https://doi.org/10.48550/ARXIV.2302.04166 arXiv:2302.04166
[67] Tingchen Fu, Xueliang Zhao, Chongyang Tao, Ji-Rong Wen, and Rui Yan. 2022. There Are a Thousand Hamlets in a
Thousand People’s Eyes: Enhancing Knowledge-grounded Dialogue with Personal Memory. In Proceedings of the 60th
Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May
22-27, 2022. Association for Computational Linguistics, 3901–3913. https://doi.org/10.18653/V1/2022.ACL-LONG.270
[68] Isabel O. Gallegos, Ryan A. Rossi, Joe Barrow, Md. Mehrab Tanjim, Sungchul Kim, Franck Dernoncourt, Tong Yu, Ruiyi
Zhang, and Nesreen K. Ahmed. 2023. Bias and Fairness in Large Language Models: A Survey. CoRR abs/2309.00770
(2023). https://doi.org/10.48550/ARXIV.2309.00770 arXiv:2309.00770
[69] Difei Gao, Lei Ji, Luowei Zhou, Kevin Qinghong Lin, Joya Chen, Zihan Fan, and Mike Zheng Shou. 2023. AssistGPT:
A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn. arXiv:2306.08640 [cs.CV]
[70] Luyu Gao, Zhuyun Dai, Panupong Pasupat, Anthony Chen, Arun Tejasvi Chaganty, Yicheng Fan, Vincent Y. Zhao, Ni
Lao, Hongrae Lee, Da-Cheng Juan, and Kelvin Guu. 2023. RARR: Researching and Revising What Language Models
Say, Using Language Models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics
(Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023. Association for Computational Linguistics,
16477–16508. https://doi.org/10.18653/V1/2023.ACL-LONG.910
[71] Tianyu Gao, Howard Yen, Jiatong Yu, and Danqi Chen. 2023. Enabling Large Language Models to Generate Text with
Citations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023,
Singapore, December 6-10, 2023. Association for Computational Linguistics, 6465–6488. https://aclanthology.org/2023.

ACM Trans. Inf. Syst., Vol. 1, No. 1, Article . Publication date: March 2025.
From Matching to Generation: A Survey on Generative Information Retrieval 47

emnlp-main.398
[72] Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Qianyu Guo, Meng Wang, and
Haofen Wang. 2024. Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv:2312.10997 [cs.CL]
[73] Tiezheng Ge, Kaiming He, Qifa Ke, and Jian Sun. 2014. Optimized Product Quantization. IEEE Trans. Pattern Anal.
Mach. Intell. 36, 4 (2014), 744–755. https://doi.org/10.1109/TPAMI.2013.240
[74] Shijie Geng, Shuchang Liu, Zuohui Fu, Yingqiang Ge, and Yongfeng Zhang. 2022. Recommendation as Language
Processing (RLP): A Unified Pretrain, Personalized Prompt & Predict Paradigm (P5). In RecSys ’22: Sixteenth ACM
Conference on Recommender Systems, Seattle, WA, USA, September 18 - 23, 2022. ACM, 299–315. https://doi.org/10.
1145/3523227.3546767
[75] Lukas Gienapp, Harrisen Scells, Niklas Deckers, Janek Bevendorff, Shuai Wang, Johannes Kiesel, Shahbaz Syed,
Maik Frobe, Guide Zucoon, Benno Stein, Matthias Hagen, and Martin Potthast. 2023. Evaluating Generative Ad Hoc
Information Retrieval. ArXiv abs/2311.04694 (2023). https://api.semanticscholar.org/CorpusID:265050661
[76] Google. 2023. Google. https://www.google.com (2023).
[77] Google. 2023. YouTube. https://www.youtube.com (2023).
[78] Albert Gu and Tri Dao. 2023. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. CoRR abs/2312.00752
(2023). https://doi.org/10.48550/ARXIV.2312.00752 arXiv:2312.00752
[79] Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan
Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang,
Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee, and Yuanzhi Li. 2023. Textbooks Are All You Need.
arXiv:2306.11644 [cs.CL]
[80] Jiafeng Guo, Changjiang Zhou, Ruqing Zhang, Jiangui Chen, Maarten de Rijke, Yixing Fan, and Xueqi Cheng. 2024.
CorpusBrain++: A Continual Generative Pre-Training Framework for Knowledge-Intensive Language Tasks. arXiv
preprint arXiv:2402.16767 (2024).
[81] Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. 2020. Retrieval augmented language
model pre-training. In International conference on machine learning. PMLR, 3929–3938.
[82] Shibo Hao, Tianyang Liu, Zhen Wang, and Zhiting Hu. 2023. ToolkenGPT: Augmenting Frozen Language Models
with Massive Tools via Tool Embeddings. In Advances in Neural Information Processing Systems 36: Annual Conference
on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023. http:
//papers.nips.cc/paper_files/paper/2023/hash/8fd1a81c882cd45f64958da6284f4a3f-Abstract-Conference.html
[83] Tom Hartvigsen, Swami Sankaranarayanan, Hamid Palangi, Yoon Kim, and Marzyeh Ghassemi. 2023. Aging with
GRACE: Lifelong Model Editing with Discrete Key-Value Adaptors. In Advances in Neural Information Processing
Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA,
December 10 - 16, 2023. http://papers.nips.cc/paper_files/paper/2023/hash/95b6e2ff961580e03c0a662a63a71812-
Abstract-Conference.html
[84] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021.
Measuring Massive Multitask Language Understanding. In 9th International Conference on Learning Representations,
ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net. https://openreview.net/forum?id=d7KBjmI3GmQ
[85] William R. Hersh. 2023. Search Still Matters: Information Retrieval in the Era of Generative AI. CoRR abs/2311.18550
(2023). https://doi.org/10.48550/ARXIV.2311.18550 arXiv:2311.18550
[86] Johannes Hoffart, Mohamed Amir Yosef, Ilaria Bordino, Hagen Fürstenau, Manfred Pinkal, Marc Spaniol, Bilyana
Taneva, Stefan Thater, and Gerhard Weikum. 2011. Robust disambiguation of named entities in text. In EMNLP.
782–792.
[87] Chengyu Huang, Zeqiu Wu, Yushi Hu, and Wenya Wang. 2024. Training Language Models to Generate Text
with Citations via Fine-grained Rewards. CoRR abs/2402.04315 (2024). https://doi.org/10.48550/ARXIV.2402.04315
arXiv:2402.04315
[88] Jie Huang and Kevin Chen-Chuan Chang. 2023. Citation: A Key to Building Responsible and Accountable Large
Language Models. CoRR abs/2307.02185 (2023). https://doi.org/10.48550/ARXIV.2307.02185 arXiv:2307.02185
[89] Jie Huang, Hanyin Shao, and Kevin Chen-Chuan Chang. 2022. Are Large Pre-Trained Language Models Leaking Your
Personal Information?. In Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United
Arab Emirates, December 7-11, 2022. Association for Computational Linguistics, 2038–2047. https://doi.org/10.18653/
V1/2022.FINDINGS-EMNLP.148
[90] Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng,
Xiaocheng Feng, Bing Qin, and Ting Liu. 2023. A Survey on Hallucination in Large Language Models: Principles,
Taxonomy, Challenges, and Open Questions. CoRR abs/2311.05232 (2023). https://doi.org/10.48550/ARXIV.2311.05232
arXiv:2311.05232
[91] Qiushi Huang, Shuai Fu, Xubo Liu, Wenwu Wang, Tom Ko, Yu Zhang, and Lilian H. Y. Tang. 2023. Learning Retrieval
Augmentation for Personalized Dialogue Generation. In Proceedings of the 2023 Conference on Empirical Methods in

ACM Trans. Inf. Syst., Vol. 1, No. 1, Article . Publication date: March 2025.
48 Li et al.

Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023. Association for Computational Linguistics,
2523–2540. https://doi.org/10.18653/V1/2023.EMNLP-MAIN.154
[92] Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng
Lv, Yikai Zhang, Jiayi Lei, Yao Fu, Maosong Sun, and Junxian He. 2023. C-Eval: A Multi-Level Multi-Discipline
Chinese Evaluation Suite for Foundation Models. In Advances in Neural Information Processing Systems 36: Annual
Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16,
2023. http://papers.nips.cc/paper_files/paper/2023/hash/c6ec1844bec96d6d32ae95ae694e23d8-Abstract-Datasets_
and_Benchmarks.html
[93] Yue Huang, Qihui Zhang, Philip S. Yu, and Lichao Sun. 2023. TrustGPT: A Benchmark for Trustworthy and Responsible
Large Language Models. CoRR abs/2306.11507 (2023). https://doi.org/10.48550/ARXIV.2306.11507 arXiv:2306.11507
[94] Zeyu Huang, Yikang Shen, Xiaofeng Zhang, Jie Zhou, Wenge Rong, and Zhang Xiong. 2023. Transformer-Patcher:
One Mistake Worth One Neuron. In The Eleventh International Conference on Learning Representations, ICLR 2023,
Kigali, Rwanda, May 1-5, 2023. OpenReview.net. https://openreview.net/pdf?id=4oYUGeGBPm
[95] Gautier Izacard and Edouard Grave. 2020. Leveraging passage retrieval with generative models for open domain
question answering. arXiv preprint arXiv:2007.01282 (2020).
[96] Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. 1991. Adaptive mixtures of local experts.
Neural computation 3, 1 (1991), 79–87.
[97] Palak Jain, Livio Soares, and Tom Kwiatkowski. 2023. 1-PAGER: One Pass Answer Generation and Evidence Retrieval.
In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023. Association
for Computational Linguistics, 14529–14543. https://doi.org/10.18653/V1/2023.FINDINGS-EMNLP.967
[98] Joel Jang, Seungone Kim, Bill Yuchen Lin, Yizhong Wang, Jack Hessel, Luke Zettlemoyer, Hannaneh Hajishirzi, Yejin
Choi, and Prithviraj Ammanabrolu. 2023. Personalized Soups: Personalized Large Language Model Alignment via Post-
hoc Parameter Merging. CoRR abs/2310.11564 (2023). https://doi.org/10.48550/ARXIV.2310.11564 arXiv:2310.11564
[99] Joel Jang, Seungone Kim, Seonghyeon Ye, Doyoung Kim, Lajanugen Logeswaran, Moontae Lee, Kyungjae Lee, and
Minjoon Seo. 2023. Exploring the benefits of training expert language models over instruction tuning. In International
Conference on Machine Learning. PMLR, 14702–14729.
[100] Joel Jang, Seonghyeon Ye, Sohee Yang, Joongbo Shin, Janghoon Han, Gyeonghun Kim, Stanley Jungkyu Choi, and
Minjoon Seo. 2021. Towards continual knowledge learning of language models. arXiv preprint arXiv:2110.03215
(2021).
[101] Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf.
Syst. 20, 4 (2002), 422–446. https://doi.org/10.1145/582415.582418
[102] Hervé Jégou, Matthijs Douze, and Cordelia Schmid. 2011. Product Quantization for Nearest Neighbor Search. IEEE
Trans. Pattern Anal. Mach. Intell. 33, 1 (2011), 117–128. https://doi.org/10.1109/TPAMI.2010.57
[103] Jiaming Ji, Mickel Liu, Josef Dai, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Chen, Ruiyang Sun, Yizhou Wang, and
Yaodong Yang. 2023. BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset. In
Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems
2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023. http://papers.nips.cc/paper_files/paper/2023/hash/
4dbb61cb68671edc4ca3712d70083b9f-Abstract-Datasets_and_Benchmarks.html
[104] Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and
Pascale Fung. 2023. Survey of hallucination in natural language generation. Comput. Surveys 55, 12 (2023), 1–38.
[105] Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas,
Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux,
Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7B.
CoRR abs/2310.06825 (2023). https://doi.org/10.48550/ARXIV.2310.06825 arXiv:2310.06825
[106] Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Deven-
dra Singh Chaplot, Diego de Las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guil-
laume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia
Yang, Szymon Antoniak, Teven Le Scao, Théophile Gervet, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and
William El Sayed. 2024. Mixtral of Experts. CoRR abs/2401.04088 (2024). https://doi.org/10.48550/ARXIV.2401.04088
arXiv:2401.04088
[107] Dongfu Jiang, Xiang Ren, and Bill Yuchen Lin. 2023. LLM-Blender: Ensembling Large Language Models with Pairwise
Comparison and Generative Fusion. In Proceedings of the 61th Annual Meeting of the Association for Computational
Linguistics (ACL 2023).
[108] Huiqiang Jiang, Qianhui Wu, , Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. 2023. LongLLM-
Lingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression. ArXiv preprint
abs/2310.06839 (2023). https://arxiv.org/abs/2310.06839

ACM Trans. Inf. Syst., Vol. 1, No. 1, Article . Publication date: March 2025.
From Matching to Generation: A Survey on Generative Information Retrieval 49

[109] Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. 2023. LLMLingua: Compressing Prompts
for Accelerated Inference of Large Language Models. In Proceedings of the 2023 Conference on Empirical Methods in
Natural Language Processing. Association for Computational Linguistics, 13358–13376. https://doi.org/10.18653/v1/
2023.emnlp-main.825
[110] Jinhao Jiang, Kun Zhou, Zican Dong, Keming Ye, Xin Zhao, and Ji-Rong Wen. 2023. StructGPT: A General Framework
for Large Language Model to Reason over Structured Data. In Proceedings of the 2023 Conference on Empirical
Methods in Natural Language Processing. Association for Computational Linguistics, Singapore, 9237–9251. https:
//doi.org/10.18653/v1/2023.emnlp-main.574
[111] Zhengbao Jiang, Frank F. Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and
Graham Neubig. 2023. Active Retrieval Augmented Generation. In Proceedings of the 2023 Conference on Empirical
Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023. Association for Computational
Linguistics, 7969–7992. https://aclanthology.org/2023.emnlp-main.495
[112] Bowen Jin, Hansi Zeng, Guoyin Wang, Xiusi Chen, Tianxin Wei, Ruirui Li, Zhengyang Wang, Zheng Li, Yang Li,
Hanqing Lu, et al. 2023. Language Models As Semantic Indexers. arXiv preprint arXiv:2310.07815 (2023).
[113] Jiajie Jin, Yutao Zhu, Xinyu Yang, Chenghao Zhang, and Zhicheng Dou. 2024. FlashRAG: A Modular Toolkit for Efficient
Retrieval-Augmented Generation Research. CoRR abs/2405.13576 (2024). https://doi.org/10.48550/ARXIV.2405.13576
arXiv:2405.13576
[114] Jiajie Jin, Yutao Zhu, Yujia Zhou, and Zhicheng Dou. 2024. BIDER: Bridging Knowledge Inconsistency for Efficient
Retrieval-Augmented LLMs via Key Supporting Evidence. CoRR abs/2402.12174 (2024). https://doi.org/10.48550/
ARXIV.2402.12174 arXiv:2402.12174
[115] Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. 2017. TriviaQA: A Large Scale Distantly Supervised
Challenge Dataset for Reading Comprehension. In ACL. Association for Computational Linguistics, Vancouver,
Canada, 1601–1611.
[116] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec
Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling Laws for Neural Language Models. CoRR abs/2001.08361 (2020).
arXiv:2001.08361 https://arxiv.org/abs/2001.08361
[117] Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau
Yih. 2020. Dense Passage Retrieval for Open-Domain Question Answering. In EMNLP. 6769–6781.
[118] Jungo Kasai, Keisuke Sakaguchi, Yoichi Takahashi, Ronan Le Bras, Akari Asai, Xinyan Yu, Dragomir Radev,
Noah A. Smith, Yejin Choi, and Kentaro Inui. 2023. RealTime QA: What’s the Answer Right Now?. In Advances
in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023,
NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023. http://papers.nips.cc/paper_files/paper/2023/hash/
9941624ef7f867a502732b5154d30cb7-Abstract-Datasets_and_Benchmarks.html
[119] Zixuan Ke, Yijia Shao, Haowei Lin, Tatsuya Konishi, Gyuhak Kim, and Bing Liu. 2023. Continual pre-training of
language models. arXiv preprint arXiv:2302.03241 (2023).
[120] Krishnaram Kenthapadi, Himabindu Lakkaraju, and Nazneen Rajani. 2023. Generative ai meets responsible ai:
Practical challenges and opportunities. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery
and Data Mining. 5805–5806.
[121] Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional
Transformers for Language Understanding. In NAACL-HLT. 4171–4186.
[122] Muhammad Khalifa, David Wadden, Emma Strubell, Honglak Lee, Lu Wang, Iz Beltagy, and Hao Peng. 2024. Source-
Aware Training Enables Knowledge Attribution in Language Models. https://api.semanticscholar.org/CorpusID:
268819100
[123] Gangwoo Kim, Sungdong Kim, Byeongguk Jeon, Joonsuk Park, and Jaewoo Kang. 2023. Tree of Clarifications:
Answering Ambiguous Questions with Retrieval-Augmented Large Language Models. In Proceedings of the 2023
Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Singapore,
996–1009. https://doi.org/10.18653/v1/2023.emnlp-main.63
[124] Varsha Kishore, Chao Wan, Justin Lovelace, Yoav Artzi, and Kilian Q. Weinberger. 2023. IncDSI: Incrementally
Updatable Document Retrieval. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu,
Hawaii, USA (Proceedings of Machine Learning Research, Vol. 202). PMLR, 17122–17134. https://proceedings.mlr.press/
v202/kishore23a.html
[125] Mojtaba Komeili, Kurt Shuster, and Jason Weston. 2022. Internet-Augmented Dialogue Generation. In Proceedings
of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for
Computational Linguistics, Dublin, Ireland, 8460–8478. https://doi.org/10.18653/v1/2022.acl-long.579
[126] Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle
Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. 2019. Natural questions: a benchmark for question answering
research. Transactions of the Association for Computational Linguistics 7 (2019), 453–466.

ACM Trans. Inf. Syst., Vol. 1, No. 1, Article . Publication date: March 2025.
50 Li et al.

[127] Tin Lai, Yukun Shi, Zicong Du, Jiajie Wu, Ken Fu, Yichao Dou, and Ziqi Wang. 2023. Psy-LLM: Scaling up Global
Mental Health Psychological Services with AI-based Large Language Models. CoRR abs/2307.11991 (2023). https:
//doi.org/10.48550/ARXIV.2307.11991 arXiv:2307.11991
[128] Tian Lan, Deng Cai, Yan Wang, Heyan Huang, and Xian-Ling Mao. 2023. Copy is All You Need. In The Eleventh
International Conference on Learning Representations. https://openreview.net/forum?id=CROlOA9Nd8C
[129] Dongyub Lee, Taesun Whang, Chanhee Lee, and Heuiseok Lim. 2023. Towards Reliable and Fluent Large Language
Models: Incorporating Feedback Learning Loops in QA Systems. CoRR abs/2309.06384 (2023). https://doi.org/10.
48550/ARXIV.2309.06384 arXiv:2309.06384
[130] Hyunji Lee, JaeYoung Kim, Hoyeon Chang, Hanseok Oh, Sohee Yang, Vladimir Karpukhin, Yi Lu, and Minjoon
Seo. 2023. Nonparametric Decoding for Generative Retrieval. In Findings of the Association for Computational
Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023. Association for Computational Linguistics, 12642–12661.
https://doi.org/10.18653/V1/2023.FINDINGS-ACL.801
[131] Hyunji Lee, Sohee Yang, Hanseok Oh, and Minjoon Seo. 2022. Generative Multi-hop Retrieval. In Proceedings of the
2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates,
December 7-11, 2022. Association for Computational Linguistics, 1417–1436. https://doi.org/10.18653/V1/2022.EMNLP-
MAIN.92
[132] Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas
Carlini. 2022. Deduplicating Training Data Makes Language Models Better. In Proceedings of the 60th Annual Meeting
of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022.
Association for Computational Linguistics, 8424–8445. https://doi.org/10.18653/V1/2022.ACL-LONG.577
[133] Nayeon Lee, Wei Ping, Peng Xu, Mostofa Patwary, Pascale Fung, Mohammad Shoeybi, and Bryan Catanzaro. 2022.
Factuality Enhanced Language Models for Open-Ended Text Generation. In Advances in Neural Information Processing
Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, No-
vember 28 - December 9, 2022. http://papers.nips.cc/paper_files/paper/2022/hash/df438caa36714f69277daa92d608dd63-
Abstract-Conference.html
[134] Nayeon Lee, Wei Ping, Peng Xu, Mostofa Patwary, Pascale Fung, Mohammad Shoeybi, and Bryan Catanzaro. 2023.
Factuality Enhanced Language Models for Open-Ended Text Generation. arXiv:2206.04624 [cs.CL]
[135] Sunkyung Lee, Minjin Choi, and Jongwuk Lee. 2023. GLEN: Generative Retrieval via Lexical Index Learning. In
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore,
December 6-10, 2023. Association for Computational Linguistics, 7693–7704. https://aclanthology.org/2023.emnlp-
main.477
[136] Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam
Shazeer, and Zhifeng Chen. 2020. GShard: Scaling Giant Models with Conditional Computation and Automatic
Sharding. CoRR abs/2006.16668 (2020). arXiv:2006.16668 https://arxiv.org/abs/2006.16668
[137] Omer Levy, Minjoon Seo, Eunsol Choi, and Luke Zettlemoyer. 2017. Zero-Shot Relation Extraction via Reading
Comprehension. In CoNLL. 333–342.
[138] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov,
and Luke Zettlemoyer. 2020. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation,
Translation, and Comprehension. In ACL. 7871–7880.
[139] Patrick S. H. Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich
Küttler, Mike Lewis, Wen tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-Augmented
Generation for Knowledge-Intensive NLP Tasks. In Advances in Neural Information Processing Systems 33: Annual
Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual. https://
proceedings.neurips.cc/paper/2020/hash/6b493230205f780e1bc26945df7481e5-Abstract.html
[140] Cheng Li, Mingyang Zhang, Qiaozhu Mei, Weize Kong, and Michael Bendersky. 2023. Automatic Prompt Rewrit-
ing for Personalized Text Generation. CoRR abs/2310.00152 (2023). https://doi.org/10.48550/ARXIV.2310.00152
arXiv:2310.00152
[141] Cheng Li, Mingyang Zhang, Qiaozhu Mei, Yaqing Wang, Spurthi Amba Hombaiah, Yi Liang, and Michael Bendersky.
2023. Teach LLMs to Personalize - An Approach inspired by Writing Education. CoRR abs/2308.07968 (2023).
https://doi.org/10.48550/ARXIV.2308.07968 arXiv:2308.07968
[142] Dongfang Li, Zetian Sun, Baotian Hu, Zhenyu Liu, Xinshuo Hu, Xuebo Liu, and Min Zhang. 2024. Improving
Attributed Text Generation of Large Language Models via Preference Learning. CoRR abs/2403.18381 (2024). https:
//doi.org/10.48550/ARXIV.2403.18381 arXiv:2403.18381
[143] Dongfang Li, Zetian Sun, Xinshuo Hu, Zhenyu Liu, Ziyang Chen, Baotian Hu, Aiguo Wu, and Min Zhang. 2023. A
Survey of Large Language Models Attribution. CoRR abs/2311.03731 (2023). https://doi.org/10.48550/ARXIV.2311.
03731 arXiv:2311.03731

ACM Trans. Inf. Syst., Vol. 1, No. 1, Article . Publication date: March 2025.
From Matching to Generation: A Survey on Generative Information Retrieval 51

[144] Junyi Li, Xiaoxue Cheng, Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. 2023. HaluEval: A Large-Scale Hallucination
Evaluation Benchmark for Large Language Models. In Proceedings of the 2023 Conference on Empirical Methods in
Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023. Association for Computational Linguistics,
6449–6464. https://aclanthology.org/2023.emnlp-main.397
[145] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. Blip: Bootstrapping language-image pre-training
for unified vision-language understanding and generation. In International conference on machine learning. PMLR,
12888–12900.
[146] Jinming Li, Wentao Zhang, Tian Wang, Guanglei Xiong, Alan Lu, and Gerard Medioni. 2023. GPT4Rec: A generative
framework for personalized recommendation and user interests interpretation. arXiv preprint arXiv:2304.03879 (2023).
[147] Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. 2023. Inference-Time Intervention:
Eliciting Truthful Answers from a Language Model. In Thirty-seventh Conference on Neural Information Processing
Systems. https://openreview.net/forum?id=aLLuYpn83y
[148] Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin
Li. 2023. API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs. In Proceedings of the 2023 Conference
on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023. Association for
Computational Linguistics, 3102–3116. https://aclanthology.org/2023.emnlp-main.187
[149] Pan Li and Alexander Tuzhilin. 2019. Towards Controllable and Personalized Review Generation. In Proceedings
of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Confer-
ence on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019. Association for
Computational Linguistics, 3235–3243. https://doi.org/10.18653/V1/D19-1319
[150] Weitao Li, Junkai Li, Weizhi Ma, and Yang Liu. 2024. Citation-Enhanced Generation for LLM-based Chatbots. CoRR
abs/2402.16063 (2024). https://doi.org/10.48550/ARXIV.2402.16063 arXiv:2402.16063
[151] Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. 2025.
Search-o1: Agentic Search-Enhanced Large Reasoning Models. CoRR abs/2501.05366 (2025). https://doi.org/10.48550/
ARXIV.2501.05366 arXiv:2501.05366
[152] Xiaoxi Li, Zhicheng Dou, Yujia Zhou, and Fangchao Liu. 2024. CorpusLM: Towards a Unified Language Model on
Corpus for Knowledge-Intensive Tasks. In Proceedings of the 47th International ACM SIGIR Conference on Research and
Development in Information Retrieval, SIGIR 2024, Washington DC, USA, July 14-18, 2024, Grace Hui Yang, Hongning
Wang, Sam Han, Claudia Hauff, Guido Zuccon, and Yi Zhang (Eds.). ACM, 26–37. https://doi.org/10.1145/3626772.
3657778
[153] Xiaoxi Li, Jiajie Jin, Yujia Zhou, Yongkang Wu, Zhonghua Li, Qi Ye, and Zhicheng Dou. 2024. RetroLLM: Empowering
Large Language Models to Retrieve Fine-grained Evidence within Generation. CoRR abs/2412.11919 (2024). https:
//doi.org/10.48550/ARXIV.2412.11919 arXiv:2412.11919
[154] Xiaopeng Li, Shasha Li, Shezheng Song, Jing Yang, Jun Ma, and Jie Yu. 2024. Pmet: Precise model editing in a
transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 18564–18572.
[155] Xiaoxi Li, Yujia Zhou, and Zhicheng Dou. 2024. UniGen: A Unified Generative Framework for Retrieval and Question
Answering with Large Language Models. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38.
8688–8696.
[156] Xiaonan Li, Changtai Zhu, Linyang Li, Zhangyue Yin, Tianxiang Sun, and Xipeng Qiu. 2023. LLatrieval: LLM-
Verified Retrieval for Verifiable Generation. CoRR abs/2311.07838 (2023). https://doi.org/10.48550/ARXIV.2311.07838
arXiv:2311.07838
[157] Yongqi Li, Wenjie Wang, Leigang Qu, Liqiang Nie, Wenjie Li, and Tat-Seng Chua. 2024. Generative Cross-Modal
Retrieval: Memorizing Images in Multimodal Language Models for Retrieval and Beyond. CoRR abs/2402.10805 (2024).
https://doi.org/10.48550/ARXIV.2402.10805 arXiv:2402.10805
[158] Yongqi Li, Nan Yang, Liang Wang, Furu Wei, and Wenjie Li. 2023. Generative retrieval for conversational question
answering. Inf. Process. Manag. 60, 5 (2023), 103475. https://doi.org/10.1016/J.IPM.2023.103475
[159] Yongqi Li, Nan Yang, Liang Wang, Furu Wei, and Wenjie Li. 2023. Learning to Rank in Generative Retrieval. CoRR
abs/2306.15222 (2023). https://doi.org/10.48550/ARXIV.2306.15222 arXiv:2306.15222
[160] Yongqi Li, Nan Yang, Liang Wang, Furu Wei, and Wenjie Li. 2023. Multiview Identifiers Enhanced Generative
Retrieval. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1:
Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023. Association for Computational Linguistics, 6636–6648.
https://doi.org/10.18653/V1/2023.ACL-LONG.366
[161] Yongqi Li, Zhen Zhang, Wenjie Wang, Liqiang Nie, Wenjie Li, and Tat-Seng Chua. 2024. Distillation Enhanced
Generative Retrieval. arXiv preprint arXiv:2402.10769 (2024).
[162] Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Annual Meeting of the Association
for Computational Linguistics. https://api.semanticscholar.org/CorpusID:964287

ACM Trans. Inf. Syst., Vol. 1, No. 1, Article . Publication date: March 2025.
52 Li et al.

[163] Jimmy Lin and Xueguang Ma. 2021. A Few Brief Notes on DeepImpact, COIL, and a Conceptual Framework for
Information Retrieval Techniques. CoRR abs/2106.14807 (2021). arXiv:2106.14807 https://arxiv.org/abs/2106.14807
[164] Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. TruthfulQA: Measuring How Models Mimic Human Falsehoods.
In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),
ACL 2022, Dublin, Ireland, May 22-27, 2022. Association for Computational Linguistics, 3214–3252. https://doi.org/10.
18653/V1/2022.ACL-LONG.229
[165] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and
C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In Computer Vision - ECCV 2014 -
13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V (Lecture Notes in Computer
Science, Vol. 8693). Springer, 740–755. https://doi.org/10.1007/978-3-319-10602-1_48
[166] Yen-Ting Lin and Yun-Nung Chen. 2023. LLM-Eval: Unified Multi-Dimensional Automatic Evaluation for Open-
Domain Conversations with Large Language Models. CoRR abs/2305.13711 (2023). https://doi.org/10.48550/ARXIV.
2305.13711 arXiv:2305.13711
[167] Adam Liska, Tomás Kociský, Elena Gribovskaya, Tayfun Terzi, Eren Sezener, Devang Agrawal, Cyprien de Mas-
son d’Autume, Tim Scholtes, Manzil Zaheer, Susannah Young, Ellen Gilsenan-McMahon, Sophia Austin, Phil
Blunsom, and Angeliki Lazaridou. 2022. StreamingQA: A Benchmark for Adaptation to New Knowledge over
Time in Question Answering Models. In International Conference on Machine Learning, ICML 2022, 17-23 July
2022, Baltimore, Maryland, USA (Proceedings of Machine Learning Research, Vol. 162). PMLR, 13604–13622. https:
//proceedings.mlr.press/v162/liska22a.html
[168] Robert Litschko, Max Müller-Eberstein, Rob van der Goot, Leon Weber-Genzel, and Barbara Plank. 2023. Establishing
Trustworthiness: Rethinking Tasks and Model Evaluation. In Proceedings of the 2023 Conference on Empirical Methods in
Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023. Association for Computational Linguistics,
193–203. https://doi.org/10.18653/V1/2023.EMNLP-MAIN.14
[169] Junling Liu, Chao Liu, Renjie Lv, Kang Zhou, and Yan Zhang. 2023. Is ChatGPT a Good Recommender? A Preliminary
Study. CoRR abs/2304.10149 (2023). https://doi.org/10.48550/ARXIV.2304.10149 arXiv:2304.10149
[170] Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2023.
Lost in the Middle: How Language Models Use Long Contexts. arXiv:2307.03172.
[171] Nelson F. Liu, Tianyi Zhang, and Percy Liang. 2023. Evaluating Verifiability in Generative Search Engines. In
Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023. Association for
Computational Linguistics, 7001–7025. https://doi.org/10.18653/V1/2023.FINDINGS-EMNLP.467
[172] Qian Liu, Yihong Chen, Bei Chen, Jian-Guang Lou, Zixuan Chen, Bin Zhou, and Dongmei Zhang. 2020. You Impress
Me: Dialogue Generation via Mutual Persona Perception. In Proceedings of the 58th Annual Meeting of the Association
for Computational Linguistics, ACL 2020, Online, July 5-10, 2020. Association for Computational Linguistics, 1417–1427.
https://doi.org/10.18653/V1/2020.ACL-MAIN.131
[173] Wenhan Liu, Xinyu Ma, Yutao Zhu, Ziliang Zhao, Shuaiqiang Wang, Dawei Yin, and Zhicheng Dou. 2024. Sliding
Windows Are Not the End: Exploring Full Ranking with Long-Context Large Language Models. CoRR abs/2412.14574
(2024). https://doi.org/10.48550/ARXIV.2412.14574 arXiv:2412.14574
[174] Xiao Liu, Hanyu Lai, Hao Yu, Yifan Xu, Aohan Zeng, Zhengxiao Du, Peng Zhang, Yuxiao Dong, and Jie Tang. 2023.
WebGLM: Towards An Efficient Web-Enhanced Question Answering System with Human Preferences. In Proceedings
of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2023, Long Beach, CA, USA,
August 6-10, 2023. ACM, 4549–4560. https://doi.org/10.1145/3580305.3599931
[175] Xin Liu, Daniel McDuff, Geza Kovacs, Isaac R. Galatzer-Levy, Jacob E. Sunshine, Jiening Zhan, Ming-Zher Poh, Shun
Liao, Paolo Di Achille, and Shwetak N. Patel. 2023. Large Language Models are Few-Shot Health Learners. CoRR
abs/2305.15525 (2023). https://doi.org/10.48550/ARXIV.2305.15525 arXiv:2305.15525
[176] Yu-An Liu, Ruqing Zhang, Jiafeng Guo, Wei Chen, and Xueqi Cheng. 2023. On the Robustness of Generative Retrieval
Models: An Out-of-Distribution Perspective. CoRR abs/2306.12756 (2023). https://doi.org/10.48550/ARXIV.2306.12756
arXiv:2306.12756
[177] Zhengliang Liu, Zihao Wu, Mengxuan Hu, Bokai Zhao, Lin Zhao, Tianyi Zhang, Haixing Dai, Xianyan Chen, Ye Shen,
Sheng Li, et al. 2023. Pharmacygpt: The ai pharmacist. arXiv preprint arXiv:2307.10432 (2023).
[178] Xinwei Long, Jiali Zeng, Fandong Meng, Zhiyuan Ma, Kaiyan Zhang, Bowen Zhou, and Jie Zhou. 2024. Generative
Multi-Modal Knowledge Retrieval with Large Language Models. CoRR abs/2401.08206 (2024). https://doi.org/10.
48550/ARXIV.2401.08206 arXiv:2401.08206
[179] Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin B. Clement, Dawn
Drain, Daxin Jiang, Duyu Tang, Ge Li, Lidong Zhou, Linjun Shou, Long Zhou, Michele Tufano, Ming Gong, Ming
Zhou, Nan Duan, Neel Sundaresan, Shao Kun Deng, Shengyu Fu, and Shujie Liu. 2021. CodeXGLUE: A Machine
Learning Benchmark Dataset for Code Understanding and Generation. In Proceedings of the Neural Information
Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021,

ACM Trans. Inf. Syst., Vol. 1, No. 1, Article . Publication date: March 2025.
From Matching to Generation: A Survey on Generative Information Retrieval 53

virtual. https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/c16a5320fa475530d9583c34fd356ef5-
Abstract-round1.html
[180] Yi Luan, Jacob Eisenstein, Kristina Toutanova, and Michael Collins. 2021. Sparse, Dense, and Attentional Representa-
tions for Text Retrieval. Trans. Assoc. Comput. Linguistics 9 (2021), 329–345. https://doi.org/10.1162/TACL_A_00369
[181] Linhao Luo, Yuan-Fang Li, Gholamreza Haffari, and Shirui Pan. 2024. Reasoning on Graphs: Faithful and Interpretable
Large Language Model Reasoning. In International Conference on Learning Representations.
[182] Jun-Yu Ma, Jia-Chen Gu, Zhen-Hua Ling, Quan Liu, and Cong Liu. 2023. Untying the Reversal Curse via Bidirectional
Language Model Editing. CoRR abs/2310.10322 (2023). https://doi.org/10.48550/ARXIV.2310.10322 arXiv:2310.10322
[183] Xinbei Ma, Yeyun Gong, Pengcheng He, Hai Zhao, and Nan Duan. 2023. Query Rewriting in Retrieval-Augmented
Large Language Models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing.
Association for Computational Linguistics, Singapore, 5303–5315. https://doi.org/10.18653/v1/2023.emnlp-main.322
[184] Zhengyi Ma, Zhicheng Dou, Yutao Zhu, Hanxun Zhong, and Ji-Rong Wen. 2021. One Chatbot Per Person: Creating
Personalized Chatbots based on Implicit User Profiles. In SIGIR ’21: The 44th International ACM SIGIR Conference on
Research and Development in Information Retrieval, Virtual Event, Canada, July 11-15, 2021. ACM, 555–564. https:
//doi.org/10.1145/3404835.3462828
[185] Jean Maillard, Vladimir Karpukhin, Fabio Petroni, Wen tau Yih, Barlas Oguz, Veselin Stoyanov, and Gargi Ghosh. 2021.
Multi-Task Retrieval for Knowledge-Intensive Tasks. In Proceedings of the 59th Annual Meeting of the Association for
Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP
2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021. Association for Computational Linguistics, 1098–1111.
https://doi.org/10.18653/V1/2021.ACL-LONG.89
[186] Bodhisattwa Prasad Majumder, Shuyang Li, Jianmo Ni, and Julian J. McAuley. 2019. Generating Personalized Recipes
from Historical User Preferences. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language
Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong,
China, November 3-7, 2019. Association for Computational Linguistics, 5975–5981. https://doi.org/10.18653/V1/D19-
1613
[187] Yury A. Malkov and Dmitry A. Yashunin. 2020. Efficient and Robust Approximate Nearest Neighbor Search Using
Hierarchical Navigable Small World Graphs. IEEE Trans. Pattern Anal. Mach. Intell. 42, 4 (2020), 824–836. https:
//doi.org/10.1109/TPAMI.2018.2889473
[188] Udi Manber and Gene Myers. 1990. Suffix arrays: a new method for on-line string searches. In Proceedings of the
First Annual ACM-SIAM Symposium on Discrete Algorithms (San Francisco, California, USA) (SODA ’90). Society for
Industrial and Applied Mathematics, USA, 319–327.
[189] Julieta Martinez, Holger H. Hoos, and James J. Little. 2014. Stacked Quantizers for Compositional Vector Compression.
CoRR abs/1411.2173 (2014). arXiv:1411.2173 http://arxiv.org/abs/1411.2173
[190] Pierre-Emmanuel Mazaré, Samuel Humeau, Martin Raison, and Antoine Bordes. 2018. Training Millions of Per-
sonalized Dialogue Agents. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Pro-
cessing, Brussels, Belgium, October 31 - November 4, 2018. Association for Computational Linguistics, 2775–2779.
https://doi.org/10.18653/V1/D18-1298
[191] Vittorio Mazzia, Alessandro Pedrani, Andrea Caciolai, Kay Rottmann, and Davide Bernardi. 2023. A Survey on
Knowledge Editing of Neural Networks. CoRR abs/2310.19704 (2023). https://doi.org/10.48550/ARXIV.2310.19704
arXiv:2310.19704
[192] Sanket Vaibhav Mehta, Jai Gupta, Yi Tay, Mostafa Dehghani, Vinh Q Tran, Jinfeng Rao, Marc Najork, Emma
Strubell, and Donald Metzler. 2022. DSI++: Updating Transformer Memory with New Documents. arXiv preprint
arXiv:2212.09744 (2022).
[193] Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2022. Locating and Editing Factual Associations in
GPT. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing
Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022. http://papers.nips.cc/paper_files/
paper/2022/hash/6f1d43d5a82a37e89b0665b33bf3a182-Abstract-Conference.html
[194] Kevin Meng, Arnab Sen Sharma, Alex J. Andonian, Yonatan Belinkov, and David Bau. 2023. Mass-Editing Memory in
a Transformer. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May
1-5, 2023. OpenReview.net. https://openreview.net/pdf?id=MkbcAHIYgyS
[195] Donald Metzler, Yi Tay, Dara Bahri, and Marc Najork. 2021. Rethinking search: making domain experts out of
dilettantes. In ACM SIGIR Forum, Vol. 55. ACM New York, NY, USA, 1–27.
[196] Microsoft. 2023. Bing. https://www.bing.com (2023).
[197] Microsoft. 2023. Bing Chat. https://www.bing.com/new (2023).
[198] Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer,
and Hannaneh Hajishirzi. 2023. FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text
Generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023,

ACM Trans. Inf. Syst., Vol. 1, No. 1, Article . Publication date: March 2025.
54 Li et al.

Singapore, December 6-10, 2023. Association for Computational Linguistics, 12076–12100. https://aclanthology.org/
2023.emnlp-main.741
[199] Eric Mitchell, Charles Lin, Antoine Bosselut, Chelsea Finn, and Christopher D. Manning. 2022. Fast Model Editing at
Scale. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022.
OpenReview.net. https://openreview.net/forum?id=0DcZxeWfOPt
[200] Fengran Mo, Kelong Mao, Ziliang Zhao, Hongjin Qian, Haonan Chen, Yiruo Cheng, Xiaoxi Li, Yutao Zhu, Zhicheng
Dou, and Jian-Yun Nie. 2024. A Survey of Conversational Search. arXiv preprint arXiv:2410.15576 (2024).
[201] Jisoo Mok, Jaeyoung Do, Sungjin Lee, Tara Taghavi, Seunghak Yu, and Sungroh Yoon. 2023. Large-scale lifelong
learning of in-context instructions and how to tackle it. In Proceedings of the 61st Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Papers). 12573–12589.
[202] Sheshera Mysore, Zhuoran Lu, Mengting Wan, Longqi Yang, Steve Menezes, Tina Baghaee, Emmanuel Barajas
Gonzalez, Jennifer Neville, and Tara Safavi. 2023. PEARL: Personalizing Large Language Model Writing Assistants
with Generation-Calibrated Retrievers. CoRR abs/2311.09180 (2023). https://doi.org/10.48550/ARXIV.2311.09180
arXiv:2311.09180
[203] Usama Nadeem, Noah Ziems, and Shaoen Wu. 2022. CodeDSI: Differentiable Code Search. CoRR abs/2210.00328
(2022). https://doi.org/10.48550/ARXIV.2210.00328 arXiv:2210.00328
[204] Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu
Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button,
Matthew Knight, Benjamin Chess, and John Schulman. 2021. WebGPT: Browser-assisted question-answering with
human feedback. CoRR abs/2112.09332 (2021). arXiv:2112.09332 https://arxiv.org/abs/2112.09332
[205] Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. MS
MARCO: A Human Generated MAchine Reading COmprehension Dataset. In Proceedings of the Workshop on Cognitive
Computation: Integrating neural and symbolic approaches 2016 co-located with the 30th Annual Conference on Neural
Information Processing Systems (NIPS 2016), Barcelona, Spain, December 9, 2016 (CEUR Workshop Proceedings, Vol. 1773).
CEUR-WS.org. https://ceur-ws.org/Vol-1773/CoCoNIPS_2016_paper9.pdf
[206] Thong Nguyen and Andrew Yates. 2023. Generative Retrieval as Dense Retrieval. CoRR abs/2306.11397 (2023).
https://doi.org/10.48550/ARXIV.2306.11397 arXiv:2306.11397
[207] Jianmo Ni, Chen Qu, Jing Lu, Zhuyun Dai, Gustavo Hernández Ábrego, Ji Ma, Vincent Y. Zhao, Yi Luan, Keith B.
Hall, Ming-Wei Chang, and Yinfei Yang. 2022. Large Dual Encoders Are Generalizable Retrievers. In Proceedings
of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab
Emirates, December 7-11, 2022, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Computational
Linguistics, 9844–9855. https://doi.org/10.18653/V1/2022.EMNLP-MAIN.669
[208] Rodrigo Nogueira, Jimmy Lin, and AI Epistemic. 2019. From doc2query to docTTTTTquery. Online preprint 6 (2019),
2.
[209] OpenAI. 2022. Introducing chatgpt. https://openai.com/blog/chatgpt (2022).
[210] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini
Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda
Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow
instructions with human feedback. In Advances in Neural Information Processing Systems 35: Annual Conference on
Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022.
http://papers.nips.cc/paper_files/paper/2022/hash/b1efde53be364a73914f58805a001731-Abstract-Conference.html
[211] Oded Ovadia, Menachem Brief, Moshik Mishaeli, and Oren Elisha. 2023. Fine-Tuning or Retrieval? Comparing Knowl-
edge Injection in LLMs. CoRR abs/2312.05934 (2023). https://doi.org/10.48550/ARXIV.2312.05934 arXiv:2312.05934
[212] Oded Ovadia, Menachem Brief, Moshik Mishaeli, and Oren Elisha. 2023. Fine-Tuning or Retrieval? Comparing Knowl-
edge Injection in LLMs. CoRR abs/2312.05934 (2023). https://doi.org/10.48550/ARXIV.2312.05934 arXiv:2312.05934
[213] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a Method for Automatic Evaluation of
Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, July
6-12, 2002, Philadelphia, PA, USA. ACL, 311–318. https://doi.org/10.3115/1073083.1073135
[214] Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. 2023. Gorilla: Large Language Model Connected
with Massive APIs. arXiv:2305.15334 [cs.CL]
[215] Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Hamza Alobeidli, Alessandro Cap-
pelli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. 2023. The RefinedWeb Dataset for Falcon LLM:
Outperforming Curated Corpora with Web Data Only. In Advances in Neural Information Processing Systems 36:
Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, Decem-
ber 10 - 16, 2023. http://papers.nips.cc/paper_files/paper/2023/hash/fa3ed726cc5073b9c31e3e49a807789c-Abstract-
Datasets_and_Benchmarks.html

ACM Trans. Inf. Syst., Vol. 1, No. 1, Article . Publication date: March 2025.
From Matching to Generation: A Survey on Generative Information Retrieval 55

[216] Bohao Peng, Zhuotao Tian, Shu Liu, Mingchang Yang, and Jiaya Jia. 2024. Scalable Language Model with Generalized
Continual Learning. arXiv preprint arXiv:2404.07470 (2024).
[217] Denis Peskoff and Brandon Stewart. 2023. Credible without Credit: Domain Experts Assess Generative Language
Models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short
Papers), ACL 2023, Toronto, Canada, July 9-14, 2023. Association for Computational Linguistics, 427–438. https:
//doi.org/10.18653/V1/2023.ACL-SHORT.37
[218] Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick Lewis, Majid Yazdani, Nicola De Cao, James Thorne, Yacine
Jernite, Vladimir Karpukhin, Jean Maillard, Vassilis Plachouras, Tim Rocktäschel, and Sebastian Riedel. 2021. KILT: a
Benchmark for Knowledge Intensive Language Tasks. In Proceedings of the 2021 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational
Linguistics, Online, 2523–2544. https://doi.org/10.18653/v1/2021.naacl-main.200
[219] Sebastian Porsdam Mann, Brian D Earp, Nikolaj Møller, Suren Vynn, and Julian Savulescu. 2023. AUTOGEN: A
personalized large language model for academic enhancement—Ethics and proof of principle. The American Journal
of Bioethics 23, 10 (2023), 28–41.
[220] Ronak Pradeep, Kai Hui, Jai Gupta, Ádám D. Lelkes, Honglei Zhuang, Jimmy Lin, Donald Metzler, and Vinh Q. Tran.
2023. How Does Generative Retrieval Scale to Millions of Passages?. In Proceedings of the 2023 Conference on Empirical
Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023. Association for Computational
Linguistics, 1305–1321. https://aclanthology.org/2023.emnlp-main.83
[221] Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah Smith, and Mike Lewis. 2023. Measuring and Narrowing
the Compositionality Gap in Language Models. In Findings of the Association for Computational Linguistics: EMNLP
2023. Association for Computational Linguistics, Singapore, 5687–5711. https://doi.org/10.18653/v1/2023.findings-
emnlp.378
[222] Cheng Qian, Chi Han, Yi Fung, Yujia Qin, Zhiyuan Liu, and Heng Ji. 2023. CREATOR: Tool Creation for Disentangling
Abstract and Concrete Reasoning of Large Language Models. In Findings of the Association for Computational
Linguistics: EMNLP 2023. Association for Computational Linguistics, Singapore, 6922–6939. https://doi.org/10.18653/
v1/2023.findings-emnlp.462
[223] Hongjing Qian, Yutao Zhu, Zhicheng Dou, Haoqi Gu, Xinyu Zhang, Zheng Liu, Ruofei Lai, Zhao Cao, Jian-Yun Nie,
and Ji-Rong Wen. 2023. WebBrain: Learning to Generate Factually Correct Articles for Queries by Grounding on
Large Web Corpus. CoRR abs/2304.04358 (2023). https://doi.org/10.48550/ARXIV.2304.04358 arXiv:2304.04358
[224] Shanbao Qiao, Xuebing Liu, and Seung-Hoon Na. 2023. DiffusionRet: Diffusion-Enhanced Generative Retriever using
Constrained Decoding. In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December
6-10, 2023. Association for Computational Linguistics, 9515–9529. https://aclanthology.org/2023.findings-emnlp.638
[225] Yujia Qin, Zihan Cai, Dian Jin, Lan Yan, Shihao Liang, Kunlun Zhu, Yankai Lin, Xu Han, Ning Ding, Huadong
Wang, Ruobing Xie, Fanchao Qi, Zhiyuan Liu, Maosong Sun, and Jie Zhou. 2023. WebCPM: Interactive Web
Search for Chinese Long-form Question Answering. In Proceedings of the 61st Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Toronto, Canada,
8968–8988. https://doi.org/10.18653/v1/2023.acl-long.499
[226] Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian,
Sihan Zhao, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. 2023.
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs. CoRR abs/2307.16789 (2023).
https://doi.org/10.48550/ARXIV.2307.16789 arXiv:2307.16789
[227] Yiwei Qin, Kaiqiang Song, Yebowen Hu, Wenlin Yao, Sangwoo Cho, Xiaoyang Wang, Xuansheng Wu, Fei Liu, Pengfei
Liu, and Dong Yu. 2024. InFoBench: Evaluating Instruction Following Ability in Large Language Models. (2024).
arXiv:2401.03601 [cs.CL]
[228] Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. 2018. Improving language understanding by
generative pre-training. (2018).
[229] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are
unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.
[230] Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. Direct
Preference Optimization: Your Language Model is Secretly a Reward Model. In Thirty-seventh Conference on Neural
Information Processing Systems. https://openreview.net/forum?id=HPuSIXJaa9
[231] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and
Peter J. Liu. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn.
Res. 21 (2020), 140:1–140:67. http://jmlr.org/papers/v21/20-074.html
[232] Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know What You Don’t Know: Unanswerable Questions for SQuAD.
In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne,
Australia, July 15-20, 2018, Volume 2: Short Papers. Association for Computational Linguistics, 784–789. https:

ACM Trans. Inf. Syst., Vol. 1, No. 1, Article . Publication date: March 2025.
56 Li et al.

//doi.org/10.18653/V1/P18-2124
[233] Shashank Rajput, Nikhil Mehta, Anima Singh, Raghunandan Hulikal Keshavan, Trung Vu, Lukasz Heldt, Lichan
Hong, Yi Tay, Vinh Q. Tran, Jonah Samost, Maciej Kula, Ed H. Chi, and Mahesh Sathiamoorthy. 2023. Recommender
Systems with Generative Retrieval. In Advances in Neural Information Processing Systems 36: Annual Conference
on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023. http:
//papers.nips.cc/paper_files/paper/2023/hash/20dcab0f14046a5c6b02b61da9f13229-Abstract-Conference.html
[234] Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Amnon Shashua, Kevin Leyton-Brown, and Yoav Shoham.
2023. In-context retrieval-augmented language models. arXiv preprint arXiv:2302.00083 (2023).
[235] Juan Ramos et al. 2003. Using tf-idf to determine word relevance in document queries. In Proceedings of the first
instructional conference on machine learning, Vol. 242. Citeseer, 29–48.
[236] Anastasia Razdaibiedina, Yuning Mao, Rui Hou, Madian Khabsa, Mike Lewis, and Amjad Almahairi. 2023. Progressive
prompts: Continual learning for language models. arXiv preprint arXiv:2301.12314 (2023).
[237] Ruiyang Ren, Wayne Xin Zhao, Jing Liu, Hua Wu, Ji-Rong Wen, and Haifeng Wang. 2023. TOME: A Two-stage
Approach for Model-based Retrieval. (2023), 6102–6114. https://doi.org/10.18653/V1/2023.ACL-LONG.336
[238] Stephen E. Robertson and Hugo Zaragoza. 2009. The Probabilistic Relevance Framework: BM25 and Beyond. Found.
Trends Inf. Retr. 3, 4 (2009), 333–389. https://doi.org/10.1561/1500000019
[239] Nafis Sadeq, Byungkyu Kang, Prarit Lamba, and Julian J. McAuley. 2023. Unsupervised Improvement of Factual
Knowledge in Language Models. In Proceedings of the 17th Conference of the European Chapter of the Association for
Computational Linguistics, EACL 2023, Dubrovnik, Croatia, May 2-6, 2023. Association for Computational Linguistics,
2952–2961. https://doi.org/10.18653/V1/2023.EACL-MAIN.215
[240] Alireza Salemi, Sheshera Mysore, Michael Bendersky, and Hamed Zamani. 2023. LaMP: When Large Language Models
Meet Personalization. CoRR abs/2304.11406 (2023). https://doi.org/10.48550/ARXIV.2304.11406 arXiv:2304.11406
[241] Malik Sallam, Nesreen A. Salim, Ala’a B. Al-Tammemi, Muna M Barakat, Diaa Fayyad, Souheil Hallit, Harapan
Harapan, Rabih Hallit, and Azmi Mahafzah. 2023. ChatGPT Output Regarding Compulsory Vaccination and COVID-
19 Vaccine Conspiracy: A Descriptive Study at the Outset of a Paradigm Shift in Online Search for Information.
Cureus 15 (2023). https://api.semanticscholar.org/CorpusID:256897987
[242] Gerard Salton, Edward A. Fox, and Harry Wu. 1983. Extended Boolean Information Retrieval. Commun. ACM 26, 11
(1983), 1022–1036. https://doi.org/10.1145/182.358466
[243] Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilic, Daniel Hesslow, Roman Castagné, Alexan-
dra Sasha Luccioni, François Yvon, Matthias Gallé, Jonathan Tow, Alexander M. Rush, Stella Biderman, Albert Webson,
Pawan Sasanka Ammanamanchi, Thomas Wang, Benoît Sagot, Niklas Muennighoff, Albert Villanova del Moral,
Olatunji Ruwase, Rachel Bawden, Stas Bekman, Angelina McMillan-Major, Iz Beltagy, Huu Nguyen, Lucile Saulnier,
Samson Tan, Pedro Ortiz Suarez, Victor Sanh, Hugo Laurençon, Yacine Jernite, Julien Launay, Margaret Mitchell, Colin
Raffel, Aaron Gokaslan, Adi Simhi, Aitor Soroa, Alham Fikri Aji, Amit Alfassy, Anna Rogers, Ariel Kreisberg Nitzav,
Canwen Xu, Chenghao Mou, Chris Emezue, Christopher Klamm, Colin Leong, Daniel van Strien, David Ifeoluwa
Adelani, and et al. 2022. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model. CoRR abs/2211.05100
(2022). https://doi.org/10.48550/ARXIV.2211.05100 arXiv:2211.05100
[244] Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo. 2023. Are Emergent Abilities of Large Language Models a
Mirage?. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing
Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023. http://papers.nips.cc/paper_files/paper/
2023/hash/adc98a266f45005c403b8311ca7e8bd7-Abstract-Conference.html
[245] Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and
Thomas Scialom. 2023. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761
(2023).
[246] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal Policy Optimization
Algorithms. CoRR abs/1707.06347 (2017). arXiv:1707.06347 http://arxiv.org/abs/1707.06347
[247] Thibault Sellam, Dipanjan Das, and Ankur Parikh. 2020. BLEURT: Learning Robust Metrics for Text Generation. In
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational
Linguistics, Online, 7881–7892. https://doi.org/10.18653/v1/2020.acl-main.704
[248] Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. 2023. Enhancing Retrieval-
Augmented Large Language Models with Iterative Retrieval-Generation Synergy. arXiv:2305.15294 [cs.CL]
[249] Xiaoteng Shen, Rui Zhang, Xiaoyan Zhao, Jieming Zhu, and Xi Xiao. 2024. PMG: Personalized Multimodal Generation
with Large Language Models. arXiv preprint arXiv:2404.08677 (2024).
[250] Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. 2023. HuggingGPT: Solving AI
Tasks with ChatGPT and its Friends in Hugging Face. In Advances in Neural Information Processing Systems 36: Annual
Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023.
http://papers.nips.cc/paper_files/paper/2023/hash/77c33e6a367922d003ff102ffb92b658-Abstract-Conference.html

ACM Trans. Inf. Syst., Vol. 1, No. 1, Article . Publication date: March 2025.
From Matching to Generation: A Survey on Generative Information Retrieval 57

[251] Weijia Shi, Xiaochuang Han, Mike Lewis, Yulia Tsvetkov, Luke Zettlemoyer, and Scott Wen tau Yih. 2023. Trusting
Your Evidence: Hallucinate Less with Context-aware Decoding. arXiv:2305.14739 [cs.CL]
[252] Weijia Shi, Sewon Min, Michihiro Yasunaga, Minjoon Seo, Rich James, Mike Lewis, Luke Zettlemoyer, and Wen
tau Yih. 2023. REPLUG: Retrieval-Augmented Black-Box Language Models. CoRR abs/2301.12652 (2023). https:
//doi.org/10.48550/ARXIV.2301.12652 arXiv:2301.12652
[253] Kurt Shuster, Jing Xu, Mojtaba Komeili, Da Ju, Eric Michael Smith, Stephen Roller, Megan Ung, Moya Chen, Kushal
Arora, Joshua Lane, Morteza Behrooz, William Ngan, Spencer Poff, Naman Goyal, Arthur Szlam, Y-Lan Boureau,
Melanie Kambadur, and Jason Weston. 2022. BlenderBot 3: a deployed conversational agent that continually learns to
responsibly engage. arXiv:2208.03188 [cs.CL]
[254] Zihua Si, Zhongxiang Sun, Jiale Chen, Guozhang Chen, Xiaoxue Zang, Kai Zheng, Yang Song, Xiao Zhang, and
Jun Xu. 2023. Generative Retrieval with Semantic Tree-Structured Item Identifiers via Contrastive Learning. CoRR
abs/2309.13375 (2023). https://doi.org/10.48550/ARXIV.2309.13375 arXiv:2309.13375
[255] Karan Singhal, Shekoofeh Azizi, Tao Tu, S. Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Kumar
Tanwani, Heather Cole-Lewis, Stephen Pfohl, Perry Payne, Martin Seneviratne, Paul Gamble, Chris Kelly, Nathaneal
Schärli, Aakanksha Chowdhery, Philip Andrew Mansfield, Blaise Agüera y Arcas, Dale R. Webster, Gregory S. Corrado,
Yossi Matias, Katherine Chou, Juraj Gottweis, Nenad Tomasev, Yun Liu, Alvin Rajkomar, Joelle K. Barral, Christopher
Semturs, Alan Karthikesalingam, and Vivek Natarajan. 2022. Large Language Models Encode Clinical Knowledge.
CoRR abs/2212.13138 (2022). https://doi.org/10.48550/ARXIV.2212.13138 arXiv:2212.13138
[256] Aviv Slobodkin, Eran Hirsch, Arie Cattan, Tal Schuster, and Ido Dagan. 2024. Attribute First, then Generate: Locally-
attributable Grounded Text Generation. CoRR abs/2403.17104 (2024). https://doi.org/10.48550/ARXIV.2403.17104
arXiv:2403.17104
[257] EuiYul Song, Sangryul Kim, Haeju Lee, Joonkee Kim, and James Thorne. 2024. Re3val: Reinforced and Reranked
Generative Retrieval. CoRR abs/2401.16979 (2024). https://doi.org/10.48550/ARXIV.2401.16979 arXiv:2401.16979
[258] Yifan Song, Weimin Xiong, Dawei Zhu, Wenhao Wu, Han Qian, Mingbo Song, Hailiang Huang, Cheng Li, Ke Wang,
Rong Yao, Ye Tian, and Sujian Li. 2023. RestGPT: Connecting Large Language Models with Real-World RESTful APIs.
arXiv:2306.06624 [cs.CL]
[259] Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown,
Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea
Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie,
Aman Hussain, Amanda Askell, Amanda Dsouza, Ameet Rahane, Anantharaman S. Iyer, Anders Andreassen, Andrea
Santilli, Andreas Stuhlmüller, Andrew M. Dai, Andrew La, Andrew K. Lampinen, Andy Zou, Angela Jiang, Angelica
Chen, Anh Vuong, Animesh Gupta, Anna Gottardi, Antonio Norelli, Anu Venkatesh, Arash Gholamidavoodi, Arfa
Tabassum, Arul Menezes, Arun Kirubarajan, Asher Mullokandov, Ashish Sabharwal, Austin Herrick, Avia Efrat, Aykut
Erdem, Ayla Karakas, and et al. 2022. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of
language models. CoRR abs/2206.04615 (2022). https://doi.org/10.48550/ARXIV.2206.04615 arXiv:2206.04615
[260] Alane Suhr and Yoav Artzi. 2024. Continual learning for instruction following from realtime feedback. Advances in
Neural Information Processing Systems 36 (2024).
[261] Hao Sun, Hengyi Cai, Bo Wang, Yingyan Hou, Xiaochi Wei, Shuaiqiang Wang, Yan Zhang, and Dawei Yin. 2023.
Towards Verifiable Text Generation with Evolving Memory and Self-Reflection. CoRR abs/2312.09075 (2023). https:
//doi.org/10.48550/ARXIV.2312.09075 arXiv:2312.09075
[262] Jiashuo Sun, Chengjin Xu, Lumingyuan Tang, Saizhuo Wang, Chen Lin, Yeyun Gong, Heung-Yeung Shum, and Jian
Guo. 2023. Think-on-Graph: Deep and Responsible Reasoning of Large Language Model with Knowledge Graph.
arXiv:2307.07697 [cs.CL]
[263] Lichao Sun, Yue Huang, Haoran Wang, Siyuan Wu, Qihui Zhang, Chujie Gao, Yixin Huang, Wenhan Lyu, Yixuan
Zhang, Xiner Li, Zhengliang Liu, Yixin Liu, Yijue Wang, Zhikun Zhang, Bhavya Kailkhura, Caiming Xiong, Chaowei
Xiao, Chunyuan Li, Eric P. Xing, Furong Huang, Hao Liu, Heng Ji, Hongyi Wang, Huan Zhang, Huaxiu Yao, Manolis
Kellis, Marinka Zitnik, Meng Jiang, Mohit Bansal, James Zou, Jian Pei, Jian Liu, Jianfeng Gao, Jiawei Han, Jieyu Zhao,
Jiliang Tang, Jindong Wang, John Mitchell, Kai Shu, Kaidi Xu, Kai-Wei Chang, Lifang He, Lifu Huang, Michael Backes,
Neil Zhenqiang Gong, Philip S. Yu, Pin-Yu Chen, Quanquan Gu, Ran Xu, Rex Ying, Shuiwang Ji, Suman Jana, Tianlong
Chen, Tianming Liu, Tianyi Zhou, William Wang, Xiang Li, Xiangliang Zhang, Xiao Wang, Xing Xie, Xun Chen,
Xuyu Wang, Yan Liu, Yanfang Ye, Yinzhi Cao, and Yue Zhao. 2024. TrustLLM: Trustworthiness in Large Language
Models. CoRR abs/2401.05561 (2024). https://doi.org/10.48550/ARXIV.2401.05561 arXiv:2401.05561
[264] Weiwei Sun, Zhengliang Shi, Shen Gao, Pengjie Ren, Maarten de Rijke, and Zhaochun Ren. 2022. Contrastive Learning
Reduces Hallucination in Conversations. arXiv:2212.10400 [cs.CL]
[265] Weiwei Sun, Lingyong Yan, Zheng Chen, Shuaiqiang Wang, Haichao Zhu, Pengjie Ren, Zhumin Chen, Dawei Yin,
Maarten de Rijke, and Zhaochun Ren. 2023. Learning to Tokenize for Generative Retrieval. CoRR abs/2304.04171
(2023). https://doi.org/10.48550/ARXIV.2304.04171 arXiv:2304.04171

ACM Trans. Inf. Syst., Vol. 1, No. 1, Article . Publication date: March 2025.
58 Li et al.

[266] Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. 2023.
Retentive Network: A Successor to Transformer for Large Language Models. CoRR abs/2307.08621 (2023). https:
//doi.org/10.48550/ARXIV.2307.08621 arXiv:2307.08621
[267] Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Hao Tian, Hua Wu, and Haifeng Wang. 2020. Ernie 2.0: A continual
pre-training framework for language understanding. In Proceedings of the AAAI conference on artificial intelligence,
Vol. 34. 8968–8975.
[268] Zhiqing Sun, Xuezhi Wang, Yi Tay, Yiming Yang, and Denny Zhou. 2022. Recitation-augmented language models.
arXiv preprint arXiv:2210.01296 (2022).
[269] Dídac Surís, Sachit Menon, and Carl Vondrick. 2023. ViperGPT: Visual Inference via Python Execution for Reasoning.
arXiv preprint arXiv:2303.08128 (2023).
[270] Chenmien Tan, Ge Zhang, and Jie Fu. 2023. Massive Editing for Large Language Models via Meta Learning. CoRR
abs/2311.04661 (2023). https://doi.org/10.48550/ARXIV.2311.04661 arXiv:2311.04661
[271] Jiejun Tan, Zhicheng Dou, Wen Wang, Mang Wang, Weipeng Chen, and Ji-Rong Wen. 2024. HtmlRAG: HTML is
Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems. CoRR abs/2411.02959 (2024). https:
//doi.org/10.48550/ARXIV.2411.02959 arXiv:2411.02959
[272] Jiejun Tan, Zhicheng Dou, Yutao Zhu, Peidong Guo, Kun Fang, and Ji-Rong Wen. 2024. Small Models, Big Insights:
Leveraging Slim Proxy Models To Decide When and What to Retrieve for LLMs. CoRR abs/2402.12052 (2024).
https://doi.org/10.48550/ARXIV.2402.12052 arXiv:2402.12052
[273] Juntao Tan, Shuyuan Xu, Wenyue Hua, Yingqiang Ge, Zelong Li, and Yongfeng Zhang. 2024. Towards LLM-RecSys
Alignment with Textual ID Learning. arXiv preprint arXiv:2403.19021 (2024).
[274] Zhaoxuan Tan, Qingkai Zeng, Yijun Tian, Zheyuan Liu, Bing Yin, and Meng Jiang. 2024. Democratizing Large
Language Models via Personalized Parameter-Efficient Fine-tuning. CoRR abs/2402.04401 (2024). https://doi.org/10.
48550/ARXIV.2402.04401 arXiv:2402.04401
[275] Qiaoyu Tang, Jiawei Chen, Bowen Yu, Yaojie Lu, Cheng Fu, Haiyang Yu, Hongyu Lin, Fei Huang, Ben He, Xianpei
Han, et al. 2024. Self-Retrieval: Building an Information Retrieval System with One Large Language Model. arXiv
preprint arXiv:2403.00801 (2024).
[276] Qiaoyu Tang, Ziliang Deng, Hongyu Lin, Xianpei Han, Qiao Liang, and Le Sun. 2023. ToolAlpaca: Generalized Tool
Learning for Language Models with 3000 Simulated Cases. CoRR abs/2306.05301 (2023). https://doi.org/10.48550/
ARXIV.2306.05301 arXiv:2306.05301
[277] Xiangru Tang, Anni Zou, Zhuosheng Zhang, Yilun Zhao, Xingyao Zhang, Arman Cohan, and Mark Gerstein. 2023.
MedAgents: Large Language Models as Collaborators for Zero-shot Medical Reasoning. CoRR abs/2311.10537 (2023).
https://doi.org/10.48550/ARXIV.2311.10537 arXiv:2311.10537
[278] Yubao Tang, Ruqing Zhang, Jiafeng Guo, Jiangui Chen, Zuowei Zhu, Shuaiqiang Wang, Dawei Yin, and Xueqi Cheng.
2023. Semantic-Enhanced Differentiable Search Index Inspired by Learning Strategies. In Proceedings of the 29th ACM
SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2023, Long Beach, CA, USA, August 6-10, 2023.
ACM, 4904–4913. https://doi.org/10.1145/3580305.3599903
[279] Yubao Tang, Ruqing Zhang, Jiafeng Guo, and Maarten de Rijke. 2023. Recent Advances in Generative Information
Retrieval. In Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in
the Asia Pacific Region, SIGIR-AP 2023, Beijing, China, November 26-28, 2023. ACM, 294–297. https://doi.org/10.1145/
3624918.3629547
[280] Yubao Tang, Ruqing Zhang, Jiafeng Guo, Maarten de Rijke, Wei Chen, and Xueqi Cheng. 2024. Listwise Generative
Retrieval Models via a Sequential Learning Process. ACM Transactions on Information Systems (2024).
[281] Yi Tay, Vinh Tran, Mostafa Dehghani, Jianmo Ni, Dara Bahri, Harsh Mehta, Zhen Qin, Kai Hui, Zhe Zhao, Jai Prakash
Gupta, Tal Schuster, William W. Cohen, and Donald Metzler. 2022. Transformer Memory as a Differentiable Search
Index. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing
Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022. http://papers.nips.cc/paper_files/
paper/2022/hash/892840a6123b5ec99ebaab8be1530fba-Abstract-Conference.html
[282] Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia
Jin, Taylor Bos, Leslie Baker, Yu Du, YaGuang Li, Hongrae Lee, Huaixiu Steven Zheng, Amin Ghafouri, Marcelo
Menegali, Yanping Huang, Maxim Krikun, Dmitry Lepikhin, James Qin, Dehao Chen, Yuanzhong Xu, Zhifeng
Chen, Adam Roberts, Maarten Bosma, Yanqi Zhou, Chung-Ching Chang, Igor Krivokon, Will Rusch, Marc Pickett,
Kathleen S. Meier-Hellstern, Meredith Ringel Morris, Tulsee Doshi, Renelito Delos Santos, Toju Duke, Johnny Soraker,
Ben Zevenbergen, Vinodkumar Prabhakaran, Mark Diaz, Ben Hutchinson, Kristen Olson, Alejandra Molina, Erin
Hoffman-John, Josh Lee, Lora Aroyo, Ravi Rajakumar, Alena Butryna, Matthew Lamm, Viktoriya Kuzmina, Joe
Fenton, Aaron Cohen, Rachel Bernstein, Ray Kurzweil, Blaise Agüera y Arcas, Claire Cui, Marian Croak, Ed H. Chi,
and Quoc Le. 2022. LaMDA: Language Models for Dialog Applications. CoRR abs/2201.08239 (2022). arXiv:2201.08239
https://arxiv.org/abs/2201.08239

ACM Trans. Inf. Syst., Vol. 1, No. 1, Article . Publication date: March 2025.
From Matching to Generation: A Survey on Generative Information Retrieval 59

[283] James Thorne. 2022. Data-Efficient Autoregressive Document Retrieval for Fact Verification. CoRR abs/2211.09388
(2022). https://doi.org/10.48550/ARXIV.2211.09388 arXiv:2211.09388
[284] James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. 2018. FEVER: a Large-scale Dataset
for Fact Extraction and VERification. In Proceedings of the 2018 Conference of the North American Chapter of the
Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 809–819.
[285] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste
Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models.
arXiv preprint arXiv:2302.13971 (2023).
[286] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov,
Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open Foundation and Fine-Tuned Chat Models.
arXiv preprint arXiv:2307.09288 (2023).
[287] Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. 2022. Interleaving retrieval with
chain-of-thought reasoning for knowledge-intensive multi-step questions. arXiv preprint arXiv:2212.10509 (2022).
[288] Ravisri Valluri, Akash Kumar Mohankumar, Kushal Dave, Amit Singh, Jian Jiao, Manik Varma, and Gaurav Sinha. 2024.
Scaling the vocabulary of non-autoregressive models for efficient generative retrieval. arXiv preprint arXiv:2406.06739
(2024).
[289] Chris van der Lee, Albert Gatt, Emiel van Miltenburg, Sander Wubben, and Emiel Krahmer. 2019. Best practices
for the human evaluation of automatically generated text. In Proceedings of the 12th International Conference on
Natural Language Generation, INLG 2019, Tokyo, Japan, October 29 - November 1, 2019. Association for Computational
Linguistics, 355–368. https://doi.org/10.18653/V1/W19-8643
[290] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and
Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems 30: Annual
Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA. 5998–6008.
https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html
[291] Tu Vu, Mohit Iyyer, Xuezhi Wang, Noah Constant, Jerry W. Wei, Jason Wei, Chris Tar, Yun-Hsuan Sung, Denny
Zhou, Quoc V. Le, and Thang Luong. 2023. FreshLLMs: Refreshing Large Language Models with Search Engine
Augmentation. CoRR abs/2310.03214 (2023). https://doi.org/10.48550/ARXIV.2310.03214 arXiv:2310.03214
[292] David Wan, Mengwen Liu, Kathleen McKeown, Markus Dreyer, and Mohit Bansal. 2023. Faithfulness-aware decoding
strategies for abstractive summarization. arXiv preprint arXiv:2303.03278 (2023).
[293] Cunxiang Wang, Xiaoze Liu, Yuanhao Yue, Xiangru Tang, Tianhang Zhang, Jiayang Cheng, Yunzhi Yao, Wenyang
Gao, Xuming Hu, Zehan Qi, Yidong Wang, Linyi Yang, Jindong Wang, Xing Xie, Zheng Zhang, and Yue Zhang. 2023.
Survey on Factuality in Large Language Models: Knowledge, Retrieval and Domain-Specificity. CoRR abs/2310.07521
(2023). https://doi.org/10.48550/ARXIV.2310.07521 arXiv:2310.07521
[294] Danqing Wang, Kevin Yang, Hanlin Zhu, Xiaomeng Yang, Andrew Cohen, Lei Li, and Yuandong Tian. 2023.
Learning Personalized Story Evaluation. CoRR abs/2310.03304 (2023). https://doi.org/10.48550/ARXIV.2310.03304
arXiv:2310.03304
[295] Hongru Wang, Minda Hu, Yang Deng, Rui Wang, Fei Mi, Weichao Wang, Yasheng Wang, Wai-Chung Kwan, Irwin
King, and Kam-Fai Wong. 2023. Large Language Models as Source Planner for Personalized Knowledge-grounded
Dialogues. In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023.
Association for Computational Linguistics, 9556–9569. https://doi.org/10.18653/V1/2023.FINDINGS-EMNLP.641
[296] Hongru Wang, Boyang Xue, Baohang Zhou, Tianhua Zhang, Cunxiang Wang, Guanhua Chen, Huimin Wang, and
Kam-fai Wong. 2024. Self-DC: When to retrieve and When to generate? Self Divide-and-Conquer for Compositional
Unknown Questions. arXiv preprint arXiv:2402.13514 (2024).
[297] Haoyu Wang, Tuo Zhao, and Jing Gao. 2024. BlendFilter: Advancing Retrieval-Augmented Large Language Models
via Query Generation Blending and Knowledge Filtering. arXiv:2402.11129 [cs.CL]
[298] Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu
Wei. 2022. Text Embeddings by Weakly-Supervised Contrastive Pre-training. CoRR abs/2212.03533 (2022). https:
//doi.org/10.48550/ARXIV.2212.03533 arXiv:2212.03533
[299] Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei.
2023. SimLM: Pre-training with Representation Bottleneck for Dense Passage Retrieval. In Proceedings of the 61st
Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July
9-14, 2023. Association for Computational Linguistics, 2244–2258. https://doi.org/10.18653/V1/2023.ACL-LONG.125
[300] Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. 2023. Large Search Model:
Redefining Search Stack in the Era of LLMs. SIGIR Forum 57, 2 (2023), 23:1–23:16. https://doi.org/10.1145/3642979.
3643006
[301] Liyuan Wang, Xingxing Zhang, Hang Su, and Jun Zhu. 2023. A Comprehensive Survey of Continual Learning: Theory,
Method and Application. CoRR abs/2302.00487 (2023). https://doi.org/10.48550/ARXIV.2302.00487 arXiv:2302.00487

ACM Trans. Inf. Syst., Vol. 1, No. 1, Article . Publication date: March 2025.
60 Li et al.

[302] Shuting Wang, Zhicheng Dou, Jing Yao, Yujia Zhou, and Ji-Rong Wen. 2023. Incorporating Explicit Subtopics in
Personalized Search. In Proceedings of the ACM Web Conference 2023, WWW 2023, Austin, TX, USA, 30 April 2023 - 4
May 2023. ACM, 3364–3374. https://doi.org/10.1145/3543507.3583488
[303] Song Wang, Yaochen Zhu, Haochen Liu, Zaiyi Zheng, Chen Chen, and Jundong Li. 2023. Knowledge Editing
for Large Language Models: A Survey. CoRR abs/2310.16218 (2023). https://doi.org/10.48550/ARXIV.2310.16218
arXiv:2310.16218
[304] Wenjie Wang, Xinyu Lin, Fuli Feng, Xiangnan He, and Tat-Seng Chua. 2023. Generative Recommendation: Towards
Next-generation Recommender Paradigm. CoRR abs/2304.03516 (2023). https://doi.org/10.48550/ARXIV.2304.03516
arXiv:2304.03516
[305] Xiao Wang, Tianze Chen, Qiming Ge, Han Xia, Rong Bao, Rui Zheng, Qi Zhang, Tao Gui, and Xuanjing Huang. 2023.
Orthogonal subspace learning for language model continual learning. arXiv preprint arXiv:2310.14152 (2023).
[306] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny
Zhou. 2023. Self-Consistency Improves Chain of Thought Reasoning in Language Models. arXiv:2203.11171 [cs.CL]
[307] Yujing Wang, Yingyan Hou, Haonan Wang, Ziming Miao, Shibin Wu, Hao Sun, Qi Chen, Yuqing Xia, Chengmin Chi,
Guoshuai Zhao, Zheng Liu, Xing Xie, Hao Allen Sun, Weiwei Deng, Qi Zhang, and Mao Yang. 2022. A Neural Corpus
Indexer for Document Retrieval. CoRR abs/2206.02743 (2022).
[308] Yile Wang, Peng Li, Maosong Sun, and Yang Liu. 2023. Self-knowledge guided retrieval augmentation for large
language models. arXiv preprint arXiv:2310.05002 (2023).
[309] Yidan Wang, Zhaochun Ren, Weiwei Sun, Jiyuan Yang, Zhixiang Liang, Xin Chen, Ruobing Xie, Su Yan, Xu Zhang,
Pengjie Ren, et al. 2024. Enhanced Generative Recommendation via Content and Collaboration Integration. arXiv
preprint arXiv:2403.18480 (2024).
[310] Zhiruo Wang, Zhoujun Cheng, Hao Zhu, Daniel Fried, and Graham Neubig. 2024. What Are Tools Anyway? A Survey
from the Language Model Perspective. CoRR abs/2403.15452 (2024). https://doi.org/10.48550/ARXIV.2403.15452
arXiv:2403.15452
[311] Zihan Wang, Yujia Zhou, Yiteng Tu, and Zhicheng Dou. 2023. NOVO: Learnable and Interpretable Document
Identifiers for Model-Based IR. In Proceedings of the 32nd ACM International Conference on Information and Knowledge
Management, CIKM 2023, Birmingham, United Kingdom, October 21-25, 2023. ACM, 2656–2665. https://doi.org/10.
1145/3583780.3614993
[312] Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten
Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean,
and William Fedus. 2022. Emergent Abilities of Large Language Models. Trans. Mach. Learn. Res. 2022 (2022).
https://openreview.net/forum?id=yzkSU5zdwD
[313] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed H. Chi, Quoc Le, and Denny Zhou. 2022. Chain
of Thought Prompting Elicits Reasoning in Large Language Models. CoRR abs/2201.11903 (2022). arXiv:2201.11903
https://arxiv.org/abs/2201.11903
[314] Orion Weller, Marc Marone, Nathaniel Weir, Dawn Lawrie, Daniel Khashabi, and Benjamin Van Durme. 2023. "
According to..." Prompting Language Models Improves Quoting from Pre-Training Data. arXiv preprint arXiv:2305.13252
(2023).
[315] Ryen W. White. 2023. Tasks, Copilots, and the Future of Search: A Keynote at SIGIR 2023. SIGIR Forum 57, 2 (2023),
4:1–4:8. https://doi.org/10.1145/3642979.3642985
[316] Stanislaw Wozniak, Bartlomiej Koptyra, Arkadiusz Janz, Przemyslaw Kazienko, and Jan Kocon. 2024. Personalized
Large Language Models. CoRR abs/2402.09269 (2024). https://doi.org/10.48550/ARXIV.2402.09269 arXiv:2402.09269
[317] Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. 2023. Visual chatgpt:
Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671 (2023).
[318] Jimmy Wu, Rika Antonova, Adam Kan, Marion Lepert, Andy Zeng, Shuran Song, Jeannette Bohg, Szymon
Rusinkiewicz, and Thomas A. Funkhouser. 2023. TidyBot: personalized robot assistance with large language models.
Auton. Robots 47, 8 (2023), 1087–1102. https://doi.org/10.1007/S10514-023-10139-Z
[319] Shiguang Wu, Wenda Wei, Mengqi Zhang, Zhumin Chen, Jun Ma, Zhaochun Ren, Maarten de Rijke, and Pengjie Ren.
2024. Generative Retrieval as Multi-Vector Dense Retrieval. arXiv preprint arXiv:2404.00684 (2024).
[320] Tongtong Wu, Massimo Caccia, Zhuang Li, Yuan-Fang Li, Guilin Qi, and Gholamreza Haffari. 2022. Pretrained
Language Model in Continual Learning: A Comparative Study. In The Tenth International Conference on Learning
Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net. https://openreview.net/forum?id=
figzpGMrdD
[321] Tongtong Wu, Linhao Luo, Yuan-Fang Li, Shirui Pan, Thuy-Trang Vu, and Gholamreza Haffari. 2024. Continual
learning for large language models: A survey. arXiv preprint arXiv:2402.01364 (2024).
[322] Yuwei Wu, Xuezhe Ma, and Diyi Yang. 2021. Personalized response generation via generative split memory network.
In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics:

ACM Trans. Inf. Syst., Vol. 1, No. 1, Article . Publication date: March 2025.
From Matching to Generation: A Survey on Generative Information Retrieval 61

Human Language Technologies. 1956–1970.


[323] Yong Xie, Karan Aggarwal, and Aitzaz Ahmad. 2023. Efficient continual pre-training for building domain specific
large language models. arXiv preprint arXiv:2311.08545 (2023).
[324] Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul N Bennett, Junaid Ahmed, and Arnold Overwijk.
2020. Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. In ICLR.
[325] Fangyuan Xu, Weijia Shi, and Eunsol Choi. 2023. RECOMP: Improving Retrieval-Augmented LMs with Com-
pression and Selective Augmentation. CoRR abs/2310.04408 (2023). https://doi.org/10.48550/ARXIV.2310.04408
arXiv:2310.04408
[326] Shicheng Xu, Liang Pang, Huawei Shen, Xueqi Cheng, and Tat-Seng Chua. 2023. Search-in-the-Chain: Towards the
Accurate, Credible and Traceable Content Generation for Complex Knowledge-intensive Tasks. CoRR abs/2304.14732
(2023). https://doi.org/10.48550/ARXIV.2304.14732 arXiv:2304.14732
[327] Xuhai Xu, Bingshen Yao, Yuanzhe Dong, Hong Yu, James Hendler, Anind K Dey, and Dakuo Wang. 2023. Leveraging
large language models for mental health prediction via online text data. arXiv preprint arXiv:2307.14385 (2023).
[328] Zhichao Xu, Fengran Mo, Zhiqi Huang, Crystina Zhang, Puxuan Yu, Bei Wang, Jimmy Lin, and Vivek Srikumar. 2025.
A Survey of Model Architectures in Information Retrieval. arXiv preprint arXiv:2502.14822 (2025).
[329] Haoyan Yang, Zhitao Li, Yong Zhang, Jianzong Wang, Ning Cheng, Ming Li, and Jing Xiao. 2023. PRCA: Fitting Black-
Box Large Language Models for Retrieval Question Answering via Pluggable Reward-Driven Contextual Adapter.
In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore,
December 6-10, 2023. Association for Computational Linguistics, 5364–5375. https://aclanthology.org/2023.emnlp-
main.326
[330] Tianchi Yang, Minghui Song, Zihan Zhang, Haizhen Huang, Weiwei Deng, Feng Sun, and Qi Zhang. 2023. Auto Search
Indexer for End-to-End Document Retrieval. In Findings of the Association for Computational Linguistics: EMNLP 2023,
Singapore, December 6-10, 2023. Association for Computational Linguistics, 6955–6970. https://aclanthology.org/2023.
findings-emnlp.464
[331] Xianjun Yang, Liangming Pan, Xuandong Zhao, Haifeng Chen, Linda R. Petzold, William Yang Wang, and Wei Cheng.
2023. A Survey on Detection of LLMs-Generated Content. CoRR abs/2310.15654 (2023). https://doi.org/10.48550/
ARXIV.2310.15654 arXiv:2310.15654
[332] Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. 2023.
Tree of Thoughts: Deliberate Problem Solving with Large Language Models. arXiv:2305.10601 [cs.CL]
[333] Shunyu Yao, Jeffrey Zhao, Dian Yu, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. 2022. ReAct: Synergizing
Reasoning and Acting in Language Models. In NeurIPS 2022 Foundation Models for Decision Making Workshop.
https://openreview.net/forum?id=tvI4u1ylcqs
[334] Yunzhi Yao, Peng Wang, Bozhong Tian, Siyuan Cheng, Zhoubo Li, Shumin Deng, Huajun Chen, and Ningyu Zhang.
2023. Editing Large Language Models: Problems, Methods, and Opportunities. In Proceedings of the 2023 Conference
on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023. Association for
Computational Linguistics, 10222–10240. https://doi.org/10.18653/V1/2023.EMNLP-MAIN.632
[335] Xi Ye, Ruoxi Sun, Sercan Ö. Arik, and Tomas Pfister. 2023. Effective Large Language Model Adaptation for Improved
Grounding. CoRR abs/2311.09533 (2023). https://doi.org/10.48550/ARXIV.2311.09533 arXiv:2311.09533
[336] Howard Yen, Tianyu Gao, and Danqi Chen. 2024. Long-Context Language Modeling with Parallel Context Encoding.
CoRR abs/2402.16617 (2024). https://doi.org/10.48550/ARXIV.2402.16617 arXiv:2402.16617
[337] Soyoung Yoon, Chaeeun Kim, Hyunji Lee, Joel Jang, and Minjoon Seo. 2023. Exploring the Practicality of Generative
Retrieval on Dynamic Corpora. https://api.semanticscholar.org/CorpusID:258967398
[338] Ori Yoran, Tomer Wolfson, Ori Ram, and Jonathan Berant. 2023. Making Retrieval-Augmented Language Models
Robust to Irrelevant Context. arXiv:2310.01558 [cs.CL]
[339] Wenhao Yu, Dan Iter, Shuohang Wang, Yichong Xu, Mingxuan Ju, Soumya Sanyal, Chenguang Zhu, Michael Zeng,
and Meng Jiang. 2022. Generate rather than retrieve: Large language models are strong context generators. arXiv
preprint arXiv:2209.10063 (2022).
[340] Zichun Yu, Chenyan Xiong, Shi Yu, and Zhiyuan Liu. 2023. Augmentation-adapted retriever improves generalization
of language models as generic plug-in. arXiv preprint arXiv:2305.17331 (2023).
[341] Hongyi Yuan, Zheng Yuan, Chuanqi Tan, Fei Huang, and Songfang Huang. 2022. SeqDiffuSeq: Text Diffusion
with Encoder-Decoder Transformers. CoRR abs/2212.10325 (2022). https://doi.org/10.48550/ARXIV.2212.10325
arXiv:2212.10325
[342] Peiwen Yuan, Xinglin Wang, Shaoxiong Feng, Boyuan Pan, Yiwei Li, Heda Wang, Xupeng Miao, and Kan Li. 2024.
Generative Dense Retrieval: Memory Can Be a Burden. CoRR abs/2401.10487 (2024). https://doi.org/10.48550/ARXIV.
2401.10487 arXiv:2401.10487
[343] Weizhe Yuan, Graham Neubig, and Pengfei Liu. 2021. Bartscore: Evaluating generated text as text generation. Advances
in Neural Information Processing Systems 34, 27263–27277.

ACM Trans. Inf. Syst., Vol. 1, No. 1, Article . Publication date: March 2025.
62 Li et al.

[344] Hansi Zeng, Chen Luo, Bowen Jin, Sheikh Muhammad Sarwar, Tianxin Wei, and Hamed Zamani. 2023. Scalable and
Effective Generative Information Retrieval. CoRR abs/2311.09134 (2023). https://doi.org/10.48550/ARXIV.2311.09134
arXiv:2311.09134
[345] Hansi Zeng, Chen Luo, and Hamed Zamani. 2024. Planning Ahead in Generative Retrieval: Guiding Autoregressive
Generation through Simultaneous Decoding. arXiv preprint arXiv:2404.14600 (2024).
[346] Hailin Zhang, Yujing Wang, Qi Chen, Ruiheng Chang, Ting Zhang, Ziming Miao, Yingyan Hou, Yang Ding, Xupeng
Miao, Haonan Wang, et al. 2023. Model-enhanced Vector Index. arXiv preprint arXiv:2309.13335 (2023).
[347] Jingqing Zhang, Kai Sun, Akshay Jagadeesh, Mahta Ghahfarokhi, Deepa Gupta, Ashok Gupta, Vibhor Gupta, and
Yike Guo. 2023. The Potential and Pitfalls of using a Large Language Model such as ChatGPT or GPT-4 as a Clinical
Assistant. CoRR abs/2307.08152 (2023). https://doi.org/10.48550/ARXIV.2307.08152 arXiv:2307.08152
[348] Kui Zhang, Guangquan Lu, Guixian Zhang, Zhi Lei, and Lijuan Wu. 2022. Personalized headline generation with
enhanced user interest perception. In International Conference on Artificial Neural Networks. Springer, 797–809.
[349] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. 2023. Adding conditional control to text-to-image diffusion models.
In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3836–3847.
[350] Lingxi Zhang, Yue Yu, Kuan Wang, and Chao Zhang. 2024. ARL2: Aligning Retrievers for Black-box Large Language
Models via Self-guided Adaptive Relevance Labeling. arXiv:2402.13542 [cs.CL]
[351] Ningyu Zhang, Yunzhi Yao, Bozhong Tian, Peng Wang, Shumin Deng, Mengru Wang, Zekun Xi, Shengyu Mao, Jintian
Zhang, Yuansheng Ni, et al. 2024. A comprehensive study of knowledge editing for large language models. arXiv
preprint arXiv:2401.01286 (2024).
[352] Peitian Zhang, Zheng Liu, Yujia Zhou, Zhicheng Dou, Fangchao Liu, and Zhao Cao. 2024. Generative Retrieval via
Term Set Generation. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in
Information Retrieval, SIGIR 2024, Washington DC, USA, July 14-18, 2024, Grace Hui Yang, Hongning Wang, Sam Han,
Claudia Hauff, Guido Zuccon, and Yi Zhang (Eds.). ACM, 458–468.
[353] Peitian Zhang, Shitao Xiao, Zheng Liu, Zhicheng Dou, and Jian-Yun Nie. 2023. Retrieve Anything To Augment Large
Language Models. CoRR abs/2310.07554 (2023). https://doi.org/10.48550/ARXIV.2310.07554 arXiv:2310.07554
[354] Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. 2018. Personalizing
dialogue agents: I have a dog, do you have pets too? arXiv preprint arXiv:1801.07243.
[355] Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. 2019. Bertscore: Evaluating text
generation with bert. arXiv preprint arXiv:1904.09675.
[356] Wenhao Zhang, Mengqi Zhang, Shiguang Wu, Jiahuan Pei, Zhaochun Ren, Maarten de Rijke, Zhumin Chen, and
Pengjie Ren. 2024. ExcluIR: Exclusionary Neural Information Retrieval. arXiv preprint arXiv:2404.17288 (2024).
[357] Yidan Zhang, Ting Zhang, Dong Chen, Yujing Wang, Qi Chen, Xing Xie, Hao Sun, Weiwei Deng, Qi Zhang, Fan
Yang, Mao Yang, Qingmin Liao, and Baining Guo. 2023. IRGen: Generative Modeling for Image Retrieval. CoRR
abs/2303.10126 (2023). https://doi.org/10.48550/ARXIV.2303.10126 arXiv:2303.10126
[358] Zhexin Zhang, Leqi Lei, Lindong Wu, Rui Sun, Yongkang Huang, Chong Long, Xiao Liu, Xuanyu Lei, Jie Tang, and
Minlie Huang. 2023. SafetyBench: Evaluating the Safety of Large Language Models with Multiple Choice Questions.
CoRR abs/2309.07045 (2023). https://doi.org/10.48550/ARXIV.2309.07045 arXiv:2309.07045
[359] Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie
Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu
Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen. 2023. A Survey of Large Language Models. CoRR
abs/2303.18223 (2023). https://doi.org/10.48550/ARXIV.2303.18223 arXiv:2303.18223
[360] Bowen Zheng, Yupeng Hou, Hongyu Lu, Yu Chen, Wayne Xin Zhao, and Ji-Rong Wen. 2023. Adapting large language
models by integrating collaborative semantics for recommendation. arXiv preprint arXiv:2311.09049 (2023).
[361] Aakas Zhiyuli, Yanfang Chen, Xuan Zhang, and Xun Liang. 2023. BookGPT: A General Framework for Book
Recommendation Empowered by Large Language Model. CoRR abs/2305.15673 (2023). https://doi.org/10.48550/
ARXIV.2305.15673 arXiv:2305.15673
[362] Ming Zhong, Yang Liu, Da Yin, Yuning Mao, Yizhu Jiao, Pengfei Liu, Chenguang Zhu, Heng Ji, and Jiawei Han. 2022.
Towards a unified multi-dimensional evaluator for text generation. arXiv preprint arXiv:2210.07197.
[363] Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili
Yu, et al. 2024. Lima: Less is more for alignment. Advances in Neural Information Processing Systems 36.
[364] Yujia Zhou, Zhicheng Dou, and Ji-Rong Wen. 2020. Encoding History with Context-aware Representation Learning
for Personalized Search. In Proceedings of the 43rd International ACM SIGIR conference on research and development in
Information Retrieval, SIGIR 2020, Virtual Event, China, July 25-30, 2020. ACM, 1111–1120. https://doi.org/10.1145/
3397271.3401175
[365] Yujia Zhou, Zhicheng Dou, and Ji-Rong Wen. 2023. Enhancing Generative Retrieval with Reinforcement Learning
from Relevance Feedback. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,
EMNLP 2023, Singapore, December 6-10, 2023. Association for Computational Linguistics, 12481–12490. https:

ACM Trans. Inf. Syst., Vol. 1, No. 1, Article . Publication date: March 2025.
From Matching to Generation: A Survey on Generative Information Retrieval 63

//aclanthology.org/2023.emnlp-main.768
[366] Yujia Zhou, Yan Liu, Xiaoxi Li, Jiajie Jin, Hongjin Qian, Zheng Liu, Chaozhuo Li, Zhicheng Dou, Tsung-Yi Ho, and
Philip S. Yu. 2024. Trustworthiness in Retrieval-Augmented Generation Systems: A Survey. CoRR abs/2409.10102
(2024). https://doi.org/10.48550/ARXIV.2409.10102 arXiv:2409.10102
[367] Yujia Zhou, Zheng Liu, and Zhicheng Dou. 2024. Boosting the Potential of Large Language Models with an Intelli-
gent Information Assistant. In Advances in Neural Information Processing Systems 38: Annual Conference on Neural
Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024, Amir Glober-
sons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang (Eds.).
http://papers.nips.cc/paper_files/paper/2024/hash/28d38c036365420f61ce03300418e44a-Abstract-Conference.html
[368] Yujia Zhou, Zheng Liu, Jiajie Jin, Jian-Yun Nie, and Zhicheng Dou. 2024. Metacognitive Retrieval-Augmented Large
Language Models. CoRR abs/2402.11626 (2024). https://doi.org/10.48550/ARXIV.2402.11626 arXiv:2402.11626
[369] Yujia Zhou, Jing Yao, Zhicheng Dou, Yiteng Tu, Ledell Wu, Tat-Seng Chua, and Ji-Rong Wen. 2024. ROGER: Ranking-
Oriented Generative Retrieval. ACM Trans. Inf. Syst. 42, 6 (2024), 155:1–155:25. https://doi.org/10.1145/3603167
[370] Yujia Zhou, Jing Yao, Zhicheng Dou, Ledell Wu, and Ji-Rong Wen. 2023. DynamicRetriever: A Pre-trained Model-based
IR System Without an Explicit Index. Mach. Intell. Res. 20, 2 (2023), 276–288. https://doi.org/10.1007/S11633-022-1373-9
[371] Yujia Zhou, Jing Yao, Zhicheng Dou, Ledell Wu, Peitian Zhang, and Ji-Rong Wen. 2022. Ultron: An Ultimate Retriever
on Corpus with a Model-based Indexer. CoRR abs/2208.09257 (2022).
[372] Yujia Zhou, Jing Yao, Ledell Wu, Zhicheng Dou, and Ji-Rong Wen. 2023. WebUltron: An Ultimate Retriever on
Webpages Under the Model-Centric Paradigm. IEEE Transactions on Knowledge and Data Engineering (2023).
[373] Yujia Zhou, Qiannan Zhu, Jiajie Jin, and Zhicheng Dou. 2024. Cognitive Personalized Search Integrating Large
Language Models with an Efficient Memory Mechanism. CoRR abs/2402.10548 (2024). https://doi.org/10.48550/
ARXIV.2402.10548 arXiv:2402.10548
[374] Yutao Zhu, Huaying Yuan, Shuting Wang, Jiongnan Liu, Wenhan Liu, Chenlong Deng, Zhicheng Dou, and Ji-Rong
Wen. 2023. Large language models for information retrieval: A survey. arXiv preprint arXiv:2308.07107 (2023).
[375] Shengyao Zhuang, Houxing Ren, Linjun Shou, Jian Pei, Ming Gong, Guido Zuccon, and Daxin Jiang. 2022. Bridging
the Gap Between Indexing and Retrieval for Differentiable Search Index with Query Generation. CoRR abs/2206.10128
(2022). https://doi.org/10.48550/ARXIV.2206.10128 arXiv:2206.10128
[376] Noah Ziems, Wenhao Yu, Zhihan Zhang, and Meng Jiang. 2023. Large Language Models are Built-in Autoregressive
Search Engines. In Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14,
2023. Association for Computational Linguistics, 2666–2678. https://doi.org/10.18653/V1/2023.FINDINGS-ACL.167

ACM Trans. Inf. Syst., Vol. 1, No. 1, Article . Publication date: March 2025.

You might also like