0% found this document useful (0 votes)
24 views25 pages

Sec1 Introduction GR Tutorial Slides SIGIR

The SIGIR 2024 tutorial on Generative Information Retrieval discusses the evolution and advantages of generative models over traditional pipelined architectures in information retrieval. It highlights the effectiveness and efficiency of generative retrieval methods, including closed-book and open-book approaches, and outlines the tutorial's goals to cover key developments, challenges, and future directions in this field. The schedule includes various sections presented by different faculty members, focusing on definitions, design, training, inference strategies, and applications.

Uploaded by

prosperneshiwa99
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views25 pages

Sec1 Introduction GR Tutorial Slides SIGIR

The SIGIR 2024 tutorial on Generative Information Retrieval discusses the evolution and advantages of generative models over traditional pipelined architectures in information retrieval. It highlights the effectiveness and efficiency of generative retrieval methods, including closed-book and open-book approaches, and outlines the tutorial's goals to cover key developments, challenges, and future directions in this field. The schedule includes various sections presented by different faculty members, focusing on definitions, design, training, inference strategies, and applications.

Uploaded by

prosperneshiwa99
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Generative Information Retrieval

SIGIR 2024 tutorial – Section 1

Yubao Tanga , Ruqing Zhanga , Zhaochun Renb , Jiafeng Guoa and Maarten de Rijkec
https://generative-ir.github.io/
July 14, 2024
a Institute of Computing Technology, Chinese Academy of Sciences & UCAS
b Leiden University
c University of Amsterdam
1
About the presenters

Yubao Tang Ruqing Zhang Zhaochun Ren Jiafeng Guo Maarten de Rijke
PhD student Faculty Faculty Faculty Faculty
@ICT, CAS @ICT, CAS @LEI @ICT, CAS @UvA

2
Information retrieval

Information retrieval (IR) is the activity of obtaining information resources that are
relevant to an information need from a collection of those resources.

Information need

Relevance Relationship
Information repository

Given: User query (keywords, question, image, . . . )


Rank: Information objects (passages, documents, images, products, . . . )
Ordered by: Relevance scores

3
Complex architecture design behind search engines
User Query Query parser

Search
Results Syntable Ontology

Re-ranking
Module
Modified
Crawlers
Query
Retrieval
Module

Intermediate Matching Web-page


Results Technique Prediction

Offline Components
Structured Indexing Web-page
Web-page Module Repository
Online Components Repository

4
Complex architecture design behind search engines
User Query Query parser

Search
Results Syntable Ontology

Re-ranking
Module
Modified
Crawlers
Query
Retrieval
Module

Intermediate Matching Web-page


Results Technique Prediction

Offline Components
Structured Indexing Web-page
Web-page Module Repository
Online Components Repository

• Advantages:
Pipelined paradigm has withstood the test of time
Advanced machine learning and deep learning approaches applied to many
components of modern systems
4
Core pipelined paradigm: Index-Retrieval-Ranking

Query parser
rewriting, expansion, Search query
suggestion, …

Doc parser
extraction, anti- Index Retrieval Re-ranking Search results
spamming, …

• Index: Build an index for each document in the entire corpus


• Retriever: Find an initial set of candidate documents for a query
• Re-ranker: Determine the relevance degree of each candidate

5
Index-Retrieval-Ranking: Disadvantages

• Effectiveness: Heterogeneous ranking components are usually difficult to be


optimized in an end-to-end way towards the global objective
6
Index-Retrieval-Ranking: Disadvantages

MS MARCO
300K

Big storage Slow inference speed


GTR (Dense retrieval) GTR (Dense retrieval)
Memory size 1430MB Online latency 1.97s
Source: [Sun et al., 2023]

• Efficiency: A large document index is needed to search over the corpus, leading
to significant memory consumption and computational overhead
7
What if we replaced the pipelined architecture with a single consolidated
model that efficiently and effectively encodes all of the information con-
tained in the corpus?

8
Opinion paper: A single model for IR

Query Parser
rewriting, expansion, Search query
suggestion, …

Doc Parser
extraction, anti- Index Retrieval Ranking Search results
spamming, …

Search query
Source: [Metzler et al., 2021]

A Single Model
Search results

9
Image source: [Zhao et al., 2023]
Generative language models

10
Two families of generative retrieval

• Closed-book: The language model is the only source of knowledge leveraged


during generation, e.g.,
Capturing document ids in the language models
Language models as retrieval agents via prompting
• Open-book: The language model can draw on external memory prior to, during,
and after generation, e.g.,
Retrieval augmented generation of answers
Tool-augmented generation of answers
Source: [Najork, 2023]

11
Two families of generative retrieval

• Closed-book: The language model is the only source of knowledge leveraged


during generation, e.g.,
Capturing document ids in the language models
Language models as retrieval agents via prompting
• Open-book: The language model can draw on external memory prior to, during,
and after generation, e.g.,
Retrieval augmented generation of answers
Tool-augmented generation of answers
Source: [Najork, 2023]

11
Closed-book generative retrieval

The IR task can be formulated as a sequence-to-sequence (Seq2Seq) generation


problem

12
Closed-book generative retrieval

The IR task can be formulated as a sequence-to-sequence (Seq2Seq) generation


problem

• Input: A sequence of query words


• Output: A sequence of document identifiers

12
Neural IR models: Discriminative vs. Generative

Discriminative Generative
𝑑! 𝑑! “olympic games”
𝑑" 𝐼"
𝑑" 𝐼% 𝑑" “olympic symbols”
𝑑% game
𝑑# FIFA

Olympic

… …
2022
symbol
𝑑% 𝑑! 𝐼!
winter

opening
Document Space
ceremony
𝑑% “2022 Winter Olympics
opening ceremony”
matching 𝐼!
𝑞! 𝐼"
𝑞!
𝑞! 𝐼#
𝑞" …

𝑞# 𝑞$ association …
𝑞&
𝑞&
Query Space 𝐼'
Query Space
Meta-identifier Space

𝑝(𝑅 = 1|𝑞, 𝑑) ≈ … ≈ 𝑎𝑟𝑔𝑚𝑎𝑥 𝑠(𝑞⃗, 𝑑⃗) 𝑝(𝑞|𝑑) ≈ 𝑝(𝑑𝑜𝑐𝐼𝐷|𝑞) = 𝑎𝑟𝑔𝑚𝑎𝑥 p((𝐼!, … , 𝐼" )|𝑞)
( probabilistic ranking principle) ( query likelihood)

13
Why generative retrieval?

Query Parser
rewriting, expansion, Search query
suggestion, …

Query Parser
Heterogeneous objectives rewriting, expansion, Search query
suggestion, …
Doc Parser
extraction, anti- Index Retrieval Ranking Search results
spamming, …

Doc Parser
extraction, anti- Index Retrieval Ranking Search results
spamming, …

Search query

A Single Model Search query

A Single Model
Search results
A global objective
Search results

• Effectiveness: Knowledge of all documents in corpus is encoded into model


parameters, which can be optimized directly in an end-to-end manner

14
Why generative retrieval?

Dense retrieval Generative retrieval

Memory size
GTR GenRet
(MS MARCO 300K)
1430MB 860MB

Online latency GTR GenRet


1.97s 0.16s
Data source: [Sun et al., 2023]

• Efficiency: Main memory computation of GR is the storage of document


identifiers and model parameters
• Heavy retrieval process is replaced with a light generative process over the
vocabulary of identifiers
15
Statistics of related publications

40 39
AAAI, 2% TOIS, 1%
WWW, 1%
ICML, 1%

35

arXiv, 27%

30
Workshop & findings &
keynote, 18%

25
number of papers

20
18

ICLR, 1%
KDD, 2%
15 EMNLP, 3%

WSDM, 2%
11

10 CIKM , 4%

SIGIR, 13%

5 4

Others, 11%
ACL, 6%
0
2021 2022 2023 2024 NeurIPS, 7%

The data statistics cover up to July 10, 2024.


16
Goals of the tutorial

• We will cover key developments on generative information retrieval (mostly


2021–2024)
Problem definitions
Docid design
Training approaches
Inference strategies
Applications

17
Goals of the tutorial

• We will cover key developments on generative information retrieval (mostly


2021–2024)
Problem definitions
Docid design
Training approaches
Inference strategies
Applications
• We are still far from understanding how to best develop generative IR architecture
compared to traditional pipelined IR architecture:
Taxonomies of existing research and key insights
Our perspectives on the current challenges & future directions

17
Schedule

Time Section Presenter


09:00 - 09:25 Section 1: Introduction Maarten de Rijke
09:25 - 09:55 Section 2: Definitions & Preliminaries Zhaochun Ren
09:55 - 10:30 Section 3: Docid design Yubao Tang

30min coffee break

11:00 - 11:30 Section 4: Training approaches Zhaochun Ren


11:30 - 11:50 Section 5: Inference strategies Yubao Tang
11:50 - 12:00 Section 6: Applications Zhaochun Ren
12:00 - 12:15 Section 7: Challenges & Opportunities Maarten de Rijke
12:15 - 12:30 Q&A All

18
References
References i

D. Metzler, Y. Tay, D. Bahri, and M. Najork. Rethinking search: Making domain experts out of
dilettantes. SIGIR Forum, 55(1):1–27, 2021.
M. Najork. Generative information retrieval (slides), 2023. URL https:
//docs.google.com/presentation/d/19lAeVzPkh20Ly855tKDkz1uv-1pHV_9GxfntiTJPUug/.
W. Sun, L. Yan, Z. Chen, S. Wang, H. Zhu, P. Ren, Z. Chen, D. Yin, M. de Rijke, and Z. Ren.
Learning to tokenize for generative retrieval. In Thirty-seventh Conference on Neural Information
Processing Systems, 2023.
W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong, et al.
A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.

19

You might also like