Sec1 Introduction GR Tutorial Slides SIGIR
Sec1 Introduction GR Tutorial Slides SIGIR
Yubao Tanga , Ruqing Zhanga , Zhaochun Renb , Jiafeng Guoa and Maarten de Rijkec
https://generative-ir.github.io/
July 14, 2024
a Institute of Computing Technology, Chinese Academy of Sciences & UCAS
b Leiden University
c University of Amsterdam
1
About the presenters
Yubao Tang Ruqing Zhang Zhaochun Ren Jiafeng Guo Maarten de Rijke
PhD student Faculty Faculty Faculty Faculty
@ICT, CAS @ICT, CAS @LEI @ICT, CAS @UvA
2
Information retrieval
Information retrieval (IR) is the activity of obtaining information resources that are
relevant to an information need from a collection of those resources.
Information need
Relevance Relationship
Information repository
3
Complex architecture design behind search engines
User Query Query parser
Search
Results Syntable Ontology
Re-ranking
Module
Modified
Crawlers
Query
Retrieval
Module
Offline Components
Structured Indexing Web-page
Web-page Module Repository
Online Components Repository
4
Complex architecture design behind search engines
User Query Query parser
Search
Results Syntable Ontology
Re-ranking
Module
Modified
Crawlers
Query
Retrieval
Module
Offline Components
Structured Indexing Web-page
Web-page Module Repository
Online Components Repository
• Advantages:
Pipelined paradigm has withstood the test of time
Advanced machine learning and deep learning approaches applied to many
components of modern systems
4
Core pipelined paradigm: Index-Retrieval-Ranking
Query parser
rewriting, expansion, Search query
suggestion, …
Doc parser
extraction, anti- Index Retrieval Re-ranking Search results
spamming, …
5
Index-Retrieval-Ranking: Disadvantages
MS MARCO
300K
• Efficiency: A large document index is needed to search over the corpus, leading
to significant memory consumption and computational overhead
7
What if we replaced the pipelined architecture with a single consolidated
model that efficiently and effectively encodes all of the information con-
tained in the corpus?
8
Opinion paper: A single model for IR
Query Parser
rewriting, expansion, Search query
suggestion, …
Doc Parser
extraction, anti- Index Retrieval Ranking Search results
spamming, …
Search query
Source: [Metzler et al., 2021]
A Single Model
Search results
9
Image source: [Zhao et al., 2023]
Generative language models
10
Two families of generative retrieval
11
Two families of generative retrieval
11
Closed-book generative retrieval
12
Closed-book generative retrieval
12
Neural IR models: Discriminative vs. Generative
Discriminative Generative
𝑑! 𝑑! “olympic games”
𝑑" 𝐼"
𝑑" 𝐼% 𝑑" “olympic symbols”
𝑑% game
𝑑# FIFA
Olympic
… …
2022
symbol
𝑑% 𝑑! 𝐼!
winter
opening
Document Space
ceremony
𝑑% “2022 Winter Olympics
opening ceremony”
matching 𝐼!
𝑞! 𝐼"
𝑞!
𝑞! 𝐼#
𝑞" …
…
𝑞# 𝑞$ association …
𝑞&
𝑞&
Query Space 𝐼'
Query Space
Meta-identifier Space
𝑝(𝑅 = 1|𝑞, 𝑑) ≈ … ≈ 𝑎𝑟𝑔𝑚𝑎𝑥 𝑠(𝑞⃗, 𝑑⃗) 𝑝(𝑞|𝑑) ≈ 𝑝(𝑑𝑜𝑐𝐼𝐷|𝑞) = 𝑎𝑟𝑔𝑚𝑎𝑥 p((𝐼!, … , 𝐼" )|𝑞)
( probabilistic ranking principle) ( query likelihood)
13
Why generative retrieval?
Query Parser
rewriting, expansion, Search query
suggestion, …
Query Parser
Heterogeneous objectives rewriting, expansion, Search query
suggestion, …
Doc Parser
extraction, anti- Index Retrieval Ranking Search results
spamming, …
Doc Parser
extraction, anti- Index Retrieval Ranking Search results
spamming, …
Search query
A Single Model
Search results
A global objective
Search results
14
Why generative retrieval?
Memory size
GTR GenRet
(MS MARCO 300K)
1430MB 860MB
40 39
AAAI, 2% TOIS, 1%
WWW, 1%
ICML, 1%
35
arXiv, 27%
30
Workshop & findings &
keynote, 18%
25
number of papers
20
18
ICLR, 1%
KDD, 2%
15 EMNLP, 3%
WSDM, 2%
11
10 CIKM , 4%
SIGIR, 13%
5 4
Others, 11%
ACL, 6%
0
2021 2022 2023 2024 NeurIPS, 7%
17
Goals of the tutorial
17
Schedule
18
References
References i
D. Metzler, Y. Tay, D. Bahri, and M. Najork. Rethinking search: Making domain experts out of
dilettantes. SIGIR Forum, 55(1):1–27, 2021.
M. Najork. Generative information retrieval (slides), 2023. URL https:
//docs.google.com/presentation/d/19lAeVzPkh20Ly855tKDkz1uv-1pHV_9GxfntiTJPUug/.
W. Sun, L. Yan, Z. Chen, S. Wang, H. Zhu, P. Ren, Z. Chen, D. Yin, M. de Rijke, and Z. Ren.
Learning to tokenize for generative retrieval. In Thirty-seventh Conference on Neural Information
Processing Systems, 2023.
W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong, et al.
A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
19