LectureLtR-neural IR 2
LectureLtR-neural IR 2
LEARNING TO RANK
NEURAL IR (2)
Static vs. Contextualized Word embeddings
▪ Static word embeddings (e.g., Word2vec)
▪ Global representations of the words, i.e., a single fixed
embedding for each term in the vocabulary
▪ Static embeddings fail with multi-sense terms (which are
however rare) and end up with a “sense averaged” or most
common-sense representation
▪ Contextualized word embeddings
▪ Local representations of words, depending on the others used
together within a “given instance”
▪ We’ll see that transformer deep leaning architectures can
compute contextualized word embeddings
2
Static vs. Contextualized Word embeddings
▪ Static word embeddings (e.g., Word2vec)
▪ Global representations of vocabulary
Indeed, the the words, i.e.,
terms a single
in realfixed
application
embedding forofeach term
Neural IRin theNeural
and vocabulary
NLP are smaller in size,
▪ Static embeddings failand
withbreaks
multi-sense
words terms (which are
in sub-pieces:
however rare) and end up with a “sense averaged” or most
common-sense representation
WordPiece Tokenizer for BERT
▪ Contextualized word embeddings
Byte pair encoding (BPE) for LLMs and some
▪ Local representations of words, depending on the others used
Neural IR systems
together on a given instance
▪ We’ll see that transformer
However, indeep leaning
GPT or BERT architectures can
(transformer deep
compute contextualized word
leaning) the embeddings
token embeddings are pre-trained
end-to-end as part of the whole model 3
DNNs to Contextualize Word embeddings
▪ Starting from static representations of words (even word pieces),
Deep NNs (DNNs) aggregate information from surrounding words to
contextualize the meaning of specific sentences
4
DNNs to Contextualize Word embeddings
▪ Before Transformers
▪ Recurrent neural networks (RNNs)
▪ used for language understanding tasks, such as language
modeling and machine translation (along with LSTMs and
GRUs)
▪ RNNs realize seq2seq networks
▪ given an input sequence of terms, produce an output
sequence of “translated” terms (sequentially, left to right)
5
Transformers
▪ In “Attention Is All You Need”, the Google team introduced the
Transformer deep NN architecture
▪ Novel neural network architecture based on a self-attention mechanism,
particularly well suited for language understanding (much better than RNNs)
▪ As other RNN methods for language translation/understanding, they are
based on Encoders & Decoders
▪ Still these DNNs transform an input sequence into an output sequence, but
focusing differently on different parts of the input due to self-attentions
▪ The Transformer performs a small, constant number of steps
▪ at each step, it applies a self-attention mechanism, which directly models
relationships between all words in a sentence, regardless of their respective
position
▪ Consider the earlier example “I arrived at the bank after crossing the river”
▪ the Transformer can learn to pay immediately attention to the word “river” (at
the end of the phrase), thus making the decision in a single step.
▪ No more left-to-right behavior
B. Vaswani, et al. Attention is all you need. NIPS 2017 6
Transformers
Sequence2sequence
Transformer:
The input is no Stacks of encoder/decoders
more sequential
Credits: Jay Alammar
left-to-right 7
Transformers
▪ The main novelty of transformers are:
▪ self-attention layer – a layer that permits the encoder
look at other words in the input sentence as it encodes
a specific word.
▪ It solves the problem of RNN networks
▪ RNNs process the input sequence sequentially, one word at a
time
▪ the computation for time-step t cannot be carried out until the
computation for time-step t-1 have been completed
▪ In addition, RNN & friends must maintain Very challenging
in a hidden state all long-range dependencies
between words spread far apart in a long sentence
▪ This also implies slows down training and inference.
▪ To encode "it" in encoder #5 (the top encoder in the stack), part of the
self-attention mechanism is focusing on "The Animal”
▪ This entails inserting a part of the representation "The Animal” into the encoding of "it"
11
How does Self-attention work
▪ Second, calculate a score (we’re calculating the self-attention
for the first word “Thinking” in this example), and normalize
the computed scores score, using softmax
▪ We have to compute the same for the other word/query “Machines”
14
How does Self-attention work
This is important,
because all the
sequence is
passed to the NN
15
How does Self-attention work
• The decoder is much more complex
• It uses the output of the top encoder transformed into
a set of attention vectors
• These vectors are exploited by each decoder in its “encoder-
decoder attention” layer
• This helps the decoder to also focus on appropriate
places in the input sequence
16
How does Self-attention work
17
How does Self-attention work
▪ Linear layer
▪ Fully connected neural network
▪ Projects the vector produced by the stack of decoders, into a much larger
vector called a logits vector (of dimension |V|, the term vocabulary)
▪ Softmax layer
▪ Turns the logits scores into probabilities
▪ The cell with the highest probability is chosen, and the word associated
with it is produced as the output for this time step
▪ Vaswani et al. start with random initialization of model
weights
▪ Proceed to directly train on labeled data in a supervised manner
▪ i.e., using (input sequence, output sequence) pairs
18
Pre-Trained Language Models
task. In other words, fine-tuning is the procedure to update the parameters of a pre-
▪ trained
The significant
language model advance of BERT/GPT
for the domain data and over
target the
task.original seq2seq
A transformer
s illustrated in Figure 3, pre-training
formulation typically
is the use ofrequires a huge general-purpose
semi-supervision in train-
ing corpus, such+as
pretraining Wikipedia
fine tuningor Common Crawl web pages, expensive computa-
tion resources and long training times, spanning several days or weeks. On the other
▪ Semi-supervision means that the texts provide their own
side, fine-tuning requires a small domain-specific corpus focused on the downstream
“labels”, and that loss can be computed from the text sequence
task, affordable computational resources and few hours or days of additional training.
Specialitself
cases(without needing
of fine-tuning any learning,
are few-shot other external
where the annotations)
domain-specific corpus is
composed of a very limited number of training data, and zero-shot learning, where a
pre-trained language model is used on a downstream task that it was not fine-tuned
on.
Huge Many Small Few
Random corpus TPUs Pre-trained corpus GPUs Fine-tuned
Language Language Language
M odel Days/ weeks M odel Hours/ days M odel
of training of training
model.
Pre-Trained Language Models
▪ task.
Pre-trained models,
In other words, optimized
fine-tuning withouttoreference
is the procedure to the specific
update the parameters of a pre-
trained
task welanguage
needmodel for theprovide
to solve, domain data
goodandstarting
target task.
points for further
A fine-tuning
s illustrated in Figure 3, pre-training typically
with task-specific labeledrequires
data a huge general-purpose train-
ing corpus, such as Wikipedia or Common Crawl web pages, expensive computa-
▪ Special cases of fine-tuning are
tion resources and long training times, spanning several days or weeks. On the other
▪ Few-shot
side, fine-tuning learning,
requires where
a small the domain-specific
domain-specific corpuson
corpus focused is composed of a
the downstream
very limited number of training data
task, affordable computational resources and few hours or days of additional training.
Special ▪cases
Zero-shot learning,
of fine-tuning arewhere a pre-trained
few-shot language
learning, where model is usedcorpus
the domain-specific on a is
downstream task without fine-tuning
composed of a very limited number of training data, and zero-shot learning, where a
pre-trained language model is used on a downstream task that it was not fine-tuned
on.
Huge Many Small Few
Random corpus TPUs Pre-trained corpus GPUs Fine-tuned
Language Language Language
M odel Days/ weeks M odel Hours/ days M odel
of training of training
model.
Generative Pre-Training models (GPT)
▪ GPT can be viewed as a decoder-only transformer
▪ Unsupervised corpus of tokens
▪ The goal is to maximize this likelihood:
Context window
(prompt)
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of deep bidirectional
transformers for language understanding. NA-ACL Conf. 2019. 23
Pre-training BERT
24
BERT: Transfer learning
• BERT and fine-tuning
• From a pre-trained language model based on transformer to a fine-tuned
classifier
25
Pre-training BERT
26
Tokenizers
▪ Input sequences to BERT are usually tokenized with the WordPiece
tokenizer [Wu et al., 2016], while BPE [Sennrich et al., 2016] is a
common alternative (used in GPT)
Y. Wu, et al. Google’s neural machine translation system: Bridging the gap between human and
machine translation. arXiv:1609.08144, 2016.
27
R. Sennrich, et al. Neural machine translation of rare words with subword units. In ACL, 2016.
Tokenizers
Y. Wu, et al. Google’s neural machine translation system: Bridging the gap between human and
machine translation. arXiv:1609.08144, 2016.
28
R. Sennrich, et al. Neural machine translation of rare words with subword units. In ACL, 2016.
High-Level Overview of BERT
▪ At its core, BERT is a neural network model for generating contextual
Illustration of BERT, showing composition of input embeddings. Redrawn from Devlin et al. (NAACL 2019)
embeddings for input sequences (up to 512)
T[CLS] T1 T2 T3 T4 T5 T6 T7 T[SEP]
… … … … … … … … …
E[CLS] E1 E2 E3 E4 E5 E6 E7 E[SEP]
Token
[CLS] At11 At22 At33 [SEP]
t4 Bt51 Bt62 t7 [SEP]
Embeddings
Segment
+ + + + + + + + +
EA EA EA EA EA EA EA EA EA
Embeddings
Position
+ + + + + + + + +
P0 P1 P2 P3 P4 P5 P6 P7 P8
Embeddings
By Jimmy Lin ([email protected]), released under Creative Commons Attribution 4.0 International (CC BY 4.0): https://creativecommons.org/licenses/by/4.0/
Figure
[CLS] 4: The architecture
(classification): of BERT. Input
At the beginning of thevectors compriseIts
input sequence. the element-wise
final summation
hidden state is often usedof token
as aggregate
embeddings, segment embeddings,
representation and
of the position
entire embeddings.
sequence The output
for classification of BERT
tasks is a contextual
(and others)
embedding
[SEP] for each
(separator): Addedinput token.
at the endThe
of a contextual embedding
single sentence of the [CLS]
or to separate token is typically
two sentences taken
in tasks like assentence
next
an aggregate representation of the entire sequence
prediction for classification-based
or question answering downstream tasks.
[PAD] (padding): Used to make sequences of different lengths the same length within a batch. These tokens are
masked so they don't contribute to the model's learning 29
High-Level Overview of BERT
30
Typical BERT tasks
▪ As we said before, BERT primarily converts a sequence of input
embeddings into a sequence of corresponding contextual
embeddings
▪ The input embeddings
▪ Typical application tasks (besides IR ranking) obtained by fine-
tuning BERT:
▪ Single-input classification tasks (sentiment analysis on a single segment of
text)
▪ Two-input classification tasks (two sentences are paraphrases)
▪ Single-input token labeling tasks, like named-entity recognition (each token
in the input is assigned a label, as opposed to single-input classification,
where the label is assigned to the entire sequence)
▪ ….
31
Typical BERT tasks
32
Fine-tuning BERT as a cross-encoder for IR
for “re-ranking”, 2nd stage
▪ Given a query-document pair, both texts are tokenized into token sequences:
the tokens are concatenated with BERT special tokens to form the following:
W2 s(q,d)
over the non-relevant and relevant classes
▪ Cross-encoder = Interaction-based model
Encoder model
33
q d
Fine-tuning BERT as a cross-encoder for IR
for “re-ranking”, 2nd stage
▪ Given an input query-document pair (q, d), the cross-encoder M(θ),also called
monoBERT, parametrized by θ, computes sθ(q, d)∈ R
▪ We are supposed to predict y ∈ {+, −} (relevant or non-relevant)
▪ assigning a high score to a relevant document and a low score to a non-relevant document,
▪ The fine tuning of monoBERT starts with a pretrained BERT model, which can be
downloaded Hugging Face Transformers library
them w ith a more expensive neural re-ranking system, such as cross-encoders de-
scribed in Section 2.
Candidates Neural
Retriever Re-Ranker
Candidates Results
Query
List List
Learned
Document query-document
Collection representation
η(q,d)
Z. Dai and J. Callan. Deeper Text Understanding for IR with Contextual Neural Language
Modeling. In Proc. SIGIR 2019.
Fine-tuning BERT as a bi-encoder
for “dense retrieval”, used as first stage retrieval
[ CLS] = 0 1 |q|
ψ[ CLS] = ψ0 ψ1 ψ|d|
Figure 6: Representation-focused system.
Encoder Encoder
model model
to compute both the query and the document representations, so the mo
called dual encoder or bi-encoder [Bromley et al. 1993]. A bi-encoder maps q
[ CLS] Tokeniser [ CLS] Tokeniser
documents in the same vector space R ` , in such a way that the represent
q d
e▪mathematically
The representation functionsUsually,
manipulated.
Figure 6: Representation-focused system.
φ and ψ are
the computed
output through
embedding fine-
correspond
CLS] token BERT
tuning is assumed to be the representation of a given input text. Using
mpute both the query and the document representations, so the model is
▪ Usually, the output embedding corresponding to the [CLS] token is assumed
erepresentations,
to be the` score aggregation
thespace
representation of athegiven function
dual encoder or bi-encoder [Bromley et al. 1993]. A bi-encoder maps queries
input is the
text. Using dotthe
them, product:
score
ments in the same vector R , in such a w ay that representations
aggregation
hematically manipulated. function
Usually, the output is the dot/inner
embedding corresponding product:
to
oken is assumed to be the representation of a given input text. Using these
sentations, the score aggregation function is the dot product: s( q, d) = f [ CLS] · y [ CLS]
q
Fine-tuning BERT as a bi-encoder
Figure 6: Representation-focused sy
for “dense retrieval”, used as first stage retrieval
Document embedding
41
Representation-based model
Multiple representations
▪ ColBERT [Khattab and Zaharia 2020] does not limit to m the
number of embeddings used to represent a document.
▪ it uses all the 1 + |d| output embeddings to represent a document d,
including the [CLS] special token
▪ it uses all the 1 + |q| output embeddings to represent a query q,
including the [CLS] special token
▪ Late interaction:
Sum of Maxsim
(all-to-all computation)
O. Khattab and M. Zaharia. ColBERT: Efficient and Effective Passage Search via
Contextualized Late Interaction over BERT. In Proc. SIGIR 2020. 42
query representation encoder must compute only the query representation f ( q) , then
Dense retrieval architecture
the documents are ranked according to the inner product of their representation w ith
the query embedding, and the top k documents w hose embeddings have the largest
Representation-based neural IR systems
inner product w.r.t. the query embedding are returned to the user (Figure 8).
Results
Query Learned
List
query Neural
representation Ranker
encoder (q)
Online
Offline Efficient
Learned Approximate kNN
Document document Document
Collection representation Embeddings
encoder ψ(d) Index
▪ The goal of the maximum inner product (MIP) search is to find the document
Figure 8: Dense retrieval architecture for representation-focused neural IR systems.
embedding such that the inner product (similarity) is maximum
▪ The naive Document embedding index is the flat index, that requires an exhaustive
searchMtoI Pthe
4.2 andmost
N Nsimilar
Searchdocument
Problems
▪ The most recent search methods have shifted to Approximate - k Nearest Neighbor
The pre-computed
(A-kNN) document
search, e.g., embeddings
using the k-meansare stored inalgorithm
clustering a special data structurethe
to partition called
index
▪ index.
In the In
IR its simplest form, this index must store the document embeddings and pro-
research community, Meta’s FAISS is the most widely adopted framework
vide a search algorithm
for embedding indexesthat, given aet
[Johnson query embedding, effi ciently fi nds the document
al. 2021]
rned document representation encoder y ( d) . A t query processing time, the learned
ery representation encoder must compute only the query representation f ( q) , then
Dense retrieval architecture
e documents are ranked according to the inner product of their representation w ith
e query embedding, and the top k documents w hose embeddings have the largest
Representation-based neural IR systems
ner product w.r.t. the query embedding are returned to the user (Figure 8).
Results
Query Learned
List
query Neural
representation Ranker
encoder (q)
Online
Offline
Learned Efficient Approximate
Document document Document
Collection representation Embeddings kNN search
encoder ψ(d) Index
gure
▪ 8:Faiss
Denseisretrieval architecture
a library — for representation-focused neural
Inner Product (IP) IR systems.
is a similarity
metric, where larger values
developed by Facebook AI indicate greater similarity
2 M— thatNenables
I P and N Searchefficient
Problems
similarity/distance search Euclidean Distance (ED) is a
e pre-computed document embeddings are stored in a special
distance metric,data
with structure
smaller called
▪ simplest
ex. In its Given aform,
set of dense
this indexvectors
must store the document
value indicatingembeddings
greater and pro-
(representing documents), we similarity.
de a search algorithm that, given a query embedding, effi ciently fi nds the document
can index them using Faiss
bedding w ith the largest dot product, or, more in general, w ith the maximum inner
oduct. ▪ Using another dense vector
(representing the query), we
search for the most similar29 With flat indexes, given a query vector xq, we compare it
(the closest) vectors within the against every other xd vector in the index, calculating the
distance to each, then selecting the exact top-k closest ones.
index.
Stategies for Approximate k-NN Retrieval
▪ Long-range edges aim to transform the incomplete graph in a small word one,
with short average path length
▪ any two nodes are connected by surprisingly few hops (small degree of separation)
▪ Example:
▪ starting from an entry node, navigate the graph to return the 1NN result set
HNSW & Greedy Search
▪ Hierarchical Navigable Small World (HNSW) graph
▪ The HNSW index stores the input data into multiple NSW graphs
▪ The search procedure starts with the top layer graph.
▪ At each layer, the greedy heuristic searches for the closest node, then the next layer
is searched, starting from the node corresponding to the closest node identified in
the preceding graph
▪ At the bottom layer, the greedy heuristic searches for the k closest nodes to be
returned starting from the node identified navigating the previous layers
The quantity of
nodes in the other
graphs decreases
exponentially at
each layer