0% found this document useful (0 votes)
10 views52 pages

LectureLtR-neural IR 2

The document discusses the evolution of word embeddings from static to contextualized forms, highlighting the advantages of transformer architectures like BERT and GPT in natural language processing. It explains how transformers utilize self-attention mechanisms to better understand word relationships in context, overcoming limitations of previous models like RNNs. Additionally, it covers pre-training and fine-tuning processes for language models, emphasizing BERT's bidirectional approach and GPT's generative capabilities.

Uploaded by

zarypkanov270301
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views52 pages

LectureLtR-neural IR 2

The document discusses the evolution of word embeddings from static to contextualized forms, highlighting the advantages of transformer architectures like BERT and GPT in natural language processing. It explains how transformers utilize self-attention mechanisms to better understand word relationships in context, overcoming limitations of previous models like RNNs. Additionally, it covers pre-training and fine-tuning processes for language models, emphasizing BERT's bidirectional approach and GPT's generative capabilities.

Uploaded by

zarypkanov270301
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

Relevance Feedback & Query Expansion

LEARNING TO RANK
NEURAL IR (2)
Static vs. Contextualized Word embeddings
▪ Static word embeddings (e.g., Word2vec)
▪ Global representations of the words, i.e., a single fixed
embedding for each term in the vocabulary
▪ Static embeddings fail with multi-sense terms (which are
however rare) and end up with a “sense averaged” or most
common-sense representation
▪ Contextualized word embeddings
▪ Local representations of words, depending on the others used
together within a “given instance”
▪ We’ll see that transformer deep leaning architectures can
compute contextualized word embeddings

2
Static vs. Contextualized Word embeddings
▪ Static word embeddings (e.g., Word2vec)
▪ Global representations of vocabulary
Indeed, the the words, i.e.,
terms a single
in realfixed
application
embedding forofeach term
Neural IRin theNeural
and vocabulary
NLP are smaller in size,
▪ Static embeddings failand
withbreaks
multi-sense
words terms (which are
in sub-pieces:
however rare) and end up with a “sense averaged” or most
common-sense representation
WordPiece Tokenizer for BERT
▪ Contextualized word embeddings
Byte pair encoding (BPE) for LLMs and some
▪ Local representations of words, depending on the others used
Neural IR systems
together on a given instance
▪ We’ll see that transformer
However, indeep leaning
GPT or BERT architectures can
(transformer deep
compute contextualized word
leaning) the embeddings
token embeddings are pre-trained
end-to-end as part of the whole model 3
DNNs to Contextualize Word embeddings
▪ Starting from static representations of words (even word pieces),
Deep NNs (DNNs) aggregate information from surrounding words to
contextualize the meaning of specific sentences

▪ For example, deciding on the most likely meaning and


appropriate representation of the word “bank” in the sentence:
“I arrived at the bank after crossing the…”
requires knowing if the sentence ends in “... road.” or “... river.”

▪ bank may refer to either the shore of a river or to a financial


institution

4
DNNs to Contextualize Word embeddings
▪ Before Transformers
▪ Recurrent neural networks (RNNs)
▪ used for language understanding tasks, such as language
modeling and machine translation (along with LSTMs and
GRUs)
▪ RNNs realize seq2seq networks
▪ given an input sequence of terms, produce an output
sequence of “translated” terms (sequentially, left to right)

▪ ISSUE: strict sequential left-to-right behavior in input and


output at both learning and inference time (see the previous
example of “bank” term)

5
Transformers
▪ In “Attention Is All You Need”, the Google team introduced the
Transformer deep NN architecture
▪ Novel neural network architecture based on a self-attention mechanism,
particularly well suited for language understanding (much better than RNNs)
▪ As other RNN methods for language translation/understanding, they are
based on Encoders & Decoders
▪ Still these DNNs transform an input sequence into an output sequence, but
focusing differently on different parts of the input due to self-attentions
▪ The Transformer performs a small, constant number of steps
▪ at each step, it applies a self-attention mechanism, which directly models
relationships between all words in a sentence, regardless of their respective
position
▪ Consider the earlier example “I arrived at the bank after crossing the river”
▪ the Transformer can learn to pay immediately attention to the word “river” (at
the end of the phrase), thus making the decision in a single step.
▪ No more left-to-right behavior
B. Vaswani, et al. Attention is all you need. NIPS 2017 6
Transformers
Sequence2sequence

Machine translation, Text Summarization, Question-


Answering, Named Entity Recognition, Speech
Recognition.

Transformer:
The input is no Stacks of encoder/decoders
more sequential
Credits: Jay Alammar
left-to-right 7
Transformers
▪ The main novelty of transformers are:
▪ self-attention layer – a layer that permits the encoder
look at other words in the input sentence as it encodes
a specific word.
▪ It solves the problem of RNN networks
▪ RNNs process the input sequence sequentially, one word at a
time
▪ the computation for time-step t cannot be carried out until the
computation for time-step t-1 have been completed
▪ In addition, RNN & friends must maintain Very challenging
in a hidden state all long-range dependencies
between words spread far apart in a long sentence
▪ This also implies slows down training and inference.

B. Vaswani, et al. Attention is all you need. NIPS 2017 8


Transformers and Self-attention

▪ To encode "it" in encoder #5 (the top encoder in the stack), part of the
self-attention mechanism is focusing on "The Animal”
▪ This entails inserting a part of the representation "The Animal” into the encoding of "it"

B. Vaswani, et al. Attention is all you need. NIPS 2017 9


Multi-head Self-attention
▪ A way to expand the model’s ability to focus on different
positions for encoding words (multi-head Self-attention)

▪ To encode "it", the orange attention is focusing most on "the animal",


while the green one most on on ”tired”
▪ This entails mainly inserting a part of the representation "The Animal” and ”tired” into
10
the encoding of "it"
How does Self-attention work
▪ Given the current word encodings/embeddings, first create
three vectors from each of the encoder’s input vectors
▪ Query vector, Key vector, and Value vector
▪ … by multiplying the embedding by three matrices (to be trained)
▪ In the paper, their dimensionality is 64, while the dimensionality of
embedding and encoder input/output vectors is 512

11
How does Self-attention work
▪ Second, calculate a score (we’re calculating the self-attention
for the first word “Thinking” in this example), and normalize
the computed scores score, using softmax
▪ We have to compute the same for the other word/query “Machines”

▪ The softmax scores show the usefulness of attending the


other words (keys) relevant to “Thinking” (query) 12
How does Self-attention work
▪ Third, multiply each value vector vi by the associated softmax
score, and then sum up them (weighted sum) to obtain vector zi

▪ The intuition here is


to disempower
irrelevant words
(by multiplying
them by tiny numbers)

▪ Finally, sum them up, and


send the zi vector to a
feed-forward layer
to produce the output
of the encoders
13
How does Self-attention work
Multi-head self attention

14
How does Self-attention work

This is important,
because all the
sequence is
passed to the NN
15
How does Self-attention work
• The decoder is much more complex
• It uses the output of the top encoder transformed into
a set of attention vectors
• These vectors are exploited by each decoder in its “encoder-
decoder attention” layer
• This helps the decoder to also focus on appropriate
places in the input sequence

• The decoder works sequentially


• At each step, it guess a term/token of the output sequence

16
How does Self-attention work

17
How does Self-attention work
▪ Linear layer
▪ Fully connected neural network
▪ Projects the vector produced by the stack of decoders, into a much larger
vector called a logits vector (of dimension |V|, the term vocabulary)
▪ Softmax layer
▪ Turns the logits scores into probabilities
▪ The cell with the highest probability is chosen, and the word associated
with it is produced as the output for this time step
▪ Vaswani et al. start with random initialization of model
weights
▪ Proceed to directly train on labeled data in a supervised manner
▪ i.e., using (input sequence, output sequence) pairs

18
Pre-Trained Language Models
task. In other words, fine-tuning is the procedure to update the parameters of a pre-
▪ trained
The significant
language model advance of BERT/GPT
for the domain data and over
target the
task.original seq2seq
A transformer
s illustrated in Figure 3, pre-training
formulation typically
is the use ofrequires a huge general-purpose
semi-supervision in train-
ing corpus, such+as
pretraining Wikipedia
fine tuningor Common Crawl web pages, expensive computa-
tion resources and long training times, spanning several days or weeks. On the other
▪ Semi-supervision means that the texts provide their own
side, fine-tuning requires a small domain-specific corpus focused on the downstream
“labels”, and that loss can be computed from the text sequence
task, affordable computational resources and few hours or days of additional training.
Specialitself
cases(without needing
of fine-tuning any learning,
are few-shot other external
where the annotations)
domain-specific corpus is
composed of a very limited number of training data, and zero-shot learning, where a
pre-trained language model is used on a downstream task that it was not fine-tuned
on.
Huge Many Small Few
Random corpus TPUs Pre-trained corpus GPUs Fine-tuned
Language Language Language
M odel Days/ weeks M odel Hours/ days M odel
of training of training

Transfer learning of a pre-trained language model to a fine-tuned language model.


Figure 3: Transfer learning of a pre-trained language model to a fine-tuned language 19

model.
Pre-Trained Language Models
▪ task.
Pre-trained models,
In other words, optimized
fine-tuning withouttoreference
is the procedure to the specific
update the parameters of a pre-
trained
task welanguage
needmodel for theprovide
to solve, domain data
goodandstarting
target task.
points for further
A fine-tuning
s illustrated in Figure 3, pre-training typically
with task-specific labeledrequires
data a huge general-purpose train-
ing corpus, such as Wikipedia or Common Crawl web pages, expensive computa-
▪ Special cases of fine-tuning are
tion resources and long training times, spanning several days or weeks. On the other
▪ Few-shot
side, fine-tuning learning,
requires where
a small the domain-specific
domain-specific corpuson
corpus focused is composed of a
the downstream
very limited number of training data
task, affordable computational resources and few hours or days of additional training.
Special ▪cases
Zero-shot learning,
of fine-tuning arewhere a pre-trained
few-shot language
learning, where model is usedcorpus
the domain-specific on a is
downstream task without fine-tuning
composed of a very limited number of training data, and zero-shot learning, where a
pre-trained language model is used on a downstream task that it was not fine-tuned
on.
Huge Many Small Few
Random corpus TPUs Pre-trained corpus GPUs Fine-tuned
Language Language Language
M odel Days/ weeks M odel Hours/ days M odel
of training of training

Transfer learning of a pre-trained language model to a fine-tuned language model.


Figure 3: Transfer learning of a pre-trained language model to a fine-tuned language 20

model.
Generative Pre-Training models (GPT)
▪ GPT can be viewed as a decoder-only transformer
▪ Unsupervised corpus of tokens
▪ The goal is to maximize this likelihood:
Context window
(prompt)

where k is the size of the context window, and the conditional


probability P is modeled using a neural network with
parameters Θ.
▪ Therefore, GPT uses a standard language modeling objective
(i.e., the prediction of the next term)
▪ Note that GPT's goal is text generation
▪ it starts from a prompt and keeps predicting the next token, one after
the other, without need of a separate encoder to summarize an input
▪ The goal is the same a training and inference time
A. Radford, K. Narasimhan, T. Salimans, and I. Sutskeveret.
21
Improving Language Understanding by Generative Pre-Training. OpenAI Report, 2018.
Bidirectional Encoder Representations
from Transformers (BERT)
▪ BERT, an invention of Google, is the best-known example of
transformer for IR
▪ BERT can be viewed as an encoder-only transformer
▪ The pre-train of BERT is not carried out by using a standard language
modeling prediction, like GPT
▪ Google blog
▪ Google confirmed that the company had improved search “by applying
BERT models to both ranking and featured snippets”
▪ Also used for query understanding and generative-AI: “AI overview”
▪ https://blog.google/products/search/search-language-understanding-
bert/
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of deep bidirectional
transformers for language understanding. NA-ACL Conf. 2019. 22
Bidirectional Encoder Representations
from Transformers (BERT)
▪ BERT, an invention of Google, is the best-known example
of transformer for IR
▪ BERT can be viewed as an encoder-only transformer
▪ The pre-train of BERT is not carried out by using a standard
language modeling prediction, like GPT
▪ Google blog
▪ Google confirmed that the company had improved search “by
applying BERT models to both ranking and featured snippets”.
▪ https://blog.google/products/search/search-language-
understanding-bert/

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of deep bidirectional
transformers for language understanding. NA-ACL Conf. 2019. 23
Pre-training BERT

▪ “Masked Language Model” (MLM) pretraining objective


▪ Randomly “cover up” (more formally, “mask”) a token from the input
sequence
▪ Ask the model to “guess” (i.e., predict) it, training with cross entropy loss
▪ a.k.a. cloze task
▪ B = bidirectional
▪ The MLM objective well explains the “B” in BERT, as the model can use both
left and right contexts (preceding and succeeding contexts) of a masked
token to make predictions
▪ In contrast, GPT uses a language modeling objective for pre-training, thus
only exploiting the preceding tokens in the left context only (formally, this is
called “autoregressive” model)

24
BERT: Transfer learning
• BERT and fine-tuning
• From a pre-trained language model based on transformer to a fine-tuned
classifier

25
Pre-training BERT

▪ Next sentence prediction (NSP) is the other training goal (besides


MLM)
▪ Pairs of sentences as input
▪ Learn to predict if the second sentence in the pair is the subsequent
sentence in the original document
▪ During training, 50% of the inputs are a pair in which the second sentence is
the subsequent sentence in the original document, while in the other 50% a
random sentence from the corpus is chosen as the second sentence.

26
Tokenizers
▪ Input sequences to BERT are usually tokenized with the WordPiece
tokenizer [Wu et al., 2016], while BPE [Sennrich et al., 2016] is a
common alternative (used in GPT)

▪ The aim of these tokenizers is to reduce the vocabulary space by


splitting words into “subwords”, usually in an unsupervised
manner

▪ The embeddings of BERT token are generated/learned end-to-end


during pre-training as part of the whole model (they are not
word2vec)

Y. Wu, et al. Google’s neural machine translation system: Bridging the gap between human and
machine translation. arXiv:1609.08144, 2016.
27
R. Sennrich, et al. Neural machine translation of rare words with subword units. In ACL, 2016.
Tokenizers

▪ WordPiece is the tokenizer used by BERT


▪ “scrolling” becomes “scroll” + “##ing”
▪ note the convention of prepending two hashes (##) to a subword, indicating that it is
connected to the previous subword (no spaces between the current subword and the
previous one)
▪ “walking” and “talking” are not split into subwords
▪ “biking” is split into “bi” + “##king”
▪ “biostatistics” is split into “bio” + “##sta” + “##tist” + “##ics”
▪ Finally, we obtain a small vocabulary (e.g., 30,522 wordpieces) sufficient to
model large, naturally-occurring corpora that may have millions of unique
tokens if based on a simple method like tokenization by whitespace
▪ RoBERTa is a BERT-based DNN that uses Byte-level BPE tokenizer

Y. Wu, et al. Google’s neural machine translation system: Bridging the gap between human and
machine translation. arXiv:1609.08144, 2016.
28
R. Sennrich, et al. Neural machine translation of rare words with subword units. In ACL, 2016.
High-Level Overview of BERT
▪ At its core, BERT is a neural network model for generating contextual
Illustration of BERT, showing composition of input embeddings. Redrawn from Devlin et al. (NAACL 2019)
embeddings for input sequences (up to 512)

T[CLS] T1 T2 T3 T4 T5 T6 T7 T[SEP]

… … … … … … … … …

E[CLS] E1 E2 E3 E4 E5 E6 E7 E[SEP]

Token
[CLS] At11 At22 At33 [SEP]
t4 Bt51 Bt62 t7 [SEP]
Embeddings
Segment
+ + + + + + + + +
EA EA EA EA EA EA EA EA EA
Embeddings
Position
+ + + + + + + + +
P0 P1 P2 P3 P4 P5 P6 P7 P8
Embeddings

By Jimmy Lin ([email protected]), released under Creative Commons Attribution 4.0 International (CC BY 4.0): https://creativecommons.org/licenses/by/4.0/
Figure
[CLS] 4: The architecture
(classification): of BERT. Input
At the beginning of thevectors compriseIts
input sequence. the element-wise
final summation
hidden state is often usedof token
as aggregate
embeddings, segment embeddings,
representation and
of the position
entire embeddings.
sequence The output
for classification of BERT
tasks is a contextual
(and others)
embedding
[SEP] for each
(separator): Addedinput token.
at the endThe
of a contextual embedding
single sentence of the [CLS]
or to separate token is typically
two sentences taken
in tasks like assentence
next
an aggregate representation of the entire sequence
prediction for classification-based
or question answering downstream tasks.
[PAD] (padding): Used to make sequences of different lengths the same length within a batch. These tokens are
masked so they don't contribute to the model's learning 29
High-Level Overview of BERT

Hyperparameter settings of various pretrained BERT configurations.


Devlin et al. [2019] presented BERTBase and BERTLarge, the two most commonly used
configurations today

30
Typical BERT tasks
▪ As we said before, BERT primarily converts a sequence of input
embeddings into a sequence of corresponding contextual
embeddings
▪ The input embeddings
▪ Typical application tasks (besides IR ranking) obtained by fine-
tuning BERT:
▪ Single-input classification tasks (sentiment analysis on a single segment of
text)
▪ Two-input classification tasks (two sentences are paraphrases)
▪ Single-input token labeling tasks, like named-entity recognition (each token
in the input is assigned a label, as opposed to single-input classification,
where the label is assigned to the entire sequence)
▪ ….

31
Typical BERT tasks

32
Fine-tuning BERT as a cross-encoder for IR
for “re-ranking”, 2nd stage
▪ Given a query-document pair, both texts are tokenized into token sequences:

the tokens are concatenated with BERT special tokens to form the following:

▪ The output embedding , corresponding to input token [CLS] token,


is basically a contextual representation of the query-document pair as a whole
▪ Finally, fine-tune BERT on a binary classification task to compute the query-
document relevance score η Pointwise LtR model
[ CLS]

▪ Produce probability distribution p0 and p1 softmax

W2 s(q,d)
over the non-relevant and relevant classes
▪ Cross-encoder = Interaction-based model

Encoder model

[ CLS] q1 qm [ SEP] d1 dn [ SEP]


Tokeniser Tokeniser

33
q d
Fine-tuning BERT as a cross-encoder for IR
for “re-ranking”, 2nd stage
▪ Given an input query-document pair (q, d), the cross-encoder M(θ),also called
monoBERT, parametrized by θ, computes sθ(q, d)∈ R
▪ We are supposed to predict y ∈ {+, −} (relevant or non-relevant)
▪ assigning a high score to a relevant document and a low score to a non-relevant document,
▪ The fine tuning of monoBERT starts with a pretrained BERT model, which can be
downloaded Hugging Face Transformers library

▪ Minimize the (binary) cross entropy lCE

▪ A dataset T available for fine-tuning pre-trained language models for relevance


scoring is composed of a list of triples (q, d+, d−), aka docpairs
▪ The expected lCE entropy is approximated by the sum computed for each triple
MS MARCO: used for fine-tuning
▪ MS MARCO is a collection of datasets focused on deep learning in
search (paper released at NIPS 2016)
▪ Used to fine tune BERT for IR purposes
▪ MS MARCO Passage Ranking is a large dataset to train models for
information retrieval.
▪ Real search queries from Bing search engine with the relevant
text passage that answers the query.
▪ <query, positive passage, negative passage>
▪ Look at the Colab notebook:
https://colab.research.google.com/drive/1C4JtN_OvR2OpRF5jtN90BjDrXR
HKrkuq
Specifically, docpairs in msmarco-passage/train/judged for fine-
tuning 35
Re-ranking pipeline architecture
Interaction-based neural IR systems

them w ith a more expensive neural re-ranking system, such as cross-encoders de-
scribed in Section 2.

Candidates Neural
Retriever Re-Ranker
Candidates Results
Query
List List
Learned
Document query-document
Collection representation
η(q,d)

Figure 7: Re-ranking pipeline architecture for interaction-focused neural IR systems.

The most important benefi t of bi-encoders discussed in Section 3 is the possibility to


pre-compute and cache the representations of a large corpus of documents w ith the
Dealing with long documents
▪ BERT have an input size limited to 512 tokens, including the special
ones.
▪ Long documents cannot be fed completely into a transformer model
▪ We need to split them into smaller texts, in a procedure referred to as
passaging
▪ Proposal by Dai and Callan [2019]
▪ Split a long document into overlapping shorter passages, processed
independently together with the same query by a cross-encoder.
▪ During training, if a long document is relevant, all its passages are relevant,
and vice-versa
▪ At processing time, the relevance scores for each composing passage are
aggregated back to a single score for a long document
▪ Possible aggregations: FirstP, MaxP, and SumP (where P stands for Passage)

Z. Dai and J. Callan. Deeper Text Understanding for IR with Contextual Neural Language
Modeling. In Proc. SIGIR 2019.
Fine-tuning BERT as a bi-encoder
for “dense retrieval”, used as first stage retrieval

▪ BERT can also be used for a representation-based model


▪ Goal: build up independent query and document representations
▪ Document representations can be pre-computed and stored in advance
▪ At query processing time, only the query representation is computed
▪ The top documents are searched through the stored document
representations

▪ Representation-based systems can identify the relevant


documents among all documents in a collection
▪ Not only re-ranking among a query-dependent sample
▪ This introduces a new type of retrieval systems called dense retrieval, in
contrast to the classic BOW sparse retrieval model

▪ A representation-based model is also called dual encoder or bi-


encoder
Fine-tuning BERT as a bi-encoder
for “dense retrieval”, used as first stage retrieval
[ CLS] Tokeniser [Siamese
CLS] (or twin)Tokeniser
network structure
s(q,d)
Two identical BERT encoders, sharing the same
weights, process each input independently
Score aggregation Their resulting embeddings are compared to
q d
learn a meaningful similarity space

[ CLS] = 0 1 |q|
ψ[ CLS] = ψ0 ψ1 ψ|d|
Figure 6: Representation-focused system.
Encoder Encoder
model model
to compute both the query and the document representations, so the mo
called dual encoder or bi-encoder [Bromley et al. 1993]. A bi-encoder maps q
[ CLS] Tokeniser [ CLS] Tokeniser

documents in the same vector space R ` , in such a way that the represent
q d
e▪mathematically
The representation functionsUsually,
manipulated.
Figure 6: Representation-focused system.
φ and ψ are
the computed
output through
embedding fine-
correspond
CLS] token BERT
tuning is assumed to be the representation of a given input text. Using
mpute both the query and the document representations, so the model is
▪ Usually, the output embedding corresponding to the [CLS] token is assumed
erepresentations,
to be the` score aggregation
thespace
representation of athegiven function
dual encoder or bi-encoder [Bromley et al. 1993]. A bi-encoder maps queries
input is the
text. Using dotthe
them, product:
score
ments in the same vector R , in such a w ay that representations
aggregation
hematically manipulated. function
Usually, the output is the dot/inner
embedding corresponding product:
to
oken is assumed to be the representation of a given input text. Using these
sentations, the score aggregation function is the dot product: s( q, d) = f [ CLS] · y [ CLS]
q
Fine-tuning BERT as a bi-encoder
Figure 6: Representation-focused sy
for “dense retrieval”, used as first stage retrieval

▪ For fine-tuning BERT


usedused as bi-encoder,
to compute both theuse Contrastive
query Loss
and the document repr
▪ Pull together matching query-document pairs (positive pairs)
also called dual encoder or bi-encoder [Bromley et al. 1993].
▪ Push apart non-matching ones (negative pairs) `
and documents in the same vector space R , in such a w
▪ In the training batch you have:
can be mathematically manipulated. Usually, the output e
▪ A query q
the
▪ Its correct (positive) [ CLS] token
document d+ is assumed to be the representation of a g
singlerepresentations,
▪ Several incorrect (negative) documents d- the score aggregation function is the
▪ Encodes q, d+, and d- into vectors
▪ Then uses dot product or cosine similarity to score them: s( q, d) = f [ CLS] · y [ CLS]

▪ Minimize a contrastive loss like


Different this:
single-representation systems have been propo
2020], ANCE [Xiong et al. 2021], and STAR [Zhan et al. 20
adopted. The main difference among these systems is how
that pushes up themodel is carried
similarity for theout, as discussed
positive pair and in Section
pushes 3.3. the
down
scores of negative pairs.
Representation-based model
Multiple representations
▪ Rather than using just the first output embedding ψ[CLS] = ψ0
to encode a document d, poly-encoders exploit the first m
output embeddings ψ0, ψ1, . . . , ψm−1
▪ A query q is still represented with the single embedding
φ[CLS] = φ0
Similarity values between
the query and the first m
contextualized embeddings
of the document

Document embedding

41
Representation-based model
Multiple representations
▪ ColBERT [Khattab and Zaharia 2020] does not limit to m the
number of embeddings used to represent a document.
▪ it uses all the 1 + |d| output embeddings to represent a document d,
including the [CLS] special token
▪ it uses all the 1 + |q| output embeddings to represent a query q,
including the [CLS] special token
▪ Late interaction:

Sum of Maxsim
(all-to-all computation)

O. Khattab and M. Zaharia. ColBERT: Efficient and Effective Passage Search via
Contextualized Late Interaction over BERT. In Proc. SIGIR 2020. 42
query representation encoder must compute only the query representation f ( q) , then
Dense retrieval architecture
the documents are ranked according to the inner product of their representation w ith
the query embedding, and the top k documents w hose embeddings have the largest
Representation-based neural IR systems
inner product w.r.t. the query embedding are returned to the user (Figure 8).
Results
Query Learned
List
query Neural
representation Ranker
encoder (q)

Online
Offline Efficient
Learned Approximate kNN
Document document Document
Collection representation Embeddings
encoder ψ(d) Index

▪ The goal of the maximum inner product (MIP) search is to find the document
Figure 8: Dense retrieval architecture for representation-focused neural IR systems.
embedding such that the inner product (similarity) is maximum
▪ The naive Document embedding index is the flat index, that requires an exhaustive
searchMtoI Pthe
4.2 andmost
N Nsimilar
Searchdocument
Problems
▪ The most recent search methods have shifted to Approximate - k Nearest Neighbor
The pre-computed
(A-kNN) document
search, e.g., embeddings
using the k-meansare stored inalgorithm
clustering a special data structurethe
to partition called
index
▪ index.
In the In
IR its simplest form, this index must store the document embeddings and pro-
research community, Meta’s FAISS is the most widely adopted framework
vide a search algorithm
for embedding indexesthat, given aet
[Johnson query embedding, effi ciently fi nds the document
al. 2021]
rned document representation encoder y ( d) . A t query processing time, the learned
ery representation encoder must compute only the query representation f ( q) , then
Dense retrieval architecture
e documents are ranked according to the inner product of their representation w ith
e query embedding, and the top k documents w hose embeddings have the largest
Representation-based neural IR systems
ner product w.r.t. the query embedding are returned to the user (Figure 8).
Results
Query Learned
List
query Neural
representation Ranker
encoder (q)

Online
Offline
Learned Efficient Approximate
Document document Document
Collection representation Embeddings kNN search
encoder ψ(d) Index

gure
▪ 8:Faiss
Denseisretrieval architecture
a library — for representation-focused neural
Inner Product (IP) IR systems.
is a similarity
metric, where larger values
developed by Facebook AI indicate greater similarity
2 M— thatNenables
I P and N Searchefficient
Problems
similarity/distance search Euclidean Distance (ED) is a
e pre-computed document embeddings are stored in a special
distance metric,data
with structure
smaller called
▪ simplest
ex. In its Given aform,
set of dense
this indexvectors
must store the document
value indicatingembeddings
greater and pro-
(representing documents), we similarity.
de a search algorithm that, given a query embedding, effi ciently fi nds the document
can index them using Faiss
bedding w ith the largest dot product, or, more in general, w ith the maximum inner
oduct. ▪ Using another dense vector
(representing the query), we
search for the most similar29 With flat indexes, given a query vector xq, we compare it
(the closest) vectors within the against every other xd vector in the index, calculating the
distance to each, then selecting the exact top-k closest ones.
index.
Stategies for Approximate k-NN Retrieval

▪ Tree-based solutions to generate the index for A-kNN,


e.g., ANNOY by Spotify
Stategies for Approximate k-NN Retrieval
▪ Clustering-based solutions, e.g., IVF (Inverted File
System) in FAISS by Facebook
▪ For each cluster (or its centroid), an "inverted list" is created,
containing the IDs of all the data points of that cluster.
▪ Limit the search to the n-probed clusters whose centroids are
closest to the query
Stategies for Approximate k-NN Retrieval
▪ Hashing-based solutions. e.g., LSH (Locality
sensitive hashing) from theoretical papers
▪ close embeddings fall with high probablility into the same
hash bucket
Stategies for Approximate k-NN Retrieval
▪ Proximity Graphs. e.g., HNSW by everybody
right now
▪ Hierarchical Navigable Small World graph-based index
Graph-based A-kNN
▪ Each graph node corresponds to a distinct document vector, while
each edge stores the distance/similarity between vectors

▪ Computing exact kNN with a graph requires to compute and store


O(n2) similarity! UNFEASIBLE

▪ Graph-based A-kNN navigates an incomplete graph


▪ Starting from a predefined entry node, the graph is visited one node at a
time, keeping on finding the closest nodes to query vectors among the
unvisited neighbour nodes.
▪ The search terminates when there is no improvement in the current kNN
candidate set (minHeap)

▪ This simple technique is however inefficient due to long paths potentially


required to navigate the graph to identify the nodes/documents closest
to the query
49
NSW & Greedy Search
▪ Navigable small world
(NSW) graph
▪ Instead of storing
only short-range
edges, i.e., edges
connecting two
close nodes, the
kNN graph can be
xq enriched with
randomly
generated long-
range edges

▪ Long-range edges aim to transform the incomplete graph in a small word one,
with short average path length
▪ any two nodes are connected by surprisingly few hops (small degree of separation)
▪ Example:
▪ starting from an entry node, navigate the graph to return the 1NN result set
HNSW & Greedy Search
▪ Hierarchical Navigable Small World (HNSW) graph
▪ The HNSW index stores the input data into multiple NSW graphs
▪ The search procedure starts with the top layer graph.
▪ At each layer, the greedy heuristic searches for the closest node, then the next layer
is searched, starting from the node corresponding to the closest node identified in
the preceding graph
▪ At the bottom layer, the greedy heuristic searches for the k closest nodes to be
returned starting from the node identified navigating the previous layers
The quantity of
nodes in the other
graphs decreases
exponentially at
each layer

The bottom layer


contains a node for each
document/element
Learned Sparse Retrieval
▪ Novel Neural IR – based proposal
▪ Incorporate the effectiveness improvements of neural networks into inverted indexes (with their
efficient query processing algorithms) through learned sparse retrieval approaches
▪ Still a representation-based (sparse) approach, based on a small lexicon (Wordpiece or BPE)

▪ Document/query expansion learning


▪ seq2seq neural models to modify the actual content of documents/queries
▪ Impact score learning
▪ the output embeddings of documents’ terms are furtherly transformed with neural
networks to generate a single real value (relevance contribution of the term in the
document)

▪ Sparse representation learning


▪ instead of independently learning to expand the documents/queries and then
learning the impact score of the terms in the expanded documents, sparse
representation learning aims at learning both at the same time
▪ Notable example: SPLADE ver. 1, 2, etc.
52

You might also like