0% found this document useful (0 votes)
4 views

Lecture14 Distributed Representations.pptx

Uploaded by

cddsingh13
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Lecture14 Distributed Representations.pptx

Uploaded by

cddsingh13
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

Introduction to Information Retrieval

Introduction to
Information Retrieval
Distributed Word Representations for
Information Retrieval
Introduction to Information Retrieval Sec. 9.2.2

How can we more robustly match a


user’s search intent?
We want to understand a query, not just do String equals()
▪ If user searches for [Dell notebook battery size], we would like
to match documents discussing “Dell laptop battery capacity”
▪ If user searches for [Seattle motel], we would like to match
documents containing “Seattle hotel”

A pure keyword-matching IR system does nothing to help….


Simple facilities that we have already discussed do a bit to help
▪ Spelling correction
▪ Stemming / case folding
But we’d like to better understand when query/document match
Introduction to Information Retrieval Sec. 9.2.2

How can we more robustly match a


user’s search intent?
Query expansion:
▪ Relevance feedback could allow us to capture this if we get
near enough to matching documents with these words
▪ We can also use information on word similarities:
▪ A manual thesaurus of synonyms for query expansion
▪ A measure of word similarity
▪ Calculated from a big document collection
▪ Calculated by query log mining (common on the web)
Document expansion:
▪ Use of anchor text may solve this by providing human
authored synonyms, but not for new or less popular web
pages, or non-hyperlinked collections
Introduction to Information Retrieval Sec. 9.2.2

Example of manual thesaurus


Introduction to Information Retrieval

Search log query expansion


▪ Context-free query expansion ends up problematic
▪ [wet ground] ≈ [wet earth]
▪ So expand [ground] ⇒ [ground earth]
▪ But [ground coffee] ≠ [earth coffee]
▪ You can learn query context-specific rewritings from
search logs by attempting to identify the same user
making a second attempt at the same user need
▪ [Hinton word vector]
▪ [Hinton word embedding]
▪ In this context, [vector] ≈ [embedding]
Introduction to Information Retrieval Sec. 9.2.3

Automatic Thesaurus Generation


▪ Attempt to generate a thesaurus automatically by
analyzing a collection of documents
▪ Fundamental notion: similarity between two words
▪ Definition 1: Two words are similar if they co-occur with
similar words.
▪ Definition 2: Two words are similar if they occur in a
given grammatical relation with the same words.
▪ You can harvest, peel, eat, prepare, etc. apples and
pears, so apples and pears must be similar.
▪ Co-occurrence based is more robust, grammatical
relations are more accurate. Why?
Introduction to Information Retrieval Sec. 9.2.3

Simple Co-occurrence Thesaurus


▪ Simplest way to compute one is based on term-term similarities
in C = AAT where A is term-document matrix.
▪ wi,j = (normalized) weight for (ti ,dj)
dj N
A
ti

M
▪ For each ti, pick terms with high values in C
Introduction to Information Retrieval Sec. 9.2.3

Simple Co-occurrence Thesaurus


▪ Simplest way to compute one is based on term-term similarities
in C = AAT where A is term-document matrix.
▪ wi,j = (normalized) weight for (ti ,dj)
dj N
A
ti What does C
contain if A
is a term-doc
incidence
M (0/1) matrix?
▪ For each ti, pick terms with high values in C
Introduction to Information Retrieval

Automatic thesaurus generation


example … sort of works
Word Nearest neighbors
absolutely absurd, whatsoever, totally, exactly, nothing
bottomed dip, copper, drops, topped, slide, trimmed
captivating shimmer, stunningly, superbly, plucky, witty
doghouse dog, porch, crawling, beside, downstairs
makeup repellent, lotion, glossy, sunscreen, skin, gel
mediating reconciliation, negotiate, cease, conciliation
keeping hoping, bring, wiping, could, some, would
lithographs drawings, Picasso, Dali, sculptures, Gauguin
pathogens toxins, bacteria, organisms, bacterial, parasites
senses grasp, psyche, truly, clumsy, naïve, innate

Too little data (10s of millions of words) treated by too sparse method.
100,000 words = 1010 entries in C.
Introduction to Information Retrieval Sec. 9.2.2

How can we represent term relations?


▪ With the standard symbolic encoding of terms, each term is a
dimension
▪ Different terms have no inherent similarity
▪ motel [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]T
hotel [0 0 0 0 0 0 0 3 0 0 0 0 0 0 0] = 0
▪ If query on hotel and document has motel, then our query
and document vectors are orthogonal
Introduction to Information Retrieval

Can you directly learn term relations?


▪ Basic IR is scoring on qTd
▪ No treatment of synonyms; no machine learning
▪ Can we learn parameters W to rank via qTWd ?

▪ Cf. Query translation models: Berger and Lafferty (1999)


▪ Problem is again sparsity – W is huge > 1010
Introduction to Information Retrieval

Is there a better way?


▪ Idea:
▪ Can we learn a dense low-dimensional representation of a
word in ℝd such that dot products uTv express word
similarity?
▪ We could still if we want to include a “translation” matrix
between vocabularies (e.g., cross-language): uTWv
▪ But now W is small!
▪ Supervised Semantic Indexing (Bai et al. Journal of
Information Retrieval 2009) shows successful use of
learning W for information retrieval

▪ But we’ll develop direct similarity in this class


Introduction to Information Retrieval

Distributional similarity based


representations
▪ You can get a lot of value by representing a word by
means of its neighbors
▪ “You shall know a word by the company it keeps”
▪ (J. R. Firth 1957: 11)

▪ One of the most successful ideas of modern


statistical NLP
…government debt problems turning into banking crises as happened in 2009…
…saying that Europe needs unified banking regulation to replace the hodgepodge…
…India has just given its banking system a shot in the arm…

🡽 These words will represent banking 🡽

13
Introduction to Information Retrieval

Solution: Low dimensional vectors


▪ The number of topics that people talk about is small
(in some sense)
▪ Clothes, movies, politics, …
• Idea: store “most” of the important information in a
fixed, small number of dimensions: a dense vector
• Usually 25 – 1000 dimensions

• How to reduce the dimensionality?


• Go from big, sparse co-occurrence count vector to low
dimensional “word embedding”

14
Introduction to Information Retrieval Sec. 18.2

Traditional Way:
Latent Semantic Indexing/Analysis
▪ Use Singular Value Decomposition (SVD) – kind of like
Principal Components Analysis (PCA) for an arbitrary
rectangular matrix – or just random projection to find a
low-dimensional basis or orthogonal vectors
▪ Theory is that similarity is preserved as much as possible
▪ You can actually gain in IR (slightly) by doing LSA, as “noise”
of term variation gets replaced by semantic “concepts”
▪ Somewhat popular in the 1990s [Deerwester et al. 1990, etc.]
▪ But results were always somewhat iffy (… it worked sometimes)
▪ Hard to implement efficiently in an IR system (dense vectors!)
▪ Discussed in IIR chapter 18, but not discussed further here
▪ Not on the exam (!!!)
Introduction to Information Retrieval

“NEURAL EMBEDDINGS”
Introduction to Information Retrieval Sec. 18.2

Benefit of Distributed Representation


in Information Retrieval
▪ Semantic Similarity

▪ Handling Synonymy and Polysemy

▪ Out-of-Vocabulary (OOV) Handling

▪ Dimensionality Reduction

▪ Improved Query Understanding

▪ Contextual Information
Introduction to Information Retrieval

Word meaning is defined in terms of


vectors
▪ We will build a dense vector for each word type,
chosen so that it is good at predicting other words
appearing in its context
… those other words also being represented by vectors … it all gets a bit
recursive

0.286
0.792
−0.177
banking = −0.107
0.109
−0.542
0.349
0.271
Introduction to Information Retrieval

Neural word embeddings - visualization

19
Introduction to Information Retrieval

Basic idea of learning neural network word


embeddings
▪ We define a model that aims to predict between a
center word wt and context words in terms of word
vectors
▪ p(context|wt) = …
▪ which has a loss function, e.g.,
▪ J = 1 − p(w−t |wt)
▪ We look at many positions t in a big language corpus
▪ We keep adjusting the vector representations of
words to minimize this loss
Introduction to Information Retrieval

Idea: Directly learn low-dimensional word


vectors based on ability to predict
• Old idea: Learning representations by back-propagating
errors. (Rumelhart et al., 1986)
• A neural probabilistic language model (Bengio et al.,
2003) Non-linear
• NLP (almost) from Scratch (Collobert & Weston, 2008) and slow
• A simpler and faster model:
word2vec (Mikolov et al. 2013) 🡪 intro now Fast
bilinear
• The GloVe model from Stanford (Pennington, Socher,
models
and Manning 2014) connects back to matrix
factorization
• Per-token representations: Deep contextual word Current
representations: ELMo, ULMfit, BERT, ELMo, XLM, GPT state of
the art

21
Introduction to Information Retrieval

Word2vec is a family of algorithms


[Mikolov et al. 2013]

Predict between every word and its context words!

Two algorithms
1. Skip-grams (SG)
Predict context words given target (position independent)
2. Continuous Bag of Words (CBOW)
Predict target word from bag-of-words context

Two (moderately efficient) training methods


3. Hierarchical softmax
4. Negative sampling
5. Naïve softmax
Introduction to Information Retrieval

Word2Vec Skip-gram Overview


problem bankin crise


… turning into as …
s g s

outside context words center word outside context


in window of size 2 at position t words
in window of size 2

23
Introduction to Information Retrieval

Word2vec: objective function

Likelihood =

sometimes called cost or loss function

24
Introduction to Information Retrieval

Word2vec: objective function

25
Introduction to Information Retrieval

Word2vec: prediction function


Exponentiation makes anything
positive

Normalize over entire vocabulary
to give probability distribution

Open
region

26
Introduction to Information Retrieval

Word2vec: 2 matrices of parameters


Center word Context word
embeddings embeddings
as rows as columns

(Transposed!)
Introduction to Information Retrieval

To learn good word vectors:


Compute all vector gradients!
▪ We often define the set of all parameters in a model
in terms of one long vector
▪ In our case with
d-dimensional vectors
and
V many words:

▪ We then optimize
these parameters

Note: Every word has two vectors! Makes it simpler!


Introduction to Information Retrieval

Intuition of how to minimize loss for a


simple function over two parameters
We start at a random point and walk in the steepest
direction, which is given by the derivative of the function

Contour lines show


points of equal value
of objective function
Introduction to Information Retrieval

Descending by using derivatives


We will minimize a cost function by
gradient descent

Trivial example: (from Wikipedia)


Find a local minimum of the function
f(x) = x4−3x3+2,
with derivative f'(x) = 4x3−9x2

Subtracting a fraction
of the gradient moves
you towards the
minimum!
Introduction to Information Retrieval

Vanilla Gradient Descent Code


Introduction to Information Retrieval

Stochastic Gradient Descent


▪ But Corpus may have 40B tokens and windows
▪ You would wait a very long time before making a single
update!
▪ Very bad idea for pretty much all neural nets!
▪ Instead: We update parameters after each window t
🡪 Stochastic gradient descent (SGD)
Introduction to Information Retrieval

Working out how to optimize a neural


network is really all the chain rule!
Chain rule! If y = f(u) and u = g(x), i.e. y = f(g(x)), then:

Simple example:
Introduction to Information Retrieval
Introduction to Information Retrieval
Introduction to Information Retrieval
Introduction to Information Retrieval
Introduction to Information Retrieval

38
Introduction to Information Retrieval

Linear Relationships in word2vec


These representations are very good at encoding
similarity and dimensions of similarity!
▪ Analogies testing dimensions of similarity can be
solved quite well just by doing vector subtraction in
the embedding space
Syntactically
▪ xapple − xapples ≈ xcar − xcars ≈ xfamily − xfamilies
▪ Similarly for verb and adjective morphological forms
Semantically (Semeval 2012 task 2)
▪ xshirt − xclothing ≈ xchair − xfurniture
▪ xking − xman ≈ xqueen − xwoman
39
Introduction to Information Retrieval

Word Analogies
Test for linear relationships, examined by Mikolov et al.

a:b :: c:?

man:woman :: king:?

+ king [ 0.30 0.70 ] queen


king
− man [ 0.20 0.20 ]

+ woman [ 0.60 0.30 ]


woman

queen [ 0.70 0.80 ] man


Introduction to Information Retrieval

GloVe Visualizations

http://nlp.stanford.edu/projects/glove/
41
Introduction to Information Retrieval

Glove Visualizations: Company - CEO

42
Introduction to Information Retrieval

Glove Visualizations: Superlatives

3/27/2023 43
Introduction to Information Retrieval

Application to Information Retrieval


Application is just beginning – we’re “at the end of the early years”
▪ Google’s RankBrain – little is publicly known
▪ Bloomberg article by Jack Clark (Oct 26, 2015):
http://www.bloomberg.com/news/articles/2015-10-26/google-turning-its-lucrat
ive-web-search-over-to-ai-machines
▪ A result reranking system. “3rd most valuable ranking signal”
▪ But note: more of the potential value is in the tail?
▪ New SIGIR Neu-IR workshop series (2016 on)
Introduction to Information Retrieval

An application to information retrieval


Nalisnick, Mitra, Craswell & Caruana. 2016. Improving Document
Ranking with Dual Word Embeddings. WWW 2016 Companion.
http://research.microsoft.com/pubs/260867/pp1291-Nalisnick.pdf
Mitra, Nalisnick, Craswell & Caruana. 2016. A Dual Embedding
Space Model for Document Ranking. arXiv:1602.01137 [cs.IR]

Builds on BM25 model idea of “aboutness”


▪ Not just term repetition indicating aboutness
▪ Relationship between query terms and all terms in the
document indicates aboutness (BM25 uses only query terms)
Makes clever argument for different use of word and context
vectors in word2vec’s CBOW/SGNS or GloVe
Introduction to Information Retrieval

Modeling document aboutness:


Results from a search for Albuquerque
d1

d2
Introduction to Information Retrieval

Using 2 word embeddings


word2vec model with 1 word of context

WIN WOUT
Embeddings Embeddings
for focus for context
words words
Focus Context
word word

We can gain by using these


two embeddings differently
Introduction to Information Retrieval

Using 2 word embeddings


Introduction to Information Retrieval

Dual Embedding Space Model (DESM)


▪ Simple model
▪ A document is represented by the centroid of its
word vectors

▪ Query-document similarity is average over query


words of cosine similarity
Introduction to Information Retrieval

Dual Embedding Space Model (DESM)


▪ What works best is to use the OUT vectors for the
document and the IN vectors for the query

▪ This way similarity measures aboutness – words that


appear with this word – which is more useful in this
context than (distributional) semantic similarity
Introduction to Information Retrieval

Experiments
▪ Train word2vec from either
▪ 600 million Bing queries
▪ 342 million web document sentences
▪ Test on 7,741 randomly sampled Bing queries
▪ 5 level eval (Perfect, Excellent, Good, Fair, Bad)
▪ Two approaches
1. Use DESM model to rerank top results from BM25
2. Use DESM alone or a mixture model of it and BM25
Introduction to Information Retrieval

Results – reranking k-best list

Pretty decent gains – e.g., 2% for NDCG@3


Gains are bigger for model trained on queries than docs
Introduction to Information Retrieval

Results – whole ranking system


Introduction to Information Retrieval

A possible explanation

IN-OUT has some ability to prefer Relevant to close-by


(judged) non-relevant, but it’s scores induce too much
noise vs. BM25 to be usable alone
Introduction to Information Retrieval

DESM conclusions
▪ DESM is a weak ranker but effective at finding subtler
similarities/aboutness
▪ It is effective at, but only at, reranking at least
somewhat relevant documents

▪ For example, DESM can confuse Oxford and Cambridge


▪ Bing rarely makes an Oxford/Cambridge mistake!
Introduction to Information Retrieval

What else can neural nets do in IR?


▪ Use a neural network as a supervised
reranker
▪ Assume a query and document
embedding network (as we have
discussed)
▪ Assume you have (q,d,rel) relevance
data
▪ Learn a neural network (with
supervised learning) to predict
relevance of (q,d) pair
▪ An example of “machine-learned
relevance”, which we’ll talk about
more next lecture
Introduction to Information Retrieval

What else can neural nets do in IR?


▪ BERT: Devlin, Chang, Lee, Toutanova (2018)
▪ A deep transformer-based neural network
▪ Builds per-token (in context) representations
▪ Produces a query/document
representation as well
▪ Or jointly embed query and
document and ask for a
retrieval score
▪ Incredibly effective!
▪ https://arxiv.org/abs/1810.04805
Introduction to Information Retrieval

Summary: Embed all the things!


Word embeddings are the hot new technology (again!)

Lots of applications wherever knowing word context or


similarity helps prediction:
▪ Synonym handling in search
▪ Document aboutness
▪ Ad serving
▪ Language models: from spelling correction to email response
▪ Machine translation
▪ Sentiment analysis
▪ …
Introduction to Information Retrieval
Introduction to Information Retrieval

Global vs. local embedding [Diaz 2016]


Introduction to Information Retrieval

Global vs. local embedding [Diaz 2016]

Train w2v on documents from


first round of retrieval

Fine-grained word sense


disambiguation
Introduction to Information Retrieval

Ad-hoc retrieval using local and


distributed representation [Mitra et al. 2017]
▪ Argues both “lexical” and
“semantic” matching is
important for document
ranking
▪ Duet model is a linear
combination of two DNNs
using local and distributed
representations of query/
document as inputs, and
jointly trained on labelled data
Introduction to Information Retrieval

Latest Papers (WWW and SIGIR 2023)


Title: Eligibility Mechanisms: Auctions Meet Information Retrieval
Authors: Gagan Goel, Renato Paes Leme, Jon Schneider, David Thompson
and Hanrui Zhang
https://www.andrew.cmu.edu/user/hanruiz1/papers/stochastic_probing.pdf

Title: Multivariate Representation Learning for Information Retrieval


Authors: Hamed Zamani, Michael Bendersky
https://arxiv.org/abs/2304.14522
Title: One Blade for One Purpose: Advancing Math Information Retrieval
using Hybrid Search
Authors: Wei Zhong, Sheng-Chieh Lin, Jheng-Hong Yang, Jimmy Lin
https://dl.acm.org/doi/10.1145/3539618.3591746
Title: MSQ-BioBERT: Ambiguity Resolution to Enhance BioBERT Medical
Question-Answering
Authors: Muzhe Guo, Muhao Guo, Edward T. Dougherty, Fang Jin
https://dl.acm.org/doi/10.1145/3543507.3583878

You might also like