Lecture14 Distributed Representations.pptx
Lecture14 Distributed Representations.pptx
Introduction to
Information Retrieval
Distributed Word Representations for
Information Retrieval
Introduction to Information Retrieval Sec. 9.2.2
M
▪ For each ti, pick terms with high values in C
Introduction to Information Retrieval Sec. 9.2.3
Too little data (10s of millions of words) treated by too sparse method.
100,000 words = 1010 entries in C.
Introduction to Information Retrieval Sec. 9.2.2
13
Introduction to Information Retrieval
14
Introduction to Information Retrieval Sec. 18.2
Traditional Way:
Latent Semantic Indexing/Analysis
▪ Use Singular Value Decomposition (SVD) – kind of like
Principal Components Analysis (PCA) for an arbitrary
rectangular matrix – or just random projection to find a
low-dimensional basis or orthogonal vectors
▪ Theory is that similarity is preserved as much as possible
▪ You can actually gain in IR (slightly) by doing LSA, as “noise”
of term variation gets replaced by semantic “concepts”
▪ Somewhat popular in the 1990s [Deerwester et al. 1990, etc.]
▪ But results were always somewhat iffy (… it worked sometimes)
▪ Hard to implement efficiently in an IR system (dense vectors!)
▪ Discussed in IIR chapter 18, but not discussed further here
▪ Not on the exam (!!!)
Introduction to Information Retrieval
“NEURAL EMBEDDINGS”
Introduction to Information Retrieval Sec. 18.2
▪ Dimensionality Reduction
▪ Contextual Information
Introduction to Information Retrieval
0.286
0.792
−0.177
banking = −0.107
0.109
−0.542
0.349
0.271
Introduction to Information Retrieval
19
Introduction to Information Retrieval
21
Introduction to Information Retrieval
Two algorithms
1. Skip-grams (SG)
Predict context words given target (position independent)
2. Continuous Bag of Words (CBOW)
Predict target word from bag-of-words context
23
Introduction to Information Retrieval
Likelihood =
24
Introduction to Information Retrieval
25
Introduction to Information Retrieval
Open
region
26
Introduction to Information Retrieval
(Transposed!)
Introduction to Information Retrieval
▪ We then optimize
these parameters
Subtracting a fraction
of the gradient moves
you towards the
minimum!
Introduction to Information Retrieval
Simple example:
Introduction to Information Retrieval
Introduction to Information Retrieval
Introduction to Information Retrieval
Introduction to Information Retrieval
Introduction to Information Retrieval
38
Introduction to Information Retrieval
Word Analogies
Test for linear relationships, examined by Mikolov et al.
a:b :: c:?
man:woman :: king:?
GloVe Visualizations
http://nlp.stanford.edu/projects/glove/
41
Introduction to Information Retrieval
42
Introduction to Information Retrieval
3/27/2023 43
Introduction to Information Retrieval
d2
Introduction to Information Retrieval
WIN WOUT
Embeddings Embeddings
for focus for context
words words
Focus Context
word word
Experiments
▪ Train word2vec from either
▪ 600 million Bing queries
▪ 342 million web document sentences
▪ Test on 7,741 randomly sampled Bing queries
▪ 5 level eval (Perfect, Excellent, Good, Fair, Bad)
▪ Two approaches
1. Use DESM model to rerank top results from BM25
2. Use DESM alone or a mixture model of it and BM25
Introduction to Information Retrieval
A possible explanation
DESM conclusions
▪ DESM is a weak ranker but effective at finding subtler
similarities/aboutness
▪ It is effective at, but only at, reranking at least
somewhat relevant documents