Ranked Retrieval
Ranked Retrieval
Nisheeth
Ch. 6
Ranked retrieval
• Thus far, our queries have all been Boolean.
– Documents either match or don’t.
• Good for expert users with precise
understanding of their needs and the
collection.
– Also good for applications: Applications can easily
consume 1000s of results.
• Not good for the majority of users.
– Writing Boolean queries is hard
Ch. 6
4
Ch. 6
Jaccard coefficient
• jaccard(A,B) = |A ∩ B| / |A ∪ B|
• jaccard(A,A) = 1
• jaccard(A,B) = 0 if A ∩ B = 0
• Always assigns a number between 0 and 1.
Ch. 6
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
Log-frequency weighting
• The log frequency weight of term t in d is
1 + log10 tf t,d , if tf t,d > 0
wt,d =
0, otherwise
Document frequency
w t ,d = log(1 + tf t ,d ) × log10 ( N / df t )
• Best known weighting scheme in information retrieval
• Increases with the number of occurrences within a
document
• Increases with the rarity of the term in the collection
Effect of idf on ranking
Score(q,d) = ∑ tf.idft,d
t ∈q∩d
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Documents as vectors
Queries as vectors
• Key idea 1: Do the same for queries: represent them
as vectors in the space
• Key idea 2: Rank documents according to their
proximity to the query in this space
• proximity = similarity of vectors
• proximity ≈ inverse of distance
• We do this because we want to get away from the
you’re-either-in-or-out Boolean model.
• Instead: rank more relevant documents higher than
less relevant documents
Euclidean distance is a bad idea
• The Euclidean
distance between q
• and d2 is large even
though the
• distribution of
terms in the query
q and the
distribution of
• terms in the
document d2 are
• very similar.
Sec. 6.3
cosine(query,document)
for q, d length-normalized.
Sec. 6.3
Sensibility jealous 10 7 11
• WH: Wuthering
Term frequencies (counts)
Heights?
cos(SaS,PaP) ≈
0.789 × 0.832 + 0.515 × 0.555 + 0.335 × 0.0 + 0.0 × 0.0
≈ 0.94
cos(SaS,WH) ≈ 0.79
cos(PaP,WH) ≈ 0.69
Why do we have cos(SaS,PaP) > cos(SaS,WH)?
Computing cosine scores
Summary – vector space models
LANGUAGE MODELS
Trouble with frequency-based models
• Too literal
• Can’t deal with misspellings, synonyms etc.
• Natural language queries are hard to deal with
if you don’t address these difficulties
Language Model
• Unigram language model
– probability distribution over the words in a
language
– generation of text consists of pulling words out of
a “bucket” according to the probability
distribution and replacing them
• N-gram language model
– some applications use bigram and trigram
language models where probabilities depend on
previous words
Semantic distance
Sample topic
Language Model
• A topic in a document or query can be
represented as a language model
– i.e., words that tend to occur often when discussing a
topic will have high probabilities in the corresponding
language model
– The basic assumption is that words cluster in semantic
space
• Multinomial distribution over words
– text is modeled as a finite sequence of words, where
there are t possible words at each point in the
sequence
– commonly used, but not only possibility
– doesn’t model burstiness
Has interesting applications
LMs for Retrieval
• 3 possibilities:
– probability of generating the query text from a
document language model
– probability of generating the document text from
a query language model
– comparing the language models representing the
query and document topics
• Models of topical relevance
Query-Likelihood Model
• Rank documents by the probability that the
query could be generated by the document
model (i.e. same topic)
• Given query, start with P(D|Q)
• Using Bayes’ Rule
• Ranking score
D D M
q q