0% found this document useful (0 votes)
48 views52 pages

Ranked Retrieval

Ranked retrieval aims to return documents in order of their relevance to a query, rather than just returning documents that match or don't match like in Boolean retrieval. It assigns a score between 0 and 1 to each document based on how well the document matches the query. This score is based on factors like term frequency (how many times a term appears in a document) and inverse document frequency (how common or rare a term is across all documents). A popular weighting scheme is tf-idf, which multiplies term frequency by inverse document frequency to give higher weight to rarer terms that appear many times in a relevant document. Documents and queries can then be represented as vectors in a high dimensional space, with weights along each term dimension

Uploaded by

Install Mac
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views52 pages

Ranked Retrieval

Ranked retrieval aims to return documents in order of their relevance to a query, rather than just returning documents that match or don't match like in Boolean retrieval. It assigns a score between 0 and 1 to each document based on how well the document matches the query. This score is based on factors like term frequency (how many times a term appears in a document) and inverse document frequency (how common or rare a term is across all documents). A popular weighting scheme is tf-idf, which multiplies term frequency by inverse document frequency to give higher weight to rarer terms that appear many times in a relevant document. Documents and queries can then be represented as vectors in a high dimensional space, with weights along each term dimension

Uploaded by

Install Mac
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

Ranked retrieval

Nisheeth
Ch. 6

Ranked retrieval
• Thus far, our queries have all been Boolean.
– Documents either match or don’t.
• Good for expert users with precise
understanding of their needs and the
collection.
– Also good for applications: Applications can easily
consume 1000s of results.
• Not good for the majority of users.
– Writing Boolean queries is hard
Ch. 6

Problem with Boolean search:


feast or famine
• Boolean queries often result in either too few
(=0) or too many (1000s) results.
• Query 1: “standard user dlink 650” → 200,000
hits
• Query 2: “standard user dlink 650 no card
found”: 0 hits
• It takes a lot of skill to come up with a query
that produces a manageable number of hits.
– AND gives too few; OR gives too many
Ranked retrieval models
• Rather than a set of documents satisfying a query expression,
in ranked retrieval, the system returns an ordering over the
(top) documents in the collection for a query
• Free text queries: Rather than a query language of operators
and expressions, the user’s query is just one or more words in
a human language
• Ranked list of results: No more feast or famine

4
Ch. 6

Scoring as the basis of ranked


retrieval
• We wish to return in order the documents
most likely to be useful to the searcher
• How can we rank-order the documents in the
collection with respect to a query?
• Assign a score – say in [0, 1] – to each
document
• This score measures how well document and
query “match”.
Ch. 6

Query-document matching scores

• We need a way of assigning a score to a


query/document pair
• If the query term does not occur in the
document: score should be 0
• The more frequent the query term in the
document, the higher the score (should be)
Ch. 6

Jaccard coefficient

• jaccard(A,B) = |A ∩ B| / |A ∪ B|
• jaccard(A,A) = 1
• jaccard(A,B) = 0 if A ∩ B = 0
• Always assigns a number between 0 and 1.
Ch. 6

Issues with Jaccard for scoring

• Privileges shorter documents


– We need a more sophisticated way of normalizing
for length | A  B | / | A  B |
• It doesn’t consider term frequency
– how many times a term occurs in a document
• Does not account for term informativeness
– How important is the term in the document
Sec. 6.2

Accounting for term frequency

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0

Each document is represented by a binary vector ∈ {0,1}|V|


Term frequency tf

• The term frequency tft,d of term t in document


d is defined as the number of times that t
occurs in d.
• We want to use tf when computing query-
document match scores. But how?
• Raw term frequency is not what we want:
– A document with 10 occurrences of the term is
more relevant than a document with 1 occurrence
of the term.
– But not 10 times more relevant.
Sec. 6.2

Log-frequency weighting
• The log frequency weight of term t in d is
1 + log10 tf t,d , if tf t,d > 0
wt,d =
 0, otherwise

• 0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc.


• Score for a document-query pair: sum over
terms t in both q and d:
• score = ∑t∈q∩d (1 + log tf t ,d )
• The score is 0 if none of the query terms is
present in the document.
Sec. 6.2.1

Document frequency

• Rare terms are more informative than frequent


terms
– Recall stop words
• Consider a term in the query that is rare in the
collection (e.g., arachnocentric)
• A document containing this term is very likely to
be relevant to the query arachnocentric
• → We want a high weight for rare terms like
arachnocentric.
Sec. 6.2.1

Document frequency, continued

• Frequent terms are less informative than rare


terms
• Consider a query term that is frequent in the
collection (e.g., high, increase, line)
• A document containing such a term is more
likely to be relevant than a document that
doesn’t
• But it’s not a sure indicator of relevance.
– How/when will it break?
idf weight

• dft is the document frequency of t: the


number of documents that contain t
– dft is an inverse measure of the informativeness of
t
– dft ≤ N
• We define the idf (inverse document
frequency) of t by
idf t = log10 ( N/df t )
– We use log (N/dft) instead of N/dft to “dampen”
the effect of idf.
idf example, suppose N = 1 million

term dft idft


calpurnia 1 6
animal 100 4
sunday 1,000 3
fly 10,000 2
under 100,000 1
the 1,000,000 0

idf t = log10 ( N/df t )


There is one idf value for each term t in a collection.
tf.idf weighting
• The tf.idf weight of a term is the product of its tf
weight and its idf weight.

w t ,d = log(1 + tf t ,d ) × log10 ( N / df t )
• Best known weighting scheme in information retrieval
• Increases with the number of occurrences within a
document
• Increases with the rarity of the term in the collection
Effect of idf on ranking

• Does idf have an effect on ranking for one-


term queries, like
– iPhone
• idf has no effect on ranking one term queries
– idf affects the ranking of documents for queries
with at least two terms
– For the query capricious person, idf weighting
makes occurrences of capricious count for much
more in the final document ranking than
occurrences of person.
17
Score for a document given a query

Score(q,d) = ∑ tf.idft,d
t ∈q∩d

• There are many variants


– How “tf” is computed (with/without logs)
– Whether the terms in the query are also weighted
–…
tf-idf weighting has many variants

Columns headed ‘n’ are acronyms for weight schemes.


Weighting may differ in queries vs
documents
• Many search engines allow for different
weightings for queries vs. documents
• SMART Notation: denotes the combination in
use in an engine, with the notation ddd.qqq,
using the acronyms from the previous table
– A very standard weighting scheme is: lnc.ltc
• Document: logarithmic tf (l as first character), no idf
and cosine normalization
• Query: logarithmic tf (l in leftmost column), idf (t in
second column), no normalization …
tf-idf example: lnc.ltc
Document: car insurance auto insurance
Query: best car insurance
Term Query Document Pro
d
tf- tf-wt df idf wt n’liz tf-raw tf-wt wt n’liz
raw e e
auto 0 0 5000 2.3 0 0 1 1 1 0.52 0
best 1 1 50000 1.3 1.3 0.34 0 0 0 0 0
car 1 1 10000 2.0 2.0 0.52 1 1 1 0.52 0.27
insurance 1 1 1000 3.0 3.0 0.78 2 1.3 1.3 0.68 0.53

Exercise: what is N, the number of docs?


Doc length = 12 + 0 2 + 12 + 1.32 ≈ 1.92
Score = 0+0+0.27+0.53 = 0.8
Binary → count → weight matrix

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 5.25 3.18 0 0 0 0.35


Brutus 1.21 6.1 0 1 0 0
Caesar 8.59 2.54 0 1.51 0.25 0
Calpurnia 0 1.54 0 0 0 0
Cleopatra 2.85 0 0 0 0 0
mercy 1.51 0 1.9 0.12 5.25 0.88
worser 1.37 0 0.11 4.15 0.25 1.95

Each document is now represented by a real-valued


vector of tf-idf weights ∈ R|V|
Sec. 6.3

Documents as vectors

• So we have a |V|-dimensional vector space


• Terms are axes of the space
• Documents are points or vectors in this space
• Very high-dimensional: tens of millions of
dimensions when you apply this to a web
search engine
• These are very sparse vectors - most entries
are zero.
Sec. 6.3

Queries as vectors
• Key idea 1: Do the same for queries: represent them
as vectors in the space
• Key idea 2: Rank documents according to their
proximity to the query in this space
• proximity = similarity of vectors
• proximity ≈ inverse of distance
• We do this because we want to get away from the
you’re-either-in-or-out Boolean model.
• Instead: rank more relevant documents higher than
less relevant documents
Euclidean distance is a bad idea

• The Euclidean
distance between q
• and d2 is large even
though the
• distribution of
terms in the query
q and the
distribution of
• terms in the
document d2 are
• very similar.
Sec. 6.3

cosine(query,document)

Dot product Unit vectors


  


V
  q•d q d q di
cos( q , d ) =   =  •  = i =1 i
q d
∑ ∑i=1 i
V V
qd q2
d 2
i =1 i

qi is the tf-idf weight of term i in the query


di is the tf-idf weight of term i in the document

cos(q,d) is the cosine similarity of q and d … or,


equivalently, the cosine of the angle between q and d.
Length normalization
• A vector can be (length-) normalized by dividing each
of its components by its length – for this we use the
L2 norm: 
x 2 = ∑i xi2
• Dividing a vector by its L2 norm makes it a unit
(length) vector (on surface of unit hypersphere)
• Effect on the two documents d and d′ (d appended
to itself) from earlier slide: they have identical
vectors after length-normalization.
– Long and short documents now have comparable
weights
Cosine for length-normalized
vectors
• For length-normalized vectors, cosine
similarity is simply the dot product (or scalar
product):
     V
cos(q, d ) = q • d = ∑ qi di

 i=1

for q, d length-normalized.
Sec. 6.3

Cosine similarity amongst 3 documents

• How similar are


these novels term SaS PaP WH

• SaS: Sense and affection 115 58 20

Sensibility jealous 10 7 11

• PaP: Pride and gossip 2 0 6

Prejudice, and wuthering 0 0 38

• WH: Wuthering
Term frequencies (counts)
Heights?

Note: To simplify this example, we don’t do idf weighting.


3 documents example contd.
• Log frequency • After length
weighting normalization
term SaS PaP WH term SaS PaP WH
affection 3.06 2.76 2.30 affection 0.789 0.832 0.524
jealous 2.00 1.85 2.04 jealous 0.515 0.555 0.465
gossip 1.30 0 1.78 gossip 0.335 0 0.405
wuthering 0 0 2.58 wuthering 0 0 0.588

cos(SaS,PaP) ≈
0.789 × 0.832 + 0.515 × 0.555 + 0.335 × 0.0 + 0.0 × 0.0
≈ 0.94
cos(SaS,WH) ≈ 0.79
cos(PaP,WH) ≈ 0.69
Why do we have cos(SaS,PaP) > cos(SaS,WH)?
Computing cosine scores
Summary – vector space models

• Represent the query as a weighted tf-idf vector


• Represent each document as a weighted tf-idf
vector
• Compute the cosine similarity score for the
query vector and each document vector
• Rank documents with respect to the query by
score
• Return the top K (e.g., K = 10) to the user
Ranked retrieval

LANGUAGE MODELS
Trouble with frequency-based models
• Too literal
• Can’t deal with misspellings, synonyms etc.
• Natural language queries are hard to deal with
if you don’t address these difficulties
Language Model
• Unigram language model
– probability distribution over the words in a
language
– generation of text consists of pulling words out of
a “bucket” according to the probability
distribution and replacing them
• N-gram language model
– some applications use bigram and trigram
language models where probabilities depend on
previous words
Semantic distance
Sample topic
Language Model
• A topic in a document or query can be
represented as a language model
– i.e., words that tend to occur often when discussing a
topic will have high probabilities in the corresponding
language model
– The basic assumption is that words cluster in semantic
space
• Multinomial distribution over words
– text is modeled as a finite sequence of words, where
there are t possible words at each point in the
sequence
– commonly used, but not only possibility
– doesn’t model burstiness
Has interesting applications
LMs for Retrieval
• 3 possibilities:
– probability of generating the query text from a
document language model
– probability of generating the document text from
a query language model
– comparing the language models representing the
query and document topics
• Models of topical relevance
Query-Likelihood Model
• Rank documents by the probability that the
query could be generated by the document
model (i.e. same topic)
• Given query, start with P(D|Q)
• Using Bayes’ Rule

• Assuming prior is uniform, unigram model


Other query constructions
Estimating Probabilities
• Obvious estimate for unigram probabilities is

• Maximum likelihood estimate


– makes the observed value of fq ;D most likely
i

• If query words are missing from document,


score will be zero
– Missing 1 out of 4 query words same as missing 3
out of 4
Smoothing
• Document texts are a sample from the
language model
– Missing words should not have zero probability of
occurring
• Smoothing is a technique for estimating
probabilities for missing (or unseen) words
– lower (or discount) the probability estimates for
words that are seen in the document text
– assign that “left-over” probability to the estimates
for the words that are not seen in the text
Estimating Probabilities
• Estimate for unseen words is αDP(qi|C)
– P(qi|C) is the probability for query word i in the
collection language model for collection C
(background probability)
– αD is a parameter
• Estimate for words that occur is
(1 − αD) P(qi|D) + αD P(qi|C)
• Different forms of estimation come from
different αD
Jelinek-Mercer Smoothing
• αD is a constant, λ
• Gives estimate of

• Ranking score

• Use logs for convenience


– accuracy problems multiplying small numbers
Compare with tf.idf

- proportional to the term frequency, inversely


proportional to the collection frequency
Dirichlet Smoothing
• αD depends on document length

• Gives probability estimation of

• and document score


Query Likelihood Example
• For the term “president”
– fqi,D = 15, cqi = 160,000
• For the term “lincoln”
– fqi,D = 25, cqi = 2,400
• document |d| is assumed to be 1,800 words long
• collection is assumed to be 109 words long
– 500,000 documents times an average of 2,000 words
• μ = 2,000
Query Likelihood Example

• Negative number because summing logs


of small numbers
Query Likelihood Example
Going beyond tf.idf

D D M

q q

You might also like