IR - 2 Unit
IR - 2 Unit
Unit 2
Note
Topics..
▪ Parametric and Zone indexes
▪ Ranked retrieval
▪ Scoring documents
▪ Term frequency and Weighting
▪ Collection statistics
▪ Weighting schemes
▪ Vector space model for scoring
▪ Variant of tf-idf functions
▪ Components of an “Information Retrieval System”
Introduction to Information Retrieval Ch. 6
Ranked retrieval
▪ Thus far, our queries have all been Boolean.
▪ Documents either match or don’t.
▪ Good for expert users with precise understanding of their
needs and the collection
▪ Not good for the majority of users.
▪ Most users incapable of writing Boolean queries (or
they are, but they think it’s too much work).
▪ Most users don’t want to wade through 1000s of
results.
▪ This is particularly true of web search.
Introduction to Information Retrieval Ch. 6
6
Introduction to Information Retrieval Ch. 6
9
Introduction to Information Retrieval
12
Introduction to Information Retrieval Ch. 6
Jaccard coefficient
▪ A commonly used measure of overlap of two sets A and B
▪ jaccard(A,B) = |A ∩ B| / |A ∪ B|
▪ jaccard(A,A) = 1
▪ jaccard(A,B) = 0 if A ∩ B = 0
▪ A and B don’t have to be the same size.
▪ Always assigns a number between 0 and 1.
Introduction to Information Retrieval Ch. 6
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 157 73 0 0 0 0
Brutus 4 157 0 1 0 0
Caesar 232 227 0 2 1 1
Calpurnia 0 10 0 0 0 0
Cleopatra 57 0 0 0 0 0
mercy 2 0 3 5 5 1
worser 2 0 1 1 1 0
Introduction to Information Retrieval
Term frequency tf
▪ The term frequency tft,d of term t in document d is defined
as the number of times that t occurs in d.
▪ We want to use tf when computing query-document match
scores. But how?
▪ Raw term frequency is not what we want:
▪ A document with 10 occurrences of the term is more
relevant than a document with 1 occurrence of the term.
▪ But not 10 times more relevant.
▪ Relevance does not increase proportionally with term
frequency.
Log-frequency weighting
▪ The log frequency weight of term t in d is
⎧1 + log10 tf t,d , if tf t,d > 0
wt,d =⎨
⎩ 0, otherwise
▪ 0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc.
▪ Score for a document-query pair: sum over terms t in both
q and d:
▪ score = (1 + log tf t ,d )
∑ t∈q∩d
Document frequency
▪ Rare terms are more informative than frequent terms
▪ Recall stop words
▪ Consider a term in the query that is rare in the collection
(e.g., arachnocentric)
▪ A document containing this term is very likely to be relevant
to the query arachnocentric
▪ → We want a high weight for rare terms like arachnocentric.
Introduction to Information Retrieval Sec. 6.2.1
idf weight
▪ dft is the document frequency of t: the number of
documents that contain t
▪ dft is an inverse measure of the informativeness of t
▪ dft ≤ N
▪ We define the idf (inverse document frequency) of t by
idf t = log10 ( N/df t )
▪ We use log (N/dft) instead of N/dft to “dampen” the effect
of idf.
26
Introduction to Information Retrieval Sec. 6.2.1
tf-idf weighting
▪ The tf-idf weight of a term is the product of its tf weight and
its idf weight.
w t ,d = log(1 + tf t ,d ) × log10 ( N / df t )
▪ Best known weighting scheme in information retrieval
▪ Note: the “-” in tf-idf is a hyphen, not a minus sign!
▪ Alternative names: tf.idf, tf x idf
▪ Increases with the number of occurrences of term within a
document
▪ Increases with the rarity of the term in the collection
Introduction to Information Retrieval Sec. 6.2.2
Score(q,d) = ∑ tf.idft,d
t ∈q∩d
29
Introduction to Information Retrieval Sec. 6.3
Documents as vectors
▪ So we have a |V|-dimensional vector space
▪ Terms are axes of the space
▪ Documents are points or vectors in this space
Queries as vectors
▪ Key idea 1: Do the same for queries: represent queries as
vectors in the space
▪ Key idea 2: Rank documents according to their proximity
to the query in this space
▪ proximity = similarity of vectors
▪ proximity ≈ inverse of distance
▪ Recall: We do this because we want to get away from the
you’re-either-in-or-out Boolean model.
▪ Instead: rank more relevant documents higher than less
relevant documents
Introduction to Information Retrieval Sec. 6.3
cosine(query, document)
Dot product Unit vectors
! ! ! ! V
!! q • d q d ∑ q di
i =1 i
cos( q , d ) = !!= !• !=
qd q d V
q2
d
V
2
∑ i =1 i ∑i=1 i
qi is the tf-idf weight of term i in the query
di is the tf-idf weight of term i in the document
for q, d length-normalized.
38
Introduction to Information Retrieval
39
Introduction to Information Retrieval Sec. 6.3
cos(SaS,PaP) ≈
0.789 × 0.832 + 0.515 × 0.555 + 0.335 × 0.0 + 0.0 × 0.0
≈ 0.94
cos(SaS,WH) ≈ 0.79
cos(PaP,WH) ≈ 0.69
Why do we have cos(SaS,PaP) > cos(SaS,WH)?
Introduction to Information Retrieval Sec. 6.4
Points to note
▪ A document may have a high cosine similarity score for a
query, even if it does not contain all terms in the query
▪ How to speedup the vector space retrieval?
▪ Can store the inverse document frequency (e.g., N/dft)
at the head of the postings list for term t
▪ Store the term-frequency (e.g., tft,d) in each postings
entry of the postings list for term t
▪ For a multi-word query, the postings lists of the various
query terms can even be traversed concurrently
46