IR Presentation 2
IR Presentation 2
Information Retrieval
Faster postings merges:
Skip pointers/Skip lists
Limitations of Boolean Retrieval Model
• Thus far, our queries have all been Boolean.
• Documents either match or don’t.
• Good for expert users with a precise understanding of their needs
and the collection.
• Also good for applications: Applications can easily consume 1000s of
results.
• Not good for the majority of users.
• Writing Boolean queries is hard
Limitations of Boolean Retrieval Model
• Exact matching may retrieve too few (∼ 0) or too many documents (∼
1000)
• When the system produces a ranked result set, the large result set
is not an issue.
• The size of the result is not an issue
• This score measures how well the document and query “match”.
Scoring as Basis of Ranked Retrieval
• If the query term does not occur in the document: score should be
0
• The more frequent the query term in the document, the higher the
score (should be)
Jaccard coefficient
• jaccard(A,B) = |A ∩ B| / |A ∪ B|
• jaccard(A,A) = 1
• jaccard(A,B) = 0 if A ∩ B = 0
• Always assigns a number between 0
and 1.
Jaccard coefficient: Scoring Example
J(q,d1)=1/6=0.16
Query: ides of march=3
J(q,d2)=1/5=0.2
D1: Caesar died in March=4
D2: the long March=3
Issues with Jaccard for scoring