0% found this document useful (0 votes)
9 views28 pages

IR Presentation 2

Uploaded by

Jawad Abid
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views28 pages

IR Presentation 2

Uploaded by

Jawad Abid
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Introduction to

Information Retrieval
Faster postings merges:
Skip pointers/Skip lists
Limitations of Boolean Retrieval Model
• Thus far, our queries have all been Boolean.
• Documents either match or don’t.
• Good for expert users with a precise understanding of their needs
and the collection.
• Also good for applications: Applications can easily consume 1000s of
results.
• Not good for the majority of users.
• Writing Boolean queries is hard
Limitations of Boolean Retrieval Model
• Exact matching may retrieve too few (∼ 0) or too many documents (∼
1000)

• Query 1: “standard user dlink 650” → 200,000 hits

• Query 2: “standard user dlink 650 no card found”: 0 hits

• It takes a lot of skill to come up with a query that produces a


manageable number of hits.
• AND gives too few, OR gives too many
Ranked Retrieval

• Ranks documents by relevance to the query

• Return top k documents by relevance

• Allow for free text queries:


• Rather than considering query language of operators and expressions, it
consider words of human language
Ranked
Retrieval
Very large/Small: No issue in Ranked Retrieval

• When the system produces a ranked result set, the large result set
is not an issue.
• The size of the result is not an issue

• Just show the top k (k ∼ 10,20,100) results.

• Don’t overwhelm the user

• Premise: the ranking algorithm works


Scoring as Basis of Ranked Retrieval

• We wish to return in order the documents most likely to be useful


to the searcher

• How to rank-order documents in a collection with respect to a


query?

• Assign a score – say in [0,1] – to each document

• This score measures how well the document and query “match”.
Scoring as Basis of Ranked Retrieval

• We need a way of assigning a score to a query/document pair

• If the query term does not occur in the document: score should be
0

• The more frequent the query term in the document, the higher the
score (should be)
Jaccard coefficient

• jaccard(A,B) = |A ∩ B| / |A ∪ B|
• jaccard(A,A) = 1
• jaccard(A,B) = 0 if A ∩ B = 0
• Always assigns a number between 0
and 1.
Jaccard coefficient: Scoring Example

• What is the query-document match score that the Jaccard coefficient


computes for each of the two documents below?

J(q,d1)=1/6=0.16
Query: ides of march=3
J(q,d2)=1/5=0.2
D1: Caesar died in March=4
D2: the long March=3
Issues with Jaccard for scoring

• Privileges shorter documents


• We need a more sophisticated way of normalizing for length | A ∩ B | /√ | A ∪ B |

• It doesn’t consider term frequency


– how many times a term occurs in a document

– Rare terms in collection are more informative than frequent terms.

– Jaccard doesn’t consider this information


Binary Term Incidence Matrix

• Each document is represented by a binary vector ∈ {0,1}|V|


Count Matrix
Bag of Words
• We do not consider the order of words in a document.
• Represented the same way:

• This is called a bag of words model.


• In a sense, this is a step back: The positional index was able to
distinguish these two documents.
Term frequency tf
• The term frequency tft,d of term t in document d is defined as the
number of times that t occurs in d.
• We want to use tf when computing query document match
scores. But how?
• – A document with 10 occurrences of the term is more relevant
than a document with 1 occurrence of the term.
• Raw term frequency is not what we want:
– But not 10 times more relevant.
–Relevance does not increase proportionally with term frequency.
Instead of raw frequency: Log frequency weighting
Desired weight for rare terms
Document frequency
idf weight
idf weight-
Example
Effect of idf on ranking
Collection frequency vs. Document frequency
tf-idf weighting

• Combine the term frequency and


document frequency to produce a
composite weight for each term in
each document
Binary→Count → Weight Matrix
tf-idf weighting

• Increases with the number of occurrences in the document.

• Increases with the rarity of the term in the collection.

• Lowest when the term occurs in virtually all docs


Score of document

• 𝑆𝑐𝑜𝑟𝑒 𝑞, 𝑑 = σ𝑡∈𝑞 𝑡𝑓 − 𝑖𝑑𝑓𝑡,𝑑

• Score of document d is the sum , over all query terms, of the

number of times each of the query term occurs in d.

• Tf-idf of each term in d.

You might also like