Chapter 5 IR
Chapter 5 IR
Chapter Five
IR models
Target Group –IT 3rd year students
Injibara, Ethiopia
IR Models - Basic Concepts
Word evidence:
IR systems usually adopt index terms to index and retrieve
documents
Each document is represented by a set of representative keywords
or index terms (called Bag of Words)
Not all terms are equally useful for representing the document
contents:
less frequent terms allow identifying a narrower set of documents
T1 T2 …. TN
- Note that, no weights D1 w11 w12 … w1N
D2 w21 w22 … w2N
assigned in-between 0 and
: : : :
1, just only values 0 or 1. : : : :
DM wM1 wM2 … wMN
The Boolean Model:
Boolean Query expression keywords connected by AND, OR, and
NOT, including the use of brackets to indicate scope.
Example
k3
The Boolean Model: Example
Given the following determine documents retrieved by the
Boolean model based IR system
• Index Terms: K1, …,K8.
• Documents:
1. D1 = {K1, K2, K3, K4, K5}
2. D2 = {K1, K2, K3, K4}
3. D3 = {K2, K4, K6, K8}
4. D4 = {K1, K3, K5, K7}
5. D5 = {K4, K5, K6, K7, K8}
6. D6 = {K1, K2, K3, K4}
• Query: K1 (K2 K3)
• Answer: {D1, D2, D4, D6} ({D1, D2, D3, D6} {D3, D5})
= {D1, D2, D6}
Exercise
Given the following four documents with the following contents:
– D2 = “computer retrieval”
– D3 = “information”
– D4 = “computer information”
– Q1 = “information retrieval”
– Q2 = “information ¬computer”
Drawbacks of the Boolean Model
• Exact-match only, no partial matches
q
n i
dj q wi , j wi ,q
sim(d j , q) i 1
• Disadvantages:
• assumes independence of index terms (??)
Probabilistic Model
• IR is an uncertain process
–Mapping Information need to Query is not perfect
–Mapping Documents to index terms is a logical representation
–Query terms and index terms mostly mismatch
p(R)p(D | R) p(R)p(D | R)
p(R | D)
p(D) p(R)p(D | R) p(R)p(D | R)
p(R | D) p(R)p(R | D)
log O(R | D) log log
p(R | D) p(R)p(R | D)
p(R | D) 1 - p(R | D)
Probabilistic Models
Most probabilistic models based on combining probabilities of
relevance and non-relevance of individual terms
– Probability that a term will appear in a relevant document
– Probability that the term will not appear in a non‐relevant
document
These probabilities are estimated based on counting term
appearances in document descriptions
• Retrieval Status Value (rsv)
D is a vector of binary term
occurrences
We assume that terms occur
independently of each other
Principles surrounding weights
• Independence Assumptions
– R is completely unknown
• q1 = eat
• q2 = porridge
• q3 = hot porridge
• q4 = eat nine day old porridge
Improving the Ranking
• Now, suppose
– we have shown the initial ranking to the user
• We now have
– N documents in coll, R are known relevant
(n r 0.5)(R r 0.5)
Relevance weighted Example
• q3 = hot porridge
• Document 2 is relevant
Probabilistic Retrieval Example
• D1: “Cost of paper is up.” (relevant)