4 IRModels
4 IRModels
k3
The Boolean Model: Example
• Given the following determine documents retrieved by the
Boolean model based IR system
• Index Terms: K1, …,K8.
• Documents:
|QD|
min(| Q |, | D |) Overlap Coefficient
Similarity Measure
•Sim(q,dj) = cos() j
dj
q
n i
d j q wi , j wi ,q
sim(d j , q ) i 1
i 1 w i 1 i,q
n n
dj q 2
i, j w 2
a 0 1 1 1 3 0 0 0 0 0
arrived 0 0 1 1 2 0.176 0 0 0.176 0.176
damaged 0 1 0 0 1 0.477 0 0.477 0 0
delivery 0 0 1 0 1 0.477 0 0 0.477 0
fire 0 1 0 0 1 0.477 0 0.477 0 0
gold 1 1 0 1 2 0.176 0.176 0.176 0 0.176
in 0 1 1 1 3 0 0 0 0 0
of 0 1 1 1 3 0 0 0 0 0
silver 1 0 2 0 1 0.477 0.477 0 0.954 0
shipment 0 1 0 1 2 0.176 0 0.176 0 0.176
Vector-Space Model
Terms Q D1 D2 D3
a 0 0 0 0
arrived 0 0 0.176 0.176
damaged 0 0.477 0 0
delivery 0 0 0.477 0
fire 0 0.477 0 0
gold 0.176 0.176 0 0.176
in 0 0 0 0
of 0 0 0 0
silver 0.477 0 0.954 0
shipment 0 0.176 0 0.176
truck 0.176 0 0.176 0.176
Vector-Space Model: Example
•Compute similarity using cosine Sim(q,d1)
• Disadvantages:
• assumes independence of index terms (??)
Exercise 1
Suppose the database collection consists of the following
documents.
c1: Human machine interface for Lab ABC computer
applications
c2: A survey of user opinion of computer system response time
c3: The EPS user interface management system
c4: System and human system engineering testing of EPS
c5: Relation of user-perceived response time to error measure
Query:
Find documents relevant to "human computer
Exercise 2
• Consider these documents:
Doc 1 breakthrough drug for schizophrenia
Doc 2 new schizophrenia drug
Doc 3 new approach for treatment of schizophrenia
Doc 4 new hopes for schizophrenia patients
–Draw the term-document incidence matrix for this document
collection.
–Draw the inverted index representation for this collection.
p(R | D) p(R)p(R | D)
log O(R | D) log log
p(R | D) p(R)p(R | D)
p(R | D) 1 - p(R | D)
Probabilistic Models
• Most probabilistic models based on combining probabilities of
relevance and non-relevance of individual terms
– Probability that a term will appear in a relevant document
– Probability that the term will not appear in a non‐relevant
document
• These probabilities are estimated based on counting term
appearances in document descriptions
• Retrieval Status Value (rsv)
D is a vector of binary term
occurrences
We assume that terms occur
independently of each other
Principles surrounding weights
• Independence Assumptions
– I1: The distribution of terms in relevant documents is
independent and their distribution in all documents is
independent.
– I2: The distribution of terms in relevant documents is
independent and their distribution in non-relevant documents
is independent.
• Ordering Principles
– O1: Probable relevance is based only on the presence of search
terms in the documents.
– O2: Probable relevance is based on both the presence of
search terms in documents and their absence from documents.
Computing term probabilities
• Initially, there are no retrieved documents
– R is completely unknown
– Assume P(ti|R) is constant (usually 0.5)
– Assume P(ti|NR) approximated by distribution of ti
across collection – IDF