0% found this document useful (0 votes)
5 views46 pages

4 IRModels

Uploaded by

gosatilahun2017
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views46 pages

4 IRModels

Uploaded by

gosatilahun2017
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 46

IR models

IR Models - Basic Concepts


• Word evidence:
 IR systems usually adopt index terms to index and
retrieve documents
 Each document is represented by a set of representative
keywords or index terms (called Bag of Words)
• An index term is a document word useful for remembering the
document main themes
• Not all terms are equally useful for representing
the document contents:
 less frequent terms allow identifying a narrower set of
documents
• But no ordering information is attached to the Bag of Words
identified from the document collection.
IR Models - Basic Concepts
•One central problem regarding IR systems is
the issue of predicting the degree of relevance
of documents for a given query
•Such a decision is usually dependent on a
ranking algorithm which attempts to
establish a simple ordering of the
documents retrieved
•Documents appearing at the top of this
ordering are considered to be more likely to
be relevant
•Thus ranking algorithms are at the core of IR
systems
•The IR models determine the predictions of
what is relevant and what is not, based on
IR Models - Basic Concepts
• After preprocessing, N distinct terms (Bag of words) remain
which are unique terms that form the VOCABULARY
• Let
– ki be an index term i & dj be a document j
– K = (k1, k2, …, kN) is the set of all index terms
• Each term, i, in a document or query j, is given a real-valued
weight, wij.
– wij is a weight associated with (ki,dj). If wij = 0 , it
indicates that term does not belong to document dj
• The weight wij quantifies the importance of the index term
for describing the document contents
• vec(d ) = (w , w , …, w ) is a weighted vector associated
j 1j 2j tj
with the document dj
Mapping Documents &
Queries
• Represent both documents and queries as N-dimensional
vectors in a term-document matrix, which shows
occurrence of terms in the document collection or query

– E.g. d j (t1, j , t 2, j ,..., t N , j ); qk (t1,k , t 2,k ,..., t N ,k )
• An entry in the matrix corresponds to the “weight” of a
term in the document; zero means the term doesn’t exist in
the document.
T1 T2 …. TN – Document collection is mapped to
D1 w11 w12 … w1N term-by-document matrix
D2 w21 w22 … w2N – View as vector in multidimensional
space
: : : :
• Nearby vectors are related
: : : :
– Normalize for vector length to avoid
DM wM1 wM2 … wMN
the effect of document length
Qi wi1 wi2 … wiN
Weighting Terms in Vector
Space
• The importance of the index terms is represented by
weights associated to them
• Problem: to show the importance of the index term for
describing the document/query contents, what weight we
can assign?
• Solution 1: Binary weights: t=1 if presence, 0 otherwise
– Similarity: number of terms in common
• Problem: Not all terms equally interesting
– E.g. the vs. dog vs. cat
• Solution: Replace binary weights with non-binary weights
 
d j ( w1, j , w2, j ,..., wN , j ); qk ( w1,k , w2,k ,..., wN ,k )
How to evaluate Models?
• We need to investigate what procedures they
follow and what techniques they used for:
– Are they using binary or non-binary weighting for
measuring importance of terms in documents
– Are they using similarity measurements?
– Are they applying partial matching?
– Are they performing Exact matching or Best
matching for document retrieval?
– Any Ranking mechanism?
The Boolean Model
•Boolean model is a simple model based on
set theory
•The Boolean model imposes a binary
criterion for deciding relevance
•Terms are either present or absent. Thus,
wij  {0,1}
•sim(q,dj) = 1, if document satisfies the boolean
T T …. T
query 1 2 N

0 otherwise D1 w11 w12 … w1N


- Note that, no weights
D2 w21 w22 … w2N
assigned in-between 0
and 1, just only values 0 : : : :
or 1 : : : :
DM wM1 wM2 … wMN
The Boolean Model: Example
•Generate the relevant documents
retrieved by the Boolean model for the
query :
q = k1  (k2  k3)
k2
k1
d7
d2 d6
d4 d5
d3
d1

k3
The Boolean Model: Example
• Given the following determine documents retrieved by the
Boolean model based IR system
• Index Terms: K1, …,K8.
• Documents:

1. D1 = {K1, K2, K3, K4, K5}


2. D2 = {K1, K2, K3, K4}
3. D3 = {K2, K4, K6, K8}
4. D4 = {K1, K3, K5, K7}
5. D5 = {K4, K5, K6, K7, K8}
6. D6 = {K1, K2, K3, K4}
• Query: K1 (K2  K3)
• Answer: {D1, D2, D4, D6} ({D1, D2, D3, D6} {D3, D5})
= {D1, D2, D6}
The Boolean Model: Further Example
Given the following three documents, Construct Term –
document matrix and find the relevant documents
retrieved by the Boolean model for given query
• D1: “Shipment of gold damaged in a fire”
• D2: “Delivery of silver arrived in a silver truck”
• D3: “Shipment of gold arrived in a truck”
• Query: “gold silver truck”
Table below shows document –term (ti) matrix
arrive damage deliver fire gold silver ship truck
• Find the
D1 relevant
documents for
D2
the queries:
D3 (a)gold delivery
query (b)ship gold
(c)silver truck
Exercise
Given the following four documents with the following
contents:
– D1 = “computer information retrieval”
– D2 = “computer retrieval”
– D3 = “information”
– D4 = “computer information”

• What are the relevant documents retrieved for the


queries:
– Q1 = “information  retrieval”
– Q2 = “information  ¬computer”
Drawbacks of the Boolean Model
•Retrieval based on binary decision criteria with no
notion of partial matching
•No ranking of the documents is provided (absence of
a grading scale)
•Information need has to be translated into a Boolean
expression which most users find awkward
•The Boolean queries formulated by the users are most
often too simplistic
As a consequence, the Boolean model frequently
returns either too few or too many documents in
response to a user query
Vector-Space Model
• This is the most commonly used strategy for measuring
relevance of documents for a given query. This is
because,
• Use of binary weights is too limiting
• Non-binary weights provide consideration for partial
matches
• These term weights are used to compute a degree of
similarity between a query and each document
• Ranked set of documents provides for better matching
• The idea behind VSM is that
• the meaning of a document is conveyed by the words
used in that document
Vector-Space Model
To find relevant documens for a given query,
• First, map documents and queries into term-document
vector space.
Note that queries are considered as short document
• Second, in the vector space, queries and documents are
represented as weighted vectors, wij
There are different weighting technique; the most
widely used one is computing tf*idf for each term
• Third, similarity measurement is used to rank documents by
the closeness of their vectors to the query.
Documents are ranked by closeness to the query.
Closeness is determined by a similarity score calculation
Term-document matrix.
• A collection of n documents and query can be represented
in the vector space model by a term-document matrix.
– An entry in the matrix corresponds to the “weight” of a term in
the document;
–zero means the term has no significance in the document or
it simply doesn’t exist in the document. Otherwise, wij >
0 whenever ki  dj
T 1 T2 …. TN
D1 w11 w21 … w1N
D2 w21 w22 … w2N
: : : :
: : : :
DM wM1 wM2 … wMN
Computing weights
• How to compute weights for term i in document j and
query q; wij and wiq ?
• A good weight must take into account two effects:
– Quantification of intra-document contents (similarity)
• tf factor, the term frequency within a document

– Quantification of inter-documents separation


(dissimilarity)
• idf factor, the inverse document frequency

• As a result of which most IR systems are using tf*idf


weighting technique:
wij = tf(i,j) * idf(i)
Computing weights
• Let,
 N be the total number of documents in the collection
 ni be the number of documents which contain ti
 freq(i,j) raw frequency of ti within dj
• A normalized tf(i,j) factor is given by
tf(i,j) = freq(i,j) / max(freq(k,j))
 where the maximum is computed over all terms which occur
within the document dj
• The idf factor is computed as
idf(i) = log (N/ni)
 The log is used to make the values of tf and idf
comparable. It can also be interpreted as the amount of
information associated with the term ti.
• A normalized tf*idf weight is given by:
w = freq(i,j) / max(freq(k,j)) * log(N/n )
Example: Computing weights
• A collection includes 10,000 documents
The term A appears 20 times in a particular
document
 The maximum appearance of any term in this
document is 50
 The term A appears in 2,000 of the collection
documents.
• Compute TF*IDF weight?
 tf(i,j) = freq(i,j) / max(freq(k,j)) = 20/50 = 0.4
 idf(i) = log(N/ni) = log (10,000/2,000) = log(5) = 2.32
 wij = f(i,j) * log(N/ni) = 0.4 * 2.32 = 0.928
Similarity Measure
• A similarity measure is a function that computes the
degree of similarity between two vectors.

• Using a similarity measure between the query and


each document:
– It is possible to rank the retrieved documents in
the order of presumed relevance.
– It is possible to enforce a certain threshold so that
we can control the size of the retrieved set of
documents.
Similarity Measures
|QD| Dot Product (Simple matching)
|QD|
2 Dice’s Coefficient
|Q|| D|
|QD|
Jaccard’s Coefficient
|QD|
|QD|
1 1 Cosine Coefficient
| Q | | D |
2 2

|QD|
min(| Q |, | D |) Overlap Coefficient
Similarity Measure
•Sim(q,dj) = cos() j
dj


q
 

n i
d j q wi , j wi ,q
sim(d j , q )     i 1

i 1 w i 1 i,q
n n
dj q 2
i, j w 2

•Since wij > 0 and wiq > 0, 0 <= sim(q,dj)


<=1
•A document is retrieved even if it matches the
query terms only partially
Vector Space with Term Weights and
Cosine Matching
Di=(di1,wdi1;di2, wdi2;…;dit, wdit)
Term B
Q =(qi1,wqi1;qi2, wqi2;…;qit, wqit)
1.0 Q = (0.4,0.8)

t
D2 Q D1=(0.8,0.3)
j 1
wq j wd ij
0.8 D2=(0.2,0.7) sim(Q, Di ) 
 j 1 (wq j )  j 1 dij
t 2 t 2
( w )
0.6
2 (0.4 0.2)  (0.8 0.7)
sim (Q, D 2) 
0.4
[(0.4) 2  (0.8) 2 ] [(0.2) 2  (0.7) 2 ]
D1
0.2 1 0.64
 0.98
0.42
0 0.2 0.4 0.6 0.8 1.0
.56
Term A sim(Q, D1 )  0.74
0.58
Vector-Space Model:
• Suppose
Example user query for: Q = “gold silver truck”. The
database collection consists of three documents with the
following content.
D1: “Shipment of gold damaged in a fire”
D2: “Delivery of silver arrived in a silver truck”
D3: “Shipment of gold arrived in a truck”
• Show retrieval results in ranked order?
1.Assume that full text terms are used during indexing,
without removing common terms, stop words, & also no
terms are stemmed.
2.Assume that content-bearing terms are selected during
indexing
3.Also compare your result with or without normalizing
term frequency
Vector-Space Model: Example
Counts TF Wi = TF*IDF
Terms Q D1 D2 D3 DF IDF Q D1 D2 D3

a 0 1 1 1 3 0 0 0 0 0
arrived 0 0 1 1 2 0.176 0 0 0.176 0.176
damaged 0 1 0 0 1 0.477 0 0.477 0 0
delivery 0 0 1 0 1 0.477 0 0 0.477 0
fire 0 1 0 0 1 0.477 0 0.477 0 0
gold 1 1 0 1 2 0.176 0.176 0.176 0 0.176
in 0 1 1 1 3 0 0 0 0 0
of 0 1 1 1 3 0 0 0 0 0
silver 1 0 2 0 1 0.477 0.477 0 0.954 0
shipment 0 1 0 1 2 0.176 0 0.176 0 0.176
Vector-Space Model
Terms Q D1 D2 D3
a 0 0 0 0
arrived 0 0 0.176 0.176
damaged 0 0.477 0 0
delivery 0 0 0.477 0
fire 0 0.477 0 0
gold 0.176 0.176 0 0.176
in 0 0 0 0
of 0 0 0 0
silver 0.477 0 0.954 0
shipment 0 0.176 0 0.176
truck 0.176 0 0.176 0.176
Vector-Space Model: Example
•Compute similarity using cosine Sim(q,d1)

• First, for each document and query, compute all


vector lengths (zero terms ignored)
|d1|= 0.477 2  0.477 2  0.1762  0.176
= 2 = 0.719
0.517
|d2|= 0.1762  0.477 2  0.1762  0.= 1=
1762 1.095
.2001
|d3|= 2 2 2 = 2 =.124
0 0.352
0.176  0.176  0.176  0.176
|q|= =2 = 0.538
0.2896
0.1762  0.4712  0.176
• Next, compute dot products (zero products ignored)
Q*d1= 0.176*0.176 = 0.0310
Q*d2 = 0.954*0.477 + 0.176 *0.176 = 0.4862
Q*d3 = 0.176*0.176 + 0.176*0.176 = 0.0620
Vector-Space Model:
Example
Now, compute similarity score
Sim(q,d1) = (0.0310) / (0.538*0.719) = 0.0801
Sim(q,d1) = (0.4862 ) / (0.538*1.095)= 0.8246
Sim(q,d1) = (0.0620) / (0.538*0.352)= 0.3271
Finally, we sort and rank documents in descending
order according to the similarity scores
Rank 1: Doc 2 = 0.8246
Rank 2: Doc 3 = 0.3271
Rank 3: Doc 1 = 0.0801

• Exercise: using normalized TF, rank documents


using cosine similarity measure? Hint: Normalize
TF of term i in doc j using max frequency of a
Vector-Space Model
• Advantages:
• term-weighting improves quality of the answer set
since it displays in ranked order
• partial matching allows retrieval of documents that
approximate the query conditions
• cosine ranking formula sorts documents according
to degree of similarity to the query

• Disadvantages:
• assumes independence of index terms (??)
Exercise 1
Suppose the database collection consists of the following
documents.
c1: Human machine interface for Lab ABC computer
applications
c2: A survey of user opinion of computer system response time
c3: The EPS user interface management system
c4: System and human system engineering testing of EPS
c5: Relation of user-perceived response time to error measure

Query:
Find documents relevant to "human computer
Exercise 2
• Consider these documents:
Doc 1 breakthrough drug for schizophrenia
Doc 2 new schizophrenia drug
Doc 3 new approach for treatment of schizophrenia
Doc 4 new hopes for schizophrenia patients
–Draw the term-document incidence matrix for this document
collection.
–Draw the inverted index representation for this collection.

• For the document collection shown above, what are the


returned results for the queries:
–schizophrenia AND drug
–for AND NOT(drug OR approach)
Probabilistic model
• Asks the question: what is the probability that user will
see relevant information if they read this document.
– P(rel | di ): probability of relevance after reading di
– How likely is the user to get relevance information from
reading this document
– high probability means more likely to get relevant info.

• A Probabilistic retrieval model


– Rank documents in decreasing order of probability of relevance
to users information need
– Calculate P(rel|di) for each document and rank
Probabilistic Model definitions
• Let D be a document in the collection.
– dj = (t1,j, t2,j, … tt,j), ti,j Î {0,1}
• terms occurrences are boolean (not counts)
• query q is represented similarly
• Let R represent the set of relevant documents with respect to a given
query and let NR represent the set of irrelevant documents.
– P(R | dj) is probability that dj is relevant,
– P(NR | dj) is probability that dj is irrelevant
• Need to find p(R| D) - probability that a retrieved document D is
relevant.
• Similarity function: p( R | D) 
p( D | R) p( R)
– Ratio of prob of relevance to prob of p( D)
non-relevance: p( D | NR ) p( NR )
p( NR | D) 
If p(R|D) > p(NR|D) then D is relevant, p( D)
otherwise D is not relevant
Bayes’ Theorem: Application in IR
• Goal: want to estimate the probability that a document D
is relevant to a given query.
p(R)p(D | R) p(R)p(D | R)
p(R | D)  
p(D) p(R)p(D | R)  p(R)p(D | R)

• It is easier to estimate log odds of probability of


relevance

p(R | D) p(R)p(R | D)
log O(R | D) log log
p(R | D) p(R)p(R | D)

p(R | D) 1 - p(R | D)
Probabilistic Models
• Most probabilistic models based on combining probabilities of
relevance and non-relevance of individual terms
– Probability that a term will appear in a relevant document
– Probability that the term will not appear in a non‐relevant
document
• These probabilities are estimated based on counting term
appearances in document descriptions
• Retrieval Status Value (rsv)
D is a vector of binary term
occurrences
We assume that terms occur
independently of each other
Principles surrounding weights
• Independence Assumptions
– I1: The distribution of terms in relevant documents is
independent and their distribution in all documents is
independent.
– I2: The distribution of terms in relevant documents is
independent and their distribution in non-relevant documents
is independent.

• Ordering Principles
– O1: Probable relevance is based only on the presence of search
terms in the documents.
– O2: Probable relevance is based on both the presence of
search terms in documents and their absence from documents.
Computing term probabilities
• Initially, there are no retrieved documents
– R is completely unknown
– Assume P(ti|R) is constant (usually 0.5)
– Assume P(ti|NR) approximated by distribution of ti
across collection – IDF

• This can be used to compute an initial rank


using IDF as the basic term weight
Probabilistic Model Example
d Document vectors <tfd,t>
col day eat hot lot nin old pea por pot
1 1 1 1 1
2 1 1 1
3 1 1 1
4 1 1 1
5 1 1
6 1 1
wt 0.26 0.56 0.56 0.26 0.56 0.56 0.56 0.0 0.0 0.26
• q1 = eat
• q2 = porridge
• q3 = hot porridge
• q4 = eat nine day old porridge
Improving the Ranking
• Now, suppose
– we have shown the initial ranking to the user
– the user has labeled some of the documents as
relevant ("relevance feedback")
• We now have
– N documents in coll, R are known relevant
– ni documents containing ti, ri are relevant
Improving Term Weight Estimates
N=the total number of documents in the collection
n= the total number of documents that contain term ti
R=the total number of relevant documents retrieved
r=the total number of relevant documents retrieved that contain term t i
Document Relevance
For term ti No of relevant No of non-relevant Total
docs docs
No of docs including r n-r n
term ti
No of docs excluding R-r N-R-(n-r) N-n
term ti
Total R N-R N
Compute Term Weight: Robertson-Spark
Jones Weights
• Retrospective formulation  r 
–Ratio of the odds of a relevant  
document having the term (i.e.,  R r
ratio of relevant documents log
having the term to not having  n r 
 
the term) to the odds of all  N  n Rr 
non-relevant documents
having the term (i.e., ratio of all
non-relevant documents  r  0.5 
having the term to not having
 
(1)  R  r  0.5 
the term) w log
 n  r  0.5 
 
• Predictive formulation  N  n  R  r  0.5 
–To guarantee that the
denominator is never zero,
adding a minor 0.5 to all (1) (r  0.5)( N  n  R  r  0.5)
w log
numerators and denominators:
(n  r  0.5)( R  r  0.5)
Relevance weighted Example
d Document vectors <tfd,t>
col day eat hot lot nin old pea por pot Relev
ance
1 1 1 1 1 NR
2 1 1 1 R
3 1 1 1 NR
4 1 1 1 NR
5 1 1 NR
6 1 1 NR
wt -0.33 0.00 0.00 -0.33 0.00 0.00 0.00 0.62 0.62 0.95
• q3 = hot porridge
• Document 2 is relevant
Probabilistic Retrieval Example
• D1: “Cost of paper is up.” (relevant)
• D2: “Cost of jellybeans is up.” (not relevant)
• D3: “Salaries of CEO’s are up.” (not relevant)
• D4: “Paper: CEO’s labor cost up.” (not relevant)
Probabilistic Retrieval Example
cost paper Jellybean salary CEO labor up
D1 1 1 0 0 0 0 1
D2 1 0 1 0 0 0 1
D3 0 0 0 1 1 0 1
D4 1 1 0 0 1 1 1
Wij 0.477 1.176 -0.477 -0.477 -0.477 0.222 -0.222
• D1=0.477 +1.176+ -0.222
• D2=0.477 + -0.477+ -0.222
• D3= -0.477 + -0.477+ -0.222
• D4=0.477 +1.176 + -0.477 + 0.222 + -0.222
Exercise
• Consider the collection below. The collection has 5 documents and
each document is described by two terms. The initial guess of
relevance to a particular query Q is as given in the table below.
Assuming the query Q has a total of 2 relevant documents in this
collection solve the following questions
Document T1 T2 Relevance
D1 1 1 R
D2 0 1 NR
D3 1 0 NR
D4 1 0 R
D5 0 1 NR
• Using the probabilistic term weighting formula, calculate the new
weight for each of the query in Q
• Rank the documents according to their probability of relevance
with the new query
Probabilistic model
• Probabilistic model uses probability theory to model the
uncertainty in the retrieval process
– Assumptions are made explicit
– Term weight without relevance information is IDF
• Relevance feedback can improve the ranking by giving better
term probability estimates
• Advantages of probabilistic model over vector ‐space
– Strong theoretical basis
– Since the base is probability theory, it is very well understood
– Easy to extend
• Disadvantages
– Models are often complicated
– No term frequency weighting
• Which is better: vector‐space or probabilistic?
– Both are approximately as good as each other
– Depends on collection, query, and other factors

You might also like