0% found this document useful (0 votes)
5 views46 pages

Chapter 5 IR

Chapter Five discusses Information Retrieval (IR) models, focusing on the representation of documents and queries using index terms and the Bag of Words approach. It highlights the importance of ranking algorithms in determining the relevance of documents based on user queries, and compares the Boolean Model with the Vector-Space Model for document retrieval. The chapter also covers the computation of term weights using tf-idf and various similarity measures for ranking documents.

Uploaded by

bekeletamirat931
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views46 pages

Chapter 5 IR

Chapter Five discusses Information Retrieval (IR) models, focusing on the representation of documents and queries using index terms and the Bag of Words approach. It highlights the importance of ranking algorithms in determining the relevance of documents based on user queries, and compares the Boolean Model with the Vector-Space Model for document retrieval. The chapter also covers the computation of term weights using tf-idf and various similarity measures for ranking documents.

Uploaded by

bekeletamirat931
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Information Retrieval and Storage

Chapter Five
IR models
Target Group –IT 3rd year students

Injibara, Ethiopia
IR Models - Basic Concepts
Word evidence:
 IR systems usually adopt index terms to index and retrieve
documents
 Each document is represented by a set of representative keywords
or index terms (called Bag of Words)

An index term is a document word useful for remembering the


document main themes.

Not all terms are equally useful for representing the document
contents:
 less frequent terms allow identifying a narrower set of documents

But no ordering information is attached to the Bag of Words


identified from the document collection.
....continued
One central problem regarding IR systems is the issue of predicting
the degree of relevance of documents for a given query
 Such a decision is usually dependent on a ranking algorithm
which attempts to establish a simple ordering of the documents
retrieved
 Documents appearning at the top of this ordering are considered
to be more likely to be relevant
Thus ranking algorithms are at the core of IR systems
 The IR models determine the predictions of what is relevant and
what is not, based on the notion of relevance implemented by
the system
....continued
After preprocessing, N distinct terms (Bag of words) remain which
are unique terms that form the VOCABULARY
• Let
– ki be an index term i & dj be a document j
– K = (k1, k2, …, kN) is the set of all index terms
• Each term, i, in a document or query j, is given a real-valued weight,
wij.
– wij is a weight associated with (ki,dj). If wij = 0 , it indicates that
term does not belong to document dj
The weight wij quantifies the importance of the index term for
describing the document contents
• vec(dj) = (w1j, w2j, …, wtj) is a weighted vector associated with the
document dj
Mapping Documents & Queries
Represent both documents and queries as N-dimensional vectors in
a term-document matrix, which shows occurrence of terms in the
document collection or query
– E.g.  
d j  (t1, j , t 2, j ,..., t N , j ); qk  (t1,k , t 2,k ,..., t N ,k )
• An entry in the matrix corresponds to the “weight” of a term in the
document; zero means the term doesn’t exist in the document.

T1 T2 …. TN  Document collection is mapped to term-


D1 w11 w12 … w1N by-document matrix
D2 w21 w22 … w2N
 View as vector in multidimensional space
: : : :
: : : :  Nearby vectors are related
DM wM1 wM2 … wMN  Normalize for vector length to avoid effect
Qi wi1 wi2 … wiN of document length
Weighting Terms in Vector Sapce
The importance of the index terms is represented by weights
associated to them
Problem: to show the importance of the index term for describing
the document/query contents, what weight we can assign?
Solution 1: Binary weights: t=1 if presence, 0 otherwise
– Similarity: number of terms in common
Problem: Not all terms equally interesting
– E.g. the vs. dog vs. cat
Solution: Replace binary weights with non-binary weights
 
d j  ( w1, j , w2, j ,..., wN , j ); qk  ( w1,k , w2,k ,..., wN ,k )
How to evaluate Models?
We need to investigate what procedures they follow and what
techniques they used for:

Are they using binary or non-binary weighting for measuring


importance of terms in documents

Are they using similarity measurements?

Are they applying partial matching?

Are they performing Exact matching or Best matching for


document retrieval?

Any Ranking mechanism?


The Boolean Model
Boolean model is a simple model based on set theory
• The Boolean model imposes a binary criterion for deciding
relevance
Terms are either present or absent. Thus, wij  {0,1}
 sim(q,dj) = 1, if document satisfies the boolean query
0 otherwise

T1 T2 …. TN
- Note that, no weights D1 w11 w12 … w1N
D2 w21 w22 … w2N
assigned in-between 0 and
: : : :
1, just only values 0 or 1. : : : :
DM wM1 wM2 … wMN
The Boolean Model:
Boolean Query expression keywords connected by AND, OR, and
NOT, including the use of brackets to indicate scope.
Example

• Generate the relevant documents retrieved by the Boolean


model for the query :
k2
q = k1  (k2  k3) k1
d7
d2 d6
d4 d5
d3
d1

k3
The Boolean Model: Example
 Given the following determine documents retrieved by the
Boolean model based IR system
• Index Terms: K1, …,K8.
• Documents:
1. D1 = {K1, K2, K3, K4, K5}
2. D2 = {K1, K2, K3, K4}
3. D3 = {K2, K4, K6, K8}
4. D4 = {K1, K3, K5, K7}
5. D5 = {K4, K5, K6, K7, K8}
6. D6 = {K1, K2, K3, K4}
• Query: K1 (K2  K3)

• Answer: {D1, D2, D4, D6} ({D1, D2, D3, D6} {D3, D5})
= {D1, D2, D6}
Exercise
Given the following four documents with the following contents:

– D1 = “computer information retrieval”

– D2 = “computer retrieval”

– D3 = “information”

– D4 = “computer information”

• What are the relevant documents retrieved for the queries:

– Q1 = “information  retrieval”

– Q2 = “information  ¬computer”
Drawbacks of the Boolean Model
• Exact-match only, no partial matches

• ✁ Retrieved documents not ranked

• All terms are equally important

✁ Boolean operator usage has much more influence


than a critical word
Vector-Space Model....
This is the most commonly used strategy for measuring relevance
of documents for a given query. This is because,
 Use of binary weights is too limiting
 Non-binary weights provide consideration for partial matches
These term weights are used to compute a degree of similarity
between a query and each document
 Ranked set of documents provides for better matching
The idea behind VSM is that
 the meaning of a document is conveyed by the words used in
that document.
Vector-Space Model
To find relevant documens for a given query,

• First, map documents and queries into term-document vector space.


Note that queries are considered as short document
• Second, in the vector space, queries and documents are represented as
weighted vectors, wij
There are different weighting technique; the most widely used one
is computing tf*idf for each term

• Third, similarity measurement is used to rank documents by the


closeness of their vectors to the query.
Documents are ranked by closeness to the query. Closeness is
determined by a similarity score calculation
Vector-Space Model.....
• A collection of n documents and query can be represented in the
vector space model by a term-document matrix.
–An entry in the matrix corresponds to the “weight” of a term in
the document;
–zero means the term has no significance in the document or it
simply doesn’t exist in the document. Otherwise, wij > 0
whenever ki  dj.
T1 T2 …. TN
D1 w11 w21 … w1N
D2 w21 w22 … w2N
: : : :
: : : :
DM wM1 wM2 … wMN
Computing weights
How to compute weights for term i in document j and query q; wij
and wiq ?

A good weight must take into account two effects:


– Quantification of intra-document contents (similarity)
• tf factor, the term frequency within a document
– Quantification of inter-documents separation (dissimilarity)
• idf factor, the inverse document frequency
As a result of which most IR systems are using tf*idf weighting
technique:
wij = tf(i,j) * idf(i)
Computing weights....
Let,
 N be the total number of documents in the collection
 ni be the number of documents which contain ti
 freq(i,j) raw frequency of ti within dj
A normalized tf(i,j) factor is given by
tf(i,j) = freq(i,j) / max(freq(k,j))
 where the maximum is computed over all terms which occur
within the document dj
The idf factor is computed as
idf(i) = log (N/ni)
 The log is used to make the values of tf and idf comparable. It
can also be interpreted as the amount of information associated
with the term ti.
A normalized tf*idf weight is given by:
wij = freq(i,j) / max(freq(k,j)) * log(N/ni)
Example: Computing weights
Query:-Users query is typically treated as a document and also
tf-idf weighted.
The vector space model is usually as good as the known
ranking alternatives. It is also simple and fast to compute.
A collection includes 10,000 documents
 The term A appears 20 times in a particular document
 The maximum appearance of any term in this document is 50
 The term A appears in 2,000 of the collection documents.
Compute TF*IDF weight?
 tf(i,j) = freq(i,j) / max(freq(k,j)) = 20/50 = 0.4
 idf(i) = log(N/ni) = log (10,000/2,000) = log(5) = 2.32
 wij = f(i,j) * log(N/ni) = 0.4 * 2.32 = 0.928
Similarity Measure

• A similarity measure is a function that computes the


degree of similarity between two vectors.

• Using a similarity measure between the query and


each document:

– It is possible to rank the retrieved documents in


the order of presumed relevance.
Similarity Measures

|QD| Dot Product (Simple matching)


|QD|
2 Dice’s Coefficient
|Q|| D|
|QD|
Jaccard’s Coefficient
|QD|
|QD|
1 1 Cosine Coefficient
|Q | 2 | D | 2
|QD|
min(| Q |, | D |) Overlap Coefficient
Similarity Measure
• Sim(q,dj) = cos() j
dj


q
 

n i
dj q wi , j wi ,q
sim(d j , q)     i 1

i1 w i1 i,q


n n
dj q 2
i, j w 2

• Since wij > 0 and wiq > 0, 0 <= sim(q,dj) <=1


• A document is retrieved even if it matches the query
terms only partially
Vector Space with Term Weights and Cosine
Matching

Di=(di1,wdi1;di2, wdi2;…;dit, wdit)


Term B Q =(qi1,wqi1;qi2, wqi2;…;qit, wqit)
1.0 Q = (0.4,0.8)

t
D1=(0.8,0.3) wq j wdij
j 1
0.8
D2 Q D2=(0.2,0.7) sim(Q, Di ) 
 j 1 (wq j )  j 1 dij
t 2 t 2
( w )
0.6
2 (0.4  0.2)  (0.8  0.7)
sim(Q, D 2) 
0.4 [( 0.4) 2  (0.8) 2 ]  [( 0.2) 2  (0.7) 2 ]
D1
0.2 1 0.64
  0.98
0.42
0 0.2 0.4 0.6 0.8 1.0 .56
Term A sim(Q, D1 )   0.74
0.58
Vector-Space Model: Example
• Suppose user query for: Q = “gold silver truck”. The database
collection consists of three documents with the following content.
D1: “Shipment of gold damaged in a fire”
D2: “Delivery of silver arrived in a silver truck”
D3: “Shipment of gold arrived in a truck”
• Show retrieval results in ranked order?
1. Assume that full text terms are used during indexing, without
removing common terms, stop words, & also no terms are stemmed.
2. Assume that content-bearing terms are selected during indexing
3. Also compare your result with or without normalizing term
frequency
Vector-Space Model: Example
Counts TF Wi = TF*IDF
Terms Q D1 D2 D3 DF IDF Q D1 D2 D3
a 0 1 1 1 3 0 0 0 0 0
arrived 0 0 1 1 2 0.176 0 0 0.176 0.176
damaged 0 1 0 0 1 0.477 0 0.477 0 0
delivery 0 0 1 0 1 0.477 0 0 0.477 0
fire 0 1 0 0 1 0.477 0 0.477 0 0
gold 1 1 0 1 2 0.176 0.176 0.176 0 0.176
in 0 1 1 1 3 0 0 0 0 0
of 0 1 1 1 3 0 0 0 0 0
silver 1 0 2 0 1 0.477 0.477 0 0.954 0
shipment 0 1 0 1 2 0.176 0 0.176 0 0.176
truck 1 0 1 1 2 0.176 0.176 0 0.176 0.176
Vector-Space Model
Terms Q D1 D2 D3
a 0 0 0 0
arrived 0 0 0.176 0.176
damaged 0 0.477 0 0
delivery 0 0 0.477 0
fire 0 0.477 0 0
gold 0.176 0.176 0 0.176
in 0 0 0 0
of 0 0 0 0
silver 0.477 0 0.954 0
shipment 0 0.176 0 0.176
truck 0.176 0 0.176 0.176
Vector-Space Model: Example
• Compute similarity using cosine Sim(q,d1)
• First, for each document and query, compute all vector
lengths (zero terms ignored)
|d1|= 0.4772  0.4772  0.1762  0.1762 = 0.517 = 0.719
|d2|= 0.1762  0.4772  0.1762  0.1762 = 1.2001 = 1.095
|d3|= 2 = 0.124 = 0.352
0.176  0.176  0.176  0.176
2 2 2

|q|= = 0.2896 = 0.538


0.176  0.471  0.176
2 2 2

• Next, compute dot products (zero products ignored)


Q*d1= 0.176*0.167 = 0.0310
Q*d2 = 0.954*0.477 + 0.176 *0.176 = 0.4862
Q*d3 = 0.176*0.167 + 0.176*0.167 = 0.0620
Vector-Space Model: Example
Now, compute similarity score
Sim(q,d1) = (0.0310) / (0.538*0.719) = 0.0801
Sim(q,d1) = (0.4862 ) / (0.538*1.095)= 0.8246
Sim(q,d1) = (0.0620) / (0.538*0.352)= 0.3271

Finally, we sort and rank documents in descending order


according to the similarity scores
Rank 1: Doc 2 = 0.8246
Rank 2: Doc 3 = 0.3271
Rank 3: Doc 1 = 0.0801
Vector-Space Model
• Advantages:
• term-weighting improves quality of the answer set since
it displays in ranked order
• partial matching allows retrieval of documents that
approximate the query conditions
• cosine ranking formula sorts documents according to
degree of similarity to the query

• Disadvantages:
• assumes independence of index terms (??)
Probabilistic Model
• IR is an uncertain process
–Mapping Information need to Query is not perfect
–Mapping Documents to index terms is a logical representation
–Query terms and index terms mostly mismatch

• This situation leads to several statistical approaches: probability theory,


fuzzy logic, theory of evidence, etc.
• Probabilistic retrieval model is rigorous formal model that attempts to
predict the probability that a given document will be relevant to a given
query (P(R|q,di)
– Use probability to estimate the “odds” of relevance of a query to a
document.
– It relies on accurate estimates of probabilities
Probabilistic model…
Asks the question: what is the probability that user will see
relevant information if they read this document.
– P(rel | di ): probability of relevance after reading di
– How likely is the user to get relevance information from reading
this document
– high probability means more likely to get relevant info.
A Probabilistic retrieval model
– Rank documents in decreasing order of probability of relevance
to users information need
– Calculate P(rel|di) for each document and rank
Probability Ranking Principle
You have a collection of Documents
– User issues a query
– A Set of documents needs to be returned
– Intuitively, want the “best” document to be first, second best -
second, etc…
– We need a formal way to judge the “goodness” of documents with
respect to queries.
Probability ranking principle: if a reference retrieval system's
response to each request is a ranking of the documents in the
collection in order of decreasing probability of relevance… the
overall effectiveness of the system to its user will be the best that is
obtainable.
Difficulties
Evidence is based on a lossy representation
– Evaluate probability of relevance based on occurrence of terms
in query and documents

– Start with an initial estimate , refine through relevance feedback

Computing the probabilities exactly according to the


model is intractable
– Make some simplifying assumptions
Probabilistic Model definitions
• Let D be a document in the collection.
– dj = (t1,j, t2,j, … tt,j), ti,j Î {0,1}
• terms occurrences are boolean (not counts)
• query q is represented similarly
• Let R represent the set of relevant documents with respect to a given
query and let NR represent the set of irrelevant documents.
– P(R | dj) is probability that dj is relevant,
– P(NR | dj) is probability that dj is irrelevant
• Need to find p(R| D) - probability that a retrieved document D is
relevant.
• Similarity function: p( D | R) p( R)
– Ratio of prob of relevance to prob of p( R | D) 
p( D)
non-relevance:
If p(R|D) > p(NR|D) then D is relevant, p( NR | D)  p( D | NR ) p( NR )
otherwise D is not relevant p( D)
Bayes’ Theorem: Application in IR
• Goal: want to estimate the probability that a document D is
relevant to a given query.

p(R)p(D | R) p(R)p(D | R)
p(R | D)  
p(D) p(R)p(D | R)  p(R)p(D | R)

• It is easier to estimate log odds of probability of


relevance

p(R | D) p(R)p(R | D)
log O(R | D)  log  log
p(R | D) p(R)p(R | D)

p(R | D)  1 - p(R | D)
Probabilistic Models
Most probabilistic models based on combining probabilities of
relevance and non-relevance of individual terms
– Probability that a term will appear in a relevant document
– Probability that the term will not appear in a non‐relevant
document
These probabilities are estimated based on counting term
appearances in document descriptions
• Retrieval Status Value (rsv)
D is a vector of binary term
occurrences
We assume that terms occur
independently of each other
Principles surrounding weights
• Independence Assumptions

– I1: The distribution of terms in relevant documents is independent


and their distribution in all documents is independent.
– I2: The distribution of terms in relevant documents is independent
and their distribution in non-relevant documents is independent.
• Ordering Principles
– O1: Probable relevance is based only on the presence of search
terms in the documents.
– O2: Probable relevance is based on both the presence of search
terms in documents and their absence from documents.
Computing term probabilities
• Initially, there are no retrieved documents

– R is completely unknown

– Assume P(ti|R) is constant (usually 0.5)

– Assume P(ti|NR) approximated by distribution of ti across


collection – IDF

• This can be used to compute an initial rank using IDF as the


basic term weight
Probabilistic Model Example
d Document vectors <tfd,t>
col day eat hot lot nin old pea por pot
1 1 1 1 1
2 1 1 1
3 1 1 1
4 1 1 1
5 1 1
6 1 1
wt 0.26 0.56 0.56 0.26 0.56 0.56 0.56 0.0 0.0 0.26

• q1 = eat
• q2 = porridge
• q3 = hot porridge
• q4 = eat nine day old porridge
Improving the Ranking
• Now, suppose
– we have shown the initial ranking to the user

– the user has labeled some of the documents as relevant


("relevance feedback")

• We now have
– N documents in coll, R are known relevant

– ni documents containing ti, ri are relevant


Improving Term Weight Estimates
N=the total number of documents in the collection
n= the total number of documents that contain term ti
R=the total number of relevant documents retrieved
r=the total number of relevant documents retrieved that contain term ti
Document Relevance
For term ti No of relevant No of non-relevant Total
docs docs
No of docs including r n-r n
term ti
No of docs excluding R-r N-R-(n-r) N-n
term ti
Total R N-R N
Compute Term Weight: Robertson-Spark Jones Weights
• Retrospective formulation
–Ratio of the odds of a relevant  r 
document having the term (i.e.,  
ratio of relevant documents  Rr
log
having the term to not having  nr 
the term) to the odds of all non-  
relevant documents having the  N nRr
term (i.e., ratio of all non-
relevant documents having the  r  0.5 
term to not having the term)  
• Predictive formulation  R  r  0.5 
w  log
(1)
– To guarantee that the  n  r  0.5 
denominator is never zero,  
adding a minor 0.5 to all  N  n  R  r  0.5 
numerators and denominators:
(r  0.5)( N  n  R  r  0.5)
w  log
(1)

(n  r  0.5)(R  r  0.5)
Relevance weighted Example

d Document vectors <tfd,t>


col day eat hot lot nin old pea por pot Relevanc
e
1 1 1 1 1 NR
2 1 1 1 R
3 1 1 1 NR
4 1 1 1 NR
5 1 1 NR
6 1 1 NR
wt -0.33 0.00 0.00 -0.33 0.0 0.0 0.0 0.62 0.62 0.95
0 0 0

• q3 = hot porridge
• Document 2 is relevant
Probabilistic Retrieval Example
• D1: “Cost of paper is up.” (relevant)

• D2: “Cost of jellybeans is up.” (not relevant)

• D3: “Salaries of CEO’s are up.” (not relevant)

• D4: “Paper: CEO’s labor cost up.” (????)


Probabilistic Retrieval Example
cost paper Jellybean salary CEO labor up
D1 1 1 0 0 0 0 1
D2 1 0 1 0 0 0 1
D3 0 0 0 1 1 0 1
D4 1 1 0 0 1 1 1
Wij 0.477 1.176 -0.477 -0.477 -0.477 0.222 -0.222
• D1=0.477 +1.176+ -0.222
• D2=0.477 + -0.477+ -0.222
• D3= -0.477 + -0.477+ -0.222
• D4=0.477 +1.176 + -0.477 + 0.222 + -0.222
Probabilistic model
• Probabilistic model uses probability theory to model the uncertainty
in the retrieval process
– Assumptions are made explicit
– Term weight without relevance information is IDF
• Relevance feedback can improve the ranking by giving better term
probability estimates
• Advantages of probabilistic model over vector‐space
– Strong theoretical basis
– Since the base is probability theory, it is very well understood
– Easy to extend
• Disadvantages
– Models are often complicated
– No term frequency weighting
• Which is better: vector‐space or probabilistic?
– Both are approximately as good as each other
– Depends on collection, query, and other factors
46

You might also like