0% found this document useful (0 votes)
283 views46 pages

IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model

The document discusses information retrieval (IR) models. It begins by introducing three common IR models: the Boolean model, vector space model, and probabilistic model. It then provides details on the Boolean model, which uses a simple binary approach to determine if a document is relevant based on whether terms are present or absent. The weighting scheme in the Boolean model is binary, assigning weights of either 0 or 1.

Uploaded by

kerya ibrahim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
283 views46 pages

IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model

The document discusses information retrieval (IR) models. It begins by introducing three common IR models: the Boolean model, vector space model, and probabilistic model. It then provides details on the Boolean model, which uses a simple binary approach to determine if a document is relevant based on whether terms are present or absent. The weighting scheme in the Boolean model is binary, assigning weights of either 0 or 1.

Uploaded by

kerya ibrahim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

IR models

• Why IR models?
• Boolean IR Model
• Vector space IR model
• Probabilistic IR model
What is Information Retrieval ?
• Information retrieval is the
process of searching for relevant
documents from unstructured
large corpus that satisfy users
information need.
– It is a tool that finds and selects
from a collection of items a
subset that serves the user’s
purpose
• Much IR research focuses more specifically on text retrieval. But
there are many other interesting areas:
 Cross-language vs. multilingual information retrieval,
 Multimedia (audio, video & image) information retrieval (QBIC, WebSeek,
SaFe)
 Question-answering (AskJeeves, Answerbus).
 Digital and virtual libraries
Assignment 1 (Due: __ days)
Compare local vs. global research works on the following topic & submit
the assessment result. Your report should show the state-of-the-art
(including overview of the concept, its significance, major tasks,
architecture, approaches, concluding remarks with future research
direction & references). Share the soft-copy of your report & slides to all
the classmates, and Cc to me.. There is a 10 minutes presentation by
each group, which will start on April 09, 2012 (Monday).
1. Amharic IR system (Kifle & Martha)
2. Stemming and Thesaurus construction (Demewoz & Sintayehu)
3. IR Models (Daniel)
4. Query Expansion (Abdulkerim & Zealem)
5. Document Image Retrieval (Betsegaw & Tsegaye S.)
6. Cross Language IR (Ibsa & Eyob)
7. Multimedia IR (Besufekad, Tamirat & Kibrom)
8. Question Answering (Alemayehu & Getachew)
9. Recommender Systems (Mulalem & Brook)
10. Document Summarization (Tsegaye M. & Adey)
11. Information Extraction (Kibreab & Tesfaye)
12. Text Classification (Berihu & Yibeltal)
13. Information Filtering (Mengistu & Esubalew)
• Web IR; Document provenance; Intelligent IR;
Information Retrieval serve as a
Bridge
• An Information Retrieval System serves as a bridge
between the world of authors and the world of
readers/users,
– That is, writers present a set of ideas in a document using
a set of concepts. Then Users seek the IR system for
relevant documents that satisfy their information need.

Black box
User Documents
Typical IR System Architecture

Document
corpus

Query IR
String System

1. Doc1
2. Doc2
Ranked 3. Doc3
Relevant Documents .
.
Our focus during IR system design
• In improving Effectiveness of the system
–The concern here is retrieving more relevant documents as per
users query
–Effectiveness of the system is measured in terms of precision,
recall, …
–Main emphasis: text operations (such as stemming, stopwords
removal, normalization, etc.), weighting schemes, matching
algorithms, …
• In improving Efficiency of the system
–The concern here is
• enhancing searching time, indexing time, access time…
• reducing storage space requirement of the system
• space – time tradeoffs
–Main emphasis:
• Compression
• Index terms selection (free text or content-bearing terms)
• indexing structures
Subsystems of IR system
The two subsystems of an IR system: Indexing and
Searching
–Indexing:
• is an offline process of organizing documents
using keywords extracted from the collection
• Indexing is used to speed up access to desired
information from document collection as per
users query

–Searching
• Is an online process that scans document corpus to find
relevant documents that matches users query
Indexing Subsystem
documents
Documents Assign document identifier

document document
Tokenization
IDs
tokens
Stopword removal
non-stoplist tokens
Stemming &
stemmed terms
Normalization
Term weighting

Weighted index
terms Index File
Searching Subsystem
query parse query
query tokens
ranked non-stoplist
document Stop word
tokens
set
Ranking
Stemming & Normalize
relevant stemmed terms
document set
Similarity Query Term weighting
Measure terms
Index terms
Index
IR Models - Basic Concepts
 IR systems usually adopt index terms to
index and retrieve documents
Each document is represented by a set of
representative keywords or index terms (called
Bag of Words)
• An index term is a word useful for remembering the
document main themes
• Not all terms are equally useful for
representing the document contents:
less frequent terms allow identifying a narrower
set of documents
• But no ordering information is attached to the Bag of
Words identified from the document collection.
IR Models - Basic Concepts
•One central problem regarding IR systems is
the issue of predicting the degree of relevance
of documents for a given query
 Such a decision is usually dependent on a
ranking algorithm which attempts to
establish a simple ordering of the
documents retrieved
 Documents appearning at the top of this
ordering are considered to be more likely
to be relevant
•Thus ranking algorithms are at the core of IR
systems
 The IR models determine the predictions of
what is relevant and what is not, based on
IR models

Probabilistic
relevance
How to find relevant documents for a
query?
• Step 1: Map documents & queries into term-document vector
space. Note that queries are considered as short document
– Represent both documents & queries as N-dimensional vectors in
a term-document matrix, which shows occurrence of terms in the
document collection or query
 
d j (t1, j , t 2, j ,..., t N , j ); qk (t1,k , t 2,k ,..., t N ,k )
T1 T 2 …. TN
– Document collection is mapped to
D1 … … .. … … term-by-document matrix
D2 … … .. … … – View as vector in
: … … ..… …: multidimensional space
… … ..… … • Nearby vectors are related
DM … … .. … …
Qi … … .. … …
How to find relevant documents for a
• Step 2: Queries and documents
query? are represented as
weighted vectors, wij
 Why we need weighting techniques?
 To know the importance of a term in describing the content
of a given document?
 There are binary weights & non-binary weighting technique.
Any difference between the two?
 What method you recommend to compute weights for term i in
document j and query q; wij and wiq ?
T1 T2 …. TN
• An entry in the matrix corresponds to
the “weight” of a term in the document; D1 w11 w12 … w1N
zero means the term doesn’t exist in the D2 w21 w22 … w2N
document. : : : :
• Normalize for vector length to avoid : : : :
the effect of document length DM wM1 wM2 … wMN
Qi wi1 wi2 … wiN
How to find relevant documents for a
• Step 3: Rank documentsquery?
(in increasing or decreasing
order) based on their closeness to the query.
 Documents are ranked by the degree of their closeness to
the query.
 How closeness of the document to query measured?
 It is determined by a similarity/dissimilarity score
calculation
 How many matching (similarity/dissimilarity
measurements) you know? Which one is best for IR?
  n
d j q  w w
i 1 i , j i , q
sim( d j , q )    
n n
dj q w 2
i 1 i , j i 1 i ,qw 2
How to evaluate Models?
• We need to investigate what procedures the IR Models
follow and what techniques they use:
– What is the weighting technique used by the IR Models for
measuring importance of terms in documents?
• Are they using binary or non-binary weight?
– What is the matching technique used by the IR models?
• Are they measuring similarity or dissimilarity?
– Are they applying exact matching or partial matching in the
course of finding relevant documents for a given query?
– Are they applying best matching principle to measure the
degree of relevance of documents to display in ranked-order?
• Is there any Ranking mechanism applied before displaying
relevant documents for the users?
The Boolean Model
•Boolean model is a simple model based on
set theory
 The Boolean model imposes a binary
criterion for deciding relevance
•Terms are either present or absent. Thus,
wij  {0,1}
•sim(q,dj) = 1 - if document satisfies
T1 T2 the
…. TN
boolean query D1 w11 w12 … w1N
0 - otherwiseD w w
- Note that, no weights … w2N
2 21 22
assigned in-between 0 : : : :
and 1, just only values 0 : : : :
or 1 DM wM1 wM2 … wMN
The Boolean Model: Example
Given the following three documents, Construct Term –
document matrix and find the relevant documents
retrieved by the Boolean model for the query “gold
silver truck”
• D1: “Shipment of gold damaged in a fire”
• D2: “Delivery of silver arrived in a silver truck”
• D3: “Shipment of gold arrived in a truck”
Table below
arrive shows document
damage deliver –term
fire (tgold
i) matrix
silver ship truck
D1
D2
D3
query

Also find the documents relevant for the queries:


(a)gold delivery; (b) ship gold; (c) silver truck
The Boolean Model: Further

Example
Given the following determine documents retrieved by the
Boolean model based IR system
• Index Terms: K1, …,K8.
• Documents:

1. D1 = {K1, K2, K3, K4, K5}


2. D2 = {K1, K2, K3, K4}
3. D3 = {K2, K4, K6, K8}
4. D4 = {K1, K3, K5, K7}
5. D5 = {K4, K5, K6, K7, K8}
6. D6 = {K1, K2, K3, K4}
• Query: K1 (K2  K3)
• Answer: {D1, D2, D4, D6} ({D1, D2, D3, D6} {D3, D5})
= {D1, D2, D6}
Exercise
Given the following four documents with the following
contents:
– D1 = “computer information retrieval”
– D2 = “computer retrieval”
– D3 = “information”
– D4 = “computer information”

• What are the relevant documents retrieved for the


queries:
– Q1 = “information  retrieval”
– Q2 = “information  ¬computer”
Drawbacks of the Boolean Model
•Retrieval based on binary decision criteria
with no notion of partial matching
•No ranking of the documents is provided
(absence of a grading scale)
•Information need has to be translated into a
Boolean expression which most users find
awkward
•The Boolean queries formulated by the users
are most often too simplistic
 As a consequence, the Boolean model
frequently returns either too few or too
many documents in response to a user
query
Vector-Space Model
• This is the most commonly used strategy for measuring
relevance of documents for a given query. This is
because,
 Use of binary weights is too limiting
 Non-binary weights provide consideration for partial
matches
• These term weights are used to compute a degree of
similarity between a query and each document
 Ranked set of documents provides for better
matching
• The idea behind VSM is that
 the meaning of a document is conveyed by the words
used in that document
Vector-Space Model
To find relevant documens for a given query,
• First, map documents and queries into term-document
vector space.
Note that queries are considered as short document
• Second, in the vector space, queries and documents are
represented as weighted vectors, wij
There are different weighting technique; the most widely used
one is computing TF*IDF weight for each term
• Third, similarity measurement is used to rank documents
by the closeness of their vectors to the query.
To measure closeness of documents to the query cosine
similarity score is used by most search engines
Computing weights
• The vector space model with TF*IDF weights is a good
ranking strategy with general collections
• For index terms a normalized TF*IDF weight is given
by: freq (i, j )
wij  * log(N/n i )
max( freq (k , j ))
• Users query is typically treated as a short document and
also TF-IDF weighted.
For the query term weights, a suggestion is
freq (i, q )
wiq  0.5  [0.5 * ] * log(N/n i )
max( freq (k , q ))

• The vector space model is usually as good as the known


ranking alternatives. It is also simple and fast to compute.
Example: Computing weights
• A collection includes 10,000 documents
 The term tA appears 20 times in a particular document j
 The maximum appearance of term tk in document j is 50
times
 The term tA appears in 2,000 of the document
collections.

• Compute TF*IDF weight of term A?


 tf(A,j) = freq(A,j) / max(freq(k,j)) = 20/50 = 0.4
 idf(A) = log(N/DFA) = log (10,000/2,000) = log(5) = 2.32
 wAj = tf(A,j) * log(N/DFA) = 0.4 * 2.32 = 0.928
Similarity Measure
• A similarity measure is a function that computes the
degree of similarity/dissimilarity between document j
and users query.   n
d j q  w w
i 1 i , j i , q
sim(d j , q )    
n n
dj q w 2
i 1 i , j i 1 i ,q w 2

• Using a similarity score between the query and each


document:
– It is possible to apply best matching such that documents
are ranked for retrieval in the order of presumed
relevance.
– It is possible to enforce a certain threshold so that we can
control the size of the retrieved set of documents.
Vector Space with Term
Weights and Cosine Similarity
Measure
Di=(d1i,w1di;d2i, w2di;…;dti, wtdi)
Term B
Q =(q1i,w1qi;q2i, w2qi;…;qti, wtqi)
1.0 Q = (0.4,0.8)
t
D2 Q D1=(0.8,0.3)  j 1
w jq w jdi
0.8 D2=(0.2,0.7) sim(Q, Di ) 
t 2 t 2

0.6
 j 1 (w jq ) ( w
 j 1 jdi )
2 (0.4 0.2)  (0.8 0.7)
sim (Q, D 2) 
0.4
[(0.4) 2  (0.8) 2 ] [(0.2) 2  (0.7) 2 ]
D1
0.2 1 0.64
 0.98
0.42
0 0.2 0.4 0.6 0.8 1.0
.56
Term A sim(Q, D1 )  0.74
0.58
Example Vector-Space
Model
• Suppose user query for: Q = “gold silver truck”. The
database collection consists of three documents with the
following content.
D1: “Shipment of gold damaged in a fire”
D2: “Delivery of silver arrived in a silver truck”
D3: “Shipment of gold arrived in a truck”
• Show retrieval results in ranked order?
1.Assume that full text terms are used during indexing,
without removing common terms, stop words, & also no
terms are stemmed.
2.Assume that content-bearing terms are selected during
indexing
3.Also compare your result with or without normalizing
term frequency
Example VSM: Weighting
Terms
Terms Q
Counts TF
DF IDF
W = TF*IDF i

D1 D2 D3 Q D1 D2 D3

arrive 0 0 1 1 2 0.176 0 0 0.176 0.176


damage 0 1 0 0 1 0.477 0 0.477 0 0
deliver 0 0 1 0 1 0.477 0 0 0.477 0
fire 0 1 0 0 1 0.477 0 0.477 0 0
gold 1 1 0 1 2 0.176 0.176 0.176 0 0.176
silver 1 0 2 0 1 0.477 0.477 0 0.954 0
ship 0 1 0 1 2 0.176 0 0.176 0 0.176
truck 1 0 1 1 2 0.176 0.176 0 0.176 0.176
Example VSM: Weighting
Terms
Terms Q D1 D2 D3
arrive 0 0 0.176 0.176
damage 0 0.477 0 0
deliver 0 0 0.477 0
fire 0 0.477 0 0
gold 0.176 0.176 0 0.176
silver 0.477 0 0.954 0
ship 0 0.176 0 0.176
truck 0.176 0 0.176 0.176
Example VSM: similarity
Measure
•Compute similarity using cosine Sim(q,d1)
• First, for each document and query, compute all
vector lengths (zero terms ignored)
|d1|= 0.477 2  0.477 2  0.1762  0.1762 =0.517 =
0.719
0.1762  0.477 2  0.9542  0.1762 1.1996
|d2|= = =
0.176 2
 0. 176 2
 0. 176 2
 0. 176 2 0.124
1.095
|d3|= 0. 176 2
 0. 4712
 0. 176 2
0.2896
= =
0.352

|q|= = = 0.538
• Next, compute dot products (zero products
ignored)
Example VSM: Ranking
Now, compute similarity score
Sim(q,d1) = (0.0310) / (0.538*0.719) = 0.0801
Sim(q,d2) = (0.4862 ) / (0.538*1.095)= 0.8246
Sim(q,d3) = (0.0620) / (0.538*0.352)= 0.3271
Finally, we sort and rank documents in descending
order according to the similarity scores
Rank 1: Doc 2 = 0.8246
Rank 2: Doc 3 = 0.3271
Rank 3: Doc 1 = 0.0801
• Exercise: using normalized TF, rank documents
using cosine similarity measure? Hint: Normalize
TF of term i in doc j using max frequency of a
Vector-Space Model
• Advantages:
• Term-weighting improves quality of the answer set since
it helps to display relevant documents in ranked order
• Partial matching allows retrieval of documents that
approximate the query conditions
• Cosine ranking formula sorts documents according to
degree of similarity to the query

• Disadvantages:
• Assumes independence of index terms. It doesn’t relate
one term with another term
• Computationally expensive since it measures the
similarity between each document and the query
Exercise 1
Suppose the database collection consists of the following
documents.
c1: Human machine interface for Lab ABC computer applications
c2: A survey of user opinion of computer system response time
c3: The EPS user interface management system
c4: System and human system engineering testing of EPS
c5: Relation of user-perceived response time to error measure
M1: The generation of random, binary, unordered trees
M2: The intersection graph of paths in trees
M3: Graph minors: Widths of trees and well-quasi-ordering
m4: Graph minors: A survey
Query:
Find documents relevant to "human computer
interaction"
Exercise 2
• Consider these documents:
Doc 1 breakthrough drug for schizophrenia
Doc 2 new schizophrenia drug
Doc 3 new approach for treatment of schizophrenia
Doc 4 new hopes for schizophrenia patients
–Draw the term-document incidence matrix for this document
collection.
–Draw the inverted index representation for this collection.

• For the document collection shown above, what are the


returned results for the queries:
–schizophrenia AND drug
–for AND NOT(drug OR approach)
Probabilistic Model
• IR is an uncertain process
–Mapping Information need to Query is not perfect
–Mapping Documents to index terms is a logical representation
–Query terms and index terms mostly mismatch
• This situation leads to several statistical approaches:
probability theory, fuzzy logic, theory of evidence,
language modeling, etc.
• Probabilistic retrieval model is rigorous formal model that
attempts to predict the probability that a given document
will be relevant to a given query; i.e. Prob(R|(q,di))
–Use probability to estimate the “odds” of relevance of a query to
a document.
–It relies on accurate estimates of probabilities
Probability Ranking
Principle
• The relevance of a given document for users query can be
determined by the probability score
–High probability (prob(rel | di q): means more likely for users to
get relevant information by reading document di.
• A Probabilistic retrieval model follows Probability ranking
principle
–You have a collection of Documents
• A set of relevant documents needs to be returned for
queries issued by users
• Intuitively, want the “best” document to be first, second
best - second, etc…
–According to probability ranking principle, documents are
ranked in decreasing order of probability of relevance to users
information need
Terms Existence in Relevant
Document
N=the total number of documents in the collection
n= the total number of documents that contain term ti
R=the total number of relevant documents retrieved
r=the total number of relevant documents retrieved that contain term t i
Document Relevance
For term ti No of relevant No of non-relevant Total
docs docs

No of docs including r n-r n


term ti
No of docs excluding R-r N-R-(n-r) N-n
term ti
Total R N-R N
(r  0.5)( N  n  R  r  0.5)
wi log
(n  r  0.5)( R  r  0.5)
Computing term
probabilities
• Three cases: Relevance of documents for a given query
may be known, partially known or unknown
– Initially, there are no retrieved documents
– R is completely unknown
– Assume P(ti|R) is constant (usually 0.5)
– Assume P(ti|NR) approximated by distribution of ti
across collection – IDF

• This can be used to compute an initial rank using


IDF as the basic term weight
Probabilistic Model
d
Example
Document vectors <td,t>
cold day eat hot lot nine old pea pizza pot
1 1 1 1 1
2 1 1 1
3 1 1 1
4 1 1 1
5 1 1
6 1 1

wt 0.26 0.56 0.56 0.26 0.56 0.56 0.56 0.0 0.0 0.26

• q1 = eat
• q2 = eat pizza
• q4 = eat hot pizza
Improving the Ranking
• Now, suppose
– we have shown the initial ranking to the user
– the user has labeled some of the documents as
relevant ("relevance feedback")
• We now have
– N documents in collection, R are known relevant
documents
– ni documents containing ti, out of which ri are
relevant
Relevance weighted
Example
Document vectors <td,t>
d
cold day eat hot lot nine old pea pizza pot Relev
ance
1 1 1 1 1 NR
2 1 1 1 R
3 1 1 1 NR
4 1 1 1 NR
5 1 1 NR
6 1 1 NR

wt -0.33 0.00 0.00 -0.33 0.00 0.00 0.00 0.62 0.62 0.95

• query = hot pizza


• Document 2 is relevant
Probabilistic Retrieval
Example
• D1: “Cost of paper is up.” (relevant)
• D2: “Cost of jellybeans is up.” (not relevant)
• D3: “Salaries of CEO’s are up.” (not relevant)
• D4: “Paper: CEO’s labor cost up.” (????)
Probabilistic Retrieval
cost
Example
paper Jellybean salary CEO labor up Releva
nce
D1 1 1 0 0 0 0 1 R
D2 1 0 1 0 0 0 1 NR
D3 0 0 0 1 1 0 1 NR
D4 1 1 0 0 1 1 1 ??
Wij 0.477 1.176 -0.477 -0.477 -0.477 0.222 -0.222

• D1=0.477 +1.176+ -0.222


• D2=0.477 + -0.477+ -0.222
• D3= -0.477 + -0.477+ -0.222
• D4=0.477 +1.176 + -0.477 + 0.222 + -0.222
Exercise
• Consider the collection below. The collection has 5 documents and
each document is described by two terms. The initial guess of
relevance to a particular query Q is as given in the table below.
Assuming the query Q has a total of 2 relevant documents in this
collection solve the following questions
Document T1 T2 Relevance
D1 1 1 R
D2 0 1 NR
D3 1 0 NR
D4 1 0 R
D5 0 1 NR

• Using the probabilistic term weighting formula, calculate the new


weight for each of the query in Q
• Rank the documents according to their probability of relevance
with the new query
Probabilistic model
• Probabilistic model uses probability theory to model the
uncertainty in the retrieval process
– Assumptions are made explicit
– Term weight without relevance information is IDF
• Relevance feedback can improve the ranking by giving better
term probability estimates
• Advantages of probabilistic model over vector‐space
– Strong theoretical basis
– Since the base is probability theory, it is very well understood
– Easy to extend
• Disadvantages
– Models are often complicated
– No term frequency weighting
• Which is better: vector‐space or probabilistic?
– Both are approximately as good as each other
– Depends on collection, query, and other factors

You might also like