IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
• Why IR models?
• Boolean IR Model
• Vector space IR model
• Probabilistic IR model
What is Information Retrieval ?
• Information retrieval is the
process of searching for relevant
documents from unstructured
large corpus that satisfy users
information need.
– It is a tool that finds and selects
from a collection of items a
subset that serves the user’s
purpose
• Much IR research focuses more specifically on text retrieval. But
there are many other interesting areas:
Cross-language vs. multilingual information retrieval,
Multimedia (audio, video & image) information retrieval (QBIC, WebSeek,
SaFe)
Question-answering (AskJeeves, Answerbus).
Digital and virtual libraries
Assignment 1 (Due: __ days)
Compare local vs. global research works on the following topic & submit
the assessment result. Your report should show the state-of-the-art
(including overview of the concept, its significance, major tasks,
architecture, approaches, concluding remarks with future research
direction & references). Share the soft-copy of your report & slides to all
the classmates, and Cc to me.. There is a 10 minutes presentation by
each group, which will start on April 09, 2012 (Monday).
1. Amharic IR system (Kifle & Martha)
2. Stemming and Thesaurus construction (Demewoz & Sintayehu)
3. IR Models (Daniel)
4. Query Expansion (Abdulkerim & Zealem)
5. Document Image Retrieval (Betsegaw & Tsegaye S.)
6. Cross Language IR (Ibsa & Eyob)
7. Multimedia IR (Besufekad, Tamirat & Kibrom)
8. Question Answering (Alemayehu & Getachew)
9. Recommender Systems (Mulalem & Brook)
10. Document Summarization (Tsegaye M. & Adey)
11. Information Extraction (Kibreab & Tesfaye)
12. Text Classification (Berihu & Yibeltal)
13. Information Filtering (Mengistu & Esubalew)
• Web IR; Document provenance; Intelligent IR;
Information Retrieval serve as a
Bridge
• An Information Retrieval System serves as a bridge
between the world of authors and the world of
readers/users,
– That is, writers present a set of ideas in a document using
a set of concepts. Then Users seek the IR system for
relevant documents that satisfy their information need.
Black box
User Documents
Typical IR System Architecture
Document
corpus
Query IR
String System
1. Doc1
2. Doc2
Ranked 3. Doc3
Relevant Documents .
.
Our focus during IR system design
• In improving Effectiveness of the system
–The concern here is retrieving more relevant documents as per
users query
–Effectiveness of the system is measured in terms of precision,
recall, …
–Main emphasis: text operations (such as stemming, stopwords
removal, normalization, etc.), weighting schemes, matching
algorithms, …
• In improving Efficiency of the system
–The concern here is
• enhancing searching time, indexing time, access time…
• reducing storage space requirement of the system
• space – time tradeoffs
–Main emphasis:
• Compression
• Index terms selection (free text or content-bearing terms)
• indexing structures
Subsystems of IR system
The two subsystems of an IR system: Indexing and
Searching
–Indexing:
• is an offline process of organizing documents
using keywords extracted from the collection
• Indexing is used to speed up access to desired
information from document collection as per
users query
–Searching
• Is an online process that scans document corpus to find
relevant documents that matches users query
Indexing Subsystem
documents
Documents Assign document identifier
document document
Tokenization
IDs
tokens
Stopword removal
non-stoplist tokens
Stemming &
stemmed terms
Normalization
Term weighting
Weighted index
terms Index File
Searching Subsystem
query parse query
query tokens
ranked non-stoplist
document Stop word
tokens
set
Ranking
Stemming & Normalize
relevant stemmed terms
document set
Similarity Query Term weighting
Measure terms
Index terms
Index
IR Models - Basic Concepts
IR systems usually adopt index terms to
index and retrieve documents
Each document is represented by a set of
representative keywords or index terms (called
Bag of Words)
• An index term is a word useful for remembering the
document main themes
• Not all terms are equally useful for
representing the document contents:
less frequent terms allow identifying a narrower
set of documents
• But no ordering information is attached to the Bag of
Words identified from the document collection.
IR Models - Basic Concepts
•One central problem regarding IR systems is
the issue of predicting the degree of relevance
of documents for a given query
Such a decision is usually dependent on a
ranking algorithm which attempts to
establish a simple ordering of the
documents retrieved
Documents appearning at the top of this
ordering are considered to be more likely
to be relevant
•Thus ranking algorithms are at the core of IR
systems
The IR models determine the predictions of
what is relevant and what is not, based on
IR models
Probabilistic
relevance
How to find relevant documents for a
query?
• Step 1: Map documents & queries into term-document vector
space. Note that queries are considered as short document
– Represent both documents & queries as N-dimensional vectors in
a term-document matrix, which shows occurrence of terms in the
document collection or query
d j (t1, j , t 2, j ,..., t N , j ); qk (t1,k , t 2,k ,..., t N ,k )
T1 T 2 …. TN
– Document collection is mapped to
D1 … … .. … … term-by-document matrix
D2 … … .. … … – View as vector in
: … … ..… …: multidimensional space
… … ..… … • Nearby vectors are related
DM … … .. … …
Qi … … .. … …
How to find relevant documents for a
• Step 2: Queries and documents
query? are represented as
weighted vectors, wij
Why we need weighting techniques?
To know the importance of a term in describing the content
of a given document?
There are binary weights & non-binary weighting technique.
Any difference between the two?
What method you recommend to compute weights for term i in
document j and query q; wij and wiq ?
T1 T2 …. TN
• An entry in the matrix corresponds to
the “weight” of a term in the document; D1 w11 w12 … w1N
zero means the term doesn’t exist in the D2 w21 w22 … w2N
document. : : : :
• Normalize for vector length to avoid : : : :
the effect of document length DM wM1 wM2 … wMN
Qi wi1 wi2 … wiN
How to find relevant documents for a
• Step 3: Rank documentsquery?
(in increasing or decreasing
order) based on their closeness to the query.
Documents are ranked by the degree of their closeness to
the query.
How closeness of the document to query measured?
It is determined by a similarity/dissimilarity score
calculation
How many matching (similarity/dissimilarity
measurements) you know? Which one is best for IR?
n
d j q w w
i 1 i , j i , q
sim( d j , q )
n n
dj q w 2
i 1 i , j i 1 i ,qw 2
How to evaluate Models?
• We need to investigate what procedures the IR Models
follow and what techniques they use:
– What is the weighting technique used by the IR Models for
measuring importance of terms in documents?
• Are they using binary or non-binary weight?
– What is the matching technique used by the IR models?
• Are they measuring similarity or dissimilarity?
– Are they applying exact matching or partial matching in the
course of finding relevant documents for a given query?
– Are they applying best matching principle to measure the
degree of relevance of documents to display in ranked-order?
• Is there any Ranking mechanism applied before displaying
relevant documents for the users?
The Boolean Model
•Boolean model is a simple model based on
set theory
The Boolean model imposes a binary
criterion for deciding relevance
•Terms are either present or absent. Thus,
wij {0,1}
•sim(q,dj) = 1 - if document satisfies
T1 T2 the
…. TN
boolean query D1 w11 w12 … w1N
0 - otherwiseD w w
- Note that, no weights … w2N
2 21 22
assigned in-between 0 : : : :
and 1, just only values 0 : : : :
or 1 DM wM1 wM2 … wMN
The Boolean Model: Example
Given the following three documents, Construct Term –
document matrix and find the relevant documents
retrieved by the Boolean model for the query “gold
silver truck”
• D1: “Shipment of gold damaged in a fire”
• D2: “Delivery of silver arrived in a silver truck”
• D3: “Shipment of gold arrived in a truck”
Table below
arrive shows document
damage deliver –term
fire (tgold
i) matrix
silver ship truck
D1
D2
D3
query
0.6
j 1 (w jq ) ( w
j 1 jdi )
2 (0.4 0.2) (0.8 0.7)
sim (Q, D 2)
0.4
[(0.4) 2 (0.8) 2 ] [(0.2) 2 (0.7) 2 ]
D1
0.2 1 0.64
0.98
0.42
0 0.2 0.4 0.6 0.8 1.0
.56
Term A sim(Q, D1 ) 0.74
0.58
Example Vector-Space
Model
• Suppose user query for: Q = “gold silver truck”. The
database collection consists of three documents with the
following content.
D1: “Shipment of gold damaged in a fire”
D2: “Delivery of silver arrived in a silver truck”
D3: “Shipment of gold arrived in a truck”
• Show retrieval results in ranked order?
1.Assume that full text terms are used during indexing,
without removing common terms, stop words, & also no
terms are stemmed.
2.Assume that content-bearing terms are selected during
indexing
3.Also compare your result with or without normalizing
term frequency
Example VSM: Weighting
Terms
Terms Q
Counts TF
DF IDF
W = TF*IDF i
D1 D2 D3 Q D1 D2 D3
|q|= = = 0.538
• Next, compute dot products (zero products
ignored)
Example VSM: Ranking
Now, compute similarity score
Sim(q,d1) = (0.0310) / (0.538*0.719) = 0.0801
Sim(q,d2) = (0.4862 ) / (0.538*1.095)= 0.8246
Sim(q,d3) = (0.0620) / (0.538*0.352)= 0.3271
Finally, we sort and rank documents in descending
order according to the similarity scores
Rank 1: Doc 2 = 0.8246
Rank 2: Doc 3 = 0.3271
Rank 3: Doc 1 = 0.0801
• Exercise: using normalized TF, rank documents
using cosine similarity measure? Hint: Normalize
TF of term i in doc j using max frequency of a
Vector-Space Model
• Advantages:
• Term-weighting improves quality of the answer set since
it helps to display relevant documents in ranked order
• Partial matching allows retrieval of documents that
approximate the query conditions
• Cosine ranking formula sorts documents according to
degree of similarity to the query
• Disadvantages:
• Assumes independence of index terms. It doesn’t relate
one term with another term
• Computationally expensive since it measures the
similarity between each document and the query
Exercise 1
Suppose the database collection consists of the following
documents.
c1: Human machine interface for Lab ABC computer applications
c2: A survey of user opinion of computer system response time
c3: The EPS user interface management system
c4: System and human system engineering testing of EPS
c5: Relation of user-perceived response time to error measure
M1: The generation of random, binary, unordered trees
M2: The intersection graph of paths in trees
M3: Graph minors: Widths of trees and well-quasi-ordering
m4: Graph minors: A survey
Query:
Find documents relevant to "human computer
interaction"
Exercise 2
• Consider these documents:
Doc 1 breakthrough drug for schizophrenia
Doc 2 new schizophrenia drug
Doc 3 new approach for treatment of schizophrenia
Doc 4 new hopes for schizophrenia patients
–Draw the term-document incidence matrix for this document
collection.
–Draw the inverted index representation for this collection.
wt 0.26 0.56 0.56 0.26 0.56 0.56 0.56 0.0 0.0 0.26
• q1 = eat
• q2 = eat pizza
• q4 = eat hot pizza
Improving the Ranking
• Now, suppose
– we have shown the initial ranking to the user
– the user has labeled some of the documents as
relevant ("relevance feedback")
• We now have
– N documents in collection, R are known relevant
documents
– ni documents containing ti, out of which ri are
relevant
Relevance weighted
Example
Document vectors <td,t>
d
cold day eat hot lot nine old pea pizza pot Relev
ance
1 1 1 1 1 NR
2 1 1 1 R
3 1 1 1 NR
4 1 1 1 NR
5 1 1 NR
6 1 1 NR
wt -0.33 0.00 0.00 -0.33 0.00 0.00 0.00 0.62 0.62 0.95