5 IRModels
5 IRModels
et
IR Models - Basic Concepts
Word evidence: Bag of words
IR systems usually adopt index terms to index and retrieve
documents
Each document is represented by a set of representative
keywords or index terms (called Bag of Words)
An index term is a word from a document useful for
remembering the document main themes
Not all terms are equally useful for representing the document
contents:
less frequent terms allow identifying a narrower set of
documents
But No ordering information is attached to the Bag of Words
identified from the document collection.
IR Models - Basic Concepts
• One central problem regarding IR systems is the
issue of predicting which documents are relevant and
which are not
• Such a decision is usually dependent on a ranking
algorithm which attempts to establish a simple
ordering of the documents retrieved
• Documents appearning at the top of this ordering
are considered to be more likely to be relevant
• Thus ranking algorithms are at the core of IR systems
• The IR models determine the predictions of what is
relevant and what is not, based on the notion of
relevance implemented by the system
IR Models - Basic Concepts
• After preprocessing, N distinct terms remain which are
Unique terms that form the VOCABULARY
• Let
– ki be an index term i & dj be a document j
– K = (k1, k2, …, kN) is the set of all index terms
• Each term, i, in a document or query j, is given a real-
valued weight, wij.
– wij is a weight associated with (ki,dj). If wij = 0 , it
indicates that term does not belong to document dj
• The weight wij quantifies the importance of the index term
for describing the document contents
• vec(d ) = (w , w , …, w ) is a weighted vector
j 1j 2j tj
associated with the document dj
Mapping Documents & Queries
Represent both documents and queries as N-
dimensional vectors in a term-document matrix, which
shows occurrence
of terms in the document collection
or queryd j (t1, j , t 2, j ,..., t N , j ); qk (t1,k , t 2,k ,..., t N ,k )
An entry in the matrix corresponds to the “weight” of a
term in the document;
– Document collection is mapped to
T1 T2 …. TN
term-by-document matrix
D1 w11 w12 … w1N – The documents are viewed as
D2 w21 w22 … w2N vectors in multidimensional space
: : : : • “Nearby” vectors are related
: : : : – Normalize the weight as usual for
DM wM1 wM2 … wMN vector length to avoid the effect of
document length
Weighting Terms in Vector Sapce
The importance of the index terms is represented by
weights associated to them
Problem: to show the importance of the index term for
describing the document/query contents, what weight can
we assign?
Solution 1: Binary weights: t=1 if presence, 0 otherwise
Similarity: number of terms in common
Problem: Not all terms equally interesting
E.g. the vs. dog vs. cat
Solution: Replace binary weights with non-binary weights
d j ( w1, j , w2, j ,..., wN , j ); qk ( w1,k , w2,k ,..., wN ,k )
The Boolean Model
• Boolean model is a simple model based on set theory
• The Boolean model imposes a binary criterion for
deciding relevance
• Terms are either present or absent. Thus,
wij {0,1}
• sim(q,dj) = 1, if document satisfies the boolean query
0 otherwise T1 T2 …. TN
D1 w11 w12 … w1N
- Note that, no weights
D2 w21 w22 … w2N
assigned in-between 0 and 1,
only values 0 or 1 can be : : : :
assigned : : : :
DM wM1 wM2 … wMN
The Boolean Model: Example
• Generate the relevant documents retrieved by
the Boolean model for the query :
q = k1 (k2 k3)
k2
k1
d7
d2 d6
d4 d5
d3
d1
k3
The Boolean Model: Example
• Given the following determine documents retrieved by the
Boolean model based IR system
• Index Terms: K1, …,K8.
• Documents:
q
• Sim(q,dj) = cos() i
n
d j q wi , j qi ,k
sim(d j , q ) i 1
i 1 w i 1 i,k
n n
dj q 2
i, j q 2
• Disadvantages:
• assumes independence of index terms (??)
Suppose the database collection consists of the following
documents.
c1: Human machine interface for Lab ABC computer
applications
c2: A survey of user opinion of computer system response time
c3: The EPS user interface management system
c4: System and human system engineering testing of EPS
c5: Relation of user-perceived response time to error measure
M1: The generation of random, binary, unordered trees
M2: The intersection graph of paths in trees
M3: Graph minors: Widths of trees and well-quasi-ordering
m4: Graph minors: A survey
Query:
Find documents relevant to "human computer
Exercises
Given the following documents, rank documents
according to their relevance to the query using
cosine similarity, Euclidean distance and inner
product measures?
docID words in document
1 Taipei Taiwan
2 Macao Taiwan Shanghai
3 Japan Sapporo
4 Sapporo Osaka Taiwan
Query: Taiwan Sapporo ?
29