Chapter 4 IR Models
Chapter 4 IR Models
1
IR Models - Basic Concepts
Word evidence: Bag of words
• IR systems usually adopt index terms to index and retrieve
documents
• Each document is represented by a set of representative
keywords or index terms (called Bag of Words)
2
IR Models - Basic Concepts
• One central problem regarding IR systems is the
issue of predicting which documents are relevant
and which are not
• Such a decision is usually dependent on a ranking
algorithm which attempts to establish a simple
ordering of the documents retrieved
• Documents appearning at the top of this ordering
are considered to be more likely to be relevant
• Thus ranking algorithms are at the core of IR systems
• The IR models determine the predictions of what is
relevant and what is not, based on the notion of
relevance implemented by the system
3
IR Models - Basic Concepts
• After preprocessing, N distinct terms remain which are
Unique terms that form the VOCABULARY
• Let
– ki be an index term i & dj be a document j
– K = (k1, k2, …, kN) is the set of all index terms
• Each term, i, in a document or query j, is given a real-
valued weight, wij.
– wij is a weight associated with (ki,dj). If wij = 0 , it
indicates that term does not belong to document dj
• The weight wij quantifies the importance of the index term
for describing the document contents
• Vec(dj) = (w1j, w2j, …, wtj) is a term weighted vector
associated with the document dj
4
Mapping Documents & Queries
Represent both documents and queries as N-dimensional
vectors in a term-document matrix, which shows occurrence
of terms in the document collection or query
E.g. d j (t1, j , t2, j ,..., t N , j ); qk (t1,q , t2,q ,..., t N ,q )
An entry in the matrix corresponds to the “weight” of a
term in the document
– Document collection is mapped to
T1 T2 …. TN term-by-document matrix
D1 w11 w12 … w1N
– View as vector in multidimensional
D2 w21 w22 … w2N space
: : : :
• Nearby vectors are related
: : : :
DM wM1 wM2 … wMN – Normalize for vector length to avoid
the effect of document length
5
Weighting Terms in Vector Sapce
The importance of the index terms is represented by weights
associated to them
Problem: to show the importance of the index term for
describing the document/query contents, what weight we can
assign?
Solution 1: Binary weights: t=1 if present, 0 otherwise
Similarity: number of terms in common between the
document and the query
Problem: Not all terms are equally interesting
E.g. the vs. dog vs. cat
Solution: Replace binary weights with non-binary weights
6
d j (w1, j , w2, j ,..., wN , j ); qk (w1,k , w2,k ,..., wN ,k )
The Boolean Model
• Boolean model is a simple model based on set theory
• Boolean model imposes a binary criterion
for deciding relevance
• Terms are either present or absent. Thus,
wij {0,1}
• sim(q,dj) = 1, if document satisfies the boolean query
0 otherwise
T1 T2 …. TN
D1 w11 w12 … w1N
- Note that, no weights D2 w21 w22 … w2N
assigned in-between 0 and 1, : : : :
just only values 0 or 1 : : : :
DM wM1 wM2 … wMN
7
The Boolean Model: Example
• Generate the relevant documents retrieved by
the Boolean model for the query :
q = k1 (k2 k3)
k2
k1
d7
d2 d6
d4 d5
d3
d1
k3
8
The Boolean Model: Example
• Given the following determine documents retrieved by the
Boolean model based IR system
• Index Terms: K1, …,K8.
• Documents:
1. D1 = {K1, K2, K3, K4, K5}
2. D2 = {K1, K2, K3, K4}
3. D3 = {K2, K4, K6, K8}
4. D4 = {K1, K3, K5, K7}
5. D5 = {K4, K5, K6, K7, K8}
6. D6 = {K1, K2, K3, K4}
• Query: K1 (K2 K3)
• Answer: {D1, D2, D4, D6} ({D1, D2, D3, D6} {D3, D5})
9
= {D1, D2, D6}
The Boolean Model: Further Example
Given the following three documents, Construct Term
– document matrix and find the relevant
documents retrieved by the Boolean model for
given query • Also find the relevant
• D1: “Shipment of gold damaged in a fire” documents for the
queries:
• D2: “Delivery of silver arrived in a silver truck”
• D3: “Shipment of gold arrived in a truck” • (a) “gold delivery”;
• Query: “gold silver truck” • (b) ship gold;
• (c) “silver truck”
Table below shows document –term (ti) matrix
Query c 0 0 0 0 1 0 1 0
Query d 0 0 0 0 0 1 0 1
11
Exercise
Given the following three documents with the following
contents:
D1 = “computer information retrieval”
D2 = “computer retrieval”
D3 = “information”
D4 = “computer information”
16
Vector-Space Model
• This is the most commonly used strategy for measuring
relevance of documents for a given query. This is
because,
• Use of binary weights is too limiting
• Non-binary weights provide consideration for partial
matches
• The term weights are used to compute a degree of
similarity between a query and each document
• Ranked set of documents provides for better matching
• The idea behind VSM is that
• the meaning of a document is conveyed by the words
used in that document and the weight it carries
17
Vector-Space Model
To find relevant documens for a given query:
• First, Documents and queries are mapped into term
vector space.
• Note that queries are considered as short document
• Short document mean with few words
• Second, In the vector space, queries and documents are
represented as weighted vectors
• There are different weighting technique; the most
widely used one is computing tf*idf for each term
• Third, similarity measurement is used to rank
documents by the closeness of their vectors to the query.
• Documents are ranked by closeness to the query. Closeness
is determined by a similarity score calculation
18
Term-document matrix.
A collection of n documents and query can be represented
in the vector space model by a term-document matrix.
An entry in the matrix corresponds to the “weight” of a term in
the document;
zero means the term has no significance in the document or
it simply doesn’t exist in the document. Otherwise, wij > 0
whenever ki dj
T1 T2 …. TN
D1 w11 w21 … w1N
D2 w21 w22 … w2N
: : : :
: : : :
DM wM1 wM2 … wMN
19
Computing weights
• How to compute weight for term i in document j (wij ) and
weight for term i in query q (wiq)?
• A good weight must take into account two effects:
– Quantification of intra-document contents (similarity)
• tf factor, the term frequency within a document
– Quantification of inter-documents separation
(dissimilarity)
• idf factor, the inverse document frequency across
documents
– As a result of which most IR systems are using tf*idf
weighting technique:
20
wij = tf(i,j) * idf(i)
Computing weights
• Let:
• N be the total number of documents in the collection
• ni be the number of documents which contain ki
• freq(i,j) total existence frequency of ki within dj
• A normalized tf factor is given by
• f(i,j) = freq(i,j)/max(freq(j))
• where the maximum is computed over all terms which
occur within the document dj
• The idf factor is computed as
• idf(i) = log (N/ni)
• the log is used to make the values of tf and idf
21
comparable. It can also be interpreted as the amount of
information associated with the term ki.
Computing weights
• The best term-weighting schemes use tf*idf weights
which are given by
wij = tf(i,j) * log(N/ni)
22
Example: Computing weights
• A collection includes 10,000 documents
• The term A appears 20 times in a particular
document
• The maximum appearance of any term in this
document is 50
• The term A appears in 2,000 of the collection
documents.
• Sim(q,dj) = cos() q
i
n
dj q wi , j qi ,k
sim(d j , q) i 1
n n
dj q w 2 2
q
i 1 i, j i 1 i ,k
• Disadvantages:
• assumes independence of index terms (??)
30
More Example
Suppose the database collection consists of the following documents.
c1: Human machine interface for Lab ABC computer
applications
c2: A survey of user opinion of computer system response time
c3: The EPS user interface management system
c4: System and human system engineering testing of EPS
c5: Relation of user-perceived response time to error measure
M1: The generation of random, binary, unordered trees
M2: The intersection graph of paths in trees
M3: Graph minors: Widths of trees and well-quasi-ordering
M4: Graph minors: A survey
Query:
Find documents relevant to "human computer interaction”
31
Exercises
Given the following documents, rank documents according to
their relevance to the query using Cosine similarity, Euclidean
distance and Inner product measures?
32
End of Chapter 4
33
Test
• Given the following Term-Document matrix and Query, perform:
1. Eucledian Distance between the Query and Each Document
2. Inner Product between the Query and Each Document
economy develop country
D1 1 3 2
D2 3 2 1
D3 2 1 0
Q 1 1 0
34