0% found this document useful (0 votes)
20 views32 pages

4 IRModels

The document discusses information retrieval (IR) models. It defines a model as a simplified representation of a complex reality that captures the essential aspects needed to understand that reality. An IR model includes the retrieval mechanism, how queries are formulated, and the human-computer interaction process. Some important classic IR models discussed are the Boolean model and vector space model. The Boolean model uses binary weighting and retrieves all documents that satisfy a Boolean query. The vector space model assigns non-binary term weights and ranks documents based on similarity to the query vector.

Uploaded by

awokebaheilu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views32 pages

4 IRModels

The document discusses information retrieval (IR) models. It defines a model as a simplified representation of a complex reality that captures the essential aspects needed to understand that reality. An IR model includes the retrieval mechanism, how queries are formulated, and the human-computer interaction process. Some important classic IR models discussed are the Boolean model and vector space model. The Boolean model uses binary weighting and retrieves all documents that satisfy a Boolean query. The vector space model assigns non-binary term weights and ranks documents based on similarity to the query vector.

Uploaded by

awokebaheilu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 32

Chapter Four

IR models

02/29/24 1
What do we mean by model?

• Assumptions
• You will say
– It is smaller than the real thing
– It looks like the same to the real thing
– It looks different, made of different stuff
– A representation
– It doers some of the same things
– It is for understanding something
• Each answer is correct

02/29/24 2
• Model is an idealization or abstraction of the actual
process (here, retrieval)
• It represents something that exists or is planned in
the real world and that in someway is too complex or
large for us to understand it as it stands
• A model is, in someway, simplified, or reduced in size,
scope or scale.
• Helps to understand the system better
• The best way, scientific way to study reality
• Thus, a model is a simplified representation of a
complex reality, usually for the purpose of
understanding that reality, and having all the features
of that reality necessary for the current task or
problem

02/29/24 3
What is a retrieval model?
• Are models that describe the computational process
– e.g. how documents are ranked( i. e how documents or
indexes are stored is implemented)
• Are models that attempt to describe the human
process
– e.g. the information need, interaction
– Few do so meaningfully
• Retrieval variables: queries, documents, terms,
relevance judgments, users, information needs, …
• Retrieval models have an explicit or implicit
definition of relevance

02/29/24 4
What an IR model includes?
• The retrieval mechanism
– used to match query with a set of documents
• The ways in which the user’s information need can
be formulated as a query
– that can be searched by that mechanism
• The human computer interaction
– that needs to take place to ensure the most appropriate
processing of the query
• The social and cognitive environments
– in which that interaction takes place
02/29/24 5
IR Models

• A number of IR models are proposed over the


years
• Among them are 15 important IR models
• These are grouped into 3 major categories
– Classic model
– Structured model
– Browsing

02/29/24 6
IR Models

02/29/24 7
IR Models - Basic Concepts
• Word evidence:
 IR systems usually adopt index terms to index and retrieve
documents
 Each document is represented by a set of representative
keywords or index terms (called Bag of Words)
• An index term is a word useful for remembering the
document main themes
• Not all terms are equally useful for representing the
document contents:
 less frequent terms allow identifying a narrower set of
documents
• But no ordering information is attached to the Bag of
Words identified from the document collection.
02/29/24 8
IR Models - Basic Concepts
• One central problem regarding IR systems is the
issue of predicting the degree of relevance of
documents for a given query
 Such a decision is usually dependent on a ranking
algorithm which attempts to establish a simple
ordering of the documents retrieved
 Documents appearning at the top of this ordering
are considered to be more likely to be relevant
• Thus ranking algorithms are at the core of IR systems
 The IR models determine the predictions of what
is relevant and what is not, based on the notion of
relevance implemented by the system
02/29/24 9
General Procedures Followed
To find relevant documens for a given query,
• First, map documents and queries into term-document
vector space.
 Note that queries are considered as short document
• Second, queries and documents are represented as
weighted vectors, wij
 There are binary weights & non-binary weighting
technique
• Third, rank documents by the closeness of their vectors to
the query.
 Documents are ranked by closeness to the query.
Closeness is determined by a similarity score calculation
02/29/24 10
Mapping Documents & Queries
• Represent both documents and queries as N-dimensional
vectors in a term-document matrix, which shows
occurrence of terms in the document collection or query

d j  (t1, j , t 2, j ,..., t N , j ); qk  (t1,k , t 2,k ,..., t N ,k )
• An entry in the matrix corresponds to the “weight” of a
term in the document; zero means the term doesn’t exist in
the document.
T1 T2 …. TN – Document collection is mapped to
D1 w11 w21 … w1N term-by-document matrix
D2 w21 w22 … w2N – View as vector in multidimensional
: : : : space
: : : : • Nearby vectors are related
DM wM1 wM2 … wMN – Normalize for vector length to avoid
02/29/24
the effect of document length 11
The Boolean Model
• Boolean model is a simple model based on set theory
 The Boolean model imposes a binary criterion
for deciding relevance
• Terms are either present or absent. Thus,
wij  {0,1}
• sim(q,dj) = 1, if document satisfies the boolean query
0 otherwise T1 T2 …. TN
D1 w11 w12 … w1N
- Note that, no weights D2 w21 w22 … w2N
assigned in-between 0 and 1, : : : :
just only values 0 or 1
: : : :
DM wM1 wM2 … wMN
02/29/24 12
The Boolean Model: Example
Given the following three documents, Construct Term – document
matrix and find the relevant documents retrieved by the
Boolean model for the query “gold silver truck”
• D1: “Shipment of gold damaged in a fire”
• D2: “Delivery of silver arrived in a silver truck”
• D3: “Shipment of gold arrived in a truck”
Table below shows document –term (ti) matrix
Arrive damage deliver fire gold silver ship truck
D1 0 1 0 1 1 0 1 0
D2 1 0 1 0 0 1 0 1
D3 1 0 0 0 1 0 1 1
query 0 0 0 0 1 1 0 1

Also find the documents relevant for the queries:


(a)gold delivery; (b) ship gold; (c) silver truck
02/29/24 13
The Boolean Model: Further Example
• Given the following determine documents retrieved by the
Boolean model based IR system
• Index Terms: K1, …,K8.
• Documents:

1. D1 = {K1, K2, K3, K4, K5}


2. D2 = {K1, K2, K3, K4}
3. D3 = {K2, K4, K6, K8}
4. D4 = {K1, K3, K5, K7}
5. D5 = {K4, K5, K6, K7, K8}
6. D6 = {K1, K2, K3, K4}
• Query: K1 (K2  K3)
• Answer: {D1, D2, D4, D6} ({D1, D2, D3, D6} {D3, D5})
= {D1, D2, D6}
02/29/24 14
Exercise
Given the following four documents with the following
contents:
– D1 = “computer information retrieval”
– D2 = “computer retrieval”
– D3 = “information”
– D4 = “computer information”

• What are the relevant documents retrieved for the


queries:
– Q1 = “information  retrieval”
– Q2 = “information  ¬computer”
02/29/24 15
Drawbacks of the Boolean Model
• Retrieval based on binary decision criteria with no
notion of partial matching
• No ranking of the documents is provided (absence of
a grading scale)
• Information need has to be translated into a Boolean
expression which most users find awkward
• The Boolean queries formulated by the users are
most often too simplistic
 As a consequence, the Boolean model frequently
returns either too few or too many documents in
response to a user query

02/29/24 16
Vector-Space Model
• This is the most commonly used strategy for measuring
relevance of documents for a given query. This is
because,
 Use of binary weights is too limiting
 Non-binary weights provide consideration for partial
matches
• These term weights are used to compute a degree of
similarity between a query and each document
 Ranked set of documents provides for better
matching
• The idea behind VSM is that
 the meaning of a document is conveyed by the words
02/29/24 used in that document 17
Vector-Space Model
To find relevant documens for a given query,
• First, map documents and queries into term-document vector
space.
Note that queries are considered as short document
• Second, in the vector space, queries and documents are
represented as weighted vectors, wij
There are different weighting technique; the most widely used one
is computing TF*IDF weight for each term

• Third, similarity measurement is used to rank documents by


the closeness of their vectors to the query.
To measure closeness of documents to the query cosine similarity
score is used by most search engines
02/29/24 18
Term-document matrix.
• A collection of n documents and query can be represented
in the vector space model by a term-document matrix.
– An entry in the matrix corresponds to the “weight” of a term in
the document;
–zero means the term has no significance in the document or
it simply doesn’t exist in the document. Otherwise, wij > 0
whenever ki  dj
T1 T2 …. TN
• How to compute weights
D1 w11 w21 … w1N for term i in document j and
D2 w21 w22 … w2N in query q; wij and wiq ?
: : : :
: : : :
DM wM1 wM2 … wMN
02/29/24 19
Example: Computing weights
• A collection includes 10,000 documents
 The term A appears 20 times in a particular document j
 The maximum appearance of any term in document j is
50
 The term A appears in 2,000 of the collection
documents.

• Compute TF,IDF weight of term A?


 tf(A,j) = freq(A,j) / max(freq(k,j)) = 20/50 = 0.4
 idf(A) = log(N/DFA) = log2 (10,000/2,000) = 2.32
 wAj = tf(A,j) * log2(N/DFA) = 0.4 * 2.32 = 0.928
02/29/24 21
Similarity Measure
• A similarity measure is a function that computes the
degree of similarity between document j and users
query.  

n
d j q wi , j wi ,q
sim(d j , q )     i 1

i 1 w i 1 i,q
n n
dj q 2
i, j w 2

• Using a similarity score between the query and each


document:
– It is possible to rank the retrieved documents in the
order of presumed relevance.
– It is possible to enforce a certain beginning so that we
can control the size of the retrieved set of documents.
02/29/24 22
Vector Space with Term Weights and
Cosine Matching
Di=(d1i,w1di;d2i, w2di;…;dti, wtdi)
Term B
Q =(q1i,w1qi;q2i, w2qi;…;qti, wtqi)
1.0 Q = (0.4,0.8)

t
D2 Q D1=(0.8,0.3)
j 1
w jq w jdi
0.8 D2=(0.2,0.7) sim(Q, Di ) 
 j 1 (w jq )  j 1 jdi
t 2 t 2
( w )
0.6
2 (0.4  0.2)  (0.8  0.7)
sim (Q, D 2) 
0.4
[(0.4) 2  (0.8) 2 ]  [(0.2) 2  (0.7) 2 ]
D1
0.2 1 0.64
  0.98
0.42
0 0.2 0.4 0.6 0.8 1.0
.56
Term A sim(Q, D1 )   0.74
0.58
02/29/24 23
Vector-Space Model: Example
• Suppose user query for: Q = “gold silver truck”. The
database collection consists of three documents with the
following content.
D1: “Shipment of gold damaged in a fire”
D2: “Delivery of silver arrived in a silver truck”
D3: “Shipment of gold arrived in a truck”
• Show retrieval results in ranked order?
1.Assume that full text terms are used during indexing,
without removing common terms, stop words, & also no
terms are stemmed.
2.Assume that content-bearing terms are selected during
indexing
3.Also compare your result with or without normalizing
term frequency
02/29/24 24
Vector-Space Model: Example
Counts TF Wi = TF*IDF
Terms Q D1 D2 D3 DF IDF Q D1 D2 D3

a 0 1 1 1 3 0 0 0 0 0
arrived 0 0 1 1 2 0.176 0 0 0.176 0.176
damaged 0 1 0 0 1 0.477 0 0.477 0 0
delivery 0 0 1 0 1 0.477 0 0 0.477 0
fire 0 1 0 0 1 0.477 0 0.477 0 0
gold 1 1 0 1 2 0.176 0.176 0.176 0 0.176
in 0 1 1 1 3 0 0 0 0 0
of 0 1 1 1 3 0 0 0 0 0
silver 1 0 2 0 1 0.477 0.477 0 0.954 0
shipment
02/29/24 0 1 0 1 2 0.176 0 0.176 0 0.176
25
Vector-Space Model
Terms Q D1 D2 D3
a 0 0 0 0
arrived 0 0 0.176 0.176
damaged 0 0.477 0 0
delivery 0 0 0.477 0
fire 0 0.477 0 0
gold 0.176 0.176 0 0.176
in 0 0 0 0
of 0 0 0 0
silver 0.477 0 0.954 0
shipment 0 0.176 0 0.176
truck 0.176 0 0.176 0.176
02/29/24 26
Vector-Space Model: Example
• Compute similarity using cosine Sim(q,d1)
• First, for each document and query, compute all vector
lengths (zero terms ignored)
|d1|= 0.477 2  0.477 2  0.1762  0.=
1762 0=.517
0.719
|d2|= 0.1762  0.477 2  0.9542  0.=1762 1.=1996
1.095
|d3|= 0.176 2  0.176 2  0.176 2  0.=
1762 0=.124
0.352

|q|= 0.1762  0.4712  0.176=2 = 0.538


0.2896
• Next, compute dot products (zero products ignored)
Q*d1= 0.176*0.167 = 0.0310
Q*d2 = 0.954*0.477 + 0.176 *0.176 = 0.4862
Q*d3
02/29/24
= 0.176*0.167 + 0.176*0.167 = 0.0620 27
Vector-Space Model: Example
Now, compute similarity score
Sim(q,d1) = (0.0310) / (0.538*0.719) = 0.0801
Sim(q,d1) = (0.4862 ) / (0.538*1.095)= 0.8246
Sim(q,d1) = (0.0620) / (0.538*0.352)= 0.3271
Finally, we sort and rank documents in descending
order according to the similarity scores
Rank 1: Doc 2 = 0.8246
Rank 2: Doc 3 = 0.3271
Rank 3: Doc 1 = 0.0801

02/29/24 28
Vector-Space Model
• Advantages:
• Term-weighting improves quality of the answer set
since it helps to display relevant documents in ranked
order
• Partial matching allows retrieval of documents that
approximate the query conditions
• Cosine ranking formula sorts documents according to
degree of similarity to the query
• Disadvantages:
• Assumes independence of index terms. It doesn’t
relate one term with another term
• Computationally expensive since it measures the
similarity between each document and the query
02/29/24 29
Probabilistic Model
• IR is an uncertain process
–Mapping Information need to Query is not perfect
–Mapping Documents to index terms is a logical representation
–Query terms and index terms mostly mismatch
• This situation leads to several statistical approaches:
probability theory, fuzzy logic, theory of evidence,
language modeling, etc.
• Probabilistic retrieval model is hard formal model that
attempts to predict the probability that a given document
will be relevant to a given query; i.e. Prob(R|(q,di))
–Use probability to estimate the “odds” of relevance of a query to
a document.
–It relies on accurate estimates of probabilities

02/29/24 30
Terms Existence in Relevant Document
N=the total number of documents in the collection
n= the total number of documents that contain term ti
R=the total number of relevant documents retrieved
r=the total number of relevant documents retrieved that contain term t i

(r  0.5)( N  n  R  r  0.5)
wi  log
(n  r  0.5)( R  r  0.5)

02/29/24 31
Probabilistic model
• Probabilistic model uses probability theory to model the
uncertainty in the retrieval process
– Relevance feedback can improve the ranking by giving better
term probability estimates
• Advantages of probabilistic model over vector‐space
– Strong theoretical basis
– Since the base is probability theory, it is very well understood
– Easy to extend
• Disadvantages
– Models are often complicated
– No term frequency weighting

02/29/24 32
?

. Thank you!

02/29/24 33

You might also like