4 IRModels
4 IRModels
IR models
02/29/24 1
What do we mean by model?
• Assumptions
• You will say
– It is smaller than the real thing
– It looks like the same to the real thing
– It looks different, made of different stuff
– A representation
– It doers some of the same things
– It is for understanding something
• Each answer is correct
02/29/24 2
• Model is an idealization or abstraction of the actual
process (here, retrieval)
• It represents something that exists or is planned in
the real world and that in someway is too complex or
large for us to understand it as it stands
• A model is, in someway, simplified, or reduced in size,
scope or scale.
• Helps to understand the system better
• The best way, scientific way to study reality
• Thus, a model is a simplified representation of a
complex reality, usually for the purpose of
understanding that reality, and having all the features
of that reality necessary for the current task or
problem
02/29/24 3
What is a retrieval model?
• Are models that describe the computational process
– e.g. how documents are ranked( i. e how documents or
indexes are stored is implemented)
• Are models that attempt to describe the human
process
– e.g. the information need, interaction
– Few do so meaningfully
• Retrieval variables: queries, documents, terms,
relevance judgments, users, information needs, …
• Retrieval models have an explicit or implicit
definition of relevance
02/29/24 4
What an IR model includes?
• The retrieval mechanism
– used to match query with a set of documents
• The ways in which the user’s information need can
be formulated as a query
– that can be searched by that mechanism
• The human computer interaction
– that needs to take place to ensure the most appropriate
processing of the query
• The social and cognitive environments
– in which that interaction takes place
02/29/24 5
IR Models
02/29/24 6
IR Models
02/29/24 7
IR Models - Basic Concepts
• Word evidence:
IR systems usually adopt index terms to index and retrieve
documents
Each document is represented by a set of representative
keywords or index terms (called Bag of Words)
• An index term is a word useful for remembering the
document main themes
• Not all terms are equally useful for representing the
document contents:
less frequent terms allow identifying a narrower set of
documents
• But no ordering information is attached to the Bag of
Words identified from the document collection.
02/29/24 8
IR Models - Basic Concepts
• One central problem regarding IR systems is the
issue of predicting the degree of relevance of
documents for a given query
Such a decision is usually dependent on a ranking
algorithm which attempts to establish a simple
ordering of the documents retrieved
Documents appearning at the top of this ordering
are considered to be more likely to be relevant
• Thus ranking algorithms are at the core of IR systems
The IR models determine the predictions of what
is relevant and what is not, based on the notion of
relevance implemented by the system
02/29/24 9
General Procedures Followed
To find relevant documens for a given query,
• First, map documents and queries into term-document
vector space.
Note that queries are considered as short document
• Second, queries and documents are represented as
weighted vectors, wij
There are binary weights & non-binary weighting
technique
• Third, rank documents by the closeness of their vectors to
the query.
Documents are ranked by closeness to the query.
Closeness is determined by a similarity score calculation
02/29/24 10
Mapping Documents & Queries
• Represent both documents and queries as N-dimensional
vectors in a term-document matrix, which shows
occurrence of terms in the document collection or query
d j (t1, j , t 2, j ,..., t N , j ); qk (t1,k , t 2,k ,..., t N ,k )
• An entry in the matrix corresponds to the “weight” of a
term in the document; zero means the term doesn’t exist in
the document.
T1 T2 …. TN – Document collection is mapped to
D1 w11 w21 … w1N term-by-document matrix
D2 w21 w22 … w2N – View as vector in multidimensional
: : : : space
: : : : • Nearby vectors are related
DM wM1 wM2 … wMN – Normalize for vector length to avoid
02/29/24
the effect of document length 11
The Boolean Model
• Boolean model is a simple model based on set theory
The Boolean model imposes a binary criterion
for deciding relevance
• Terms are either present or absent. Thus,
wij {0,1}
• sim(q,dj) = 1, if document satisfies the boolean query
0 otherwise T1 T2 …. TN
D1 w11 w12 … w1N
- Note that, no weights D2 w21 w22 … w2N
assigned in-between 0 and 1, : : : :
just only values 0 or 1
: : : :
DM wM1 wM2 … wMN
02/29/24 12
The Boolean Model: Example
Given the following three documents, Construct Term – document
matrix and find the relevant documents retrieved by the
Boolean model for the query “gold silver truck”
• D1: “Shipment of gold damaged in a fire”
• D2: “Delivery of silver arrived in a silver truck”
• D3: “Shipment of gold arrived in a truck”
Table below shows document –term (ti) matrix
Arrive damage deliver fire gold silver ship truck
D1 0 1 0 1 1 0 1 0
D2 1 0 1 0 0 1 0 1
D3 1 0 0 0 1 0 1 1
query 0 0 0 0 1 1 0 1
02/29/24 16
Vector-Space Model
• This is the most commonly used strategy for measuring
relevance of documents for a given query. This is
because,
Use of binary weights is too limiting
Non-binary weights provide consideration for partial
matches
• These term weights are used to compute a degree of
similarity between a query and each document
Ranked set of documents provides for better
matching
• The idea behind VSM is that
the meaning of a document is conveyed by the words
02/29/24 used in that document 17
Vector-Space Model
To find relevant documens for a given query,
• First, map documents and queries into term-document vector
space.
Note that queries are considered as short document
• Second, in the vector space, queries and documents are
represented as weighted vectors, wij
There are different weighting technique; the most widely used one
is computing TF*IDF weight for each term
i 1 w i 1 i,q
n n
dj q 2
i, j w 2
a 0 1 1 1 3 0 0 0 0 0
arrived 0 0 1 1 2 0.176 0 0 0.176 0.176
damaged 0 1 0 0 1 0.477 0 0.477 0 0
delivery 0 0 1 0 1 0.477 0 0 0.477 0
fire 0 1 0 0 1 0.477 0 0.477 0 0
gold 1 1 0 1 2 0.176 0.176 0.176 0 0.176
in 0 1 1 1 3 0 0 0 0 0
of 0 1 1 1 3 0 0 0 0 0
silver 1 0 2 0 1 0.477 0.477 0 0.954 0
shipment
02/29/24 0 1 0 1 2 0.176 0 0.176 0 0.176
25
Vector-Space Model
Terms Q D1 D2 D3
a 0 0 0 0
arrived 0 0 0.176 0.176
damaged 0 0.477 0 0
delivery 0 0 0.477 0
fire 0 0.477 0 0
gold 0.176 0.176 0 0.176
in 0 0 0 0
of 0 0 0 0
silver 0.477 0 0.954 0
shipment 0 0.176 0 0.176
truck 0.176 0 0.176 0.176
02/29/24 26
Vector-Space Model: Example
• Compute similarity using cosine Sim(q,d1)
• First, for each document and query, compute all vector
lengths (zero terms ignored)
|d1|= 0.477 2 0.477 2 0.1762 0.=
1762 0=.517
0.719
|d2|= 0.1762 0.477 2 0.9542 0.=1762 1.=1996
1.095
|d3|= 0.176 2 0.176 2 0.176 2 0.=
1762 0=.124
0.352
02/29/24 28
Vector-Space Model
• Advantages:
• Term-weighting improves quality of the answer set
since it helps to display relevant documents in ranked
order
• Partial matching allows retrieval of documents that
approximate the query conditions
• Cosine ranking formula sorts documents according to
degree of similarity to the query
• Disadvantages:
• Assumes independence of index terms. It doesn’t
relate one term with another term
• Computationally expensive since it measures the
similarity between each document and the query
02/29/24 29
Probabilistic Model
• IR is an uncertain process
–Mapping Information need to Query is not perfect
–Mapping Documents to index terms is a logical representation
–Query terms and index terms mostly mismatch
• This situation leads to several statistical approaches:
probability theory, fuzzy logic, theory of evidence,
language modeling, etc.
• Probabilistic retrieval model is hard formal model that
attempts to predict the probability that a given document
will be relevant to a given query; i.e. Prob(R|(q,di))
–Use probability to estimate the “odds” of relevance of a query to
a document.
–It relies on accurate estimates of probabilities
02/29/24 30
Terms Existence in Relevant Document
N=the total number of documents in the collection
n= the total number of documents that contain term ti
R=the total number of relevant documents retrieved
r=the total number of relevant documents retrieved that contain term t i
(r 0.5)( N n R r 0.5)
wi log
(n r 0.5)( R r 0.5)
02/29/24 31
Probabilistic model
• Probabilistic model uses probability theory to model the
uncertainty in the retrieval process
– Relevance feedback can improve the ranking by giving better
term probability estimates
• Advantages of probabilistic model over vector‐space
– Strong theoretical basis
– Since the base is probability theory, it is very well understood
– Easy to extend
• Disadvantages
– Models are often complicated
– No term frequency weighting
02/29/24 32
?
. Thank you!
02/29/24 33