0% found this document useful (0 votes)

20 views32 pages

4 IRModels

The document discusses information retrieval (IR) models. It defines a model as a simplified representation of a complex reality that captures the essential aspects needed to understand that reality. An IR model includes the retrieval mechanism, how queries are formulated, and the human-computer interaction process. Some important classic IR models discussed are the Boolean model and vector space model. The Boolean model uses binary weighting and retrieves all documents that satisfy a Boolean query. The vector space model assigns non-binary term weights and ranks documents based on similarity to the query vector.

Uploaded by

awokebaheilu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views32 pages

4 IRModels

Uploaded by

awokebaheilu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 32

Chapter Four

IR models

02/29/24 1
What do we mean by model?

• Assumptions
• You will say
– It is smaller than the real thing
– It looks like the same to the real thing
– It looks different, made of different stuff
– A representation
– It doers some of the same things
– It is for understanding something
• Each answer is correct

02/29/24 2
• Model is an idealization or abstraction of the actual
process (here, retrieval)
• It represents something that exists or is planned in
the real world and that in someway is too complex or
large for us to understand it as it stands
• A model is, in someway, simplified, or reduced in size,
scope or scale.
• Helps to understand the system better
• The best way, scientific way to study reality
• Thus, a model is a simplified representation of a
complex reality, usually for the purpose of
understanding that reality, and having all the features
of that reality necessary for the current task or
problem

02/29/24 3
What is a retrieval model?
• Are models that describe the computational process
– e.g. how documents are ranked( i. e how documents or
indexes are stored is implemented)
• Are models that attempt to describe the human
process
– e.g. the information need, interaction
– Few do so meaningfully
• Retrieval variables: queries, documents, terms,
relevance judgments, users, information needs, …
• Retrieval models have an explicit or implicit
definition of relevance

02/29/24 4
What an IR model includes?
• The retrieval mechanism
– used to match query with a set of documents
• The ways in which the user’s information need can
be formulated as a query
– that can be searched by that mechanism
• The human computer interaction
– that needs to take place to ensure the most appropriate
processing of the query
• The social and cognitive environments
– in which that interaction takes place
02/29/24 5
IR Models

• A number of IR models are proposed over the

years
• Among them are 15 important IR models
• These are grouped into 3 major categories
– Classic model
– Structured model
– Browsing

02/29/24 6
IR Models

02/29/24 7
IR Models - Basic Concepts
• Word evidence:
 IR systems usually adopt index terms to index and retrieve
documents
 Each document is represented by a set of representative
keywords or index terms (called Bag of Words)
• An index term is a word useful for remembering the
document main themes
• Not all terms are equally useful for representing the
document contents:
 less frequent terms allow identifying a narrower set of
documents
• But no ordering information is attached to the Bag of
Words identified from the document collection.
02/29/24 8
IR Models - Basic Concepts
• One central problem regarding IR systems is the
issue of predicting the degree of relevance of
documents for a given query
 Such a decision is usually dependent on a ranking
algorithm which attempts to establish a simple
ordering of the documents retrieved
 Documents appearning at the top of this ordering
are considered to be more likely to be relevant
• Thus ranking algorithms are at the core of IR systems
 The IR models determine the predictions of what
is relevant and what is not, based on the notion of
relevance implemented by the system
02/29/24 9
General Procedures Followed
To find relevant documens for a given query,
• First, map documents and queries into term-document
vector space.
 Note that queries are considered as short document
• Second, queries and documents are represented as
weighted vectors, wij
 There are binary weights & non-binary weighting
technique
• Third, rank documents by the closeness of their vectors to
the query.
 Documents are ranked by closeness to the query.
Closeness is determined by a similarity score calculation
02/29/24 10
Mapping Documents & Queries
• Represent both documents and queries as N-dimensional
vectors in a term-document matrix, which shows
occurrence of terms in the document collection or query

d j  (t1, j , t 2, j ,..., t N , j ); qk  (t1,k , t 2,k ,..., t N ,k )
• An entry in the matrix corresponds to the “weight” of a
term in the document; zero means the term doesn’t exist in
the document.
T1 T2 …. TN – Document collection is mapped to
D1 w11 w21 … w1N term-by-document matrix
D2 w21 w22 … w2N – View as vector in multidimensional
: : : : space
: : : : • Nearby vectors are related
DM wM1 wM2 … wMN – Normalize for vector length to avoid
02/29/24
the effect of document length 11
The Boolean Model
• Boolean model is a simple model based on set theory
 The Boolean model imposes a binary criterion
for deciding relevance
• Terms are either present or absent. Thus,
wij  {0,1}
• sim(q,dj) = 1, if document satisfies the boolean query
0 otherwise T1 T2 …. TN
D1 w11 w12 … w1N
- Note that, no weights D2 w21 w22 … w2N
assigned in-between 0 and 1, : : : :
just only values 0 or 1
: : : :
DM wM1 wM2 … wMN
02/29/24 12
The Boolean Model: Example
Given the following three documents, Construct Term – document
matrix and find the relevant documents retrieved by the
Boolean model for the query “gold silver truck”
• D1: “Shipment of gold damaged in a fire”
• D2: “Delivery of silver arrived in a silver truck”
• D3: “Shipment of gold arrived in a truck”
Table below shows document –term (ti) matrix
Arrive damage deliver fire gold silver ship truck
D1 0 1 0 1 1 0 1 0
D2 1 0 1 0 0 1 0 1
D3 1 0 0 0 1 0 1 1
query 0 0 0 0 1 1 0 1

Also find the documents relevant for the queries:

(a)gold delivery; (b) ship gold; (c) silver truck
02/29/24 13
The Boolean Model: Further Example
• Given the following determine documents retrieved by the
Boolean model based IR system
• Index Terms: K1, …,K8.
• Documents:

1. D1 = {K1, K2, K3, K4, K5}

2. D2 = {K1, K2, K3, K4}
3. D3 = {K2, K4, K6, K8}
4. D4 = {K1, K3, K5, K7}
5. D5 = {K4, K5, K6, K7, K8}
6. D6 = {K1, K2, K3, K4}
• Query: K1 (K2  K3)
• Answer: {D1, D2, D4, D6} ({D1, D2, D3, D6} {D3, D5})
= {D1, D2, D6}
02/29/24 14
Exercise
Given the following four documents with the following
contents:
– D1 = “computer information retrieval”
– D2 = “computer retrieval”
– D3 = “information”
– D4 = “computer information”

• What are the relevant documents retrieved for the

queries:
– Q1 = “information  retrieval”
– Q2 = “information  ¬computer”
02/29/24 15
Drawbacks of the Boolean Model
• Retrieval based on binary decision criteria with no
notion of partial matching
• No ranking of the documents is provided (absence of
a grading scale)
• Information need has to be translated into a Boolean
expression which most users find awkward
• The Boolean queries formulated by the users are
most often too simplistic
 As a consequence, the Boolean model frequently
returns either too few or too many documents in
response to a user query

02/29/24 16
Vector-Space Model
• This is the most commonly used strategy for measuring
relevance of documents for a given query. This is
because,
 Use of binary weights is too limiting
 Non-binary weights provide consideration for partial
matches
• These term weights are used to compute a degree of
similarity between a query and each document
 Ranked set of documents provides for better
matching
• The idea behind VSM is that
 the meaning of a document is conveyed by the words
02/29/24 used in that document 17
Vector-Space Model
To find relevant documens for a given query,
• First, map documents and queries into term-document vector
space.
Note that queries are considered as short document
• Second, in the vector space, queries and documents are
represented as weighted vectors, wij
There are different weighting technique; the most widely used one
is computing TF*IDF weight for each term

• Third, similarity measurement is used to rank documents by

the closeness of their vectors to the query.
To measure closeness of documents to the query cosine similarity
score is used by most search engines
02/29/24 18
Term-document matrix.
• A collection of n documents and query can be represented
in the vector space model by a term-document matrix.
– An entry in the matrix corresponds to the “weight” of a term in
the document;
–zero means the term has no significance in the document or
it simply doesn’t exist in the document. Otherwise, wij > 0
whenever ki  dj
T1 T2 …. TN
• How to compute weights
D1 w11 w21 … w1N for term i in document j and
D2 w21 w22 … w2N in query q; wij and wiq ?
: : : :
: : : :
DM wM1 wM2 … wMN
02/29/24 19
Example: Computing weights
• A collection includes 10,000 documents
 The term A appears 20 times in a particular document j
 The maximum appearance of any term in document j is
50
 The term A appears in 2,000 of the collection
documents.

• Compute TF,IDF weight of term A?

 tf(A,j) = freq(A,j) / max(freq(k,j)) = 20/50 = 0.4
 idf(A) = log(N/DFA) = log2 (10,000/2,000) = 2.32
 wAj = tf(A,j) * log2(N/DFA) = 0.4 * 2.32 = 0.928
02/29/24 21
Similarity Measure
• A similarity measure is a function that computes the
degree of similarity between document j and users
query.  

n
d j q wi , j wi ,q
sim(d j , q )     i 1

i 1 w i 1 i,q
n n
dj q 2
i, j w 2

• Using a similarity score between the query and each

document:
– It is possible to rank the retrieved documents in the
order of presumed relevance.
– It is possible to enforce a certain beginning so that we
can control the size of the retrieved set of documents.
02/29/24 22
Vector Space with Term Weights and
Cosine Matching
Di=(d1i,w1di;d2i, w2di;…;dti, wtdi)
Term B
Q =(q1i,w1qi;q2i, w2qi;…;qti, wtqi)
1.0 Q = (0.4,0.8)

t
D2 Q D1=(0.8,0.3)
j 1
w jq w jdi
0.8 D2=(0.2,0.7) sim(Q, Di ) 
 j 1 (w jq )  j 1 jdi
t 2 t 2
( w )
0.6
2 (0.4  0.2)  (0.8  0.7)
sim (Q, D 2) 
0.4
[(0.4) 2  (0.8) 2 ]  [(0.2) 2  (0.7) 2 ]
D1
0.2 1 0.64
  0.98
0.42
0 0.2 0.4 0.6 0.8 1.0
.56
Term A sim(Q, D1 )   0.74
0.58
02/29/24 23
Vector-Space Model: Example
• Suppose user query for: Q = “gold silver truck”. The
database collection consists of three documents with the
following content.
D1: “Shipment of gold damaged in a fire”
D2: “Delivery of silver arrived in a silver truck”
D3: “Shipment of gold arrived in a truck”
• Show retrieval results in ranked order?
1.Assume that full text terms are used during indexing,
without removing common terms, stop words, & also no
terms are stemmed.
2.Assume that content-bearing terms are selected during
indexing
3.Also compare your result with or without normalizing
term frequency
02/29/24 24
Vector-Space Model: Example
Counts TF Wi = TF*IDF
Terms Q D1 D2 D3 DF IDF Q D1 D2 D3

a 0 1 1 1 3 0 0 0 0 0
arrived 0 0 1 1 2 0.176 0 0 0.176 0.176
damaged 0 1 0 0 1 0.477 0 0.477 0 0
delivery 0 0 1 0 1 0.477 0 0 0.477 0
fire 0 1 0 0 1 0.477 0 0.477 0 0
gold 1 1 0 1 2 0.176 0.176 0.176 0 0.176
in 0 1 1 1 3 0 0 0 0 0
of 0 1 1 1 3 0 0 0 0 0
silver 1 0 2 0 1 0.477 0.477 0 0.954 0
shipment
02/29/24 0 1 0 1 2 0.176 0 0.176 0 0.176
25
Vector-Space Model
Terms Q D1 D2 D3
a 0 0 0 0
arrived 0 0 0.176 0.176
damaged 0 0.477 0 0
delivery 0 0 0.477 0
fire 0 0.477 0 0
gold 0.176 0.176 0 0.176
in 0 0 0 0
of 0 0 0 0
silver 0.477 0 0.954 0
shipment 0 0.176 0 0.176
truck 0.176 0 0.176 0.176
02/29/24 26
Vector-Space Model: Example
• Compute similarity using cosine Sim(q,d1)
• First, for each document and query, compute all vector
lengths (zero terms ignored)
|d1|= 0.477 2  0.477 2  0.1762  0.=
1762 0=.517
0.719
|d2|= 0.1762  0.477 2  0.9542  0.=1762 1.=1996
1.095
|d3|= 0.176 2  0.176 2  0.176 2  0.=
1762 0=.124
0.352

|q|= 0.1762  0.4712  0.176=2 = 0.538

0.2896
• Next, compute dot products (zero products ignored)
Q*d1= 0.176*0.167 = 0.0310
Q*d2 = 0.954*0.477 + 0.176 *0.176 = 0.4862
Q*d3
02/29/24
= 0.176*0.167 + 0.176*0.167 = 0.0620 27
Vector-Space Model: Example
Now, compute similarity score
Sim(q,d1) = (0.0310) / (0.538*0.719) = 0.0801
Sim(q,d1) = (0.4862 ) / (0.538*1.095)= 0.8246
Sim(q,d1) = (0.0620) / (0.538*0.352)= 0.3271
Finally, we sort and rank documents in descending
order according to the similarity scores
Rank 1: Doc 2 = 0.8246
Rank 2: Doc 3 = 0.3271
Rank 3: Doc 1 = 0.0801

02/29/24 28
Vector-Space Model
• Advantages:
• Term-weighting improves quality of the answer set
since it helps to display relevant documents in ranked
order
• Partial matching allows retrieval of documents that
approximate the query conditions
• Cosine ranking formula sorts documents according to
degree of similarity to the query
• Disadvantages:
• Assumes independence of index terms. It doesn’t
relate one term with another term
• Computationally expensive since it measures the
similarity between each document and the query
02/29/24 29
Probabilistic Model
• IR is an uncertain process
–Mapping Information need to Query is not perfect
–Mapping Documents to index terms is a logical representation
–Query terms and index terms mostly mismatch
• This situation leads to several statistical approaches:
probability theory, fuzzy logic, theory of evidence,
language modeling, etc.
• Probabilistic retrieval model is hard formal model that
attempts to predict the probability that a given document
will be relevant to a given query; i.e. Prob(R|(q,di))
–Use probability to estimate the “odds” of relevance of a query to
a document.
–It relies on accurate estimates of probabilities

02/29/24 30
Terms Existence in Relevant Document
N=the total number of documents in the collection
n= the total number of documents that contain term ti
R=the total number of relevant documents retrieved
r=the total number of relevant documents retrieved that contain term t i

(r  0.5)( N  n  R  r  0.5)
wi  log
(n  r  0.5)( R  r  0.5)

02/29/24 31
Probabilistic model
• Probabilistic model uses probability theory to model the
uncertainty in the retrieval process
– Relevance feedback can improve the ranking by giving better
term probability estimates
• Advantages of probabilistic model over vector‐space
– Strong theoretical basis
– Since the base is probability theory, it is very well understood
– Easy to extend
• Disadvantages
– Models are often complicated
– No term frequency weighting

02/29/24 32
?

. Thank you!

02/29/24 33

CompletedUNIT 1 PPT 10.7.17
100% (6)
CompletedUNIT 1 PPT 10.7.17
87 pages
Machine and Industrial Design in Mechanical Engineering (Milan Rackov, Radivoje Mitrović, Maja Čavić) (Z-Library)
No ratings yet
Machine and Industrial Design in Mechanical Engineering (Milan Rackov, Radivoje Mitrović, Maja Čavić) (Z-Library)
725 pages
Ios Mat 0010 13
50% (2)
Ios Mat 0010 13
55 pages
(Viral) Kamal Kaur Viral Video Original Link
No ratings yet
(Viral) Kamal Kaur Viral Video Original Link
5 pages
Do417 2.8 Student Guide
No ratings yet
Do417 2.8 Student Guide
600 pages
IR - Models
100% (3)
IR - Models
58 pages
Manual de RM1
No ratings yet
Manual de RM1
75 pages
CBSE Class 6 Maths Practice Worksheets
100% (1)
CBSE Class 6 Maths Practice Worksheets
2 pages
Hart Oil & Gas Lawsuit
100% (1)
Hart Oil & Gas Lawsuit
55 pages
Information Retrieval System and The Pagerank Algorithm
No ratings yet
Information Retrieval System and The Pagerank Algorithm
37 pages
IR Chap4
100% (1)
IR Chap4
32 pages
Cs8080 Ir Unit2 I Modeling and Retrieval Evaluation
No ratings yet
Cs8080 Ir Unit2 I Modeling and Retrieval Evaluation
42 pages
Home Building Manual 2014
100% (2)
Home Building Manual 2014
39 pages
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
No ratings yet
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
46 pages
Life Saving Rules Poster in English
No ratings yet
Life Saving Rules Poster in English
11 pages
IR Chap4
100% (1)
IR Chap4
32 pages
Application of IR - ITC
No ratings yet
Application of IR - ITC
23 pages
Chapter 11 Network Models: Quantitative Analysis For Management, 11e (Render)
No ratings yet
Chapter 11 Network Models: Quantitative Analysis For Management, 11e (Render)
32 pages
UBL Operations Management
No ratings yet
UBL Operations Management
18 pages
IR Models: Chapter Five
100% (1)
IR Models: Chapter Five
26 pages
Unit V Easy To Learn
No ratings yet
Unit V Easy To Learn
21 pages
Information Retrieval Notes
No ratings yet
Information Retrieval Notes
42 pages
4 IRModels
No ratings yet
4 IRModels
46 pages
Completed Unit II 17.7.17
No ratings yet
Completed Unit II 17.7.17
113 pages
54dgftmar2006 07
No ratings yet
54dgftmar2006 07
107 pages
Investor Presentation
No ratings yet
Investor Presentation
30 pages
Packet Tracer 8.6.1.3
0% (1)
Packet Tracer 8.6.1.3
16 pages
What Your Food Ate How To Heal Our Land and Reclaim Our Health David R Montgomery Instant Download
No ratings yet
What Your Food Ate How To Heal Our Land and Reclaim Our Health David R Montgomery Instant Download
83 pages
5 B IRModels
No ratings yet
5 B IRModels
51 pages
IR Unit 2
No ratings yet
IR Unit 2
54 pages
Chapter 4 IR Models
No ratings yet
Chapter 4 IR Models
34 pages
Introduction of IR Models
No ratings yet
Introduction of IR Models
62 pages
IR Models
No ratings yet
IR Models
65 pages
Introduction of IR Models
No ratings yet
Introduction of IR Models
67 pages
4 IRModels
No ratings yet
4 IRModels
46 pages
Hate Speech, 2016 Report
No ratings yet
Hate Speech, 2016 Report
60 pages
Unit II
No ratings yet
Unit II
73 pages
Week 2 - Information Retrieval Basics
No ratings yet
Week 2 - Information Retrieval Basics
74 pages
NLP - Module 5
No ratings yet
NLP - Module 5
58 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
50 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
5 IRModels
No ratings yet
5 IRModels
30 pages
4-IR Models
No ratings yet
4-IR Models
33 pages
Unit 2
No ratings yet
Unit 2
58 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
27 pages
Information Retrieval - 1
No ratings yet
Information Retrieval - 1
47 pages
Chapter Five IR Models
No ratings yet
Chapter Five IR Models
28 pages
5 IRModels IR
No ratings yet
5 IRModels IR
25 pages
4-IR Models
No ratings yet
4-IR Models
33 pages
Chapter 4 IR Models
No ratings yet
Chapter 4 IR Models
43 pages
4 IRModels
No ratings yet
4 IRModels
30 pages
Chapter 5 IR
No ratings yet
Chapter 5 IR
46 pages
02 Chap02a-BooleanAndvector Models
No ratings yet
02 Chap02a-BooleanAndvector Models
30 pages
E103-W02 UserManual EN V3.0
No ratings yet
E103-W02 UserManual EN V3.0
54 pages
ISE Information Retrieval Mod-V
No ratings yet
ISE Information Retrieval Mod-V
48 pages
4 IRinArabic2021 Ranked Retrieval I
No ratings yet
4 IRinArabic2021 Ranked Retrieval I
49 pages
Chapter 4
No ratings yet
Chapter 4
48 pages
Information Retrieval Models: Vector Space Models: Chengxiang Zhai
No ratings yet
Information Retrieval Models: Vector Space Models: Chengxiang Zhai
30 pages
NLP See
No ratings yet
NLP See
27 pages
Ir Mod2 Notes
No ratings yet
Ir Mod2 Notes
26 pages
2 Introduction To Information Retrieval
No ratings yet
2 Introduction To Information Retrieval
38 pages
Unit Ii Part B 1. Write About Basic IR Model
No ratings yet
Unit Ii Part B 1. Write About Basic IR Model
17 pages
Web Search
No ratings yet
Web Search
30 pages
IR Chapter 4
No ratings yet
IR Chapter 4
15 pages
L02-IR Models MMN
No ratings yet
L02-IR Models MMN
27 pages
Chapter 2
No ratings yet
Chapter 2
37 pages
Cyber Security Module 1 Lesson 3 Notes
No ratings yet
Cyber Security Module 1 Lesson 3 Notes
20 pages
Chapter 2: Modeling: Advanced Topics in Information Retrieval
No ratings yet
Chapter 2: Modeling: Advanced Topics in Information Retrieval
28 pages
Information Retrieval System-Chapter-1
No ratings yet
Information Retrieval System-Chapter-1
23 pages
Ems, TCP
No ratings yet
Ems, TCP
12 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
33 pages
Conclusion
No ratings yet
Conclusion
14 pages
Unit 2
No ratings yet
Unit 2
13 pages
Chapter I
No ratings yet
Chapter I
8 pages
Manila Standard Today - Friday (December 14, 2012) Issue
No ratings yet
Manila Standard Today - Friday (December 14, 2012) Issue
26 pages
Holacracy - The New Management System
No ratings yet
Holacracy - The New Management System
11 pages
Unit-5 Adt
No ratings yet
Unit-5 Adt
11 pages
The Tower (2012 South Korean Film) : From Wikipedia, The Free Encyclopedia
No ratings yet
The Tower (2012 South Korean Film) : From Wikipedia, The Free Encyclopedia
9 pages
NLP See
No ratings yet
NLP See
9 pages
IR Systems Usually Adopt Index Terms To Process Queries Index Term
No ratings yet
IR Systems Usually Adopt Index Terms To Process Queries Index Term
24 pages
Project Africa Now
No ratings yet
Project Africa Now
6 pages
LIBS 894 Assignment Three Classic Models
No ratings yet
LIBS 894 Assignment Three Classic Models
8 pages
Chapter 4
No ratings yet
Chapter 4
8 pages
IR Unit II
No ratings yet
IR Unit II
4 pages
Lang Models: 04 December 2024 23:03
No ratings yet
Lang Models: 04 December 2024 23:03
4 pages
Situation Infancy Mortality
No ratings yet
Situation Infancy Mortality
2 pages
MR-Pdt-SE New Adhesion Communication
No ratings yet
MR-Pdt-SE New Adhesion Communication
2 pages
Cotton Case Study
No ratings yet
Cotton Case Study
2 pages
Mccsemi: Features
No ratings yet
Mccsemi: Features
1 page
Blockchain Foundation Courseware - English
From Everand
Blockchain Foundation Courseware - English
Eppo Luppes
No ratings yet

4 IRModels

Uploaded by

4 IRModels

Uploaded by

Chapter Four

• A number of IR models are proposed over the

Also find the documents relevant for the queries:

1. D1 = {K1, K2, K3, K4, K5}

• What are the relevant documents retrieved for the

• Third, similarity measurement is used to rank documents by

• Compute TF,IDF weight of term A?

• Using a similarity score between the query and each

|q|= 0.1762  0.4712  0.176=2 = 0.538

You might also like