IR Models: Chapter Five

The document discusses information retrieval (IR) models and the vector space model (VSM) in particular. It explains that in VSM, documents and queries are represented as weighted vectors in a multidimensional term space. Similarities between document and query vectors are then used to rank documents by relevance. Weighting terms involves calculating tf-idf, which considers both the term frequency within a document and the inverse document frequency in the collection. This allows partially relevant documents to be returned rather than just exact matches.

Uploaded by

milkikoo shifera

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

253 views26 pages

IR Models: Chapter Five

Uploaded by

milkikoo shifera

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 26

IR Models

CHAPTER FIVE

1
IR Models - Basic Concepts
Word evidence: Bag of words
IR systems usually adopt index terms to index and retrieve
documents
Each document is represented by a set of representative keywords
or index terms (called Bag of Words).
An index term is a word from a document useful for
remembering the document main themes
Not all terms are equally useful for representing the document
contents:
less frequent terms allow identifying a narrower set of documents
But No ordering information is attached to the Bag of Words
identified from the document collection.

2
IR Models - Basic Concepts
• One central problem regarding IR systems is the issue of
predicting which documents are relevant and which are
not:
• Such a decision is usually dependent on a ranking
algorithm which attempts to establish a simple ordering
of the documents retrieved.
• Documents appearning at the top of this ordering are
considered to be more likely to be relevant.
• Thus ranking algorithms are at the core of IR systems.
• The IR models determine the predictions of what is
relevant and what is not, based on the notion of
relevance implemented by the system.

3
IR Models - Basic Concepts
• After preprocessing, N distinct terms remain which are
Unique terms that form the VOCABULARY
• Let
– ki be an index term i & dj be a document j
– K = (k1, k2, …, kN) is the set of all index terms
• Each term, i, in a document or query j, is given a real-
valued weight, wij.
– wij is a weight associated with (ki,dj). If wij = 0 , it
indicates that term does not belong to document dj
• The weight wij quantifies the importance of the index term
for describing the document contents
• vec(d ) = (w , w , …, w ) is a weighted vector
j 1j 2j tj
4 associated with the document dj
Mapping Documents & Queries
Represent both documents and queries as N-dimensional
vectors in a term-document matrix, which shows
occurrence of terms in the document collection or query
An entry in the matrix corresponds to the “weight” of a
term in the document;
 
d j  (t1, j , t 2, j ,..., t N , j ); qk  (t1,k , t 2,k ,..., t N ,k )
– Document collection is mapped to
T1 T2 …. TN
term-by-document matrix
D1 w11 w12 … w1N – The documents are viewed as
D2 w21 w22 … w2N vectors in multidimensional space
: : : : • “Nearby” vectors are related
: : : : – Normalize the weight as usual for
DM wM1 wM2 … wMN vector length to avoid the effect of
document length 5
Weighting Terms in Vector Sapce
The importance of the index terms is represented by
weights associated to them
Problem: to show the importance of the index term for
describing the document/query contents, what weight can
we assign?
Solution 1: Binary weights: t=1 if presence, 0 otherwise
Similarity: number of terms in common
Problem: Not all terms equally interesting
E.g. the vs. dog vs. cat
Solution: Replace binary weights with non-binary weights
 
d j  ( w1, j , w2, j ,..., wN , j ); qk  ( w1,k , w2,k ,..., wN ,k )
6
The Boolean Model
• Boolean model is a simple model based on set
theory
• The Boolean model imposes a binary criterion
for deciding relevance.
• Terms are either present or absent. Thus,
wij  {0,1}
• sim(q,dj) = 1, if document satisfies the boolean query
T1 T 2 …. TN
0 otherwise
D1 w11 w12 … w1N
- Note that, no weights
assigned in-between 0 and 1, D2 w21 w22 … w2N
only values 0 or 1 can be : : : :
assigned : : : :
7 DM wM1 wM2 … wMN
The Boolean Model: Example
• Generate the relevant documents retrieved by
the Boolean model for the query :
q = k1  (k2  k3)

k2
k1
d7
d2 d6
d4 d5
d3
d1

k3
8
The Boolean Model: Example
• Given the following determine documents retrieved by the
Boolean model based IR system
• Index Terms: K1, …,K8.
• Documents:

1. D1 = {K1, K2, K3, K4, K5}

2. D2 = {K1, K2, K3, K4}
3. D3 = {K2, K4, K6, K8}
4. D4 = {K1, K3, K5, K7}
5. D5 = {K4, K5, K6, K7, K8}
6. D6 = {K1, K2, K3, K4}
• Query: K1 (K2  K3)
• Answer: {D1, D2, D4, D6} ({D1, D2, D3, D6} {D3, D5})
9
= {D1, D2, D6}
Exersise1
Given the following three documents, Construct Term – document
matrix and find the relevant documents retrieved by the
Boolean model for given query
• D1: “Shipment of gold damaged in a fire”
• D2: “Delivery of silver arrived in a silver truck”
• D3: “Shipment of gold arrived in a truck”
• Query: “gold silver truck”
Table below shows document –term (ti) matrix
arrive damage deliver fire gold silver ship truck Also find the relevant
documents for the
D1
queries:
D2 (a) “gold delivery”;
D3 (b) ship gold;
query (c) “silver truck”
10
Exercise 2
Given the following three documents with the following
contents:
D1 = “computer information retrieval”
D2 = “computer retrieval”
D3 = “information”
D4 = “computer information”

What are the relevant documents retrieved for the queries:

Q1 = “information  retrieval”
Q2 = “information  ¬computer”

11
Drawbacks of the Boolean Model
• Retrieval based on binary decision criteria with no
notion of partial matching.
• No ranking of the documents is provided (absence
of a grading scale)
• Information need has to be translated into a
Boolean expression which most users find
awkward.
• The Boolean queries formulated by the users are
most often too simplistic.
• As a consequence, the Boolean model frequently
returns either too few or too many documents in
response to a user query.
• Just changing a boolean operator from “AND” to “OR”
12
changes the result from intersection to union
Vector-Space Model (VSM)
 This is the most commonly used strategy for measuring
relevance of documents for a given query.
• This is because,
 Use of binary weights is too limiting.
 Non-binary weights provide consideration for
partial matches.
 These term weights are used to compute a degree of
similarity between a query and each document.
 Ranked set of documents provides for better
matching.
 The idea behind VSM is that:
 The meaning of a document is represented by the
words used in that document.
13
Vector-Space Model
To find relevant documens for a given query,
First, Documents and queries are mapped into term vector
space.
• Note that queries are considered as short documents.
Second, in the vector space, queries and documents are
represented as weighted vectors.
• There are different weighting technique; the most widely
used one is computing tf*idf for each term.
Third, similarity measurement is used to rank documents by
the closeness of their vectors to the query.
• Documents are ranked by closeness to the query.
Closeness is determined by a similarity score calculation.
14
Term-document matrix.
A collection of n documents and query can be
represented in the vector space model by a term-
document matrix.
An entry in the matrix corresponds to the “weight” of a term in
the document;
zero means the term has no significance in the document
or it simply doesn’t exist in the document. Otherwise, wij >
0 whenever ki  dj
T1 T2 …. TN
D1 w11 w21 … w1N
D2 w21 w22 … w2N
: : : :
: : : :
DM wM1 wM2 … wMN
15
Computing weights
• How do we compute weights for term i in document j and
query q; wij and wiq ?
• A good weight must take into account two effects:
– Quantification of intra-document contents (similarity)
• The tf factor, the term frequency within a document

– Quantification of inter-documents separation

(dissimilarity)
• The idf factor, the inverse document frequency

– As a result of which most IR systems are using tf*idf

weighting technique:
wij = tf(i,j) * idf(i)
16
Computing Weights
• Let,
• N be the total number of documents in the collection
• ni be the number of documents which contain ki
• freq(i,j) raw frequency of ki within dj
• A normalized tf factor is given by
• f(i,j) = freq(i,j) / max(freq(l,j))
• where the maximum is computed over all terms which
occur within the document dj
• The idf factor is computed as
• idf(i) = log (N/ni)
• The log is used to make the values of tf and idf
comparable. It can also be interpreted as the amount of
17 information associated with the term ki.
Example: Computing weights
• A collection includes 10,000 documents
• The term A appears 20 times in a particular
document
• The maximum appearance of any term in this
document is 50
• The term A appears in 2,000 of the collection
documents.
• Compute TF*IDF weight?
• f(i,j) = freq(i,j) / max(freq(l,j)) = 20/50 = 0.4
• idf(i) = log(N/ni) = log (10,000/2,000) = log(5) = 2.32
• wij = f(i,j) * log(N/ni) = 0.4 * 2.32 = 0.928
18
Similarity Measure
j
dj


q
• Sim(q,dj) = cos() i
 

n
d j q wi , j qi ,k
sim(d j , q)     i 1

i1 w i 1 i ,k
n n
dj q 2
i, j q 2

• Since wij > 0 and wiq > 0, 0 <= sim(q,dj) <=1

•19A document is retrieved even if it matches the query
terms only partially.
Vector-Space Model: Example
• Suppose we query for the query: Q: “gold silver
truck”. The database collection consists of three
documents with the following documents.
• D1: “Shipment of gold damaged in a fire”
• D2: “Delivery of silver arrived in a silver truck”
• D3: “Shipment of gold arrived in a truck”
• Assume that all terms are used, including common
terms, stop words, and also no terms are reduced to
root terms.
• Show retrieval results in ranked order?
20
Vector-Space Model: Example
Terms Q Counts TF DF IDF Wi = TF*IDF
Q D1 D2 D3
D1 D2 D3
a 0 1 1 1 3 0 0 0 0 0
arrived 0 0 1 1 2 0.176 0 0 0.176 0.176
damaged 0 1 0 0 1 0.477 0 0.477 0 0
delivery 0 0 1 0 1 0.477 0 0 0.477 0
fire 0 1 0 0 1 0.477 0 0.477 0 0
gold 1 1 0 1 2 0.176 0.176 0.176 0 0.176
in 0 1 1 1 3 0 0 0 0 0
of 0 1 1 1 3 0 0 0 0 0
silver 1 0 2 0 1 0.477 0.477 0 0.954 0
shipment 0 1 0 1 2 0.176 0 0.176 0 0.176
truck 1 0 1 1 2 0.176 0.176 0 0.176 0.176

21
Vector-Space Model
Terms Q D1 D2 D3
a 0 0 0 0
arrived 0 0 0.176 0.176
damaged 0 0.477 0 0
delivery 0 0 0.477 0
fire 0 0.477 0 0
gold 0.176 0.176 0 0.176
in 0 0 0 0
of 0 0 0 0
silver 0.477 0 0.954 0
shipment 0 0.176 0 0.176
truck
22 0.176 0 0.176 0.176
Vector-Space Model: Example
• Compute similarity using cosine Sim(q,d1)
• First, for each document and query, compute all vector
lengths (zero terms ignored)
|d1|= 0.477 2  0.477 2  0.1762  0.1762 = 0.517 = 0.719
|d2|= 0.176  0.477  0.176  0.176 = 1.2001 = 1.095
2 2 2 2

|d3|= 0.176 2  0.1762  0.176 2  0.176 2 = 0.124 = 0.352

|q|= 0.1762  0.4712  0.1762 = 0.2896 = 0.538

• Next, compute dot products (zero products ignored)
Q*d1= 0.176*0.176 = 0.0310
Q*d2 = 0.954*0.477 + 0.176 *0.176 = 0.4862
23
Q*d3 = 0.176*0.167 + 0.176*0.167 = 0.0620
Vector-Space Model: Example
Now, compute similarity score
Sim(q,d1) = (0.0310) / (0.538*0.719) = 0.0801
Sim(q,d1) = (0.4862 ) / (0.538*1.095)= 0.8246
Sim(q,d1) = (0.0620) / (0.538*0.352)= 0.3271
Finally, we sort and rank documents in descending
order according to the similarity scores
Rank 1: Doc 2 = 0.8246
Rank 2: Doc 3 = 0.3271
Rank 3: Doc 1 = 0.0801

• Exercise: using normalized TF, rank documents

using cosine similarity measure? Hint: Normalize
24
TF of term i in doc j using max frequency of a
Vector-Space Model
• Advantages:
• term-weighting improves quality of the answer set
since it displays in ranked order
• partial matching allows retrieval of documents that
approximate the query conditions
• cosine ranking formula sorts documents according
to degree of similarity to the query

• Disadvantages:
• assumes independence of index terms (??)

25
Thank you

IR Chap4
100% (1)
IR Chap4
32 pages
3 Retrieval Models
No ratings yet
3 Retrieval Models
87 pages
JSE1 - Final Test
100% (1)
JSE1 - Final Test
11 pages
Chapter 4 IR Models
No ratings yet
Chapter 4 IR Models
43 pages
Chapter 4 IR Models
No ratings yet
Chapter 4 IR Models
34 pages
MongoDB Notes
No ratings yet
MongoDB Notes
11 pages
4 IRModels
No ratings yet
4 IRModels
30 pages
Chapter 2 - Data Communication and Transmission Media
100% (1)
Chapter 2 - Data Communication and Transmission Media
29 pages
How To Integrate Telebirr in Your Web App by Mukerem Ali Medium
No ratings yet
How To Integrate Telebirr in Your Web App by Mukerem Ali Medium
16 pages
Week 2 Quiz
100% (1)
Week 2 Quiz
6 pages
Lecture 7 Webservices Fundamentals For WEB IS
No ratings yet
Lecture 7 Webservices Fundamentals For WEB IS
29 pages
Design and Analysis of Algorithm: Lab File
No ratings yet
Design and Analysis of Algorithm: Lab File
58 pages
5 IRModels
No ratings yet
5 IRModels
30 pages
Chapter 1 - Computer Networks and The Internet
No ratings yet
Chapter 1 - Computer Networks and The Internet
108 pages
Java Script
No ratings yet
Java Script
63 pages
Apriori Algorithm in Data Mining
No ratings yet
Apriori Algorithm in Data Mining
8 pages
Homework2 Solution
100% (1)
Homework2 Solution
11 pages
Crime File System Project Report
67% (3)
Crime File System Project Report
79 pages
Introduction To Information Retrieval-Ch2 Solutions
No ratings yet
Introduction To Information Retrieval-Ch2 Solutions
2 pages
Itec 4010
No ratings yet
Itec 4010
3 pages
Database Security and Auditing
No ratings yet
Database Security and Auditing
9 pages
Search Analytics For Your Site 1st Edition Louis Rosenfeld Instant Download
100% (1)
Search Analytics For Your Site 1st Edition Louis Rosenfeld Instant Download
53 pages
Online Voting Presentation Printsecond2
No ratings yet
Online Voting Presentation Printsecond2
96 pages
World Wide Web
No ratings yet
World Wide Web
8 pages
It9224 Distributed Systems Important Questions UNIT-1
100% (1)
It9224 Distributed Systems Important Questions UNIT-1
4 pages
Mid Exam A
100% (1)
Mid Exam A
4 pages
Chapter - 3 Searching and Planning
No ratings yet
Chapter - 3 Searching and Planning
85 pages
Introduction To Information Storage and Retrieval: Chapter Four: Indexing Structure
No ratings yet
Introduction To Information Storage and Retrieval: Chapter Four: Indexing Structure
34 pages
Acm Icpc Phuket 2009 Contest PB Final
No ratings yet
Acm Icpc Phuket 2009 Contest PB Final
23 pages
AI Chapter 6
No ratings yet
AI Chapter 6
28 pages
DBMS Syllabus
No ratings yet
DBMS Syllabus
4 pages
Unit 2
No ratings yet
Unit 2
36 pages
Chapter 1 Introduction
No ratings yet
Chapter 1 Introduction
77 pages
Query Languages and Query Operation: Chapter Seven
No ratings yet
Query Languages and Query Operation: Chapter Seven
20 pages
Ip Packet Delivery
100% (1)
Ip Packet Delivery
4 pages
Unit II - Loops and Function Pointers, Queues
No ratings yet
Unit II - Loops and Function Pointers, Queues
34 pages
College of Engineering &technology Department of Computer Science
100% (1)
College of Engineering &technology Department of Computer Science
44 pages
Chapter 3 - Naming and Threads-1
No ratings yet
Chapter 3 - Naming and Threads-1
21 pages
WMC MCQ 4 5 PDF
No ratings yet
WMC MCQ 4 5 PDF
19 pages
Multimedia Mining Presentation
No ratings yet
Multimedia Mining Presentation
18 pages
Repport Btech Final
No ratings yet
Repport Btech Final
50 pages
Ajax Enabled Rich Internet Applications Powerpoint PPT Presentation
No ratings yet
Ajax Enabled Rich Internet Applications Powerpoint PPT Presentation
71 pages
Cs2301 Software Engineering Questionbank
100% (1)
Cs2301 Software Engineering Questionbank
43 pages
Chapter 5 - Backtracking PDF
No ratings yet
Chapter 5 - Backtracking PDF
10 pages
Cse357 MCQ
No ratings yet
Cse357 MCQ
28 pages
Final IOS
No ratings yet
Final IOS
12 pages
Chapter 1 Introduction To ISR
No ratings yet
Chapter 1 Introduction To ISR
39 pages
BCAC-301 - Lecture 1
No ratings yet
BCAC-301 - Lecture 1
18 pages
C Language - Chapter 4
No ratings yet
C Language - Chapter 4
60 pages
ch-6 Class Diagrams
No ratings yet
ch-6 Class Diagrams
20 pages
Practical 5: Introduction To Weka For Classfication
100% (1)
Practical 5: Introduction To Weka For Classfication
4 pages
Computer Networks Assignment Questions
100% (1)
Computer Networks Assignment Questions
8 pages
Chapter 3 - Simple Sorting and Searching
100% (1)
Chapter 3 - Simple Sorting and Searching
18 pages
Chapter Three Term Weighting and Similarity Measures
No ratings yet
Chapter Three Term Weighting and Similarity Measures
25 pages
IR Chap3
No ratings yet
IR Chap3
45 pages
ML Lab Programs (1-12)
No ratings yet
ML Lab Programs (1-12)
35 pages
Calculator Using HTML, CSS, JS
No ratings yet
Calculator Using HTML, CSS, JS
2 pages
Data Driven Control
No ratings yet
Data Driven Control
6 pages
Lab Exam Question Bank OfC++ PRG
No ratings yet
Lab Exam Question Bank OfC++ PRG
3 pages
Chimdesa Gedefa Assignment #2 Causal and Entry Consistency
No ratings yet
Chimdesa Gedefa Assignment #2 Causal and Entry Consistency
15 pages
Hemanth ETL Informatica Resume
No ratings yet
Hemanth ETL Informatica Resume
5 pages
Installation Procedure: - Install Android Studio
No ratings yet
Installation Procedure: - Install Android Studio
3 pages
08 PHP MYSQL Complex Queries Functions
100% (6)
08 PHP MYSQL Complex Queries Functions
16 pages
Software Design and Architecture SWE-501
No ratings yet
Software Design and Architecture SWE-501
2 pages
ADBMS Sem 1 Mumbai University (MSC - CS)
No ratings yet
ADBMS Sem 1 Mumbai University (MSC - CS)
39 pages
Edureka CAS Brochure PDF
No ratings yet
Edureka CAS Brochure PDF
15 pages
Excel For HR Cheat Sheet
No ratings yet
Excel For HR Cheat Sheet
1 page
MCA 2nd Sem Web Programming
No ratings yet
MCA 2nd Sem Web Programming
4 pages
Software Engineering: Assignment # 02
No ratings yet
Software Engineering: Assignment # 02
3 pages
Unit 1 - Modern Information Retrieval - WWW - Rgpvnotes.in
No ratings yet
Unit 1 - Modern Information Retrieval - WWW - Rgpvnotes.in
8 pages
Term Weighting and Similarity Measures
No ratings yet
Term Weighting and Similarity Measures
35 pages
Data Backup Recovery Training
No ratings yet
Data Backup Recovery Training
39 pages
Indexing Structure: Chapter Four
No ratings yet
Indexing Structure: Chapter Four
26 pages
DDS Unit - 5
No ratings yet
DDS Unit - 5
27 pages
Apriori Algorithm in Data Mining With Examples
No ratings yet
Apriori Algorithm in Data Mining With Examples
4 pages
Real Application Testing For Oracle Database 9i / 10g / 11gR1/R2
No ratings yet
Real Application Testing For Oracle Database 9i / 10g / 11gR1/R2
28 pages
The Spreadsheet User's Guide To Modern Analytics Ebook
No ratings yet
The Spreadsheet User's Guide To Modern Analytics Ebook
48 pages
Flashback Technology Provides A Set of Features To View and Rewind Data Back and Forth in Time
No ratings yet
Flashback Technology Provides A Set of Features To View and Rewind Data Back and Forth in Time
12 pages
6 Retrieval Evaluation
No ratings yet
6 Retrieval Evaluation
28 pages
Distributed Parallel Architecture For "Big Data"
No ratings yet
Distributed Parallel Architecture For "Big Data"
12 pages
CIS 2109 - Database and File Management Systems Department of Computer and Information Sciences
No ratings yet
CIS 2109 - Database and File Management Systems Department of Computer and Information Sciences
5 pages
Shami Shaji
No ratings yet
Shami Shaji
2 pages
Unit-05 SSB DBMS
No ratings yet
Unit-05 SSB DBMS
174 pages
CS964 Data Warehousing and Data Mining
No ratings yet
CS964 Data Warehousing and Data Mining
1 page
Data Science
No ratings yet
Data Science
24 pages
Lakshay ISM 26
No ratings yet
Lakshay ISM 26
51 pages
IPT Chapter 1 - Network Programming & Integrative Coding
No ratings yet
IPT Chapter 1 - Network Programming & Integrative Coding
22 pages
Google Advanced Search Operators-Lab Manual
No ratings yet
Google Advanced Search Operators-Lab Manual
15 pages
In - Memory Data Grid: White Paper
No ratings yet
In - Memory Data Grid: White Paper
16 pages
Design Theory For Relational Databases
No ratings yet
Design Theory For Relational Databases
73 pages
CH03 HKM Law Investigation and Ethics
No ratings yet
CH03 HKM Law Investigation and Ethics
32 pages
Exercise 7,8,9 Basic Commands
No ratings yet
Exercise 7,8,9 Basic Commands
7 pages
Sol Error
No ratings yet
Sol Error
33 pages
Introduction To Databases Transparencies: © Pearson Education Limited 1995, 2005
No ratings yet
Introduction To Databases Transparencies: © Pearson Education Limited 1995, 2005
24 pages
PL SQL Reference Guide
No ratings yet
PL SQL Reference Guide
17 pages
Representing Trees in Oracle SQL
No ratings yet
Representing Trees in Oracle SQL
16 pages
t3 Simple SQL
No ratings yet
t3 Simple SQL
5 pages
SQL Server Cheat Sheet: by Via
No ratings yet
SQL Server Cheat Sheet: by Via
1 page