0% found this document useful (0 votes)

84 views67 pages

Introduction of IR Models

The vector space model (VSM) represents documents and queries as vectors of keywords with numeric weights. Similarity between document and query vectors is calculated using cosine similarity, with more similar vectors having higher similarity scores and document rankings. VSM allows partial matching and ranking of documents by relevance. Key steps include term weighting using TF-IDF and normalizing document lengths.

Uploaded by

Magarsa Bedasa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

84 views67 pages

Introduction of IR Models

Uploaded by

Magarsa Bedasa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 67

Chapter 4

IR Models
Introduction of IR Models
Information
Retrieval
Models
At the end of this chapter every student must able to:
 Define what model is
 Describe why model is needed in information retrieval
 Differentiate different types of information retrieval models
 Boolean Model
Vector space model
 probabilistic model
 know how to calculate and find the similarity of some
documents to the given query
 Identify term frequency, document frequency, inverted
document frequency, term weight and similarity
measurements
What is model?
• Model- is an idealization or abstraction of actual processes (i.e.,
things that happen in the real world)
There are 2 good reasons for having models of IR

1. Models guide research and provide the means of academic

discussion
2. Models can serve as a blueprint to implement actual retrieval
system
IR Models
• In IR, mathematical models are used to understand and
reason about some behavior or phenomena in the real world

• A model of an information retrieval predicts and explains what

a user will find relevant given the user query
Retrieval model
• Thus, retrieval models are models that can describe the
computational processes (here, retrieval)
– e.g., how documents are ranked
– e.g., how similarities are measured
• Are models that can attempt to describe the human process
– e.g., the information need, interaction
• Are models that specify the details of
– Document representation
– Query representation
– Retrieval function (matching function)
– Ranking
Retrieval Models
• A number of IR models are proposed over the years to retrieve
information
• The following are the major models developed to retrieve
information
– Boolean model
• Exact match model
– Statistical models
• Vector space and probabilistic models are the major
statistical models
• Are “best match” models
– Linguistic and knowledge based models
• Are “best match models”
What is the difference b/n best match and exact match?
Types of models
• The three classic information retrieval models are:
– Boolean retrieval models
– Vector space models
– Probabilistic models
1. Boolean model

 A document either matches a query, or does not.

 The Boolean retrieval model is a model for information

retrieval in which we can pose(create) any query which is in
the form of a Boolean expression of terms, that is,
in which terms are combined with the operators AND, OR,
and NOT.
…..cont
 The first model of an information retrieval
 The most criticized model

 Developed by George Boole

• Boole defined 3 basic operators

AND

OR
NOT
……cont
• Boolean relevance prediction ( R )

– A document is predicted as relevant to a query iff it satisfies the

query expression
– Each query term specifies a set of documents containing them

• AND (^) : the intersection of two sets

• OR (V) : the union of two sets

• NOT (¬) :set inverse, or really set difference

– A query, thus, searches a set of documents to determine their content

– The search engine retrieves those documents satisfying the logical

constraints of the query
……cont
• There is an assumption of document representation before retrieval

– Documents and queries are represented as sets of index terms

• Basis for the majority of DBMS and conventional IR systems

– Database systems use Boolean logic for searching

• Document (how a document is viewed in BM)

– Is an object, a set consisting of terms

– That is, documents are sets of terms

• Term (how a term is viewed in BM)

– The terms are the things we used to describe concepts in a particular

domain
– The vocabulary is growing when new terms are introduced
….cont

• Most queries search for more than one term

– Information need: to find all documents

containing “information” and “retrieval”
Answer Only documents containing both “information” and
“retrieval” satisfy this query

– Information need: to find all documents

containing “information” or “retrieval”
Answer Will be satisfied by a document that contains either
of the two words or both.
Boolean model

• Consider a set of five docs and assume that they contain the terms
shown in the table
Doc. Terms
D1 Algorithm, information, retrieval
D2 Retrieval, science
D3 Algorithm, information, science
D4 Pattern, retrieval, science
D5 Science, algorithm

Find documents retrieved by the following expressions in

a. Information AND retrieval
answer {d1,d3} ∩{d1,d2,d4}={d1}
b. Information OR retrieval
answer {d1,d3} U{d1,d2,d4}={d1, d2,d3,d4}
Examples
• Example 2

– Information need

Question: Find all documents containing “information” and

“retrieval” or not containing “retrieval” and contain “science”

– Query/Boolean expression

• (information and retrieval) OR NOT (retrieval) AND science

• parenthesis avoid ambiguity

Exercise

• If the followings are our documents, which document can be

retrieved for the given query? The documents are:
Doc1: Computer Information Retrieval

Doc2: Computer Retrieval

Doc3: Information
Doc4: Computer Information
Query1: Information AND Retrieval
Query2: (Information AND Retrieval) OR(NOT computer)
Advantages and disadvantages of Boolean
model
• Advantages of Boolean model

A very simple model based on sets(Easy for expert)

It is computationally efficient

Expressiveness and clarity

Is still a dominant model with the commercial database

systems
Disadvantages of Boolean model
• Disadvantages

Needs to train users

Very limited to show user information need in detail

No weighting of index or query terms

Based on exact matchingthere may be relevant document

that is partially matched
Vector Space Model
Suggested by peter Luhn and Salton
A classic model of document retrieval based on representing
documents and queries as a vector

Partial matching is possible

Retrieval is based on the similarity between the query vector
and document vectors
The output documents are ranked according to this similarity
Example
• Document vector and query vector
….cont
• The similarity is based on the occurrences of the keywords in
the query and document

• The angle between query and document is measured by using:

cosine similarity measurement since both document

and user’s query are represented as vectors in VSM
…cont
• VSM assumes that if document vector V1 is closer to the query
than another document vector V2: then;
 The document represented by V1 is more relevant
than the one represented by V2

In VSM, to decide the similarity of the document to the given

query, term weighting(tf*idf) is used. To calculate tf*idf, first
we have to calculate the following
1.Term frequency(tf)
• Term frequency: is the number of times a given term
appears in that documents
Tf = number of frequency of a term
maximum frequency
(for a single document)
2. Inverse document frequency(IDF)

• IDF used to measure whether the terms are common or

rare across all documents
• Idf = log2 N/df where
N total number of documents
Df document frequency (number of documents
containing the given term)
Detail information about log
3. Term weighting (tf*idf)
• Term weighting = tf*idf
Tfi,j* log N/df
4. Document length
After calculating the term weight we have to
calculate the length of the document
Document length= the square root summation of
term weight square
5. similarity
At the end we need to calculate similarity of the
documents to the query.
The widely used measure of similarity in vector space
model is called the cosine similarity.
The cosine similarity between two vectors d,j (the
document vector) and q (query vector) is given by:
…..cont
…cont
• The denominator of the above formula can be
replaced by the length of the document times length
of the query. This means
Example1
• Example1: If the following three documents are given with
one query, then, which document must be ranked first?

Doc1: New york times

Doc2: New York post
Doc3: Los Angeles times
Query: new new times
Solution
Step 2
Step 3
Step 4
Step 5
Exercise 1
Suppose the database collection consists of three documents
(N=3) with the following content. Rank the following
document as per their similarity with the query. Don’t use
stop words.

Doc1: "Shipment of gold damaged in a fire"

Doc2: "Delivery of silver arrived in a silver truck"
Doc3: "Shipment of gold arrived in a truck" .
Query: "gold silver truck“.
Exercise 2
 Which document must be ranked first for the following?

Doc1: Breakthrough drug for health

Doc2: New health drug

Doc3: New approach for treatment of health
Doc4: New hopes for health patients

Query: Treatment of health patients

Exercise 3
 From the following documents which one must be
ranked first? Remove stopwords
Doc1:The health observances for march
Doc2:The health oriented calendar
Doc3: The awareness news for march awareness
Q: March health awareness
Advantages and Disadvantages of VSM
Latent semantic indexing
Latent Semantic Indexing (LSI) is an extension of the vector
space retrieval method (Deerwester et al., 1990).
LSI can retrieve relevant documents even when they do not
share any words with the query.

if only a synonym of the keyword is present in a document, the

document will be still found relevant.
Example
Cont..
• For example, the word bank when used together with
mortgage, loans, and rates probably means a financial
institution.

• However, the word bank when used together with lures,

casting, and fish probably means a stream or river bank.
Probabilistic Model
Given a user information need (represented as a query) and a
collection of documents (transformed into document
representations), a system must determine how well the
documents satisfy the query.

An IR system has an uncertain understanding of the user

query, and makes an uncertain guess of whether a document
satisfies the query
Cont….
• Probability theory provides a principled foundation for such
reasoning under uncertainty

• Probabilistic models feat this foundation to estimate how

likely it is that a document is relevant to a query.
Cont..

• A probabilistic formula is used to calculate P (D|R), in place of

the vector similarity formula, e.g., cosine similarity, used to
calculate relevance ranking in the vector space model.

• The probability formula depends on the specific model used,

and also on the assumptions made about the distribution of
terms, e.g., how terms are distributed over documents in the
set of relevant documents, and in the set of non-relevant
documents.
Cont..
In the case of P(R) the sample space might be {relevant,
irrelevant}, and we might define the random variable R to
take the values {0, 1}, where 0=irrelevant and 1=relevant

If we know the number of relevant documents in the

collection, say 100 documents are relevant, and we know the
total number of documents in the collection, say 1 million,
then the quotient of those two defines the probability of
relevance P(R=1) = 100/1000000 = 0.0001.
IR as Classification
Why probabilities in IR?
BIM (Binary Independence Model)
• We assume here that the relevance of each document is
independent of the relevance of other documents.

• Under the BIM, we model the probability P(R|d, q) that a

document is relevant via the probability in terms of term
incidence vectors P(R|x, q).
Cont…

Many assumptions are made in the Binary Independence Model.

Let us summarize them here:

 The documents are independent of each other.

 The terms in a document are independent of each other.
 The terms not present in query are equally likely to occur in
any document . i.e.do not affect the retrieval process.
Formula  p(NR/Y) p(y)=p(y/NR) p (NR)
P (NR/Y) =p(y/NR) p (NR)/p(y)
• Where:

P(NR/Y)= probability that y is non relevant document

P(Y/NR)= Probability that if a non relevant document

retrieved, it is “y‟
P(NR)= Probability of non relevant documents in the collection

P(Y)= probability of y is in the collection

BIM(binary independence )model
 Assume that document “y‟ is in the collection

 Probability that if a non relevant document retrieved, it is

“y‟ is p(y/NR) =0.2
 Probability of non relevant documents in the collection is p
(NR) =0.6
 Probability of ‘’y‟ in the collection is p(y) =0.4

a. What is the probability that y is non relevant document?

b. Is the document is relevant or non relevant?
Answer
P (NR/Y) =p(y/NR) p (NR)/p(y)
• Where:

P(NR/Y)= probability that y is non relevant document

P(Y/NR)= Probability that if a non relevant document

retrieved, it is “y‟
P(NR)= Probability of non relevant documents in the collection

P(Y)= probability of y is in the collection

Cont..
a. P (NR/Y) =p(y/NR) p (NR)/p(y)

= 0.2*0.6/0.4=0.3

b. p(NR/Y)+p(R/Y)=1

p(R/Y)=1-p(NR/Y)
=1-0.3
So , the document is
=0.7 relevant
Question
• If the value of the probability of one document is relevant is
equal with that of non relevant how can we decide whether the
document is relevant or not?
Exercise

 The TREC (Text Retrieval Conference) has the total number

of documents 1.8 million for experimentation. Assume that the
document which is called ‘m’ is found in this collection.
 The probability that document ‘m’ is found in the collection is
0.2; the probability of relevant documents in the collection in
total is 0.4; the probability that non relevant document
retrieved is considered as ‘m’ is 0.3; then,
 what is the probability that ‘m’ is non relevant document? Is
the document (document ‘m’) is relevant or not?
BM25 Okapi weighting scheme

• According to (Vu & Huy, 2015), BM25 (where BM stands for

Best Match) is a ranking function that sorts documents
according to their relevance to a specific query, which is
implemented with the Okapi information retrieval system
first.

• It uses two free parameters k and b that distinguish it from

other ranking methods.
• OkapiBM25 is in information retrieval, Okapi BM25 is a
ranking function used by search engines to rank matching
documents according to their relevance to a given search
query
Cont…
• It was first implemented in the“Okapi Information retrieval
system” in London’s City University in the 1980s and 1990s.
• It is based on the probabilistic retrieval model discussed above
but pays attention to the term frequencies and document
length.
Contingency Table
Gives scoring function:

F4 (BM25(D,Q)) = a best match scoring

function that compute RSJ (Relevance scoring
Fumction) weights
Definition

r = number of relevant documents that contain the term.

n – r = number of non-relevant documents that contain the term.
n = number of documents that contain the term.
R – r = number of relevant documents that do not contain the
term.
N – n – R + r = number of non-relevant documents that do not
contain the term.
N - n = number of documents that do not contain the term.
R = number of relevant documents.
N – R = number of non-relevant documents.
N = number of documents in the collection.
k = a smoothing correction usually set to 0.5
Cont…

k1, k2 and K are parameters whose values are set empirically

dl is doc length
Typical TREC value for k1 is 1.2, k2 varies from 0 to 1000, b =
0.75
BM25 Example
 Query with two terms, “president lincoln”, (qf = 1)
 No relevance information (r and R are zero)
 N = 500,000 documents

 “president” occurs in 40,000 documents (n1 = 40, 000)

 “lincoln” occurs in 300 documents (n2 = 300)

 “president” occurs 15 times in doc (f1 = 15)

 “lincoln” occurs 25 times (f2 = 25)

 document length is 90% of the average length (dl/avdl = .9)

k1 = 1.2, b = 0.75, and k2 = 100

K = 1.2 · (0.25 + 0.75 · 0.9) = 1.11

Answer


Comparison of IR models

• Compare and contrast IR models.

click here for next slide

Airnavx Standalone User Guide
50% (2)
Airnavx Standalone User Guide
34 pages
200 301 Questions Ccna Exam Prep
No ratings yet
200 301 Questions Ccna Exam Prep
6 pages
Introduction of IR Models
No ratings yet
Introduction of IR Models
62 pages
Information Retrieval Notes
No ratings yet
Information Retrieval Notes
42 pages
Chapter 4 IR Models
No ratings yet
Chapter 4 IR Models
43 pages
Chapter Five IR Models
No ratings yet
Chapter Five IR Models
28 pages
5 IRModels IR
No ratings yet
5 IRModels IR
25 pages
Chapter 4 IR Models
No ratings yet
Chapter 4 IR Models
34 pages
IR Models: Chapter Five
100% (1)
IR Models: Chapter Five
26 pages
4 IRModels
No ratings yet
4 IRModels
46 pages
Ir Mod2 Notes
No ratings yet
Ir Mod2 Notes
26 pages
Unit V Easy To Learn
No ratings yet
Unit V Easy To Learn
21 pages
4 IRModels
No ratings yet
4 IRModels
32 pages
Retrieval Models and Rank Retrieval
No ratings yet
Retrieval Models and Rank Retrieval
16 pages
4 IRModels
No ratings yet
4 IRModels
46 pages
Chapter 5 IR
No ratings yet
Chapter 5 IR
46 pages
Chapter 2
No ratings yet
Chapter 2
37 pages
5 B IRModels
No ratings yet
5 B IRModels
51 pages
IR Chapter 4
No ratings yet
IR Chapter 4
15 pages
4-IR Models
No ratings yet
4-IR Models
33 pages
5 IRModels
No ratings yet
5 IRModels
30 pages
IR Chap4
100% (1)
IR Chap4
32 pages
IR Chap4
100% (1)
IR Chap4
32 pages
4-IR Models
No ratings yet
4-IR Models
33 pages
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
No ratings yet
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
46 pages
4 IRModels
No ratings yet
4 IRModels
30 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
27 pages
Chapter 4
No ratings yet
Chapter 4
8 pages
Module 3 Indexing Part A
No ratings yet
Module 3 Indexing Part A
46 pages
Completed Unit II 17.7.17
No ratings yet
Completed Unit II 17.7.17
113 pages
Information Retrieval - 1
No ratings yet
Information Retrieval - 1
47 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
33 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
IR Unit 2
No ratings yet
IR Unit 2
54 pages
Information Retrieval System and The Pagerank Algorithm
No ratings yet
Information Retrieval System and The Pagerank Algorithm
37 pages
ISE Information Retrieval Mod-V
No ratings yet
ISE Information Retrieval Mod-V
48 pages
Unit 2
No ratings yet
Unit 2
58 pages
LIBS 894 Assignment Three Classic Models
No ratings yet
LIBS 894 Assignment Three Classic Models
8 pages
Cs8080 Ir Unit2 I Modeling and Retrieval Evaluation
No ratings yet
Cs8080 Ir Unit2 I Modeling and Retrieval Evaluation
42 pages
NLP - Module 5
No ratings yet
NLP - Module 5
58 pages
Chapter 4
No ratings yet
Chapter 4
48 pages
Unit 2
No ratings yet
Unit 2
13 pages
L02-IR Models MMN
No ratings yet
L02-IR Models MMN
27 pages
Unit Ii Part B 1. Write About Basic IR Model
No ratings yet
Unit Ii Part B 1. Write About Basic IR Model
17 pages
IR Unit II
No ratings yet
IR Unit II
4 pages
02 Chap02a-BooleanAndvector Models
No ratings yet
02 Chap02a-BooleanAndvector Models
30 pages
IR Systems Usually Adopt Index Terms To Process Queries Index Term
No ratings yet
IR Systems Usually Adopt Index Terms To Process Queries Index Term
24 pages
Unit-5 Adt
No ratings yet
Unit-5 Adt
11 pages
Information Retrieval
No ratings yet
Information Retrieval
5 pages
Unit II
No ratings yet
Unit II
73 pages
Week 2 - Information Retrieval Basics
No ratings yet
Week 2 - Information Retrieval Basics
74 pages
ISE Information Retrieval Mod-V (Uploaded by Snaptricks - In)
No ratings yet
ISE Information Retrieval Mod-V (Uploaded by Snaptricks - In)
48 pages
Web Search
No ratings yet
Web Search
30 pages
IR Models
No ratings yet
IR Models
65 pages
NLP See
No ratings yet
NLP See
27 pages
3 Retrieval Models
No ratings yet
3 Retrieval Models
87 pages
cs419-519 Slides Part 2
No ratings yet
cs419-519 Slides Part 2
6 pages
NLP Unit-Ii (Part-I)
No ratings yet
NLP Unit-Ii (Part-I)
19 pages
Information Retrieval Practical
No ratings yet
Information Retrieval Practical
10 pages
Module 6 Updated Final
No ratings yet
Module 6 Updated Final
48 pages
Module 7
No ratings yet
Module 7
53 pages
Visualizing Data Structures
From Everand
Visualizing Data Structures
Rhonda Hoenigman
No ratings yet
250+ TOP MCQs On Multivalued Dependencies and Answers
No ratings yet
250+ TOP MCQs On Multivalued Dependencies and Answers
7 pages
250+ TOP MCQs On Database Design Process and Answers
No ratings yet
250+ TOP MCQs On Database Design Process and Answers
7 pages
250+ TOP MCQs On SQL Data Types and Schemas and Answers
No ratings yet
250+ TOP MCQs On SQL Data Types and Schemas and Answers
7 pages
Blue Spacewalk Presentation
No ratings yet
Blue Spacewalk Presentation
12 pages
Proposal
No ratings yet
Proposal
24 pages
IS LCMS Proposal
No ratings yet
IS LCMS Proposal
17 pages
Project Proposal Improved
No ratings yet
Project Proposal Improved
20 pages
IR Chapter 1&2
No ratings yet
IR Chapter 1&2
88 pages
IR - Chapter 5
No ratings yet
IR - Chapter 5
28 pages
IR Chapter 1 & 2
No ratings yet
IR Chapter 1 & 2
114 pages
IR Chapt 5
No ratings yet
IR Chapt 5
55 pages
Topic Modeling in The Voynich Manuscript
No ratings yet
Topic Modeling in The Voynich Manuscript
18 pages
Natural Language Processing
No ratings yet
Natural Language Processing
72 pages
K-Means Document Clustering Using Vector Space Model
No ratings yet
K-Means Document Clustering Using Vector Space Model
5 pages
Ai in Engeneering at Facebook PDF
No ratings yet
Ai in Engeneering at Facebook PDF
10 pages
Idf
No ratings yet
Idf
11 pages
Survey of Data Management and Analysis in Disaster Situations
No ratings yet
Survey of Data Management and Analysis in Disaster Situations
14 pages
Chapter 2 Modeling: Modern Information Retrieval by R. Baeza-Yates and B. Ribeir
No ratings yet
Chapter 2 Modeling: Modern Information Retrieval by R. Baeza-Yates and B. Ribeir
47 pages
CS8080 Irt Unit Ii Qbank Main
No ratings yet
CS8080 Irt Unit Ii Qbank Main
8 pages
Klasifikasi Berita Hoax Dengan Menggunakan Metode: Naive Bayes
No ratings yet
Klasifikasi Berita Hoax Dengan Menggunakan Metode: Naive Bayes
12 pages
Efficient Encoding and Embedding Strategies
No ratings yet
Efficient Encoding and Embedding Strategies
21 pages
Ai Practical For Class X 2024 25
No ratings yet
Ai Practical For Class X 2024 25
20 pages
IRCh 7 Slides
No ratings yet
IRCh 7 Slides
52 pages
Technocolabs Machine Learning Internship: Project Report
No ratings yet
Technocolabs Machine Learning Internship: Project Report
16 pages
Non Invasive Real Time Multimodal Deception Detection Using Machine Learning and Parallel Computing Techniques
No ratings yet
Non Invasive Real Time Multimodal Deception Detection Using Machine Learning and Parallel Computing Techniques
16 pages
Towards An Effective XML Keyword Search: Zhifeng Bao, Jiaheng Lu, Tok Wang Ling and Bo Chen
No ratings yet
Towards An Effective XML Keyword Search: Zhifeng Bao, Jiaheng Lu, Tok Wang Ling and Bo Chen
14 pages
Hashtag Recommendation System in A P2P Social Networking Application
No ratings yet
Hashtag Recommendation System in A P2P Social Networking Application
13 pages
CP5074 - SNA Unit V Notes
No ratings yet
CP5074 - SNA Unit V Notes
21 pages
Personalized Text Summarization Based On Important Terms Identification
No ratings yet
Personalized Text Summarization Based On Important Terms Identification
5 pages
Spam Message Detection Using Logistic Regression
No ratings yet
Spam Message Detection Using Logistic Regression
4 pages
Multi-Modal Hate Speech Detection Using Machine Learning
No ratings yet
Multi-Modal Hate Speech Detection Using Machine Learning
4 pages
Minor Project Report
No ratings yet
Minor Project Report
49 pages
Fake News Detection
No ratings yet
Fake News Detection
5 pages
Gen Ai Lab Programs
No ratings yet
Gen Ai Lab Programs
15 pages
The Antecedents of Customer Satisfaction and Dissatisfaction Toward Various Types of Hotels PDF
100% (1)
The Antecedents of Customer Satisfaction and Dissatisfaction Toward Various Types of Hotels PDF
13 pages
Statistical Learning and Text Classification With NLTK and Scikit-Learn
No ratings yet
Statistical Learning and Text Classification With NLTK and Scikit-Learn
24 pages
Quiz 3 Solution (No 1-4)
No ratings yet
Quiz 3 Solution (No 1-4)
3 pages
1644397192phd Computer Engg
No ratings yet
1644397192phd Computer Engg
42 pages
IR New
No ratings yet
IR New
4 pages
10 Most Used NLP Techniques
No ratings yet
10 Most Used NLP Techniques
7 pages