0% found this document useful (0 votes)
84 views67 pages

Introduction of IR Models

The vector space model (VSM) represents documents and queries as vectors of keywords with numeric weights. Similarity between document and query vectors is calculated using cosine similarity, with more similar vectors having higher similarity scores and document rankings. VSM allows partial matching and ranking of documents by relevance. Key steps include term weighting using TF-IDF and normalizing document lengths.

Uploaded by

Magarsa Bedasa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
84 views67 pages

Introduction of IR Models

The vector space model (VSM) represents documents and queries as vectors of keywords with numeric weights. Similarity between document and query vectors is calculated using cosine similarity, with more similar vectors having higher similarity scores and document rankings. VSM allows partial matching and ranking of documents by relevance. Key steps include term weighting using TF-IDF and normalizing document lengths.

Uploaded by

Magarsa Bedasa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 67

Chapter 4

IR Models
Introduction of IR Models
Information
Retrieval
Models
At the end of this chapter every student must able to:
 Define what model is
 Describe why model is needed in information retrieval
 Differentiate different types of information retrieval models
 Boolean Model
Vector space model
 probabilistic model
 know how to calculate and find the similarity of some
documents to the given query
 Identify term frequency, document frequency, inverted
document frequency, term weight and similarity
measurements
What is model?
• Model- is an idealization or abstraction of actual processes (i.e.,
things that happen in the real world)
There are 2 good reasons for having models of IR

1. Models guide research and provide the means of academic


discussion
2. Models can serve as a blueprint to implement actual retrieval
system
IR Models
• In IR, mathematical models are used to understand and
reason about some behavior or phenomena in the real world

• A model of an information retrieval predicts and explains what


a user will find relevant given the user query
Retrieval model
• Thus, retrieval models are models that can describe the
computational processes (here, retrieval)
– e.g., how documents are ranked
– e.g., how similarities are measured
• Are models that can attempt to describe the human process
– e.g., the information need, interaction
• Are models that specify the details of
– Document representation
– Query representation
– Retrieval function (matching function)
– Ranking
Retrieval Models
• A number of IR models are proposed over the years to retrieve
information
• The following are the major models developed to retrieve
information
– Boolean model
• Exact match model
– Statistical models
• Vector space and probabilistic models are the major
statistical models
• Are “best match” models
– Linguistic and knowledge based models
• Are “best match models”
What is the difference b/n best match and exact match?
Types of models
• The three classic information retrieval models are:
– Boolean retrieval models
– Vector space models
– Probabilistic models
1. Boolean model

 A document either matches a query, or does not.

 The Boolean retrieval model is a model for information


retrieval in which we can pose(create) any query which is in
the form of a Boolean expression of terms, that is,
in which terms are combined with the operators AND, OR,
and NOT.
…..cont
 The first model of an information retrieval
 The most criticized model

 Developed by George Boole

• Boole defined 3 basic operators


AND

OR
NOT
……cont
• Boolean relevance prediction ( R )

– A document is predicted as relevant to a query iff it satisfies the


query expression
– Each query term specifies a set of documents containing them

• AND (^) : the intersection of two sets

• OR (V) : the union of two sets


• NOT (¬) :set inverse, or really set difference

– A query, thus, searches a set of documents to determine their content

– The search engine retrieves those documents satisfying the logical


constraints of the query
……cont
• There is an assumption of document representation before retrieval

– Documents and queries are represented as sets of index terms

• Basis for the majority of DBMS and conventional IR systems

– Database systems use Boolean logic for searching

• Document (how a document is viewed in BM)

– Is an object, a set consisting of terms

– That is, documents are sets of terms

• Term (how a term is viewed in BM)

– The terms are the things we used to describe concepts in a particular


domain
– The vocabulary is growing when new terms are introduced
….cont

• Most queries search for more than one term

– Information need: to find all documents


containing “information” and “retrieval”
Answer Only documents containing both “information” and
“retrieval” satisfy this query

– Information need: to find all documents


containing “information” or “retrieval”
Answer Will be satisfied by a document that contains either
of the two words or both.
Boolean model

• Consider a set of five docs and assume that they contain the terms
shown in the table
Doc. Terms
D1 Algorithm, information, retrieval
D2 Retrieval, science
D3 Algorithm, information, science
D4 Pattern, retrieval, science
D5 Science, algorithm

Find documents retrieved by the following expressions in


a. Information AND retrieval
answer {d1,d3} ∩{d1,d2,d4}={d1}
b. Information OR retrieval
answer {d1,d3} U{d1,d2,d4}={d1, d2,d3,d4}
Examples
• Example 2

– Information need

Question: Find all documents containing “information” and

“retrieval” or not containing “retrieval” and contain “science”

– Query/Boolean expression

• (information and retrieval) OR NOT (retrieval) AND science

• parenthesis avoid ambiguity


Exercise

• If the followings are our documents, which document can be


retrieved for the given query? The documents are:
Doc1: Computer Information Retrieval

Doc2: Computer Retrieval


Doc3: Information
Doc4: Computer Information
Query1: Information AND Retrieval
Query2: (Information AND Retrieval) OR(NOT computer)
Advantages and disadvantages of Boolean
model
• Advantages of Boolean model

A very simple model based on sets(Easy for expert)

It is computationally efficient

Expressiveness and clarity

Is still a dominant model with the commercial database


systems
Disadvantages of Boolean model
• Disadvantages

Needs to train users

Very limited to show user information need in detail

No weighting of index or query terms

Based on exact matchingthere may be relevant document


that is partially matched
Vector Space Model
Suggested by peter Luhn and Salton
A classic model of document retrieval based on representing
documents and queries as a vector

Partial matching is possible


Retrieval is based on the similarity between the query vector
and document vectors
The output documents are ranked according to this similarity
Example
• Document vector and query vector
….cont
• The similarity is based on the occurrences of the keywords in
the query and document

• The angle between query and document is measured by using:

cosine similarity measurement since both document


and user’s query are represented as vectors in VSM
…cont
• VSM assumes that if document vector V1 is closer to the query
than another document vector V2: then;
 The document represented by V1 is more relevant
than the one represented by V2

In VSM, to decide the similarity of the document to the given


query, term weighting(tf*idf) is used. To calculate tf*idf, first
we have to calculate the following
1.Term frequency(tf)
• Term frequency: is the number of times a given term
appears in that documents
Tf = number of frequency of a term
maximum frequency
(for a single document)
2. Inverse document frequency(IDF)

• IDF used to measure whether the terms are common or


rare across all documents
• Idf = log2 N/df where
N total number of documents
Df document frequency (number of documents
containing the given term)
Detail information about log
3. Term weighting (tf*idf)
• Term weighting = tf*idf
Tfi,j* log N/df
4. Document length
After calculating the term weight we have to
calculate the length of the document
Document length= the square root summation of
term weight square
5. similarity
At the end we need to calculate similarity of the
documents to the query.
The widely used measure of similarity in vector space
model is called the cosine similarity.
The cosine similarity between two vectors d,j (the
document vector) and q (query vector) is given by:
…..cont
…cont
• The denominator of the above formula can be
replaced by the length of the document times length
of the query. This means
Example1
• Example1: If the following three documents are given with
one query, then, which document must be ranked first?

Doc1: New york times


Doc2: New York post
Doc3: Los Angeles times
Query: new new times
Solution
Step 2
Step 3
Step 4
Step 5
Exercise 1
Suppose the database collection consists of three documents
(N=3) with the following content. Rank the following
document as per their similarity with the query. Don’t use
stop words.

Doc1: "Shipment of gold damaged in a fire"


Doc2: "Delivery of silver arrived in a silver truck"
Doc3: "Shipment of gold arrived in a truck" .
Query: "gold silver truck“.
Exercise 2
 Which document must be ranked first for the following?

Doc1: Breakthrough drug for health

Doc2: New health drug


Doc3: New approach for treatment of health
Doc4: New hopes for health patients

Query: Treatment of health patients


Exercise 3
 From the following documents which one must be
ranked first? Remove stopwords
Doc1:The health observances for march
Doc2:The health oriented calendar
Doc3: The awareness news for march awareness
Q: March health awareness
Advantages and Disadvantages of VSM
Latent semantic indexing
Latent Semantic Indexing (LSI) is an extension of the vector
space retrieval method (Deerwester et al., 1990).
LSI can retrieve relevant documents even when they do not
share any words with the query.

if only a synonym of the keyword is present in a document, the


document will be still found relevant.
Example
Cont..
• For example, the word bank when used together with
mortgage, loans, and rates probably means a financial
institution.

• However, the word bank when used together with lures,


casting, and fish probably means a stream or river bank.
Probabilistic Model
Given a user information need (represented as a query) and a
collection of documents (transformed into document
representations), a system must determine how well the
documents satisfy the query.

An IR system has an uncertain understanding of the user


query, and makes an uncertain guess of whether a document
satisfies the query
Cont….
• Probability theory provides a principled foundation for such
reasoning under uncertainty

• Probabilistic models feat this foundation to estimate how


likely it is that a document is relevant to a query.
Cont..

• A probabilistic formula is used to calculate P (D|R), in place of


the vector similarity formula, e.g., cosine similarity, used to
calculate relevance ranking in the vector space model.

• The probability formula depends on the specific model used,


and also on the assumptions made about the distribution of
terms, e.g., how terms are distributed over documents in the
set of relevant documents, and in the set of non-relevant
documents.
Cont..
In the case of P(R) the sample space might be {relevant,
irrelevant}, and we might define the random variable R to
take the values {0, 1}, where 0=irrelevant and 1=relevant

If we know the number of relevant documents in the


collection, say 100 documents are relevant, and we know the
total number of documents in the collection, say 1 million,
then the quotient of those two defines the probability of
relevance P(R=1) = 100/1000000 = 0.0001.
IR as Classification
Why probabilities in IR?
BIM (Binary Independence Model)
• We assume here that the relevance of each document is
independent of the relevance of other documents.

• Under the BIM, we model the probability P(R|d, q) that a


document is relevant via the probability in terms of term
incidence vectors P(R|x, q).
Cont…

Many assumptions are made in the Binary Independence Model.


Let us summarize them here:

 The documents are independent of each other.


 The terms in a document are independent of each other.
 The terms not present in query are equally likely to occur in
any document . i.e.do not affect the retrieval process.
Formula  p(NR/Y) p(y)=p(y/NR) p (NR)
P (NR/Y) =p(y/NR) p (NR)/p(y)
• Where:

P(NR/Y)= probability that y is non relevant document

P(Y/NR)= Probability that if a non relevant document


retrieved, it is “y‟
P(NR)= Probability of non relevant documents in the collection

P(Y)= probability of y is in the collection


BIM(binary independence )model
 Assume that document “y‟ is in the collection

 Probability that if a non relevant document retrieved, it is


“y‟ is p(y/NR) =0.2
 Probability of non relevant documents in the collection is p
(NR) =0.6
 Probability of ‘’y‟ in the collection is p(y) =0.4

a. What is the probability that y is non relevant document?


b. Is the document is relevant or non relevant?
Answer
P (NR/Y) =p(y/NR) p (NR)/p(y)
• Where:

P(NR/Y)= probability that y is non relevant document

P(Y/NR)= Probability that if a non relevant document


retrieved, it is “y‟
P(NR)= Probability of non relevant documents in the collection

P(Y)= probability of y is in the collection


Cont..
a. P (NR/Y) =p(y/NR) p (NR)/p(y)

= 0.2*0.6/0.4=0.3

b. p(NR/Y)+p(R/Y)=1

p(R/Y)=1-p(NR/Y)
=1-0.3
So , the document is
=0.7 relevant
Question
• If the value of the probability of one document is relevant is
equal with that of non relevant how can we decide whether the
document is relevant or not?
Exercise

 The TREC (Text Retrieval Conference) has the total number


of documents 1.8 million for experimentation. Assume that the
document which is called ‘m’ is found in this collection.
 The probability that document ‘m’ is found in the collection is
0.2; the probability of relevant documents in the collection in
total is 0.4; the probability that non relevant document
retrieved is considered as ‘m’ is 0.3; then,
 what is the probability that ‘m’ is non relevant document? Is
the document (document ‘m’) is relevant or not?
BM25 Okapi weighting scheme

• According to (Vu & Huy, 2015), BM25 (where BM stands for


Best Match) is a ranking function that sorts documents
according to their relevance to a specific query, which is
implemented with the Okapi information retrieval system
first.

• It uses two free parameters k and b that distinguish it from


other ranking methods.
• OkapiBM25 is in information retrieval, Okapi BM25 is a
ranking function used by search engines to rank matching
documents according to their relevance to a given search
query
Cont…
• It was first implemented in the“Okapi Information retrieval
system” in London’s City University in the 1980s and 1990s.
• It is based on the probabilistic retrieval model discussed above
but pays attention to the term frequencies and document
length.
Contingency Table
Gives scoring function:

F4 (BM25(D,Q)) = a best match scoring


function that compute RSJ (Relevance scoring
Fumction) weights
Definition

r = number of relevant documents that contain the term.


n – r = number of non-relevant documents that contain the term.
n = number of documents that contain the term.
R – r = number of relevant documents that do not contain the
term.
N – n – R + r = number of non-relevant documents that do not
contain the term.
N - n = number of documents that do not contain the term.
R = number of relevant documents.
N – R = number of non-relevant documents.
N = number of documents in the collection.
k = a smoothing correction usually set to 0.5
Cont…

k1, k2 and K are parameters whose values are set empirically


dl is doc length
Typical TREC value for k1 is 1.2, k2 varies from 0 to 1000, b =
0.75
BM25 Example
 Query with two terms, “president lincoln”, (qf = 1)
 No relevance information (r and R are zero)
 N = 500,000 documents

 “president” occurs in 40,000 documents (n1 = 40, 000)

 “lincoln” occurs in 300 documents (n2 = 300)

 “president” occurs 15 times in doc (f1 = 15)

 “lincoln” occurs 25 times (f2 = 25)

 document length is 90% of the average length (dl/avdl = .9)

k1 = 1.2, b = 0.75, and k2 = 100

K = 1.2 · (0.25 + 0.75 · 0.9) = 1.11


Answer


Comparison of IR models

• Compare and contrast IR models.

click here for next slide

You might also like