Irt-23 Unit 2

The document discusses various information retrieval models including Boolean, vector space, probabilistic, and latent semantic indexing models. It also covers retrieval evaluation metrics like precision and recall as well as relevance feedback and query expansion techniques.

Uploaded by

karthin.roy115

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views10 pages

Irt-23 Unit 2

Uploaded by

karthin.roy115

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

You are on page 1/ 10

UNIT II

MODELING AND RETRIEVAL EVALUATION

Basic IR Models - Boolean Model - TF-IDF (Term Frequency/Inverse Document
Frequency) Weighting - Vector Model – Probabilistic Model – Latent Semantic Indexing
Model – Neural Network Model – Retrieval Evaluation – Retrieval Metrics – Precision
and Recall – Reference Collection – User-based Evaluation – Relevance Feedback and
Query Expansion – Explicit Relevance Feedback.

Modeling
 Modeling in IR is a complex process aimed at producing a ranking function.
 Ranking function: a function that assigns scores to documents with regard to a given
query.
 This process consists of two main tasks:
 The conception of a logical framework for representing documents and queries
 The definition of a ranking function that allows quantifying the similarities
among documents and queries
 IR systems usually adopt index terms to index and retrieve documents
IR Model Definition:
Types of Information Retrieval (IR) Model
 An information model (IR) model can be classified into the following three
models
 Classical IR Model
 Non-Classical IR Model
 Alternative IR Model

Classical IR Model
It is the simplest and easy to implement IR model. This model is based on mathematical
knowledge that was easily recognized and understood as well. Boolean, Vector and
Probabilistic are the three classical IR models.

Non-Classical IR Model
It is completely opposite to classical IR model. Such kind of IR models are based on
principles other than similarity, probability, Boolean operations. Information logic model,
situation theory model and interaction models are the examples of non-classical IR model.

Alternative IR Model
It is the enhancement of classical IR model making use of some specific techniques from
some other fields. Cluster model, fuzzy model and latent semantic indexing (LSI) models
are the example of alternative IR model.

Classic IR model:
 Each document is described by a set of representative keywords called index terms.
 Assign numerical weights to distinct relevance between index terms.
 Three classic models:
 Boolean,
 Vector,
 Probabilistic
BOOLEAN MODEL
The Boolean retrieval model is a model for information retrieval in which we
can pose any query which is in the form of a Boolean expression of terms, that is, in
which terms are combined with the operators AND, OR, and NOT. The model views each
document as just a set of words. Based on a binary decision criterion without any notion of
a grading scale. Boolean expressions have precise semantics.
It is the oldest information retrieval (IR) model. The model is based on set theory and
the Boolean algebra, where documents are sets of terms and queries are Boolean
expressions on terms. The Boolean model can be defined as −
 D − A set of words, i.e., the indexing terms present in a document. Here, each
term is either present (1) or absent (0).
 Q − A Boolean expression, where terms are the index terms and operators are
logical products − AND, logical sum − OR and logical difference − NOT
 F − Boolean algebra over sets of terms as well as over sets of documents
If we talk about the relevance feedback, then in Boolean IR model the Relevance
prediction can be defined as follows −
 R − A document is predicted as relevant to the query expression if and only if it
satisfies the query expression as −
((𝑡e𝑥𝑡 ˅ i𝑛fo𝑟𝑚𝑎𝑡io𝑛) ˄ 𝑟e𝑟ie𝑣𝑎𝑙 ˄ ˜ 𝑡ℎeo𝑟𝑦)
We can explain this model by a query term as an unambiguous definition of a set of
documents. For example, the query term “economic” defines the set of documents that are
indexed with the term “economic”.
Now, what would be the result after combining terms with Boolean AND
Operator? It will define a document set that is smaller than or equal to the document sets
of any of the single terms. For example, the query with terms “social” and
“economic” will produce the documents set of documents that are indexed with both the
terms. In other words, document set with the intersection of both the sets.
Now, what would be the result after combining terms with Boolean OR operator?
It will define a document set that is bigger than or equal to the document sets of any of
the single terms. For example, the query with terms “social” or “economic”
will produce the documents set of documents that are indexed with either the term
“social” or “economic”. In other words, document set with the union of both the sets.
way to avoid linearly scanning the texts for each query is to index the
documents in advance. Let us stick with Shakespeare’s Collected Works, and use it to
introduce the basics of the Boolean retrieval model. Suppose we record for each document –
here a play of Shakespeare’s – whether it contains each word out of all the words
Shakespeare used (Shakespeare used about 32,000 different words). The result is a binary
term-document incidence, as in Figure. Terms are the indexed units; they are usually
words, and for the moment you can think of them as words.

Figure : A term-document incidence matrix. Matrix element (t, d) is 1 if the play in

column d contains the word in row t, and is 0 otherwise.
To answer the query Brutus AND Caesar AND NOT Calpurnia, we take the
vectors for Brutus, Caesar and Calpurnia, complement the last, and then do a bitwise AND:
110100 AND 110111 AND 101111 = 100100
Answer :The answers for this query are thus Antony and Cleopatra and Hamlet Let us now
consider a more realistic scenario, simultaneously using the opportunity to introduce some
terminology and notation. Suppose we have N = 1 million documents. By documents we
mean whatever units we have decided to build a retrieval system over. They might be
individual memos or chapters of a book. We will refer to the group of documents over
which we perform retrieval as the COLLECTION. It is sometimes also referred to as a
Corpus.
we assume an average of 6 bytes per word including spaces and punctuation, then this
is a document collection about 6 GB in size. Typically, there might be about M =
500,000 distinct terms in these documents. There is nothing special about the numbers we
have chosen, and they might vary by an order of magnitude or more, but they give us
some idea of the dimensions of the kinds of problems we need to handle.

Advantages of the Boolean Mode

The advantages of the Boolean model are as follows −
 The simplest model, which is based on sets.
 Easy to understand and implement.
 It only retrieves exact matches
 It gives the user, a sense of control over the system.

Disadvantages of the Boolean Model

The disadvantages of the Boolean model are as follows −
 The model’s similarity function is Boolean. Hence, there would be no partial
matches. This can be annoying for the users.
 In this model, the Boolean operator usage has much more influence than a
critical word.
 The query language is expressive, but it is complicated too.
 No ranking for retrieved documents.

TF-IDF (Term Frequency/Inverse Document Frequency) Term

Frequency (tfij)
It may be defined as the number of occurrences of wi in dj. The information that
is captured by term frequency is how salient a word is within the given document or in other
words we can say that the higher the term frequency the more that word is a good
description of the content of that document.

Document Frequency (dfi)

It may be defined as the total number of documents in the collection in which wi
occurs. It is an indicator of informativeness. Semantically focused words will occur several
times in the document unlike the semantically unfocused words.

Assign to each term in a document a weight for that term, that depends on the
number of occurrences of the term in the document. We would like to compute a score
between a query term t and a document d, based on the weight of t in d. The simplest
approach is to assign the weight to be equal to the number of occurrences of term t in
document d. This weighting scheme is referred to as term frequency and is denoted tft,d
with the subscripts denoting the term and the document in order.

Inverse document frequency

This is another form of document frequency weighting and often called idf
weighting or inverse document frequency weighting. The important point of idf weighting
is that the term’s scarcity across the collection is a measure of its importance and
importance is inversely proportional to frequency of occurrence.
Raw term frequency as above suffers from a critical problem: all terms are
considered equally important when it comes to assessing relevancy on a query. In fact
certain terms have little or no discriminating power in determining relevance. For
instance, a collection of documents on the auto industry is likely to
have the term auto in almost every document. A mechanism for attenuating the effect of
terms that occur too often in the collection to be meaningful for relevance determination.
An immediate idea is to scale down the term weights of terms with high collection
frequency, defined to be the total number of occurrences of a term in the collection. The
idea would be to reduce the tf weight of a term by a factor that grows with its collection
frequency.
Mathematically,

Here,
N = documents in the collection nt =
documents containing term t

Tf-idf weighting
We now combine the definitions of term frequency and inverse document frequency,
to produce a composite weight for each term in each document.
The tf-idf weighting scheme assigns to term t a weight in document d given by
VECTOR MODEL
Assign non-binary weights to index terms in queries and in documents. Compute the
similarity between documents and query. More precise than Boolean model.
Due to the disadvantages of the Boolean model, Gerard Salton and his
colleagues suggested a model, which is based on Luhn’s similarity criterion. The similarity
criterion formulated by Luhn states, “the more two representations agreed in given
elements and their distribution, the higher would be the probability of their representing
similar information.”
Consider the following important points to understand more about the Vector Space Model
−
 The index representations (documents) and the queries are considered as vectors
embedded in a high dimensional Euclidean space.
 The similarity measure of a document vector to a query vector is usually the cosine
of the angle between them.
Cosine Similarity Measure Formula
Cosine is a normalized dot product, which can be calculated with the help of the following
formula –

Vector Space Representation with Query and Document

The query and documents are represented by a two-dimensional vector space. The
terms are car and insurance. There is one query and three documents in the vector space.

The top ranked document in response to the terms car and insurance will be the
document d2 because the angle between q and d2 is the smallest. The reason behind this is
that both the concepts car and insurance are salient in d2 and hence have the high weights.
On the other side, d1 and d3 also mention both the terms but in each case, one of them is
not a centrally important term in the document.

L12&L13 Ranked Retrieval
No ratings yet
L12&L13 Ranked Retrieval
31 pages
NLP Unit-Ii (Part-I)
No ratings yet
NLP Unit-Ii (Part-I)
19 pages
Unit Iii - Information Retrieval Design Features of Information Retrieval Systems
No ratings yet
Unit Iii - Information Retrieval Design Features of Information Retrieval Systems
57 pages
Information Retrieval System and The Pagerank Algorithm
No ratings yet
Information Retrieval System and The Pagerank Algorithm
37 pages
Information Retrieval: Adt-V Unit
No ratings yet
Information Retrieval: Adt-V Unit
106 pages
ISE Information Retrieval Mod-V
No ratings yet
ISE Information Retrieval Mod-V
48 pages
Applications of NLP
No ratings yet
Applications of NLP
48 pages
NLP 4
No ratings yet
NLP 4
33 pages
CS8080 INFORMATION RETRIEVAL TECHNIQUES II INTERNAL EXAMINATION - Google Forms
No ratings yet
CS8080 INFORMATION RETRIEVAL TECHNIQUES II INTERNAL EXAMINATION - Google Forms
420 pages
Ir 1
No ratings yet
Ir 1
14 pages
IR - Models
100% (3)
IR - Models
58 pages
Atm - Application
60% (5)
Atm - Application
2 pages
4 IRModels
No ratings yet
4 IRModels
46 pages
Module 6 Updated Final
No ratings yet
Module 6 Updated Final
48 pages
L02-IR Models MMN
No ratings yet
L02-IR Models MMN
27 pages
I R Rank
No ratings yet
I R Rank
52 pages
IRS BZU Lecture 7 Jan23
No ratings yet
IRS BZU Lecture 7 Jan23
27 pages
Information Retrieval Detailed Lecture Nov 2023
No ratings yet
Information Retrieval Detailed Lecture Nov 2023
39 pages
5 IRModels IR
No ratings yet
5 IRModels IR
25 pages
NLP Week10 IR Enc Dec Annotated - by - Ces
No ratings yet
NLP Week10 IR Enc Dec Annotated - by - Ces
83 pages
4 IRModels
No ratings yet
4 IRModels
32 pages
2 Introduction To Information Retrieval
No ratings yet
2 Introduction To Information Retrieval
38 pages
IR Lecture 1b
No ratings yet
IR Lecture 1b
54 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
69 pages
Unit 2 Irt
No ratings yet
Unit 2 Irt
33 pages
IR Presentation 2
No ratings yet
IR Presentation 2
28 pages
CMP 312 - 2
No ratings yet
CMP 312 - 2
5 pages
4 IRModels
No ratings yet
4 IRModels
46 pages
Cs8080 Ir Unit2 I Modeling and Retrieval Evaluation
No ratings yet
Cs8080 Ir Unit2 I Modeling and Retrieval Evaluation
42 pages
Chapter Five IR Models
No ratings yet
Chapter Five IR Models
28 pages
Lecture 5 - Scoring, Term Weighting, Vector Space Model - Part 1
No ratings yet
Lecture 5 - Scoring, Term Weighting, Vector Space Model - Part 1
45 pages
Information Retrieval - 1
No ratings yet
Information Retrieval - 1
47 pages
NLP See
No ratings yet
NLP See
27 pages
Colligative Properties of Non Electrolytes
50% (2)
Colligative Properties of Non Electrolytes
20 pages
NLP - Module 5
No ratings yet
NLP - Module 5
58 pages
IR Unit 2 Final
No ratings yet
IR Unit 2 Final
43 pages
3 Retrieval Models
No ratings yet
3 Retrieval Models
87 pages
Unit Ii Part B 1. Write About Basic IR Model
No ratings yet
Unit Ii Part B 1. Write About Basic IR Model
17 pages
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
No ratings yet
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
21 pages
KEN2570-5-Search and IR
No ratings yet
KEN2570-5-Search and IR
18 pages
ISE Information Retrieval Mod-V (Uploaded by Snaptricks - In)
No ratings yet
ISE Information Retrieval Mod-V (Uploaded by Snaptricks - In)
48 pages
Unit 2
No ratings yet
Unit 2
58 pages
IR Chapter 4
No ratings yet
IR Chapter 4
15 pages
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
No ratings yet
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
46 pages
Unit2 Bool Vect Example ST
No ratings yet
Unit2 Bool Vect Example ST
34 pages
5 B IRModels
No ratings yet
5 B IRModels
51 pages
Introduction of IR Models
No ratings yet
Introduction of IR Models
67 pages
Information Retrieval Models
No ratings yet
Information Retrieval Models
4 pages
Chapter 2
No ratings yet
Chapter 2
37 pages
Web Information Retrieval
No ratings yet
Web Information Retrieval
10 pages
Unit V Easy To Learn
No ratings yet
Unit V Easy To Learn
21 pages
02 Chap02a-BooleanAndvector Models
No ratings yet
02 Chap02a-BooleanAndvector Models
30 pages
Information Retrieval Systems Information Retrieval Systems
No ratings yet
Information Retrieval Systems Information Retrieval Systems
7 pages
IR Systems Usually Adopt Index Terms To Process Queries Index Term
No ratings yet
IR Systems Usually Adopt Index Terms To Process Queries Index Term
24 pages
Unit Ii Modeling
No ratings yet
Unit Ii Modeling
15 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
Unit-5 Adt
No ratings yet
Unit-5 Adt
11 pages
Module 1-1
No ratings yet
Module 1-1
12 pages
Completed Unit II 17.7.17
No ratings yet
Completed Unit II 17.7.17
113 pages
IR Unit 2
No ratings yet
IR Unit 2
54 pages
Writing of An Application Letter: Discussions
100% (1)
Writing of An Application Letter: Discussions
13 pages
American Scientist, Vol. 111.1 (January-February 2023)
No ratings yet
American Scientist, Vol. 111.1 (January-February 2023)
68 pages
Railway Traning Report
100% (2)
Railway Traning Report
45 pages
Arcs and Inscribed Angle
No ratings yet
Arcs and Inscribed Angle
29 pages
Embroidery Stitches
No ratings yet
Embroidery Stitches
16 pages
Logiq 500 GE
No ratings yet
Logiq 500 GE
407 pages
Zoology Non Chordata
No ratings yet
Zoology Non Chordata
525 pages
A'Seeb Wastewater Project Seeb, Muscat, Sultanate of Oman
No ratings yet
A'Seeb Wastewater Project Seeb, Muscat, Sultanate of Oman
3 pages
Power Engineering (Trivia 3)
No ratings yet
Power Engineering (Trivia 3)
7 pages
Resident Evil Code - Veronica X - Action Replay Codes, US - Cheat Happens
No ratings yet
Resident Evil Code - Veronica X - Action Replay Codes, US - Cheat Happens
7 pages
Financial Modelling PDF
No ratings yet
Financial Modelling PDF
2 pages
Status and Prospects For Helicopter Apus in Russia Gavrilov V.V., Ponomarev B.A
No ratings yet
Status and Prospects For Helicopter Apus in Russia Gavrilov V.V., Ponomarev B.A
16 pages
Leeb Hardness Tester
No ratings yet
Leeb Hardness Tester
4 pages
Carcassonne V3 Supplement
No ratings yet
Carcassonne V3 Supplement
2 pages
Himanshu Chichra: Work Experience Skills
No ratings yet
Himanshu Chichra: Work Experience Skills
1 page
J24 Jimmys Combo
No ratings yet
J24 Jimmys Combo
54 pages
IJCRT2310639
No ratings yet
IJCRT2310639
9 pages
RRL
No ratings yet
RRL
2 pages
TCA 1 Hard Surface Flooring Proposal and Reason Statement
No ratings yet
TCA 1 Hard Surface Flooring Proposal and Reason Statement
2 pages
In-Line Mixing
No ratings yet
In-Line Mixing
9 pages
04 - RPi Pico - Measure Distance With Ultrasonic Sensor HC-SR04
No ratings yet
04 - RPi Pico - Measure Distance With Ultrasonic Sensor HC-SR04
6 pages
Pravin Kolhe,: Executive Engineer
No ratings yet
Pravin Kolhe,: Executive Engineer
21 pages
Tony Tella Resume
No ratings yet
Tony Tella Resume
2 pages
MK PT en
No ratings yet
MK PT en
308 pages
Bai Tap Ham Tai Chinh
No ratings yet
Bai Tap Ham Tai Chinh
4 pages
GDC BCP Template
No ratings yet
GDC BCP Template
53 pages
Emergency Cart Checklist
No ratings yet
Emergency Cart Checklist
1 page
Analysis2 Final Exam 2022 PDF
No ratings yet
Analysis2 Final Exam 2022 PDF
3 pages
Semantic Modeling In Formal English
From Everand
Semantic Modeling In Formal English
Dr. Ir. Andries Van Renssen
No ratings yet
Visualizing Data Structures
From Everand
Visualizing Data Structures
Rhonda Hoenigman
No ratings yet