0% found this document useful (0 votes)
19 views10 pages

Irt-23 Unit 2

The document discusses various information retrieval models including Boolean, vector space, probabilistic, and latent semantic indexing models. It also covers retrieval evaluation metrics like precision and recall as well as relevance feedback and query expansion techniques.

Uploaded by

karthin.roy115
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views10 pages

Irt-23 Unit 2

The document discusses various information retrieval models including Boolean, vector space, probabilistic, and latent semantic indexing models. It also covers retrieval evaluation metrics like precision and recall as well as relevance feedback and query expansion techniques.

Uploaded by

karthin.roy115
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 10

UNIT II

MODELING AND RETRIEVAL EVALUATION


Basic IR Models - Boolean Model - TF-IDF (Term Frequency/Inverse Document
Frequency) Weighting - Vector Model – Probabilistic Model – Latent Semantic Indexing
Model – Neural Network Model – Retrieval Evaluation – Retrieval Metrics – Precision
and Recall – Reference Collection – User-based Evaluation – Relevance Feedback and
Query Expansion – Explicit Relevance Feedback.

Modeling
 Modeling in IR is a complex process aimed at producing a ranking function.
 Ranking function: a function that assigns scores to documents with regard to a given
query.
 This process consists of two main tasks:
 The conception of a logical framework for representing documents and queries
 The definition of a ranking function that allows quantifying the similarities
among documents and queries
 IR systems usually adopt index terms to index and retrieve documents
IR Model Definition:
Types of Information Retrieval (IR) Model
 An information model (IR) model can be classified into the following three
models
 Classical IR Model
 Non-Classical IR Model
 Alternative IR Model

Classical IR Model
It is the simplest and easy to implement IR model. This model is based on mathematical
knowledge that was easily recognized and understood as well. Boolean, Vector and
Probabilistic are the three classical IR models.

Non-Classical IR Model
It is completely opposite to classical IR model. Such kind of IR models are based on
principles other than similarity, probability, Boolean operations. Information logic model,
situation theory model and interaction models are the examples of non-classical IR model.

Alternative IR Model
It is the enhancement of classical IR model making use of some specific techniques from
some other fields. Cluster model, fuzzy model and latent semantic indexing (LSI) models
are the example of alternative IR model.

Classic IR model:
 Each document is described by a set of representative keywords called index terms.
 Assign numerical weights to distinct relevance between index terms.
 Three classic models:
 Boolean,
 Vector,
 Probabilistic
BOOLEAN MODEL
The Boolean retrieval model is a model for information retrieval in which we
can pose any query which is in the form of a Boolean expression of terms, that is, in
which terms are combined with the operators AND, OR, and NOT. The model views each
document as just a set of words. Based on a binary decision criterion without any notion of
a grading scale. Boolean expressions have precise semantics.
It is the oldest information retrieval (IR) model. The model is based on set theory and
the Boolean algebra, where documents are sets of terms and queries are Boolean
expressions on terms. The Boolean model can be defined as −
 D − A set of words, i.e., the indexing terms present in a document. Here, each
term is either present (1) or absent (0).
 Q − A Boolean expression, where terms are the index terms and operators are
logical products − AND, logical sum − OR and logical difference − NOT
 F − Boolean algebra over sets of terms as well as over sets of documents
If we talk about the relevance feedback, then in Boolean IR model the Relevance
prediction can be defined as follows −
 R − A document is predicted as relevant to the query expression if and only if it
satisfies the query expression as −
((𝑡e𝑥𝑡 ˅ i𝑛fo𝑟𝑚𝑎𝑡io𝑛) ˄ 𝑟e𝑟ie𝑣𝑎𝑙 ˄ ˜ 𝑡ℎeo𝑟𝑦)
We can explain this model by a query term as an unambiguous definition of a set of
documents. For example, the query term “economic” defines the set of documents that are
indexed with the term “economic”.
Now, what would be the result after combining terms with Boolean AND
Operator? It will define a document set that is smaller than or equal to the document sets
of any of the single terms. For example, the query with terms “social” and
“economic” will produce the documents set of documents that are indexed with both the
terms. In other words, document set with the intersection of both the sets.
Now, what would be the result after combining terms with Boolean OR operator?
It will define a document set that is bigger than or equal to the document sets of any of
the single terms. For example, the query with terms “social” or “economic”
will produce the documents set of documents that are indexed with either the term
“social” or “economic”. In other words, document set with the union of both the sets.
way to avoid linearly scanning the texts for each query is to index the
documents in advance. Let us stick with Shakespeare’s Collected Works, and use it to
introduce the basics of the Boolean retrieval model. Suppose we record for each document –
here a play of Shakespeare’s – whether it contains each word out of all the words
Shakespeare used (Shakespeare used about 32,000 different words). The result is a binary
term-document incidence, as in Figure. Terms are the indexed units; they are usually
words, and for the moment you can think of them as words.

Figure : A term-document incidence matrix. Matrix element (t, d) is 1 if the play in


column d contains the word in row t, and is 0 otherwise.
To answer the query Brutus AND Caesar AND NOT Calpurnia, we take the
vectors for Brutus, Caesar and Calpurnia, complement the last, and then do a bitwise AND:
110100 AND 110111 AND 101111 = 100100
Answer :The answers for this query are thus Antony and Cleopatra and Hamlet Let us now
consider a more realistic scenario, simultaneously using the opportunity to introduce some
terminology and notation. Suppose we have N = 1 million documents. By documents we
mean whatever units we have decided to build a retrieval system over. They might be
individual memos or chapters of a book. We will refer to the group of documents over
which we perform retrieval as the COLLECTION. It is sometimes also referred to as a
Corpus.
we assume an average of 6 bytes per word including spaces and punctuation, then this
is a document collection about 6 GB in size. Typically, there might be about M =
500,000 distinct terms in these documents. There is nothing special about the numbers we
have chosen, and they might vary by an order of magnitude or more, but they give us
some idea of the dimensions of the kinds of problems we need to handle.

Advantages of the Boolean Mode


The advantages of the Boolean model are as follows −
 The simplest model, which is based on sets.
 Easy to understand and implement.
 It only retrieves exact matches
 It gives the user, a sense of control over the system.

Disadvantages of the Boolean Model


The disadvantages of the Boolean model are as follows −
 The model’s similarity function is Boolean. Hence, there would be no partial
matches. This can be annoying for the users.
 In this model, the Boolean operator usage has much more influence than a
critical word.
 The query language is expressive, but it is complicated too.
 No ranking for retrieved documents.

TF-IDF (Term Frequency/Inverse Document Frequency) Term


Frequency (tfij)
It may be defined as the number of occurrences of wi in dj. The information that
is captured by term frequency is how salient a word is within the given document or in other
words we can say that the higher the term frequency the more that word is a good
description of the content of that document.

Document Frequency (dfi)


It may be defined as the total number of documents in the collection in which wi
occurs. It is an indicator of informativeness. Semantically focused words will occur several
times in the document unlike the semantically unfocused words.

Assign to each term in a document a weight for that term, that depends on the
number of occurrences of the term in the document. We would like to compute a score
between a query term t and a document d, based on the weight of t in d. The simplest
approach is to assign the weight to be equal to the number of occurrences of term t in
document d. This weighting scheme is referred to as term frequency and is denoted tft,d
with the subscripts denoting the term and the document in order.

Inverse document frequency


This is another form of document frequency weighting and often called idf
weighting or inverse document frequency weighting. The important point of idf weighting
is that the term’s scarcity across the collection is a measure of its importance and
importance is inversely proportional to frequency of occurrence.
Raw term frequency as above suffers from a critical problem: all terms are
considered equally important when it comes to assessing relevancy on a query. In fact
certain terms have little or no discriminating power in determining relevance. For
instance, a collection of documents on the auto industry is likely to
have the term auto in almost every document. A mechanism for attenuating the effect of
terms that occur too often in the collection to be meaningful for relevance determination.
An immediate idea is to scale down the term weights of terms with high collection
frequency, defined to be the total number of occurrences of a term in the collection. The
idea would be to reduce the tf weight of a term by a factor that grows with its collection
frequency.
Mathematically,

Here,
N = documents in the collection nt =
documents containing term t

Tf-idf weighting
We now combine the definitions of term frequency and inverse document frequency,
to produce a composite weight for each term in each document.
The tf-idf weighting scheme assigns to term t a weight in document d given by
VECTOR MODEL
Assign non-binary weights to index terms in queries and in documents. Compute the
similarity between documents and query. More precise than Boolean model.
Due to the disadvantages of the Boolean model, Gerard Salton and his
colleagues suggested a model, which is based on Luhn’s similarity criterion. The similarity
criterion formulated by Luhn states, “the more two representations agreed in given
elements and their distribution, the higher would be the probability of their representing
similar information.”
Consider the following important points to understand more about the Vector Space Model

 The index representations (documents) and the queries are considered as vectors
embedded in a high dimensional Euclidean space.
 The similarity measure of a document vector to a query vector is usually the cosine
of the angle between them.
Cosine Similarity Measure Formula
Cosine is a normalized dot product, which can be calculated with the help of the following
formula –

Vector Space Representation with Query and Document


The query and documents are represented by a two-dimensional vector space. The
terms are car and insurance. There is one query and three documents in the vector space.

The top ranked document in response to the terms car and insurance will be the
document d2 because the angle between q and d2 is the smallest. The reason behind this is
that both the concepts car and insurance are salient in d2 and hence have the high weights.
On the other side, d1 and d3 also mention both the terms but in each case, one of them is
not a centrally important term in the document.

You might also like