Irt-23 Unit 2
Irt-23 Unit 2
Modeling
Modeling in IR is a complex process aimed at producing a ranking function.
Ranking function: a function that assigns scores to documents with regard to a given
query.
This process consists of two main tasks:
The conception of a logical framework for representing documents and queries
The definition of a ranking function that allows quantifying the similarities
among documents and queries
IR systems usually adopt index terms to index and retrieve documents
IR Model Definition:
Types of Information Retrieval (IR) Model
An information model (IR) model can be classified into the following three
models
Classical IR Model
Non-Classical IR Model
Alternative IR Model
Classical IR Model
It is the simplest and easy to implement IR model. This model is based on mathematical
knowledge that was easily recognized and understood as well. Boolean, Vector and
Probabilistic are the three classical IR models.
Non-Classical IR Model
It is completely opposite to classical IR model. Such kind of IR models are based on
principles other than similarity, probability, Boolean operations. Information logic model,
situation theory model and interaction models are the examples of non-classical IR model.
Alternative IR Model
It is the enhancement of classical IR model making use of some specific techniques from
some other fields. Cluster model, fuzzy model and latent semantic indexing (LSI) models
are the example of alternative IR model.
Classic IR model:
Each document is described by a set of representative keywords called index terms.
Assign numerical weights to distinct relevance between index terms.
Three classic models:
Boolean,
Vector,
Probabilistic
BOOLEAN MODEL
The Boolean retrieval model is a model for information retrieval in which we
can pose any query which is in the form of a Boolean expression of terms, that is, in
which terms are combined with the operators AND, OR, and NOT. The model views each
document as just a set of words. Based on a binary decision criterion without any notion of
a grading scale. Boolean expressions have precise semantics.
It is the oldest information retrieval (IR) model. The model is based on set theory and
the Boolean algebra, where documents are sets of terms and queries are Boolean
expressions on terms. The Boolean model can be defined as −
D − A set of words, i.e., the indexing terms present in a document. Here, each
term is either present (1) or absent (0).
Q − A Boolean expression, where terms are the index terms and operators are
logical products − AND, logical sum − OR and logical difference − NOT
F − Boolean algebra over sets of terms as well as over sets of documents
If we talk about the relevance feedback, then in Boolean IR model the Relevance
prediction can be defined as follows −
R − A document is predicted as relevant to the query expression if and only if it
satisfies the query expression as −
((𝑡e𝑥𝑡 ˅ i𝑛fo𝑟𝑚𝑎𝑡io𝑛) ˄ 𝑟e𝑟ie𝑣𝑎𝑙 ˄ ˜ 𝑡ℎeo𝑟𝑦)
We can explain this model by a query term as an unambiguous definition of a set of
documents. For example, the query term “economic” defines the set of documents that are
indexed with the term “economic”.
Now, what would be the result after combining terms with Boolean AND
Operator? It will define a document set that is smaller than or equal to the document sets
of any of the single terms. For example, the query with terms “social” and
“economic” will produce the documents set of documents that are indexed with both the
terms. In other words, document set with the intersection of both the sets.
Now, what would be the result after combining terms with Boolean OR operator?
It will define a document set that is bigger than or equal to the document sets of any of
the single terms. For example, the query with terms “social” or “economic”
will produce the documents set of documents that are indexed with either the term
“social” or “economic”. In other words, document set with the union of both the sets.
way to avoid linearly scanning the texts for each query is to index the
documents in advance. Let us stick with Shakespeare’s Collected Works, and use it to
introduce the basics of the Boolean retrieval model. Suppose we record for each document –
here a play of Shakespeare’s – whether it contains each word out of all the words
Shakespeare used (Shakespeare used about 32,000 different words). The result is a binary
term-document incidence, as in Figure. Terms are the indexed units; they are usually
words, and for the moment you can think of them as words.
Assign to each term in a document a weight for that term, that depends on the
number of occurrences of the term in the document. We would like to compute a score
between a query term t and a document d, based on the weight of t in d. The simplest
approach is to assign the weight to be equal to the number of occurrences of term t in
document d. This weighting scheme is referred to as term frequency and is denoted tft,d
with the subscripts denoting the term and the document in order.
Here,
N = documents in the collection nt =
documents containing term t
Tf-idf weighting
We now combine the definitions of term frequency and inverse document frequency,
to produce a composite weight for each term in each document.
The tf-idf weighting scheme assigns to term t a weight in document d given by
VECTOR MODEL
Assign non-binary weights to index terms in queries and in documents. Compute the
similarity between documents and query. More precise than Boolean model.
Due to the disadvantages of the Boolean model, Gerard Salton and his
colleagues suggested a model, which is based on Luhn’s similarity criterion. The similarity
criterion formulated by Luhn states, “the more two representations agreed in given
elements and their distribution, the higher would be the probability of their representing
similar information.”
Consider the following important points to understand more about the Vector Space Model
−
The index representations (documents) and the queries are considered as vectors
embedded in a high dimensional Euclidean space.
The similarity measure of a document vector to a query vector is usually the cosine
of the angle between them.
Cosine Similarity Measure Formula
Cosine is a normalized dot product, which can be calculated with the help of the following
formula –
The top ranked document in response to the terms car and insurance will be the
document d2 because the angle between q and d2 is the smallest. The reason behind this is
that both the concepts car and insurance are salient in d2 and hence have the high weights.
On the other side, d1 and d3 also mention both the terms but in each case, one of them is
not a centrally important term in the document.