Unit 2 Irt
Unit 2 Irt
ii) Filtering
The queries remain relatively static while new documents come
into the system
Classic IR model:
Each document is described by a set of representative
keywords called index terms. Assign a numerical
weights to distinct relevance between index terms.
Three classic models: Boolean, vector, probabilistic
Boolean Model :
The Boolean retrieval model is a model for information
retrieval in which we can pose any query which is in
the form of a Boolean expression of terms, that is, in
which terms are combined with the operators AND,
OR, and NOT. The model views each document as just
a set of words. Based on a binary decision criterion
without any notion of a grading scale. Boolean
expressions have precise semantics.
Vector Model
Assign non-binary weights to index terms in queries
and in documents. Compute the similarity between
documents and query. More precise than Boolean
model.
Probabilistic Model
The probabilistic model tries to estimate the probability
that the user will find the document dj relevant with
ratio
P(dj relevant to q)/P(dj nonrelevant to q)
2
Given a user query q, and the ideal answer set R of the
relevant documents, the problem is to specify the
properties for this set. Assumption (probabilistic
principle): the probability of relevance depends on the
query and document representations only; ideal answer
set R should maximize the overall probability of
relevance
Basic Concepts
Each document is represented by a set of representative
keywords or index terms
Index term:
In a restricted sense: it is a keyword that has
some meaning on its own; usually plays the role of
a noun
In a more general form: it is any word that appears in
a document
Let, t be the number of index terms
in the document collection ki be a
generic index term Then,
The vocabulary V = {k1, . . . , kt} is
the set of all distinct index terms in
the collection
3
In matrix form, this can written as
5
Example :
A fat book which many people own is Shakespeare‟s Collected
Works.
6
Mercy 1 0 1 1 1 1
Worser 1 0 1 1 1 0
Term Frequency tf
One of the weighting scheme is Term Frequency and is
denoted tft,d, with the subscripts denoting the term and the
document in order.
Term frequency TF(t, d) of term t in document d = number of
times that t occurs in d
Ex: Term-Document Count Matrix
but we would like to give more weight to documents that have
a term several times as opposed to ones that contain it only
once. To do this we need term frequency information the
number of times a term occurs in a document .
Assign a score to represent the number of occurrences
8
Log-Frequency Weighting
Log-frequency weight of term t in document d is calculated
as
TF-IDF Weighting
tf-idf weight of a term: product of tf weight and idf weight
, Best known weighting scheme in information retrieval.
TF(t, d) measures the importance of a term t in document
d , IDF(t) measures the importance of a term t in the
whole collection of documents
TF/IDF weighting: putting TF and IDF together
11
2.4 The Vector Model
Boolean matching and binary weights is too limiting
The vector model proposes a framework in which partial
matching is possible
This is accomplished by assigning non-binary weights to index
terms in queries and in documents
Term weights are used to compute a degree of similarity
between a query and each document
The documents are ranked in decreasing order of their degree of
Similarity
12
Weights in the Vector model are basically tf-idf weights
These equations should only be applied for values of term
frequency greater than zero
If the term frequency is zero, the respective weight is also zero.
13
After length normalization
Example 2:
N=1000000
Since in the above example only one document is given, only one score is
calculated . if suppose n documents given , n score will be calculated and
ranking done in decreasing order of scores.
Advantages:
term-weighting improves quality of the answer set
partial matching allows retrieval of docs that approximate the
query conditions
cosine ranking formula sorts documents according to a degree of
similarity to the query
document length normalization is naturally built-in into the ranking
Disadvantages:
It assumes independence of index terms
15
An initial set of documents is retrieved somehow,The
user inspects these docs looking for the relevant ones (in
truth, only top 10-20 need to be inspected). The IR
system uses this information to refine the description of
the ideal answer set. By repeating this process, it is
expected that the description of the ideal answer set will
improve.
16
decreasing order of probability of relevance to the
information need: P(R | q,di)
The Ranking
17
1. Find measurable statistics (tf, df ,document length) that
affect judgments about document relevance
2. Combine these statistics to estimate the probability of
document relevance
3. Order documents by decreasing estimated probability of
relevance P(R|d, q)
4. Assume that the relevance of each document is
independent of the relevance of other documents