0% found this document useful (0 votes)
41 views33 pages

Unit 2 Irt

Uploaded by

Jiju Abutelin Ja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views33 pages

Unit 2 Irt

Uploaded by

Jiju Abutelin Ja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

i) Ad Hoc Retrieval:

The documents in the collection remain relatively static while


new queries are submitted to the system.

ii) Filtering
The queries remain relatively static while new documents come
into the system

Classic IR model:
Each document is described by a set of representative
keywords called index terms. Assign a numerical
weights to distinct relevance between index terms.
Three classic models: Boolean, vector, probabilistic
Boolean Model :
The Boolean retrieval model is a model for information
retrieval in which we can pose any query which is in
the form of a Boolean expression of terms, that is, in
which terms are combined with the operators AND,
OR, and NOT. The model views each document as just
a set of words. Based on a binary decision criterion
without any notion of a grading scale. Boolean
expressions have precise semantics.
Vector Model
Assign non-binary weights to index terms in queries
and in documents. Compute the similarity between
documents and query. More precise than Boolean
model.
Probabilistic Model
The probabilistic model tries to estimate the probability
that the user will find the document dj relevant with
ratio
P(dj relevant to q)/P(dj nonrelevant to q)
2
Given a user query q, and the ideal answer set R of the
relevant documents, the problem is to specify the
properties for this set. Assumption (probabilistic
principle): the probability of relevance depends on the
query and document representations only; ideal answer
set R should maximize the overall probability of
relevance

Basic Concepts
 Each document is represented by a set of representative
keywords or index terms
 Index term:
In a restricted sense: it is a keyword that has
some meaning on its own; usually plays the role of
a noun
In a more general form: it is any word that appears in
a document
 Let, t be the number of index terms
in the document collection ki be a
generic index term Then,
 The vocabulary V = {k1, . . . , kt} is
the set of all distinct index terms in
the collection

The Term-Document Matrix


 The occurrence of a term t i in a document dj
establishes a relation between t i and dj
 A term-document relation between ti and dj can be
quantified by the frequency of the term in the
document

3
 In matrix form, this can written as

 where each fi,j element stands for the frequency of term ti


in document dj
 Logical view of a document: from full text to a set of index
terms
Logical view of a document
It represents the full text to a set of index terms

• Documents in a collection are frequently represented through a


set of index termsor keywords. Keywords are extracted from
document. Keywords are derived automatically or generated by a
specialist, they provide a logical view of the document
• This can be accomplished through the elimination of stop words
(articles and connectives) and stemming(reduces distinct words
to their common grammaticalroot), identification of noun groups
( which eliminates adjectives ,adverbs and verb).Further
compression might be employed. These operations are called
Textoperations.
• Stop-words
– To reduce the set of representative keywords from large
collection
– Some examples of stop words are: "a", "and", "but", "how",
"or", and "what."
4
– For example, if you were to search for "What is a
motherboard?" onComputer Hope, the search
engine would only look for the term "motherboard"
. The removal of stop words usually improves IR
effectiveness.
• Noun groups
– To identify the noun groups
– Which eliminates the adjectives, adverbs and verbs
• Stemming
– Which reduces distinct words to their common grammatical
root
– Removing some endings of word
A stemmer for English, for example, should identify the string "cats"
(and possibly"catlike", "catty" etc.) as based on the root "cat", and
"stems", "stemmer", "stemming", "stemmed" as based on "stem". A
stemming algorithm reduces the words "fishing", "fished", and
"fisher" to the root word, "fish". On the other hand, "argue", "argued",
"argues", "arguing", and "argus" reduce to the stem "argu" but
"argument" and "arguments" reduce to the stem "argument".

• Reason for stemming


– Different word forms may bear similar meaning (e.g.
search, searching):create a ―standard‖ representation
for them

• Finally, compression employed, ―Text operations‖ used to


extract the index terms. Text operations reduce the complexity of
the document representation andallow moving the logical view
from that of full text to that of a set of index terms

2.2 Boolean Retrieval Models


Definition : The Boolean retrieval model is a model for
information retrieval in which the query is in the form
of a Boolean expression of terms, combined with the
operators AND, OR, and NOT. The model views each
document as just a set of words.
 Simple model based on set theory and Boolean algebra
 The Boolean model predicts that each document is
either relevant or non- relevant

5
Example :
A fat book which many people own is Shakespeare‟s Collected
Works.

Problem : To determine which plays of Shakespeare contain


the words Brutus AND Caesar AND NOT Calpurnia.

Method1 : Using Grep


The simplest form of document retrieval is for a computer to
do the linear scan through documents. This process is
commonly referred to as grepping through text, after the
Unix command grep. Grepping through text can be a very
effective process, especially given the speed of modern
computers, and often allows useful possibilities for wildcard
pattern matching through the use of regular expressions.

To Perform simple querying of modest collections , we need :


1. To process large document collections quickly.
2. To allow more flexible matching operations.
For example, it is impractical to perform the query
“Romans NEAR countrymen” with grep , where NEAR
might be defined as “within 5 words” or “within the
same sentence”.
3. To allow ranked retrieval: in many cases we want the
best answer to an information need among many documents
that contain certain words. The way to avoid linearly scanning
the texts for each query is to index documents in advance.

Method2: Using Boolean Retrieval Model


 The Boolean retrieval model is a model for information
retrieval in which we can pose any query which is in the
form of a Boolean expression of terms, that is, in which
terms are combined with the operators AND, OR, and
NOT. The model views each document as just a set of
words.
Terms are the indexed units . we have a vector for each term,
which shows the documents it appears in, or a vector for each
document, showing the terms that occur in it. The result is a
binary term-document incidence matrix, as in Figure

Anthony Julius The Haml Othell Macbeth


and Caesar Tempes et o.
Cleopatra t
Anthony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0

6
Mercy 1 0 1 1 1 1
Worser 1 0 1 1 1 0

A term-document incidence matrix. Matrix element (t,


d) is 1 if the play in column d contains the word in row
t, and is 0 otherwise.

 To answer the query Brutus AND Caesar AND NOT


Calpurnia, we take the vectors for Brutus, Caesar and
Calpurnia, complement the last, and then do a
bitwise AND:
110100 AND 110111 AND 101111 = 100100

Solution: Antony and Cleopatra and Hamlet

Results from Shakespeare for the query Brutus AND Caesar


AND NOT Calpurnia.

Drawbacks of the Boolean Model


 Retrieval based on binary decision criteria with no notion of
partial matching
 No ranking of the documents is provided (absence of a
grading scale)
 Information need has to be translated into a Boolean
expression, which most users find awkward
 The Boolean queries formulated by the users are most often
too simplistic
 The model frequently returns either too few or too many
documents in

2.3 Term weighting


Search Engine should return in order the documents
most likely to be useful to the searcher . To achieve
this , ordering documents with respect to a query -
called Ranking

Term-Document Incidence Matrix


A Boolean model only records term presence or absence,
Assign a score – say in [0, 1] – to each document , it measures
how well document and query “match”. For One-term query
7
“BRUTUS” , score is 1 if it is present in the document , 0
otherwise , More appearances of term in document have
higher score
Anthony Julius The Hamlet Othell Macbeth
and Caesar Tempes o.
Cleopatra t
Anthony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
Mercy 1 0 1 1 1 1
Worser 1 0 1 1 1 0
Document represented by binary vector Є {0,1}|V

Term Frequency tf
One of the weighting scheme is Term Frequency and is
denoted tft,d, with the subscripts denoting the term and the
document in order.
Term frequency TF(t, d) of term t in document d = number of
times that t occurs in d
Ex: Term-Document Count Matrix
but we would like to give more weight to documents that have
a term several times as opposed to ones that contain it only
once. To do this we need term frequency information the
number of times a term occurs in a document .
Assign a score to represent the number of occurrences

How to use tf for query-document match scores?

Raw term frequency is not what we want. A document with 10


occurrences of the term is more relevant than a document with 1
occurrence of the term . But not 10 times more relevant. We use
Log frequency weighting

8
Log-Frequency Weighting
Log-frequency weight of term t in document d is calculated
as

tft,d → wt,d : 0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 →


4, etc.
Document Frequency & Collection Frequency
Document frequency DF(t): the number of documents
in the collection that contain a term t
Collection frequency CF(t): the total number of
occurrences of a term t in the collection
Example :

Rare terms are more informative than frequent terms, to capture


this we will use document frequency (df)
Example: rare word ARACHNOCENTRIC

Document containing this term is very likely to be relevant to query


ARACHNOCENTRIC We want high weight for rare terms like
ARACHNOCENTRIC
Example: common word THE

Document containing this term can be about anything


We want very low weight for common terms like THE
Example

Document frequency is more meaningful , as we see in the


example above , there are few documents that contain
“insurance” to get a higher boost for a query on “insurance”
than the many documents containing “try” to get from a
9
query on “try”
Inverse Document Frequency ( idf Weight)
It estimates the rarity of a term in the whole document
collection. idft is an inverse measure of the informativeness of
t and idft <= N
dft is the document frequency of t: the number of documents
that contain t Informativeness idf (inverse document
frequency) of t:

log (N/dft) is used instead of N/dft to diminish the effect of


idf.

N: the total number of documents in the collection (for


example :806,791 documents)
• IDF(t) is high if t is a rare term
• IDF(t) is likely low if t is a frequent term

Effect of idf on Ranking


Does idf have an effect on ranking for one- term queries, like
IPHONE? Ans: No , idf will be effect for >1 term queries.
Query CAPRICIOUS PERSON: idf puts more weight on
CAPRICIOUS than PERSON, since it is a rare term.

TF-IDF Weighting
tf-idf weight of a term: product of tf weight and idf weight
, Best known weighting scheme in information retrieval.
TF(t, d) measures the importance of a term t in document
d , IDF(t) measures the importance of a term t in the
whole collection of documents
TF/IDF weighting: putting TF and IDF together

TFIDF(t, d) = TF(t, d) x IDF(t)

(if log tf is used)

 High if t occurs many times in a small number of


10
documents, i.e., highly discriminative in those
documents
 Not high if t appears infrequent in a document, or is
frequent in many documents, i.e., not discriminative
 Low if t occurs in almost all documents, i.e., no
discrimination at all
Simple Query-Document Score
 scoring finds whether or not a query term is present in a
zone (Zones: document features whose content can be
arbitrary free text – Examples: title, abstracts ) within
a document.
 Score for a document-query pair: sum over
terms t in both q and d:

If the Query contains more than one terms , Score for


a document-query pair is sum over terms t in both q
and d:

The score is 0 if none of the query terms is present in the document

Document Length Normalization


 Document sizes might vary widely
 This is a problem because longer documents are more likely to
be retrieved by a given query
 To compensate for this undesired effect, we can divide the
rank of each document by its length
 This procedure consistently leads to better ranking, and it is
called document length normalization

11
2.4 The Vector Model
 Boolean matching and binary weights is too limiting
 The vector model proposes a framework in which partial
matching is possible
 This is accomplished by assigning non-binary weights to index
terms in queries and in documents
 Term weights are used to compute a degree of similarity
between a query and each document
 The documents are ranked in decreasing order of their degree of
Similarity

 The weight wi,j associated with a pair (k i, dj) is positive and


non-binary
 The index terms are assumed to be all mutually independent
 They are represented as unit vectors of a t-dimensional space (t
is the total number of index terms)
 The representations of document dj and query q are
t-dimensional vectors given by

Similarity between a document dj and a query q

12
 Weights in the Vector model are basically tf-idf weights
 These equations should only be applied for values of term
frequency greater than zero
 If the term frequency is zero, the respective weight is also zero.

Cosine Similarity Measure the similarity between document and


the query using the cosine of the vector representations

Example1: To find similarity between 3 novels


Three novels are taken for discussion , namely
o SaS: Sense and Sensibility
o PaP: Pride and Prejudice
o WH: Wuthering Heights?
To find how similar are the novels using Term frequency
tft , values are

Log frequency weights :

13
After length normalization

cos(SaS,PaP) ≈ 0.789*0.832 + 0.515*0.555 + 0.335*0 + 0*0 ≈


0.94
cos(SaS,WH) ≈0.79 cos(PaP,WH) ≈ 0.69
cos(SAS,PaP) > cos(*,WH)

Example 2:

We often use different weightings for queries(q ) and


documents(d).
Notation : ddd.qqq
Example : lnc.ltn
Document : l logarithmic tf,
nno idf weighting,
c cosine normalization
Query : l logarithmic tf,
t idf,
nno cosine normalization

Example query: “best car insurance”


Example document: “car insurance auto insurance”
N=10,000,000 & df column given in the question itself.

Final similarity score between query and document:


wqi · wdi = 0 + 0 + 1.04 + 2.04 = 3.08

Example 3: standard weighting scheme (lnc.ltc)


14
Document : l logarithmic tf,
nno idf weighting,
c cosine normalization

Query : l logarithmic tf,


t idf,
c cosine normalization
Document: car insurance auto insurance

Query: best car insurance

N=1000000

Term Query Document Prod


tf- tf- df idf wt n’lize tf- tf- wt n’lize
raw wt raw wt
(log)
auto 0 0 5000 2.3 0 0 1 1 1 0.52 0
best 1 1 50000 1.3 1.3 0.34 0 0 0 0 0
car 1 1 10000 2.0 2.0 0.52 1 1 1 0.52 0.27
insurance 1 1 1000 3.0 3.0 0.78 2 1.3 1.3 0.68 0.53

Score = 0+0+0.27+0.53 = 0.8

Since in the above example only one document is given, only one score is
calculated . if suppose n documents given , n score will be calculated and
ranking done in decreasing order of scores.

Advantages:
 term-weighting improves quality of the answer set
 partial matching allows retrieval of docs that approximate the
query conditions
 cosine ranking formula sorts documents according to a degree of
similarity to the query
 document length normalization is naturally built-in into the ranking
Disadvantages:
It assumes independence of index terms

2.5 Probabilistic Model

The probabilistic model captures the IR problem using a


probabilistic framework
Given a user query, there is an ideal answer set for this
query , Given a description of this ideal answer set, we
could retrieve the relevant documents. Querying is seen
as a specification of the properties of this ideal answer
set.

15
An initial set of documents is retrieved somehow,The
user inspects these docs looking for the relevant ones (in
truth, only top 10-20 need to be inspected). The IR
system uses this information to refine the description of
the ideal answer set. By repeating this process, it is
expected that the description of the ideal answer set will
improve.

Probabilistic Ranking Principle

The probabilistic model tries to estimate the probability


that a document will be relevant to a user query ,
Assumes that this probability depends on the query and
document representations only ,The ideal answer set,
referred to as R, should maximize the probability of
relevance

Idea: Given a user query q, and the ideal answer set R


of the relevant documents, the problem is to specify
the properties for this set
– Assumption (probabilistic principle): the
probability of relevance depends on the query and
document representations only; ideal answer set R
should maximize the overall probability of
relevance
– The probabilistic model tries to estimate the
probability that the user will find the document
dj relevant with ratio
P(dj relevant to q)/P(dj nonrelevant to q)

Given a query q, there exists a subset of the documents R


which are relevant to q
But membership of R is uncertain (not sure) , A
Probabilistic retrieval model ranks documents in

16
decreasing order of probability of relevance to the
information need: P(R | q,di)

Why probabilities in IR?


Users gives with information needs, which they translate
into query representations. Similarly, there are
documents, which are converted into document
representations . Given only a query, an IR system has
an uncertain understanding of the information need. So
IR is an uncertain process , Because
 Information need to query representation
 Documents to index terms construction
 Query terms and index terms mismatch
Probability theory provides a principled foundation for
such reasoning under uncertainty. This model provides
how likely a document is relevant to an information
need.

The Ranking

Probabilistic IR - Need to Estimate

17
1. Find measurable statistics (tf, df ,document length) that
affect judgments about document relevance
2. Combine these statistics to estimate the probability of
document relevance
3. Order documents by decreasing estimated probability of
relevance P(R|d, q)
4. Assume that the relevance of each document is
independent of the relevance of other documents

Term Incidence Contingency Table

Let pt = P(xt = 1|R = 1,q) be the probability of a term


appearing in a document relevant to the query,
Let ut = P(x t = 1|R = 0,q) be the probability of a term
appearing in a non relevant document.
Rank documents using the log odds ratios for the terms in
the query ct

(pt/(1 − pt)),  odds of the term appearing if the document


is relevant
(ut/(1 − u t ))  odds of the term appearing if the document
is irrelevant
ct = 0  term has equal odds of appearing in relevant and
irrelevant docs
ct positive  higher odds to appear in relevant documents
ct negative higher odds to appear in irrelevant
documents
ct functions as a term weight
Retrieval status value for document d:

Example for the BIR model


• Assume a query q containing two terms, x1 , x2. Table
18

You might also like