Unit 2
Unit 2
Basic IR Models – Boolean Model – TF-IDF (Term Frequency/Inverse Document Frequency) Weighting –
Vector Model – Probabilistic Model – Latent Semantic Indexing Model – Neural Network Model –Retrieval
Evaluation – Retrieval Metrics – Precision and Recall – Reference Collection –User-based Evaluation – Relevance
Feedback and Query Expansion – Explicit Relevance Feedback.
2.1 Introduction
Modeling
Modeling in IR is a complex process aimed at producing a ranking function
Ranking function: a function that assigns scores to documents with regard to a given query.
This process consists of two main tasks:
• The conception of a logical framework for representing documents and queries
• The definition of a ranking function that allows quantifying the similarities among documents and
queries
IR systems usually adopt index terms to index and retrieve documents
IR Model Definition:
A Taxonomy of IR Models
Retrieval models most frequently associated with distinct combinations of a document logical view and a
user task. The users task includes retrieval and browsing. In retrieval
i) Ad Hoc Retrieval:
The documents in the collection remain relatively static while new queries are submitted to the system.
ii) Filtering
The queries remain relatively static while new documents come into the system
Page 1 of 58
Classic IR model:
Each document is described by a set of representative keywords called index terms. Assign a
numerical weights to distinct relevance between index terms.
Three classic models: Boolean, vector, probabilistic
Boolean Model :
The Boolean retrieval model is a model for information retrieval in which we can pose any query
which is in the form of a Boolean expression of terms, that is, in which terms are combined with the
operators AND, OR, and NOT. The model views each document as just a set of words. Based on a
binary decision criterion without any notion of a grading scale. Boolean expressions have precise
semantics.
Vector Model
Assign non-binary weights to index terms in queries and in documents. Compute the similarity
between documents and query. More precise than Boolean model.
Probabilistic Model
The probabilistic model tries to estimate the probability that the user will find the document dj
relevant with ratio
P(dj relevant to q)/P(dj nonrelevant to q)
Given a user query q, and the ideal answer set R of the relevant documents, the problem is to specify
the properties for this set. Assumption (probabilistic principle): the probability of relevance depends
on the query and document representations only; ideal answer set R should maximize the overall
probability of relevance
Page 2 of 58
Basic Concepts
• Each document is represented by a set of representative keywords or index terms
• Index term:
In a restricted sense: it is a keyword that has some meaning on its own; usually plays
the role of a noun
In a more general form: it is any word that appears in a document
• Let, t be the number of index terms in the document collection ki be a
generic index term Then,
• The vocabulary V = {k1, . . . , kt} is the set of all distinct index terms in
the collection
• where each fi,j element stands for the frequency of term ti in document dj
• Logical view of a document: from full text to a set of index terms
Example :
A fat book which many people own is Shakespeare‟s Collected Works.
Problem : To determine which plays of Shakespeare contain the words Brutus AND Caesar AND
NOT Calpurnia.
Page 3 of 58
CS6080 – Information Retrieval Department of CSE
given the speed of modern computers, and often allows useful possibilities for wildcard pattern matching through
the use of regular expressions.
• To answer the query Brutus AND Caesar AND NOT Calpurnia, we take the vectors for Brutus,
Caesar and Calpurnia, complement the last, and then do a bitwise AND:
110100 AND 110111 AND 101111 = 100100
Results from Shakespeare for the query Brutus AND Caesar AND NOT Calpurnia.
Page 4 of 58
Consider N = 106 documents, each with about 1000 tokens ⇒ total of 109 tokens
On average 6 bytes per token, including spaces and punctuation ⇒ size of document collection is about
6 ・ 109 = 6 GB
Assume there are M = 500,000 distinct terms in the collection
M = 500,000 × 106 = half a trillion 0s and 1s.
But the matrix has no more than one billion 1s. Matrix is extremely
sparse.What is a better representations? We only record the 1s.(Inverted Index)
i) Dictionary / vocabulary /lexicon : we use dictionary for the data structure and vocabulary for the set of
terms, The dictionary in Figure has been sorted alphabetically
ii) Posting : for each term, we have a list of Document ID in which the term present
.The list is then called a postings list (or inverted list), and each postings list is sorted by document ID.
The two parts of an inverted index. The dictionary is commonly kept in memory, with pointers to each
postings list, which is stored on disk.
...
2. Tokenize the text, turning each document into a list of tokens:
...
3. Do linguistic preprocessing, producing a list of normalized tokens, which are the indexing terms:
...
4. Index the documents that each term occurs in by creating an inverted index,
consisting of a dictionary and postings.
Page 5 of 58
DocID :
Each document has a unique serial number, known as the document identifier ( docID ).
During index construction, simply assign successive integers to each new document when it
is first encountered.
Input Dictionary & posting :
The input to indexing is a list of normalized tokens for each document, which we can equally
think of as a list of pairs of term and docID . The core indexing step is sorting this list ,
Multiple occurrences of the same term from the same document are then merged. Instances
of the same term are then grouped, and the result is split into a dictionary and postings
Document Frequency :
The dictionary records some statistics, such as the number of documents which contain each
term . This information is not vital for a basic Boolean search engine, but it allows us to
improve the efficiency of the search engine at query time, and it is a statistic later used in
many ranked retrieval models. The postings are secondarily sorted by docID. This provides
the basis for efficient query processing.
Page 6 of 58
Storage (dictionary & postings lists) :
1. A fixed length array would be wasteful as some words occur in many documents, and others
in very few.
2. For an in-memory postings list - two good alternatives
a. singly linked lists : Singly linked lists allow cheap insertion of documents into
postings lists, and naturally extend to more advanced indexing strategies such as skip
lists, which require additional pointers.
b. Variable length arrays : win in space requirements by avoiding the overhead for
pointers and in time requirements because their use of contiguous memory increases
speed on modern processors with memory caches. Variable length arrays will be more
compact and faster to traverse.
3. A hybrid scheme with a linked list of fixed length arrays for each term. When postings lists
are stored on disk, they are stored (perhaps compressed) as a contiguous run of postings
without explicit pointers, so as to minimize the size of the postings list and the number of
disk seeks to read a postings list into memory.
To process a query using an inverted index and the basic Boolean retrieval model Consider
processing the simple conjunctive query : over the
Brutus AND Calpurnia
inverted index partially shown in Figure
Steps :
1. Locate Brutus in the Dictionary
2. Retrieve its postings
3. Locate Calpurnia in the Dictionary
4. Retrieve its postings
5. Intersect the two postings lists, as shown in Figure
Intersecting the postings lists for Brutus and Calpurnia Algorithm for the
intersection of two postings lists P1 and P2.
Page 7 of 58
There is a simple and effective method of intersecting postings lists using the merge algorithm. we maintain
pointers into both lists and walk through the two postings lists simultaneously, in time linear in the total
number of postings entries. At each step, we compare the docID pointed to by both pointers. If they are the
same, we put that docID in the results list, and advance both pointers. Otherwise we advance the pointer
pointing to the smaller docID. To use this algorithm, postings is be sorted by a single global ordering. Using
a numeric sort by docID is one simple way to achieve this.
Query optimization
Case1:
Consider a query that is an and of n terms, n > 2
For each of the terms, get its postings list, then and them together Example
Start with the shortest postings list, then keep cutting further In this example,
first CAESAR, then CALPURNIA, then BRUTUS
Case2:
Example query: (MADDING OR CROWD) and (IGNOBLE OR STRIFE)
Get frequencies for all terms, Estimate the size of each or by the sum of its frequencies (conservative),
Process in increasing order of or sizes
Page 8 of 58
2.4 Term weighting
Search Engine should return in order the documents most likely to be useful to the searcher . To
achieve this , ordering documents with respect to a query - called Ranking
Se
Bag of Words Model Document represented by count vector Є Nv
The exact ordering of the terms in a document is ignored but the number of occurrences of each term is
important.
Example : two documents with similar bag of words representations are similar in content.
“Mary is quicker than John” →“John is quicker than Mary”
Page 9 of 58
This is called the bag of words model. In a sense, step back: The positional index was able to
distinguish these two documents
Raw term frequency is not what we want. A document with 10 occurrences of the term is more relevant than
a document with 1 occurrence of the term . But not 10 times more relevant. We use Log frequency
weighting.
Log-Frequency Weighting
Rare terms are more informative than frequent terms, to capture this we will use document
frequency (df)
Example: rare word ARACHNOCENTRIC
Document containing this term is very likely to be relevant to query
ARACHNOCENTRIC
We want high weight for rare terms like ARACHNOCENTRIC
Example:
Page 10 of 58
Document frequency is more meaningful , as we see in the example above , there are few documents that
contain “insurance” to get a higher boost for a query on “insurance” than the many documents containing
“try” to get from a query on “try”
N: the total number of documents in the collection (for example :806,791 documents)
• IDF(t) is high if t is a rare term
• IDF(t) is likely low if t is a frequent term
N = 1, 000, 000
idft = log10 1,000,000
dft
• High if t occurs many times in a small number of documents, i.e., highly discriminative in those
documents
• Not high if t appears infrequent in a document, or is frequent in many documents, i.e., not
discriminative
• Low if t occurs in almost all documents, i.e., no discrimination at all
Page 11 of 58
Simple Query-Document Score
• scoring finds whether or not a query term is present in a zone (Zones: document features whose
content can be arbitrary free text – Examples: title, abstracts ) within a document.
• Score for a document-query pair: sum over terms t in both q and d:
If the Query contains more than one terms , Score for a document-query pair is sum over terms t in
both q and d:
Binary values are changed into count values , later it is represented as weight matrix
Page 12 of 58
Document represented by tf-idf weight vector
Queries as Vectors
Key idea 1: Represent queries as vectors in same space
Key idea 2: Rank documents according to proximity to query in this space proximity =
similarity of vectors
proximity ≈ inverse of distance
Get away from Boolean model , Rank more relevant documents higher than less relevant documents
Euclidean distance is a bad idea . . . because Euclidean distance is large for vectors of different lengths
Page 13 of 58
The Euclidean distance of and is large although the distribution of terms in the query q and the
distribution of terms in the document d2 are very similar.
Cosine is a monotonically decreasing function of the angle for the interval [0◦, 180◦]
Dividing a vector by its L2 norm makes it a unit (length) vector (on surface of unit hypersphere). This maps
vectors onto the unit sphere . . .
As a result, longer documents and shorter documents have weights of the same order of magnitude. Effect
on the two documents d and d′ (d appended to itself) from earlier example : they have identical vectors
after length-normalization.
Page 14 of 58
qi is the tf-idf weight of term i in the query
di is the tf-idf weight of term i in the document
is the cosine similarity of q and d = the cosine of the angle between q and d.
In reality:
- Length-normalize when document added to index:
- Length-normalize query:
For normalized vectors, the cosine is equivalent to the dot product or scalar product:
Page 15 of 58
Log frequency weights :
Example 2:
We often use different weightings for queries and documents.
Notation : ddd.qqq
Example : lnc.ltn
Page 16 of 58
Document : logarithmic tf, no df weighting, cosine normalization Query
: logarithmic tf, idf, no normalization
Page 17 of 58
Example query: “best car insurance”
Example document: “car insurance auto insurance”
N=10,000,000
We need to deal with format and language of each document. What format
is it in? pdf, word, excel, html etc.
What language is it in?
What character set is in use? Each of these is a classification problem.
Format/Language: Complications
▪ A single index usually contains terms of several languages.
▪ Sometimes a document or its components contain multiple
languages/formats.
▪ French email with Spanish pdf attachment
▪ What is the document unit for indexing?
▪ A file?
▪ An email?
▪ An email with 5 attachments?
▪ A group of files (ppt or latex in HTML)?
▪ Upshot: Answering the question “what is a document?” is not trivial and requires some
design decisions.
Page 18 of 58
Determining the vocabulary of terms
1) Tokenization:
Task of splitting the document into pieces called tokens.
▪ Input:
▪ Output:
▪ 3/20/91 date
▪ 20/3/91
▪ Mar 20, 1991
▪ B-52
▪ 100.2.86.144 IP address
▪ (800) 234-2333 Phone Number
▪ 800.234.2333
2) Normalization
▪ Need to “normalize” terms in indexed text as well as query terms into the same form.
▪ Example: We want to match U.S.A. and USA
▪ We most commonly implicitly define equivalence classes of terms.
▪ Alternatively: do asymmetric expansion
▪ window → window, windows
▪ windows → Windows, windows
▪ Windows (no expansion)
▪ More powerful, but less efficient
Case Folding-
Reduce all letters to lower case
Possible exceptions: capitalized words in mid-sentence MIT vs.
mit
Fed vs. fed
It‟s often best to lowercase everything since users will use lowercase regardless of
correct capitalization.
Page 19 of 58
▪ He got his PhD from MIT. → MIT ≠ mit
3) Stop words
▪ stop words = extremely common words which would appear to be of little value in helping
select documents matching a user need
▪ Examples: a, an, and, are, as, at, be, by, for, from, has, he, in, is, it, its, of, on, that, the, to,
was, were, will, with
▪ Stop word elimination used to be standard in older IR systems.
▪ But you need stop words for phrase queries, e.g. “King of Denmark”
▪ Most web search engines index stop words
o Definition of stemming: Crude heuristic process that chops off the ends of words in the hope
of achieving what “principled” lemmatization attempts to do with a lot of linguistic
knowledge.
o Language dependent
o Example : automate, automatic, automation all reduce to automat
Porter algorithm
▪ Most common algorithm for stemming English
▪ Results suggest that it is at least as good as other stemming options
▪ Conventions + 5 phases of reductions
▪ Phases are applied sequentially
▪ Each phase consists of a set of commands.
▪ Sample command: Delete final ement if what remains is longer than 1 character
▪ replacement → replac
▪ cement → cement
▪ Sample convention: Of the rules in a compound command, select the one that applies to the
longest suffix.
▪ Porter stemmer: A few rules
Rule Example
SSES → SS caresses → caress
IES → I ponies → poni
Page 20 of 58
SS → SS caress → caress
S→ cats → cat
Other Three stemmers
1) Sample text:
2) Lovins stemmer:
3) Paice stemmer:
Lexical analysis
Objective: Determine the words of the document. Lexical analysis separates the input
alphabet into
• Word characters (e.g., the letters a-z)
• Word separators (e.g., space, newline, tab) The
following decisions may have impact on retrieval
• Digits: Used to be ignored, but the trend now is to identify numbers (e.g., telephone
numbers) and mixed strings as words.
• Punctuation marks: Usually treated as word separators.
• Hyphens: Should we interpret “pre-processing” as “pre processing” or as
“preprocessing”?
• Letter case: Often ignored, but then a search for “First Bank” (a specific bank) would retrieve a
document with the phrase “Bank of America was the first bank to offer its customers…”
Stopword Elimination
Objective: Filter out words that occur in most of the documents.
Such words have no value for retrieval purposes , These words are referred to as
stopwords.
They include
• Articles (a, an, the, …)
• Prepositions (in, on, of, …)
• Conjunctions (and, or, but, if, …)
• Pronouns (I, you, them, it…)
• Possibly some verbs, nouns, adverbs, adjectives (make, thing, similar, …)
• A typical stopword list may include several hundred words.
the 100 most frequent words add-up to about 50% of the words in a document. Hence, stopword elimination
improves the size of the indexing structures
Page 21 of 58
Stemming
Objective: Replace all the variants of a word with the single stem of the word. Variants include plurals,
gerund forms (ing-form), third person suffixes, past tense suffixes, etc.
Example: connect: connects, connected, connecting, connection,… All have
similar semantics and relate to a single concept.
In parallel, stemming must be performed on the user query.
Stemming improves Storage and search efficiency: less terms are stored. Recall:
without stemming a query about “connection”, matches only documents that have “connection”.
With stemming, the query is about “connect” and matches in addition documents that originally had
“connects” , “connected” , “connecting”, etc.
However, stemming may hurt precision, because users can no longer target just a particular form.
Stemming may be performed using
o Algorithms that stripe of suffixes according to substitution rules.
o Large dictionaries, that provide the stem of each word.
Page 22 of 58
8. Related issues: Index title and abstract only, or the entire document? Should
index terms be weighted?
Reducing the size of the index: Recall that articles, prepositions, conjunctions, pronouns have already been
removed through a stopword list. Recall that the 100 most frequent words account for 50% of all word
occurrences. Words that are very infrequent (occur only a few times in a collection) are often removed,
under the assumption that they would probably not be in the user‟s vocabulary. Reduction not based on
probabilistic arguments: Nouns are often preferred over verbs, adjectives, or adverbs.
Thesauri
Objective: Standardize the index terms that were selected. In its simplest form a thesaurus is A list of
“important” words (concepts). For each word, an associated list of synonyms. A thesaurus may be generic
(cover all of English) or concentrate on a particular domain of knowledge. The role of a thesaurus in
information retrieval is to
• Provide a standard vocabulary for indexing.
• Help users locate proper query terms.
• Provide hierarchies for automatic broadening or narrowing of queries.
Here, our interest is to provide a standard vocabulary (a controlled vocabulary). This is final stage, where
each indexing term is replaced by the concept that defines its thesaurus class.
Page 23 of 58
A language model is a probability distribution over sequences of words. Given such a sequence, say of
length m, it assigns a probability to the whole sequence. These distributions can be
used to predict the likelihood that the next token in the sequence is a given word . These probability
distributions are called language models. It is useful in many natural language processing applications.
• Ex: part-of-speech tagging, speech recognition, machine translation, and information retrieval
In speech recognition, the computer tries to match sounds with word sequences. The language model
provides context to distinguish between words and phrases that sound similar. For example, in American
English, the phrases "recognize speech" and "wreck a nice beach" are pronounced almost the same but mean
very different things. These ambiguities are easier to resolve when evidence from the language model is
incorporated with the pronunciation model and the acoustic model.
This diagram shows a simple finite automaton and some of the strings in the language it generates.
→ shows the start state of the automaton and a double circle indicates a (possible) finishing state. For
example, the finite automaton
Page 24 of 58
shown can generate strings that include the examples shown. The full set of strings that can be
generated is called the language of the automaton.
Probability that some text (e.g. a query) was generated by the model:
P(frog said that toad likes frog) = 0.01 x 0.03 x 0.04 x 0.01 x 0.02 x 0.01
(We ignore continue/stop probabilities assuming they are fixed for all queries)
A one-state finite automaton that acts as a unigram language model. We show a partial specification
of the state emission probabilities. If instead each node has a probability distribution over generating
different terms, we have a language model. The notion of a language model is inherently
probabilistic, it places a probability distribution over any sequence of words.
Page 25 of 58
P(q | M2) = 0.0002 x 0.04 x 0.0001
P(q|M1) > P(q|M2)
=> M1 is more likely to generate query q
One simple language model is equivalent to a probabilistic finite automaton consisting of just a single node
with a single probability distribution over producing different terms ,
so that . After generating each word, we decide whether to stop or to loop around and then
produce another word, and so the model also requires a probability of stopping in the finishing state.
A unigram model used in information retrieval can be treated as the combination of several one-
state finite automata.
It splits the probabilities of different terms in a context, In this model, the probability to hit each
word all depends on its own, so we only have one-state finite automata as units. For each
automaton, we only have one way to hit its only state, assigned with one probability. Viewing
from the whole model, the sum of all the one-state-hitting probabilities should be 1. Followed is
an illustration of a unigram model of a document.
For Example: considering 2 language models M1 & M2 with their word emission probabilities
Page 26 of 58
Partial specification of two unigram language models
To find
probability of a = probabilities of word X probability of continuing /
word sequence (given by model ) stopping after producing each word.
For example,
the probability of a particular string/document, is usually a very small number! Here we stopped after
generating frog the second time.
The first line of numbers are the term emission probabilities,
the second line gives the probability of continuing (0.8) or stopping (0.2) after generating each word.
To compare two models for a data set, we can calculate their likelihood ratio , which results from
simply dividing the probability of the data according to one model by the probability of the data
according to the other model.
In information retrieval contexts, unigram language models are often smoothed to avoid instances where
P(term) = 0. A common approach is to generate a maximum-likelihood model for the entire collection and
linearly interpolate the collection model with a maximum-likelihood model for each document to create a
smoothed document model.
Ex: In a bigram (n = 2) language model, the probability of the sentence “I saw the red house” is
Page 27 of 58
approximated as ( <s> means stop )
Page 28 of 58
Most language-modeling work in IR has used unigram language models. IR is not the place where most
immediately need complex language models, since IR does not directly depend on the structure of sentences
to the extent that other tasks like speech recognition do. Unigram models are often sufficient to judge the
topic of a text.
Page 29 of 58
Algorithm:
1. Infer a LM Md for each document d
2. Estimate P(q|Md)
Smoothing
Page 30 of 58
• Decreasing the estimated probability of seen events and increasing the probability of unseen events
is referred to as smoothing
• The role of smoothing in this model is not only to avoid zero probabilities. The smoothing of
terms actually implements major parts of the term weighting component. Thus, we need to
smooth probabilities in our document language models:
to discount non-zero probabilities
to give some probability mass to unseen words.
• The probability of a non occurring term should be close to its probability to occur in the collection
P(t|Mc) = cf(t)/T
cf(t) = #occurrences of term t in the collection
T – length of the collection = sum of all document lengths
The general approach is that a non-occurring term should be possible in a query, but its
probability should be somewhat close to but no more likely than would be expected by chance
from the whole collection.
Smoothing Methods
Linear
Interpolation
(Mixer Model) ▪ Mixes the probability from the document with the general collection
frequency of the word.
▪ High value of λ: “conjunctive-like” search – tends to retrieve documents
containing all query words.
▪ Low value of λ: more disjunctive, suitable for long queries
▪ Correctly setting λ is very important for good performance.
Bayesian
smoothing
Summary, with
linear
interpolation
In practice, log in taken from both sides of the equation to avoid multiplying any small
numbers
Example1:
Question: Suppose the document collection contains two documents: d1: Xerox
reports a profit but revenue is down
d2: Lucent narrows quarter loss but revenue decreases further A user
submitted the query: “revenue down”
Page 31 of 58
Rank D1 and D2 - Use an MLE unigram model and a linear interpolation smoothing with
lambda parameter 0.5
Solution :
Example2:
Collection: d1 and d2
d1 : Jackson was one of the most talented entertainers of all time
d2: Michael Jackson anointed himself King of Pop Query q:
Michael Jackson
Use mixture model with λ = 1/2
P(q|d1) = [(0/11 + 1/18)/2] · [(1/11 + 2/18)/2] ≈ 0.003
P(q|d2) = [(1/7 + 1/18)/2] · [(1/7 + 2/18)/2] ≈ 0.013
Ranking: d2 > d1
There is much less text available to estimate a language model based on the query text, and so
the model will be worse estimated
More dependency towards smoothing with some other language model
P(d|q) – the probability of query LM to generate document. The problem of this model is
that queries are short this leads to bad model estimation. So Zhai and Lafferty 2001
suggesting to expand the query with terms taken from relevant documents in the usual way
and hence update the language model
3) Model comparison
Make LM from both query and document ,Measure `how different` these LMs from each
other , Uses KL divergence
KL divergence (Kullback–Leibler (KL) divergence)
An asymmetric divergence measure from information theory , which measures the how
bad the probability distribution Mq is at modelingMd
Page 32 of 58
Rank by KLD - the closer to 0 the higher is the rank
Given a query q, there exists a subset of the documents R which are relevant to q
But membership of R is uncertain (not sure) , A Probabilistic retrieval model ranks documents in decreasing
order of probability of relevance to the information need: P(R | q,di)
Users gives with information needs, which they translate into query representations. Similarly, there are
documents, which are converted into document representations . Given only a query, an IR system has an
uncertain understanding of the information need. So IR is an uncertain process , Because
Page 33 of 58
• Information need to query
• Documents to index terms
• Query terms and index terms mismatch
Probability theory provides a principled foundation for such reasoning under uncertainty. This model
provides how likely a document is relevant to an information need.
Probabilistic methods are one of the oldest but also one of the currently hottest topics in IR .
Probabilistic IR Models :
For events A , the probability of the event lies between 0≤ P(A) ≤ 1 , For 2 events A
and B
▪ Partition rule: if B can be divided into an exhaustive set of disjoint sub cases, then P(B) is the sum
of the probabilities of the sub cases. A special case of this rule gives:
Page 34 of 58
▪ Bayes‟ Rule for inverting conditional probabilities:
In Ranked retrieval setup, for a given collection of documents, the user issues a query, and an ordered
list of documents is returned. Assume binary notion of relevance: Rd,q is a a random dichotomous
variable (A categorical variable that can take on exactly two values is termed a binary variable or
dichotomous variable), such that
Probabilistic ranking orders documents decreasingly by their estimated probability of relevance w.r.t. query: P(R
= 1|d, q)
PRP in brief
If the retrieved documents (w.r.t a query) are ranked decreasingly on their probability of relevance,
then the effectiveness of the system will be the best that is obtainable
PRP in full
If [the IR] system‟s response to each [query] is a ranking of the documents [...] in order of decreasing
probability of relevance to the [query], where the probabilities are estimated as accurately as possible on
the basis of whatever data have been made available to the system for this purpose, the overall
effectiveness of the system to its user will be the best that is obtainable on the basis of those data.
1/0 loss :
• either returning a non relevant document or failing to return a relevant document is called as 1/0
loss.
• The goal is to return the best possible results as the top k documents, for any value of k the user
chooses to examine.
• The PRP then says to simply rank all documents in decreasing order of P (R=1
| d,q) . If a set of retrieval results is to be returned, rather than an ordering, the
Page 35 of 58
Bayes Optimal Decision Rule , the decision which minimizes the risk of loss, is to simply return
documents that are more likely relevant than nonrelevant:
Such a model gives a formal framework where we can model differential costs of false positives and false
negatives and even system performance issues at the modeling stage, rather than simply at the evaluation
stage.
To make a probabilistic retrieval strategy precise, need to estimate how terms in documents contribute
to relevance
1) Find measurable statistics (term frequency, document frequency, document length) that affect
judgments about document relevance
2) Combine these statistics to estimate the probability of document relevance
3) Order documents by decreasing estimated probability of relevance P(R|d, q)
4) Assume that the relevance of each document is independent of the relevance of other documents (not
true, in practice allows duplicate results)
Page 36 of 58
relevant documents in the collection
To make a probabilistic retrieval strategy precise, need to estimate how terms in documents
contribute to relevance
• Find measurable statistics (term frequency, document frequency, document length) that
affect judgments about document relevance
• Combine these statistics to estimate the probability P(R|d, q) of document relevance
Given a query q, ranking documents by P(R = 1|d, q) is modeled under BIM as ranking them by
P(R = 1|x, q)
It is at this point that we make the Naive Bayes conditional independence assumption that the presence or
absence of a word in a document is independent of the presence or absence of any other word (given the
query):
So:
Page 37 of 58
Let pt = P(xt = 1|R = 1,q) be the probability of a term appearing in a document relevant to the query,
Let ut = P(xt = 1|R = 0,q) be the probability of a term appearing in a non relevant document. It can be
contingency table :
Additional simplifying assumption: terms not occurring in the query are equally likely to occur in relevant
and irrelevant documents, If qt = 0, then pt = ut
Now we need only to consider terms in the products that appear in the query:
The left product is over query terms found in the document and the right product is over query terms not found in
the document.
Including the query terms found in the document into the right product, but
simultaneously dividing through by them in the left product, so the value is unchanged
The left product is still over query terms found in the document, but the right product is now over all query
terms, hence constant for a particular query and can be ignored.
→ The only quantity that needs to be estimated to rank documents w.r.t a query is the left product, Hence
the Retrieval Status Value (RSV) in this model:
Equivalent: rank documents using the log odds ratios for the terms in the query ct
The odds ratio is the ratio of two odds: (i) the odds of the term appearing if the document is relevant (pt/(1 −
pt)), and (ii) the odds of the term appearing if the document is irrelevant (ut/(1 − ut ))
• ct = 0: term has equal odds of appearing in relevant and irrelevant docs
Page 38 of 58
• ct positive: higher odds to appear in relevant documents
Page 39 of 58
CS6080 – Information Retrieval Department of CSE
• So BIM and vector space model are identical on an operational level , except that the term weights
are different. In particular: we can use the same data structures (inverted index etc) for the two
models.
For each term t in a query, estimate ct in the whole collection using a contingency table of counts of
documents in the collection, where df t is the number of documents that contain term t:
Avoiding Zeroes :
If any of the counts is a zero, then the term weight is not well-defined. Maximum likelihood estimates
do not work for rare events.
To avoid zeros: add 0.5 to each count (expected likelihood estimation = ELE) For example,
Assuming that relevant documents are a very small percentage of the collection, approximate statistics for
irrelevant documents by statistics from the whole collection
Hence, ut (the probability of term occurrence in irrelevant documents for a query) is dft/N and
They are not that different. In either case you build an information retrieval scheme in the
exact same way.
Page 40 of 58
For probabilistic IR, at the end, you score queries not by cosine similarity and tf- idf in a vector
space, but by a slightly different formula motivated by probability theory. Next to add term
frequency and length normalization to the probabilistic model.
Definition : Latent semantic indexing (LSI) is an indexing and retrieval method that uses a mathematical
technique called singular value decomposition (SVD) to identify patterns in the relationships between the
terms and concepts contained in an unstructured collection of text. LSI is based on the principle that words
that are used in the same contexts tend to have similar meanings.
Page 41 of 58
CS6080 – Information Retrieval Department of CSE
In LSI
▪ Map documents (and terms) to a low-dimensional representation.
▪ Design a mapping such that the low-dimensional space reflects semantic associations
(latent semantic space).
▪ Compute document similarity based on the inner product in this latent semantic space
▪ We will decompose the term-document matrix into a product of matrices.
▪ The particular decomposition we‟ll use: singular value decomposition (SVD).
▪ The term matrix U – consists of one (row) vector for each term
▪ The document matrix VT – consists of one (column) vector for each document
▪ The singular value matrix Σ – diagonal matrix with singular values, reflecting
importance of each dimension
2. Use the SVD to compute a new, improved term-document matrix C′ by reducing the space which
gives better similarity values compared to C.
Page 42 of 58
Example:
One row per term, one column per min(M,N) where M is the number of terms and N is the number of
documents. This is an orthonormal matrix:
(i) Row vectors have unit length. (ii) Any two distinct row vectors are orthogonal to each other. Think of the
dimensions as “semantic” dimensions that capture distinct topics like politics, sports, economics. Each
number uij in the matrix indicates how strongly related term i is to the topic represented by semantic
dimension j .
This is a standard term-document matrix. Actually, we use a non-weighted matrix here to simplify the
example.
The matrix Σ
Page 43 of 58
CS6080 – Information Retrieval Department of CSE
consists of the singular values of C. The magnitude of the singular value measures the importance of the
corresponding semantic dimension. We‟ll make use of this by mitting unimportant dimensions.
The matrix VT
One column per document, one row per min(M,N) where M is the number of terms and N is the number
of documents. Again: This is an orthonormal matrix: (i) Column vectors have unit length. (ii) Any two
distinct column vectors are orthogonal to each other. These are again the semantic dimensions from the
term matrix U that capture distinct topics like politics, sports, economics. Each number vij in the matrix
indicates how strongly related document i is to the topic represented by semantic dimension j .
Page 44 of 58
Actually, we only zero out singular values in Σ. This has the effect of setting the corresponding dimensions in U
and V T to zero when computing the product C = UΣV T .
Page 45 of 58
We can view C2 as a two-dimensional representation of the matrix. We have performed a dimensionality
reduction to two dimensions.
0.52 * 0.28 + 0.36 * 0.16 + 0.72 * 0.36 + 0.12 * 0.20 + - 0.39 * - 0.08 ≈ 0.52
LSI takes documents that are semantically similar (= talk about the same topics), but are not similar
in the vector space (because they use different words) and re- represents them in a reduced vector
space in which they have higher similarity.
Thus, LSI addresses the problems of synonymy and semantic related ness. Standard vector
space: Synonyms contribute nothing to document similarity.
How LSI addresses synonymy and semantic relatedness
▪ Relevance feedback and query expansion are used to increase recall in information retrieval
– if query and documents have (in the extreme case) no terms in common.
▪ LSI increases recall and hurts precision.
▪ Thus, it addresses the same problems as (pseudo) relevance feedback and query expansion
...
Interactive relevance feedback: improve initial retrieval results by telling the IR system which docs are
relevant / nonrelevant . Best known relevance feedback method: Rocchio feedback
Query expansion: improve retrieval results by adding synonyms / related terms to the query. Sources for
related terms: Manual thesauri, automatic thesauri, query logs.Two ways of improving recall: relevance
feedback and query expansion
Synonymy : In most collections, the same concept may be referred to using different words. This is
called as synonymy ,
Page 46 of 58
Even if d is the most relevant document for q! We want
to change this:
Return relevant documents even if there is no term match with the (original) query
Recall:
▪ increasing the number of relevant documents returned to user”
▪ This may actually decrease recall on some measures, e.g., when expanding “jaguar” with
“panthera”
▪ . . .which eliminates some relevant documents, but increases relevant documents
returned on top pages
2) Local methods adjust a query relative to the documents that initially appear to match the query. The
basic methods are:
• Relevance feedback
• Pseudo relevance feedback, also known as Blind relevance feedback
• (Global) indirect relevance feedback
The idea of relevance feedback ( ) is to involve the user in the retrieval process so as to improve the final
result set. In particular, the user gives feedback on the relevance of documents in an initial set of results. The
basic procedure is:
We can iterate this: several rounds of relevance feedback. We will use the term ad hoc retrieval to
refer to regular retrieval without relevance feedback. We will now look at Two different examples of
relevance feedback that highlight different aspects of the process.
Page 47 of 58
Example1 : Image search engine http://nayana.ece.ucsb.edu/imsearch/imsearch.html
Page 48 of 58
CS6080 – Information Retrieval Department of CSE
Example 2: A real (non-image) example - shows a textual IR example where the user wishes to find out
about new applications of space satellites.
Initial query: [new space satellite applications] Results for initial query: (r = rank)
3 0.528 Science Panel Backs NASA Satellite Plan, But Urges Launches of Smaller
Probes
5 0.525 Scientist Who Exposed Global Warming Proposes Satellites for Climate
Research
6 0.524 Report Provides Support for the Critics Of Using Big Satellites to Study
Climate
Page 49 of 58
2.074 new 15.106 space
The centroid is the center of mass of a set of points. Recall that we represent documents as points
in a high-dimensional space. Thus: we can compute centroids of documents.
Definition:
Page 50 of 58
CS6080 – Information Retrieval Department of CSE
Example:
• The Problem is we don’t know the truly relevant docs. We move the centroid of the relevant
documents by the difference between the two centroids.
Page 51 of 58
Rocchio 1971 algorithm
▪ We can modify the query based on relevance feedback and apply standard vector space
model. Use only the docs that were marked. Relevance feedback can improve recall and
precision.Relevance feedback is most useful for increasing recall in situations where recall is
important. Users can be expected to review results and to take time to iterate
qm: modified query vector; q0: original query vector; Dr and Dnr : sets of known relevant and nonrelevant
documents respectively; α, β, and γ: weights
▪ New query moves towards relevant documents and away from nonrelevant documents.
▪ Tradeoff α vs. β/γ: If we have a lot of judged documents, we want a higher β/γ.
▪ “Negative weight” for a term doesn‟t make sense in the vector space model.
For example, set β = 0.75, γ = 0.25 to give higher weight to positive feedback. Many systems
Page 52 of 58
To Compute Rocchio’ vector
4) centroid of nonrelevant
3) does not separate relevant /
documents.
nonrelevant.
5) 6) difference vector
Page 53 of 58
9) separates relevant / nonrelevant perfectly.
Pseudo-relevance algorithm:
Page 54 of 58
Works very well on average, But can go horribly wrong for some queries. Several iterations can cause
query drift.
Cornell SMART system given query results showing number of relevant documents
out of top 100 for 50 queries (so total number of documents is 5000):
lnc.ltc 3210
lnc.ltc-PsRF 3634
Lnu.ltu 3709
Lnu.ltu-PsRF 4350
Results contrast two length normalization schemes (L vs. l) and pseudo-relevance feedback (PsRF). The
pseudo-relevance feedback method used added only 20 terms to the query. (Rocchio will add many more.)
This demonstrates that pseudo-relevance feedback is effective on average.
Query Expansion
▪ Query expansion is another method for increasing recall. We use “global query expansion” to refer to
“global methods for query reformulation”. In global query expansion, the query is modified based on
some global resource, i.e. a resource that is not query-dependent. Main information we use: (near-
)synonymy. A publication or database that collects (near-)synonyms is called a thesaurus.
▪ There are two types of thesauri:
▪ manually created
▪ automatically created.
Example
Page 55 of 58
Types of query expansion
For each term t in the query, expand the query with words the thesaurus lists as semantically related
with t.
Generally increases recall , May significantly decrease precision, particularly with ambiguous terms
Widely used in specialized search engines for science and engineering, It‟s very expensive to create a manual
thesaurus and to maintain it over time. A manual thesaurus has an effect roughly equivalent to annotation
with a controlled vocabulary.
Page 56 of 58
▪ Definition 2: Two words are similar if they occur in a given grammatical relation with the same
words.
▪ You can harvest, peel, eat, prepare, etc. apples and pears, so apples and pears must be similar.
▪ Quality of associations is usually a problem. Term ambiguity may introduce irrelevant statistically
correlated terms.
▪ “Apple computer” “Apple red fruit computer”
▪ Problems:
▪ False positives: Words deemed similar that are not
▪ False negatives: Words deemed dissimilar that are similar
▪ Since terms are highly correlated anyway, expansion may not retrieve many additional documents.
▪ Co-occurrence is more robust, grammatical relations are more accurate.
in C = AAT where A is term-document matrix , wi,j = (normalized) weight for (ti ,dj)
Page 57 of 58
Page 58 of 58