IR Unit 2
IR Unit 2
Boolean and vector-space retrieval models- Term weighting – TF-IDF weighting- cosine
similarity – Preprocessing – Inverted indices – efficient processing with sparse vectors –
Language Model based IR – Probabilistic IR –Latent Semantic Indexing – Relevance
feedback and query expansion.
2.1 Introduction
Modeling
Modeling in IR is a complex process aimed at producing a ranking function
Ranking function: a function that assigns scores to documents with regard to a given
query.
This process consists of two main tasks:
The conception of a logical framework for representing documents and queries
The definition of a ranking function that allows quantifying the similarities among
documents and queries
IR systems usually adopt index terms to index and retrieve documents
IR Model Definition:
A Taxonomy of IR Models
Retrieval models most frequently associated with distinct combinations of a document
logical view and a user task. The users task includes retrieval and browsing. In retrieval
i) Ad Hoc Retrieval:
The documents in the collection remain relatively static while new queries are submitted
to the system.
ii) Filtering
The queries remain relatively static while new documents come into the system
1
Classic IR model:
Each document is described by a set of representative keywords called index
terms. Assign a numerical weights to distinct relevance between index terms.
Three classic models: Boolean, vector, probabilistic
Boolean Model :
The Boolean retrieval model is a model for information retrieval in which we can
pose any query which is in the form of a Boolean expression of terms, that is, in
which terms are combined with the operators AND, OR, and NOT. The model views
each document as just a set of words. Based on a binary decision criterion without
any notion of a grading scale. Boolean expressions have precise semantics.
Vector Model
Assign non-binary weights to index terms in queries and in documents. Compute
the similarity between documents and query. More precise than Boolean model.
Probabilistic Model
The probabilistic model tries to estimate the probability that the user will find the
document dj relevant with ratio
P(dj relevant to q)/P(dj nonrelevant to q)
Given a user query q, and the ideal answer set R of the relevant documents, the
problem is to specify the properties for this set. Assumption (probabilistic
principle): the probability of relevance depends on the query and document
representations only; ideal answer set R should maximize the overall probability of
relevance
2
Basic Concepts
Each document is represented by a set of representative keywords or index terms
Index term:
In a restricted sense: it is a keyword that has some meaning on its own;
usually plays the role of a noun
In a more general form: it is any word that appears in a document
Let, t be the number of index terms in the document collection
ki be a generic index term Then,
The vocabulary V = {k1, . . . , kt} is the set of all distinct index
terms in the collection
where each fi,j element stands for the frequency of term ti in document dj
Logical view of a document: from full text to a set of index terms
Example :
A fat book which many people own is Shakespeare‟s Collected Works.
Problem : To determine which plays of Shakespeare contain the words Brutus AND
Caesar AND NOT Calpurnia.
3
given the speed of modern computers, and often allows useful possibilities for wildcard
pattern matching through the use of regular expressions.
To answer the query Brutus AND Caesar AND NOT Calpurnia, we take the vectors
for Brutus, Caesar and Calpurnia, complement the last, and then do a bitwise
AND:
110100 AND 110111 AND 101111 = 100100
Results from Shakespeare for the query Brutus AND Caesar AND NOT
Calpurnia.
4
Consider N = 106 documents, each with about 1000 tokens ⇒ total of 109 tokens
On average 6 bytes per token, including spaces and punctuation ⇒ size of document
collection is about 6 ・ 109 = 6 GB
Assume there are M = 500,000 distinct terms in the collection
M = 500,000 × 106 = half a trillion 0s and 1s.
But the matrix has no more than one billion 1s. Matrix is extremely
sparse.What is a better representations? We only record the 1s.(Inverted Index)
i) Dictionary / vocabulary /lexicon : we use dictionary for the data structure and
vocabulary for the set of terms, The dictionary in Figure has been sorted alphabetically
ii) Posting : for each term, we have a list of Document ID in which the term present
.The list is then called a postings list (or inverted list), and each postings list is sorted by
document ID.
The two parts of an inverted index. The dictionary is commonly kept in memory, with
pointers to each postings list, which is stored on disk.
...
2. Tokenize the text, turning each document into a list of tokens:
...
3. Do linguistic preprocessing, producing a list of normalized tokens, which are the
5
DocID :
Each document has a unique serial number, known as the document
identifier ( docID ). During index construction, simply assign successive
integers to each new document when it is first encountered.
Input Dictionary & posting :
The input to indexing is a list of normalized tokens for each document,
which we can equally think of as a list of pairs of term and docID . The core
indexing step is sorting this list , Multiple occurrences of the same term
from the same document are then merged. Instances of the same term are
then grouped, and the result is split into a dictionary and postings
Document Frequency :
The dictionary records some statistics, such as the number of documents
which contain each term . This information is not vital for a basic Boolean
search engine, but it allows us to improve the efficiency of the search engine
at query time, and it is a statistic later used in many ranked retrieval
models. The postings are secondarily sorted by docID. This provides the
basis for efficient query processing.
6
Storage (dictionary & postings lists) :
1. A fixed length array would be wasteful as some words occur in many
documents, and others in very few.
2. For an in-memory postings list - two good alternatives
a. singly linked lists : Singly linked lists allow cheap insertion of
documents into postings lists, and naturally extend to more advanced
indexing strategies such as skip lists, which require additional
pointers.
b. Variable length arrays : win in space requirements by avoiding the
overhead for pointers and in time requirements because their use of
contiguous memory increases speed on modern processors with
memory caches. Variable length arrays will be more compact and
faster to traverse.
3. A hybrid scheme with a linked list of fixed length arrays for each term. When
postings lists are stored on disk, they are stored (perhaps compressed) as a
contiguous run of postings without explicit pointers, so as to minimize the
size of the postings list and the number of disk seeks to read a postings list
into memory.
To process a query using an inverted index and the basic Boolean retrieval model
Consider processing the simple conjunctive query : over the
Brutus AND Calpurnia
inverted index partially shown in Figure
Steps :
1. Locate Brutus in the Dictionary
2. Retrieve its postings
3. Locate Calpurnia in the Dictionary
4. Retrieve its postings
5. Intersect the two postings lists, as shown in Figure
7
There is a simple and effective method of intersecting postings lists using the merge
algorithm. we maintain pointers into both lists and walk through the two postings lists
simultaneously, in time linear in the total number of postings entries. At each step, we
compare the docID pointed to by both pointers. If they are the same, we put that docID
in the results list, and advance both pointers. Otherwise we advance the pointer pointing
to the smaller docID. To use this algorithm, postings is be sorted by a single global
ordering. Using a numeric sort by docID is one simple way to achieve this.
Query optimization
Case1:
Consider a query that is an and of n terms, n > 2
For each of the terms, get its postings list, then and them together
Start with the shortest postings list, then keep cutting further
In this example, first CAESAR, then CALPURNIA, then BRUTUS
Case2:
Example query: (MADDING OR CROWD) and (IGNOBLE OR STRIFE)
Get frequencies for all terms, Estimate the size of each or by the sum of its frequencies
(conservative), Process in increasing order of or sizes
8
2.4 Term weighting
Search Engine should return in order the documents most likely to be useful to
the searcher . To achieve this , ordering documents with respect to a query -
called Ranking
Se
Document represented by count vector Є Nv
Bag of Words Model
The exact ordering of the terms in a document is ignored but the number of occurrences
of each term is important.
Example : two documents with similar bag of words representations are similar in
content.
“Mary is quicker than John” “John is quicker than Mary”
9
This is called the bag of words model. In a sense, step back: The positional index was
able to distinguish these two documents
Raw term frequency is not what we want. A document with 10 occurrences of the term is
more relevant than a document with 1 occurrence of the term . But not 10 times more
relevant. We use Log frequency weighting.
Log-Frequency Weighting
Rare terms are more informative than frequent terms, to capture this we will use
document frequency (df)
Example: rare word ARACHNOCENTRIC
Document containing this term is very likely to be relevant to query
ARACHNOCENTRIC
We want high weight for rare terms like ARACHNOCENTRIC
Example:
10
Document frequency is more meaningful , as we see in the example above , there are
few documents that contain “insurance” to get a higher boost for a query on “insurance”
than the many documents containing “try” to get from a query on “try”
N = 1, 000, 000
idft = log10 1,000,000
dft
11
Simple Query-Document Score
scoring finds whether or not a query term is present in a zone (Zones: document
features whose content can be arbitrary free text – Examples: title, abstracts )
within a document.
Score for a document-query pair: sum over terms t in both q
and d:
If the Query contains more than one terms , Score for a document-query pair is
sum over terms t in both q and d:
Binary values are changed into count values , later it is represented as weight matrix
12
Document represented by tf-idf weight vector
Queries as Vectors
Key idea 1: Represent queries as vectors in same space
Key idea 2: Rank documents according to proximity to query in this space
proximity = similarity of vectors
proximity ≈ inverse of distance
Get away from Boolean model , Rank more relevant documents higher than less relevant
documents
Euclidean distance is a bad idea . . . because Euclidean distance is large for vectors of
different lengths
13
The Euclidean distance of and is large although the distribution of terms in the
query q and the distribution of terms in the document d2 are very similar.
Cosine is a monotonically decreasing function of the angle for the interval [0◦, 180◦]
Dividing a vector by its L2 norm makes it a unit (length) vector (on surface of unit
hypersphere). This maps vectors onto the unit sphere . . .
As a result, longer documents and shorter documents have weights of the same order of
magnitude. Effect on the two documents d and d′ (d appended to itself) from earlier
example : they have identical vectors after length-normalization.
14
qi is the tf-idf weight of term i in the query
di is the tf-idf weight of term i in the document
is the cosine similarity of q and d = the cosine of the angle between q and d.
In reality:
- Length-normalize when document added to index:
- Length-normalize query:
For normalized vectors, the cosine is equivalent to the dot product or scalar
product:
15
Log frequency weights :
Example 2:
We often use different weightings for queries and documents.
Notation : ddd.qqq
Example : lnc.ltn
Document : logarithmic tf, no df weighting, cosine normalization
Query : logarithmic tf, idf, no normalization
16
Example query: “best car insurance”
Example document: “car insurance auto insurance”
N=10,000,000
Format/Language: Complications
A single index usually contains terms of several languages.
Sometimes a document or its components contain multiple
languages/formats.
French email with Spanish pdf attachment
What is the document unit for indexing?
A file?
An email?
An email with 5 attachments?
A group of files (ppt or latex in HTML)?
Upshot: Answering the question “what is a document?” is not trivial and
requires some design decisions.
17
Determining the vocabulary of terms
1) Tokenization:
Task of splitting the document into pieces called tokens.
Input:
Output:
3/20/91 date
20/3/91
Mar 20, 1991
B-52
100.2.86.144 IP address
(800) 234-2333 Phone Number
800.234.2333
2) Normalization
Need to “normalize” terms in indexed text as well as query terms into the
same form.
Example: We want to match U.S.A. and USA
We most commonly implicitly define equivalence classes of terms.
Alternatively: do asymmetric expansion
window → window, windows
windows → Windows, windows
Windows (no expansion)
More powerful, but less efficient
Case Folding-
Reduce all letters to lower case
Possible exceptions: capitalized words in mid-sentence
MIT vs. mit
Fed vs. fed
It‟s often best to lowercase everything since users will use lowercase
regardless of correct capitalization.
3) Stop words
o Definition of stemming: Crude heuristic process that chops off the ends of
words in the hope of achieving what “principled” lemmatization attempts to
do with a lot of linguistic knowledge.
o Language dependent
o Example : automate, automatic, automation all reduce to automat
Porter algorithm
Most common algorithm for stemming English
Results suggest that it is at least as good as other stemming options
Conventions + 5 phases of reductions
Phases are applied sequentially
Each phase consists of a set of commands.
Sample command: Delete final ement if what remains is longer than 1
character
replacement → replac
cement → cement
Sample convention: Of the rules in a compound command, select the one
that applies to the longest suffix.
Porter stemmer: A few rules
Rule Example
SSES → SS caresses → caress
IES → I ponies → poni
19
SS → SS caress → caress
S→ cats → cat
Other Three stemmers
1) Sample text:
2) Lovins stemmer:
3) Paice stemmer:
Lexical analysis
Objective: Determine the words of the document. Lexical analysis separates the input
alphabet into
Word characters (e.g., the letters a-z)
Word separators (e.g., space, newline, tab)
The following decisions may have impact on retrieval
Digits: Used to be ignored, but the trend now is to identify numbers (e.g.,
telephone numbers) and mixed strings as words.
Punctuation marks: Usually treated as word separators.
Hyphens: Should we interpret “pre-processing” as “pre processing” or as
“preprocessing”?
Letter case: Often ignored, but then a search for “First Bank” (a specific bank)
would retrieve a document with the phrase “Bank of America was the first bank to
offer its customers…”
Stopword Elimination
Objective: Filter out words that occur in most of the documents.
Such words have no value for retrieval purposes , These words are referred to as
stopwords.
They include
Articles (a, an, the, …)
Prepositions (in, on, of, …)
Conjunctions (and, or, but, if, …)
Pronouns (I, you, them, it…)
Possibly some verbs, nouns, adverbs, adjectives (make, thing, similar, …)
A typical stopword list may include several hundred words.
the 100 most frequent words add-up to about 50% of the words in a document. Hence,
stopword elimination improves the size of the indexing structures
20
Stemming
Objective: Replace all the variants of a word with the single stem of the word. Variants
include plurals, gerund forms (ing-form), third person suffixes, past tense suffixes, etc.
Example: connect: connects, connected, connecting, connection,…
All have similar semantics and relate to a single concept.
In parallel, stemming must be performed on the user query.
Stemming improves Storage and search efficiency: less terms are stored.
Recall:
without stemming a query about “connection”, matches only documents that have
“connection”.
With stemming, the query is about “connect” and matches in addition documents
that originally had “connects” , “connected” , “connecting”, etc.
However, stemming may hurt precision, because users can no longer target just a
particular form.
Stemming may be performed using
o Algorithms that stripe of suffixes according to substitution rules.
o Large dictionaries, that provide the stem of each word.
21
8. Related issues: Index title and abstract only, or the entire document? Should
index terms be weighted?
Reducing the size of the index: Recall that articles, prepositions, conjunctions,
pronouns have already been removed through a stopword list. Recall that the 100 most
frequent words account for 50% of all word occurrences. Words that are very infrequent
(occur only a few times in a collection) are often removed, under the assumption that
they would probably not be in the user‟s vocabulary. Reduction not based on
probabilistic arguments: Nouns are often preferred over verbs, adjectives, or adverbs.
Thesauri
Objective: Standardize the index terms that were selected. In its simplest form a
thesaurus is A list of “important” words (concepts). For each word, an associated list of
synonyms. A thesaurus may be generic (cover all of English) or concentrate on a
particular domain of knowledge. The role of a thesaurus in information retrieval is to
Provide a standard vocabulary for indexing.
Help users locate proper query terms.
Provide hierarchies for automatic broadening or narrowing of queries.
Here, our interest is to provide a standard vocabulary (a controlled vocabulary). This is
final stage, where each indexing term is replaced by the concept that defines its
thesaurus class.
22
A language model is a probability distribution over sequences of words. Given such a
sequence, say of length m, it assigns a probability to the whole
sequence. These distributions can be used to predict the likelihood that the next token
in the sequence is a given word . These probability distributions are called language
models. It is useful in many natural language processing applications.
Ex: part-of-speech tagging, speech recognition, machine translation, and
information retrieval
In speech recognition, the computer tries to match sounds with word sequences. The
language model provides context to distinguish between words and phrases that sound
similar. For example, in American English, the phrases "recognize speech" and "wreck a
nice beach" are pronounced almost the same but mean very different things. These
ambiguities are easier to resolve when evidence from the language model is incorporated
with the pronunciation model and the acoustic model.
This diagram shows a simple finite automaton and some of the strings in the
language it generates. → shows the start state of the automaton and a double
circle indicates a (possible) finishing state. For example, the finite automaton
23
shown can generate strings that include the examples shown. The full set of
strings that can be generated is called the language of the automaton.
Probability that some text (e.g. a query) was generated by the model:
P(frog said that toad likes frog) = 0.01 x 0.03 x 0.04 x 0.01 x 0.02 x 0.01
(We ignore continue/stop probabilities assuming they are fixed for all queries)
25
Partial specification of two unigram language models
To find
probability of a = probabilities of word X probability of continuing /
word sequence (given by model ) stopping after producing each word.
For example,
To compare two models for a data set, we can calculate their likelihood ratio , which
results from simply dividing the probability of the data according to one model by the
probability of the data according to the other model.
In information retrieval contexts, unigram language models are often smoothed to avoid
instances where P(term) = 0. A common approach is to generate a maximum-likelihood
model for the entire collection and linearly interpolate the collection model with a
maximum-likelihood model for each document to create a smoothed document model.
Ex: In a bigram (n = 2) language model, the probability of the sentence “I saw the red
house” is approximated as ( <s> means stop )
26
Most language-modeling work in IR has used unigram language models. IR is not the
place where most immediately need complex language models, since IR does not directly
depend on the structure of sentences to the extent that other tasks like speech
recognition do. Unigram models are often sufficient to judge the topic of a text.
27
Algorithm:
1. Infer a LM Md for each document d
2. Estimate P(q|Md)
Smoothing
28
Decreasing the estimated probability of seen events and increasing the probability
of unseen events is referred to as smoothing
The role of smoothing in this model is not only to avoid zero probabilities. The
smoothing of terms actually implements major parts of the term weighting
component. Thus, we need to smooth probabilities in our document language
models:
to discount non-zero probabilities
to give some probability mass to unseen words.
The probability of a non occurring term should be close to its probability to occur
in the collection
P(t|Mc) = cf(t)/T
cf(t) = #occurrences of term t in the collection
T – length of the collection = sum of all document lengths
Smoothing Methods
Linear
Interpolation
(Mixer Model) Mixes the probability from the document with the general
collection frequency of the word.
High value of λ: “conjunctive-like” search – tends to retrieve
documents containing all query words.
Low value of λ: more disjunctive, suitable for long queries
Correctly setting λ is very important for good performance.
Bayesian
smoothing
Summary, with
linear
interpolation
In practice, log in taken from both sides of the equation to avoid multiplying
any small numbers
Example1:
Question: Suppose the document collection contains two documents:
d1: Xerox reports a profit but revenue is down
d2: Lucent narrows quarter loss but revenue decreases further
A user submitted the query: “revenue down”
29
Rank D1 and D2 - Use an MLE unigram model and a linear interpolation
smoothing with lambda parameter 0.5
Solution :
Example2:
Collection: d1 and d2
d1 : Jackson was one of the most talented entertainers of all time
d2: Michael Jackson anointed himself King of Pop
Query q: Michael Jackson
Use mixture model with λ = 1/2
P(q|d1) = [(0/11 + 1/18)/2] · [(1/11 + 2/18)/2] ≈ 0.003
P(q|d2) = [(1/7 + 1/18)/2] · [(1/7 + 2/18)/2] ≈ 0.013
Ranking: d2 > d1
There is much less text available to estimate a language model based on the
query text, and so the model will be worse estimated
More dependency towards smoothing with some other language model
3)Model comparison
Make LM from both query and document ,Measure `how different` these LMs
from each other , Uses KL divergence
KL divergence (Kullback–Leibler (KL) divergence)
An asymmetric divergence measure from information theory , which
measures the how bad the probability distribution Mq is at modeling Md
30
Rank by KLD - the closer to 0 the higher is the rank
Given a query q, there exists a subset of the documents R which are relevant to q
But membership of R is uncertain (not sure) , A Probabilistic retrieval model ranks
documents in decreasing order of probability of relevance to the information need: P(R |
q,di)
Users gives with information needs, which they translate into query representations.
Similarly, there are documents, which are converted into document representations .
Given only a query, an IR system has an uncertain understanding of the information
need. So IR is an uncertain process , Because
31
Information need to query
Documents to index terms
Query terms and index terms mismatch
Probability theory provides a principled foundation for such reasoning under
uncertainty. This model provides how likely a document is relevant to an information
need.
Probabilistic methods are one of the oldest but also one of the currently hottest topics in
IR .
Probabilistic IR Models :
For events A , the probability of the event lies between 0≤ P(A) ≤ 1 , For 2 events
A and B
Partition rule: if B can be divided into an exhaustive set of disjoint sub cases,
then P(B) is the sum of the probabilities of the sub cases. A special case of this
rule gives:
32
Bayes‟ Rule for inverting conditional probabilities:
In Ranked retrieval setup, for a given collection of documents, the user issues a
query, and an ordered list of documents is returned. Assume binary notion of
relevance: Rd,q is a a random dichotomous variable (A categorical variable that can
take on exactly two values is termed a binary variable or dichotomous variable), such
that
PRP in brief
If the retrieved documents (w.r.t a query) are ranked decreasingly on their
probability of relevance, then the effectiveness of the system will be the best that is
obtainable
PRP in full
If [the IR] system‟s response to each [query] is a ranking of the documents [...] in
order of decreasing probability of relevance to the [query], where the probabilities are
estimated as accurately as possible on the basis of whatever data have been made
available to the system for this purpose, the overall effectiveness of the system to its
user will be the best that is obtainable on the basis of those data.
1/0 loss :
either returning a non relevant document or failing to return a relevant
document is called as 1/0 loss.
The goal is to return the best possible results as the top k documents, for any
value of k the user chooses to examine.
The PRP then says to simply rank all documents in decreasing order of P (R=1
| d,q) . If a set of retrieval results is to be returned, rather than an ordering, the
33
Bayes Optimal Decision Rule , the decision which minimizes the risk of loss, is
to simply return documents that are more likely relevant than nonrelevant:
Such a model gives a formal framework where we can model differential costs of false
positives and false negatives and even system performance issues at the modeling stage,
rather than simply at the evaluation stage.
34
relevant documents in the collection
It is at this point that we make the Naive Bayes conditional independence assumption
that the presence or absence of a word in a document is independent of the presence or
absence of any other word (given the query):
So:
35
Let pt = P(xt = 1|R = 1,q) be the probability of a term appearing in a document relevant to
the query,
Let ut = P(xt = 1|R = 0,q) be the probability of a term appearing in a non relevant
document. It can be contingency table :
Additional simplifying assumption: terms not occurring in the query are equally likely to
occur in relevant and irrelevant documents, If qt = 0, then pt = ut
Now we need only to consider terms in the products that appear in the query:
The left product is over query terms found in the document and the right product is over
query terms not found in the document.
Including the query terms found in the document into the right product, but
simultaneously dividing through by them in the left product, so the value is unchanged
The left product is still over query terms found in the document, but the right product is
now over all query terms, hence constant for a particular query and can be ignored.
→ The only quantity that needs to be estimated to rank documents w.r.t a query is the
left product, Hence the Retrieval Status Value (RSV) in this model:
Equivalent: rank documents using the log odds ratios for the terms in the query ct
The odds ratio is the ratio of two odds: (i) the odds of the term appearing if the document
is relevant (pt/(1 − pt)), and (ii) the odds of the term appearing if the document is
irrelevant (ut/(1 − ut ))
ct = 0: term has equal odds of appearing in relevant and irrelevant docs
ct positive: higher odds to appear in relevant documents
36
ct negative: higher odds to appear in irrelevant documents
ct functions as a term weight. Retrieval status value for document d:
So BIM and vector space model are identical on an operational level , except that
the term weights are different. In particular: we can use the same data structures
(inverted index etc) for the two models.
For each term t in a query, estimate ct in the whole collection using a contingency table
of counts of documents in the collection, where df t is the number of documents that
contain term t:
Avoiding Zeroes :
If any of the counts is a zero, then the term weight is not well-defined. Maximum
likelihood estimates do not work for rare events.
To avoid zeros: add 0.5 to each count (expected likelihood estimation = ELE)
Assuming that relevant documents are a very small percentage of the collection,
approximate statistics for irrelevant documents by statistics from the whole collection
They are not that different. In either case you build an information retrieval
scheme in the exact same way.
37
For probabilistic IR, at the end, you score queries not by cosine similarity and tf-
idf in a vector space, but by a slightly different formula motivated by probability
theory. Next to add term frequency and length normalization to the probabilistic
model.
Definition : Latent semantic indexing (LSI) is an indexing and retrieval method that
uses a mathematical technique called singular value decomposition (SVD) to identify
patterns in the relationships between the terms and concepts contained in an
unstructured collection of text. LSI is based on the principle that words that are used in
the same contexts tend to have similar meanings.
38
In LSI
Using SVD for this purpose is called latent semantic indexing or LSI.
Steps:
The term matrix U – consists of one (row) vector for each term
The document matrix VT – consists of one (column) vector for each document
The singular value matrix Σ – diagonal matrix with singular values,
reflecting importance of each dimension
2. Use the SVD to compute a new, improved term-document matrix C′ by reducing
the space which gives better similarity values compared to C.
39
Example:
One row per term, one column per min(M,N) where M is the number of terms and N is
the number of documents. This is an orthonormal matrix:
(i) Row vectors have unit length. (ii) Any two distinct row vectors are orthogonal to each
other. Think of the dimensions as “semantic” dimensions that capture distinct topics like
politics, sports, economics. Each number uij in the matrix indicates how strongly related
term i is to the topic represented by semantic dimension j .
The matrix Σ
40
consists of the singular values of C. The magnitude of the singular value measures the
importance of the corresponding semantic dimension. We‟ll make use of this by mitting
unimportant dimensions.
The matrix VT
One column per document, one row per min(M,N) where M is the number of terms and
N is the number of documents. Again: This is an orthonormal matrix: (i) Column
vectors have unit length. (ii) Any two distinct column vectors are orthogonal to each
other. These are again the semantic dimensions from the term matrix U that capture
distinct topics like politics, sports, economics. Each number vij in the matrix indicates
how strongly related document i is to the topic represented by semantic dimension j .
41
Actually, we only zero out singular values in Σ. This has the effect of setting the
corresponding dimensions in U and V T to zero when computing the product C = UΣV T .
42
We can view C2 as a two-dimensional representation of the matrix. We have performed
a dimensionality reduction to two dimensions.
0.52 * 0.28 + 0.36 * 0.16 + 0.72 * 0.36 + 0.12 * 0.20 + - 0.39 * - 0.08 ≈ 0.52
LSI takes documents that are semantically similar (= talk about the same topics),
but are not similar in the vector space (because they use different words) and re-
represents them in a reduced vector space in which they have higher similarity.
Thus, LSI addresses the problems of synonymy and semantic related ness.
Standard vector space: Synonyms contribute nothing to document similarity.
How LSI addresses synonymy and semantic relatedness
Interactive relevance feedback: improve initial retrieval results by telling the IR system
which docs are relevant / nonrelevant . Best known relevance feedback method: Rocchio
feedback
Query expansion: improve retrieval results by adding synonyms / related terms to the
query. Sources for related terms: Manual thesauri, automatic thesauri, query logs.Two
ways of improving recall: relevance feedback and query expansion
Synonymy : In most collections, the same concept may be referred to using different
words. This is called as synonymy ,
Recall:
increasing the number of relevant documents returned to user”
This may actually decrease recall on some measures, e.g., when expanding
“jaguar” with “panthera”
. . .which eliminates some relevant documents, but increases relevant
documents returned on top pages
2) Local methods adjust a query relative to the documents that initially appear to
match the query. The basic methods are:
Relevance feedback
Pseudo relevance feedback, also known as Blind relevance feedback
(Global) indirect relevance feedback
The idea of relevance feedback ( ) is to involve the user in the retrieval process so as
to improve the final result set. In particular, the user gives feedback on the relevance of
documents in an initial set of results. The basic procedure is:
We can iterate this: several rounds of relevance feedback. We will use the term ad
hoc retrieval to refer to regular retrieval without relevance feedback. We will now
look at Two different examples of relevance feedback that highlight different
aspects of the process.
44
Example1 : Image search engine http://nayana.ece.ucsb.edu/imsearch/imsearch.html
45
After Relevance Feedback
Example 2: A real (non-image) example - shows a textual IR example where the user
wishes to find out about new applications of space satellites.
Initial query: [new space satellite applications] Results for initial query: (r = rank)
3 0.528 Science Panel Backs NASA Satellite Plan, But Urges Launches of
Smaller Probes
Climate Research
6 0.524 Report Provides Support for the Critics Of Using Big Satellites
to Study Climate
46
2.074 new 15.106 space
30.816 satellite 5.660 application
5.991 nasa 5.196 eos
4.196 launch 3.972 aster
3.516 instrument 3.446 arianespace
3.004 bundespost 2.806 ss
2.790 rocket 2.053 scientist
2.003 broadcast 1.172 earth
0.836 oil 0.646 measure
The centroid is the center of mass of a set of points. Recall that we represent
documents as points in a high-dimensional space. Thus: we can compute
centroids of documents.
Definition:
47
Example:
The Problem is we don’t know the truly relevant docs. We move the centroid of
the relevant documents by the difference between the two centroids.
We can modify the query based on relevance feedback and apply standard
vector space model. Use only the docs that were marked. Relevance feedback
can improve recall and precision.Relevance feedback is most useful for
increasing recall in situations where recall is important. Users can be
expected to review results and to take time to iterate
qm: modified query vector; q0: original query vector; Dr and Dnr : sets of known relevant
and nonrelevant documents respectively; α, β, and γ: weights
New query moves towards relevant documents and away from nonrelevant
documents.
Tradeoff α vs. β/γ: If we have a lot of judged documents, we want a higher β/γ.
“Negative weight” for a term doesn‟t make sense in the vector space model.
For example, set β = 0.75, γ = 0.25 to give higher weight to positive feedback.
49
To Compute Rocchio’ vector
4) centroid of nonrelevant
3) does not separate relevant /
documents.
nonrelevant.
5) 6) difference vector
50
9) separates relevant / nonrelevant perfectly.
Pseudo-relevance algorithm:
lnc.ltc 3210
lnc.ltc-PsRF 3634
Lnu.ltu 3709
Lnu.ltu-PsRF 4350
Query Expansion
Query expansion is another method for increasing recall. We use “global query
expansion” to refer to “global methods for query reformulation”. In global query
expansion, the query is modified based on some global resource, i.e. a resource
that is not query-dependent. Main information we use: (near-)synonymy. A
publication or database that collects (near-)synonyms is called a thesaurus.
There are two types of thesauri:
manually created
automatically created.
Example
52
Types of query expansion
For each term t in the query, expand the query with words the thesaurus lists as
semantically related with t.
Widely used in specialized search engines for science and engineering, It‟s very expensive
to create a manual thesaurus and to maintain it over time. A manual thesaurus has an
effect roughly equivalent to annotation with a controlled vocabulary.
53
Definition 2: Two words are similar if they occur in a given grammatical relation
with the same words.
You can harvest, peel, eat, prepare, etc. apples and pears, so apples and
pears must be similar.
Quality of associations is usually a problem. Term ambiguity may introduce
irrelevant statistically correlated terms.
“Apple computer” “Apple red fruit computer”
Problems:
False positives: Words deemed similar that are not
False negatives: Words deemed dissimilar that are similar
Since terms are highly correlated anyway, expansion may not retrieve many
additional documents.
Co-occurrence is more robust, grammatical relations are more accurate.
in C = AAT where A is term-document matrix , wi,j = (normalized) weight for (ti ,dj)