IR Unit 2 Final
IR Unit 2 Final
Introduction
IR Model
Definition:
Basic Concepts
Each document is represented by a set of representative keywords or index terms
Index term:
In a restricted sense: it is a keyword that has some meaning on its
own; usually plays the role of a noun
In a more general form: it is any word that appears in a document
Let, t be the number of index terms in the document
collection ki be a generic index term Then,
The vocabulary V = {k1, . . . , kt} is the set of all distinct index
terms in the collection
where each fi,j element stands for the frequency of term ti in document dj
Logical view of a document: from full text to a set of index terms
1
________________________________________________________________________________________
To answer the query Brutus AND Caesar AND NOT Calpurnia, we take the
vectors for Brutus, Caesar and Calpurnia, complement the last, and then do a
bitwise AND:
110100 AND 110111 AND 101111 = 100100
2
Solution : Antony and Cleopatra and Hamlet
Results from Shakespeare for the query Brutus AND Caesar AND
NOT Calpurnia.
Inverted index / inverted File
It is the most efficient structure for supporting ad hoc text search. For each term t, we
store a list of all documents that contain t. The two parts of an inverted index. The
dictionary is commonly kept in memory, with pointers to each postings list, which is
stored on disk.
i) Dictionary / vocabulary /lexicon : we use dictionary for the data structure and
vocabulary for the set of terms, The dictionary in Figure has been sorted
alphabetically
ii) Posting : for each term, we have a list of Document ID in which the term present
.The list is then called a postings list (or inverted list), and each postings list is sorted by
document ID.
...
2. Tokenize the text, turning each document into a list of tokens:
...
3. Do linguistic preprocessing, producing a list of normalized tokens, which are
3
DocID :
Each document has a unique serial number, known as the document
identifier ( docID ).
Input Dictionary & posting :
The input to indexing is a list of normalized tokens for each document,
which we can equally think of as a list of pairs of term and docID . The
core indexing step is sorting this list , Multiple occurrences of the same
term from the same document are then merged. Instances of the same
term are then grouped, and the result is split into a dictionary and postings
Document Frequency :
The dictionary records some statistics, such as the number of documents
which contain each term . This provides the basis for efficient query
processing.
4
Storage (dictionary & postings lists) :
1. A fixed length array would be wasteful as some words occur in many
documents, and others in very few.
2. For an in-memory postings list - two good alternatives
a. singly linked lists :
b. Variable length arrays :
3. A hybrid scheme with a linked list of fixed length arrays for each term.
Processing Boolean queries
To process a query using an inverted index and the basic Boolean retrieval
model Consider processing the simple conjunctive query :
Brutus AND
over
the inverted index partially shown in Figure Calpurnia
Steps :
1. Locate Brutus in the Dictionary
2. Retrieve its postings
3. Locate Calpurnia in the Dictionary
4. Retrieve its postings
5. Intersect the two postings lists, as shown in Figure
Se
Document represented by count vector Є Nv
Bag of Words
Model
6
The exact ordering of the terms in a document is ignored but the number of
occurrences of each term is important.
Example : two documents with similar bag of words representations are similar in
content.
“Mary is quicker than John” “John is quicker than Mary”
Log-Frequency Weighting
Rare terms are more informative than frequent terms, to capture this we will
use document frequency (df)
Example: common word THE
Document containing this term can be about
anything We want very low weight for common
terms like THE
7
N: the total number of documents in the collection (for example :806,791
documents)
• IDF(t) is high if t is a rare term
• IDF(t) is likely low if t is a frequent term
N = 1, 000, 000
idft = log10 1,000,000
dft
TF-IDF Weighting
tf-idf weight of a term: product of tf weight and idf weight , Best known
weighting scheme in information retrieval. TF(t, d) measures the importance of
a term t in document d , IDF(t) measures the importance of a term t in the
whole collection of documents
TF/IDF weighting: putting TF and IDF together
If the Query contains more than one terms , Score for a document-query pair
is sum over terms t in both q and d:
Queries as Vectors
Key idea 1: Represent queries as vectors in same space
Key idea 2: Rank documents according to proximity to query in this space
proximity = similarity of vectors
proximity ≈ inverse of distance
Get away from Boolean model , Rank more relevant documents higher than less
relevant documents
The Euclidean distance is large although the distribution of terms in the query q
and the distribution of terms in the document d2 are very similar.
Dividing a vector by its L2 norm makes it a unit (length) vector (on surface of
unit hypersphere). This maps vectors onto the unit sphere . . .
As a result, longer documents and shorter documents have weights of the same order of
magnitude. Effect on the two documents d and d′ (d appended to itself) from earlier
example : they have identical vectors after length-normalization.
Cosine Similarity
Measure the similarity between document and the query using the cosine of the
vector representations
is the cosine similarity of q and d = the cosine of the angle between q and d.
In reality:
- Length-normalize when document added to index:
- Length-normalize query:
1
0
(if and are length-normalized).
1
1
Log frequency weights :
1
2
Preprocessing
Format/Language: Complications
A single index usually contains terms of several languages.
Sometimes a document or its components contain multiple
languages/formats.
French email with Spanish pdf attachment
What is the document unit for indexing?
A file?
An email?
An email with 5 attachments?
A group of files (ppt or latex in HTML)?
Upshot: Answering the question “what is a document?” is not trivial and
requires some design decisions.
Determining the vocabulary of terms
1) Tokenization:
Task of splitting the document into pieces called tokens.
Input:
Output:
1
3
Hewlett
-
Packar
d
State-
of-the-
art
co-
educati
on
the
hold-
him-
back-
and-
drag-
him-
away
maneuv
er
3/20/91
date
20/3/9
1
Mar 20,
1991
B-52
100.2.8
6.144
IP
address
(800)
234-
2333
Phone
Numbe
r
800.23
4.2333
1
4
2) Normalization
Need to “normalize” terms in indexed text as well as query terms
into the same form.
Example: We want to match U.S.A. and USA
We most commonly implicitly define equivalence classes of terms.
Alternatively: do asymmetric expansion
window → window, windows
windows → Windows, windows
Windows (no expansion)
More powerful, but less efficient
Case Folding-
Reduce all letters to lower case
Possible exceptions: capitalized words in mid-sentence
MIT vs. mit
Fed vs. fed
It‟s often best to lowercase everything since users will use lowercase
regardless of correct capitalization.
3) Stop words
1
5
o Definition of stemming: Crude heuristic process that chops off the
ends of words in the hope of achieving what “principled”
lemmatization attempts to do with a lot of linguistic knowledge.
o Language dependent
o Example : automate, automatic, automation all reduce to automat
Porter algorithm
Most common algorithm for stemming English
Results suggest that it is at least as good as other stemming options
Conventions + 5 phases of reductions
Phases are applied sequentially
Each phase consists of a set of commands.
Sample command: Delete final ement if what remains is longer than
1 character
replacement → replac
cement → cement
Sample convention: Of the rules in a compound command, select the one
that applies to the longest suffix.
Porter stemmer: A few rules
Rule Example
SSES → SS caresses → caress
IES → I ponies → poni
SS → SS caress → caress
S→ cats → cat
Other Three stemmers
1) Sample text:
2) Lovins stemmer:
3) Paice stemmer:
Lexical analysis
Objective: Determine the words of the document.
Lexical analysis separates the input alphabet into
Word characters (e.g., the letters a-z)
Word separators (e.g., space, newline, tab)
The following decisions may have impact on
retrieval
Digits: Used to be ignored, but the trend now is to identify numbers
1
6
(e.g., telephone numbers) and mixed strings as words.
Punctuation marks: Usually treated as word separators.
Hyphens: Should we interpret “pre-processing” as “pre processing” or
as “preprocessing”?
Letter case: Often ignored, but then a search for “First Bank” (a specific bank)
would retrieve a document with the phrase “Bank of America was the first
bank to offer its customers…”
Stopword Elimination
Objective: Filter out words that occur in most of the documents.
Such words have no value for retrieval purposes , These words are referred to as
stopwords.
They include
Articles (a, an, the, …)
Prepositions (in, on, of, …)
Conjunctions (and, or, but, if, …)
Pronouns (I, you, them, it…)
Possibly some verbs, nouns, adverbs, adjectives (make, thing, similar, …)
A typical stopword list may include several hundred words.
the 100 most frequent words add-up to about 50% of the words in a document. Hence,
stopword elimination improves the size of the indexing structures
Stemming
Objective: Replace all the variants of a word with the single stem of the word.
Variants include plurals, gerund forms (ing-form), third person suffixes, past tense
suffixes, etc.
Example: connect: connects, connected, connecting, connection,…
All have similar semantics and relate to a single concept.
In parallel, stemming must be performed on the user query.
1
7
Relative advantages of manual indexing:
1. Ability to perform abstractions (conclude what the subject is) and determine
additional related terms,
2. Ability to judge the value of concepts.
Reducing the size of the index: Recall that articles, prepositions, conjunctions,
pronouns have already been removed through a stopword list. Recall that the 100
most frequent words account for 50% of all word occurrences. Words that are very
infrequent (occur only a few times in a collection) are often removed, under the
assumption that they would probably not be in the user‟s vocabulary. Reduction not
based on probabilistic arguments: Nouns are often preferred over verbs, adjectives,
or adverbs.
Thesauri
Objective: Standardize the index terms that were selected. In its simplest form a
thesaurus is A list of “important” words (concepts). For each word, an associated list
of synonyms. A thesaurus may be generic (cover all of English) or concentrate on a
particular domain of knowledge. The role of a thesaurus in information retrieval is to
Provide a standard vocabulary for indexing.
Help users locate proper query terms.
Provide hierarchies for automatic broadening or narrowing of queries.
1
8
Here, our interest is to provide a standard vocabulary (a controlled vocabulary). This
is final stage, where each indexing term is replaced by the concept that defines its
thesaurus class.
_____________________________________________________________________________________________
Language models
A document is a good match to a query, if the document model is likely to generate the
query i.e If document contains query words often
Ex:
This diagram shows a simple finite automaton and some of the strings in the
language it generates. The full set of strings that can be generated is called the
language of the automaton.
Probability that some text (e.g. a query) was generated by the model:
2
1
Partial specification of two unigram language models
To find
probability of a = probabilities of word X probability of continuing /
word sequence (given by model ) stopping after producing each
word.
For example,
Ex: In a bigram (n = 2) language model, the probability of the sentence “I saw the
red house” is approximated as ( <s> means stop )
2
2
1) The query likelihood model in IR
The original and basic method for using language models in IR is the query
likelihood model .
Goal : Construct from each document d in the collection a language model M d, goal is
to rank documents by P(d|q), where the probability of a document is interpreted as
the likelihood that it is relevant to the query.
2
3
E.g., P(q|Md3) > P(q|Md1) > P(q|Md2) d3 is first, d1 is second, d2 is third
Smoothing
Decreasing the estimated probability of seen events and increasing the
probability of unseen events is referred to as smoothing
The role of smoothing in this model is not only to avoid zero probabilities.
we need to smooth probabilities in our document language models:
to discount non-zero probabilities
to give some probability mass to unseen words.
(Mixer Model) Mixes the probability from the document with the
2
4
general collection frequency of the word.
High value of λ: “conjunctive-like” search – tends to
retrieve documents containing all query words.
Low value of λ: more disjunctive, suitable for long queries
Correctly setting λ is very important for good performance.
Bayesian
smoothing
Summary, with
linear
interpolation
3) Model comparison
Make LM from both query and document ,Measure `how different` these
LMs from each other , Uses KL divergence
KL divergence (Kullback–Leibler (KL) divergence)
An asymmetric divergence measure from information theory , which
measures the how bad the probability distribution Mq is at modeling Md
Given a query q, there exists a subset of the documents R which are relevant to q
But membership of R is uncertain (not sure) , A Probabilistic retrieval model ranks
documents in decreasing order of probability of relevance to the information need:
P(R | q,di)
Probabilistic methods are one of the oldest but also one of the currently hottest
topics in IR .
Probabilistic IR Models :
2
6
Odds of an event (is the ratio of the probability of an event to the probability of
its complement.) provide a kind of multiplier for how probabilities change:
In Ranked retrieval setup, for a given collection of documents, the user issues a
query, and an ordered list of documents is returned. Assume binary notion of
relevance: Rd,q is a a random dichotomous variable (A categorical variable that
can take on exactly two values is termed a binary variable or dichotomous variable), such
that
PRP in brief
If the retrieved documents (w.r.t a query) are ranked decreasingly on their
probability of relevance, then the effectiveness of the system will be the best
that is obtainable
PRP in full
If [the IR] system‟s response to each [query] is a ranking of the documents [...] in
order of decreasing probability of relevance to the [query], where the probabilities
are estimated as accurately as possible on the basis of whatever data have been
made available to the system for this purpose, the overall effectiveness of the
system to its user will be the best that is obtainable on the basis of those data.
1/0 loss :
either returning a non relevant document or failing to return a relevant
document is called as 1/0 loss.
The goal is to return the best possible results as the top k documents, for
any value of k the user chooses to examine.
The PRP then says to simply rank all documents in decreasing order of P (R=1
| d,q) . If a set of retrieval results is to be returned, rather than an ordering, the
Bayes Optimal Decision Rule , the decision which minimizes the risk of loss, is
to simply return documents that are more likely relevant than
nonrelevant:
2
7
that
representation is
2
8
Since a document is either relevant or nonrelevant to a query, we must have that:
So:
2
9
Let pt = P(xt = 1|R = 1,q) be the probability of a term appearing in a document
relevant to the query,
Let ut = P(xt = 1|R = 0,q) be the probability of a term appearing in a non relevant
document. It can be contingency table :
Additional simplifying assumption: terms not occurring in the query are equally
likely to occur in relevant and irrelevant documents, If qt = 0, then pt = ut
Now we need only to consider terms in the products that appear in the query:
The left product is over query terms found in the document and the right product is over
query terms not found in the document.
Equivalent: rank documents using the log odds ratios for the terms in the query ct
The odds ratio is the ratio of two odds: (i) the odds of the term appearing if the
document is relevant (pt/(1 − pt)), and (ii) the odds of the term appearing if the document
is irrelevant (ut/(1 − ut ))
ct = 0: term has equal odds of appearing in relevant and irrelevant docs
ct positive: higher odds to appear in relevant documents
3
0
ct negative: higher odds to appear in irrelevant documents
ct functions as a term weight. Retrieval status value for document d:
So BIM and vector space model are identical on an operational level , except
that the term weights are different. In particular: we can use the same data
structures (inverted index etc) for the two models.
For each term t in a query, estimate ct in the whole collection using a contingency
table of counts of documents in the collection, where df t is the number of
documents that contain term t:
Avoiding Zeroes :
If any of the counts is a zero, then the term weight is not well-defined. Maximum
likelihood estimates do not work for rare events.
Assuming that relevant documents are a very small percentage of the collection,
approximate statistics for irrelevant documents by statistics from the whole collection
They are not that different. In either case you build an information
retrieval scheme in the exact same way.
3
1
For probabilistic IR, at the end, you score queries not by cosine similarity and
tf- idf in a vector space, but by a slightly different formula motivated by
probability theory. Next to add term frequency and length normalization to
the probabilistic model.
Latent semantic indexing (LSI) is an indexing and retrieval method that uses a
mathematical technique called singular value decomposition (SVD) to identify
patterns in the relationships between the terms and concepts contained in an
unstructured collection of text. LSI is based on the principle that words that are used
in the same contexts tend to have similar meanings.
In LSI
Using SVD for this purpose is called latent semantic indexing or LSI.
Steps
:
3
2
1. decompose the term-document matrix into a product of matrices.The
particular decomposition is called as singular value decomposition (SVD).
SVD: C = UΣV T (where C = term-document matrix)
The term matrix U – consists of one (row) vector for each term
The document matrix VT – consists of one (column) vector for each document
The singular value matrix Σ – diagonal matrix with singular
values, reflecting importance of each dimension
2. Use the SVD to compute a new, improved term-document matrix C′ by
reducing the space which gives better similarity values compared to C.
LSI takes documents that are semantically similar (= talk about the same
topics), but are not similar in the vector space (because they use different
words) and re- represents them in a reduced vector space in which they have
higher similarity.
Thus, LSI addresses the problems of synonymy and semantic related
ness. Standard vector space: Synonyms contribute nothing to document
similarity.
How LSI addresses synonymy and semantic relatedness
3
3
The matrix U - consists of one (row) vector for each term
One row per term, one column per min(M,N) where M is the number of terms and N
is the number of documents. This is an orthonormal matrix:
(ii) Row vectors have unit length. (ii) Any two distinct row vectors are orthogonal to
each other. Think of the dimensions as “semantic” dimensions that capture distinct
topics like politics, sports, economics. Each number uij in the matrix indicates how
strongly related term i is to the topic represented by semantic dimension j .
The matrix Σ
The matrix VT
3
4
One column per document, one row per min(M,N) where M is the number of terms
and N is the number of documents. Again: This is an orthonormal matrix: (i)
Column vectors have unit length. (ii) Any two distinct column vectors are
orthogonal to each other.. Each number vij in the matrix indicates how strongly
related document i is to the topic represented by semantic dimension j .Reducing
the dimensionality to 2
3
5
Actually, we only zero out singular values in Σ. This has the effect of setting the
corresponding dimensions in U and V T to zero when computing the product C = UΣV T .
3
6
We can view C2 as a two-dimensional representation of the matrix. We have
performed a dimensionality reduction to two dimensions.
0.52 * 0.28 + 0.36 * 0.16 + 0.72 * 0.36 + 0.12 * 0.20 + - 0.39 * - 0.08 ≈ 0.52
_________________________________________________________
The idea of relevance feedback) is to involve the user in the retrieval process so as
to improve the final result set. In particular, the user gives feedback on the relevance
of documents in an initial set of results. The basic procedure is:
We can iterate this: several rounds of relevance feedback. We will use the term
ad hoc retrieval to refer to regular retrieval without relevance feedback.
Key concept for relevance feedback: Centroid
3
7
The optimal query vector is:
The Problem is we don’t know the truly relevant docs. We move the
centroid of the relevant documents by the difference between the two
centroids.
3
8
qm: modified query vector; q0: original query vector; Dr and Dnr : sets of known relevant
and nonrelevant documents respectively; α, β, and γ: weights
New query moves towards relevant documents and away from nonrelevant
documents.
Tradeoff α vs. β/γ: If we have a lot of judged documents, we want a higher β/γ.
“Negative weight” for a term doesn‟t make sense in the vector space model.
4) centroid of nonrelevant
3) does not separate relevant
documents.
/ nonrelevant.
5) 6) difference vector
3
9
7) Add difference vector to 8) to get
4
0
Relevance feedback is expensive.
Relevance feedback creates long modified queries.
Long queries are expensive to process.
Users are reluctant to provide explicit feedback.
It‟s often hard to understand why a particular document was retrieved
after applying relevance feedback.
The search engine Excite had full relevance feedback at one point, but
abandoned it later.
Pseudo-relevance algorithm:
Query Expansion
For each term t in the query, expand the query with words the thesaurus lists as
semantically related with t.
It‟s very expensive to create a manual thesaurus and to maintain it over time. A
manual thesaurus has an effect roughly equivalent to annotation with a controlled
vocabulary.
in C = AAT where A is term-document matrix , wi,j = (normalized) weight for (ti ,dj)
4
2
doghouse dog porch crawling beside downstairs
makeup repellent lotion glossy sunscreen skin gel
mediating reconciliation negotiate case conciliation
keeping hoping bring wiping could some would
lithographs drawings Picasso Dali sculptures Gauguin
pathogens toxins bacteria organisms bacterial parasite
senses grasp psyche truly clumsy naive innate
WordSpace demo on web
4
3