Lecture 3-Term Vocabulary and Posting Lists
Lecture 3-Term Vocabulary and Posting Lists
Introduction to
Information Retrieval
Document ingestion
Introduction to Information Retrieval
Parsing a document
▪ What format is it in?
▪ pdf/word/excel/html?
▪ What language is it in?
▪ What character set is in use?
▪ (CP1252, UTF-8, …)
Complications: Format/language
▪ Documents being indexed can include docs from
many different languages
▪ A single index may contain terms from many languages.
▪ Sometimes a document or its components can
contain multiple languages/formats
▪ French email with a German pdf attachment.
Tokenization
▪ Input: “Friends, Romans and Countrymen”
▪ Output: Tokens
▪ Friends
▪ Romans
▪ Countrymen
▪ A token is an instance of a sequence of characters
▪ Each such token is now a candidate for an index
entry, after further processing
▪ But what are valid tokens to emit?
Introduction to Information Retrieval
Tokenization
▪ Issues in tokenization:
▪ Finland’s capital →
Finland AND s? Finlands? Finland’s?
▪ Hewlett-Packard → Hewlett and Packard as two
tokens?
▪ state-of-the-art: break up hyphenated sequence.
▪ co-education
▪ lowercase, lower-case, lower case ?
▪ It can be effective to get the user to put in possible hyphens
Numbers
▪ 3/20/91 Mar. 12, 1991 20/3/91
▪ 55 B.C.
▪ B-52
▪ My PGP key is 324a3df234cb23e
▪ (800) 234-2333
▪ Often have embedded spaces
▪ Older IR systems may not index numbers
▪ But often very useful: think about things like looking up error
codes/stacktraces on the web
▪ (One answer is using n-grams)
▪ Will often index “meta-data” separately
▪ Creation date, format, author etc.
Introduction to Information Retrieval
▪ ← → ←→ ←
start
▪ ‘Algeria achieved its independence in 1962 after 132
years of French occupation.’
Introduction to Information Retrieval
Introduction to
Information Retrieval
Terms
The things indexed in an IR system
Introduction to Information Retrieval
Stop words
▪ With a stop list, you exclude from the dictionary entirely the
commonest words. Intuition:
▪ They have little semantic content: the, a, and, to, be
▪ There are a lot of them: ~30% of postings for top 30 words
▪ But the trend is away from doing this:
▪ Good compression techniques means the space for including stop words
in a system is very small
▪ Good query optimization techniques mean you pay little at query time
for including stop words.
▪ You need them for:
▪ Phrase queries: “King of Denmark”
▪ Various song titles, etc.: “Let it be”, “To be or not to be”
▪ “Relational” queries: “flights to London”
Introduction to Information Retrieval
Normalization to terms
▪ Token normalization is the process of canonicalizing tokens so that
matches occur despite superficial differences in the character sequences
of the Tokens
▪ We may need to “normalize” words in indexed text as well as query words
into the same form
▪ We want to match U.S.A. and USA
▪ Result is terms: a term is a (normalized) word type, which is an entry in our
IR system dictionary
▪ We most commonly implicitly define equivalence classes of terms by, e.g.,
▪ deleting periods to form a term
▪ U.S.A., USA ⎝ USA
▪ deleting hyphens to form a term
▪ anti-discriminatory, antidiscriminatory ⎝ antidiscriminatory
Introduction to Information Retrieval
Case folding
▪ Reduce all letters to lower case
▪ exception: upper case in mid-sentence?
▪ e.g., General Motors
▪ Fed vs. fed
▪ SAIL vs. sail
▪ Often best to lower case everything, since users will use
lowercase regardless of ‘correct’ capitalization…
Normalization to terms
Introduction to
Information Retrieval
Stemming and Lemmatization
Introduction to Information Retrieval
Lemmatization
▪ Lemmatization implies doing “proper” reduction to dictionary headword
form with the use of a vocabulary and morphological analysis of words
Stemming
▪ Reduce terms to their “roots” before indexing
Other stemmers
▪ Other stemmers exist:
▪ Lovins stemmer
▪ http://www.comp.lancs.ac.uk/computing/research/stemming/general/lovins.htm
▪ Single-pass, longest suffix removal (about 250 rules)
▪ Paice/Husk stemmer
▪ Snowball
▪ Rather than using a stemmer, one can use a lemmatizer, a tool
from NLP, that does full morphological analysis to accurately
identify the lemma for each word.
▪ At most modest benefits for retrieval
Introduction to Information Retrieval
Stemmer Example?
Sample text: Such an analysis can reveal features that are not easily visible from the
variations in the individual genes and can lead to a picture of expression that is
more biologically transparent and accessible to interpretation.
Output:
Lovins stemmer: such an analys can reve featur that ar not eas vis from th vari in th
individu gen and can lead to a pictur of expres that is mor biolog transpar and
acces to interpres
Porter stemmer: such an analysi can reveal featur that ar not easili visibl from the
variat in the individu gene and can lead to a pictur of express that is more biolog
transpar and access to interpret
Paice stemmer: such an analys can rev feat that are not easy vis from the vary in the
individ gen and can lead to a pict of express that is mor biolog transp and access to
interpret
Introduction to Information Retrieval
Introduction to
Information Retrieval
Faster postings merges:
Skip pointers/Skip lists
Introduction to Information Retrieval
2 4 8 4 4 6 12 Brutu
2 8 1 8 4 8 s
1 2 3 8 1 17 2 3 Caesa
1 1 1 r
If the list lengths are m and n, the merge takes O(m+n)
operations.
Can we do better?
Yes (if the index isn’t changing too fast).
Introduction to Information Retrieval
Placing skips
▪ Simple heuristic: for postings of length L, use √L evenly-spaced skip
pointers [Moffat and Zobel 1996]
▪ This definitely used to help; with modern hardware it may not unless
you’re memory-based [Bahle et al. 2002]
Introduction to Information Retrieval
1. Biword indexes
▪ One approach to handling phrases is to consider every pair of
consecutive terms in a document as a phrase.
For example, the text Friends, Romans, Countrymen would
generate the biwords:
friends romans
romans countrymen
▪ In this model, we treat each of these biwords as a vocabulary
term.
▪ The concept of a biword index can be extended to longer
sequences of words, and if the index includes variable length
word sequences, it is generally referred to as a phrase index.
Introduction to Information Retrieval
2. Positional indexes
▪ A biword index is not the standard solution. Rather, a positional
index is most commonly employed.
▪ Here, for each term in the vocabulary, we store postings of the
form docID: {hposition1, position2, . . . } e.g.
to, 993427:
(1, 6: (7, 18, 33, 72, 86, 231);
2, 5: (1, 17, 74, 222, 255);
4, 5: (8, 16, 190, 429, 433);
5, 2: (363, 367);
7, 3: (13, 23, 191); ..... . . )
be, 178239:
(1, 2: (17, 25);
4, 5: (17, 191, 291, 430, 434);
5, 3: (14, 19, 101); . . . ..)
Introduction to Information Retrieval
2. Positional indexes
▪ To process a phrase query, we still need to access the inverted
index entries for each distinct term.
▪ As before, we would start with the least frequent term and
then work to further restrict the list of possible candidates.
to: (. . . ; 4: (. . . ,429,433); . . . )
be: (. . . ; 4(. . . ,430,434); . . . )