6 The Term Vocabulary & Posting List
6 The Term Vocabulary & Posting List
Lecture 10
Tokenization: Language Issues
• French
• L'ensemble one token or two?
• L ? L’ ? Le ?
• Want l’ensemble to match with un ensemble
• Until at least 2003, it didn’t on Google
• Internationalization!
フォーチュン500社は情報不足のため時間あた$500K(約6,000万円)
• Arabic (or Hebrew) is basically written right to left, but with certain items like
numbers written left to right
• Words are separated, but letter forms within a word form complex ligatures
← → ←→ ←
• ‘Algeria achieved its independence in 1962 after 132 years of French occupation.’
• With Unicode, the surface presentation is complex, but the stored
form is straightforward
Stop words
• With a stop list, you exclude from the dictionary entirely the commonest words.
Intuition:
• They have little semantic content: the, a, and, to, be
• There are a lot of them: ~30% of postings for top 30 words
Dropping common words: a, an, and, are, as, ......
have
Little value in helping select the documents
• General Strategy for determining a stop word list is to sort the terms
by Collection Frequency
No. of times the term ‘t’ appears in the document
So this is the third frequency
Stop Words
Collection Frequency
along with
Document Frequency
Term Frequency
Stop Words
• We need to “normalize” words in indexed text as well as query words into the
same form
• We want to match U.S.A. and USA
• Result is terms: a term is a (normalized) word type, which is an entry in our IR
system dictionary
• We most commonly implicitly define equivalence classes of terms by, e.g.,
• deleting periods to form a term
• U.S.A., USA USA
• deleting hyphens to form a term
• anti-discriminatory,antidiscriminatory antidiscriminatory
• Even in languages that standardly have accents, users often may not type them
• Often best to normalize to a de-accented term
• Tuebingen,Tübingen,Tubingen Tubingen
Normalization: other languages
Case folding
• Google example:
• Query C.A.T.
• #1 result is for “cat” (well, Lolcats) not Caterpillar Inc.
Normalization to terms
FASTER POSTINGS
MERGES:
SKIP POINTERS/SKIP
LISTS
Faster postings merges via Skip pointers/Skip lists
• Can we do better?
• Yes (if index isn’t changing too fast).
• i.e.,
• There are not new entries been added or deleted from the
posting list
• Use skip list by augmenting posting lists with skip pointers (at indexing
time)
• Let see how can we use these skip pointers to increase our search and
how do we add them
41 128
2 4 8 41 48 64 128
2 4 8 41 48 64 128
2 8
1 2 3 8 11 17 21 31
p2 p2 p2 p2
And so on....
p2
• Fewer skips few pointer comparison, but then long skip spans few successful skips.
Placing Skips
L L L L L
• Easy if the index is relatively static; harder if L keeps changing because of updates.
Best is static
Deleting/inserting elements
Important Points
• If Index is small entirely fits into Memory (both dictionary & posting list can
fit into main memory)
• If corpus size is large posting may have to be stored on disk, while dictionary is
kept in memory.
• Problem 1:
• We have two-word query. For one term the postings list consists of the
following 16 entries
[4, 6, 10, 12, 14, 16, 18, 20, 22, 32, 47, 81, 120,122, 157, 180]
and for the other it is the one entry posting list
[47]
Workout how many comparisons would be done to intersect the two posting
lists with the following two strategies. Briefly justify your answer.
(a) Using standard posting list.
(b) Using posting lists stored with skip pointers, with a skip length of L
Problem 1 Solution
Problem 2
• We have a two word query. For one term the postings list consist of the
following 16 entries.
[ 2, 4, 9, 12, 14, 16, 18, 20, 24, 32, 47, 81, 120, 125, 158, 180 ]
and for the other list it is the one entry postings list
[ 81]
Work out how many comparisons would be done to intersect the two postings
list with the following two strategies.
i. Using standard postings list.
ii. Using postings list stored with skip pointers, with the suggested skip length
of √P
Problem 2 Solution
i. Using standard postings list.
12 comparisons
(2,81), (4,81), (9,81), (12,81), (14,81), (16,81), (18,81), (20, ,81),
(24, ,81), (32, ,81), (47, 81), (81, 81)
ii. Using postings list stored with skip pointers, with the suggested skip length of
√P.
7 comparisons
(2,81), (14,81), (24,81), (120,81), (32,81), (47,81), (81,81)
Problem 3
Problem 3 Solution
Assignment - II
• Why are skip pointers not useful for queries of the form x OR y?
• Exercise 1.6, 1.11, 1.9, 2.2, 2.3