Algorithms For Information Retrieval: Index Construction

ALGORITHMS FOR INFORMATION
RETRIEVAL
Index Construction
Bhaskarjyoti Das
Department of Computer Science Engineering
ALGORITHMS FOR INFORMATION
RETRIEVAL
Blocked Sort Based Indexing Part 1
Bhaskarjyoti Das
ALGORITHMS FOR INFORMATION RETRIEVAL
Recall Index Construction
T erm Doc #
1. Make a pass through the collection I 1
did 1
assembling all term-docID pairs. enact 1
julius 1
2. Sort the pairs with the term as the primary caesar
I
1
1
key and docID as the secondary key. was 1
killed 1
i' 1
3. Organize the docIDs for each term into a the 1
postings list and compute statistics like term capitol
brutus
1
1
and document frequency. killed 1
me 1
so 2
Doc 1 Doc 2 let

it
2
2
be 2
with 2
I did enact Julius So let it be with caesar
the
2
2
Caesar I was killed Caesar. The noble noble
brutus
2
2
i' the Capitol; Brutus hath told you hath
told
2
2
Brutus killed me. Caesar was ambitious you
caesar
2
2
was 2
ambitious 2
Reuter’s RCV1 Statistics
symbol statistic value

• N documents 800,000
• L avg. # tokens per doc 200
• M terms (= word types) 400,000
• avg. # bytes per token 6
(incl. spaces/punct.)
• avg. # bytes per token 4.5
(without spaces/punct.)
• avg. # bytes per term 7.5
• non-positional postings 100,000,000
Key Step : Sort
• After all documents have been parsed, we have to

sort 100M postings in RCV1!.
• We focus on this sort step.

Challenge of In Memory Sorting
• In-memory index construction does not scale

• In case of RCV1, it is only 100 Million term-docid pairs.
Typical collections are much larger !
• Can’t stuff entire collection into memory, sort, then write
back
• The earlier discussion assumed “internal sorting” (main

memory based) and is not going to work.
• We need an external (data stored external to main memory)

sorting algorithm.
Sort Using Disk As Memory ?
• Can we use the same index construction algorithm using disk instead
of memory?
• Even if virtual memory gets used, disk seeks will come into play
• No: Sorting T = 100,000,000 records on disk is too slow – too
many disk seeks.
• Example - If every comparison took 2 disk seeks, and N items could be

sorted with N log2N comparisons, how long would this take?
• Every disk seek takes in order of milli seconds and this translates
to 100M log2(100M) ms which may be in years !!
Term vs. Term_id in Posting List
▪ To make index construction more efficient, we can
represent term as termid where each termid is a unique
serial number.
▪ Savings : Term takes more number of bytes than
term id
Creation of the term-termid dictionary:
▪ Two-pass approach
▪ we compile the vocabulary in the first pass
▪ construct the inverted index in the second pass.
▪ Multi-pass algorithms are preferable in certain
applications, for example, when disk space is scarce
▪ Single pass approach: We can build the mapping from
term to termid on the fly while we are processing the
collection
Challenge in Construction of Posting List
• As we build the index, we parse docs one at a time.
• Assume 12-byte (4+4+4) records or posting entries(termid, docid,

freq).
• The final postings for any of 100 Million postings are incomplete
until the end (term may occur in the last document !).
• At 12 bytes per non-positional postings entry (termid, docid, freq),

demands a lot of space for large collections.
• T = 100,000,000 in the case of RCV1
• Solution : We need to store intermediate results on disk as we do

not have so much main memory !
Basic Idea of the BSBI Algorithm
• 12-byte (4+4+4) records (termid, docid, freq). These are

generated as we parse documents.
• Must now sort 100M such 12-byte records by termid for
RCV1 corpus
Basic idea of the proposed algorithm:

1. Divide and conquer
2. Define a Block ~ 10M such records
3. Will have 10 such blocks to start with. Can easily fit a
couple of them into memory.
4. Accumulate postings for each block, sort, write
index for each block to disk.
5. Then merge the index blocks into one long index file.
Sorting 10 Blocks of 10 M Records Each
▪ First, read each block and sort within:

▪ Quicksort is an in-place sort algorithm
▪ Quicksort takes 2N ln N expected steps
▪ In our case 2 x (10M ln 10M) steps
▪ 10 times this estimate – gives us 10 sorted runs of 10M

records each.
THANK YOU
Bhaskarjyoti Das

Algorithms For Information Retrieval: Index Construction

Uploaded by

Copyright:

Available Formats

Algorithms For Information Retrieval: Index Construction

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Algorithms For Information Retrieval: Index Construction

Uploaded by

Copyright:

Available Formats

ALGORITHMS FOR INFORMATION

Blocked Sort Based Indexing Part 1

Doc 1 Doc 2 let

symbol statistic value

• After all documents have been parsed, we have to

• We focus on this sort step.

• In-memory index construction does not scale

• The earlier discussion assumed “internal sorting” (main

• We need an external (data stored external to main memory)

• Example - If every comparison took 2 disk seeks, and N items could be

• As we build the index, we parse docs one at a time.

• Assume 12-byte (4+4+4) records or posting entries(termid, docid,

• At 12 bytes per non-positional postings entry (termid, docid, freq),

• T = 100,000,000 in the case of RCV1

• Solution : We need to store intermediate results on disk as we do

• 12-byte (4+4+4) records (termid, docid, freq). These are

Basic idea of the proposed algorithm:

▪ First, read each block and sort within:

▪ 10 times this estimate – gives us 10 sorted runs of 10M

You might also like