Algorithms For Information Retrieval: Index Construction

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

ALGORITHMS FOR INFORMATION

RETRIEVAL
Index Construction

Bhaskarjyoti Das
Department of Computer Science Engineering
ALGORITHMS FOR INFORMATION
RETRIEVAL

Blocked Sort Based Indexing Part 1

Bhaskarjyoti Das
Department of Computer Science Engineering
ALGORITHMS FOR INFORMATION RETRIEVAL
Recall Index Construction
T erm Doc #
1. Make a pass through the collection I 1
did 1
assembling all term-docID pairs. enact 1
julius 1
2. Sort the pairs with the term as the primary caesar
I
1
1
key and docID as the secondary key. was 1
killed 1
i' 1
3. Organize the docIDs for each term into a the 1
postings list and compute statistics like term capitol
brutus
1
1
and document frequency. killed 1
me 1
so 2

Doc 1 Doc 2 let


it
2
2
be 2
with 2
I did enact Julius So let it be with caesar
the
2
2
Caesar I was killed Caesar. The noble noble
brutus
2
2
i' the Capitol; Brutus hath told you hath
told
2
2
Brutus killed me. Caesar was ambitious you
caesar
2
2
was 2
ambitious 2
ALGORITHMS FOR INFORMATION RETRIEVAL
Reuter’s RCV1 Statistics

symbol statistic value


• N documents 800,000
• L avg. # tokens per doc 200
• M terms (= word types) 400,000
• avg. # bytes per token 6
(incl. spaces/punct.)
• avg. # bytes per token 4.5
(without spaces/punct.)
• avg. # bytes per term 7.5
• non-positional postings 100,000,000
ALGORITHMS FOR INFORMATION RETRIEVAL
Key Step : Sort

• After all documents have been parsed, we have to


sort 100M postings in RCV1!.

• We focus on this sort step.


ALGORITHMS FOR INFORMATION RETRIEVAL
Challenge of In Memory Sorting

• In-memory index construction does not scale


• In case of RCV1, it is only 100 Million term-docid pairs.
Typical collections are much larger !
• Can’t stuff entire collection into memory, sort, then write
back

• The earlier discussion assumed “internal sorting” (main


memory based) and is not going to work.

• We need an external (data stored external to main memory)


sorting algorithm.
ALGORITHMS FOR INFORMATION RETRIEVAL
Sort Using Disk As Memory ?

• Can we use the same index construction algorithm using disk instead
of memory?
• Even if virtual memory gets used, disk seeks will come into play
• No: Sorting T = 100,000,000 records on disk is too slow – too
many disk seeks.

• Example - If every comparison took 2 disk seeks, and N items could be


sorted with N log2N comparisons, how long would this take?
• Every disk seek takes in order of milli seconds and this translates
to 100M log2(100M) ms which may be in years !!
ALGORITHMS FOR INFORMATION RETRIEVAL
Term vs. Term_id in Posting List
▪ To make index construction more efficient, we can
represent term as termid where each termid is a unique
serial number.
▪ Savings : Term takes more number of bytes than
term id
Creation of the term-termid dictionary:
▪ Two-pass approach
▪ we compile the vocabulary in the first pass
▪ construct the inverted index in the second pass.
▪ Multi-pass algorithms are preferable in certain
applications, for example, when disk space is scarce
▪ Single pass approach: We can build the mapping from
term to termid on the fly while we are processing the
collection
ALGORITHMS FOR INFORMATION RETRIEVAL
Challenge in Construction of Posting List

• As we build the index, we parse docs one at a time.

• Assume 12-byte (4+4+4) records or posting entries(termid, docid,


freq).

• The final postings for any of 100 Million postings are incomplete
until the end (term may occur in the last document !).

• At 12 bytes per non-positional postings entry (termid, docid, freq),


demands a lot of space for large collections.

• T = 100,000,000 in the case of RCV1

• Solution : We need to store intermediate results on disk as we do


not have so much main memory !
ALGORITHMS FOR INFORMATION RETRIEVAL
Basic Idea of the BSBI Algorithm

• 12-byte (4+4+4) records (termid, docid, freq). These are


generated as we parse documents.
• Must now sort 100M such 12-byte records by termid for
RCV1 corpus

Basic idea of the proposed algorithm:


1. Divide and conquer
2. Define a Block ~ 10M such records
3. Will have 10 such blocks to start with. Can easily fit a
couple of them into memory.
4. Accumulate postings for each block, sort, write
index for each block to disk.
5. Then merge the index blocks into one long index file.
ALGORITHMS FOR INFORMATION RETRIEVAL
Sorting 10 Blocks of 10 M Records Each

▪ First, read each block and sort within:


▪ Quicksort is an in-place sort algorithm
▪ Quicksort takes 2N ln N expected steps
▪ In our case 2 x (10M ln 10M) steps

▪ 10 times this estimate – gives us 10 sorted runs of 10M


records each.
THANK YOU

Bhaskarjyoti Das
Department of Computer Science Engineering

You might also like