Algorithms For Information Retrieval: Index Construction
Algorithms For Information Retrieval: Index Construction
Algorithms For Information Retrieval: Index Construction
RETRIEVAL
Index Construction
Bhaskarjyoti Das
Department of Computer Science Engineering
ALGORITHMS FOR INFORMATION
RETRIEVAL
Bhaskarjyoti Das
Department of Computer Science Engineering
ALGORITHMS FOR INFORMATION RETRIEVAL
Recall Index Construction
T erm Doc #
1. Make a pass through the collection I 1
did 1
assembling all term-docID pairs. enact 1
julius 1
2. Sort the pairs with the term as the primary caesar
I
1
1
key and docID as the secondary key. was 1
killed 1
i' 1
3. Organize the docIDs for each term into a the 1
postings list and compute statistics like term capitol
brutus
1
1
and document frequency. killed 1
me 1
so 2
• Can we use the same index construction algorithm using disk instead
of memory?
• Even if virtual memory gets used, disk seeks will come into play
• No: Sorting T = 100,000,000 records on disk is too slow – too
many disk seeks.
• The final postings for any of 100 Million postings are incomplete
until the end (term may occur in the last document !).
Bhaskarjyoti Das
Department of Computer Science Engineering