0% found this document useful (0 votes)
24 views24 pages

Information Retrieval - 2

Uploaded by

wogarigj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views24 pages

Information Retrieval - 2

Uploaded by

wogarigj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

1

MODERN INFORMATION RETRIEVAL

INDEX CONSTRUCTION

UNIT – 2

Prepared by : SRIDHAR UDAYAKUMAR


Outline
2

 Introduction
 BSBI algorithm
 SPIMI algorithm
Building an index
3

 Block-merge indexing
 Single-pass indexing
 Distributed indexing
Hardware basics
4

 Many design decisions in information retrieval are


based on hardware constraints.
 We begin by reviewing hardware basics that we’ll
need in this course.
Hardware basics Cont..
5

 Access to data in memory is much faster than access


to data on disk.
 Disk seeks: No data is transferred from disk while
the disk head is being positioned.
 Therefore:Transferring one large chunk of data from
disk to memory is faster than transferring many small
chunks.
 Disk I/O is block-based: Reading and writing of
entire blocks (as opposed to smaller chunks).
 Block sizes: 8KB to 256 KB.
Hardware basics Cont..
6

 Servers used in IR systems now typically have


several GB of main memory, sometimes tens of GB.
 Available disk space is several (2–3)orders of
magnitude larger.
 Fault tolerance is very expensive: It’s much cheaper
to use many regular machines rather than one fault
tolerant machine.
RCV1: Our corpus for this lecture
7

 Shakespeare’s collected works definitely aren’t


large enough.
 The corpus we’ll use isn’t really large enough either,
but it’s publicly available and is at least a more
plausible example.
 As an example for applying scalable index
construction algorithms, we will use the Reuters
RCV1 collection (Approx. 1GB).
 This
is one year of Reuters newswire (part of 1996 and
1997)
A Reuters RCV1 document
8
Reuters RCV1 statistics
9
Recall IIR1 index construction Term
I
Doc #
1
did 1
10
enact 1
julius 1

 Documents are parsed to extract words and these caesar


I
1
1
are saved with the Document ID. was 1
killed 1
i' 1
the 1
capitol 1
brutus 1
killed 1
me 1
Doc 1 Doc 2 so 2
let 2
it 2
be 2
I did enact Julius So let it be with
with 2
caesar 2
Caesar I was killed Caesar. The noble the 2
i' the Capitol; Brutus hath told you noble 2
Brutus killed me. brutus 2
Caesar was ambitious hath 2
told 2
you 2
caesar 2
was 2
ambitious 2
Key step
Term Doc # Term Doc #
11 I 1 ambitious 2
did 1 be 2
enact 1 brutus 1
 After all documents have been julius
caesar
1
1
brutus
capitol
2
1
parsed, the inverted file is I
was
1
1
caesar
caesar
1
2
sorted by terms. killed 1 caesar 2
i' 1 did 1
the 1 enact 1
capitol 1 hath 1
brutus 1 I 1
killed 1 I 1
We focus on this sort step. me 1 i' 1
We have 100M items to sort. so 2 it 2
let 2 julius 1
it 2 killed 1
be 2 killed 1
with 2 let 2
caesar 2 me 1
the 2 noble 2
noble 2 so 2
brutus 2 the 1
hath 2 the 2
told 2 told 2
you 2 you 2
caesar 2 was 1
was 2 was 2
ambitious 2 with 2
Scaling index construction
12

 In-memory index construction does not scale.


 How can we construct an index for very large
collections?
 Taking into account the hardware constraints we
just learned about . . .
 Memory, disk, speed etc.
Sort-based Index construction
13

 As we build the index, we parse docs one at a time.


 Whilebuilding the index, we cannot easily exploit
compression tricks (you can, but much more complex)
 The final postings for any term are incomplete until the end.
 At 8 bytes per postings entry, demands a lot of space for
large collections.
 T = 100,000,000 in the case of RCV1
 So … we can do this in memory in 2008, but typical
collections are much larger. E.g. New York Times
provides index of >150 years of newswire
 Thus: We need to store intermediate results on disk.
Same algorithm for disk?
14

 Can we use the same index construction algorithm


(internal sorting algorithms) for larger collections,
but by using disk instead of memory?
 No: Sorting T = 100,000,000 records on disk is too
slow – too many disk seeks.
 We need an external sorting algorithm.
15 BSBI algorithm
BSBI: Blocked sort-based Indexing (Sorting
with fewer disk seeks)
16

 8-byte (4+4) records (term id, doc id).


 These are generated as we parse docs.
 Must now sort 100M such 8-byte records by term.
 Define a Block ~ 10M such records
 Can easily fit a couple into memory.
 Will have 10 such blocks to start with.

 Basic idea of algorithm:


 Accumulate postings for each block, sort, write to
disk.
 Then merge the blocks into one long sorted order.
Merging two blocks
17
Blocked Sort-Based Indexing
18

Block merge algorithm (from [Manning et al,07]):


1 blockMerge(collection c)
2 n <- 1
3 do
4 block <- parseNextBlock(c)
5 invert(block)
6 writeToDisc(block, fn)
7 n <- n+1
8 while (c != [])
9 endwhile
10 return merge([f1 .. fn])
Note: merging needs the term-termId mapping.
How to merge the sorted runs?
19

 Can do binary merges, with a merge tree of log210 = 4


layers.
 During each layer, read into memory runs in blocks of 10M,
merge, write back.
1
1 2
2 Merged run.
3 4
3

4
Runs being
merged.
Disk
How to merge the sorted runs?
20

 But it is more efficient to do a n-way merge, where you are


reading from all blocks simultaneously

 Providing you read decent-sized chunks of each block into


memory, you’re not killed by disk seeks
21 SPIMI algorithm
Problem with sort-based algorithm
22

 Our assumption was: we can keep the dictionary in


memory.
 We need the dictionary (which grows dynamically)
to map a term to termID.
 Actually, we could work with term,docID postings
instead of termID,docID postings . . .
 . . . but then intermediate files become very large.
(We would end up with a scalable, but very slow
index construction method.)
Single-pass in-memory indexing
23

 Key idea 1: Generate separate dictionaries for


each block – no need to maintain term-termID
mapping across blocks.
 Key idea 2: Don’t sort. Accumulate postings in
postings lists as they occur.
 With these two ideas we can generate a complete
inverted index for each block.
 These separate indexes can then be merged into
one big index.
SPIMI-Invert
24

You might also like