0% found this document useful (0 votes)

24 views24 pages

Information Retrieval - 2

Uploaded by

wogarigj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views24 pages

Information Retrieval - 2

Uploaded by

wogarigj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

1

MODERN INFORMATION RETRIEVAL

INDEX CONSTRUCTION

UNIT – 2

Prepared by : SRIDHAR UDAYAKUMAR

Outline
2

 Introduction
 BSBI algorithm
 SPIMI algorithm
Building an index
3

 Block-merge indexing
 Single-pass indexing
 Distributed indexing
Hardware basics
4

 Many design decisions in information retrieval are

based on hardware constraints.
 We begin by reviewing hardware basics that we’ll
need in this course.
Hardware basics Cont..
5

 Access to data in memory is much faster than access

to data on disk.
 Disk seeks: No data is transferred from disk while
the disk head is being positioned.
 Therefore:Transferring one large chunk of data from
disk to memory is faster than transferring many small
chunks.
 Disk I/O is block-based: Reading and writing of
entire blocks (as opposed to smaller chunks).
 Block sizes: 8KB to 256 KB.
Hardware basics Cont..
6

 Servers used in IR systems now typically have

several GB of main memory, sometimes tens of GB.
 Available disk space is several (2–3)orders of
magnitude larger.
 Fault tolerance is very expensive: It’s much cheaper
to use many regular machines rather than one fault
tolerant machine.
RCV1: Our corpus for this lecture
7

 Shakespeare’s collected works definitely aren’t

large enough.
 The corpus we’ll use isn’t really large enough either,
but it’s publicly available and is at least a more
plausible example.
 As an example for applying scalable index
construction algorithms, we will use the Reuters
RCV1 collection (Approx. 1GB).
 This
is one year of Reuters newswire (part of 1996 and
1997)
A Reuters RCV1 document
8
Reuters RCV1 statistics
9
Recall IIR1 index construction Term
I
Doc #
1
did 1
10
enact 1
julius 1

 Documents are parsed to extract words and these caesar

I
1
1
are saved with the Document ID. was 1
killed 1
i' 1
the 1
capitol 1
brutus 1
killed 1
me 1
Doc 1 Doc 2 so 2
let 2
it 2
be 2
I did enact Julius So let it be with
with 2
caesar 2
Caesar I was killed Caesar. The noble the 2
i' the Capitol; Brutus hath told you noble 2
Brutus killed me. brutus 2
Caesar was ambitious hath 2
told 2
you 2
caesar 2
was 2
ambitious 2
Key step
Term Doc # Term Doc #
11 I 1 ambitious 2
did 1 be 2
enact 1 brutus 1
 After all documents have been julius
caesar
1
1
brutus
capitol
2
1
parsed, the inverted file is I
was
1
1
caesar
caesar
1
2
sorted by terms. killed 1 caesar 2
i' 1 did 1
the 1 enact 1
capitol 1 hath 1
brutus 1 I 1
killed 1 I 1
We focus on this sort step. me 1 i' 1
We have 100M items to sort. so 2 it 2
let 2 julius 1
it 2 killed 1
be 2 killed 1
with 2 let 2
caesar 2 me 1
the 2 noble 2
noble 2 so 2
brutus 2 the 1
hath 2 the 2
told 2 told 2
you 2 you 2
caesar 2 was 1
was 2 was 2
ambitious 2 with 2
Scaling index construction
12

 In-memory index construction does not scale.

 How can we construct an index for very large
collections?
 Taking into account the hardware constraints we
just learned about . . .
 Memory, disk, speed etc.
Sort-based Index construction
13

 As we build the index, we parse docs one at a time.

 Whilebuilding the index, we cannot easily exploit
compression tricks (you can, but much more complex)
 The final postings for any term are incomplete until the end.
 At 8 bytes per postings entry, demands a lot of space for
large collections.
 T = 100,000,000 in the case of RCV1
 So … we can do this in memory in 2008, but typical
collections are much larger. E.g. New York Times
provides index of >150 years of newswire
 Thus: We need to store intermediate results on disk.
Same algorithm for disk?
14

 Can we use the same index construction algorithm

(internal sorting algorithms) for larger collections,
but by using disk instead of memory?
 No: Sorting T = 100,000,000 records on disk is too
slow – too many disk seeks.
 We need an external sorting algorithm.
15 BSBI algorithm
BSBI: Blocked sort-based Indexing (Sorting
with fewer disk seeks)
16

 8-byte (4+4) records (term id, doc id).

 These are generated as we parse docs.
 Must now sort 100M such 8-byte records by term.
 Define a Block ~ 10M such records
 Can easily fit a couple into memory.
 Will have 10 such blocks to start with.

 Basic idea of algorithm:

 Accumulate postings for each block, sort, write to
disk.
 Then merge the blocks into one long sorted order.
Merging two blocks
17
Blocked Sort-Based Indexing
18

Block merge algorithm (from [Manning et al,07]):

1 blockMerge(collection c)
2 n <- 1
3 do
4 block <- parseNextBlock(c)
5 invert(block)
6 writeToDisc(block, fn)
7 n <- n+1
8 while (c != [])
9 endwhile
10 return merge([f1 .. fn])
Note: merging needs the term-termId mapping.
How to merge the sorted runs?
19

 Can do binary merges, with a merge tree of log210 = 4

layers.
 During each layer, read into memory runs in blocks of 10M,
merge, write back.
1
1 2
2 Merged run.
3 4
3

4
Runs being
merged.
Disk
How to merge the sorted runs?
20

 But it is more efficient to do a n-way merge, where you are

reading from all blocks simultaneously

 Providing you read decent-sized chunks of each block into

memory, you’re not killed by disk seeks
21 SPIMI algorithm
Problem with sort-based algorithm
22

 Our assumption was: we can keep the dictionary in

memory.
 We need the dictionary (which grows dynamically)
to map a term to termID.
 Actually, we could work with term,docID postings
instead of termID,docID postings . . .
 . . . but then intermediate files become very large.
(We would end up with a scalable, but very slow
index construction method.)
Single-pass in-memory indexing
23

 Key idea 1: Generate separate dictionaries for

each block – no need to maintain term-termID
mapping across blocks.
 Key idea 2: Don’t sort. Accumulate postings in
postings lists as they occur.
 With these two ideas we can generate a complete
inverted index for each block.
 These separate indexes can then be merged into
one big index.
SPIMI-Invert
24

05 Index Construction
No ratings yet
05 Index Construction
47 pages
C3 IndexConstruction
No ratings yet
C3 IndexConstruction
46 pages
Web Search Engines: Rooted in Information Retrieval (IR) Systems
No ratings yet
Web Search Engines: Rooted in Information Retrieval (IR) Systems
48 pages
Dynamic Indexing
No ratings yet
Dynamic Indexing
53 pages
Index Construction
No ratings yet
Index Construction
42 pages
04 Index Construction
No ratings yet
04 Index Construction
48 pages
Elementary IR: Scalable Boolean Text Search: (Compare With R & G 27.1-3)
No ratings yet
Elementary IR: Scalable Boolean Text Search: (Compare With R & G 27.1-3)
22 pages
Index Construction
No ratings yet
Index Construction
48 pages
John Deere 310 Tractor Loader Backhoe Service Manual
0% (2)
John Deere 310 Tractor Loader Backhoe Service Manual
22 pages
04const Flat
No ratings yet
04const Flat
54 pages
Lect 01-Introduction
No ratings yet
Lect 01-Introduction
53 pages
IR ch4 - Inverted-Index
No ratings yet
IR ch4 - Inverted-Index
44 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
50 pages
Lec6 InvretedIndex pt2
No ratings yet
Lec6 InvretedIndex pt2
38 pages
Lecture4-Indexconstruction Ch2 and Ch4
No ratings yet
Lecture4-Indexconstruction Ch2 and Ch4
49 pages
Chapter4 Indexconstruction
No ratings yet
Chapter4 Indexconstruction
49 pages
Lecture 4 - Index Construction - Compressing
No ratings yet
Lecture 4 - Index Construction - Compressing
90 pages
Hybrid Transaxles and Transmissions
100% (10)
Hybrid Transaxles and Transmissions
109 pages
Final Review
No ratings yet
Final Review
96 pages
03 Dictionaries
No ratings yet
03 Dictionaries
112 pages
Multimedia Information Systems (MMIS) : INSY4111
No ratings yet
Multimedia Information Systems (MMIS) : INSY4111
57 pages
4.index Construction - New
No ratings yet
4.index Construction - New
46 pages
Lecture2 Indexing
No ratings yet
Lecture2 Indexing
78 pages
IR Indexing
No ratings yet
IR Indexing
15 pages
chapter2-MA212-Indexing & Preprocessing
No ratings yet
chapter2-MA212-Indexing & Preprocessing
68 pages
Lecture 2 Inverted Index PDF
No ratings yet
Lecture 2 Inverted Index PDF
24 pages
L05
No ratings yet
L05
33 pages
Ir Chapter Three
No ratings yet
Ir Chapter Three
41 pages
IR Unit III - Notes
No ratings yet
IR Unit III - Notes
18 pages
Web Search Engines: Chapter 27, Part C Based On Larson and Hearst's Slides at UC-Berkeley
No ratings yet
Web Search Engines: Chapter 27, Part C Based On Larson and Hearst's Slides at UC-Berkeley
14 pages
AI6122 Topic 3.1 - Index
No ratings yet
AI6122 Topic 3.1 - Index
40 pages
03 Dictionaries
No ratings yet
03 Dictionaries
112 pages
IR-Lec1 - Ch1-2023
No ratings yet
IR-Lec1 - Ch1-2023
41 pages
Chap5 Index Construction
No ratings yet
Chap5 Index Construction
38 pages
Algorithms For Information Retrieval: Index Construction
No ratings yet
Algorithms For Information Retrieval: Index Construction
12 pages
C10 IR M2021 IndexConstruction SimpleandDistributed
No ratings yet
C10 IR M2021 IndexConstruction SimpleandDistributed
42 pages
FOP Efficiency Indexing 13
No ratings yet
FOP Efficiency Indexing 13
22 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
54 pages
Chapter - 3 and 4
No ratings yet
Chapter - 3 and 4
47 pages
Information Retrieval - 1
No ratings yet
Information Retrieval - 1
47 pages
1726119671-4 Index Construction
No ratings yet
1726119671-4 Index Construction
19 pages
Document Indexing in Information Retrieval
No ratings yet
Document Indexing in Information Retrieval
19 pages
Lecture 5p1 - Index Construction & Compressing
No ratings yet
Lecture 5p1 - Index Construction & Compressing
42 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
49 pages
Learning Guide Unit 2
No ratings yet
Learning Guide Unit 2
15 pages
Indexing 1
No ratings yet
Indexing 1
61 pages
CS583 Info Retrieval
No ratings yet
CS583 Info Retrieval
34 pages
3 - Index Construction
No ratings yet
3 - Index Construction
5 pages
UE20CS332 Unit2 Slides PDF
No ratings yet
UE20CS332 Unit2 Slides PDF
264 pages
4 Indexing
No ratings yet
4 Indexing
29 pages
Lecture 4-Indexconstruction
No ratings yet
Lecture 4-Indexconstruction
45 pages
Ir
No ratings yet
Ir
4 pages
Chapter 1: Boolean Retrieval
No ratings yet
Chapter 1: Boolean Retrieval
9 pages
CHAP 4 Inverted Index
No ratings yet
CHAP 4 Inverted Index
21 pages
Irs Unit-3 Notes - 241202 - 145950
No ratings yet
Irs Unit-3 Notes - 241202 - 145950
21 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
II. Information Retrieval (Basics Cont.) : Web Search - Summer Term 2006
No ratings yet
II. Information Retrieval (Basics Cont.) : Web Search - Summer Term 2006
16 pages
CSCI 7000 Modern Information Retrieval: Lecture 1: Introduction
No ratings yet
CSCI 7000 Modern Information Retrieval: Lecture 1: Introduction
16 pages
Chapter 1: Introduction: Efficient Search in Large Textual Collections With Redundancy - 2009
No ratings yet
Chapter 1: Introduction: Efficient Search in Large Textual Collections With Redundancy - 2009
31 pages
Index Construction
No ratings yet
Index Construction
37 pages
Ict 7 - Q1 Exam
No ratings yet
Ict 7 - Q1 Exam
3 pages
Renault Midlum D-Range D 18 HIGH P4X2 240
No ratings yet
Renault Midlum D-Range D 18 HIGH P4X2 240
5 pages
Manual SerDia2010 en
No ratings yet
Manual SerDia2010 en
235 pages
GRZINIC Repoliticizing Art
No ratings yet
GRZINIC Repoliticizing Art
240 pages
Mobile App Security Checklist
100% (1)
Mobile App Security Checklist
50 pages
Manual RCM 12
No ratings yet
Manual RCM 12
24 pages
Pcs TCM 2800 - 200
No ratings yet
Pcs TCM 2800 - 200
1 page
Transmission Diagnostics 6T30-40
No ratings yet
Transmission Diagnostics 6T30-40
276 pages
Industrial Energy Efficiency
No ratings yet
Industrial Energy Efficiency
19 pages
Data Communication and Network Questions and Answers PDF
No ratings yet
Data Communication and Network Questions and Answers PDF
3 pages
Alignment Report Nelamangala Chikkaballapura
No ratings yet
Alignment Report Nelamangala Chikkaballapura
39 pages
Tso Short Reference Notes: Default Function and PF Key Settings
No ratings yet
Tso Short Reference Notes: Default Function and PF Key Settings
15 pages
BMW HUD Factory Schematic
100% (1)
BMW HUD Factory Schematic
4 pages
Wind Turbine Main Cabinet Schematic
No ratings yet
Wind Turbine Main Cabinet Schematic
113 pages
Sentiment Analysis of Tweets Using Natural Language Processing (#1130188) - 2484168
No ratings yet
Sentiment Analysis of Tweets Using Natural Language Processing (#1130188) - 2484168
3 pages
Track Consignment
No ratings yet
Track Consignment
1 page
7inch Wide Screen, TFT Color LCD Type Graphic Panel + PLC Function Logic Panel LP-S070
No ratings yet
7inch Wide Screen, TFT Color LCD Type Graphic Panel + PLC Function Logic Panel LP-S070
12 pages
VLT5000 5000flux 6000 8000 Profibus DP V1 MG90G102
No ratings yet
VLT5000 5000flux 6000 8000 Profibus DP V1 MG90G102
63 pages
K Pos Heavy Lift Application Operator Manual Relaese 8.4 (441472E)
No ratings yet
K Pos Heavy Lift Application Operator Manual Relaese 8.4 (441472E)
30 pages
DE100209 NT 64 64D Easyjust
No ratings yet
DE100209 NT 64 64D Easyjust
8 pages
Mini Project Report2
No ratings yet
Mini Project Report2
41 pages
Report of Sony Picture Attack
No ratings yet
Report of Sony Picture Attack
5 pages
Diode Clipping Circuits
No ratings yet
Diode Clipping Circuits
3 pages
Pressure, Temperature and Vacuum Switches - IEC61508 FSA
No ratings yet
Pressure, Temperature and Vacuum Switches - IEC61508 FSA
20 pages
Single Bot Vs Multi Bot
No ratings yet
Single Bot Vs Multi Bot
3 pages
Data Structres & Algorithms
No ratings yet
Data Structres & Algorithms
4 pages
Abb Add On Instruction For Logix5000 Controllers
No ratings yet
Abb Add On Instruction For Logix5000 Controllers
2 pages
View Rock MSEED Data With Cimarron
No ratings yet
View Rock MSEED Data With Cimarron
3 pages

Information Retrieval - 2

Uploaded by

Information Retrieval - 2

Uploaded by

1

MODERN INFORMATION RETRIEVAL

Prepared by : SRIDHAR UDAYAKUMAR

 Many design decisions in information retrieval are

 Access to data in memory is much faster than access

 Servers used in IR systems now typically have

 Shakespeare’s collected works definitely aren’t

 Documents are parsed to extract words and these caesar

 In-memory index construction does not scale.

 As we build the index, we parse docs one at a time.

 Can we use the same index construction algorithm

 8-byte (4+4) records (term id, doc id).

 Basic idea of algorithm:

Block merge algorithm (from [Manning et al,07]):

 Can do binary merges, with a merge tree of log210 = 4

 But it is more efficient to do a n-way merge, where you are

 Providing you read decent-sized chunks of each block into

 Our assumption was: we can keep the dictionary in

 Key idea 1: Generate separate dictionaries for

You might also like