0% found this document useful (0 votes)
98 views47 pages

05 Index Construction

The document discusses index construction in information retrieval. It covers hardware basics like memory being faster than disk access. Index construction involves parsing documents to extract terms and associated document IDs, then sorting the inverted index by term. For the Reuters RCV1 corpus with 800,000 documents, the inverted index contains around 100 million postings after removing duplicate document IDs per term.

Uploaded by

Rizwan Lateef
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
98 views47 pages

05 Index Construction

The document discusses index construction in information retrieval. It covers hardware basics like memory being faster than disk access. Index construction involves parsing documents to extract terms and associated document IDs, then sorting the inverted index by term. For the Reuters RCV1 corpus with 800,000 documents, the inverted index contains around 100 million postings after removing duplicate document IDs per term.

Uploaded by

Rizwan Lateef
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

5: Index Construction

Information Retrieval Techniques

1
Index construction

p  How do we construct an index?


p  What strategies can we use with limited main
memory?

2
Hardware basics

p  Many design decisions in information retrieval are


based on the characteristics of hardware
p  We begin by reviewing hardware basics

3
Hardware basics

p  Access to data in memory is much faster than


access to data on disk
p  Disk seeks: No data is transferred from disk
while the disk head is being positioned
p  Therefore transferring one large chunk of data
from disk to memory is faster than transferring
many small chunks
p  Disk I/O is block-based: Reading and writing of
entire blocks (as opposed to smaller chunks)
p  Block sizes: 8KB to 256 KB.

4
Hardware basics

p  Servers used in IR systems now typically have


several GB of main memory, sometimes tens of
GB
p  Available disk space is several (2–3) orders of
magnitude larger
p  Fault tolerance is very expensive: It’s much
cheaper to use many regular machines rather
than one fault tolerant machine.

5
Google Web Farm

p  The best guess is that Google now has more than
2 Million servers (8 Petabytes of RAM 8*106
Gigabytes)
p  Spread over at least 12 locations around the world
p  Connecting these centers is a high-capacity fiber optic
network that the company has assembled over the
last few years.

The Dalles, Oregon Dublin, Ireland 6


Hardware assumptions

p  symbol statistic value


p  s average seek time 5 ms = 5 x 10−3 s
p  b transfer time per byte 0.02 µs = 2 x 10−8 s/B
p  processor’s clock rate 109 s−1
p  p low-level operation 0.01 µs = 10−8 s
(e.g., compare & swap a word)
p  size of main memory several GB
p  size of disk space 1 TB or more

p  Example: Reading 1GB from disk


n  If stored in contiguous blocks: 2 x 10−8 s/B x 109 B = 20s
n  If stored in 1M chunks of 1KB: 20s + 106x 5 x 10−3s = 5020
s = 1.4 h 7
A Reuters RCV1 document

8
Reuters RCV1 statistics

symbol statistic value

N documents 800,000
L avg. # tokens per doc 200
M terms (= word types) 400,000
avg. # bytes per token 6
(incl. spaces/punct.)
avg. # bytes per token 4.5
(without spaces/punct.)
avg. # bytes per term 7.5
T non-positional postings 100,000,000

• 4.5 bytes per word token vs. 7.5 bytes per word type: why? 9
• Why T < N*L?
Recall IIR 1 index construction
Term Doc #
I 1
did 1

p  Documents are parsed to extract words and enact


julius
1
1
these are saved with the Document ID. caesar 1
I 1
was 1
killed 1
i' 1
the 1
capitol 1
brutus 1
killed 1
me 1
Doc 1 Doc 2 so 2
let 2
it 2
So let it be with be 2
I did enact Julius with 2
Caesar. The noble caesar 2
Caesar I was killed the 2
Brutus hath told you noble 2
i' the Capitol;
Caesar was brutus 2

Brutus killed me. hath 2

ambitious told
you
2
2
caesar 2
10
was 2
ambitious 2
Sec. 4.2

Key step
Term Doc # Term Doc #
I 1 ambitious 2
did 1 be 2
enact 1 brutus 1
julius 1 brutus 2

p  After all documents have caesar


I
1
1
capitol
caesar
1
1
been parsed, the inverted was 1 caesar 2
killed 1 caesar 2
file is sorted by terms. i' 1 did 1
the 1 enact 1
capitol 1 hath 1
brutus 1 I 1
killed 1 I 1
me 1 i' 1
so 2 it 2
let 2 julius 1
We focus on this sort step. it
be
2
2
killed
killed
1
1
We have 100M items to sort with
caesar
2
2
let
me
2
1
for Reuters RCV1 (after the
noble
2
2
noble
so
2
2
having removed duplicated brutus
hath
2
2
the
the
1
2
docid for each term) told
you
2
2
told
you
2
2
caesar 2 was 1
was 2 was 2
ambitious 2 with 2
11
Sec. 4.2

Scaling index construction

p  In-memory index construction does not scale


p  How can we construct an index for very large
collections?
p  Taking into account the hardware constraints we
just learned about . . .
p  Memory, disk, speed, etc.

12
Sec. 4.2

Sort-based index construction

p  As we build the index, we parse docs one at a time


n  While building the index, we cannot easily exploit
compression tricks (you can, but much more
complex)
n  The final postings for any term are incomplete until
the end
p  At 12 bytes per non-positional postings entry (term,
doc, freq), demands a lot of space for large collections
p  T = 100,000,000 in the case of RCV1 – so 1.2GB
n  So … we can do this in memory in 2015, but typical
collections are much larger - e.g. the New York
Times provides an index of >150 years of newswire
p  Thus: We need to store intermediate results on disk.
13
Sec. 4.2

Use the same algorithm for disk?

p  Can we use the same index construction


algorithm for larger collections, but by using
disk instead of memory?
n  I.e. scan the documents, and for each term
write the corresponding posting (term, doc,
freq) on a file
n  Finally sort the postings and build the postings
lists for all the terms
p  No: Sorting T = 100,000,000 records (term, doc,
freq) on disk is too slow – too many disk seeks
n  See next slide
p  We need an external sorting algorithm.
14
Sec. 4.2

Bottleneck
p  Parse and build postings entries one doc at a time
p  Then sort postings entries by term (then by doc
within each term)
p  Doing this with random disk seeks would be too
slow – must sort T = 100M records

If every comparison took 2 disk seeks, and N items could be


sorted with N log2N comparisons, how long would this take?
p  symbol statistic value
p  s average seek time 5 ms = 5 x 10−3 s
p  b transfer time per byte 0.02 µs = 2 x 10−8 s
p  p low-level operation 0.01 µs = 10−8 s
(e.g., compare & swap a word) 15
Solution

(2*ds-time + comparison-time)*Nlog2N seconds


= (2*5*10-3 + 10-8)* 108 log2 108
~= (2*5*10-3)* 108 log2 108
since the time required for the comparison is
actually negligible (as the time for transferring data
in the main memory)
= 106 * log2 108 = 106 * 26,5 = 2,65 * 107 s = 307
days!

p  What can we do? 16


Gaius Julius Caesar

Divide et Impera

17
Sec. 4.2

BSBI: Blocked sort-based Indexing


(Sorting with fewer disk seeks)
p  12-byte (4+4+4) records (term-id, doc-id, freq)
p  These are generated as we parse docs
p  Must now sort 100M such 12-byte records by
term
p  Define a Block ~ 10M such records
n  Can easily fit a couple into memory
n  Will have 10 such blocks to start with (RCV1)
p  Basic idea of algorithm:
n  Accumulate postings for each block (write on a
file), (read and) sort, write to disk
n  Then merge the sorted blocks into one long
18
sorted order.
Sec. 4.2
blocks contain
term-id instead

Blocks obtained
parsing different
documents

19
Sec. 4.2

Sorting 10 blocks of 10M records

p  First, read each block and sort (in memory)


within:
n  Quicksort takes 2N log2N expected steps
n  In our case 2 x (10M log210M) steps
p  Exercise: estimate total time to read each block
from disk and quicksort it
n  Approximately 7 s
p  10 times this estimate – gives us 10 sorted runs
of 10M records each
p  Done straightforwardly, need 2 copies of data on
disk
n  But can optimize this 20
Sec. 4.2

Block sorted-based indexing

Keeping the
dictionary in memory

n = number of
generated blocks 21
Sec. 4.2

How to merge the sorted runs?

p  Open all block files and maintain small read


buffers - and a write buffer for the final merged
index
p  In each iteration select the lowest termID that
has not been processed yet
p  All postings lists for this termID are read and
merged, and the merged list is written back to disk
p  Each read buffer is refilled from its file when
necessary
p  Providing you read decent-sized chunks of each
block into memory and then write out a decent-
sized output chunk, then you’re not killed by disk
22
seeks.
Sec. 4.3
Remaining problem with sort-based
algorithm

p  Our assumption was: we can keep the dictionary


in memory
p  We need the dictionary (which grows
dynamically) in order to implement a term to
termID mapping
p  Actually, we could work with (term, docID)
postings instead of (termID, docID) postings . . .
p  . . . but then intermediate files become larger -
we would end up with a scalable, but slower
index construction method.
Why?
23
Sec. 4.3

SPIMI:
Single-pass in-memory indexing

p  Key idea 1: Generate separate dictionaries for


each block – no need to maintain term-termID
mapping across blocks
p  Key idea 2: Don’t sort the postings - accumulate
postings in postings lists as they occur
n  But at the end, before writing on disk, sort
the terms
p  With these two ideas we can generate a complete
inverted index for each block
p  These separate indexes can then be merged into
one big index (because terms are sorted).
24
Sec. 4.3

SPIMI-Invert

When the memory has been exhausted -


write the index of the block (dictionary,
postings lists) to disk

p  Then merging of blocks is analogous to 25

BSBI (plus dictionary merging).


Sec. 4.3

SPIMI: Compression

p  Compression makes SPIMI even more efficient.


n  Compression of terms
n  Compression of postings

26
Sec. 4.4

Distributed indexing

p  For web-scale indexing (don’t try this at home!):


must use a distributed computing cluster
p  Individual machines are fault-prone
n  Can unpredictably slow down or fail
p  How do we exploit such a pool of machines?

27
Sec. 4.4

Google data centers

p  Google data centers mainly contain commodity


machines
p  Data centers are distributed around the world
p  Estimate: a total of 2 million servers
p  Estimate: Google installs 100,000 servers each
quarter
n  Based on expenditures of 200–250 million
dollars per year
p  This would be 10% of the computing capacity of
the world!?!

28
Google data centers

p  Consider a non-fault-tolerant system with


1000 nodes
p  Each node has 99.9% availability (probability to
be up in a time unit), what is the availability of
the system?
n  All of them should be simultaneously up
p  Answer: 37%
n  (p of staying up)# of server = (0.999)1000
p  Calculate the number of servers failing per
minute for an installation of 1 million servers.

29
Sec. 4.4

Distributed indexing

p  Maintain a master machine directing the


indexing job – considered “safe”
p  Break up indexing into sets of (parallel) tasks
p  Master machine assigns each task to an idle
machine from a pool.

30
Sec. 4.4

Parallel tasks

p  We will use two sets of parallel tasks


n  Parsers
n  Inverters
p  Break the input document collection into splits

Document collection

split split split split split

p  Each split is a subset of documents


(corresponding to blocks in BSBI/SPIMI)
31
Sec. 4.4

Parsers

p  Master assigns a split to an idle parser machine


p  Parser reads a document at a time and emits
(term-id, doc-id) pairs
p  Parser writes pairs into j partitions
p  Each partition is for a range of terms’ first letters
n  (e.g., a-f, g-p, q-z) – here j = 3.
p  Now to complete the index inversion …

split parser a-f g-p q-z


3 partitions 32
Sec. 4.4

Inverters

p  An inverter collects all (term-id, doc-id) pairs


(= postings) for one term-partition (from the
different segments produced by the parsers)
p  Sorts and writes to postings lists.

a-f

a-f inverter a-f


postings
...

a-f 33
Sec. 4.4

Data flow: MapReduce

assign Master assign


Postings

Parser a-f g-p q-z Inverter a-f

Parser a-f g-p q-z


Inverter g-p

splits Inverter q-z


Parser a-f g-p q-z

Map Reduce
Segment files
phase phase 34
Sec. 4.4

MapReduce

p  The index construction algorithm we just described is


an instance of MapReduce
p  MapReduce (Dean and Ghemawat 2004) is a robust
and conceptually simple framework for distributed
computing …
n  … without having to write code for the distribution
part
p  Solve large computing problems on cheap commodity
machines or nodes that are built from standard parts
(processor, memory, disk) as opposed to on a
supercomputer with specialized hardware
p  They describe the Google indexing system (ca. 2002)
as consisting of a number of phases, each
implemented in MapReduce. 35
Sec. 4.4

MapReduce

p  Index construction was just one phase


p  Another phase (not shown here): transforming a
term-partitioned index into a document-
partitioned index (for query processing)
n  Term-partitioned: one machine handles a
subrange of terms
n  Document-partitioned: one machine handles a
subrange of documents
p  As we will discuss in the web part of the course -
most search engines use a document-
partitioned index … better load balancing, etc.
36
Sec. 4.4
Schema for index construction in
MapReduce

p  Schema of map and reduce functions


p  map: input → list(k, v) reduce: (k,list(v)) → output
p  Instantiation of the schema for index construction
p  map: web collection → list(termID, docID)
p  reduce: (<termID1, list(docID)>, <termID2, list(docID)>,
…) → (postings list1, postings list2, …)
p  Example for index construction
p  map: (d2 : "C died.", d1 : "C came, C c’ed.") → (<C, d2>,
<died,d2>, <C,d1>, <came,d1>, <C,d1>, <c’ed, d1>
p  reduce: (<C,(d2,d1,d1)>, <died,(d2)>, <came,(d1)>,
<c’ed,(d1)>) → (<C,(d1:2,d2:1)>, <died,(d2:1)>,
<came,(d1:1)>, <c’ed,(d1:1)>)
37

we do not write term-ids for better readability


Sec. 4.5

Dynamic indexing

p  Up to now, we have assumed that collections are


static
p  They rarely are:
n  Documents come in over time and need to be
inserted
n  Documents are deleted and modified
p  This means that the dictionary and postings
lists have to be modified:
n  Postings updates for terms already in
dictionary
n  New terms added to dictionary.
38
Sec. 4.5

Simplest approach

p  Maintain big main index


p  New docs go into small auxiliary index
p  Search across both, merge results
p  Deletions
n  Invalidation bit-vector for deleted docs
n  Filter docs output on a search result by this
invalidation bit-vector
p  Periodically, re-index into one main index.

39
Sec. 4.5

Issues with main and auxiliary indexes

p  Problem of frequent merges – you touch stuff a lot


p  Poor performance during merge
p  Actually:
n  Merging of the auxiliary index into the main index
is efficient if we keep a separate file for each
postings list
n  Merge is the same as a simple append
n  But then we would need a lot of files – inefficient
for the OS
p  Assumption for the rest of the lecture: The index is
one big file
p  In reality: use a scheme somewhere in between (e.g.,
split very large postings lists, collect postings lists of
length 1 in one file etc.).
40
Sec. 4.5

Logarithmic merge

p  Maintain a series of indexes, each twice as


large as the previous one
p  Keep smallest (Z0) in memory
p  Larger ones (I0, I1, …) on disk
p  If Z0 gets too big (> n), write to disk as I0
p  or merge with I0 (if I0 already exists) as Z1
p  Either write merge Z1 to disk as I1 (if no I1)
p  Or merge with I1 to form Z2
p  etc.

41
Sec. 4.5

Permanent indexes
already stored

42
Sec. 4.5

Logarithmic merge

p  Auxiliary and main index: index construction


time is O(T2) as each posting is touched in each
merge
p  Logarithmic merge: Each posting is merged
O(log T) times, so complexity is O(T log T)
p  So logarithmic merge is much more efficient for
index construction
p  But query processing now requires the merging of
O(log T) indexes
n  Whereas it is O(1) if you just have a main and
auxiliary index
43
Sec. 4.5

Further issues with multiple indexes

p  Collection-wide statistics are hard to maintain


p  E.g., when we spoke of spell-correction: which of
several corrected alternatives do we present to
the user?
n  We said, pick the one with the most hits
p  How do we maintain the top ones with multiple
indexes and invalidation bit vectors?
n  One possibility: ignore everything but the main
index for such ordering
p  Will see more such statistics used in results
ranking.
44
Sec. 4.5

Dynamic indexing at search engines

p  All the large search engines now do dynamic


indexing
p  Their indices have frequent incremental changes
n  News items, blogs, new topical web pages
p  Grillo, Crimea, …
p  But (sometimes/typically) they also periodically
reconstruct the index from scratch
n  Query processing is then switched to the new
index, and the old index is then deleted

45
Google trends

http://www.google.com/trends/

46
Sec. 4.5

Other sorts of indexes

p  Positional indexes


n  Same sort of sorting problem … just larger Why?
p  Building character n-gram indexes:
n  As text is parsed, enumerate n-grams
n  For each n-gram, need pointers to all dictionary terms
containing it – the “postings”
n  Note that the same “postings entry” (i.e., terms) will
arise repeatedly in parsing the docs – need efficient
hashing to keep track of this
p  E.g., that the trigram uou occurs in the term
deciduous will be discovered on each text
occurrence of deciduous
p  Only need to process each term once.
47

You might also like