Lecture 3 Distributed and Dynamic Indexing
Lecture 3 Distributed and Dynamic Indexing
Sec. 4.4
Distributed indexing
• For web-scale indexing (don’t try this at
home!):
must use a distributed computing cluster
• Individual machines are fault-prone
– Can unpredictably slow down or fail
• How do we exploit such a pool of machines?
2
Sec. 4.4
3
Sec. 4.4
Distributed indexing
• Maintain a master machine directing the
indexing job.
• Break up indexing into sets of (parallel) tasks.
• Master machine assigns each task to an idle
machine from a pool.
4
Sec. 4.4
Parallel tasks
• We will use two sets of parallel tasks
– Parsers
– Inverters
• Break the input document collection into
splits
• Each split is a subset of documents
5
Sec. 4.4
Parsers
• Master assigns a split to an idle parser
machine
• Parser reads a document at a time and emits
(term, doc) pairs
• Parser writes pairs into j partitions
• Each partition is for a range of terms’ first
letters
– (e.g., a-f, g-p, q-z) – here j = 3.
• Now to complete the index inversion
6
Sec. 4.4
Inverters
• An inverter collects all (term,doc) pairs (=
postings) for one term-partition.
• Sorts and writes to postings lists
7
Sec. 4.4
Data flow
assign Master assign
Postings
splits q-z
Inverter
Parser a-f g-p q-z
Map Reduce
Segment files
phase phase
8
Sec. 4.4
MapReduce
• The index construction algorithm we just described is an
instance of MapReduce.
9
Sec. 4.5
Dynamic indexing
• Up to now, we have assumed that collections are
static.
• They rarely are:
– Documents come in over time and need to be
inserted.
– Documents are deleted and modified.
• This means that the dictionary and postings lists
have to be modified:
– Postings updates for terms already in dictionary
– New terms added to dictionary
10
Sec. 4.5
Simplest approach
• Maintain “big” main index
• New docs go into “small” auxiliary index
• Search across both, merge results
• Deletions
– Invalidation bit-vector for deleted docs
– Filter docs output on a search result by this
invalidation bit-vector
• Periodically, re-index into one main index
11
Sec. 4.5
12
Sec. 4.5
13