0% found this document useful (0 votes)
29 views13 pages

Lecture 3 Distributed and Dynamic Indexing

The document discusses distributed indexing techniques used by large web search engines. It describes how a master machine assigns parsing and indexing tasks to many machines in parallel to process large document collections. The MapReduce framework is used to coordinate this distributed indexing work.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views13 pages

Lecture 3 Distributed and Dynamic Indexing

The document discusses distributed indexing techniques used by large web search engines. It describes how a master machine assigns parsing and indexing tasks to many machines in parallel to process large document collections. The MapReduce framework is used to coordinate this distributed indexing work.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Distributed Index

Sec. 4.4

Distributed indexing
• For web-scale indexing (don’t try this at
home!):
must use a distributed computing cluster
• Individual machines are fault-prone
– Can unpredictably slow down or fail
• How do we exploit such a pool of machines?

2
Sec. 4.4

Web search engine data centers


• Web search data centers (Google, Bing, Baidu)
mainly contain commodity machines.
• Data centers are distributed around the world.
• Estimate: Google ~1 million servers, 3 million
processors/cores (Gartner 2007)

3
Sec. 4.4

Distributed indexing
• Maintain a master machine directing the
indexing job.
• Break up indexing into sets of (parallel) tasks.
• Master machine assigns each task to an idle
machine from a pool.

4
Sec. 4.4

Parallel tasks
• We will use two sets of parallel tasks
– Parsers
– Inverters
• Break the input document collection into
splits
• Each split is a subset of documents

5
Sec. 4.4

Parsers
• Master assigns a split to an idle parser
machine
• Parser reads a document at a time and emits
(term, doc) pairs
• Parser writes pairs into j partitions
• Each partition is for a range of terms’ first
letters
– (e.g., a-f, g-p, q-z) – here j = 3.
• Now to complete the index inversion
6
Sec. 4.4

Inverters
• An inverter collects all (term,doc) pairs (=
postings) for one term-partition.
• Sorts and writes to postings lists

7
Sec. 4.4

Data flow
assign Master assign
Postings

Parser a-f g-p q-z Inverter a-f

Parser a-f g-p q-z


Inverter g-p

splits q-z
Inverter
Parser a-f g-p q-z

Map Reduce
Segment files
phase phase
8
Sec. 4.4

MapReduce
• The index construction algorithm we just described is an
instance of MapReduce.

• MapReduce (Dean and Ghemawat 2004) is a robust and


conceptually simple framework for distributed computing ……
without having to write code for the distribution part.

• They describe the Google indexing system (ca. 2002) as


consisting of a number of phases, each implemented in
MapReduce.

9
Sec. 4.5

Dynamic indexing
• Up to now, we have assumed that collections are
static.
• They rarely are:
– Documents come in over time and need to be
inserted.
– Documents are deleted and modified.
• This means that the dictionary and postings lists
have to be modified:
– Postings updates for terms already in dictionary
– New terms added to dictionary

10
Sec. 4.5

Simplest approach
• Maintain “big” main index
• New docs go into “small” auxiliary index
• Search across both, merge results
• Deletions
– Invalidation bit-vector for deleted docs
– Filter docs output on a search result by this
invalidation bit-vector
• Periodically, re-index into one main index
11
Sec. 4.5

Merging main and auxiliary indexes


 Merging of the auxiliary index into the main index is efficient
if we keep a separate file for each postings list.
 Merge is the same as a simple append.
 But then we would need a lot of files – separate file for each
word, inefficient for OS.
 In reality: Use a scheme somewhere in between (e.g., split
very large postings lists, collect postings lists of length 1 in one
file etc.)

12
Sec. 4.5

Dynamic indexing at search engines


• All the large search engines now do dynamic
indexing
• Their indices have frequent incremental
changes
– News items, blogs, new topical web pages
• But (sometimes/typically) they also
periodically reconstruct the index from scratch
– Query processing is then switched to the new
index, and the old index is deleted

13

You might also like