0% found this document useful (0 votes)

35 views13 pages

Lecture 3 Distributed and Dynamic Indexing

The document discusses distributed indexing techniques used by large web search engines. It describes how a master machine assigns parsing and indexing tasks to many machines in parallel to process large document collections. The MapReduce framework is used to coordinate this distributed indexing work.

Uploaded by

Asma MSCS 2022 FAST NU LHR

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views13 pages

Lecture 3 Distributed and Dynamic Indexing

Uploaded by

Asma MSCS 2022 FAST NU LHR

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Distributed Index

Sec. 4.4

Distributed indexing
• For web-scale indexing (don’t try this at
home!):
must use a distributed computing cluster
• Individual machines are fault-prone
– Can unpredictably slow down or fail
• How do we exploit such a pool of machines?

2
Sec. 4.4

Web search engine data centers

• Web search data centers (Google, Bing, Baidu)
mainly contain commodity machines.
• Data centers are distributed around the world.
• Estimate: Google ~1 million servers, 3 million
processors/cores (Gartner 2007)

3
Sec. 4.4

Distributed indexing
• Maintain a master machine directing the
indexing job.
• Break up indexing into sets of (parallel) tasks.
• Master machine assigns each task to an idle
machine from a pool.

4
Sec. 4.4

Parallel tasks
• We will use two sets of parallel tasks
– Parsers
– Inverters
• Break the input document collection into
splits
• Each split is a subset of documents

5
Sec. 4.4

Parsers
• Master assigns a split to an idle parser
machine
• Parser reads a document at a time and emits
(term, doc) pairs
• Parser writes pairs into j partitions
• Each partition is for a range of terms’ first
letters
– (e.g., a-f, g-p, q-z) – here j = 3.
• Now to complete the index inversion
6
Sec. 4.4

Inverters
• An inverter collects all (term,doc) pairs (=
postings) for one term-partition.
• Sorts and writes to postings lists

7
Sec. 4.4

Data flow
assign Master assign
Postings

Parser a-f g-p q-z Inverter a-f

Parser a-f g-p q-z

Inverter g-p

splits q-z
Inverter
Parser a-f g-p q-z

Map Reduce
Segment files
phase phase
8
Sec. 4.4

MapReduce
• The index construction algorithm we just described is an
instance of MapReduce.

• MapReduce (Dean and Ghemawat 2004) is a robust and

conceptually simple framework for distributed computing ……
without having to write code for the distribution part.

• They describe the Google indexing system (ca. 2002) as

consisting of a number of phases, each implemented in
MapReduce.

9
Sec. 4.5

Dynamic indexing
• Up to now, we have assumed that collections are
static.
• They rarely are:
– Documents come in over time and need to be
inserted.
– Documents are deleted and modified.
• This means that the dictionary and postings lists
have to be modified:
– Postings updates for terms already in dictionary
– New terms added to dictionary

10
Sec. 4.5

Simplest approach
• Maintain “big” main index
• New docs go into “small” auxiliary index
• Search across both, merge results
• Deletions
– Invalidation bit-vector for deleted docs
– Filter docs output on a search result by this
invalidation bit-vector
• Periodically, re-index into one main index
11
Sec. 4.5

Merging main and auxiliary indexes

 Merging of the auxiliary index into the main index is efficient
if we keep a separate file for each postings list.
 Merge is the same as a simple append.
 But then we would need a lot of files – separate file for each
word, inefficient for OS.
 In reality: Use a scheme somewhere in between (e.g., split
very large postings lists, collect postings lists of length 1 in one
file etc.)

12
Sec. 4.5

Dynamic indexing at search engines

• All the large search engines now do dynamic
indexing
• Their indices have frequent incremental
changes
– News items, blogs, new topical web pages
• But (sometimes/typically) they also
periodically reconstruct the index from scratch
– Query processing is then switched to the new
index, and the old index is deleted

Potential of Cucumber (Cucumis Sativus) Peel Extract As Alternative Moisturizing Soap
No ratings yet
Potential of Cucumber (Cucumis Sativus) Peel Extract As Alternative Moisturizing Soap
8 pages
IR Unit 2 Dictionaries and Query Processing
No ratings yet
IR Unit 2 Dictionaries and Query Processing
20 pages
Massimi On Pauli'SPrinciple
100% (1)
Massimi On Pauli'SPrinciple
227 pages
Steel Beams Analysis
No ratings yet
Steel Beams Analysis
15 pages
MT8127 Android Scatter
100% (1)
MT8127 Android Scatter
7 pages
05 Index Construction
No ratings yet
05 Index Construction
47 pages
9540WTS 9560WTS 9580WTS Combines MY 2001 2004 Europe Edition Introduction
No ratings yet
9540WTS 9560WTS 9580WTS Combines MY 2001 2004 Europe Edition Introduction
6 pages
(LSE Monographs On Social Anthropology 63) Andre Beteille - Society and Politics in India - Essays in A Comparative Perspective-Athlone Press - Routledge (1991) (Z-Lib - Io)
No ratings yet
(LSE Monographs On Social Anthropology 63) Andre Beteille - Society and Politics in India - Essays in A Comparative Perspective-Athlone Press - Routledge (1991) (Z-Lib - Io)
326 pages
Lecture Word Embeddings WordTo Vec IR
No ratings yet
Lecture Word Embeddings WordTo Vec IR
60 pages
Nonparametric Tests: Larson/Farber 4th Ed
No ratings yet
Nonparametric Tests: Larson/Farber 4th Ed
94 pages
Oracle Upgrade To EBS R12
No ratings yet
Oracle Upgrade To EBS R12
10 pages
Sample Private
No ratings yet
Sample Private
1 page
Mapreduce and Hadoop Distributed File System
No ratings yet
Mapreduce and Hadoop Distributed File System
36 pages
YIP 6.0 Students
No ratings yet
YIP 6.0 Students
86 pages
Lecture Crawling
No ratings yet
Lecture Crawling
38 pages
Example of A Literature Review Social Work
100% (2)
Example of A Literature Review Social Work
4 pages
C3 IndexConstruction
No ratings yet
C3 IndexConstruction
46 pages
Neural IR
No ratings yet
Neural IR
45 pages
Biology Lesson 9.1 Worksheet
No ratings yet
Biology Lesson 9.1 Worksheet
3 pages
Contents
No ratings yet
Contents
224 pages
Nicholas C. Reithmaier: University of South Florida
No ratings yet
Nicholas C. Reithmaier: University of South Florida
5 pages
Technology Plan Evaluation
No ratings yet
Technology Plan Evaluation
12 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
42 pages
Spin Coherent State Through Path Integral & Semi-Classical Physics
No ratings yet
Spin Coherent State Through Path Integral & Semi-Classical Physics
44 pages
W3 Product Market Fit - TPE
No ratings yet
W3 Product Market Fit - TPE
15 pages
77777
No ratings yet
77777
29 pages
Daa-Unit-2 R16
No ratings yet
Daa-Unit-2 R16
33 pages
Lecture 4 - Index Construction - Compressing
No ratings yet
Lecture 4 - Index Construction - Compressing
90 pages
Introduction To Information Storage and Retrieval: Chapter Four: Indexing Structure
No ratings yet
Introduction To Information Storage and Retrieval: Chapter Four: Indexing Structure
34 pages
Erreur 0 104 PDF
No ratings yet
Erreur 0 104 PDF
2 pages
IR Unit III - Notes
No ratings yet
IR Unit III - Notes
18 pages
Private Placement Memorandum Manager
No ratings yet
Private Placement Memorandum Manager
4 pages
Floating Solar Project at The Kariba Dam
No ratings yet
Floating Solar Project at The Kariba Dam
15 pages
Datasheet: Model 230 Brushless Slip Ring
No ratings yet
Datasheet: Model 230 Brushless Slip Ring
7 pages
AMS 5355jv005
100% (3)
AMS 5355jv005
11 pages
Resource Persons: Chief Patron Patron Chairman Convenor Co-Convenors
No ratings yet
Resource Persons: Chief Patron Patron Chairman Convenor Co-Convenors
2 pages
02 Lecf 13 Map Reduce
No ratings yet
02 Lecf 13 Map Reduce
81 pages
Lect 01-Introduction
No ratings yet
Lect 01-Introduction
53 pages
Apex Voltage To Current Conversion
No ratings yet
Apex Voltage To Current Conversion
4 pages
Cds 1 Phase Submersible Dewatering Pumpset Compressed
No ratings yet
Cds 1 Phase Submersible Dewatering Pumpset Compressed
2 pages
Dynamic Indexing
No ratings yet
Dynamic Indexing
53 pages
Pro B760M P DDR4
No ratings yet
Pro B760M P DDR4
1 page
Lecture 5p1 - Index Construction & Compressing
No ratings yet
Lecture 5p1 - Index Construction & Compressing
42 pages
1s07 Map Reduce Presentation 2019
No ratings yet
1s07 Map Reduce Presentation 2019
43 pages
Whitney
No ratings yet
Whitney
19 pages
Information Retrieval: Prof: Ehab Ezzat Hassanein
No ratings yet
Information Retrieval: Prof: Ehab Ezzat Hassanein
14 pages
04 Index Construction
No ratings yet
04 Index Construction
48 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
50 pages
MapReduce - Simpli Ed Data Processing On Large Clusters
No ratings yet
MapReduce - Simpli Ed Data Processing On Large Clusters
22 pages
Big Data Analytics Module 3: Mapreduce Paradigm: Faculty Name: Ms. Varsha Sanap Dr. Vivek Singh
No ratings yet
Big Data Analytics Module 3: Mapreduce Paradigm: Faculty Name: Ms. Varsha Sanap Dr. Vivek Singh
36 pages
IR - Assigment - RK
No ratings yet
IR - Assigment - RK
14 pages
04const Flat
No ratings yet
04const Flat
54 pages
Information Processing and Management: Richard Mccreadie, Craig Macdonald, Iadh Ounis
No ratings yet
Information Processing and Management: Richard Mccreadie, Craig Macdonald, Iadh Ounis
16 pages
Arakin 3 Key
No ratings yet
Arakin 3 Key
23 pages
chapter2-MA212-Indexing & Preprocessing
No ratings yet
chapter2-MA212-Indexing & Preprocessing
68 pages
Information Retrieval - 2
No ratings yet
Information Retrieval - 2
24 pages
Lec6 InvretedIndex pt2
No ratings yet
Lec6 InvretedIndex pt2
38 pages
3 - Index Construction
No ratings yet
3 - Index Construction
5 pages
IR ch4 - Inverted-Index
No ratings yet
IR ch4 - Inverted-Index
44 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
54 pages
Large Scale Index Processing: Mapreduce
No ratings yet
Large Scale Index Processing: Mapreduce
9 pages
Irs Unit - 3
No ratings yet
Irs Unit - 3
68 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
49 pages
Big Data NoSLQ Kopyası
No ratings yet
Big Data NoSLQ Kopyası
51 pages
What Is A Mapreduce?: Michael Kleber
No ratings yet
What Is A Mapreduce?: Michael Kleber
19 pages
Indexing 1
No ratings yet
Indexing 1
61 pages
Dens Shield
No ratings yet
Dens Shield
16 pages
Lecture 2 Inverted Index PDF
No ratings yet
Lecture 2 Inverted Index PDF
24 pages
Lecture 1 - Map Reduce
No ratings yet
Lecture 1 - Map Reduce
31 pages
4.index Construction - New
No ratings yet
4.index Construction - New
46 pages
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
No ratings yet
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
65 pages
What Is Mapreduce
No ratings yet
What Is Mapreduce
19 pages
2.boolean Retrieval Model
No ratings yet
2.boolean Retrieval Model
40 pages
C10 IR M2021 IndexConstruction SimpleandDistributed
No ratings yet
C10 IR M2021 IndexConstruction SimpleandDistributed
42 pages
03 - Lect3 Search Engines-Part2
No ratings yet
03 - Lect3 Search Engines-Part2
32 pages
L05
No ratings yet
L05
33 pages
Chap5 Index Construction
No ratings yet
Chap5 Index Construction
38 pages
AI6122 Topic 3.1 - Index
No ratings yet
AI6122 Topic 3.1 - Index
40 pages
Ir Chapter Three
No ratings yet
Ir Chapter Three
41 pages
IRS Module 5
No ratings yet
IRS Module 5
24 pages
Mini Google
No ratings yet
Mini Google
34 pages
23 Big Data and Data Wrangling
No ratings yet
23 Big Data and Data Wrangling
56 pages
Chapter 1: Introduction: Efficient Search in Large Textual Collections With Redundancy - 2009
No ratings yet
Chapter 1: Introduction: Efficient Search in Large Textual Collections With Redundancy - 2009
31 pages
1726119671-4 Index Construction
No ratings yet
1726119671-4 Index Construction
19 pages
Ir Mod4 Notes
No ratings yet
Ir Mod4 Notes
19 pages
FOP Efficiency Indexing 13
No ratings yet
FOP Efficiency Indexing 13
22 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
AAAI2011 Tutorial Slides
No ratings yet
AAAI2011 Tutorial Slides
213 pages
Lecture 3 - MapReduce
No ratings yet
Lecture 3 - MapReduce
9 pages
CHAP 4 Inverted Index
No ratings yet
CHAP 4 Inverted Index
21 pages
Ir
No ratings yet
Ir
4 pages
Splunk and MapReduce
No ratings yet
Splunk and MapReduce
8 pages
Map Reduce
No ratings yet
Map Reduce
28 pages
Q Tips: Fast, Scalable, and Maintainable Kdb+
From Everand
Q Tips: Fast, Scalable, and Maintainable Kdb+
Nick Psaris
No ratings yet

Lecture 3 Distributed and Dynamic Indexing

Uploaded by

Lecture 3 Distributed and Dynamic Indexing

Uploaded by

Distributed Index

Web search engine data centers

Parser a-f g-p q-z Inverter a-f

Parser a-f g-p q-z

• MapReduce (Dean and Ghemawat 2004) is a robust and

• They describe the Google indexing system (ca. 2002) as

Merging main and auxiliary indexes

Dynamic indexing at search engines

You might also like