ESSIR MapReduce For Indexing
ESSIR MapReduce For Indexing
Enrique Alfonseca
Manager, Natural Language Understanding
Google Research Zurich
Tutorial Overview
Input Output
MapReduce Execution Overview
User
Program
(1) fork (1) fork
(1) fork
Master
(2) assign
map (2) assign
reduce
worker
split 0
m ote worker
(6) write output
split 1 )re d file 0
(4) local write (5 rea
(3) read
split 2 worker
split 3 output
worker
file 1
split 4
worker
Unique implementations
Mapper 395 1958 4083
Reducer 269 1208 2418
● Case study
○ Higher DRAM errors observed in a new GMail cluster
○ Similar servers running GMail elsware not affected
■ Same version of the software, kernel, firmware, etc.
○ Bad DRAM is the initial culprit
■ ... but that same DRAM model was fairly healthy elsewhere
○ Actual problem: bad motherboard batch
■ Poor electrical margin in some memory bus signals
■ GMail got more than its fair share of the bad batch
■ Analysis of this batch allocated to other services confirmed the
theory
Solution
● Mapper:
○ For every word in a document output (word, "1")
● Reducer:
○ Sum all occurrences of words and output (word, total_count)
Word Count Solution
● Can we do better?
Reducer 1:
Input: SSN → {City, 2007 Income}
Output: (SSN, [City, 2007 Income])
Mapper 2:
Input: SSN → [City, 2007 Income]
Output: (City, 2007 Income)
Reducer 2:
Input: City → 2007 Incomes
Output: (City, AVG(2007 Incomes))
Average Income in a City Basic Solution
Mapper 1a: Mapper 1b:
Input: SSN → Personal Information Input: SSN → Annual Incomes
Output: (SSN, City) Output: (SSN, 2007 Income)
Reducer 1:
Input: SSN → {City, 2007 Income}
Output: (SSN, [City, 2007 Income])
Our Inputs are sorted
Reducer 2:
Input: City → 2007 Incomes
Output: (City, AVG(2007 Incomes))
Average Income in a Joined Solution
Mapper:
Input: SSN → Personal Information and Incomes
Output: (City, 2007 Income)
Reducer
Input: City → 2007 Income
Output: (City, AVG(2007 Incomes))
Application Examples
1. Split the whole territory into "tiles" with fixed location IDs
2. Split each source image according to the tiles it covers
4. Serve the merged imagery data for each tile, so they can be
loaded into and served from a image server farm.
Using Protocol Buffers
to Encode Structured Data
● Open sourced from Google, among many others:
http://code.google.com/p/protobuf/
● It supports C++, Java and Python.
● A way of encoding structured data in an efficient yet extensible
format. e.g. we can define
message Tile {
required int64 location_id = 1;
group coverage {
double latitude = 2;
double longitude = 3;
double width = 4; // in km
double length = 5; // in km
}
required bytes image_data = 6; // Bitmap Image data
required int64 timestamp = 7;
optional float resolution = 8 [default = 10];
optinal string debug_info = 10;
}
Google uses Protocol Buffers for almost all its internal RPC
protocols, file formats and of course in MapReduce.
Stitch Imagery Data Solution: Mapper
map(String key, String value):
// key: image file name
// value: image data
Tile whole_image;
switch (file_type(key)):
FROM_PROVIDER_A: Convert_A(value, &whole_image);
FROM PROVIDER_B: Convert_B(...);
...
Tile merged_tile;
for each v in values:
overlay pixels in v to merged_tile based on
v.coverage();
Emit(key, ProtobufToString(merged_tile));
Tutorial Overview
● Data organization
● Programming model
● Execution model
● Target applications
● Assumed computing environment
● Overall operating cost
My Basket of Fruit
Declarative
DBMS/SQL
Programming
Model
MapReduce
Procedural
MPI
Data to be manipulated Any k,v pairs: string/protomsg Tables with rich types
Key selling point Flexible to accommodate Plow through large amount Interactive querying the data;
various applications of data with commodity Maintain a consistent view
hardware across clients
● Target applications
○ Complex operations run frequently v.s. one time plow
○ Off-line processing v.s. real-time serving
Hadoop Map-Reduce
● Open Source!
● Plus the whole equivalent package, and more
○ HDFS, Map-Reduce, Pig, Zookeeper, HBase, Hive
● Used by Yahoo!, Facebook, Amazon and Google-IBM NSF cluster
Dryad
● Proprietary, based on Microsoft SQL servers
● Dryad(EuroSys'07), DryadLINQ(OSDI'08)
● Michael's Dryad TechTalk@Google (Nov.'07)
And others
Tutorial Overview
I saw the cat on the mat I saw the dog on the mat
I http://www.cat.com, 0 http://www.dog.com, 0
cat http://www.cat.com, 3
Solution:
● Mapper:
○ For every word in a document output (word, [URL, position])
● Reducer:
○ Aggregate all the information that we have about each word.
Inverted Index Solution
(Source: http://en.wikipedia.org/wiki/PageRank)
PageRank computation
Algorithm:
● N = total number of web pages
● Matrix M defined as;
○ M[i][j] is 0 if the j-th page has no links to the i-th page.
○ M[i][j] is the probability to move from page j to page i,
assuming the same probability for all outgoing links.
● Iterative algorithm:
R = (1-d) . M . R + d/N
where d is the decay term
PageRank computation
C 0 1 0 0 0 0 0 0 0 0 0 0.09 0.09
G 0 0 0 0 0 0 0 0 0 0 0 0.09 0.00
H 0 0 0 0 0 0 0 0 0 0 0 0.09 0.00
I 0 0 0 0 0 0 0 0 0 0 0 0.09 0.00
J 0 0 0 0 0 0 0 0 0 0 0 0.09 0.00
K 0 0 0 0 0 0 0 0 0 0 0 0.09 0.00
PageRank computation
C 0 1 0 0 0 0 0 0 0 0 0 0.09 0.24
G 0 0 0 0 0 0 0 0 0 0 0 0.09 0.00
H 0 0 0 0 0 0 0 0 0 0 0 0.09 0.00
I 0 0 0 0 0 0 0 0 0 0 0 0.09 0.00
J 0 0 0 0 0 0 0 0 0 0 0 0.09 0.00
K 0 0 0 0 0 0 0 0 0 0 0 0.09 0.00
PageRank computation
A B C D E F G H I J K PR PR
C 0 1 0 0 0 0 0 0 0 0 0 0.09 0.57
G 0 0 0 0 0 0 0 0 0 0 0 0.09 0.01
H 0 0 0 0 0 0 0 0 0 0 0 0.09 0.01
I 0 0 0 0 0 0 0 0 0 0 0 0.09 0.01
J 0 0 0 0 0 0 0 0 0 0 0 0.09 0.01
K 0 0 0 0 0 0 0 0 0 0 0 0.09 0.01
PageRank computation
● Matrix m is sparse:
○ We can store one <key, value> pair per row.
○ key = URL, value = URLs of outgoing links.
● Vector R:
○ one <key, value> pair per element.
● Matrix multiplication:
○ Join both sets (aggregate by key).
○ Multiply to produce each new value of R’ in the reduce
step.
PageRank computation
● Data centers
○ Google-specific mechanical, thermal and electrical design
○ Highly-customized PC-class motherboards
○ Running Linux
○ In-house management & application software
Sharing is the Way of Life
+ Batch processing
(MapReduce, Sazwall)
Major Challenges
To organize the world’s information and make it universally
accessible and useful.
● Failure handling
○ Bad apples appear now and there
● Scalability
○ Fast growing dataset
○ Broad extension of Google services
● Performance and utilization
○ Minimizing run-time for individual jobs
○ Maximizing throughput across all services
● Usability
○ Troubleshooting
○ Performance tuning
○ Production monitoring
Failures in Literature
● Failure types
○ Permanent
○ Transient
Different Failures Require Different Actions
● Transient failures
○ You'd want your job to adjust and finish when
issues resolve
● "It's-Not-My-Fault" failures
MapReduce: Task Failure
User
Program
(1) fork (1) fork
(1) fork
Master
(2) assign
map (2) assign
reduce
worker
split 0
ote (6) write output
split 1 5) remad worker
file 0
(4) local write ( re
(3) read
split 2 worker
split 3 output
worker
file 1
split 4
worker
(2) assign
map Master
(2) assign
reduce
worker
split 0
m ote worker
(6) write output
re file 0
split 1
(4) local write (5) read
(3) read
split 2 worker
split 3 output
worker
file 1
split 4
worker
User
Program
(1) fork (1) fork
(1) fork
Master
(2) assign
map (2) assign
reduce
worker
split 0
o te (6) write output
split 1 5 ) remad worker
file 0
(4) write ( re
(3) read
split 2 worker
split 3 output
worker
file 1
split 4
worker
Master
(2) assign
map (2) assign
reduce
worker
split 0
m ote worker
(6) write output
re file 0
split 1
(4) local write (5) read
(3) read
split 2 worker
split 3 output
worker
file 1
split 4
worker
Master
(2) assign
map (2) assign
reduce
worker
split 0
(6) write output
worker
split 1 file 0
split 2 worker
split 3 output
worker
file 1
split 4
worker
(2) assign
map (2) assign
reduce
worker
split 0
m ote worker
(6) write output
re file 0
split 1
(4) write (5) read
(3) read
split 2 worker
split 3 output
worker
file 1
split 4
worker
Master
(2) assign
map (2) assign
reduce
worker
split 0
o te (6) write output
split 1 5 ) remad worker
file 0
(4) write ( re
(3) read
split 2 worker
split 3 output
worker
file 1
split 4
worker
● Files in GFS
○ Divided into chunks (default 64MB)
○ Stored with replications, typical r=3
○ Reading from local disk is much faster and cheaper
than reading from a remote server
Implications on scalability
● Master has to make O(M+R) decisions
● System has to keep O(M*R) metadata for distributing
map output to reducers
Mapper
cgi.ebay.com 58.2G
Reducer
Mapper
Reducer profile.myspace.com 56.3G
yellowpages.superpages.com 49.6G
www.amazon.co.uk 41.7G
reducer output
sorter file 0
map
worker
reduce output
worker file 1
map
worker
Dealing with Reduce Stragglers
Technique 1:
Create a backup instance as early and as necessary as
possible
reducer output
sorter file 0
map
worker
output
R'
file 0
map
worker
Steal Reduce Input for Backups
Technique 2:
Retrieving map output and sorting are expensive, but we
can transport the sorted input to the backup reducer
reducer output
sorter file 0
map
worker
output
R'
file 0
map
worker
Reduce Task Splitting
Technique 3:
Divide a reduce task into smaller ones to take advantage of
more parallelism.
reduce() output
sorter file 0
map
worker
output
R'
file 0.2
map
worker output
R'
file 0.1
output
R'
file 0.0
Tutorial Overview
Examples:
● num_map_output_records == num_reduce_input_records
● CPU time spend in Map() and Reduce() functions
MapReduce Development inside Google