MapReduce Patterns, Algorithms, and Use Cases - Highly Scalable Blog
MapReduce Patterns, Algorithms, and Use Cases - Highly Scalable Blog
EXTRAS
(https://highlyscalable.files.wordpress.com/2012/02/map-reduce.png)
MapReduce Framework
Problem Statement: There is a number of documents where each document is a set of terms. It is required to calculate a total number
of occurrences of each term in all documents. Alternatively, it can be an arbitrary function of the terms. For instance, there is a log file
where each record contains a response time and it is required to calculate an average response time.
Solution:
Let start with something really simple. The code snippet below shows Mapper that simply emit “1” for each term it processes and Reducer
that goes through the lists of ones and sum them up:
1 class Mapper
2 method Map(docid id, doc d)
3 for all term t in doc d do
4 Emit(term t, count 1)
5
6 class Reducer
7 method Reduce(term t, counts [c1, c2,...])
8 sum = 0
9 for all count c in [c1, c2,...] do
10 sum = sum + c
11 Emit(term t, count sum)
The obvious disadvantage of this approach is a high amount of dummy counters emitted by the Mapper. The Mapper can decrease a
number of counters via summing counters for each document:
1 class Mapper
2 method Map(docid id, doc d)
3 H = new AssociativeArray
4 for all term t in doc d do
5 H{t} = H{t} + 1
6 for all term t in H do
7 Emit(term t, count H{t})
In order to accumulate counters not only for one document, but for all documents processed by one Mapper node, it is possible to leverage
Combiners:
1 class Mapper
2 method Map(docid id, doc d)
3 for all term t in doc d do
4 Emit(term t, count 1)
5
6 class Combiner
7 method Combine(term t, [c1, c2,...])
8 sum = 0
9 for all count c in [c1, c2,...] do
10 sum = sum + c
11 Emit(term t, count sum)
12
13 class Reducer
14 method Reduce(term t, counts [c1, c2,...])
15 sum = 0
16 for all count c in [c1, c2,...] do
17 sum = sum + c
18 Emit(term t, count sum)
Applications:
Collating
Problem Statement: There is a set of items and some function of one item. It is required to save all items that have the same value of
function into one file or perform some other computation that requires all such items to be processed as a group. The most typical example
is building of inverted indexes.
Solution:
The solution is straightforward. Mapper computes a given function for each item and emits value of the function as a key and item itself as
a value. Reducer obtains all items grouped by function value and process or save them. In case of inverted indexes, items are terms (words)
and function is a document ID where the term was found.
Applications:
Problem Statement: There is a set of records and it is required to collect all records that meet some condition or transform each record
(independently from other records) into another representation. The later case includes such tasks as text parsing and value extraction,
conversion from one format to another.
Solution: Solution is absolutely straightforward – Mapper takes records one by one and emits accepted items or their transformed
versions.
Applications:
Problem Statement: There is a large computational problem that can be divided into multiple parts and results from all parts can be
combined together to obtain a final result.
Solution: Problem description is split in a set of specifications and specifications are stored as input data for Mappers. Each Mapper takes
a specification, performs corresponding computations and emits results. Reducer combines all emitted parts into the final result.
There is a software simulator of a digital communication system like WiMAX that passes some volume of random data through the system
model and computes error probability of throughput. Each Mapper runs simulation for specified amount of data which is 1/Nth of the
required sampling and emit error rate. Reducer computes average error rate.
Applications:
Sorting
Problem Statement: There is a set of records and it is required to sort these records by some rule or process these records in a certain
order.
Solution: Simple sorting is absolutely straightforward – Mappers just emit all items as values associated with the sorting keys that are
assembled as function of items. Nevertheless, in practice sorting is often used in a quite tricky way, that’s why it is said to be a heart of
MapReduce (and Hadoop). In particular, it is very common to use composite keys to achieve secondary sorting and grouping.