0% found this document useful (0 votes)
13 views

MapReduce Patterns, Algorithms, and Use Cases - Highly Scalable Blog

The document discusses several common MapReduce patterns and algorithms, including counting and summing, collating, filtering, distributed task execution, and sorting. It provides code examples to demonstrate how to implement counting and summing of term frequencies in documents using MapReduce. It also lists some common applications of these patterns such as log analysis, data querying, and building inverted indexes.

Uploaded by

jefferyleclerc
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

MapReduce Patterns, Algorithms, and Use Cases - Highly Scalable Blog

The document discusses several common MapReduce patterns and algorithms, including counting and summing, collating, filtering, distributed task execution, and sorting. It provides code examples to demonstrate how to implement counting and summing of term frequencies in documents using MapReduce. It also lists some common applications of these patterns such as log analysis, data querying, and building inverted indexes.

Uploaded by

jefferyleclerc
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

3/15/24, 12:47 AM MapReduce Patterns, Algorithms, and Use Cases – Highly Scalable Blog

Highly Scalable Blog


ARTICLES ON BIG DATA, NOSQL, AND HIGHLY SCALABLE SOFTWARE ENGINEERING

 EXTRAS

MapReduce Patterns, Algorithms, and Use Cases


In this article I digested a number of MapReduce patterns and algorithms to give a systematic view of the different techniques that can be
found on the web or scientific articles. Several practical case studies are also provided. All descriptions and code snippets use the standard
Hadoop’s MapReduce model with Mappers, Reduces, Combiners, Partitioners, and sorting. This framework is depicted in the figure below.

https://highlyscalable.w ordpress.com/2012/02/01/mapreduce-patterns/ 1/33


3/15/24, 12:47 AM MapReduce Patterns, Algorithms, and Use Cases – Highly Scalable Blog

(https://highlyscalable.files.wordpress.com/2012/02/map-reduce.png)
MapReduce Framework

https://highlyscalable.w ordpress.com/2012/02/01/mapreduce-patterns/ 2/33


3/15/24, 12:47 AM MapReduce Patterns, Algorithms, and Use Cases – Highly Scalable Blog

Basic MapReduce Patterns

Counting and Summing

Problem Statement: There is a number of documents where each document is a set of terms. It is required to calculate a total number
of occurrences of each term in all documents. Alternatively, it can be an arbitrary function of the terms. For instance, there is a log file
where each record contains a response time and it is required to calculate an average response time.

Solution:

Let start with something really simple. The code snippet below shows Mapper that simply emit “1” for each term it processes and Reducer
that goes through the lists of ones and sum them up:

1 class Mapper
2 method Map(docid id, doc d)
3 for all term t in doc d do
4 Emit(term t, count 1)
5
6 class Reducer
7 method Reduce(term t, counts [c1, c2,...])
8 sum = 0
9 for all count c in [c1, c2,...] do
10 sum = sum + c
11 Emit(term t, count sum)

The obvious disadvantage of this approach is a high amount of dummy counters emitted by the Mapper. The Mapper can decrease a
number of counters via summing counters for each document:

https://highlyscalable.w ordpress.com/2012/02/01/mapreduce-patterns/ 3/33


3/15/24, 12:47 AM MapReduce Patterns, Algorithms, and Use Cases – Highly Scalable Blog

1 class Mapper
2 method Map(docid id, doc d)
3 H = new AssociativeArray
4 for all term t in doc d do
5 H{t} = H{t} + 1
6 for all term t in H do
7 Emit(term t, count H{t})

In order to accumulate counters not only for one document, but for all documents processed by one Mapper node, it is possible to leverage
Combiners:

1 class Mapper
2 method Map(docid id, doc d)
3 for all term t in doc d do
4 Emit(term t, count 1)
5
6 class Combiner
7 method Combine(term t, [c1, c2,...])
8 sum = 0
9 for all count c in [c1, c2,...] do
10 sum = sum + c
11 Emit(term t, count sum)
12
13 class Reducer
14 method Reduce(term t, counts [c1, c2,...])
15 sum = 0
16 for all count c in [c1, c2,...] do
17 sum = sum + c
18 Emit(term t, count sum)

Applications:

Log Analysis, Data Querying

https://highlyscalable.w ordpress.com/2012/02/01/mapreduce-patterns/ 4/33


3/15/24, 12:47 AM MapReduce Patterns, Algorithms, and Use Cases – Highly Scalable Blog

Collating

Problem Statement: There is a set of items and some function of one item. It is required to save all items that have the same value of
function into one file or perform some other computation that requires all such items to be processed as a group. The most typical example
is building of inverted indexes.

Solution:

The solution is straightforward. Mapper computes a given function for each item and emits value of the function as a key and item itself as
a value. Reducer obtains all items grouped by function value and process or save them. In case of inverted indexes, items are terms (words)
and function is a document ID where the term was found.

Applications:

Inverted Indexes, ETL

https://highlyscalable.w ordpress.com/2012/02/01/mapreduce-patterns/ 5/33


3/15/24, 12:47 AM MapReduce Patterns, Algorithms, and Use Cases – Highly Scalable Blog

Filtering (“Grepping”), Parsing, and Validation

Problem Statement: There is a set of records and it is required to collect all records that meet some condition or transform each record
(independently from other records) into another representation. The later case includes such tasks as text parsing and value extraction,
conversion from one format to another.

Solution: Solution is absolutely straightforward – Mapper takes records one by one and emits accepted items or their transformed
versions.

Applications:

Log Analysis, Data Querying, ETL, Data Validation

Distributed Task Execution

Problem Statement: There is a large computational problem that can be divided into multiple parts and results from all parts can be
combined together to obtain a final result.

https://highlyscalable.w ordpress.com/2012/02/01/mapreduce-patterns/ 6/33


3/15/24, 12:47 AM MapReduce Patterns, Algorithms, and Use Cases – Highly Scalable Blog

Solution: Problem description is split in a set of specifications and specifications are stored as input data for Mappers. Each Mapper takes
a specification, performs corresponding computations and emits results. Reducer combines all emitted parts into the final result.

Case Study: Simulation of a Digital Communication System

There is a software simulator of a digital communication system like WiMAX that passes some volume of random data through the system
model and computes error probability of throughput. Each Mapper runs simulation for specified amount of data which is 1/Nth of the
required sampling and emit error rate. Reducer computes average error rate.

Applications:

Physical and Engineering Simulations, Numerical Analysis, Performance Testing

Sorting

Problem Statement: There is a set of records and it is required to sort these records by some rule or process these records in a certain
order.

Solution: Simple sorting is absolutely straightforward – Mappers just emit all items as values associated with the sorting keys that are
assembled as function of items. Nevertheless, in practice sorting is often used in a quite tricky way, that’s why it is said to be a heart of
MapReduce (and Hadoop). In particular, it is very common to use composite keys to achieve secondary sorting and grouping.

https://highlyscalable.w ordpress.com/2012/02/01/mapreduce-patterns/ 7/33

You might also like