0% found this document useful (0 votes)

50 views7 pages

MapReduce for Big Data Engineers

The document discusses several common MapReduce patterns and algorithms, including counting and summing, collating, filtering, distributed task execution, and sorting. It provides code examples to demonstrate how to implement counting and summing of term frequencies in documents using MapReduce. It also lists some common applications of these patterns such as log analysis, data querying, and building inverted indexes.

Uploaded by

jefferyleclerc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

50 views7 pages

MapReduce for Big Data Engineers

Uploaded by

jefferyleclerc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

3/15/24, 12:47 AM MapReduce Patterns, Algorithms, and Use Cases – Highly Scalable Blog

Highly Scalable Blog

ARTICLES ON BIG DATA, NOSQL, AND HIGHLY SCALABLE SOFTWARE ENGINEERING

 EXTRAS

MapReduce Patterns, Algorithms, and Use Cases

In this article I digested a number of MapReduce patterns and algorithms to give a systematic view of the different techniques that can be
found on the web or scientific articles. Several practical case studies are also provided. All descriptions and code snippets use the standard
Hadoop’s MapReduce model with Mappers, Reduces, Combiners, Partitioners, and sorting. This framework is depicted in the figure below.

https://highlyscalable.w ordpress.com/2012/02/01/mapreduce-patterns/ 1/33

3/15/24, 12:47 AM MapReduce Patterns, Algorithms, and Use Cases – Highly Scalable Blog

(https://highlyscalable.files.wordpress.com/2012/02/map-reduce.png)
MapReduce Framework

https://highlyscalable.w ordpress.com/2012/02/01/mapreduce-patterns/ 2/33

3/15/24, 12:47 AM MapReduce Patterns, Algorithms, and Use Cases – Highly Scalable Blog

Basic MapReduce Patterns

Counting and Summing

Problem Statement: There is a number of documents where each document is a set of terms. It is required to calculate a total number
of occurrences of each term in all documents. Alternatively, it can be an arbitrary function of the terms. For instance, there is a log file
where each record contains a response time and it is required to calculate an average response time.

Solution:

Let start with something really simple. The code snippet below shows Mapper that simply emit “1” for each term it processes and Reducer
that goes through the lists of ones and sum them up:

1 class Mapper
2 method Map(docid id, doc d)
3 for all term t in doc d do
4 Emit(term t, count 1)
5
6 class Reducer
7 method Reduce(term t, counts [c1, c2,...])
8 sum = 0
9 for all count c in [c1, c2,...] do
10 sum = sum + c
11 Emit(term t, count sum)

The obvious disadvantage of this approach is a high amount of dummy counters emitted by the Mapper. The Mapper can decrease a
number of counters via summing counters for each document:

https://highlyscalable.w ordpress.com/2012/02/01/mapreduce-patterns/ 3/33

3/15/24, 12:47 AM MapReduce Patterns, Algorithms, and Use Cases – Highly Scalable Blog

1 class Mapper
2 method Map(docid id, doc d)
3 H = new AssociativeArray
4 for all term t in doc d do
5 H{t} = H{t} + 1
6 for all term t in H do
7 Emit(term t, count H{t})

In order to accumulate counters not only for one document, but for all documents processed by one Mapper node, it is possible to leverage
Combiners:

1 class Mapper
2 method Map(docid id, doc d)
3 for all term t in doc d do
4 Emit(term t, count 1)
5
6 class Combiner
7 method Combine(term t, [c1, c2,...])
8 sum = 0
9 for all count c in [c1, c2,...] do
10 sum = sum + c
11 Emit(term t, count sum)
12
13 class Reducer
14 method Reduce(term t, counts [c1, c2,...])
15 sum = 0
16 for all count c in [c1, c2,...] do
17 sum = sum + c
18 Emit(term t, count sum)

Applications:

Log Analysis, Data Querying

https://highlyscalable.w ordpress.com/2012/02/01/mapreduce-patterns/ 4/33

3/15/24, 12:47 AM MapReduce Patterns, Algorithms, and Use Cases – Highly Scalable Blog

Collating

Problem Statement: There is a set of items and some function of one item. It is required to save all items that have the same value of
function into one file or perform some other computation that requires all such items to be processed as a group. The most typical example
is building of inverted indexes.

Solution:

The solution is straightforward. Mapper computes a given function for each item and emits value of the function as a key and item itself as
a value. Reducer obtains all items grouped by function value and process or save them. In case of inverted indexes, items are terms (words)
and function is a document ID where the term was found.

Applications:

Inverted Indexes, ETL

https://highlyscalable.w ordpress.com/2012/02/01/mapreduce-patterns/ 5/33

3/15/24, 12:47 AM MapReduce Patterns, Algorithms, and Use Cases – Highly Scalable Blog

Filtering (“Grepping”), Parsing, and Validation

Problem Statement: There is a set of records and it is required to collect all records that meet some condition or transform each record
(independently from other records) into another representation. The later case includes such tasks as text parsing and value extraction,
conversion from one format to another.

Solution: Solution is absolutely straightforward – Mapper takes records one by one and emits accepted items or their transformed
versions.

Applications:

Log Analysis, Data Querying, ETL, Data Validation

Distributed Task Execution

Problem Statement: There is a large computational problem that can be divided into multiple parts and results from all parts can be
combined together to obtain a final result.

https://highlyscalable.w ordpress.com/2012/02/01/mapreduce-patterns/ 6/33

3/15/24, 12:47 AM MapReduce Patterns, Algorithms, and Use Cases – Highly Scalable Blog

Solution: Problem description is split in a set of specifications and specifications are stored as input data for Mappers. Each Mapper takes
a specification, performs corresponding computations and emits results. Reducer combines all emitted parts into the final result.

Case Study: Simulation of a Digital Communication System

There is a software simulator of a digital communication system like WiMAX that passes some volume of random data through the system
model and computes error probability of throughput. Each Mapper runs simulation for specified amount of data which is 1/Nth of the
required sampling and emit error rate. Reducer computes average error rate.

Applications:

Physical and Engineering Simulations, Numerical Analysis, Performance Testing

Sorting

Problem Statement: There is a set of records and it is required to sort these records by some rule or process these records in a certain
order.

Solution: Simple sorting is absolutely straightforward – Mappers just emit all items as values associated with the sorting keys that are
assembled as function of items. Nevertheless, in practice sorting is often used in a quite tricky way, that’s why it is said to be a heart of
MapReduce (and Hadoop). In particular, it is very common to use composite keys to achieve secondary sorting and grouping.

https://highlyscalable.w ordpress.com/2012/02/01/mapreduce-patterns/ 7/33

03 MapReduce
No ratings yet
03 MapReduce
184 pages
Hadoop Map Reduce Concepts - Teaching - 1
No ratings yet
Hadoop Map Reduce Concepts - Teaching - 1
53 pages
Big Data Infrastructure: Week 2: Mapreduce Algorithm Design (2/2)
No ratings yet
Big Data Infrastructure: Week 2: Mapreduce Algorithm Design (2/2)
55 pages
Mapreduce Programming Model and Design Patterns: Andrea Lottarini January 17, 2012
No ratings yet
Mapreduce Programming Model and Design Patterns: Andrea Lottarini January 17, 2012
23 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
43 pages
MapReduce Workflow and Key Concepts
No ratings yet
MapReduce Workflow and Key Concepts
5 pages
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
No ratings yet
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
54 pages
09b - MapReduce
No ratings yet
09b - MapReduce
44 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
17 pages
Mapreduce Model Principles
No ratings yet
Mapreduce Model Principles
65 pages
Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
Introduction to MapReduce & Functional Programming
No ratings yet
Introduction to MapReduce & Functional Programming
37 pages
Chapter 9 - Processing Big Data With Mapreduce
No ratings yet
Chapter 9 - Processing Big Data With Mapreduce
157 pages
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
49 pages
Map Reduce Programming
No ratings yet
Map Reduce Programming
74 pages
MapReduce Fundamentals Explained
No ratings yet
MapReduce Fundamentals Explained
15 pages
Unit-2 MapReduce2024
No ratings yet
Unit-2 MapReduce2024
41 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
9 pages
Map Reduce
No ratings yet
Map Reduce
30 pages
Introduction to MapReduce Programming
No ratings yet
Introduction to MapReduce Programming
64 pages
MapReduceIntro Updated
No ratings yet
MapReduceIntro Updated
31 pages
Understanding MapReduce Framework
No ratings yet
Understanding MapReduce Framework
44 pages
MapReduce Basics for Big Data Processing
No ratings yet
MapReduce Basics for Big Data Processing
32 pages
MapReduce Basics for Big Data Beginners
No ratings yet
MapReduce Basics for Big Data Beginners
32 pages
MapReduce & Hadoop for CS Students
No ratings yet
MapReduce & Hadoop for CS Students
25 pages
MapReduce Term Co-occurrence Guide
No ratings yet
MapReduce Term Co-occurrence Guide
46 pages
Cloud Computing & MapReduce Basics
No ratings yet
Cloud Computing & MapReduce Basics
55 pages
Assignment 2 Write-Up
No ratings yet
Assignment 2 Write-Up
7 pages
BDA - Unit - III-1
No ratings yet
BDA - Unit - III-1
57 pages
CC Unit-7
No ratings yet
CC Unit-7
16 pages
L04 MapReduce
No ratings yet
L04 MapReduce
37 pages
MapReduce for Big Data Enthusiasts
No ratings yet
MapReduce for Big Data Enthusiasts
18 pages
MapReduce for Data Engineers
No ratings yet
MapReduce for Data Engineers
30 pages
Lecture 4: Mapreduce and Hadoop: Indranil Gupta (Indy)
No ratings yet
Lecture 4: Mapreduce and Hadoop: Indranil Gupta (Indy)
37 pages
Map Reduce
No ratings yet
Map Reduce
35 pages
Untitled
No ratings yet
Untitled
16 pages
Unit 2
No ratings yet
Unit 2
12 pages
Hadoop Architecture & MapReduce Guide
No ratings yet
Hadoop Architecture & MapReduce Guide
7 pages
? Mapreduce - Detailed Summary
No ratings yet
? Mapreduce - Detailed Summary
4 pages
Bda 2
No ratings yet
Bda 2
35 pages
Bda FW-4
No ratings yet
Bda FW-4
7 pages
MapReduce Algorithm Explained
No ratings yet
MapReduce Algorithm Explained
8 pages
Bda Unit-3
No ratings yet
Bda Unit-3
44 pages
Unit 5 Lecture 5
No ratings yet
Unit 5 Lecture 5
21 pages
Hadoop Wordcount Program
No ratings yet
Hadoop Wordcount Program
20 pages
Chap 6 - MapReduce Programming
No ratings yet
Chap 6 - MapReduce Programming
37 pages
Describe The MapReduce Execution Steps With A Neat Diagram
No ratings yet
Describe The MapReduce Execution Steps With A Neat Diagram
10 pages
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
53 pages
Chapter 4
No ratings yet
Chapter 4
53 pages
Module2 D MapReduceParadigm
No ratings yet
Module2 D MapReduceParadigm
90 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
27 pages
(BIG DATA) (MapReduce - Quick Guide, Tutorialspoint - Com)
No ratings yet
(BIG DATA) (MapReduce - Quick Guide, Tutorialspoint - Com)
36 pages
Introduction To Map Reduce
No ratings yet
Introduction To Map Reduce
50 pages
Unit-2 (MapReduce-I)
No ratings yet
Unit-2 (MapReduce-I)
28 pages
Lesson 4 Notes PDF
No ratings yet
Lesson 4 Notes PDF
10 pages
3a - MapReduce Data Flow Scheduling Combiner Partitioner PDF
No ratings yet
3a - MapReduce Data Flow Scheduling Combiner Partitioner PDF
22 pages
2 Mapreduce Model Principles
No ratings yet
2 Mapreduce Model Principles
7 pages
Balanced k-means Algorithm Analysis
No ratings yet
Balanced k-means Algorithm Analysis
3 pages
2023 Data, Analytics, and Artificial Intelligence Adoption Strategy-H
No ratings yet
2023 Data, Analytics, and Artificial Intelligence Adoption Strategy-H
4 pages
2023 Data, Analytics, and Artificial Intelligence Adoption Strategy-A
No ratings yet
2023 Data, Analytics, and Artificial Intelligence Adoption Strategy-A
7 pages
2023 Data, Analytics, and Artificial Intelligence Adoption Strategy-C
No ratings yet
2023 Data, Analytics, and Artificial Intelligence Adoption Strategy-C
10 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-17
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-17
3 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-1E
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-1E
2 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-O
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-O
3 pages
Understanding MapReduce Benefits
No ratings yet
Understanding MapReduce Benefits
7 pages
MapReduce Algorithms for Sorting and Searching
No ratings yet
MapReduce Algorithms for Sorting and Searching
7 pages
Optimizing BKM+ for Clustering Efficiency
No ratings yet
Optimizing BKM+ for Clustering Efficiency
3 pages
SAP HANA K-Means Customer Segmentation
No ratings yet
SAP HANA K-Means Customer Segmentation
2 pages
Overview of Hadoop and MapReduce Framework
No ratings yet
Overview of Hadoop and MapReduce Framework
7 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-16
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-16
3 pages
SAP HANA K-Means Customer Segmentation
No ratings yet
SAP HANA K-Means Customer Segmentation
3 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-9
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-9
4 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community
3 pages
K-Means Clustering in SAP HANA PAL
No ratings yet
K-Means Clustering in SAP HANA PAL
3 pages
SVM Distance-Based Kernel Accuracy
No ratings yet
SVM Distance-Based Kernel Accuracy
1 page
SAP HANA K-Means for Segmentation
No ratings yet
SAP HANA K-Means for Segmentation
6 pages
Data Visualization for Machine Learning
No ratings yet
Data Visualization for Machine Learning
3 pages
K-Means Clustering Guide
No ratings yet
K-Means Clustering Guide
3 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-5
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-5
4 pages
Fast Scalable K-Means++ Algorithm With Mapreduce
No ratings yet
Fast Scalable K-Means++ Algorithm With Mapreduce
2 pages
SAP HANA K-Means Clustering Guide
No ratings yet
SAP HANA K-Means Clustering Guide
3 pages
K-Means Clustering Optimization Algorithm Based On Mapreduce
No ratings yet
K-Means Clustering Optimization Algorithm Based On Mapreduce
6 pages
Fuzzy K-Mean Clustering in Mapreduce On Cloud Based Hadoop: Dweepna Garg
No ratings yet
Fuzzy K-Mean Clustering in Mapreduce On Cloud Based Hadoop: Dweepna Garg
4 pages
Big Data Clustering with MapReduce
No ratings yet
Big Data Clustering with MapReduce
7 pages
Understanding MapReduce Algorithms
No ratings yet
Understanding MapReduce Algorithms
6 pages
The Incremental Online K Means Clustering Algorithm and Its Application To Color Quantization
No ratings yet
The Incremental Online K Means Clustering Algorithm and Its Application To Color Quantization
42 pages
SPD Unit-1
No ratings yet
SPD Unit-1
119 pages
Latex Thesis Template Fancy
100% (3)
Latex Thesis Template Fancy
4 pages
Weather Website Project
No ratings yet
Weather Website Project
10 pages
Pet Sphere
No ratings yet
Pet Sphere
2 pages
L32 An Intro To Git and GitHub For Beginners (Tutorial)
No ratings yet
L32 An Intro To Git and GitHub For Beginners (Tutorial)
25 pages
Line Follower Robot Guide
No ratings yet
Line Follower Robot Guide
9 pages
Module 19: On-Board Transit Management Systems
No ratings yet
Module 19: On-Board Transit Management Systems
29 pages
Computer Studies II PDF
No ratings yet
Computer Studies II PDF
3 pages
Benefits of SAP S/4HANA Explained
100% (2)
Benefits of SAP S/4HANA Explained
13 pages
Welcome To PDFelement
No ratings yet
Welcome To PDFelement
11 pages
Data Sheet 6ES7153-2BA70-0XB0: General Information
No ratings yet
Data Sheet 6ES7153-2BA70-0XB0: General Information
3 pages
Boku Technical Reference
No ratings yet
Boku Technical Reference
64 pages
3GPP Release 8 and Beyond
100% (33)
3GPP Release 8 and Beyond
178 pages
PIC24FJ256GB110
No ratings yet
PIC24FJ256GB110
352 pages
BLE Power Measurement
No ratings yet
BLE Power Measurement
11 pages
Google Privacy Policy en
No ratings yet
Google Privacy Policy en
34 pages
CoolMasterNel WEB
No ratings yet
CoolMasterNel WEB
2 pages
Zero-Voltage Switch With Adjustable Ramp: Description
No ratings yet
Zero-Voltage Switch With Adjustable Ramp: Description
11 pages
Particle Swarm Optimization Using C#
No ratings yet
Particle Swarm Optimization Using C#
12 pages
Practical1 Introduction To ENVI
No ratings yet
Practical1 Introduction To ENVI
11 pages
Fedora Install
No ratings yet
Fedora Install
2 pages
TLE ICT 10 Quiz 2.2
No ratings yet
TLE ICT 10 Quiz 2.2
2 pages
User Programs For Memoskop C-E, C-E100, C and Sub
No ratings yet
User Programs For Memoskop C-E, C-E100, C and Sub
5 pages
最佳文章写作服务
100% (1)
最佳文章写作服务
6 pages
DB Concepts
No ratings yet
DB Concepts
22 pages
Bca 4 Chapter 3 Os
No ratings yet
Bca 4 Chapter 3 Os
13 pages
Excel Spreadsheet in Mechanical Engineering
No ratings yet
Excel Spreadsheet in Mechanical Engineering
11 pages
Object Oriented Software& Engineering Object Oriented Software& Engineering
No ratings yet
Object Oriented Software& Engineering Object Oriented Software& Engineering
56 pages
ASE PROJECT Compressed
No ratings yet
ASE PROJECT Compressed
12 pages
SQL Commands Syntax Guide
No ratings yet
SQL Commands Syntax Guide
3 pages

MapReduce for Big Data Engineers

Uploaded by

MapReduce for Big Data Engineers

Uploaded by

3/15/24, 12:47 AM MapReduce Patterns, Algorithms, and Use Cases – Highly Scalable Blog

Highly Scalable Blog

MapReduce Patterns, Algorithms, and Use Cases

https://highlyscalable.w ordpress.com/2012/02/01/mapreduce-patterns/ 1/33

https://highlyscalable.w ordpress.com/2012/02/01/mapreduce-patterns/ 2/33

Basic MapReduce Patterns

Counting and Summing

https://highlyscalable.w ordpress.com/2012/02/01/mapreduce-patterns/ 3/33

Log Analysis, Data Querying

https://highlyscalable.w ordpress.com/2012/02/01/mapreduce-patterns/ 4/33

Inverted Indexes, ETL

https://highlyscalable.w ordpress.com/2012/02/01/mapreduce-patterns/ 5/33

Filtering (“Grepping”), Parsing, and Validation

Log Analysis, Data Querying, ETL, Data Validation

Distributed Task Execution

https://highlyscalable.w ordpress.com/2012/02/01/mapreduce-patterns/ 6/33

Case Study: Simulation of a Digital Communication System

Physical and Engineering Simulations, Numerical Analysis, Performance Testing

https://highlyscalable.w ordpress.com/2012/02/01/mapreduce-patterns/ 7/33

You might also like