0% found this document useful (0 votes)

52 views55 pages

Ecs765p W2

This document provides an overview of the MapReduce programming model for processing big data in parallel. It discusses how MapReduce works by splitting the processing into map and reduce stages. The map stage processes input data in parallel to emit key-value pairs, which are then shuffled and sorted. The reduce stage combines all intermediate values associated with the same key to produce the final output. It also describes how the MapReduce runtime system handles partitioning, scheduling, and fault tolerance. The benefits of MapReduce include simplifying parallel programming while achieving high performance on large clusters.

Uploaded by

Yen-Kai Cheng

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

52 views55 pages

Ecs765p W2

Uploaded by

Yen-Kai Cheng

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 55

ECS640/ECS765

Big Data Processing

The MapReduce Programming Model
Lecturer: Joseph Doyle
School of Electronic Engineering and Computer Science
ECS640U/ECS765P
Big Data Processing
The MapReduce Programming Model
Lecturer: Joseph Doyle
School of Electronic Engineering and Computer Science
Contents

● Introduction to MapReduce
● Programming patterns
● Aggregate computations
Our first parallel program (Reminder from last week)
● Task: count the number of occurrences of each word in one document
● Input: text document
● Output: sequence of: word, count
The 56
School 23
Queen 10
● Collection Stage: Not applicable in this case
● Ingestion Stage: Move file to data lake with applicable protocol e.g. HTTP/FTP
● Preparation Stage: Remove character which might confuse algorithm e.g. quotation marks etc
Program Input
QMUL has been ranked 9th among multi-faculty institutions in the UK, according to tables published today in the Times Higher
Education.
A total of 154 institutions were submitted for the exercise.
The 2008 RAE confirmed Queen Mary to be one of the rising stars of the UK research environment and the REF 2014 shows that this
upward trajectory has been maintained.
Professor Simon Gaskell, President and Principal of Queen Mary, said: “This is an outstanding result for Queen Mary. We have built
upon the progress that was evidenced by the last assessment exercise and have now clearly cemented our position as one of the UK’s
foremost research-led universities. This achievement is derived from the talent and hard work of our academic staff in all disciplines,
and the colleagues who support them.”
The Research Excellence Framework (REF) is the system for assessing the quality of research in UK higher education institutions.
Universities submit their work across 36 panels of assessment. Research is judged according to quality of output (65 per cent),
environment (15 per cent) and, for the first time, the impact of research (20 per cent).
How to solve the problem?
How to solve the problem?
How to solve the problem on a single processor?
#input:text string with the complete text
words = text.split()
count = dict()
for word in words:
if word in count:
count[word] = count[word] + 1
else:
count[word] = 1
Parallelising the problem
Splitting the load on subtasks:
● Split sentences/lines into words
● Count all the occurrences of each word
…What do we do with the intermediate results?
● Merge into single collection
● Possibly requires parallelism too
MapReduce
“A simple and powerful interface that enables automatic parallelization and distribution of large-scale
computations, combined with an implementation of this interface that achieves high performance on
large clusters of commodity PCs.” (Dean and Ghermawat, “MapReduce: Simplified Data Processing on
Large Clusters”, Google Inc.)
More simply, MapReduce is:
● A parallel programming model and associated implementation.
MapReduce Pattern
MapReduce Programming Model
Data is processed with map() and reduce() functions
● The map() function is called on every item in the input and emits a series of intermediate key/value pairs
● All the emitted values for a given key are grouped together
● The reduce() function is called on every unique key, and the collected values. Emits a partial result that
is added to the output
Example wordcount (pythonish pseudocode)
def mapper(_,text):
words = text.split()
for word in words:
emit(word, 1)

def reducer(key, values):

emit(key, sum(values) )
Example wordcount (javaish pseudocode)
public void Map (String filename, String text) {

List[String] words= text.split();

for (String word: words){
emit(word, 1)
}
}
public void Reduce (String key, List[Integer] values) {
int sum = 0;
for (Integer count: values){
sum+=count;
}
emit(key, sum);
}
How MapReduce parallelises
Input data is partitioned into processable chunks
One Map job is executed per chunk
● All can be parallelised (depends on number of nodes)
One Reduce Job is executed for each distinct key emitted by the Mappers
● All can be parallelised (partitioned ‘evenly’ among nodes)
Computing nodes first work on Map jobs. After all have completed, a synchronization step occurs, and they
start running Reduce jobs
Word Count Example
MapReduce: A Brief History
Inspiration from functional programming (e.g., Lisp)
map() function
● Applies a function to each individual value of a sequence to create a new list of values
● Example: square x = x * x
map square [1,2,3,4,5] returns [1,4,9,16,25]
reduce() function
● Example: sum = (each elem in arr, total +=)
reduce [1,2,3,4,5] returns 15 (the sum of the elements)
MapReduce Benefits

● High level parallel programming abstraction

● Framework implementations provide good performance results
● Greatly reduces parallel programming complexity
● However, it is not suitable for every parallel programming algorithm!
Synchronization and message passing
Input key*value Input key*value
pairs pairs

...

map map
Data store 1 Data store n

(key 1, (key 2, (key 3, (key 1, (key 2, (key 3,

values...) values...) values...) values...) values...) values...)

== Barrier == : Aggregates intermediate values by output key

key 1, key 2, key 3,

intermediate intermediate intermediate
values values values

reduce reduce reduce

final key 1 final key 2 final key 3

values values values
MapReduce Benefits

Every key-value item generated by the mappers is collected

● Items are transferred over the network
Same key items are grouped into a list of values
Data is partitioned among the number of Reducers
Data is copied over the network to each Reducer
The data provided to each Reducer is sorted according to the keys
MapReduce Runtime System

● Partitions input data

● Schedules execution across a set of machines
● Handles load balancing
● Shuffles, partitions and sorts data between Map and Reduce steps
● Handles machine failure transparently
● Manages inter process communication
Contents

● Introduction to MapReduce?
● Programming patterns
● Aggregate computations
Shuffle and Sort steps

Every key-value item generated by the mappers is collected

All emitted Key-value pairs are collected

● In-memory buffer (100MB default size), spills to HD
Key/Value Pairs are partitioned depending on target reducer
● Partitioning aims at even split of keys
(Optionally) Combiner runs on each partition
Output is available to the Reducers through HTTP server threads
Shuffle and Sort – At each Reducer

The reducer downloads output from mappers

● Potentially all Mappers are contacted
Partial values from each Mapper are merged
Keys are sorted and fed as input for the Reducer
● List of <k2, list<v2> >, sorted by k2
The cost of communications

Parallelising Map and Reduce jobs allow algorithms to scale close to linearly
One potential bottleneck for MapReduce programs is the cost of Shuffle and Sort operations
● Data has to be copied over network communications
● All the keys emitted by the mappers
● Sorting large amounts of elements can be costly
Combiner is an additional optional step that is executed before these steps
The Combiner

The combiner acts as a preliminary reducer

It is executed at each mapper node just before sending all the key value pairs for shuffling
● Reduces the number of emitted items
Improves efficiency
It cannot be mandatory (the algorithm must work correctly if the Combiner is not invoked)
Frequently the same as the Reducer function (not always)
The Combiner
Word count combiner
def mapper(_,text):
words = text.split()
for word in words:
emit(word, 1)

def reducer(key, values):

emit(key, sum(values) )

def combiner(key, values):

emit(key, sum(values)
Combiner Rules
The combiner has the same structure as the reducer (same input parameters) but must comply with these
rules
● Idempotent - The number of times the combiner is applied can't change the output
● Transitive - The order of the inputs can't change the output
● Side-effect free - Combiners can't have side effects (or they won't be idempotent).
● Preserve the sort order - They can't change the keys to disrupt the sort order
● Preserve the partitioning - They can't change the keys to change the partitioning to the Reducers
Word Count Example
Word Count Example with Combiner
Inverted Index
Goal: Generate index from a dataset to allow faster searches for specific features
Examples:
● Building index from a textbook.
● Finding all websites that match a search term
Inverted Index Structure
Inverted Index Pattern
Inverted Index Pseudocode
def mapper(docId, text):
features = find_features(text)
for(feature in features):
emit(feature, docId)

def reducer(feature, docIds):

emit(feature, formatNicely(docIds))
Filtering
Goal: Filter out records/fields that are not of interest for further computation.
Speedup the actual computation thanks to a reduced size of the dataset
Examples:
● distributed grep.
● Tracking a thread of events (logs from the same user)
● data cleansing
Mapper only job
Filtering Structure
Top Ten Elements
Goal: Retrieve a small number of records, relative to a ranking
Examples:
● build top sellers view
● find outliers in the data.
Top Ten Pattern
Example Top Ten MapReduce
def mapper(_, row):
studentId = parseId(row)
grade = parseGrade(row)
pair = studentId, grade)
emit(None, pair)
def reducer(_, pairs):
top10 = pairs.sort().getTop(10)
rank = 1
for(student in top10):
emit(rank, student[0])
rank += 1
Top Ten Structure
Contents

● Introduction to MapReduce?
● Programming patterns
● Aggregate computations
Top Ten Performance
How many Reducers?
● Performance issues?
What happens if we don’t use Combiners?
Performance depends greatly on the number of elements, (to a lesser extend on the size of data)
Minimum requirement: the ranking data for a whole input split must fit into memory of a single Mapper
Numerical Summarisation
Goal: Calculate aggregate statistical values over a dataset
Extract features from the dataset elements, compute the same function for each feature
Examples:
● Count occurrences
● Maximum / minimum values
● Average / median / standard deviation
Sample dataset: China’s Air Quality sensors
Sample numerical summarisation questions
● Compute what is the maximum PM2.5 registered for each location provided in the dataset
● Return the average AQI registered each week
● Compute for each day of the week the number of locations where the PM2.5 index exceeded 150
Numerical Summarisation Structure
Numerical Summarisation Map and Reduce functions
Numerical Summarisation Combiner?
Computing Averages
Computing Averages
Combining Average
Average is NOT an associative operation
● Cannot be executed partially with the Combiners
Solution: Change Mapper results
● Emit aggregated quantities, and number of elements
● Mapper. For mark values (100,100,20),
Emit (100,1),(100,1), (20,1)
● Combiner: adds aggregates and number of elements
Emits (220,3)
● Reducer
Adds aggregates and computes average

16 Tenses in English: 1. Simple Present Tense
88% (8)
16 Tenses in English: 1. Simple Present Tense
4 pages
Map Reduce
No ratings yet
Map Reduce
33 pages
Bda Lab Exercises Lab Mannual - 2023
No ratings yet
Bda Lab Exercises Lab Mannual - 2023
72 pages
Detailed Lesson Plan in Mathematics
100% (3)
Detailed Lesson Plan in Mathematics
10 pages
Lecture 10 MapReduce Hadoop
No ratings yet
Lecture 10 MapReduce Hadoop
37 pages
The Passion of An Amateur Card Magician
100% (4)
The Passion of An Amateur Card Magician
557 pages
Ch02a Mapreduce
No ratings yet
Ch02a Mapreduce
53 pages
4a MapReduce
No ratings yet
4a MapReduce
47 pages
ECS765P - W2 - The MapReduce Programming Model
No ratings yet
ECS765P - W2 - The MapReduce Programming Model
53 pages
Map Reduce
No ratings yet
Map Reduce
39 pages
Da Unit 5 Data Analytics
No ratings yet
Da Unit 5 Data Analytics
43 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
Introduction To Batch Processing
No ratings yet
Introduction To Batch Processing
23 pages
Map Reduce Intro CS4961-L22
No ratings yet
Map Reduce Intro CS4961-L22
20 pages
Chapter 9 - Processing Big Data With Mapreduce
No ratings yet
Chapter 9 - Processing Big Data With Mapreduce
157 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
9 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
43 pages
Map Reduce
No ratings yet
Map Reduce
44 pages
Chapter 4
No ratings yet
Chapter 4
53 pages
Abnormal Psychology 9th Edition Thomas F Oltmanns Robert E Emery Digital Access
100% (1)
Abnormal Psychology 9th Edition Thomas F Oltmanns Robert E Emery Digital Access
405 pages
Lecture - 3
No ratings yet
Lecture - 3
25 pages
3.Map-Reduce Framework - 1
No ratings yet
3.Map-Reduce Framework - 1
47 pages
Ir MR 1
No ratings yet
Ir MR 1
34 pages
Map Reduce
No ratings yet
Map Reduce
35 pages
Managing Information Systems Ten Essential Topics 2013th Edition Jun Xu Download
100% (4)
Managing Information Systems Ten Essential Topics 2013th Edition Jun Xu Download
61 pages
Unit4 Fos
No ratings yet
Unit4 Fos
7 pages
Lecture 03
No ratings yet
Lecture 03
26 pages
Lecture 2.1
No ratings yet
Lecture 2.1
13 pages
Map Reduce Notes and Learning
No ratings yet
Map Reduce Notes and Learning
48 pages
Revised Blueprint, Civil Engineering - 240116 - 131542
100% (1)
Revised Blueprint, Civil Engineering - 240116 - 131542
41 pages
Module 3 (Part-1) - Big Data
No ratings yet
Module 3 (Part-1) - Big Data
46 pages
Mapreduce Model Principles
No ratings yet
Mapreduce Model Principles
65 pages
Paper Map Reduce
No ratings yet
Paper Map Reduce
16 pages
M4 06 MapReduce
No ratings yet
M4 06 MapReduce
28 pages
Lecture 3 - MapReduce
No ratings yet
Lecture 3 - MapReduce
9 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
27 pages
Chapter 4 - Understanding Map Reduce Fundamentals
No ratings yet
Chapter 4 - Understanding Map Reduce Fundamentals
45 pages
Mapreduce
No ratings yet
Mapreduce
13 pages
Map Reduce
No ratings yet
Map Reduce
42 pages
Introduction To Map Reduce
No ratings yet
Introduction To Map Reduce
50 pages
Big Data
No ratings yet
Big Data
120 pages
Describe The MapReduce Execution Steps With A Neat Diagram
No ratings yet
Describe The MapReduce Execution Steps With A Neat Diagram
10 pages
Map Reduce
No ratings yet
Map Reduce
3 pages
Mapreduce
No ratings yet
Mapreduce
13 pages
Tom Rose - From The Red Notebook 2nd Edition
75% (4)
Tom Rose - From The Red Notebook 2nd Edition
33 pages
Write Your First MapReduce Program in 20 Minutes
No ratings yet
Write Your First MapReduce Program in 20 Minutes
16 pages
Map Reduce
No ratings yet
Map Reduce
18 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
37 pages
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
55 pages
Matt Mello - Thought Control
No ratings yet
Matt Mello - Thought Control
16 pages
Chapter4 - MapReduce
No ratings yet
Chapter4 - MapReduce
29 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
43 pages
Lecture 4: Mapreduce and Hadoop: Indranil Gupta (Indy)
No ratings yet
Lecture 4: Mapreduce and Hadoop: Indranil Gupta (Indy)
37 pages
Map Reduce
No ratings yet
Map Reduce
28 pages
Distributed and Cloud Computing
No ratings yet
Distributed and Cloud Computing
58 pages
Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
Denim Jeans Report
No ratings yet
Denim Jeans Report
53 pages
Unit-2 (MapReduce-I)
No ratings yet
Unit-2 (MapReduce-I)
28 pages
Map-Reduce For Parallel Computing: Amit Jain
No ratings yet
Map-Reduce For Parallel Computing: Amit Jain
72 pages
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
49 pages
Map Reduce: Simplified Processing On Large Clusters
No ratings yet
Map Reduce: Simplified Processing On Large Clusters
29 pages
Good Afternoon
No ratings yet
Good Afternoon
30 pages
Data Science
No ratings yet
Data Science
7 pages
Map Reduce Examples
No ratings yet
Map Reduce Examples
7 pages
Parallel Programming, Mapreduce Model: Unit Ii
No ratings yet
Parallel Programming, Mapreduce Model: Unit Ii
47 pages
The Mapreduce Paradigm: Michael Kleber
No ratings yet
The Mapreduce Paradigm: Michael Kleber
13 pages
Dean 08 Map Reduce
No ratings yet
Dean 08 Map Reduce
7 pages
Mapreduce: Simpli - Ed Data Processing On Large Clusters
No ratings yet
Mapreduce: Simpli - Ed Data Processing On Large Clusters
4 pages
Unit 6 Grammar
No ratings yet
Unit 6 Grammar
11 pages
ECS765P - W11 - Stream Processing II
No ratings yet
ECS765P - W11 - Stream Processing II
47 pages
Complete Bundle Business Data Communications Infrastructure Networking and Security 7th Edition Stallings
No ratings yet
Complete Bundle Business Data Communications Infrastructure Networking and Security 7th Edition Stallings
409 pages
ECS781P-9-Cloud Data Management
No ratings yet
ECS781P-9-Cloud Data Management
79 pages
Mapreduce Programming Model and Design Patterns: Andrea Lottarini January 17, 2012
No ratings yet
Mapreduce Programming Model and Design Patterns: Andrea Lottarini January 17, 2012
23 pages
Flag in Every School List
No ratings yet
Flag in Every School List
274 pages
ECS781P-11-Edge of The Cloud
No ratings yet
ECS781P-11-Edge of The Cloud
30 pages
ECS765P - W9 - Large-Scale Graph Processing
No ratings yet
ECS765P - W9 - Large-Scale Graph Processing
51 pages
W2 Ecs7020p
No ratings yet
W2 Ecs7020p
54 pages
Four Corners Activity
No ratings yet
Four Corners Activity
2 pages
ECS726-Week01 Intro
No ratings yet
ECS726-Week01 Intro
70 pages
ECS765P - W5 - Spark Programming
No ratings yet
ECS765P - W5 - Spark Programming
43 pages
ECS726-Week04 - Hash - MAC - Digital Sinatures - Freshness - Dynamic Password Schemes
No ratings yet
ECS726-Week04 - Hash - MAC - Digital Sinatures - Freshness - Dynamic Password Schemes
52 pages
W3 Ecs7020p
No ratings yet
W3 Ecs7020p
51 pages
ECS765P - W10 - Stream Processing
No ratings yet
ECS765P - W10 - Stream Processing
39 pages
ECS726-Week02 Symmetric EncryptionP
No ratings yet
ECS726-Week02 Symmetric EncryptionP
62 pages
ECS765P - W3 - Hadoop Principles and Components
No ratings yet
ECS765P - W3 - Hadoop Principles and Components
47 pages
ECS726-Week05 Cryptographic Protocols Key Management-P
No ratings yet
ECS726-Week05 Cryptographic Protocols Key Management-P
58 pages
Week 3 v1.1 (Hidden) Supervised Learning (Regression)
No ratings yet
Week 3 v1.1 (Hidden) Supervised Learning (Regression)
52 pages
ECS781P 6 CloudPerformanceSLAs
No ratings yet
ECS781P 6 CloudPerformanceSLAs
39 pages
ECS765P - W6 - Big Data Ingestion and Storage
No ratings yet
ECS765P - W6 - Big Data Ingestion and Storage
34 pages
Lights Illusions Script 08-26-19
No ratings yet
Lights Illusions Script 08-26-19
6 pages
Exercise 1
No ratings yet
Exercise 1
3 pages
Magic Pen Script 10-05-19
No ratings yet
Magic Pen Script 10-05-19
4 pages
Week 4 v1.1 (Hidden) - Supervised Learning (Classification)
No ratings yet
Week 4 v1.1 (Hidden) - Supervised Learning (Classification)
43 pages
ECS765P - W4 - Introduction To Spark
No ratings yet
ECS765P - W4 - Introduction To Spark
39 pages
Cloud Computing Lab 2
No ratings yet
Cloud Computing Lab 2
4 pages
W4 Ecs7020p
No ratings yet
W4 Ecs7020p
48 pages
Cat Preparation Strategy: Disclaimer
No ratings yet
Cat Preparation Strategy: Disclaimer
3 pages
ECS7020P ClassificationExercisesSolutions II
No ratings yet
ECS7020P ClassificationExercisesSolutions II
7 pages
ECS781P-3-Cloud Applications
No ratings yet
ECS781P-3-Cloud Applications
50 pages
Ecs781p 4 Rest
No ratings yet
Ecs781p 4 Rest
47 pages
Result of Civil Judge Class-II Online Prelims - 2019 (Phase-II) Alongwith Application Form
No ratings yet
Result of Civil Judge Class-II Online Prelims - 2019 (Phase-II) Alongwith Application Form
48 pages
ECS781P 10 Microservices
No ratings yet
ECS781P 10 Microservices
34 pages
Easwari Engineering College: (Autonomous Institution)
No ratings yet
Easwari Engineering College: (Autonomous Institution)
84 pages
Note - Wireless Communications For Everybody
No ratings yet
Note - Wireless Communications For Everybody
2 pages
Jay Hardwick Resume
No ratings yet
Jay Hardwick Resume
2 pages
Placement PPTHHHHHHHHJJJJ
No ratings yet
Placement PPTHHHHHHHHJJJJ
12 pages
TTL2 SemiFinals Module
No ratings yet
TTL2 SemiFinals Module
7 pages
Fed Undergraduate Booklet A5
No ratings yet
Fed Undergraduate Booklet A5
24 pages
Comptia A+ Certification: Trade Association
No ratings yet
Comptia A+ Certification: Trade Association
8 pages
From 22.08.2011 To 30.06.2016 (OCTMP) Worked As District Consultant Under WR Department Govt. of Odisha
No ratings yet
From 22.08.2011 To 30.06.2016 (OCTMP) Worked As District Consultant Under WR Department Govt. of Odisha
3 pages
Understanding The Enneagram Stacey Keogh George
No ratings yet
Understanding The Enneagram Stacey Keogh George
16 pages
Bidisha
No ratings yet
Bidisha
2 pages
Suchman 2023 The Uncontroversial Thingness of Ai
No ratings yet
Suchman 2023 The Uncontroversial Thingness of Ai
5 pages
Engl111 71761act1 BanguisMyraFritzie
No ratings yet
Engl111 71761act1 BanguisMyraFritzie
2 pages
Commencement Guide
No ratings yet
Commencement Guide
3 pages
Week 3
No ratings yet
Week 3
12 pages
File 5 SOPH, UWC, Health Promotion I Unit 4 OERF
No ratings yet
File 5 SOPH, UWC, Health Promotion I Unit 4 OERF
18 pages
D Resume
No ratings yet
D Resume
1 page
Facilities Data and State Rated Capadty School Year 201:4-2015
No ratings yet
Facilities Data and State Rated Capadty School Year 201:4-2015
3 pages

Ecs765p W2

Uploaded by

Ecs765p W2

Uploaded by

ECS640/ECS765

Big Data Processing

def reducer(key, values):

List[String] words= text.split();

● High level parallel programming abstraction

(key 1, (key 2, (key 3, (key 1, (key 2, (key 3,

== Barrier == : Aggregates intermediate values by output key

key 1, key 2, key 3,

reduce reduce reduce

final key 1 final key 2 final key 3

Every key-value item generated by the mappers is collected

● Partitions input data

Every key-value item generated by the mappers is collected

All emitted Key-value pairs are collected

The reducer downloads output from mappers

The combiner acts as a preliminary reducer

def reducer(key, values):

def combiner(key, values):

def reducer(feature, docIds):

You might also like