0% found this document useful (0 votes)
52 views55 pages

Ecs765p W2

This document provides an overview of the MapReduce programming model for processing big data in parallel. It discusses how MapReduce works by splitting the processing into map and reduce stages. The map stage processes input data in parallel to emit key-value pairs, which are then shuffled and sorted. The reduce stage combines all intermediate values associated with the same key to produce the final output. It also describes how the MapReduce runtime system handles partitioning, scheduling, and fault tolerance. The benefits of MapReduce include simplifying parallel programming while achieving high performance on large clusters.

Uploaded by

Yen-Kai Cheng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views55 pages

Ecs765p W2

This document provides an overview of the MapReduce programming model for processing big data in parallel. It discusses how MapReduce works by splitting the processing into map and reduce stages. The map stage processes input data in parallel to emit key-value pairs, which are then shuffled and sorted. The reduce stage combines all intermediate values associated with the same key to produce the final output. It also describes how the MapReduce runtime system handles partitioning, scheduling, and fault tolerance. The benefits of MapReduce include simplifying parallel programming while achieving high performance on large clusters.

Uploaded by

Yen-Kai Cheng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

ECS640/ECS765

Big Data Processing


The MapReduce Programming Model
Lecturer: Joseph Doyle
School of Electronic Engineering and Computer Science
ECS640U/ECS765P
Big Data Processing
The MapReduce Programming Model
Lecturer: Joseph Doyle
School of Electronic Engineering and Computer Science
Contents

● Introduction to MapReduce
● Programming patterns
● Aggregate computations
Our first parallel program (Reminder from last week)
● Task: count the number of occurrences of each word in one document
● Input: text document
● Output: sequence of: word, count
The 56
School 23
Queen 10
● Collection Stage: Not applicable in this case
● Ingestion Stage: Move file to data lake with applicable protocol e.g. HTTP/FTP
● Preparation Stage: Remove character which might confuse algorithm e.g. quotation marks etc
Program Input
QMUL has been ranked 9th among multi-faculty institutions in the UK, according to tables published today in the Times Higher
Education.
A total of 154 institutions were submitted for the exercise.
The 2008 RAE confirmed Queen Mary to be one of the rising stars of the UK research environment and the REF 2014 shows that this
upward trajectory has been maintained.
Professor Simon Gaskell, President and Principal of Queen Mary, said: “This is an outstanding result for Queen Mary. We have built
upon the progress that was evidenced by the last assessment exercise and have now clearly cemented our position as one of the UK’s
foremost research-led universities. This achievement is derived from the talent and hard work of our academic staff in all disciplines,
and the colleagues who support them.”
The Research Excellence Framework (REF) is the system for assessing the quality of research in UK higher education institutions.
Universities submit their work across 36 panels of assessment. Research is judged according to quality of output (65 per cent),
environment (15 per cent) and, for the first time, the impact of research (20 per cent).
How to solve the problem?
How to solve the problem?
How to solve the problem on a single processor?
#input:text string with the complete text
words = text.split()
count = dict()
for word in words:
if word in count:
count[word] = count[word] + 1
else:
count[word] = 1
Parallelising the problem
Splitting the load on subtasks:
● Split sentences/lines into words
● Count all the occurrences of each word
…What do we do with the intermediate results?
● Merge into single collection
● Possibly requires parallelism too
MapReduce
“A simple and powerful interface that enables automatic parallelization and distribution of large-scale
computations, combined with an implementation of this interface that achieves high performance on
large clusters of commodity PCs.” (Dean and Ghermawat, “MapReduce: Simplified Data Processing on
Large Clusters”, Google Inc.)
More simply, MapReduce is:
● A parallel programming model and associated implementation.
MapReduce Pattern
MapReduce Programming Model
Data is processed with map() and reduce() functions
● The map() function is called on every item in the input and emits a series of intermediate key/value pairs
● All the emitted values for a given key are grouped together
● The reduce() function is called on every unique key, and the collected values. Emits a partial result that
is added to the output
Example wordcount (pythonish pseudocode)
def mapper(_,text):
words = text.split()
for word in words:
emit(word, 1)

def reducer(key, values):


emit(key, sum(values) )
Example wordcount (javaish pseudocode)
public void Map (String filename, String text) {

List[String] words= text.split();


for (String word: words){
emit(word, 1)
}
}
public void Reduce (String key, List[Integer] values) {
int sum = 0;
for (Integer count: values){
sum+=count;
}
emit(key, sum);
}
How MapReduce parallelises
Input data is partitioned into processable chunks
One Map job is executed per chunk
● All can be parallelised (depends on number of nodes)
One Reduce Job is executed for each distinct key emitted by the Mappers
● All can be parallelised (partitioned ‘evenly’ among nodes)
Computing nodes first work on Map jobs. After all have completed, a synchronization step occurs, and they
start running Reduce jobs
Word Count Example
MapReduce: A Brief History
Inspiration from functional programming (e.g., Lisp)
map() function
● Applies a function to each individual value of a sequence to create a new list of values
● Example: square x = x * x
map square [1,2,3,4,5] returns [1,4,9,16,25]
reduce() function
● Example: sum = (each elem in arr, total +=)
reduce [1,2,3,4,5] returns 15 (the sum of the elements)
MapReduce Benefits

● High level parallel programming abstraction


● Framework implementations provide good performance results
● Greatly reduces parallel programming complexity
● However, it is not suitable for every parallel programming algorithm!
Synchronization and message passing
Input key*value Input key*value
pairs pairs

...

map map
Data store 1 Data store n

(key 1, (key 2, (key 3, (key 1, (key 2, (key 3,


values...) values...) values...) values...) values...) values...)

== Barrier == : Aggregates intermediate values by output key

key 1, key 2, key 3,


intermediate intermediate intermediate
values values values

reduce reduce reduce

final key 1 final key 2 final key 3


values values values
MapReduce Benefits

Every key-value item generated by the mappers is collected


● Items are transferred over the network
Same key items are grouped into a list of values
Data is partitioned among the number of Reducers
Data is copied over the network to each Reducer
The data provided to each Reducer is sorted according to the keys
MapReduce Runtime System

● Partitions input data


● Schedules execution across a set of machines
● Handles load balancing
● Shuffles, partitions and sorts data between Map and Reduce steps
● Handles machine failure transparently
● Manages inter process communication
Contents

● Introduction to MapReduce?
● Programming patterns
● Aggregate computations
Shuffle and Sort steps

Every key-value item generated by the mappers is collected


● Items are transferred over the network
Same key items are grouped into a list of values
Data is partitioned among the number of Reducers
Data is copied over the network to each Reducer
The data provided to each Reducer is sorted according to the keys
Word Count Example
Shuffle and Sort
Shuffle and Sort – At each Mapper

All emitted Key-value pairs are collected


● In-memory buffer (100MB default size), spills to HD
Key/Value Pairs are partitioned depending on target reducer
● Partitioning aims at even split of keys
(Optionally) Combiner runs on each partition
Output is available to the Reducers through HTTP server threads
Shuffle and Sort – At each Reducer

The reducer downloads output from mappers


● Potentially all Mappers are contacted
Partial values from each Mapper are merged
Keys are sorted and fed as input for the Reducer
● List of <k2, list<v2> >, sorted by k2
The cost of communications

Parallelising Map and Reduce jobs allow algorithms to scale close to linearly
One potential bottleneck for MapReduce programs is the cost of Shuffle and Sort operations
● Data has to be copied over network communications
● All the keys emitted by the mappers
● Sorting large amounts of elements can be costly
Combiner is an additional optional step that is executed before these steps
The Combiner

The combiner acts as a preliminary reducer


It is executed at each mapper node just before sending all the key value pairs for shuffling
● Reduces the number of emitted items
Improves efficiency
It cannot be mandatory (the algorithm must work correctly if the Combiner is not invoked)
Frequently the same as the Reducer function (not always)
The Combiner
Word count combiner
def mapper(_,text):
words = text.split()
for word in words:
emit(word, 1)

def reducer(key, values):


emit(key, sum(values) )

def combiner(key, values):


emit(key, sum(values)
Combiner Rules
The combiner has the same structure as the reducer (same input parameters) but must comply with these
rules
● Idempotent - The number of times the combiner is applied can't change the output
● Transitive - The order of the inputs can't change the output
● Side-effect free - Combiners can't have side effects (or they won't be idempotent).
● Preserve the sort order - They can't change the keys to disrupt the sort order
● Preserve the partitioning - They can't change the keys to change the partitioning to the Reducers
Word Count Example
Word Count Example with Combiner
Inverted Index
Goal: Generate index from a dataset to allow faster searches for specific features
Examples:
● Building index from a textbook.
● Finding all websites that match a search term
Inverted Index Structure
Inverted Index Pattern
Inverted Index Pseudocode
def mapper(docId, text):
features = find_features(text)
for(feature in features):
emit(feature, docId)

def reducer(feature, docIds):


emit(feature, formatNicely(docIds))
Filtering
Goal: Filter out records/fields that are not of interest for further computation.
Speedup the actual computation thanks to a reduced size of the dataset
Examples:
● distributed grep.
● Tracking a thread of events (logs from the same user)
● data cleansing
Mapper only job
Filtering Structure
Top Ten Elements
Goal: Retrieve a small number of records, relative to a ranking
Examples:
● build top sellers view
● find outliers in the data.
Top Ten Pattern
Example Top Ten MapReduce
def mapper(_, row):
studentId = parseId(row)
grade = parseGrade(row)
pair = studentId, grade)
emit(None, pair)
def reducer(_, pairs):
top10 = pairs.sort().getTop(10)
rank = 1
for(student in top10):
emit(rank, student[0])
rank += 1
Top Ten Structure
Contents

● Introduction to MapReduce?
● Programming patterns
● Aggregate computations
Top Ten Performance
How many Reducers?
● Performance issues?
What happens if we don’t use Combiners?
Performance depends greatly on the number of elements, (to a lesser extend on the size of data)
Minimum requirement: the ranking data for a whole input split must fit into memory of a single Mapper
Numerical Summarisation
Goal: Calculate aggregate statistical values over a dataset
Extract features from the dataset elements, compute the same function for each feature
Examples:
● Count occurrences
● Maximum / minimum values
● Average / median / standard deviation
Sample dataset: China’s Air Quality sensors
Sample numerical summarisation questions
● Compute what is the maximum PM2.5 registered for each location provided in the dataset
● Return the average AQI registered each week
● Compute for each day of the week the number of locations where the PM2.5 index exceeded 150
Numerical Summarisation Structure
Numerical Summarisation Map and Reduce functions
Numerical Summarisation Combiner?
Computing Averages
Computing Averages
Combining Average
Average is NOT an associative operation
● Cannot be executed partially with the Combiners
Solution: Change Mapper results
● Emit aggregated quantities, and number of elements
● Mapper. For mark values (100,100,20),
Emit (100,1),(100,1), (20,1)
● Combiner: adds aggregates and number of elements
Emits (220,3)
● Reducer
Adds aggregates and computes average

You might also like