Ecs765p W2
Ecs765p W2
● Introduction to MapReduce
● Programming patterns
● Aggregate computations
Our first parallel program (Reminder from last week)
● Task: count the number of occurrences of each word in one document
● Input: text document
● Output: sequence of: word, count
The 56
School 23
Queen 10
● Collection Stage: Not applicable in this case
● Ingestion Stage: Move file to data lake with applicable protocol e.g. HTTP/FTP
● Preparation Stage: Remove character which might confuse algorithm e.g. quotation marks etc
Program Input
QMUL has been ranked 9th among multi-faculty institutions in the UK, according to tables published today in the Times Higher
Education.
A total of 154 institutions were submitted for the exercise.
The 2008 RAE confirmed Queen Mary to be one of the rising stars of the UK research environment and the REF 2014 shows that this
upward trajectory has been maintained.
Professor Simon Gaskell, President and Principal of Queen Mary, said: “This is an outstanding result for Queen Mary. We have built
upon the progress that was evidenced by the last assessment exercise and have now clearly cemented our position as one of the UK’s
foremost research-led universities. This achievement is derived from the talent and hard work of our academic staff in all disciplines,
and the colleagues who support them.”
The Research Excellence Framework (REF) is the system for assessing the quality of research in UK higher education institutions.
Universities submit their work across 36 panels of assessment. Research is judged according to quality of output (65 per cent),
environment (15 per cent) and, for the first time, the impact of research (20 per cent).
How to solve the problem?
How to solve the problem?
How to solve the problem on a single processor?
#input:text string with the complete text
words = text.split()
count = dict()
for word in words:
if word in count:
count[word] = count[word] + 1
else:
count[word] = 1
Parallelising the problem
Splitting the load on subtasks:
● Split sentences/lines into words
● Count all the occurrences of each word
…What do we do with the intermediate results?
● Merge into single collection
● Possibly requires parallelism too
MapReduce
“A simple and powerful interface that enables automatic parallelization and distribution of large-scale
computations, combined with an implementation of this interface that achieves high performance on
large clusters of commodity PCs.” (Dean and Ghermawat, “MapReduce: Simplified Data Processing on
Large Clusters”, Google Inc.)
More simply, MapReduce is:
● A parallel programming model and associated implementation.
MapReduce Pattern
MapReduce Programming Model
Data is processed with map() and reduce() functions
● The map() function is called on every item in the input and emits a series of intermediate key/value pairs
● All the emitted values for a given key are grouped together
● The reduce() function is called on every unique key, and the collected values. Emits a partial result that
is added to the output
Example wordcount (pythonish pseudocode)
def mapper(_,text):
words = text.split()
for word in words:
emit(word, 1)
...
map map
Data store 1 Data store n
● Introduction to MapReduce?
● Programming patterns
● Aggregate computations
Shuffle and Sort steps
Parallelising Map and Reduce jobs allow algorithms to scale close to linearly
One potential bottleneck for MapReduce programs is the cost of Shuffle and Sort operations
● Data has to be copied over network communications
● All the keys emitted by the mappers
● Sorting large amounts of elements can be costly
Combiner is an additional optional step that is executed before these steps
The Combiner
● Introduction to MapReduce?
● Programming patterns
● Aggregate computations
Top Ten Performance
How many Reducers?
● Performance issues?
What happens if we don’t use Combiners?
Performance depends greatly on the number of elements, (to a lesser extend on the size of data)
Minimum requirement: the ranking data for a whole input split must fit into memory of a single Mapper
Numerical Summarisation
Goal: Calculate aggregate statistical values over a dataset
Extract features from the dataset elements, compute the same function for each feature
Examples:
● Count occurrences
● Maximum / minimum values
● Average / median / standard deviation
Sample dataset: China’s Air Quality sensors
Sample numerical summarisation questions
● Compute what is the maximum PM2.5 registered for each location provided in the dataset
● Return the average AQI registered each week
● Compute for each day of the week the number of locations where the PM2.5 index exceeded 150
Numerical Summarisation Structure
Numerical Summarisation Map and Reduce functions
Numerical Summarisation Combiner?
Computing Averages
Computing Averages
Combining Average
Average is NOT an associative operation
● Cannot be executed partially with the Combiners
Solution: Change Mapper results
● Emit aggregated quantities, and number of elements
● Mapper. For mark values (100,100,20),
Emit (100,1),(100,1), (20,1)
● Combiner: adds aggregates and number of elements
Emits (220,3)
● Reducer
Adds aggregates and computes average