0% found this document useful (0 votes)
17 views21 pages

Map Reduce Design and EXECUTION FRAMEWORK

The document discusses MapReduce and how to use it to solve problems involving large datasets in parallel. It covers the basic MapReduce model including mappers, reducers and combiners. It also discusses using MRJob to implement MapReduce jobs in Python and defines multiple steps for processing data.

Uploaded by

l200908
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views21 pages

Map Reduce Design and EXECUTION FRAMEWORK

The document discusses MapReduce and how to use it to solve problems involving large datasets in parallel. It covers the basic MapReduce model including mappers, reducers and combiners. It also discusses using MRJob to implement MapReduce jobs in Python and defines multiple steps for processing data.

Uploaded by

l200908
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

MAP REDUCE

CONTINUE
Refinement: Combiners
■ Back to our word counting example:
– Combiner combines the values of all keys of a single mapper (single machine):

– Much less data needs to be copied and shuffled!


Combiner is usually same as the reduce function
26
Refinement: Combiners
■ Back to our word counting example:
– Combiner combines the values of all keys of a single mapper (single machine):

– Much less data needs to be copied and shuffled!


Combiner is usually same as the reduce function
27
Refinement: Combiners
■ Back to our word counting example:
– Combiner combines the values of all keys of a single mapper (single machine):

– Much less data needs to be copied and shuffled!


Combiner is usually same as the reduce function
28
Word Count Using MapReduce
from mrjob.job import MRJob

class WordCount(MRJob): map(key, value):


// key: document name; value: text of
the document
def mapper(self, _, line):
for each word w in value:
for word in line.split():
yield(word, 1) emit(w, 1)

def combiner(self, word, counts):


yield(word, sum(counts))
reduce(key, values):
// key: a word; value:an array counts
def reducer(self, word, counts): result = 0
yield(word, sum(counts)) for each count v in values:
result += v
emit(key, result)
if __name__ == '__main__':
WordCount.run()
29
Computing the Mean: Version 1
■ Mean(1, 2, 3, 4, 5) = (1+2+3+4+5) / 5 = 3

– Mean(Mean(1, 2) = (1+2) /2 = 1.5;


– Mean(3, 4, 5)) = (3+4+5) / 3 = 4

Computing the Mean
■ Can we use the reducer as a combiner?
Computing
the Mean:
Version 2

Does
this
work?
Computing
the Mean:
Version 3
■ Fixed?
Input
Average Temperatures
Output
Example: Analysis of Weather Dataset
■ Data from NCDC(National Climatic Data Center): A large
volume of log data collected by weather sensors: e.g. temperature
■ Data format
– Line-oriented ASCII format with many elements
– We focus on the temperature element
– Data files are organized by date and weather station
Year Temperature

0067011990999991950051507004...9999999N9+00001+99999999999...
0043011990999991950051512004...9999999N9+00221+99999999999...
0043011990999991950051518004...9999999N9-00111+99999999999...
0043012650999991949032412004...0500001N9+01111+99999999999...
0043012650999991949032418004...0500001N9+00781+99999999999...

Contents of data files List of data files


Example: Analysis of Weather Dataset
■ Query: What’s the highest recorded global temperature for each year
in the dataset?

Complete run for the century took 42 minutes


on a single EC2 High-CPU Extra Large Instance

To speed up the processing, we need to


run parts of the program in parallel
Year Temperature

0067011990999991950051507004...9999999N9+00001+99999999999...
0043011990999991950051512004...9999999N9+00221+99999999999...
0043011990999991950051518004...9999999N9-00111+99999999999...
0043012650999991949032412004...0500001N9+01111+99999999999...
0043012650999991949032418004...0500001N9+00781+99999999999...

Contents of data files List of data files


Hadoop MapReduce
■ To use MapReduce, we need to express out query as a MapReduce
job

■ MapReduce job
– Map function
– Reduce function

■ Each function has key-value pairs as input and output


– Types of input and output are chosen by the programmer

37 / 18
MapReduce Design of NCDC Example
■ Map phase
– Text input format of the dataset files
■ Key: offset of the line (unnecessary)
Input File
■ Value: each line of the files
– Pull out the year and the temperature
■ The map phase is simply data preparation phase
■ Drop bad records(filtering)

Input of Map Function (key, value) Output of Map Function (key, value)
Map

38 / 18
MapReduce Design of NCDC Example
The output from the map function is processed by MapReduce framework
Sort and Group By

▪ Reduce function iterates through the list and pick up the maximum value
Reduce

39 / 18
MapReduce Design of NCDC Example
The output from the map function is processed by MapReduce framework
Sort and Group By

▪ Reduce function iterates through the list and pick up the maximum value
Reduce

Any improvement that you can suggest ?


40 / 18
MRJob
■ A job is defined by a class that inherits from MRJob. This class
contains methods that define the steps of your job.

■ A “step” consists of a mapper, a combiner, and a reducer.


– All of these are optional, though you must have at least one.
– So you could have a step that’s just a mapper, or just a combiner and a
reducer.
■ When you only have one step, all you have to do is write methods
called mapper(), combiner(), and reducer().
MRJob
■ Most of the time, you’ll need more than one step in your job.
■ To define multiple steps, override steps() to return a list of MRSteps.
MRJob
■ Most of the time, you’ll need more than one step in your job.
■ To define multiple steps, override steps() to return a list of MRSteps.
MRJob
■ Most of the time, you’ll need more than one step in your job.
■ To define multiple steps, override steps() to return a list of MRSteps.
MRJob
■ Most of the time, you’ll need more than one step in your job.
■ To define multiple steps, override steps() to return a list of MRSteps.

You might also like