0% found this document useful (0 votes)

17 views21 pages

Map Reduce Design and EXECUTION FRAMEWORK

The document discusses MapReduce and how to use it to solve problems involving large datasets in parallel. It covers the basic MapReduce model including mappers, reducers and combiners. It also discusses using MRJob to implement MapReduce jobs in Python and defines multiple steps for processing data.

Uploaded by

l200908

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views21 pages

Map Reduce Design and EXECUTION FRAMEWORK

Uploaded by

l200908

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

MAP REDUCE

CONTINUE
Refinement: Combiners
■ Back to our word counting example:
– Combiner combines the values of all keys of a single mapper (single machine):

– Much less data needs to be copied and shuffled!

Combiner is usually same as the reduce function
26
Refinement: Combiners
■ Back to our word counting example:
– Combiner combines the values of all keys of a single mapper (single machine):

– Much less data needs to be copied and shuffled!

Combiner is usually same as the reduce function
27
Refinement: Combiners
■ Back to our word counting example:
– Combiner combines the values of all keys of a single mapper (single machine):

– Much less data needs to be copied and shuffled!

Combiner is usually same as the reduce function
28
Word Count Using MapReduce
from mrjob.job import MRJob

class WordCount(MRJob): map(key, value):

// key: document name; value: text of
the document
def mapper(self, _, line):
for each word w in value:
for word in line.split():
yield(word, 1) emit(w, 1)

def combiner(self, word, counts):

yield(word, sum(counts))
reduce(key, values):
// key: a word; value:an array counts
def reducer(self, word, counts): result = 0
yield(word, sum(counts)) for each count v in values:
result += v
emit(key, result)
if __name__ == '__main__':
WordCount.run()
29
Computing the Mean: Version 1
■ Mean(1, 2, 3, 4, 5) = (1+2+3+4+5) / 5 = 3

– Mean(Mean(1, 2) = (1+2) /2 = 1.5;

– Mean(3, 4, 5)) = (3+4+5) / 3 = 4
–
Computing the Mean
■ Can we use the reducer as a combiner?
Computing
the Mean:
Version 2

Does
this
work?
Computing
the Mean:
Version 3
■ Fixed?
Input
Average Temperatures
Output
Example: Analysis of Weather Dataset
■ Data from NCDC(National Climatic Data Center): A large
volume of log data collected by weather sensors: e.g. temperature
■ Data format
– Line-oriented ASCII format with many elements
– We focus on the temperature element
– Data files are organized by date and weather station
Year Temperature

0067011990999991950051507004...9999999N9+00001+99999999999...
0043011990999991950051512004...9999999N9+00221+99999999999...
0043011990999991950051518004...9999999N9-00111+99999999999...
0043012650999991949032412004...0500001N9+01111+99999999999...
0043012650999991949032418004...0500001N9+00781+99999999999...

Contents of data files List of data files

Example: Analysis of Weather Dataset
■ Query: What’s the highest recorded global temperature for each year
in the dataset?

Complete run for the century took 42 minutes

on a single EC2 High-CPU Extra Large Instance

To speed up the processing, we need to

run parts of the program in parallel
Year Temperature

Contents of data files List of data files

Hadoop MapReduce
■ To use MapReduce, we need to express out query as a MapReduce
job

■ MapReduce job
– Map function
– Reduce function

■ Each function has key-value pairs as input and output

– Types of input and output are chosen by the programmer

37 / 18
MapReduce Design of NCDC Example
■ Map phase
– Text input format of the dataset files
■ Key: offset of the line (unnecessary)
Input File
■ Value: each line of the files
– Pull out the year and the temperature
■ The map phase is simply data preparation phase
■ Drop bad records(filtering)

Input of Map Function (key, value) Output of Map Function (key, value)
Map

38 / 18
MapReduce Design of NCDC Example
The output from the map function is processed by MapReduce framework
Sort and Group By

▪ Reduce function iterates through the list and pick up the maximum value
Reduce

39 / 18
MapReduce Design of NCDC Example
The output from the map function is processed by MapReduce framework
Sort and Group By

▪ Reduce function iterates through the list and pick up the maximum value
Reduce

Any improvement that you can suggest ?

40 / 18
MRJob
■ A job is defined by a class that inherits from MRJob. This class
contains methods that define the steps of your job.

■ A “step” consists of a mapper, a combiner, and a reducer.

– All of these are optional, though you must have at least one.
– So you could have a step that’s just a mapper, or just a combiner and a
reducer.
■ When you only have one step, all you have to do is write methods
called mapper(), combiner(), and reducer().
MRJob
■ Most of the time, you’ll need more than one step in your job.
■ To define multiple steps, override steps() to return a list of MRSteps.
MRJob
■ Most of the time, you’ll need more than one step in your job.
■ To define multiple steps, override steps() to return a list of MRSteps.
MRJob
■ Most of the time, you’ll need more than one step in your job.
■ To define multiple steps, override steps() to return a list of MRSteps.
MRJob
■ Most of the time, you’ll need more than one step in your job.
■ To define multiple steps, override steps() to return a list of MRSteps.

Organizational Change Management
100% (5)
Organizational Change Management
107 pages
Spectrele Lui Marx - Derrida PDF
100% (1)
Spectrele Lui Marx - Derrida PDF
35 pages
Scherfi Gsvej 8, DK-2100 Copenhagen Ø, Denmark Tel.: +45 39 17 17 17. Fax: +45 39 17 18 18. E-Mail: Postmaster@euro - Who.int Web Site: WWW - Euro.who - Int
No ratings yet
Scherfi Gsvej 8, DK-2100 Copenhagen Ø, Denmark Tel.: +45 39 17 17 17. Fax: +45 39 17 18 18. E-Mail: Postmaster@euro - Who.int Web Site: WWW - Euro.who - Int
205 pages
Unit-Iii: A Weather Dataset
No ratings yet
Unit-Iii: A Weather Dataset
12 pages
Mini Research On Homeless
No ratings yet
Mini Research On Homeless
6 pages
Bridges and Roads
No ratings yet
Bridges and Roads
22 pages
Final 5
100% (3)
Final 5
9 pages
Chapter 7the Conversion Cycle Summary
No ratings yet
Chapter 7the Conversion Cycle Summary
13 pages
Map Reduce-LO2
No ratings yet
Map Reduce-LO2
62 pages
Final Project Report MRI Reconstruction
No ratings yet
Final Project Report MRI Reconstruction
19 pages
BondMaster1000eplus en
No ratings yet
BondMaster1000eplus en
2 pages
BDA Module 3
No ratings yet
BDA Module 3
66 pages
Bodybuilding, Drugs and Risk
No ratings yet
Bodybuilding, Drugs and Risk
230 pages
Emphysema 1
No ratings yet
Emphysema 1
7 pages
Ch02a Mapreduce
No ratings yet
Ch02a Mapreduce
53 pages
Energy Relationships in Chemical Reactions
No ratings yet
Energy Relationships in Chemical Reactions
11 pages
Teacher Leader Qualities Self Assessment
No ratings yet
Teacher Leader Qualities Self Assessment
7 pages
Hipotesis Uji T Kontrol Dan Intervensi
No ratings yet
Hipotesis Uji T Kontrol Dan Intervensi
3 pages
Map Reduce
No ratings yet
Map Reduce
26 pages
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
55 pages
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
49 pages
An Application of Ultrasound Technology in Condition Monitoring-Rev.1-Web
No ratings yet
An Application of Ultrasound Technology in Condition Monitoring-Rev.1-Web
16 pages
BOQs 444
No ratings yet
BOQs 444
33 pages
3 MapReduce Framework
No ratings yet
3 MapReduce Framework
28 pages
Cue Words Relaxation
No ratings yet
Cue Words Relaxation
4 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
17 pages
Group Assignment 6 ICT (XII IPA 5) - 20240118 - 003400 - 0000
No ratings yet
Group Assignment 6 ICT (XII IPA 5) - 20240118 - 003400 - 0000
13 pages
Map Reduce
100% (1)
Map Reduce
33 pages
Unit III EBDP 2022
No ratings yet
Unit III EBDP 2022
77 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
43 pages
Analyzing The Data With Hadoop
No ratings yet
Analyzing The Data With Hadoop
13 pages
1.4 Map Reduce
No ratings yet
1.4 Map Reduce
30 pages
CS-702 (D) BigData
No ratings yet
CS-702 (D) BigData
61 pages
The Impact of Digital Marketing Management On Customers Buying Behavior
No ratings yet
The Impact of Digital Marketing Management On Customers Buying Behavior
22 pages
Map Reduce
No ratings yet
Map Reduce
28 pages
Short Programs
No ratings yet
Short Programs
41 pages
Alienation From David-McClellan-The-Thought-of-Karl-Marx
No ratings yet
Alienation From David-McClellan-The-Thought-of-Karl-Marx
17 pages
Schedule For OzCon 2023 Revised 05-30 2
No ratings yet
Schedule For OzCon 2023 Revised 05-30 2
4 pages
Unit IV BDA
No ratings yet
Unit IV BDA
32 pages
Big Data 1 PDF
No ratings yet
Big Data 1 PDF
17 pages
Access Que
No ratings yet
Access Que
19 pages
ADBMS-Module 3
No ratings yet
ADBMS-Module 3
115 pages
MapReduce and Yarn
No ratings yet
MapReduce and Yarn
39 pages
Da Unit 5 Data Analytics
No ratings yet
Da Unit 5 Data Analytics
43 pages
Principles of Economics MM MBA 2018
No ratings yet
Principles of Economics MM MBA 2018
60 pages
Bda Unit 3
No ratings yet
Bda Unit 3
22 pages
Unit-Iv CC&BD CS62
No ratings yet
Unit-Iv CC&BD CS62
76 pages
Bda Lab Exercises Lab Mannual - 2023
No ratings yet
Bda Lab Exercises Lab Mannual - 2023
72 pages
Bda Material Unit 3
No ratings yet
Bda Material Unit 3
14 pages
Hadoop Weather
No ratings yet
Hadoop Weather
4 pages
Map Reduce Design and Execution Framework Part 1
No ratings yet
Map Reduce Design and Execution Framework Part 1
19 pages
Assignment 1 - 2024
No ratings yet
Assignment 1 - 2024
3 pages
Unit Iii LM
No ratings yet
Unit Iii LM
14 pages
Unit-2 (MapReduce-I)
No ratings yet
Unit-2 (MapReduce-I)
28 pages
Map Reduce
No ratings yet
Map Reduce
14 pages
Map Reduce
No ratings yet
Map Reduce
46 pages
Unit-Iii: A Weather Dataset
No ratings yet
Unit-Iii: A Weather Dataset
12 pages
From The Canterbury Tales - The Prologue
No ratings yet
From The Canterbury Tales - The Prologue
24 pages
Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
UNIT4 Notes
No ratings yet
UNIT4 Notes
32 pages
18mcs35e U4
No ratings yet
18mcs35e U4
7 pages
09b - MapReduce
No ratings yet
09b - MapReduce
44 pages
Hadoop Big Data Unit 3
No ratings yet
Hadoop Big Data Unit 3
22 pages
Unit-2 MapReduce2024
No ratings yet
Unit-2 MapReduce2024
41 pages
Analyzing Data With Hadoop
No ratings yet
Analyzing Data With Hadoop
54 pages
Big Data Lab
No ratings yet
Big Data Lab
12 pages
Unit 4 Handouts
No ratings yet
Unit 4 Handouts
13 pages
Lecture 4: Mapreduce and Hadoop: Indranil Gupta (Indy)
No ratings yet
Lecture 4: Mapreduce and Hadoop: Indranil Gupta (Indy)
37 pages
Chawimawi Ru
No ratings yet
Chawimawi Ru
1 page
IFU SURGICAL INSTRUMENTS Titan
No ratings yet
IFU SURGICAL INSTRUMENTS Titan
2 pages
BigData-Assignment3-CSP 554
No ratings yet
BigData-Assignment3-CSP 554
5 pages
MapReduce Is A Framework Using Which We Can Write Applications To Process Huge Amounts of Data
No ratings yet
MapReduce Is A Framework Using Which We Can Write Applications To Process Huge Amounts of Data
12 pages
Map Reduce
No ratings yet
Map Reduce
35 pages
Assessments in Occupational Therapy Mental Health An Integrative Approach, 4th Edition Full Digital Edition
100% (15)
Assessments in Occupational Therapy Mental Health An Integrative Approach, 4th Edition Full Digital Edition
16 pages
Book Summary and Questions
No ratings yet
Book Summary and Questions
8 pages
CS702 Big Data Programs
No ratings yet
CS702 Big Data Programs
58 pages
5-Yarn Architecture Components Workflow Scheduling-22-01-2025
No ratings yet
5-Yarn Architecture Components Workflow Scheduling-22-01-2025
26 pages
Bda Lab S
No ratings yet
Bda Lab S
92 pages
S5 Math Exercise
No ratings yet
S5 Math Exercise
6 pages
Sandeep Julakanti - Resume
No ratings yet
Sandeep Julakanti - Resume
9 pages
OOPS Lab File
No ratings yet
OOPS Lab File
60 pages
Mapreduce Model Principles
No ratings yet
Mapreduce Model Principles
65 pages
05 - MapReduce in Hadoop - An Introduction
No ratings yet
05 - MapReduce in Hadoop - An Introduction
31 pages
Sets Bda
No ratings yet
Sets Bda
19 pages
BDA Lab Manual - BAD601-Final One - 7-11
No ratings yet
BDA Lab Manual - BAD601-Final One - 7-11
25 pages
Map Reduce
No ratings yet
Map Reduce
33 pages
Map Reduce 1
No ratings yet
Map Reduce 1
8 pages
Bda Lab Output
No ratings yet
Bda Lab Output
22 pages
Map Reduce PArt 2
No ratings yet
Map Reduce PArt 2
40 pages
Interview Questions for IBM Mainframe Developers
From Everand
Interview Questions for IBM Mainframe Developers
Robert Wingate
1/5 (1)

Map Reduce Design and EXECUTION FRAMEWORK

Uploaded by

Map Reduce Design and EXECUTION FRAMEWORK

Uploaded by

MAP REDUCE

– Much less data needs to be copied and shuffled!

– Much less data needs to be copied and shuffled!

– Much less data needs to be copied and shuffled!

class WordCount(MRJob): map(key, value):

def combiner(self, word, counts):

– Mean(Mean(1, 2) = (1+2) /2 = 1.5;

Contents of data files List of data files

Complete run for the century took 42 minutes

To speed up the processing, we need to

Contents of data files List of data files

■ Each function has key-value pairs as input and output

Any improvement that you can suggest ?

■ A “step” consists of a mapper, a combiner, and a reducer.

You might also like