0% found this document useful (0 votes)

5 views28 pages

M4 06 MapReduce

MapReduce is a programming model designed for efficient distributed computing, simplifying the process of handling large datasets through a straightforward interface of Map and Reduce functions. It addresses challenges such as machine failures and scheduling, providing fault tolerance and data locality optimizations. The model is particularly effective for applications like log processing and web index building, allowing for parallel processing and improved load balancing.

Uploaded by

aradhyamanil9797

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views28 pages

M4 06 MapReduce

Uploaded by

aradhyamanil9797

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

MapReduce:

Simplified Data Processing on Large Cluster

Motivation

• Challenge at google
• Input data too large -> distributed computing
• Most computations are straightforward(log processing, inverted
index) -> boring work

• Complexity of distributed computing

• Machine failure
• Scheduling
Solution: MapReduce

• MapReduce as the distributed programing infrastructure

• Simple Programming interface: Map + Reduce

• Distributed implementation that hides all the messy details

• Fault tolerance
• I/O scheduling
• parallelization
MapReduce - What?
• MapReduce is a programming model for efficient distributed computing
• It works like a Unix pipeline
• cat input | grep | sort | uniq -c | cat > output
• Input | Map | Shuffle & Sort | Reduce | Output
• Efficiency from
• Streaming through data, reducing seeks
• Pipelining
• A good fit for a lot of applications
• Log processing
• Web index building
Programming Model
• Inspired by map and reduce functions in Lisp and other functional
programing languages
• Lisp:

Map #‘length’ (() (a) (ab) (abc)) 0123

Reduce #‘+’ (0 1 2 3) 6
Programing Model
• Programmer only need to specify two functions:

• Map Function
map (in_key, in_value) -> list(out_key, intermediate_value)
• Process input key/value pair
• Produce set of output key/intermediate value pairs

• Reduce Function
reduce (out_key, intermediate_value) -> list(out_value)
• Process intermediate key/value pairs
• Combines intermediate values per unique key
• Produce a set of merged output values(usually just one)
[input (key, value)]

Map Function
[Intermediate (key, value)]
Programming Model

Shuffle (merge sort by key)

Reduce function

[Unique key, output value list]

MapReduce - Features
• Fine grained Map and Reduce tasks
• Improved load balancing
• Faster recovery from failed tasks
• Automatic re-execution on failure
• In a large cluster, some nodes are always slow or flaky
• Framework re-executes failed tasks
• Locality optimizations
• With large data, bandwidth to data is a problem
• Map-Reduce + HDFS is a very effective solution
• Map-Reduce queries HDFS for locations of input data
• Map tasks are scheduled close to the inputs when possible
MapReduce - Advantages
• The two biggest advantages of MapReduce are:

Parallel Processing

Data Locality
Word Count Example
• Mapper
• Input: value: lines of text of input
• Output: key: word, value: 1
• Reducer
• Input: key: word, value: set of counts
• Output: key: word, value: sum
• Launching program
• Defines this job
• Submits job to cluster
Example: WordCount
Input Split Map Shuffle Reduce Output
<the, 1>
<small, 1> a, 1
the small the small Map <brown, 1> another 1
brown fox brown fox <fox, 1> <brown, 1> brown, 2
cross, 1
Reduce cow, 1
a fox a fox <a, 1>
<fox, 1> <brown, 1> fox, 3
speaks to speaks to
Map <fox, 1>
another another <speaks, 1>
fox fox <to, 1>
<another, 1> road, 1
small, 1
brown brown Reduce speaks, 1
cow cross cow cross Map <cow, 1> the, 2
<cross, 1>
the road the road <the, 1>
to, 1
<road, 1>
Word Count Mapper
public static class Map extends MapReduceBase implements
Mapper<LongWritable,Text,Text,IntWritable> {
private static final IntWritable one = new IntWritable(1);
private Text word = new Text();

public static void map(LongWritable key, Text value,

OutputCollector<Text,IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer = new StringTokenizer(line);
while(tokenizer.hasNext()) {
word.set(tokenizer.nextToken());
output.collect(word,one);
}
}
}
Word Count Reducer
public static class Reduce extends MapReduceBase implements
Reducer<Text,IntWritable,Text,IntWritable> {
public static void map(Text key, Iterator<IntWritable> values,
OutputCollector<Text,IntWritable> output, Reporter reporter) throws IOException {
int sum = 0;
while(values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
Word Count Example

• Jobs are controlled by configuring JobConfs

• JobConfs are maps from attribute names to string values
• The framework defines attributes to control how the job is executed
• conf.set(“mapred.job.name”, “MyApp”);
• Applications can add arbitrary values to the JobConf
• conf.set(“my.string”, “foo”);
• conf.set(“my.integer”, 12);
• JobConf is available to all tasks
Putting it all together

• Create a launching program for your application

• The launching program configures:
• The Mapper and Reducer to use
• The output key and value types (input types are inferred from the InputFormat)
• The locations for your input and output
• The launching program then submits the job and typically waits for it to
complete
Putting it all together
JobConf conf = new JobConf(WordCount.class);
conf.setJobName(“wordcount”);

conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);

conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducer(Reduce.class);

conf.setInputFormat(TextInputFormat.class);
Conf.setOutputFormat(TextOutputFormat.class);

FileInputFormat.setInputPaths(conf, new Path(args[0]));

FileOutputFormat.setOutputPath(conf, new Path(args[1]));

JobClient.runJob(conf);
Input and Output Formats
• A Map/Reduce may specify how it’s input is to be read by specifying an
InputFormat to be used
• A Map/Reduce may specify how it’s output is to be written by specifying an
OutputFormat to be used
• These default to TextInputFormat and TextOutputFormat, which process line-
based text data
• Another common choice is SequenceFileInputFormat and
SequenceFileOutputFormat for binary data
• These are file-based
How many Maps and Reduces
• Maps
• Usually as many as the number of HDFS blocks being processed, this is the default
• Else the number of maps can be specified as a hint
• The number of maps can also be controlled by specifying the minimum split size
• The actual sizes of the map inputs are computed by:
• max(min(block_size,data/#maps), min_split_size)
• Reduces
• Unless the amount of data being processed is small
• 0.95*num_nodes*mapred.tasktracker.tasks.maximum
System Implementation: Overview

• Cluster Characteristic
• 100s/1000s of 2-CPU x86 machines, 2-4 GB of memory
• Storage is on local IDE disks
• Infrastructure
• GFS: distributed file system manages data (SOSP'03)
• Job scheduling system: jobs made up of tasks, scheduler assigns tasks to
machines
Control Flow and data flow
User submit
Program Scheduler

fork fork fork allocate

assign Master
GFS assign
map GFS
Notify location reduce
of local write
Input Data Worker
write Output
local Worker File 0
Split 0 read
write
Split 1 Worker
Split 2 Output
Worker File 1
Worker remote
read,
sort
MapReduce Architecture
Coordinate

• Master data structures

• Task status: (idle, in-progress, completed)
• Idle tasks get scheduled as workers become available
• When a map task completes, it sends the master the location and sizes of its
intermediate files, one for each reducer
• Master incrementally pushes this info to reducers
• Master pings workers periodically to detect failures
Fault Tolerance
• Map worker failure
• Completed or in-progress tasks are
reset to idle
• Reduce worker failure
• Only in-progress tasks are reset to idle
• Why?
• Master failure
• MapReduce Task is aborted and client
is notified

• Reset tasks are rescheduled on another machine

Backup Tasks

• Slow worker delay completion time

• Start back-up tasks, for those in-progress as the job nears completion
• First task to complete is considered
Combiner Function

• Local reducer at the map worker

• Can save network time by pre-aggregating at mapper

Map
combine(k1, list(v1))

• Works only if reduce function is commutative and

associative
WordCount: No combine
Input Split Map Shuffle Reduce Output
<the, 1>
<small, 1> a, 1
the small the small Map <brown, 1> another 1
brown fox brown fox <fox, 1> <brown, 1> brown, 2
cross, 1
Reduce cow, 1
a fox a fox <a, 1>
<fox, 1> <brown, 1> fox, 3
speaks to speaks to
Map <fox, 1>
another another <speaks, 1>
fox fox <to, 1>
<another, 1> road, 1
small, 1
brown brown Reduce speaks, 1
cow cross cow cross Map <cow, 1> the, 2
<cross, 1>
the road the road <the, 1>
to, 1
<road, 1>
WordCount: With Combine
Input Split Map Shuffle Reduce Output
<the, 1>
<small, 1> a, 1
the small the small Map <brown, 1> another 1
brown fox brown fox <fox, 1> <brown, 1> brown, 2
cross, 1
Reduce cow, 1
a fox a fox <a, 1>
<fox, 2> <brown, 1> fox, 3
speaks to speaks to
Map <speaks, 1>
another another <to, 1>
fox fox <another, 1>
road, 1
small, 1
brown brown Reduce speaks, 1
cow cross cow cross Map <cow, 1> the, 2
<cross, 1>
the road the road <the, 1>
to, 1
<road, 1>
Conclusion

• Inexpensive commodity machines can be the basis of a large scale

reliable system
• MapReduce hides all the messy details of distributed computing
• MapReduce provides a simple parallel programming interface
References
• MapReduce Architecture: http://cecs.wright.edu/~tkprasad/courses/cs707/L06MapReduce.ppt/
• MapReduce Presentation: http://research.google.com/archive/mapreduce-osdi04-slides/
• MapReduce Presentation: http://web.eecs.umich.edu/~
mozafari/fall2015/eecs584/presentations/lecture15-a.pdf/
• Operating system support for warehouse-scale computing: https://www.cl.cam.ac.uk/~
ms705/pub/thesis-submitted.pdf/
• Apache Ecosystem Pic: http://blog.agroknow.com/?cat=1
• MapReduce: http://static.googleusercontent.com/media/research.google.com/en//
archive/mapreduce-osdi04.pdf/
• FlumeJava: http://pages.cs.wisc.edu/~akella/CS838/F12/838-CloudPapers/FlumeJava.pdf/
• MillWheel: http://www.vldb.org/pvldb/vol6/p1033-akidau.pdf/
• Pregel: http://web.stanford.edu/class/cs347/reading/pregel.pdf/

Samsung WF106U4SAWQ - AZ Com Diagrama
100% (4)
Samsung WF106U4SAWQ - AZ Com Diagrama
54 pages
Spare Parts Catalog: Machine Type: Dominator Gaming Platform: Various Game Software: Various
No ratings yet
Spare Parts Catalog: Machine Type: Dominator Gaming Platform: Various Game Software: Various
48 pages
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
49 pages
Unit IV Programming Model
No ratings yet
Unit IV Programming Model
30 pages
Lecture - 3
No ratings yet
Lecture - 3
25 pages
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
55 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
43 pages
Prerequisites: Single Node Setup Cluster Setup
No ratings yet
Prerequisites: Single Node Setup Cluster Setup
5 pages
Bda Unit III r20csm
No ratings yet
Bda Unit III r20csm
54 pages
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
No ratings yet
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
54 pages
04 MapReduce
No ratings yet
04 MapReduce
45 pages
CS 425 / ECE 428 Distributed Systems Fall 2016: Lecture 4: Mapreduce and Hadoop
No ratings yet
CS 425 / ECE 428 Distributed Systems Fall 2016: Lecture 4: Mapreduce and Hadoop
24 pages
Unit-2 (MapReduce-I)
No ratings yet
Unit-2 (MapReduce-I)
28 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
27 pages
3.Map-Reduce Framework - 1
No ratings yet
3.Map-Reduce Framework - 1
47 pages
Lecture 03
No ratings yet
Lecture 03
26 pages
Chapter 9 - Processing Big Data With Mapreduce
No ratings yet
Chapter 9 - Processing Big Data With Mapreduce
157 pages
Hadoop 2
No ratings yet
Hadoop 2
31 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
26 pages
Chapter 4
No ratings yet
Chapter 4
53 pages
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
53 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
17 pages
Map Reduce
No ratings yet
Map Reduce
3 pages
Map Reduce
No ratings yet
Map Reduce
18 pages
Unit 4 Da
No ratings yet
Unit 4 Da
57 pages
Mapreduce Programming Framework
No ratings yet
Mapreduce Programming Framework
23 pages
Hadoop Wordcount Program
No ratings yet
Hadoop Wordcount Program
20 pages
Chapter Five Hadoop Mapreduce & HDFS
No ratings yet
Chapter Five Hadoop Mapreduce & HDFS
44 pages
Map Reduce
No ratings yet
Map Reduce
42 pages
Unit 5 - Mapreduce
No ratings yet
Unit 5 - Mapreduce
8 pages
Hadoop - Mapreduce
No ratings yet
Hadoop - Mapreduce
5 pages
CS702 Big Data Programs
No ratings yet
CS702 Big Data Programs
58 pages
Map Reduce
No ratings yet
Map Reduce
44 pages
Unit 3 - Big Data Technologies
No ratings yet
Unit 3 - Big Data Technologies
42 pages
21CS1601 Unit 5 Understanding Big Data Technolgies
No ratings yet
21CS1601 Unit 5 Understanding Big Data Technolgies
20 pages
Bda 03
No ratings yet
Bda 03
10 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
9 pages
CS 425 / ECE 428 Distributed Systems Fall 2014: Lecture 3: Mapreduce and Hadoop
No ratings yet
CS 425 / ECE 428 Distributed Systems Fall 2014: Lecture 3: Mapreduce and Hadoop
24 pages
Map Reduce
No ratings yet
Map Reduce
57 pages
Advanced Mapreduce
No ratings yet
Advanced Mapreduce
37 pages
Unit-2 MapReduce2024
No ratings yet
Unit-2 MapReduce2024
41 pages
Map Reduce
No ratings yet
Map Reduce
30 pages
Introduction To Map Reduce
No ratings yet
Introduction To Map Reduce
50 pages
Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
CS246 TA Session: Hadoop Tutorial: Peyman Kazemian 1/11/2011
No ratings yet
CS246 TA Session: Hadoop Tutorial: Peyman Kazemian 1/11/2011
13 pages
Map Reduce Programming
No ratings yet
Map Reduce Programming
74 pages
Ecs765p W2
No ratings yet
Ecs765p W2
55 pages
Map Reduce Programming
No ratings yet
Map Reduce Programming
67 pages
Da Unit 5 Data Analytics
No ratings yet
Da Unit 5 Data Analytics
43 pages
ECS765P - W2 - The MapReduce Programming Model
No ratings yet
ECS765P - W2 - The MapReduce Programming Model
53 pages
The Mapreduce Paradigm: Michael Kleber
No ratings yet
The Mapreduce Paradigm: Michael Kleber
13 pages
Hadoop (Mapreduce)
No ratings yet
Hadoop (Mapreduce)
43 pages
1.4 Map Reduce
No ratings yet
1.4 Map Reduce
30 pages
BDA Unit 3 1
No ratings yet
BDA Unit 3 1
37 pages
Bda Experiment No2
No ratings yet
Bda Experiment No2
12 pages
Hadoop
No ratings yet
Hadoop
28 pages
Hadoop Map Reduce
No ratings yet
Hadoop Map Reduce
53 pages
3.4 Map Scheduler
No ratings yet
3.4 Map Scheduler
23 pages
Big Data Practical 2
No ratings yet
Big Data Practical 2
11 pages
Bda Unit 3
No ratings yet
Bda Unit 3
20 pages
Lua Programming Mastery: From Zero to Hero - A Complete Guide to Lightweight Scripting
From Everand
Lua Programming Mastery: From Zero to Hero - A Complete Guide to Lightweight Scripting
Jonathan Caldwell
No ratings yet
Conceptual Programming: Conceptual Programming: Learn Programming the old way!
From Everand
Conceptual Programming: Conceptual Programming: Learn Programming the old way!
Avishek Sharma
No ratings yet
2.symmetric Shared Memory Architectures
No ratings yet
2.symmetric Shared Memory Architectures
12 pages
M4 - 05 - Google File System
No ratings yet
M4 - 05 - Google File System
28 pages
DS Module 1
No ratings yet
DS Module 1
18 pages
3rd Module Avr
No ratings yet
3rd Module Avr
9 pages
IoT M2
No ratings yet
IoT M2
16 pages
IoT M2
No ratings yet
IoT M2
14 pages
Gaurang - Mehta-It Manager
No ratings yet
Gaurang - Mehta-It Manager
2 pages
How To Flash Stock Rom Using Smart Phone Flash Tool (SP Flash Tool)
No ratings yet
How To Flash Stock Rom Using Smart Phone Flash Tool (SP Flash Tool)
2 pages
RP0 Buse V2.0 - 19032022
No ratings yet
RP0 Buse V2.0 - 19032022
1 page
MSX User - Vol 1 No 3 - Feb 1985
No ratings yet
MSX User - Vol 1 No 3 - Feb 1985
100 pages
Multimedia Hardware and Software
No ratings yet
Multimedia Hardware and Software
11 pages
Chapter 21: Introduction To Data Communications and Networking - Review Notes
No ratings yet
Chapter 21: Introduction To Data Communications and Networking - Review Notes
13 pages
ClassMarker Test Taking Guide
No ratings yet
ClassMarker Test Taking Guide
3 pages
Admin Guide
No ratings yet
Admin Guide
560 pages
Error Amplifier Ics (Se Series)
No ratings yet
Error Amplifier Ics (Se Series)
1 page
CH 1 - Revision Tour 1 Practice Material For Board Exam
50% (2)
CH 1 - Revision Tour 1 Practice Material For Board Exam
28 pages
#Syallabus
No ratings yet
#Syallabus
33 pages
Service Manual: KDL-40V3000 KDL-40V3000 KDL-46V3000 KDL-46V3000 KDL-46VL130
No ratings yet
Service Manual: KDL-40V3000 KDL-40V3000 KDL-46V3000 KDL-46V3000 KDL-46VL130
133 pages
Loadmaster HW PDF
No ratings yet
Loadmaster HW PDF
15 pages
8Th Sem Internship Report Bim
No ratings yet
8Th Sem Internship Report Bim
50 pages
Digital Logic Design Lab: Experiment No.
No ratings yet
Digital Logic Design Lab: Experiment No.
8 pages
VT6105/VT6105L/VT6105LOM Software Package (VIA Version) For LAN-On-Motherboard Usage Version: 1.6
No ratings yet
VT6105/VT6105L/VT6105LOM Software Package (VIA Version) For LAN-On-Motherboard Usage Version: 1.6
7 pages
Network Security v1.0 - Module 3 ES
50% (2)
Network Security v1.0 - Module 3 ES
51 pages
FCS PRC 152 Android RCU Google Translate - English
No ratings yet
FCS PRC 152 Android RCU Google Translate - English
38 pages
NE/SE564 Phase-Locked Loop: Description Pin Configurations
No ratings yet
NE/SE564 Phase-Locked Loop: Description Pin Configurations
9 pages
2.9.1 Packet Tracer - Basic Switch and End Device Configuration - ILM
No ratings yet
2.9.1 Packet Tracer - Basic Switch and End Device Configuration - ILM
3 pages
Label Printing Pricing Scale: Features
No ratings yet
Label Printing Pricing Scale: Features
2 pages
Scope: Online Bus Ticket Booking
No ratings yet
Scope: Online Bus Ticket Booking
35 pages
Memograph rsg40 Ti133ren - 0707
No ratings yet
Memograph rsg40 Ti133ren - 0707
20 pages
42tu011 - 164278 PDF
No ratings yet
42tu011 - 164278 PDF
1 page
BACHELOR OF COMPUTER APPLICATIONS SCHEME OF EXAMINATION W - e - F - 2015-16
No ratings yet
BACHELOR OF COMPUTER APPLICATIONS SCHEME OF EXAMINATION W - e - F - 2015-16
7 pages
Learn & Master The Basics of RxSwift in 10 Minutes - Sebastian Boldt (2017)
No ratings yet
Learn & Master The Basics of RxSwift in 10 Minutes - Sebastian Boldt (2017)
12 pages
Manual Smar Tt301
100% (1)
Manual Smar Tt301
58 pages
Introduction To Computer System: Unit - 1
No ratings yet
Introduction To Computer System: Unit - 1
22 pages

M4 06 MapReduce

Uploaded by

M4 06 MapReduce

Uploaded by

MapReduce:

Simplified Data Processing on Large Cluster

• Complexity of distributed computing

• MapReduce as the distributed programing infrastructure

• Simple Programming interface: Map + Reduce

• Distributed implementation that hides all the messy details

Map #‘length’ (() (a) (ab) (abc)) 0123

Shuffle (merge sort by key)

[Unique key, output value list]

public static void map(LongWritable key, Text value,

• Jobs are controlled by configuring JobConfs

• Create a launching program for your application

FileInputFormat.setInputPaths(conf, new Path(args[0]));

fork fork fork allocate

• Master data structures

• Reset tasks are rescheduled on another machine

• Slow worker delay completion time

• Local reducer at the map worker

• Works only if reduce function is commutative and

• Inexpensive commodity machines can be the basis of a large scale

You might also like