0% found this document useful (0 votes)

4 views23 pages

3.4 Map Scheduler

MapReduce is a programming model that processes large datasets by dividing tasks into 'Map' and 'Reduce' phases, where 'Map' generates intermediate key/value pairs and 'Reduce' aggregates these values. The document explains the architecture and scheduling of MapReduce, particularly in the context of Hadoop, including code examples for Map and Reduce functions. Additionally, it discusses applications of MapReduce, fault tolerance mechanisms, and the YARN scheduler used in Hadoop 2.x and above.

Uploaded by

vanitha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views23 pages

3.4 Map Scheduler

Uploaded by

vanitha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 23

MAP REDUCE SCHEDULERS AND

MAP REDUCE ARCHITECTURE

What is MapReduce?
• Terms are borrowed from Functional Language (e.g., Lisp)
Sum of squares:
• (map square ‘(1 2 3 4))
– Output: (1 4 9 16)
[processes each record sequentially and independently]

• (reduce + ‘(1 4 9 16))

– (+ 16 (+ 9 (+ 4 1) ) )
– Output: 30
[processes set of all records in batches]

• Let’s consider a sample application: Wordcount

– You are given a huge dataset (e.g., Wikipedia dump or all of Shakespeare’s works) and asked to list the count for each of the words in
each of the documents therein
Map

• Process individual records to generate

intermediate key/value pairs.
Key Value

Welcome1
Welcome Everyone
Everyone1
Hello Everyone
Hello 1
Input <filename, file text>
Everyone1
Map

• Parallelly Process individual records to

generate intermediate key/value pairs.
MAP TASK 1
Welcome1
Welcome Everyone
Everyone1
Hello Everyone
Hello 1
Input <filename, file text> Everyone1

MAP TASK 2
Map
• Parallelly Process a large number of
individual records to generate intermediate
key/value pairs.
Welcome 1
Welcome Everyone Everyone 1
Hello Everyone Hello 1
Why are you here
Everyone 1
I am also here
Why 1
They are also here
Are 1
Yes, it’s THEM!
You 1
The same people we were thinking of
Here 1
…….
…….

Input <filename, file text>

MAP TASKS
Reduce
• Reduce processes and merges all intermediate
values associated per key
Key Value

Welcome1 Everyone2
Everyone1 Hello 1
Hello 1 Welcome1
Everyone1
Reduce
• Each key assigned to one Reduce
• Parallelly Processes and merges all intermediate values by partitioning keys

Welcome1 Everyone2
REDUCE
Everyone1 TASK 1
Hello 1
Hello 1
REDUCE Welcome1
Everyone1 TASK 2
• Popular: Hash partitioning, i.e., key is assigned to reduce # = hash(key)
%number of reduce servers
Hadoop Code - Map
public static class MapClass extends MapReduceBase implements
Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one =
new IntWritable(1);
private Text word = new Text();

public void map( LongWritable key, Text value, OutputCollector<Text,

IntWritable> output, Reporter reporter)
throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}
}
} // Source: http://developer.yahoo.com/hadoop/tutorial/module4.html#wordcount
Hadoop Code - Reduce
public static class ReduceClass extends MapReduceBase implements
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(
Text key,
Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter)
throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
} // Source: http://developer.yahoo.com/hadoop/tutorial/module4.html#wordcount
Hadoop Code - Driver
// Tells Hadoop how to run your Map-Reduce job
public void run (String inputPath, String outputPath)
throws Exception {
// The job. WordCount contains MapClass and Reduce.
JobConf conf = new JobConf(WordCount.class);
conf.setJobName(”mywordcount");
// The keys are words
(strings) conf.setOutputKeyClass(Text.class);
// The values are counts (ints)
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(MapClass.class);
conf.setReducerClass(ReduceClass.class);
FileInputFormat.addInputPath(
conf, newPath(inputPath));
FileOutputFormat.setOutputPath(
conf, new Path(outputPath));
JobClient.runJob(conf);
} // Source: http://developer.yahoo.com/hadoop/tutorial/module4.html#wordcount
Some Applications of
MapReduce
Distributed Grep:
– Input: large set of files
– Output: lines that match pattern

– Map – Emits a line if it matches the supplied pattern

– Reduce – Copies the intermediate data to output
Some Applications of
MapReduce (2)
Reverse Web-Link Graph
– Input: Web graph: tuples (a, b) where (page a  page b)
– Output: For each page, list of pages that link to it

– Map – process web log and for each input <source, target>, it
outputs <target, source>
– Reduce - emits <target, list(source)>
Some Applications of
MapReduce
Count of URL access frequency
(3)
– Input: Log of accessed URLs, e.g., from proxy server
– Output: For each URL, % of total accesses for that URL

– Map – Process web log and outputs <URL, 1>

– Multiple Reducers - Emits <URL, URL_count>
(So far, like Wordcount. But still need %)
– Chain another MapReduce job after above one
– Map – Processes <URL, URL_count> and outputs <1, (<URL, URL_count> )>
– 1 Reducer – Sums up URL_count’s to calculate overall_count.
Emits multiple <URL, URL_count/overall_count>
Some Applications of
MapReduce
Map task’s output is sorted (e.g., quicksort)
(4)
Reduce task’s input is sorted (e.g., mergesort)

Sort
– Input: Series of (key, value) pairs
– Output: Sorted <value>s

– Map – <key, value>  <value, _> (identity)

– Reducer – <key, value>  <key, value> (identity)
– Partitioning function – partition keys across reducers based on ranges (can’t use
hashing!)
• Take data distribution into account to balance reducer tasks
Programming MapReduce
Externally: For user
1. Write a Map program (short), write a Reduce program (short)
2. Specify number of Maps and Reduces (parallelism level)
3. Submit job; wait for result
4. Need to know very little about parallel/distributed programming!

Internally: For the Paradigm and Scheduler

1. Parallelize Map
2. Transfer data from Map to Reduce
3. Parallelize Reduce
4. Implement Storage for Map input, Map output, Reduce input, and Reduce output
(Ensure that no Reduce starts before all Maps are finished. That is, ensure the barrier between the Map phase and
Reduce phase)
For the cloud:
Inside MapReduce
1. Parallelize Map: easy! each map task is independent of the other!
• All Map output records with same key assigned to same Reduce
2. Transfer data from Map to Reduce:
• All Map output records with same key assigned to same Reduce task
• use partitioning function, e.g., hash(key)%number of reducers
3. Parallelize Reduce: easy! each reduce task is independent of the other!
4. Implement Storage for Map input, Map output, Reduce input, and Reduce output
• Map input: from distributed file system
• Map output: to local disk (at Map node); uses local file system
• Reduce input: from (multiple) remote disks; uses local file systems
• Reduce output: to distributed file system
local file system = Linux FS, etc.
distributed file system = GFS (Google File System), HDFS (Hadoop Distributed File
System)
Map tasks Reduce tasks Output files
into DFS
1
A A I
2
3
4 B B II
5
6 III
7 C C
Blocks Servers Servers
from DFS (Local write, remote read)
Resource Manager (assigns maps and reduces to servers)
The YARN Scheduler
• Used in Hadoop 2.x +
• YARN = Yet Another Resource Negotiator
• Treats each server as a collection of containers
– Container = fixed CPU + fixed memory
• Has 3 main components
– Global Resource Manager (RM)
• Scheduling
– Per-server Node Manager (NM)
• Daemon and server-specific functions
– Per-application (job) Application Master (AM)
• Container negotiation with RM and NMs
• Detecting task failures of that job
YARN: How a job gets a
container
Resource Manager
Capacity Scheduler
In this figure
•2 servers (A, B)
•2 jobs (1, 2)
1. Need
2. Container Completed
container 3. Container on Node B

Node A Node Manager A

Node B Node Manager B

Application Application Task (App2)

Master 1 4. Start task, please! Master 2
• Server Failure
Fault Tolerance
– NM heartbeats to RM
• If server fails, RM lets all affected AMs know, and AMs take action
– NM keeps track of each task running at its server
• If task fails while in-progress, mark the task as idle and restart it
– AM heartbeats to RM
• On failure, RM restarts AM, which then syncs up with its running tasks
• RM Failure
– Use old checkpoints and bring up secondary RM
• Heartbeats also used to piggyback container requests
– Avoids extra messages
Slow Servers
Slow tasks are called Stragglers

•The slowest task slows the entire job down (why?)

•Due to Bad Disk, Network Bandwidth, CPU, or Memory
•Keep track of “progress” of each task (% done)
•Perform proactive backup (replicated) execution of straggler
task: task considered done when first replica complete. Called
Speculative Execution.
Locality
• Locality
– Since cloud has hierarchical topology (e.g., racks)
– GFS/HDFS stores 3 replicas of each of chunks (e.g., 64 MB in size)
• Maybe on different racks, e.g., 2 on a rack, 1 on a different rack
– Mapreduce attempts to schedule a map task on
• a machine that contains a replica of corresponding input data, or
failing that,
• on the same rack as a machine containing the input, or failing that,
• Anywhere
Mapreduce: Summary
• Mapreduce uses parallelization + aggregation to
schedule applications across clusters

• Need to deal with failure

• Plenty of ongoing research work in scheduling and

fault-tolerance for Mapreduce and Hadoop

Map Reduce
No ratings yet
Map Reduce
30 pages
Chapter 9 - Processing Big Data With Mapreduce
No ratings yet
Chapter 9 - Processing Big Data With Mapreduce
157 pages
Constellation of BPSK, QPSK & Qam
No ratings yet
Constellation of BPSK, QPSK & Qam
8 pages
Introduction To Map Reduce
No ratings yet
Introduction To Map Reduce
50 pages
Hadoop (Mapreduce)
No ratings yet
Hadoop (Mapreduce)
43 pages
Ap Educe Undamentals: Business
No ratings yet
Ap Educe Undamentals: Business
74 pages
ADS Chapter 4 Concurrency Control Techniques
No ratings yet
ADS Chapter 4 Concurrency Control Techniques
36 pages
Module 3 - Mapreduce
No ratings yet
Module 3 - Mapreduce
40 pages
Bda CHP2
No ratings yet
Bda CHP2
105 pages
04 MapReduce
No ratings yet
04 MapReduce
45 pages
Ch02a Mapreduce
No ratings yet
Ch02a Mapreduce
53 pages
CS 425 / ECE 428 Distributed Systems Fall 2014: Lecture 3: Mapreduce and Hadoop
No ratings yet
CS 425 / ECE 428 Distributed Systems Fall 2014: Lecture 3: Mapreduce and Hadoop
24 pages
T4 Mapreduce
No ratings yet
T4 Mapreduce
39 pages
Bda Unit III r20csm
No ratings yet
Bda Unit III r20csm
54 pages
Hadoop Map Reduce
No ratings yet
Hadoop Map Reduce
53 pages
Unit 3 - Big Data Technologies
No ratings yet
Unit 3 - Big Data Technologies
42 pages
Map Reduce
No ratings yet
Map Reduce
44 pages
3.1 - MapReduce Paradigm
No ratings yet
3.1 - MapReduce Paradigm
28 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
43 pages
Map Reduce Notes and Learning
No ratings yet
Map Reduce Notes and Learning
48 pages
Chapter 4 MapReduce and New Software Stack
No ratings yet
Chapter 4 MapReduce and New Software Stack
48 pages
Lecture - 3
No ratings yet
Lecture - 3
25 pages
Unit 4 Transaction Processing
No ratings yet
Unit 4 Transaction Processing
45 pages
1.ASK-Modulation & Demodulation
100% (1)
1.ASK-Modulation & Demodulation
3 pages
B. Hadoop Ecosystem - III (MapReduce)
No ratings yet
B. Hadoop Ecosystem - III (MapReduce)
55 pages
Big Data Analytics Unit-3
No ratings yet
Big Data Analytics Unit-3
29 pages
Chapter3 HDFS MapReduce YARN
No ratings yet
Chapter3 HDFS MapReduce YARN
35 pages
21 - 02R210 - 1 - Distributed System Architecture
No ratings yet
21 - 02R210 - 1 - Distributed System Architecture
24 pages
TM2 ch02 Mapreduce
No ratings yet
TM2 ch02 Mapreduce
51 pages
09b - MapReduce
No ratings yet
09b - MapReduce
44 pages
Unit Iii
No ratings yet
Unit Iii
38 pages
Unit 2 - From Hadoop Streaming PDF
No ratings yet
Unit 2 - From Hadoop Streaming PDF
20 pages
Lecture 03
No ratings yet
Lecture 03
26 pages
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
55 pages
Hadoop Map Reduce Concept
No ratings yet
Hadoop Map Reduce Concept
23 pages
Unit 5
No ratings yet
Unit 5
35 pages
Lec 6
No ratings yet
Lec 6
16 pages
BDA Unit 4 PDF
No ratings yet
BDA Unit 4 PDF
31 pages
MapReduce Arch
No ratings yet
MapReduce Arch
29 pages
Lec 6
No ratings yet
Lec 6
14 pages
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
53 pages
Hadoop Class 2 PDF
No ratings yet
Hadoop Class 2 PDF
18 pages
CS 425 / ECE 428 Distributed Systems Fall 2016: Lecture 4: Mapreduce and Hadoop
No ratings yet
CS 425 / ECE 428 Distributed Systems Fall 2016: Lecture 4: Mapreduce and Hadoop
24 pages
Map Reduce and Hadoop
No ratings yet
Map Reduce and Hadoop
39 pages
3a - MapReduce Data Flow Scheduling Combiner Partitioner PDF
No ratings yet
3a - MapReduce Data Flow Scheduling Combiner Partitioner PDF
22 pages
Parallel Programming, Mapreduce Model: Unit Ii
No ratings yet
Parallel Programming, Mapreduce Model: Unit Ii
47 pages
M4 06 MapReduce
No ratings yet
M4 06 MapReduce
28 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
27 pages
1.3 General Input Output Configuration
No ratings yet
1.3 General Input Output Configuration
5 pages
Hadoop
No ratings yet
Hadoop
28 pages
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
49 pages
1.4 Map Reduce
No ratings yet
1.4 Map Reduce
30 pages
Unit 5 Lecture 5
No ratings yet
Unit 5 Lecture 5
21 pages
Lecture 4: Mapreduce and Hadoop: Indranil Gupta (Indy)
No ratings yet
Lecture 4: Mapreduce and Hadoop: Indranil Gupta (Indy)
37 pages
Large-Scale Data Management: Cs525: Special Topics in Dbs
No ratings yet
Large-Scale Data Management: Cs525: Special Topics in Dbs
22 pages
Chapter Five Hadoop Mapreduce & HDFS
No ratings yet
Chapter Five Hadoop Mapreduce & HDFS
44 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
HadoopMapreduce Summerization
No ratings yet
HadoopMapreduce Summerization
24 pages
Experiment No.5, Simulation of DPSK Aim: To Plot The Bit Error Rate (BER) of Differential Phase Shift Keying (DPSK) Signal Using OCTAVE
100% (3)
Experiment No.5, Simulation of DPSK Aim: To Plot The Bit Error Rate (BER) of Differential Phase Shift Keying (DPSK) Signal Using OCTAVE
5 pages
NOSQL
No ratings yet
NOSQL
23 pages
Unit-4: Illustrate Mapreduce Architecture With Diagram
No ratings yet
Unit-4: Illustrate Mapreduce Architecture With Diagram
7 pages
Experiment No.3 Phase Shift Keying Aim: To Generate and Demodulate Phase Shift Keyed (PSK) Signal Using OCTAVE Online
No ratings yet
Experiment No.3 Phase Shift Keying Aim: To Generate and Demodulate Phase Shift Keyed (PSK) Signal Using OCTAVE Online
6 pages
Lecture 2 - Mapreduce: Cpe 458 - Parallel Programming, Spring 2009
No ratings yet
Lecture 2 - Mapreduce: Cpe 458 - Parallel Programming, Spring 2009
26 pages
17 03 2021
No ratings yet
17 03 2021
3 pages
Unit 4 1
No ratings yet
Unit 4 1
12 pages
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
No ratings yet
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
54 pages
UNIT-2 FBCT
0% (1)
UNIT-2 FBCT
77 pages
Notes Bug Data and of Apache
No ratings yet
Notes Bug Data and of Apache
4 pages
Unit I - Principles of Sensing: Topics - Definition of Sensor - Classification of Sensor - Data Acquisition System
No ratings yet
Unit I - Principles of Sensing: Topics - Definition of Sensor - Classification of Sensor - Data Acquisition System
13 pages
Unit-7: Transaction Processing
No ratings yet
Unit-7: Transaction Processing
81 pages
1.8051 Microcontroller
No ratings yet
1.8051 Microcontroller
34 pages
2.FSK-Modulation & Demodulation
No ratings yet
2.FSK-Modulation & Demodulation
5 pages
River Blockchain
No ratings yet
River Blockchain
171 pages
Exp 4 - Class B Amplifier
100% (1)
Exp 4 - Class B Amplifier
6 pages
3.1 Hadoop Ecosystem
No ratings yet
3.1 Hadoop Ecosystem
48 pages
Advantages and Disadvantages of Distributed System Over Centralized System
100% (1)
Advantages and Disadvantages of Distributed System Over Centralized System
2 pages
1.2 Classification of Sensor
No ratings yet
1.2 Classification of Sensor
9 pages
3.5 IoT Security
No ratings yet
3.5 IoT Security
16 pages
ECS781P 12 Serverless
No ratings yet
ECS781P 12 Serverless
26 pages
Ch14-Transaction Management
No ratings yet
Ch14-Transaction Management
35 pages
1.4 Static Characteristics of Measurement Systems Accuracy, Precision, Sensitivity
No ratings yet
1.4 Static Characteristics of Measurement Systems Accuracy, Precision, Sensitivity
7 pages
Hash & Digital Signature
No ratings yet
Hash & Digital Signature
26 pages
Ethereum in Plain English-R160760914613
No ratings yet
Ethereum in Plain English-R160760914613
14 pages
Blockchain Book Notes
No ratings yet
Blockchain Book Notes
58 pages
SoK-Communication Across Distributed
No ratings yet
SoK-Communication Across Distributed
41 pages
Bcc-q1-Syllabus 2022 Piaic
No ratings yet
Bcc-q1-Syllabus 2022 Piaic
5 pages
DSCC Unit 4
No ratings yet
DSCC Unit 4
10 pages
Cctip 20221 Delisted Coins
No ratings yet
Cctip 20221 Delisted Coins
44 pages
Module 2: Goals of Parallelism Week 2 Learning Outcomes:: General-Purpose Computing On Graphics Processing Units
No ratings yet
Module 2: Goals of Parallelism Week 2 Learning Outcomes:: General-Purpose Computing On Graphics Processing Units
11 pages
Adms CH-4
No ratings yet
Adms CH-4
36 pages
4-Consensus Protocol-08-01-2024
No ratings yet
4-Consensus Protocol-08-01-2024
27 pages
Kaviya Blockchain
No ratings yet
Kaviya Blockchain
2 pages
4 Trangination and Concurency
No ratings yet
4 Trangination and Concurency
17 pages
Crypto List
No ratings yet
Crypto List
1 page
Time Complexity Cheat Sheet
No ratings yet
Time Complexity Cheat Sheet
6 pages
Distributed Systems in Erlang
No ratings yet
Distributed Systems in Erlang
5 pages
Pixelette Technologies C - 093026
No ratings yet
Pixelette Technologies C - 093026
4 pages
Assignment 2 - Solution
No ratings yet
Assignment 2 - Solution
3 pages
Https:preview - Redd.it:batch Pipeline Cheat Sheet v0 6v3btd506xrc1.jpeg
No ratings yet
Https:preview - Redd.it:batch Pipeline Cheat Sheet v0 6v3btd506xrc1.jpeg
1 page
Transaction PDF
No ratings yet
Transaction PDF
1 page

3.4 Map Scheduler

Uploaded by

3.4 Map Scheduler

Uploaded by

MAP REDUCE SCHEDULERS AND

MAP REDUCE ARCHITECTURE

• (reduce + ‘(1 4 9 16))

• Let’s consider a sample application: Wordcount

• Process individual records to generate

• Parallelly Process individual records to

Input <filename, file text>

public void map( LongWritable key, Text value, OutputCollector<Text,

– Map – Emits a line if it matches the supplied pattern

– Map – Process web log and outputs <URL, 1>

– Map – <key, value>  <value, _> (identity)

Internally: For the Paradigm and Scheduler

Node A Node Manager A

Application Application Task (App2)

•The slowest task slows the entire job down (why?)

• Need to deal with failure

• Plenty of ongoing research work in scheduling and

You might also like