T4 Mapreduce
T4 Mapreduce
T4:
Mapreduce and
Hadoop
Strathmore University
What is MapReduce?
Strathmore University
Map
Key Value
Welcome1
Welcome Everyone
Everyone 1
Hello Everyone
Hello 1
Input <filename, file text>
Everyone 1
Strathmore University
Map
MAP TASK 1
Welcome1
Welcome Everyone
Everyone 1
Hello Everyone
Hello 1
Input <filename, file text> Everyone 1
MAP TASK 2
Strathmore University
Map
Welcome 1
Welcome Everyone
Everyone 1
Hello Everyone
Hello 1
Why are you here
I am also here Everyone 1
They are also here Why 1
Yes, it’s THEM!
Are 1
The same people we were thinking of
You 1
…….
Here1
Input <filename, file text> …….
MAP TASKS
Strathmore University
Reduce
• Reduce processes and merges all intermediate values associated per key
Key Value
Welcome1 Everyone 2
Everyone 1 Hello 1
Hello 1 Welcome1
Everyone 1
Strathmore University
Reduce
Welcome1
REDUCE Everyone 2
Everyone 1 TASK 1
Hello 1
Hello 1
REDUCE Welcome1
Everyone 1 TASK 2
• Popular: Hash partitioning, i.e., key is assigned to
– reduce # = hash(key)%number of reduce tasks
Strathmore University
Hadoop Code - Map
public static class MapClass extends MapReduceBase implements Mapper<LongWritable,
Text, Text, IntWritable> {
new IntWritable(1);
throws IOException {
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
Strathmore University
}
Hadoop Code - Reduce
public static class ReduceClass extends MapReduceBase implements Reducer<Text,
IntWritable, Text, IntWritable> {
Text key,
Iterator<IntWritable> values,
Reporter reporter)
throws IOException {
int sum = 0;
10
while (values.hasNext()) {
sum += values.next().get();
} // Source: http://developer.yahoo.com/hadoop/tutorial/module4.html#wordcount
Strathmore University
Hadoop Code - Driver
// Tells Hadoop how to run your Map-Reduce job
throws Exception {
conf.setJobName(”mywordcount");
(strings) conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(MapClass.class);
conf.setReducerClass(ReduceClass.class); 11
FileInputFormat.addInputPath(
conf, newPath(inputPath));
FileOutputFormat.setOutputPath(
JobClient.runJob(conf);
Strathmore University
Some Applications of MapReduce
Distributed Grep:
• Input: large set of files
• Output: lines that match pattern
Strathmore University
Some Applications of MapReduce (2)
• Map – process web log and for each input <source, target>, it outputs
<target, source>
• Reduce - emits <target, list(source)>
Strathmore University
Some Applications of MapReduce (3)
Strathmore University
Some Applications of MapReduce (4)
Sort
• Input: Series of (key, value) pairs
• Output: Sorted <value>s
Strathmore University
Programming MapReduce
8. Implement Storage for Map input, Map output, Reduce input, and Reduce output
(Ensure that no Reduce starts before all Maps are finished. That is, ensure the barrier between the Map phase
and Reduce phase)
Strathmore University
Inside MapReduce
Strathmore University
Map tasks Reduce tasks Output files
1 into DFS
A A I
2
3
4 B B II
5
6 III
7 C C 18
1. Need
2. Container Completed
container 3. Container on Node B
Strathmore University
Fault Tolerance
• Server Failure
– NM heartbeats to RM
• If server fails: RM times out waiting for next heartbeat, RM lets
all affected AMs know, and AMs take appropriate action
– NM keeps track of each task running at its server
• If task fails while in-progress, mark the task as idle and restart it
– AM heartbeats to RM
• On failure, RM restarts AM, which then syncs it up with its
running tasks
• RM Failure 21
Strathmore University
Locality
• Locality
– Since cloud has hierarchical topology (e.g., racks)
– For server-fault-tolerance, GFS/HDFS stores 3 replicas of each of
chunks (e.g., 64 MB in size)
• For rack-fault-tolerance, on different racks, e.g., 2 on a rack, 1 on a different rack
– Mapreduce attempts to schedule a map task on
1. a machine that contains a replica of corresponding input data, or failing that,
2. on the same rack as a machine containing the input, or failing that,
3. Anywhere
– Note: The 2-1 split of replicas is intended to reduce
bandwidth when writing file.
• Using more racks does not affect overall Mapreduce scheduling
performance
Strathmore University
That was Hadoop 2.x…
Strathmore University
Mapreduce: Summary
Strathmore University
Further MapReduce Exercises
26
Strathmore University
Exercise 1
1. (MapReduce) You are given a symmetric social network (like Facebook) where a
is a friend of b implies that b is also a friend of a. The input is a dataset D (sharded)
containing such pairs (a, b) – note that either a or b may be a lexicographically
lower name. Pairs appear exactly once and are not repeated. Find the last names of
those users whose first name is “Kanye” and who have at least 300 friends. You can
chain Mapreduces if you want (but only if you must, and even then, only the least
number). You don’t need to write code – pseudocode is fine as long as it is
understandable. Your pseudocode may assume the presence of appropriate
primitives (e.g., “firstname(user_id)”, etc.). The Map function takes as input a tuple
(key=a,value=b).
27
Strathmore University
28
Strathmore University
29
Strathmore University
Exercise 1: Solution
• M1 (a,b):
• if (firstname(a)==Kanye) then output (a,b)
• if (firstname(b)==Kanye) then output (b,a)
• // note that second if is NOT an else if, so a single M1 function may be output up to 2 KV pairs!
• R1 (x, V):
• if |V| >= 300 then output (lastname(x), -)
30
Strathmore University
Exercise 2
Strathmore University
32
Strathmore University
33
Strathmore University
Exercise 2: Solution
• M1(a,b):
• Output (key=a, value=(OUT,b))
• Output (key=b, value=(IN,a))
• // Note that a single M1 function outputs TWO KV pairs
• R1(key=x, V):
• Collect Sout = set of all (OUT,*) value items from V
• Collect Sin = set of all (IN,*) value items from V
• if (|Sout| < 20 AND |Sin| >= 2M AND all items in Sout are also present in Sin)
// third term via nested for loops 34
Strathmore University
Exercise 3
Strathmore University
36
Strathmore University
37
Strathmore University
Exercise 3: Solution
Strathmore University
Strathmore University