Advanced Mapreduce
Advanced Mapreduce
2. Implement Mapper
– Input is text – a line from sample.txt
– Tokenize the text and emit first character with a count of 1 - <token,
1>
3. Implement Reducer
– Sum up counts for each letter
– Write out the result to HDFS
job.setMapperClass(TokenizerMapper.class);
job.setReducerClass(IntSumReducer.class);
job.setCombinerClass(IntSumReducer.class);
1. Configure Job
• job.waitForCompletion(true)
– Submits and waits for completion
– The boolean parameter flag specifies whether output
should be written to console
– If the job completes successfully ‘true’ is returned,
otherwise ‘false’ is returned
System.exit(job.waitForCompletion(true) ? 0 : 1);
Our Count Job is configured to
• Chop up text files into lines
• Send records to mappers as key-value pairs
– Line number and the actual value
• Mapper class is TokenizeMapper
– Receives key-value of <IntWritable,Text>
– Outputs key-value of <Text, IntWritable>
• Reducer class is IntSumReducer
– Receives key-value of <Text, IntWritable>
– Outputs key-values of <Text, IntWritable> as text
• Combiner class is IntSumReducer
1. Configure Count Job
public class WordCount {
public static void main(String[] args) throws Exception {
try{
if (args.length != 2) {
System.out.printf("Usage: wordcount <input dir> <output dir>\n");
System.exit(-1);
}
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(TokenizeMapper.class);
job.setReducerClass(TokenizeReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}catch (Exception e){
e.printStackTrace();
}
}
2. Implement Mapper Class
• Class has 4 Java Generics parameters
– (1) input key (2) input value (3) output key (4) output value
– Input and output utilizes hadoop’s IO framework
• org.apache.hadoop.io
@Override
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
while (itr.hasMoreTokens()) {
context.write(new Text(itr.nextToken()), new IntWritable(1));
}
}
}
3. Implement Reducer
• Analogous to Mapper – generic class with four types
– (1) input key (2) input value (3) output key (4) output value
– The output types of map functions must match the input types of
reduce function
• In this case Text and IntWritable
– Map/Reduce framework groups key-value pairs produced by mapper
by key
• For each key there is a set of one or more values
• Input into a reducer is sorted by key
• Known as Shuffle and Sort
– Reduce function accepts key->setOfValues and outputs key-value pairs
• Also utilizes Context object (similar to Mapper)
3. Implement Reducer
public class IntSumReducer
extends Reducer<Text, IntWritable, Text, IntWritable>{
@Override
public void reduce (Text key, Iterable<IntWritable> values, Context
context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values)
{
sum += val.get();
}
• org.apache.hadoop.io package
– LongWritable for Long
– IntWritable for Integer
– Text for String
– Etc...
Key and Value Types
• Keys must implement WritableComparable interface
– Extends Writable and java.lang.Comparable<T>
– Required because keys are sorted prior reduce phase
JT responsibilities:
• Resource management (TTs),
• Tracking resource consumption/availability,
• Job life-cycle management
MRv2 Overview
Fundamental idea:
Re-architect JT’s Resource Management and Job scheduling & monitoring into two
separate components: Resource Manager & AppMaster
Building Blocks
• ResourceManager: manages the global assignment of compute resources to
applications, has a pluggable scheduler for allocating resources to the running
applications subject constraints of capacities, queues. It optimizes for cluster utilization
(keep all resources in use all the time) against various constraints such as capacity
guarantees, fairness, and SLAs, does NOT do fault tolerance for resources (AM does)
• The AM can request very specific requirements from the RM for the containers, like:
– Resource name, hostname, rack name,
– Memory (in MB)
– CPU (in cores), added after March 2012
– Future: disk, network, GPUs, etc
Building Blocks
• A resource request from the AM to the scheduler in the RM
has the following:
– Resource name: hostname, rackname, (Future: VMs on a host,
networks)
– Priority: priority within the application, not across cluster
– Resource requirement: memory, CPU (F: GPUs)
– # of Containers: just a #
2) The ResourceManager assumes the responsibility to negotiate a specified container in which to start
the ApplicationMaster and then launches the ApplicationMaster.
3) The ApplicationMaster, on boot-up, registers with the ResourceManager – the registration allows the
client program to query the ResourceManager for details, which allow it to directly communicate with its
own ApplicationMaster.
4) During normal operation the ApplicationMaster negotiates appropriate resource containers via the
resource-request protocol.
5) On successful container allocations, the ApplicationMaster launches the container by providing the
container launch specification to the NodeManager. The launch specification, typically, includes the
necessary information to allow the container to communicate with the ApplicationMaster itself.
6) The application code executing within the container then provides necessary information (progress,
status etc.) to its ApplicationMaster via an application-specific protocol.
7) During the application execution, the client that submitted the program communicates directly with
the ApplicationMaster to get status, progress updates etc. via an application-specific protocol.
8) Once the application is complete, and all necessary work has been finished, the ApplicationMaster
deregisters with the ResourceManager and shuts down, allowing its own container to be repurposed.
Thank You