6. Map Reduce Programming
6. Map Reduce Programming
Contents
• Map-Reduce Programming
• Exercises
• Mappers & Reducers
• Hadoop combiners
• Hadoop partitioners
Overview
• Hadoop MapReduce is a software framework for easily writing applications
which process vast amounts of data (multi-terabyte data-sets) in-parallel
on large clusters (thousands of nodes) of commodity hardware in a
reliable, fault-tolerant manner.
• A MapReduce job usually splits the input data-set into independent chunks
which are processed by the map tasks in a completely parallel manner.
• The framework sorts the outputs of the maps, which are then input to the
reduce tasks.
• Typically both the input and the output of the job are stored in a file-
system.
• The framework takes care of scheduling tasks, monitoring them and re-
executes the failed tasks.
• Typically the compute nodes and the storage nodes are the same, that
is, the MapReduce framework and the Hadoop Distributed File System
are running on the same set of nodes.
• This configuration allows the framework to effectively schedule tasks on
the nodes where data is already present, resulting in very high aggregate
bandwidth across the cluster.
• The MapReduce framework consists of a single master Resource
Manager, one worker NodeManager per cluster-node, and
MRAppMaster per application
What is Map Reduce?
Word count Job
Input : Text file
Output : count of words
Hi how are you
Hi how are you? how is your job
how is your job?
how is your family how is your family
how is your sister how is your sister
how is your brother
what is the time now How is your brother
what is the strength of the Hadoop what is the time now
File.txt
what is the strength of the Hadoop
Size :: 500MB
Text Input format
Key Value Text Input Format
Sequence File Input Format
Input file SequenceFileAsTextInput Format
Byteoffset , record
Mapper Mapper Mapper Mapper
2024/9/22 9
MapReduce: In Parallel
2024/9/22 10
Steps of MapReduce
3 steps of MapReduce
• Sequentially read a lot of data
• Map: Extract something you care about
• Group by key: Sort and shuffle
• Reduce: Aggregate, summarize, filter or transform
• Output the result
2024/9/22 11
MapReduce Examples #example1
• word count using MapReduce
• map(keyin, valuein,keyout,valueout):
// key: document name; value: text of document
for each word w in value:
emit(w, 1)
• reduce(keyin, values): (hi,1)(how,1)(hi,1)(you,1)
6. SequenceFileAsBinaryInputFormat
By using SequenceFileInputFormat we can extract the sequence
file’s keys and values as an opaque binary object.
7. NlineInputFormat
• It is another form of TextInputFormat where the keys are byte offset of the
line. And values are contents of the line. So, each mapper receives a
variable number of lines of input with TextInputFormat and
KeyValueTextInputFormat.
• The number depends on the size of the split. Also, depends on the length
of the lines. So, if want our mapper to receive a fixed number of lines of
input, then we use NLineInputFormat.
• N- It is the number of lines of input that each mapper receives.
• By default (N=1), each mapper receives exactly one line of input.
• Suppose N=2, then each split contains two lines. So, one mapper receives
the first two Key-Value pairs. Another mapper receives the second two key-
value pairs.
8. DBInputFormat
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
In Driver class
• job. Set Combiner Class(ReduceClass.class);
Separate java file
• MapReduce job takes an input data set and produces the list of the
key-value pair which is the result of map phase in which input data is
split and each task processes the split and each map, output the list
of key-value pairs. Then, the output from the map phase is sent to
reduce task which processes the user-defined reduce function on
map outputs.
2024/9/22 60
How many Partitioner