6. Map Reduce Programming
6. Map Reduce Programming
Programming
Contents
• Map-Reduce Programming
• Exercises
• Mappers & Reducers
• Hadoop combiners
• Hadoop partitioners
• MapReduce is a programming model and associated implementation
for processing and generating the large data set.
• Users specify a map function that processes a Key/Value pair to
generate a set of intermediate key/value pairs, and a reduce function
that merges all intermediate values associated with the same
intermediate key.
• Many real world tasks are expressible in this model
• Programs written in this functional style are automatically parallelized
and executed on a large cluster of commodity machines
• The run-time system takes care of the details of partitioning the input
data, scheduling the program’s execution across a set of machines,
handling machine failures and managing the required inter-machine
communication.
• Typical MapReduce computation processes many terabytes of data on
thousands of machines.
MapReduce Examples
#example1
• word count using MapReduce
• map(key, value):
// key: document name; value: text of document
for each word w in value:
emit(w, 1)
(hi,1)(how,1)(hi,1)(you,1)
• reduce(key, values):
// key: a word; values: an iterator over counts
result = 0
(hi,(1,1))
for each count v in values:
result += v
emit(key, result)
counting words of different
lengths#example2
• Input file :
hi how are you?
Welcome to Nirma University.
• Output file :
2: 2 , 3:3 , 5:1, 7:1 , 10:1
How??
hi:2, how:3, are:3, you:3,welcome:7,to:2,Nirma:5,University:10
• Mapper Task :
• Emit(2,hi),(2,to)(3,how)………
• Reducer Task :
• (2:[hi,to])
• (3:[how,are,you])
• ……
Recap: HDFS
• HDFS: A specialized distributed file system
• Good for large amounts of data, sequential reads
• Bad for lots of small files, random access, non-append writes
• Architecture: Blocks, namenode, datanodes
• File data is broken into large blocks (64MB default)
• Blocks are stored & replicated by datanodes
• Single namenode manages all the metadata
• Secondary namenode: Housekeeping & (some) redundancy
• Usage: Special command-line interface
• Example: hadoop fs -ls /path/in/hdfs
Example
input : Dear, Bear, River, Car, Car, River, Deer, Car and Bear
Word count Job
Input : Text file
Output : count of words
Hi how are you
Hi how are you? how is your job
how is your job?
how is your family how is your family
how is your sister how is your sister
how is your brother
what is the time now How is your brother
what is the strength of the Hadoop what is the time now
File.txt
what is the strength of the Hadoop
Size :: 200MB
Text Input format
Input file Key Value Text Input Format
Sequence File Input Format
SequenceFileAsTextInput Format
Byteoffset , record
Mapper Mapper Mapper Mapper
02/15/2025 19
MapReduce: A Diagram
02/15/2025 20
What is Map Reduce?
• Sum of squares:
• (map square ‘(1 2 3 4))
Output: (1 4 9 16)
• (reduce + ‘(1 4 9 16))
(+ 16 (+ 9 (+ 4 +( 1)) ) )
Output: 30
Mapper Class
02/15/2025 26
What do we need to write a MR program?
• A mapper
• Accepts (key,value) pairs from the input
• Produces intermediate (key,value) pairs, which are then shuffled
• A reducer
• Accepts intermediate (key,value) pairs
• Produces final (key,value) pairs for the output
• A driver
• Specifies which inputs to use, where to put the outputs
• Chooses the mapper and the reducer to use
• Hadoop takes care of the rest!!
• Default behaviors can be customized by the driver
02/15/2025 27
The Mapper Input format Intermediate format
(file offset, line) can be freely chosen
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.io.*;
02/15/2025 30
Find out the word length histogram#example3
• particular document is given and then we have to, find out that how
many big medium and small words are
• appearing in the particular document and this becomes the word
length histogram.
bb
{
public abstract List<InputSplit> getSplits(JobContext context) throws IOException,
InterruptedException;
public abstract RecordReader<K, V>
createRecordReader(InputSplit split,
TaskAttemptContext context) throws IOException, InterruptedException;
}
Types of InputFormat in
MapReduce
Record Reader
• The MapReduce RecordReader in Hadoop takes the byte-oriented
view of input, provided by the InputSplit and presents as a record-
oriented view for Mapper.
• Map task passes the split to the createRecordReader() method on
InputFormat in task tracker to obtain a RecordReader for that split.
The RecordReader load’s data from its source and converts into key-
value pairs suitable for reading by the mapper.
Types of Hadoop Record Reader
in MapReduce
i. LineRecordReader
ii. SequenceFileRecordReader
}
Adding following lines // Importing libraries
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WCMapper extends Mapper<LongWritable,Text,Text,IntWritable> {
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
}
public class WCMapper extends Mapper< LongWritable,
Text, Text, IntWritable>
•We have created a class Map that extends the class Mapper which is already
defined in the MapReduce Framework.
•We define the data types of input and output key/value pair after the class
declaration using angle brackets.
•Both the input and output of the Mapper is a key/value pair.
•Input:
•The key is nothing but the offset of each line in the text file: LongWritable
•The value is each individual line (as shown in the figure at the right): Text
•Output:
•The key is the tokenized words: Text
•We have the hardcoded value in our case which is 1: IntWritable
•Example – Dear 1, Bear 1, etc.
•We have written a java code where we have tokenized each word and assigned
them a hardcoded value equal to 1.
Adding following lines // Importing libraries
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import
org.apache.hadoop.mapreduce.Mapper;
public class WCReducer extends Reducer<Text,IntWritable,Text,IntWritable> {
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context)
throws
IOException, InterruptedException {
int count=0;
}
• We have created a class Reduce which extends class Reducer like that of Mapper.
• We define the data types of input and output key/value pair after the class declaration using angle brackets as
done for Mapper.
• Both the input and the output of the Reducer is a key-value pair.
• Input:
• The key nothing but those unique words which have been generated after the sorting and shuffling phase: Text
• The value is a list of integers corresponding to each key: IntWritable
• Example – Bear, [1, 1], etc.
• Output:
• The key is all the unique words present in the input text file: Text
• The value is the number of occurrences of each of the unique words: IntWritable
• Example – Bear, 2; Car, 3, etc.
• We have aggregated the values present in each of the list corresponding to each key and produced the final
answer.
• In general, a single reducer is created for each of the unique words, but, you can specify the number of reducer
in mapred-site.xml.
Adding following lines // Importing libraries
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import
org.apache.hadoop.mapreduce.Reducer;
NCDC data
Raw Format of weather data
In Driver class
• job. Set Combiner
Class(ReduceClass.class);
Separate java file
• Before reduce phase, partitioning of the map output take place on the
basis of the key and sorted.
• This partitioning specifies that all the values for each key are grouped
together and make sure that all the values of a single key go to the
same reducer, thus allows even distribution of the map output over
the reducer.
• Partitioner in Hadoop MapReduce redirects the mapper output to the
reducer by determining which reducer is responsible for the particular
key.
• The Default Hadoop partitioner in Hadoop MapReduce is Hash
Partitioner which computes a hash value for the key and assigns the
partition based on this result.
How many Partitioner
02/15/2025 74