0% found this document useful (0 votes)
2 views

6. Map Reduce Programming

MapReduce is a programming model for processing large data sets using a map function to generate intermediate key/value pairs and a reduce function to merge values. It automatically parallelizes tasks across a cluster, handling data partitioning and machine failures. The document also covers practical examples, including word count and inverted index, and details the roles of Mapper, Reducer, and Driver classes in a MapReduce job.

Uploaded by

yogipatel2724
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

6. Map Reduce Programming

MapReduce is a programming model for processing large data sets using a map function to generate intermediate key/value pairs and a reduce function to merge values. It automatically parallelizes tasks across a cluster, handling data partitioning and machine failures. The document also covers practical examples, including word count and inverted index, and details the roles of Mapper, Reducer, and Driver classes in a MapReduce job.

Uploaded by

yogipatel2724
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 67

Map Reduce

Programming
Contents
• Map-Reduce Programming
• Exercises
• Mappers & Reducers
• Hadoop combiners
• Hadoop partitioners
• MapReduce is a programming model and associated implementation
for processing and generating the large data set.
• Users specify a map function that processes a Key/Value pair to
generate a set of intermediate key/value pairs, and a reduce function
that merges all intermediate values associated with the same
intermediate key.
• Many real world tasks are expressible in this model
• Programs written in this functional style are automatically parallelized
and executed on a large cluster of commodity machines
• The run-time system takes care of the details of partitioning the input
data, scheduling the program’s execution across a set of machines,
handling machine failures and managing the required inter-machine
communication.
• Typical MapReduce computation processes many terabytes of data on
thousands of machines.
MapReduce Examples
#example1
• word count using MapReduce
• map(key, value):
// key: document name; value: text of document
for each word w in value:
emit(w, 1)
(hi,1)(how,1)(hi,1)(you,1)
• reduce(key, values):
// key: a word; values: an iterator over counts
result = 0
(hi,(1,1))
for each count v in values:
result += v
emit(key, result)
counting words of different
lengths#example2
• Input file :
hi how are you?
Welcome to Nirma University.
• Output file :
2: 2 , 3:3 , 5:1, 7:1 , 10:1
How??
hi:2, how:3, are:3, you:3,welcome:7,to:2,Nirma:5,University:10
• Mapper Task :
• Emit(2,hi),(2,to)(3,how)………
• Reducer Task :
• (2:[hi,to])
• (3:[how,are,you])
• ……
Recap: HDFS
• HDFS: A specialized distributed file system
• Good for large amounts of data, sequential reads
• Bad for lots of small files, random access, non-append writes
• Architecture: Blocks, namenode, datanodes
• File data is broken into large blocks (64MB default)
• Blocks are stored & replicated by datanodes
• Single namenode manages all the metadata
• Secondary namenode: Housekeeping & (some) redundancy
• Usage: Special command-line interface
• Example: hadoop fs -ls /path/in/hdfs
Example
input : Dear, Bear, River, Car, Car, River, Deer, Car and Bear
Word count Job
Input : Text file
Output : count of words
Hi how are you
Hi how are you? how is your job
how is your job?
how is your family how is your family
how is your sister how is your sister
how is your brother
what is the time now How is your brother
what is the strength of the Hadoop what is the time now
File.txt
what is the strength of the Hadoop
Size :: 200MB
Text Input format
Input file Key Value Text Input Format
Sequence File Input Format
SequenceFileAsTextInput Format

Input split 1 Input split 1 Input split 1 Input split 1

Record reader Record reader Record reader Record reader

Byteoffset , record
Mapper Mapper Mapper Mapper

• Because of collection framework , as it doesnot work on the


primitive types, wrapper classes are created.
Primitive types Wrapper Class Box Class • Collection framework Work with the object to type so object of
int Integer IntWritable wrapper class is to be created.
long Long LongWritable • Similar to Java as it has introduced wrapper class corresponding
float Float FloatWritable to primitive class. Hadoop has introduced Box classes.
double Double DoubleWritable • For Java conversion from primitive to wrapper is done
string String StringWritable automatically but for Hadoop we need to explicitly mention that
char Character CharWritable conversion.
etc… etc.. etc.. • Int - new IntWritable(int) and get () method for back
More Details
• Input: a set of key-value pairs
• Programmer specifies two methods:
• Map(k, v) → <k’, v’>*
• Takes a key-value pair and outputs a set of key-value pairs
• E.g., key is the filename, value is a single line in the file
• There is one Map call for every (k,v) pair

• Reduce(k’, <v’>*) → <k’, v’’>*


• All values v’ with same key k’ are reduced together and processed in
v’ order
• There is one Reduce function call per unique key k’

02/15/2025 19
MapReduce: A Diagram

02/15/2025 20
What is Map Reduce?
• Sum of squares:
• (map square ‘(1 2 3 4))
Output: (1 4 9 16)
• (reduce + ‘(1 4 9 16))
(+ 16 (+ 9 (+ 4 +( 1)) ) )
Output: 30
Mapper Class

• The first stage in Data Processing using MapReduce is the Mapper


Class. Here, RecordReader processes each Input record and generates
the respective key-value pair. Hadoop’s Mapper store saves this
intermediate data into the local disk.
• Input Split
It is the logical representation of data. It represents a block of work that
contains a single map task in the MapReduce Program.
• RecordReader
It interacts with the Input split and converts the obtained data in the form of
Key-Value Pairs.
Reducer Class

• The Intermediate output generated from the mapper is fed to the


reducer which processes it and generates the final output which is
then saved in the HDFS.
Driver Class

• The major component in a MapReduce job is a Driver Class. It is


responsible for setting up a MapReduce Job to run-in Hadoop. We
specify the names of Mapper and Reducer Classes long with data
types and their respective job names.
More examples
• Distributed grep – all lines matching a pattern
• Map: filter by pattern
• Reduce: output set
• Count URL access frequency
• Map: output each URL as key, with count 1
• Reduce: sum the counts
• Reverse web-link graph
• Map: output (target,source) pairs when link to target
found in souce
• Reduce: concatenates values and emits (target,list(source))

02/15/2025 26
What do we need to write a MR program?

• A mapper
• Accepts (key,value) pairs from the input
• Produces intermediate (key,value) pairs, which are then shuffled
• A reducer
• Accepts intermediate (key,value) pairs
• Produces final (key,value) pairs for the output
• A driver
• Specifies which inputs to use, where to put the outputs
• Chooses the mapper and the reducer to use
• Hadoop takes care of the rest!!
• Default behaviors can be customized by the driver
02/15/2025 27
The Mapper Input format Intermediate format
(file offset, line) can be freely chosen

import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.io.*;

public class WCMapper extends Mapper<LongWritable, Text, Text, Text> {


public void map(LongWritable key, Text value, Context context) {
context.write(new Text("foo"), value);
}
}

• Extends abstract 'Mapper' class


• Input/output types are specified as type parameters
• Implements a 'map' function
• Accepts (key,value) pair of the specified type
• Writes output pairs by calling 'write' method on context
• Mixing up the types will cause problems at runtime (!)
02/15/2025 28
The Reducer Intermediate format
(same as mapper output) Output format

import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.io.*;

public class WCReducer extends Reducer<Text, Text, IntWritable, Text> {


public void reduce(Text key, Iterable<Text> values, Context context)
throws java.io.IOException, InterruptedException
{
for (Text value: values)
context.write(new IntWritable(4711), value); Note: We may get
} multiple values for
} the same key!

• Extends abstract 'Reducer' class


• Must specify types again (must be compatible with mapper!)
• Implements a 'reduce' function
• Values are passed in as an 'Iterable'
• Caution: These are NOT normal Java classes. Do not store them in
collections - content can change between iterations!
02/15/2025 29
The Driver import
import
org.apache.hadoop.mapreduce.*;
org.apache.hadoop.io.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WCDriver {


public static void main(String[] args) throws Exception { Mapper&Reducer are
Configuration c=new Configuration(); in the same Jar as
Job j=Job.getInstance(c,"Word count Example");
WCDriver
j.setJarByClass(WordCount.class);
j.setMapperClass(WCMapper.class);
j.setNumReduceTasks(3); Format of the (key,value)
j.setReducerClass(WCReducer.class); pairs output by the
j.setOutputKeyClass(Text.class); reducer
j.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(j,new Path(args[0]));
FileOutputFormat.setOutputPath(j, new Path(args[1]));
Input and Output
System.exit(j.waitForCompletion(true) ? 0 : 1);
} paths
}

• Specifies how the job is to be executed


• Input and output directories; mapper & reducer classes

02/15/2025 30
Find out the word length histogram#example3
• particular document is given and then we have to, find out that how
many big medium and small words are
• appearing in the particular document and this becomes the word
length histogram.
bb

Big : Yellow : 10+


Medium: Red : 5 to 9
Small: Blue : 2to 4
Tiny: pink : 1
Inverted Index #example4
• Finding given word from search engine
• Input:
Tweet1, “I love pancakes for breakfast”
Tweet2, “I dislike pancakes”
Tweet3, “What should I eat for breakfast?”
Tweet4, “I love to eat”
• Output:
Pancakes(tweet1,tweet2)
Breakfast(tweet1,tweet3)
eat(tweet3,tweet4)
love(tweet1,tweet4)
• Find out Mapper and Reducer
In Detail
• Hadoop Mapper
• Hadoop Reducer
• Key-Value Pairs
• Input Format
• Record Reader
• Partitioner
Mapper in Hadoop Map-Reduce
• How is key value pair generated in Hadoop?
1. Input Split
2. Record Reader
InputSplit
• InputSplit in Hadoop MapReduce is the logical representation of data.
It describes a unit of work that contains a single map task in a
MapReduce program.
• As a user, we don’t need to deal with InputSplit directly, because they
are created by an InputFormat
• mapred.min.split.size parameter in mapred-site.xml we can control
this value or by overriding the parameter in the Job object used to
submit a particular MapReduce job.
• The client (running the job) can calculate the splits for a job by calling
‘getSplit()’, and then sent to the application master, which uses their
storage locations to schedule map tasks that will process them on the
cluster.
• Then, map task passes the split to the createRecordReader() method
on InputFormat to get RecordReader for the split and RecordReader
generate record (key-value pair), which it passes to the map function.
What is Hadoop InputFormat?
• The InputFormat class is one of the fundamental classes in the Hadoop
MapReduce framework which provides the following functionality:
1. The files or other objects that should be used for input is selected by the
InputFormat.
2. InputFormat defines the Data splits, which defines both the size of
individual Map tasks and its potential execution server.
3. InputFormat defines the RecordReader, which is responsible for reading
actual records from the input files.
How we get the data to
mapper?

{
public abstract List<InputSplit> getSplits(JobContext context) throws IOException,
InterruptedException;
public abstract RecordReader<K, V>
createRecordReader(InputSplit split,
TaskAttemptContext context) throws IOException, InterruptedException;
}
Types of InputFormat in
MapReduce
Record Reader
• The MapReduce RecordReader in Hadoop takes the byte-oriented
view of input, provided by the InputSplit and presents as a record-
oriented view for Mapper.
• Map task passes the split to the createRecordReader() method on
InputFormat in task tracker to obtain a RecordReader for that split.
The RecordReader load’s data from its source and converts into key-
value pairs suitable for reading by the mapper.
Types of Hadoop Record Reader
in MapReduce
i. LineRecordReader
ii. SequenceFileRecordReader

• Maximum size for a Single Record


• conf.setInt("mapred.linerecordreader.maxlength", Integer.MAX_VALUE);
• A line with a size greater than this maximum value (default is
2,147,483,647) will be ignored.
Hadoop Record Writer
• Record Writer writes these output key-value pairs from the Reducer
phase to output files.
• TextOutputFormat
• SequenceFileOutputFormat
• SequenceFileAsBinaryOutputFormat
• MapFileOutputFormat
• MultipleOutputs
• LazyOutputFormat
• DBOutputFormat
Java Programs for Word count
Problem
• Driver Code
• Mapper Code
• Reducer Code
$hadoop jar wc.jar wordcount i/p o/p

public class WCDriver{


public static void main(String[] args) throws IOException,
InterruptedException, ClassNotFoundException {
// TODO code application logic here
Configuration c=new Configuration();
Job j=Job.getInstance(c,"Word count Example");
j.setJarByClass(WordCount.class);
j.setMapperClass(WCMapper.class);
j.setNumReduceTasks(3);
j.setReducerClass(WCReducer.class);
j.setOutputKeyClass(Text.class);
j.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(j,new Path(args[0]));
FileOutputFormat.setOutputPath(j, new Path(args[1]));
System.exit(j.waitForCompletion(true) ? 0 : 1);
}

}
Adding following lines // Importing libraries

import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WCMapper extends Mapper<LongWritable,Text,Text,IntWritable> {

@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

String x= value.toString(); // x="hi how are you"

for(String word: x.split(" ")) // ["hi","how","are","you"]


{
context.write(new Text(word),new IntWritable(1));
}
}

}
public class WCMapper extends Mapper< LongWritable,
Text, Text, IntWritable>

•We have created a class Map that extends the class Mapper which is already
defined in the MapReduce Framework.
•We define the data types of input and output key/value pair after the class
declaration using angle brackets.
•Both the input and output of the Mapper is a key/value pair.
•Input:
•The key is nothing but the offset of each line in the text file: LongWritable
•The value is each individual line (as shown in the figure at the right): Text
•Output:
•The key is the tokenized words: Text
•We have the hardcoded value in our case which is 1: IntWritable
•Example – Dear 1, Bear 1, etc.
•We have written a java code where we have tokenized each word and assigned
them a hardcoded value equal to 1.
Adding following lines // Importing libraries

import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import
org.apache.hadoop.mapreduce.Mapper;
public class WCReducer extends Reducer<Text,IntWritable,Text,IntWritable> {

@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context)
throws
IOException, InterruptedException {

int count=0;

for (IntWritable val : values) { //A - <1,1,1>


count += val.get();
}
context.write(key, new IntWritable(count));

}
• We have created a class Reduce which extends class Reducer like that of Mapper.
• We define the data types of input and output key/value pair after the class declaration using angle brackets as
done for Mapper.
• Both the input and the output of the Reducer is a key-value pair.
• Input:
• The key nothing but those unique words which have been generated after the sorting and shuffling phase: Text
• The value is a list of integers corresponding to each key: IntWritable
• Example – Bear, [1, 1], etc.
• Output:
• The key is all the unique words present in the input text file: Text
• The value is the number of occurrences of each of the unique words: IntWritable
• Example – Bear, 2; Car, 3, etc.
• We have aggregated the values present in each of the list corresponding to each key and produced the final
answer.
• In general, a single reducer is created for each of the unique words, but, you can specify the number of reducer
in mapred-site.xml.
Adding following lines // Importing libraries

import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import
org.apache.hadoop.mapreduce.Reducer;
NCDC data
Raw Format of weather data

Mapper Input data

Mapper Output data


Reducer Input data

Reducer Output data


Hadoop/Map Reduce Combiners
• On a large dataset when we run MapReduce job, large chunks of
intermediate data is generated by the Mapper and this intermediate
data is passed on the Reducer for further processing, which leads to
enormous network congestion. MapReduce framework provides a
function known as Hadoop Combiner that plays a key role in reducing
network congestion
• The combiner in MapReduce is also known as ‘Mini-reducer’. The
primary job of Combiner is to process the output data from the
Mapper, before passing it to Reducer. It runs after the mapper and
before the Reducer and its use is optional
MapReduce program without
Combiner
MapReduce program with
Combiner
Advantages of MapReduce Combiner
• Hadoop Combiner reduces the time taken for data transfer between
mapper and reducer.
• It decreases the amount of data that needed to be processed by the
reducer.
• The Combiner improves the overall performance of the reducer.
Disadvantages of MapReduce
Combiner
• MapReduce jobs cannot depend on the Hadoop combiner execution
because there is no guarantee in its execution.
• In the local filesystem, the key-value pairs are stored in the Hadoop and
run the combiner later which will cause expensive disk IO.
Where to define it?

In Driver class
• job. Set Combiner
Class(ReduceClass.class);
Separate java file

• public class Combiners Hadoop {


Hadoop Partitioner / MapReduce Partitioner
• Partitioning of the keys of the intermediate map output is controlled
by the Partitioner.
Need of Hadoop MapReduce
Partitioner
• MapReduce job takes an input data set and produces the list of the
key-value pair which is the result of map phase in which input data is
split and each task processes the split and each map, output the list of
key-value pairs. Then, the output from the map phase is sent to
reduce task which processes the user-defined reduce function on map
outputs.

• Before reduce phase, partitioning of the map output take place on the
basis of the key and sorted.
• This partitioning specifies that all the values for each key are grouped
together and make sure that all the values of a single key go to the
same reducer, thus allows even distribution of the map output over
the reducer.
• Partitioner in Hadoop MapReduce redirects the mapper output to the
reducer by determining which reducer is responsible for the particular
key.
• The Default Hadoop partitioner in Hadoop MapReduce is Hash
Partitioner which computes a hash value for the key and assigns the
partition based on this result.
How many Partitioner

• The total number of Partitioners that run in Hadoop is equal to the


number of reducers i.e. Partitioner will divide the data according to
the number of reducers which is set by JobConf.setNumReduceTasks()
method.
• Thus, the data from single Partitioner is processed by a single reducer.
And Partitioner is created only when there are multiple reducers
Poor Partitioning in Hadoop
MapReduce
• If in data input one key appears more than any other key then
1. The key appearing more will be sent to one partition.
2. All the other key will be sent to partitions according to their hashCode().
• If hashCode() method does not uniformly distribute other keys data
over partition range, then data will not be evenly sent to reducers.
• Poor partitioning of data means that some reducers will have more
data input than other i.e. they will have more work to do than other
reducers. So, the entire job will wait for one reducer to finish its extra-
large share of the load.
• we can create Custom partitioner, which allows sharing workload
uniformly across different reducers.
• Partitioner provides the getPartition() method that you can implement
yourself if you want to declare the custom partition for your job.
• public static class MyPartitioner extends Partitioner<Text,Text>{
public int getPartition(Text key, Text value, int numReduceTasks){
if(numReduceTasks==0)
return 0;
if(key.equals(new Text(“Male”)) )
return 0;
if(key.equals(new Text(“Female”)))
return 1;
}
}
How to set number of reducers?
• By default no of reducer=1
• If you mention JobConf.setNumReduceTasks(0) then no of reducers
are 0 and process will be executed only using mappers. No sorting&
shuffling will applied
• Methods to set no of reducers
1. Command line (bin/hadoop jar -Dmapreduce.job.maps=5 yourapp.jar..)
mapred.map.tasks --> mapreduce.job.maps
mapred.reduce.tasks --> mapreduce.job.reduces
2. In the code, one can configure JobConf variables.
job.setNumMapTasks(5); // 5 mappers
job.setNumReduceTasks(2); // 2 reducers
MapReduce: In Parallel

02/15/2025 74

You might also like