Understanding MapReduce
Understanding MapReduce
MAPREDUCE
MapReduce
Key/value pairs
The Hadoop Java API for MapReduce
The Mapper class
The Reducer class Presented by,
Gopikrishna PP
UNDERSTANDING MAP REDUCE FUNDAMENTALS
■ MapReduce is a software framework and programming model used for processing huge amounts of
data. MapReduce program work in two phases, namely, Map and Reduce. Map tasks deal with splitting and
mapping of data while Reduce tasks shuffle and reduce the data.
■ The input to each phase is key-value pairs. In addition, every programmer needs to specify two
functions: map function and reduce function.
Phases of execution
■ The whole process goes through four phases of execution:
1. Mapper
It is the first phase of MapReduce programming and contains the coding logic of the mapper function.
The conditional logic is applied to the ‘n’ number of data blocks spread across various data nodes.
Mapper function accepts key-value pairs as input as (k, v), where the key represents the offset address of
each record and the value represents the entire record content.
The output of the Mapper phase will also be in the key-value format as (k’, v’).
The output of various mappers (k’, v’), then goes into Shuffle and Sort phase.
All the duplicate values are removed, and different values are grouped together based on similar keys.
The output of the Shuffle and Sort phase will be key-value pairs again as key and array of values (k,
v[]).
■ 3. Reducer
The output of the Shuffle and Sort phase (k, v[]) will be the input of the Reducer phase.
In this phase reducer function’s logic is executed and all the values are aggregated against their
corresponding keys.
Reducer consolidates outputs of various mappers and computes the final job output.
The final output is then written into a single file in an output directory of HDFS.
■ 4. Combiner
In this phase, various outputs of the mappers are locally reduced at the node level.
For example, if different mapper outputs (k, v) coming from a single node contains duplicates, then
they get combined i.e. locally reduced as a single (k, v[]) output.
This phase makes the Shuffle and Sort phase work even quicker thereby enabling additional
■ Following illustration shows how Tweeter manages its tweets with the help of MapReduce.
■ As shown in the illustration, the MapReduce algorithm performs the following actions −
Tokenize − Tokenizes the tweets into maps of tokens and writes them as key-value pairs.
Filter − Filters unwanted words from the maps of tokens and writes the filtered maps as key-value pairs.
Count − Generates a token counter per word.
Aggregate Counters − Prepares an aggregate of similar counter values into small manageable units.
Intermediate keys
They key-value pairs generated by the
mapper are known as intermediate
keys.
Mapper in MapReduce
■ Map-Reduce is a programming model that is mainly divided into two phases Map
Phase and Reduce Phase. It is designed for processing the data in parallel which is
divided on various machines(nodes).
■ The Hadoop Java programs are consist of Mapper class and Reducer class along with
the driver class.
Understand the Mapper in Map-Reduce:
Mapper is a simple user-defined program that performs some operations on input-splits as per it is designed.
Mapper is a base class that needs to be extended by the developer or programmer in his lines of code
according to the organization’s requirements. input and output type need to be mentioned under the Mapper
class argument which needs to be modified by the developer.
■ For Example:
Class MyMappper extends Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT>
Five components
Mapper is the initial line of code that initially interacts with the input dataset.
The Mapper mainly consists of 5 components:
Input
Input Splits,
Record Reader
Map
Intermediate output disk.
• Here, in the above image, we can observe that there are multiple Mapper which are generating the key-
value pairs as output. The output of each mapper is sent to the sorter which will sort the key-value pairs
according to its key value.
• Shuffling also takes place during the sorting process and the output will be sent to the Reducer part and
final output is produced.
The Hadoop Java API for MapReducer
MapReducer API is the super-interface for all the classes, which defines different jobs in MapReduce.
The Hadoop MapReduce API is implemented in Java, so MapReduce applications are generally Java-based
In this section, we focus on MapReduce APIs. Here, we learn about the classes and methods used in
MapReduce programming