Understanding MapReduce
Understanding MapReduce
MAPPER
1. RecordReader
Function: Converts a byte-oriented view of the input into a record-
oriented view.
Input Split: Data is divided into smaller chunks (input splits) before being
passed to the mapper.
Output: Presents data as key-value pairs to the mapper.
o The key typically represents positional information (e.g., an offset
in the file).
o The value represents a chunk of data (e.g., a line in a text file).
2. Map
Core Function: The mapper function processes the input key-value pairs
produced by RecordReader and generates zero or more intermediate
key-value pairs.
Logic: The transformation logic is user-defined and varies depending on
the problem.
o For example, in word count applications, the mapper generates
(word, 1) for each word found.
3. Combiner (Optional)
Purpose: Acts as a local reducer to aggregate mapper output before
sending it to the reducer.
Performance Benefit: Reduces the amount of data transferred over the
network, saving bandwidth and disk space.
Functionality: Combines multiple intermediate key-value pairs (e.g.,
summing counts for words) before sending them to the reducer.
4. Partitioner
Function: Divides intermediate key-value pairs into partitions (shards)
and assigns each partition to a reducer.
Key Assignment: Ensures that keys with the same value are sent to the
same reducer.
Data Storage: The partitioned data is written to the local disk and pulled
by the corresponding reducer for further processing.
Reducer
1. Shuffle and Sort
Function: The shuffle phase takes the output from all partitioners and
downloads it to the reducer’s local machine.
Sorting: Data is sorted by keys to group similar keys together. This
grouping is necessary so the reducer can process all values associated
with a key in a single pass.
Purpose: Ensures that all key-value pairs for a particular key are
processed together, facilitating efficient reduction.
2. Reduce
Core Task: The reducer iterates through the sorted data, applies user-
defined logic, and processes one key-value group at a time.
Operations: It can perform operations like aggregation, filtering, and
combining. For example, in a word count problem, it aggregates word
counts from all mappers.
Output: The output can be zero or more key-value pairs, depending on
the logic applied in the reduce function.
3. Output Format
Writing the Output: The default format separates the key-value pairs
with a tab and writes the final results to a file in Hadoop Distributed File
System (HDFS).
Custom Formatting: Users can customize the output format as needed.