Hadoop MapReduce - Data Flow
Last Updated :
04 Aug, 2025
MapReduce is a Hadoop processing framework that efficiently handles large-scale data across distributed machines. Unlike traditional systems, it works directly on data stored across nodes in HDFS.
Hadoop MapReduce follows a simple yet powerful data processing model that breaks large datasets into smaller chunks and processes them in parallel across a cluster. This flow from input splitting to mapping, shuffling and reducing ensures scalable, fault-tolerant and efficient data processing over distributed systems.
Below is the workflow of Hadoop MapReduce with a simple data flow diagram.

How MapReduce Works (Step-by-Step)
The input data (e.g., a big log file or dataset) is divided into smaller chunks called Input Splits. Each split is processed independently by a separate Mapper.
Example: If you have a 1 GB file and Hadoop splits it into four 256 MB chunks, it will use 4 Mappers one for each chunk.
2. Mapper Phase
Each Mapper runs in parallel on different nodes and processes one input split.
What it does:
- Reads the input data line by line
- Transforms each line into key-value pairs
- Stores the intermediate output locally (not yet in HDFS)
Example: If the task is counting words, the Mapper reads: "Data is power"--> emits:
("Data", 1), ("is", 1), ("power", 1)
3. Shuffling & Sorting
This is a behind-the-scenes phase handled by Hadoop after mapping is done.
What it does:
- Shuffles intermediate key-value pairs across the cluster
- Groups all values with the same key
- Sorts them by key before sending to the Reducer
Example: From all Mappers, these pairs:
("Data", 1), ("Data", 1), ("power", 1)
are grouped into:
("Data", [1, 1]), ("power", [1])
4. Reducer Phase
Each Reducer receives a list of values for each unique key.
What it does:
- Applies aggregation logic (e.g., sum, average, filter)
- Generates the final key-value output
- Stores the result in HDFS
Example: For word count, the Reducer gets:
("Data", [1, 1]) --> outputs: ("Data", 2)
The final output is saved in files like: part-r-00000
Similar Reads
Hadoop - Mapper In MapReduce In Hadoopâs MapReduce framework, the Mapper is the core component of the Map Phase, responsible for processing raw input data and converting it into a structured form (key-value pairs) that Hadoop can efficiently handle.A Mapper is a user-defined Java class that takes input splits (chunks of data fr
4 min read
Hadoop - Reducer in Map-Reduce MapReduce is a core programming model in the Hadoop ecosystem, designed to process large datasets in parallel across distributed machines (nodes). The execution flow is divided into two major phases: Map Phase and Reduce Phase.Hadoop programs typically consist of three main components:Mapper Class:
3 min read
Hadoop - A Solution For Big Data Wasting the useful information hidden behind the data can be a dangerous roadblock for industries, ignoring this information eventually pulls your industry growth back. Data? Big Data? How big you think it is, yes it's really huge in volume with huge velocity, variety, veracity, and value. So how do
3 min read
MapReduce Job Execution MapReduce is a fundamental programming model in the Hadoop ecosystem, designed for processing large-scale datasets in parallel across distributed clusters. Its execution relies on the YARN (Yet Another Resource Negotiator) framework, which handles job scheduling, resource allocation and monitoring.
3 min read
Data with Hadoop Basic Issue with the data In spite of the fact that the capacity limits of hard drives have expanded enormously throughout the years, get to speeds â the rate at which information can be perused from drives â have not kept up. One commonplace drive from 1990 could store 1, 370 MB of information and
3 min read
How MapReduce completes a task? Application master changes the status for the job to "successful" when it receives a notification that the last task for a job is complete. Then it learns that the job has completed successfully when the Job polls for status. So, a message returns from the waitForCompletion() method after it prints
4 min read