Open In App

Hadoop MapReduce - Data Flow

Last Updated : 04 Aug, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

MapReduce is a Hadoop processing framework that efficiently handles large-scale data across distributed machines. Unlike traditional systems, it works directly on data stored across nodes in HDFS.

Hadoop MapReduce follows a simple yet powerful data processing model that breaks large datasets into smaller chunks and processes them in parallel across a cluster. This flow from input splitting to mapping, shuffling and reducing ensures scalable, fault-tolerant and efficient data processing over distributed systems.

Below is the workflow of Hadoop MapReduce with a simple data flow diagram.

Hadoop-MapReduce-Data-Flow

How MapReduce Works (Step-by-Step)

1. Input Split

The input data (e.g., a big log file or dataset) is divided into smaller chunks called Input Splits. Each split is processed independently by a separate Mapper.

Example: If you have a 1 GB file and Hadoop splits it into four 256 MB chunks, it will use 4 Mappers one for each chunk.

2. Mapper Phase

Each Mapper runs in parallel on different nodes and processes one input split.

What it does:

  • Reads the input data line by line
  • Transforms each line into key-value pairs
  • Stores the intermediate output locally (not yet in HDFS)

Example: If the task is counting words, the Mapper reads: "Data is power"--> emits:

("Data", 1), ("is", 1), ("power", 1)

3. Shuffling & Sorting

This is a behind-the-scenes phase handled by Hadoop after mapping is done.

What it does:

  • Shuffles intermediate key-value pairs across the cluster
  • Groups all values with the same key
  • Sorts them by key before sending to the Reducer

Example: From all Mappers, these pairs:

("Data", 1), ("Data", 1), ("power", 1)

are grouped into:

("Data", [1, 1]), ("power", [1])

4. Reducer Phase

Each Reducer receives a list of values for each unique key.

What it does:

  • Applies aggregation logic (e.g., sum, average, filter)
  • Generates the final key-value output
  • Stores the result in HDFS

Example: For word count, the Reducer gets:

("Data", [1, 1]) --> outputs: ("Data", 2)

The final output is saved in files like: part-r-00000


Similar Reads