Hadoop MapReduce – Data Flow
Last Updated :
30 Jul, 2020
Map-Reduce is a processing framework used to process data over a large number of machines. Hadoop uses Map-Reduce to process the data distributed in a Hadoop cluster. Map-Reduce is not similar to the other regular processing framework like Hibernate, JDK, .NET, etc. All these previous frameworks are designed to use with a traditional system where the data is stored at a single location like Network File System, Oracle database, etc. But when we are processing big data the data is located on multiple commodity machines with the help of HDFS.
So when the data is stored on multiple nodes we need a processing framework where it can copy the program to the location where the data is present, Means it copies the program to all the machines where the data is present. Here the Map-Reduce came into the picture for processing the data on Hadoop over a distributed system. Hadoop has a major drawback of cross-switch network traffic which is due to the massive volume of data. Map-Reduce comes with a feature called Data-Locality. Data Locality is the potential to move the computations closer to the actual data location on the machines.
Since Hadoop is designed to work on commodity hardware it uses Map-Reduce as it is widely acceptable which provides an easy way to process data over multiple nodes. Map-Reduce is not the only framework for parallel processing. Nowadays Spark is also a popular framework used for distributed computing like Map-Reduce. We also have HAMA, MPI theses are also the different-different distributed processing framework.

Let’s Understand Data-Flow in Map-Reduce
Map Reduce is a terminology that comes with Map Phase and Reducer Phase. The map is used for Transformation while the Reducer is used for aggregation kind of operation. The terminology for Map and Reduce is derived from some functional programming languages like Lisp, Scala, etc. The Map-Reduce processing framework program comes with 3 main components i.e. our Driver code, Mapper(For Transformation), and Reducer(For Aggregation).
Let’s take an example where you have a file of 10TB in size to process on Hadoop. The 10TB of data is first distributed across multiple nodes on Hadoop with HDFS. Now we have to process it for that we have a Map-Reduce framework. So to process this data with Map-Reduce we have a Driver code which is called Job. If we are using Java programming language for processing the data on HDFS then we need to initiate this Driver class with the Job object. Suppose you have a car which is your framework than the start button used to start the car is similar to this Driver code in the Map-Reduce framework. We need to initiate the Driver code to utilize the advantages of this Map-Reduce Framework.
There are also Mapper and Reducer classes provided by this framework which are predefined and modified by the developers as per the organizations requirement.
Brief Working of Mapper
Mapper is the initial line of code that initially interacts with the input dataset. suppose, If we have 100 Data-Blocks of the dataset we are analyzing then, in that case, there will be 100 Mapper program or process that runs in parallel on machines(nodes) and produce there own output known as intermediate output which is then stored on Local Disk, not on HDFS. The output of the mapper act as input for Reducer which performs some sorting and aggregation operation on data and produces the final output.
Brief Working Of Reducer
Reducer is the second part of the Map-Reduce programming model. The Mapper produces the output in the form of key-value pairs which works as input for the Reducer. But before sending this intermediate key-value pairs directly to the Reducer some process will be done which shuffle and sort the key-value pairs according to its key values. The output generated by the Reducer will be the final output which is then stored on HDFS(Hadoop Distributed File System). Reducer mainly performs some computation operation like addition, filtration, and aggregation.

Steps of Data-Flow:
- At a time single input split is processed. Mapper is overridden by the developer according to the business logic and this Mapper run in a parallel manner in all the machines in our cluster.
- The intermediate output generated by Mapper is stored on the local disk and shuffled to the reducer to reduce the task.
- Once Mapper finishes their task the output is then sorted and merged and provided to the Reducer.
- Reducer performs some reducing tasks like aggregation and other compositional operation and the final output is then stored on HDFS in part-r-00000(created by default) file.
Similar Reads
How MapReduce handles data query ?
The methodology taken by MapReduce may appear to be a beast power approach. The reason is that the whole dataset â or if nothing else a decent part of it â can be prepared for each query. Be that as it may, this is its capacity. MapReduce is a batch query processor, and the capacity to run a special
4 min read
Hadoop - Mapper In MapReduce
Map-Reduce is a programming model that is mainly divided into two phases Map Phase and Reduce Phase. It is designed for processing the data in parallel which is divided on various machines(nodes). The Hadoop Java programs are consist of Mapper class and Reducer class along with the driver class. Had
5 min read
Distributed Cache in Hadoop MapReduce
Hadoop's MapReduce framework provides the facility to cache small to moderate read-only files such as text files, zip files, jar files etc. and broadcast them to all the Datanodes(worker-nodes) where MapReduce job is running. Each Datanode gets a copy of the file(local-copy) which is sent through Di
4 min read
Hadoop - Reducer in Map-Reduce
Map-Reduce is a programming model that is mainly divided into two phases i.e. Map Phase and Reduce Phase. It is designed for processing the data in parallel which is divided on various machines(nodes). The Hadoop Java programs are consist of Mapper class and Reducer class along with the driver class
4 min read
Hadoop - A Solution For Big Data
Wasting the useful information hidden behind the data can be a dangerous roadblock for industries, ignoring this information eventually pulls your industry growth back. Data? Big Data? How big you think it is, yes it's really huge in volume with huge velocity, variety, veracity, and value. So how do
3 min read
MapReduce Job Execution
Once the resource managerâs scheduler assign a resources to the task for a container on a particular node, the container is started up by the application master by contacting the node manager. The task whose main class is YarnChild is executed by a Java application . It localizes the resources that
3 min read
Data with Hadoop
Basic Issue with the data In spite of the fact that the capacity limits of hard drives have expanded enormously throughout the years, get to speeds â the rate at which information can be perused from drives â have not kept up. One commonplace drive from 1990 could store 1, 370 MB of information and
3 min read
Hadoop - Pros and Cons
Big Data has become necessary as industries are growing, the goal is to congregate information and finding hidden facts behind the data. Data defines how industries can improve their activity and affair. A large number of industries are revolving around the data, there is a large amount of data that
5 min read
Job Initialisation in MapReduce
Resource manager hands off the request to the YARN scheduler when it receives a call to its submitApplication() method. The resource manager launches the application masterâs process there when the scheduler allocates a container under the node managerâs management. MRAppMaster is the main class of
2 min read
How Job runs on MapReduce
MapReduce can be used to work with a solitary method call: submit() on a Job object (you can likewise call waitForCompletion(), which presents the activity on the off chance that it hasn't been submitted effectively, at that point sits tight for it to finish). Let's understand the components - Clien
2 min read