0% found this document useful (0 votes)
53 views5 pages

What Is MapReduce in Hadoop

MapReduce is a software framework used for processing large amounts of data in parallel across clusters of computers. It works in two phases - the Map phase splits and maps the input data, while the Reduce phase shuffles and reduces the output of the Map phase. Hadoop is commonly used to run MapReduce programs written in Java across clusters. The input is split into key-value pairs that are processed by user-defined Map and Reduce functions to analyze large datasets.

Uploaded by

Rakesh Shaw
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views5 pages

What Is MapReduce in Hadoop

MapReduce is a software framework used for processing large amounts of data in parallel across clusters of computers. It works in two phases - the Map phase splits and maps the input data, while the Reduce phase shuffles and reduces the output of the Map phase. Hadoop is commonly used to run MapReduce programs written in Java across clusters. The input is split into key-value pairs that are processed by user-defined Map and Reduce functions to analyze large datasets.

Uploaded by

Rakesh Shaw
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

What is MapReduce in Hadoop?

MapReduce is a software framework and programming model used


for processing huge amounts of data. 

MapReduce program work in two phases, namely, Map and


Reduce.

 Map tasks deal with splitting and mapping of data while Reduce
tasks shuffle and reduce the data.

 Hadoop is capable of running MapReduce programs written in


Java.

The programs of Map Reduce in cloud computing are parallel in


nature, thus are very useful for performing large-scale data analysis
using multiple machines in the cluster.

The input to each phase is key-value pairs. In addition, every


programmer needs to specify two functions: map
function and reduce function.

MapReduce Architecture in Big Data


explained in detail
The whole process goes through four phases of execution namely,
splitting, mapping, shuffling, and reducing.

let’s understand with a MapReduce example–

Consider you have following input data for your MapReduce in Big
data Program

Welcome to Hadoop Class


Hadoop is good
Hadoop is bad
MapReduce Architecture

The final output of the MapReduce task is

bad
Class
good
Hadoop
is
to
Welcome

The data goes through the following phases of MapReduce in Big


Data

Input Splits:

An input to a MapReduce in Big Data job is divided into fixed-size


pieces called input splits Input split is a chunk of the input that is
consumed by a single map

Mapping
This is the very first phase in the execution of map-reduce
program. In this phase data in each split is passed to a mapping
function to produce output values.

In our example, a job of mapping phase is to count a number of


occurrences of each word from input splits and prepare a list in the
form of <word, frequency>

Shuffling

This phase consumes the output of Mapping phase. Its task is to


consolidate the relevant records from Mapping phase output. In our
example, the same words are clubed together along with their
respective frequency.

Reducing

In this phase, output values from the Shuffling phase are aggregated.
This phase combines values from Shuffling phase and returns a
single output value.

In short, this phase summarizes the complete dataset.

In our example, this phase aggregates the values from Shuffling


phase i.e., calculates total occurrences of each word.

 Unlike the map output, reduce output is stored in HDFS

How MapReduce Organizes Work?


we will learn how MapReduce works

Hadoop divides the job into tasks. There are two types of tasks:

Map tasks (Splits & Mapping)


Reduce tasks (Shuffling, Reducing)

The complete execution process (execution of Map and Reduce


tasks, both) is controlled by two types of entities called a

Jobtracker: Acts like a master (responsible for complete


execution of submitted job)
Multiple Task Trackers: Acts like slaves, each of them
performing the job

For every job submitted for execution in the system, there is


one Jobtracker that resides on Namenode and there are multiple
tasktrackers which reside on Datanode.

How Hadoop MapReduce Works


 A job is divided into multiple tasks which are then run onto
multiple data nodes in a cluster.

 It is the responsibility of job tracker to coordinate the activity


by scheduling tasks to run on different data nodes.

 Execution of individual task is then to look after by task tracker,


which resides on every data node executing part of the job.
 Task tracker’s responsibility is to send the progress report to the
job tracker.
In addition, task tracker periodically sends ‘heartbeat’ signal to
the Jobtracker so as to notify him of the current state of the
system.

Thus job tracker keeps track of the overall progress of each


job. In the event of task failure, the job tracker can reschedule it
on a different task tracker.

You might also like