0% found this document useful (0 votes)
59 views2 pages

MapReduce - Documentation

MapReduce is a programming model for processing large datasets in a distributed manner across clusters of computers. It involves dividing a task into smaller subtasks that are processed in parallel by the nodes in a cluster, and then combining the results from the subtasks into the final output. The key functions in MapReduce are the Mapper, which processes input data into intermediate key-value pairs, and the Reducer, which combines those intermediate pairs based on keys to produce the final output. The number of map tasks depends on the number of data blocks in the input file, so more input data results in more parallelism and faster processing.

Uploaded by

sudhirmutcherla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views2 pages

MapReduce - Documentation

MapReduce is a programming model for processing large datasets in a distributed manner across clusters of computers. It involves dividing a task into smaller subtasks that are processed in parallel by the nodes in a cluster, and then combining the results from the subtasks into the final output. The key functions in MapReduce are the Mapper, which processes input data into intermediate key-value pairs, and the Reducer, which combines those intermediate pairs based on keys to produce the final output. The number of map tasks depends on the number of data blocks in the input file, so more input data results in more parallelism and faster processing.

Uploaded by

sudhirmutcherla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 2

MapReduce 

MapReduce are programs, designed to compute large volumes of data in


a parallel fashion, which requires dividing the workload across a large
number of machines (nodes). The basic notion of MapReduce is to divide a
task into subtasks, handle the sub-tasks in parallel, and combine the
results of the subtasks to form the final output.
MapReduce consists of two key functions: Mapper and Reducer.
Mapper is a function which process the input data. The mapper processes
the data and creates several small chunks of data. The input to the mapper
function is in the form of (key, value) pairs, even though the input to a
MapReduce program is a file or directory (which is stored in the HDFS).

Working of Mapper in MapReduce:

1. The input data from the users is passed to the Mapper which is
specified by an InputFormat. InputFormat is specified in the driver
code. It defines the location of the input data like a file or directory on
HDFS. It also determines how to split the input data into input splits.
2. Each Mapper deals with a single input split. RecordReader are
objects which is a part of InputFormat, used to extract (key, value)
records from the input source (split data)
3. The Mapper processes the input, which are, the (key, value) pairs
and provides an output, which are also (key, value) pairs. The output
from the Mapper is called the intermediate output.
4. The Mapper may use or completely ignore the input key. For
example, a standard pattern is to read a file one line at a time. The key
is the byte offset into the file at which the line starts. The value is the
contents of the line itself. Typically the key is considered irrelevant. If
the Mapper writes anything out, the output must be in the form of
key/value pairs.
5. The output from the Mapper (intermediate keys and their value
lists) are passed to the Reducer in sorted key order.
6. The Reducer outputs zero or more final key/value pairs. These are
written to HDFS. The Reducer usually emits a single key/value pair for
each input key
7. If a Mapper appears to be running more slowly or lagging than the
others, a new instance of the Mapper will be started on another
machine, operating on the same data. The results of the first Mapper to
finish will be used. Hadoop will eliminate the Mapper which is still
running
The number of map tasks in a MapReduce program depends on the
number of data blocks of the input file. For example, if the block size is
128MB per block of split data and the input data is of size 1GB, then the
number of map tasks will be 8 map tasks. The number of map tasks
increases with the increase in the input data and hence parallelism
increases which results in faster processing of data.

You might also like