0% found this document useful (0 votes)
49 views8 pages

Map Reduce

MapReduce is a software framework for processing large datasets in a distributed computing environment. It allows for parallel processing of data across clusters of computers. The framework includes Mapper and Reducer classes that can be customized by developers. Mappers process input records to generate intermediate key-value pairs, which are then shuffled and sorted for input to Reducers. Reducers consolidate these intermediate pairs to produce the final output data. MapReduce thus provides an easy way to write applications that process vast amounts of structured and unstructured data stored in Hadoop Distributed Filesystem.

Uploaded by

Vishakha Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views8 pages

Map Reduce

MapReduce is a software framework for processing large datasets in a distributed computing environment. It allows for parallel processing of data across clusters of computers. The framework includes Mapper and Reducer classes that can be customized by developers. Mappers process input records to generate intermediate key-value pairs, which are then shuffled and sorted for input to Reducers. Reducers consolidate these intermediate pairs to produce the final output data. MapReduce thus provides an easy way to write applications that process vast amounts of structured and unstructured data stored in Hadoop Distributed Filesystem.

Uploaded by

Vishakha Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 8

Hadoop

Mapreduce
By-Vishakha Sharma
7CSE 8X
A2305218493
What is map reduce

MapReduce is the data processing layer of Hadoop. It is a software framework for easily
writing applications that process the vast amount of structured and unstructured data stored
in the Hadoop Distributed Filesystem (HDFS). It processes the huge amount of data in
parallel by dividing the job (submitted job) into a set of independent tasks (sub-job). By this
parallel processing speed and reliability of cluster is improved.

There are also Mapper and Reducer classes provided by this framework which are


predefined and modified by the developers as per the organizations requirement.
Why map reduce
• The emergence of massive datasets presents both challenges and
opportunities in data storage and analysis.
• These “big data” challenges traditional analytic tools and will increasingly
require new solutions.
• MapReduce is used to write scalable applications that can do parallel
processing to process a large amount of data on a large cluster of commodity
hardware servers.
Mapper

Mapper is the initial line


of code that initially
interacts with the input
dataset.
Reducer
Reducer is the second part
of the Map-Reduce
programming model. The
Mapper produces the
output in the form of key-
value pairs which works
as input for the Reducer.
Map-reduce flow

input Map Shuffling Reduce output


and sorting

The data for a It processes each Now, the output


It takes the set of The final output
MapReduce task input record and is Shuffled to the
intermediate key- obtained after
is stored in input generates new key- reduce node.
value pairs Reduce
files, and input value pair, and this The shuffling is
produced by the
files typically key-value pair the physical
mappers as the
lives in HDFS.  generated by movement of the
input and then
Mapper is data which is
runs a reducer
completely done over the
function on each of
different from the network
them to generate
input pair.  the output.
Steps of data flow
THANK YOU

You might also like