0% found this document useful (0 votes)
45 views11 pages

Map Reduce

This document provides an overview of MapReduce, describing how it divides large data processing tasks into smaller sub-tasks that can be run in parallel across clusters of computers. It explains that MapReduce programs take input as lists and output lists, and use Map and Reduce functions to distribute the workload. The Map function produces intermediate key-value pairs that get shuffled and sorted before being passed to the Reduce function to produce the final output. It also describes how MapReduce can be used to analyze large datasets like social media usage or trading firm reconciliations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views11 pages

Map Reduce

This document provides an overview of MapReduce, describing how it divides large data processing tasks into smaller sub-tasks that can be run in parallel across clusters of computers. It explains that MapReduce programs take input as lists and output lists, and use Map and Reduce functions to distribute the workload. The Map function produces intermediate key-value pairs that get shuffled and sorted before being passed to the Reduce function to produce the final output. It also describes how MapReduce can be used to analyze large datasets like social media usage or trading firm reconciliations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Map Reduce

By
Isha Shrestha
Jebina Maharjan
Manisha Bhandari
Sarah Gorkhali
Introduction
● Map Reduce is designed to process the large amount of data in
parallel by dividing the work.
● The whole job is taken from the user and divided into smaller
tasks and assign them into the working nodes.
● Map Reduce programs take input as a list and convert to the
output in the form of list as well.
Why Map Reduce
● Distribute the load
● Reduce the big data and extract the meaningful
data.
Working of Map Reduce

MapReduce are two functions: Map and Reduce.


They are sequenced one after the other.
● The Map function takes input from the disk as <key,value> pairs, processes them, and
produces another set of intermediate <key,value> pairs as output.
● The Reduce function also takes inputs as <key,value> pairs, and produces
<key,value> pairs as output.
Map

● The input data is first split into smaller blocks. Each block is
then assigned to a mapper for processing.
● For example, if a file has 100 records to be processed, 100
mappers can run together to process one record each. Or
maybe 50 mappers can run together to process two records
each. The Hadoop framework decides how many mappers to
use, based on the size of the data to be processed and the
memory block available on each mapper server.
Reduce

•After all the mappers complete processing, the


framework shuffles and sorts the results before passing
them on to the reducers. A reducer cannot start while a
mapper is still in progress. All the map output values
that have the same key are assigned to a single reducer,
which then aggregates the values for that key.
Combine and Partition

There are two intermediate steps between Map and Reduce.


Combine
● It is an optional process.
● The combiner is a reducer that runs individually on each mapper server.
● It reduces the data on each mapper further to a simplified form before passing it
downstream.
● This makes shuffling and sorting easier as there is less data to work with.
Partition

● It is the process that translates the <key, value> pairs resulting from mappers to
another set of <key, value> pairs to feed into the reducer.
● It decides how the data has to be presented to the reducer and also assigns it to a
particular reducer.
● The default partitioner determines the hash value for the key, resulting from the
mapper, and assigns a partition based on this hash value. There are as many partitions
as there are reducers. So, once the partitioning is complete, the data from each
partition is sent to a specific reducer.
Implementation
It can be written in Java, C, C++, Python, Ruby,Perl,etc.
Uses
It can be used with any complex problem that can be solved through
parallelization.
A social media site could use it to determine how many new sign-ups it received
over the past month from different countries, to gauge its increasing popularity
among different geographies.
A trading firm could perform its batch reconciliations faster and also determine
which scenarios often cause trades to break.
Search engines could determine page views, and marketers could perform
sentiment analysis using MapReduce.

You might also like