Unit 2 Topic 4 Map Reduce
Unit 2 Topic 4 Map Reduce
basics
Dr. Anil Kumar Dubey
Associate Professor,
Computer Science & Engineering Department,
ABES EC, Ghaziabad
Affiliated to Dr. A.P.J. Abdul Kalam Technical University, Uttar
Pradesh, Lucknow
Basic of MapReduce
• Is a processing technique and a program model for distributed
computing based on java.
Resilience
• Each node periodically updates its status to the master node.
• If a slave node doesn’t send its notification, the master node reassigns the
currently running task of that slave node to other available nodes in the cluster.
Conti…
Quick
• Data processing is quick as MapReduce uses HDFS as the storage system.
• MapReduce takes minutes to process terabytes of unstructured large
volumes of data.
Parallel Processing
• In MapReduce, we are dividing the job among multiple nodes and each
node works with a part of the job simultaneously.
• So, MapReduce is based on Divide and Conquer paradigm which helps us
to process the data using different machines.
• As the data is processed by multiple machines instead of a single machine
in parallel, the time taken to process the data gets reduced by a
tremendous amount
Conti…
Conti…
Availability
• Multiple replicas of the same data are sent to numerous nodes in
the network.
• Thus, in case of any failure, other copies are readily available for
processing without any loss.
Scalability
• Hadoop is a highly scalable platform.
• Traditional RDBMS systems are not scalable according to the
increase in data volume.
• MapReduce lets you run applications from a huge number of nodes,
using terabytes and petabytes of data.
Map Reduce Framework
• A MapReduce job usually splits the input data-set into independent
chunks which are processed by the map tasks in a completely parallel
manner.
• The framework sorts the outputs of the maps, which are then input
to the reduce tasks.
• Typically both the input and the output of the job are stored in a file-
system.
Conti…
Conti…
How Map Reduce works
• MapReduce can perform distributed and parallel computations
using large datasets across a large number of nodes.
• HDFS is usually used to share the job files between other entities.
Conti…
Phases of the MapReduce model
• MapReduce model has three major and one optional phase
• Mapper
• Shuffle and Sort
• Reducer
• Combiner
Conti…
Mapper
• It is the first phase of MapReduce programming and contains the
coding logic of the mapper function.
• The conditional logic is applied to the ‘n’ number of data blocks
spread across various data nodes.
• Mapper function accepts key-value pairs as input as (k, v), where
the key represents the offset address of each record and the value
represents the entire record content.
• The output of the Mapper phase will also be in the key-value
format as (k’, v’).
Conti…
Shuffle and Sort
• The output of various mappers (k’, v’), then goes into Shuffle and
Sort phase.
• All the duplicate values are removed, and different values are
grouped together based on similar keys.
• The output of the Shuffle and Sort phase will be key-value pairs
again as key and array of values (k, v[]).
Conti…
Reducer
• The output of the Shuffle and Sort phase (k, v[]) will be the input
of the Reducer phase.
• In this phase reducer function’s logic is executed and all the values
are aggregated against their corresponding keys.
• Reducer consolidates outputs of various mappers and computes
the final job output.
• The final output is then written into a single file in an output
directory of HDFS.
Conti…
Combiner
• It is an optional phase in the MapReduce model.
• The combiner phase is used to optimize the performance of
MapReduce jobs.
• In this phase, various outputs of the mappers are locally reduced
at the node level.
• For example, if different mapper outputs (k, v) coming from a
single node contains duplicates, then they get combined i.e.
locally reduced as a single (k, v[]) output.
• This phase makes the Shuffle and Sort phase work even quicker
thereby enabling additional performance in MapReduce jobs.
THANK
YOU