0% found this document useful (0 votes)
2 views

Unit 2 Topic 4 Map Reduce

The document provides an overview of the MapReduce framework, detailing its processing technique and program model for distributed computing, which consists of two main tasks: Map and Reduce. It explains how data is processed in parallel across multiple nodes, with examples such as word counting, and highlights the benefits of MapReduce including fault tolerance, resilience, and scalability. Additionally, it outlines the phases of the MapReduce model, including Mapper, Shuffle and Sort, Reducer, and the optional Combiner phase.

Uploaded by

sharmayashikagzb
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Unit 2 Topic 4 Map Reduce

The document provides an overview of the MapReduce framework, detailing its processing technique and program model for distributed computing, which consists of two main tasks: Map and Reduce. It explains how data is processed in parallel across multiple nodes, with examples such as word counting, and highlights the benefits of MapReduce including fault tolerance, resilience, and scalability. Additionally, it outlines the phases of the MapReduce model, including Mapper, Shuffle and Sort, Reducer, and the optional Combiner phase.

Uploaded by

sharmayashikagzb
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 27

Map Reduce framework and

basics
Dr. Anil Kumar Dubey
Associate Professor,
Computer Science & Engineering Department,
ABES EC, Ghaziabad
Affiliated to Dr. A.P.J. Abdul Kalam Technical University, Uttar
Pradesh, Lucknow
Basic of MapReduce
• Is a processing technique and a program model for distributed
computing based on java.

• Algorithm contains two important tasks


• Map
• Reduce

• Reduce task is always performed after the map job.


Conti…
• Map takes a set of data and converts it into another set of data, where
individual elements are broken down into tuples (key/value pairs).
• Reduce task, which takes the output from a map as an input and combines
those data tuples into a smaller set of tuples.
Example: A Word Count
• Let us have a text file called example.txt whose contents are as
follows:
Dear, Bear, River, Car, Car, River, Deer, Car and Bear

• Now, suppose, we have to perform a word count on the


sample.txt using MapReduce.

• So, we will be finding unique words and the number of


occurrences of those unique words.
Conti…
Conti…
• First, we divide the input into three splits as shown in the figure.
This will distribute the work among all the map nodes.

• Then, we tokenize the words in each of the mappers and give a


hardcoded value (1) to each of the tokens or words.

• The rationale behind giving a hardcoded value equal to 1 is that


every word, in itself, will occur once.
Conti…
• Now, a list of key-value pair will be created where the key is
nothing but the individual words and value is one. So, for the first
line (Dear Bear River) we have 3 key-value pairs — Dear, 1; Bear, 1;
River, 1. The mapping process remains the same on all the nodes.

• After the mapper phase, a partition process takes place where


sorting and shuffling happen so that all the tuples with the same
key are sent to the corresponding reducer.
Conti…
• So, after the sorting and shuffling phase, each reducer will have a
unique key and a list of values corresponding to that very key. For
example, Bear, [1,1]; Car, [1,1,1].., etc.
• Now, each Reducer counts the values which are present in that list
of values. As shown in the figure, reducer gets a list of values
which is [1,1] for the key Bear. Then, it counts the number of ones
in the very list and gives the final output as — Bear, 2.
• Finally, all the output key/value pairs are then collected and
written in the output file.
Benefits of MapReduce
Faulat-tolerance
• During the middle of a map-reduce job, if a machine carrying a few data blocks
fails architecture handles the failure.
• It considers replicated copies of the blocks in alternate machines for further
processing.

Resilience
• Each node periodically updates its status to the master node.
• If a slave node doesn’t send its notification, the master node reassigns the
currently running task of that slave node to other available nodes in the cluster.
Conti…
Quick
• Data processing is quick as MapReduce uses HDFS as the storage system.
• MapReduce takes minutes to process terabytes of unstructured large
volumes of data.

Parallel Processing
• In MapReduce, we are dividing the job among multiple nodes and each
node works with a part of the job simultaneously.
• So, MapReduce is based on Divide and Conquer paradigm which helps us
to process the data using different machines.
• As the data is processed by multiple machines instead of a single machine
in parallel, the time taken to process the data gets reduced by a
tremendous amount
Conti…
Conti…
Availability
• Multiple replicas of the same data are sent to numerous nodes in
the network.
• Thus, in case of any failure, other copies are readily available for
processing without any loss.

Scalability
• Hadoop is a highly scalable platform.
• Traditional RDBMS systems are not scalable according to the
increase in data volume.
• MapReduce lets you run applications from a huge number of nodes,
using terabytes and petabytes of data.
Map Reduce Framework
• A MapReduce job usually splits the input data-set into independent
chunks which are processed by the map tasks in a completely parallel
manner.
• The framework sorts the outputs of the maps, which are then input
to the reduce tasks.
• Typically both the input and the output of the job are stored in a file-
system.
Conti…
Conti…
How Map Reduce works
• MapReduce can perform distributed and parallel computations
using large datasets across a large number of nodes.

• A MapReduce job usually splits the input datasets and then


process each of them independently by the Map tasks in a
completely parallel manner.

• The output is then sorted and input to reduce tasks.


Conti…
• Both job input and output are stored in file systems.

• Tasks are scheduled and monitored by the framework.

• Map Reduce architecture contains two core components as


Daemon services responsible for running mapper and reducer
tasks, monitoring, and re-executing the tasks on failure. In Hadoop
2 onwards Resource Manager and Node Manager are the daemon
services.
Conti…
• When the job client submits a MapReduce job, these daemons
come into action. They are also responsible for parallel processing
and fault-tolerance features of MapReduce jobs.

• In Hadoop 2 onwards resource management and job scheduling or


monitoring functionalities are segregated by YARN (Yet Another
Resource Negotiator) as different daemons.
Conti…
• Compared to Hadoop 1 with Job Tracker and Task Tracker, Hadoop
2 contains a global Resource Manager (RM) and Application
Masters (AM) for each application.

• Job Client submits the job to the Resource Manager.

• YARN Resource Manager’s scheduler is responsible for the


coordination of resource allocation of the cluster among the
running applications.
Conti…
• YARN Node Manager runs on each node and does node-level
resource management, coordinating with the Resource manager.
It launches and monitors the compute containers on the machine
on the cluster.

• Application Master helps the resources from Resource Manager


and use Node Manager to run and coordinate MapReduce tasks.

• HDFS is usually used to share the job files between other entities.
Conti…
Phases of the MapReduce model
• MapReduce model has three major and one optional phase
• Mapper
• Shuffle and Sort
• Reducer
• Combiner
Conti…
Mapper
• It is the first phase of MapReduce programming and contains the
coding logic of the mapper function.
• The conditional logic is applied to the ‘n’ number of data blocks
spread across various data nodes.
• Mapper function accepts key-value pairs as input as (k, v), where
the key represents the offset address of each record and the value
represents the entire record content.
• The output of the Mapper phase will also be in the key-value
format as (k’, v’).
Conti…
Shuffle and Sort
• The output of various mappers (k’, v’), then goes into Shuffle and
Sort phase.
• All the duplicate values are removed, and different values are
grouped together based on similar keys.
• The output of the Shuffle and Sort phase will be key-value pairs
again as key and array of values (k, v[]).
Conti…
Reducer
• The output of the Shuffle and Sort phase (k, v[]) will be the input
of the Reducer phase.
• In this phase reducer function’s logic is executed and all the values
are aggregated against their corresponding keys.
• Reducer consolidates outputs of various mappers and computes
the final job output.
• The final output is then written into a single file in an output
directory of HDFS.
Conti…
Combiner
• It is an optional phase in the MapReduce model.
• The combiner phase is used to optimize the performance of
MapReduce jobs.
• In this phase, various outputs of the mappers are locally reduced
at the node level.
• For example, if different mapper outputs (k, v) coming from a
single node contains duplicates, then they get combined i.e.
locally reduced as a single (k, v[]) output.
• This phase makes the Shuffle and Sort phase work even quicker
thereby enabling additional performance in MapReduce jobs.
THANK
YOU

You might also like