HDFS Unit 4

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

MapReduce

Unit 4
Introduction
MapReduce is a processing technique and a program model for distributed computing based on java.

The MapReduce algorithm contains two important tasks, namely Map and Reduce. Map takes a set of
data and converts it into another set of data, where individual elements are broken down into tuples
(key/value pairs).

Secondly, reduce task, which takes the output from a map as an input and combines those data tuples
into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce task is always
performed after the map job.
Introduction
The major advantage of MapReduce is that it is easy to scale data processing over multiple computing
nodes.

Under the MapReduce model, the data processing primitives are called mappers and reducers.

Decomposing a data processing application into mappers and reducers is sometimes nontrivial.

But, once we write an application in the MapReduce form, scaling the application to run over hundreds,
thousands, or even tens of thousands of machines in a cluster is merely a configuration change. This
simple scalability is what has attracted many programmers to use the MapReduce model.
Data Flow in MapReduce
MapReduce is used to compute the huge amount of data . To handle the upcoming data in a parallel and
distributed form, the data has to flow from various phases.
Phases of MapReduce Dataflow
Input reader
The input reader reads the upcoming data and splits it into the data blocks of the appropriate size (64 MB to 128 MB). Each
data block is associated with a Map function. Once input reads the data, it generates the corresponding key-value pairs. The
input files reside in HDFS.

Map function
The map function process the upcoming key-value pairs and generated the corresponding output key-value pairs. The map
input and output type may be different from each other.

Partition function
The partition function assigns the output of each Map function to the appropriate reducer. The available key and value
provide this function. It returns the index of reducers.
Phases of MapReduce Dataflow
Shuffling and Sorting
The data are shuffled between/within nodes so that it moves out from the map and get ready to process for reduce
function. Sometimes, the shuffling of data can take much computation time. The sorting operation is performed on input
data for Reduce function. Here, the data is compared using comparison function and arranged in a sorted form.

Reduce function
The Reduce function is assigned to each unique key. These keys are already arranged in sorted order. The values associated
with the keys can iterate the Reduce and generates the corresponding output.

Output writer
Once the data flow from all the above phases, Output writer executes. The role of Output writer is to write the Reduce output
to the stable storage.
MapRed vs MapReduce
The two packages represent input / output formats, mapper and reducer base classes for the corresponding
Hadoop mapred and mapreduce APIs.
MapRed is used in Hadoop version 1 which is org.apache.hadoop.mapred
The second was used in Hadoop version 2 where YARN was introduced and that version of API was
called as MapReduce org.apache.hadoop.mapreduce
Both MapRed and MapReduce different APIs but the functionality of both the APIs is almost the same.
The one and only major difference being that the old API was capable of pushing records to
mapper/reducer. Though there are a few advancements in MapReduce which might be lacking in
MapRed which enabled the upgrade of Hadoop version 2
The new API is cleaner and faster. The old API was deprecated but after a while, it got reverted. Which
API is better to use depends on your tasks.
Mapper→Combiner → Partitioner
The sequence of execution of the mentioned components happens in the below order:

Mapper -> Combiner -> Partitioner

Mapper : The Input data is initially processed by all the Mappers/Map jobs and the intermediate output is created.

Combiner : All the intermediate outputs are optimized by local aggregation before the shuffle/sort phase by the Combiner. The
primary goal of Combiners is to save as much bandwidth as possible by minimizing the number of key/value pairs that will be
shuffled across the network and provided as input to the Reducer.

Partitioner : In Hadoop, partitioning of the keys of the intermediate map output is controlled by Partitioner. Hash function, is used
to derive partition. On the basis of key-value pair each map output is partitioned. Record having same key value goes into the
same partition (within each mapper), and then each partition is sent to a Reducer. Partition phase takes place in between mapper
and reducer.

Default Partitioner (Hash Partitioner) computes a hash value for the key and assigns the partition based on this result
Apache Ambari
It provides a highly interactive dashboard that allows administrators to visualize the progress and status of every application running over

the Hadoop cluster.

Its flexible and scalable user interface allows a range of tools such as Pig, MapReduce, Hive, etc. to be installed on the cluster and

administers their performances in a user-friendly fashion. Some of the key features of this technology can be highlighted as:

● Instantaneous insight into the health of the Hadoop cluster using preconfigured operational metrics

● User-friendly configuration providing an easy step-by-step guide for installation

● Installation of Apache Ambari is possible through Hortonworks Data Platform (HDP)

● Monitoring dependencies and performances by visualizing and analyzing jobs and tasks

● Authentication, authorization, and auditing by installing Kerberos-based Hadoop clusters

● Flexible and adaptive technology fitting perfectly in the enterprise environment


Important Links
https://home.cs.colorado.edu/~kena/classes/5448/s11/presentations/hadoop.pdf

Working of Ambari UI

https://docs.cloudera.com/HDPDocuments/Ambari-2.6.2.0/bk_ambari-operations/bk_amba
ri-operations.pdf

You might also like