Unit5 B
Unit5 B
HDFS is the distributed file system in Hadoop for storing big data. MapReduce is the
processing framework for processing vast data in the Hadoop cluster in a
distributed manner. YARN is responsible for managing the resources amongst
applications in the cluster.
Map Reduce
The MapReduce job is the unit of work the client wants to perform. MapReduce job
mainly consists of the input data, the MapReduce program, and the configuration
information. Hadoop runs the MapReduce jobs by dividing them into two types of
tasks that are map tasks and reduce tasks. The Hadoop YARN scheduled these
tasks and are run on the nodes in the cluster.
Due to some unfavorable conditions, if the tasks fail, they will automatically get
rescheduled on a different node.
The user defines the map function and the reduce function for performing the
MapReduce job.
The input to the map function and output from the reduce function is the key, value
pair.
The function of the map tasks is to load, parse, filter, and transform the data. The
output of the map task is the input to the reduce task. Reduce task then performs
grouping and aggregation on the output of the map task.
1. Map phase
a. RecordReader
Hadoop divides the inputs to the MapReduce job into the fixed-size splits called input
splits or splits. The RecordReader transforms these splits into records and parses
the data into records but it does not parse the records itself. RecordReader provides
the data to the mapper function in key-value pairs.
b. Map
In the map phase, Hadoop creates one map task which runs a user-defined function
called map function for each record in the input split. It generates zero or multiple
intermediate key-value pairs as map task output.
The map task writes its output to the local disk. This intermediate output is then
processed by the reduce tasks which run a user-defined reduce function to produce
the final output. Once the job gets completed, the map output is flushed out.
c. Combiner
Input to the single reduce task is the output from all the Mappers that is output from
all map tasks. Hadoop allows the user to define a combiner function that runs on the
map output.
Combiner groups the data in the map phase before passing it to Reducer. It
combines the output of the map function which is then passed as an input to the
reduce function.
d. Partitioner
When there are multiple reducers then the map tasks partition their output, each
creating one partition for each reduce task. In each partition, there can be many keys
and their associated values but the records for any given key are all in a single
partition.
2. Reduce phase:
It sorts each data piece into a large data list. The MapReduce framework performs
this sort and shuffles so that we can iterate over it easily in the reduce task.
The sort and shuffling are performed by the framework automatically. The developer
through the comparator object can have control over how the keys get sorted and
grouped.
b. Reduce:
The Reducer which is the user-defined reduce function performs once per key
grouping. The reducer filters, aggregates, and combines data in several different
ways. Once the reduce task is completed, it gives zero or more key-value pairs to the
OutputFormat. The reduce task output is stored in Hadoop HDFS.
c. OutputFormat
It takes the reducer output and writes it to the HDFS file by RecordWriter. By default,
it separates key, value by a tab and each record by a newline character.
YARN stands for Yet Another Resource Negotiator. It is the resource management
layer of Hadoop. It was introduced in Hadoop 2.
YARN is designed with the idea of splitting up the functionalities of job scheduling
and resource management into separate daemons. The basic idea is to have a global
ResourceManager and application Master per application where the application can
be a single job or DAG of jobs.
1. ResourceManager
It has two main components that are Scheduler and the ApplicationManager.
a. Scheduler
● The Scheduler allocates resources to the various applications running in
the cluster, considering the capacities, queues, etc.
● It is a pure Scheduler. It does not monitor or track the status of the
application.
● Scheduler does not guarantee the restart of the failed tasks that are failed
either due to application failure or hardware failure.
● It performs scheduling based on the resource requirements of the
applications.
b. ApplicationManager
2. NodeManager:
3. ApplicationMaster: