0% found this document useful (0 votes)
7 views4 pages

Unit5 B

Uploaded by

hellovasanth46
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views4 pages

Unit5 B

Uploaded by

hellovasanth46
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

The Hadoop follows master-slave topology.

The master nodes assign tasks to the


slave nodes. The architecture comprises three layers that are HDFS, YARN, and
MapReduce.

HDFS is the distributed file system in Hadoop for storing big data. MapReduce is the
processing framework for processing vast data in the Hadoop cluster in a
distributed manner. YARN is responsible for managing the resources amongst
applications in the cluster.

Map Reduce

It is the data processing layer of Hadoop. It is a software framework for writing


applications that process vast amounts of data (terabytes to petabytes in range) in
parallel on the cluster of commodity hardware.

The MapReduce framework works on the <key, value> pairs.

The MapReduce job is the unit of work the client wants to perform. MapReduce job
mainly consists of the input data, the MapReduce program, and the configuration
information. Hadoop runs the MapReduce jobs by dividing them into two types of
tasks that are map tasks and reduce tasks. The Hadoop YARN scheduled these
tasks and are run on the nodes in the cluster.

Due to some unfavorable conditions, if the tasks fail, they will automatically get
rescheduled on a different node.

The user defines the map function and the reduce function for performing the
MapReduce job.

The input to the map function and output from the reduce function is the key, value
pair.

The function of the map tasks is to load, parse, filter, and transform the data. The
output of the map task is the input to the reduce task. Reduce task then performs
grouping and aggregation on the output of the map task.

The MapReduce task is done in two phases-

1. Map phase

a. RecordReader
Hadoop divides the inputs to the MapReduce job into the fixed-size splits called input
splits or splits. The RecordReader transforms these splits into records and parses
the data into records but it does not parse the records itself. RecordReader provides
the data to the mapper function in key-value pairs.

b. Map

In the map phase, Hadoop creates one map task which runs a user-defined function
called map function for each record in the input split. It generates zero or multiple
intermediate key-value pairs as map task output.

The map task writes its output to the local disk. This intermediate output is then
processed by the reduce tasks which run a user-defined reduce function to produce
the final output. Once the job gets completed, the map output is flushed out.

c. Combiner

Input to the single reduce task is the output from all the Mappers that is output from
all map tasks. Hadoop allows the user to define a combiner function that runs on the
map output.

Combiner groups the data in the map phase before passing it to Reducer. It
combines the output of the map function which is then passed as an input to the
reduce function.

d. Partitioner

When there are multiple reducers then the map tasks partition their output, each
creating one partition for each reduce task. In each partition, there can be many keys
and their associated values but the records for any given key are all in a single
partition.

Hadoop allows users to control the partitioning by specifying a user-defined


partitioning function. Generally, there is a default Partitioner that buckets the keys
using the hash function.

2. Reduce phase:

The various phases in reduce task are as follows:

a. Sort and Shuffle:


The Reducer task starts with a shuffle and sort step. The main purpose of this phase
is to collect the equivalent keys together. Sort and Shuffle phase downloads the data
which is written by the partitioner to the node where Reducer is running.

It sorts each data piece into a large data list. The MapReduce framework performs
this sort and shuffles so that we can iterate over it easily in the reduce task.

The sort and shuffling are performed by the framework automatically. The developer
through the comparator object can have control over how the keys get sorted and
grouped.

b. Reduce:

The Reducer which is the user-defined reduce function performs once per key
grouping. The reducer filters, aggregates, and combines data in several different
ways. Once the reduce task is completed, it gives zero or more key-value pairs to the
OutputFormat. The reduce task output is stored in Hadoop HDFS.

c. OutputFormat

It takes the reducer output and writes it to the HDFS file by RecordWriter. By default,
it separates key, value by a tab and each record by a newline character.

YARN stands for Yet Another Resource Negotiator. It is the resource management
layer of Hadoop. It was introduced in Hadoop 2.

YARN is designed with the idea of splitting up the functionalities of job scheduling
and resource management into separate daemons. The basic idea is to have a global
ResourceManager and application Master per application where the application can
be a single job or DAG of jobs.

YARN consists of ResourceManager, NodeManager, and per-application


ApplicationMaster.

1. ResourceManager

It arbitrates resources amongst all the applications in the cluster.

It has two main components that are Scheduler and the ApplicationManager.

a. Scheduler
● The Scheduler allocates resources to the various applications running in
the cluster, considering the capacities, queues, etc.
● It is a pure Scheduler. It does not monitor or track the status of the
application.
● Scheduler does not guarantee the restart of the failed tasks that are failed
either due to application failure or hardware failure.
● It performs scheduling based on the resource requirements of the
applications.

b. ApplicationManager

● They are responsible for accepting the job submissions.


● ApplicationManager negotiates the first container for executing
application-specific ApplicationMaster.
● They provide service for restarting the ApplicationMaster container on
failure.
● The per-application ApplicationMaster is responsible for negotiating
containers from the Scheduler. It tracks and monitors their status and
progress.

2. NodeManager:

NodeManager runs on the slave nodes. It is responsible for containers, monitoring


the machine resource usage that is CPU, memory, disk, network usage, and reporting
the same to the ResourceManager or Scheduler.

3. ApplicationMaster:

The per-application ApplicationMaster is a framework-specific library. It is


responsible for negotiating resources from the ResourceManager. It works with the
NodeManager(s) for executing and monitoring the tasks.

You might also like