0% found this document useful (0 votes)
15 views

Big Data Notes Unit-3

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Big Data Notes Unit-3

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

UNIT-3

MAP-REDUCE
• MapReduce is a big data analysis model that processes data sets using a parallel
algorithm on computer clusters, typically Apache Hadoop clusters or cloud systems like
Amazon Elastic MapReduce (EMR) clusters.
• MapReduce and HDFS are the two major components of Hadoop which makes it so
powerful and efficient to use. MapReduce is a programming model used for efficient
processing in parallel over large data-sets in a distributed manner.
• The data is first split and then combined to produce the final result. The libraries for
MapReduce is written in so many programming languages with various different-
different optimizations.
• The purpose of MapReduce in Hadoop is to Map each of the jobs and then it will reduce
it to equivalent tasks for providing less overhead over the cluster network and to reduce
the processing power. The MapReduce task is mainly divided into two phases Map Phase
and Reduce Phase.
• MapReduce is essential to the operation of the Hadoop framework and a core component.
While “reduce tasks” shuffle and reduce the data, “map tasks” deal with separating and
mapping the data.
• MapReduce makes concurrent processing easier by dividing petabytes of data into
smaller chunks and processing them in parallel on Hadoop commodity servers. In the
end, it collects all the information from several servers and gives the application a
consolidated output.
• For example, let us consider a Hadoop cluster consisting of 20,000 affordable
commodity servers containing 256MB data blocks in each. It will be able to process
around five terabytes worth of data simultaneously. Compared to the sequential
processing of such a big data set, the usage of MapReduce cuts down the amount of time
needed for processing.
Components of MapReduce Architecture:
1. Client: The MapReduce client is the one who brings the Job to the MapReduce for
processing. There can be multiple clients available that continuously send jobs for
processing to the Hadoop MapReduce Manager.
2. Job: The MapReduce Job is the actual work that the client wanted to do which is
comprised of so many smaller tasks that the client wants to process or execute.
3. Hadoop MapReduce Master: It divides the particular job into subsequent job-parts.
4. Job-Parts: The task or sub-jobs that are obtained after dividing the main job. The result
of all the job-parts combined to produce the final output.
5. Input Data: The data set that is fed to the MapReduce for processing.
6. Output Data: The final result is obtained after the processing.
The MapReduce task is mainly divided into 2 phases i.e. Map phase and Reduce phase.
1. Map: As the name suggests its main use is to map the input data in key-value pairs. The
input to the map may be a key-value pair where the key can be the id of some kind of
address and value is the actual value that it keeps. The Map() function will be executed in
its memory repository on each of these input key-value pairs and generates the
intermediate key-value pair which works as input for the Reducer or Reduce() function.

2. Reduce: The intermediate key-value pairs that work as input for Reducer are shuffled
and sort and send to the Reduce() function. Reducer aggregate or group the data based on
its key-value pair as per the reducer algorithm written by the developer.

HOW DOES MAP-REDUCE WORK?


• MapReduce generally divides input data into pieces and distributes them among other
computers. The input data is broken up into key-value pairs. On computers in a cluster,
parallel map jobs process the chunked data.
• The reduction job combines the result into a specific key-value pair output, and the data
is then written to the Hadoop Distributed File System (HDFS).
• Typically, the MapReduce program operates on the same collection of computers as the
Hadoop Distributed File System.
• The time it takes to accomplish a task dramatically decreases when the framework runs a
job on the nodes that store the data. Several component daemons were used in the first
iteration of MapReduce, including TaskTrackers and JobTracker.

YET ANOTHER RESOURCE NEGOTIATOR (YARN)


YARN stands for “Yet Another Resource Negotiator”. It was introduced in Hadoop 2.0 to
remove the bottleneck on Job Tracker which was present in Hadoop 1.0. YARN was
described as a “Redesigned Resource Manager” at the time of its launching, but it has now
evolved to be known as large-scale distributed operating system used for Big Data
processing.
YARN architecture basically separates resource management layer from the processing layer.
In Hadoop 1.0 version, the responsibility of Job tracker is split between the resource manager
and application manager.

YARN also allows different data processing engines like graph processing, interactive
processing, stream processing as well as batch processing to run and process data stored in
HDFS (Hadoop Distributed File System) thus making the system much more efficient.
Through its various components, it can dynamically allocate various resources and schedule
the application processing. For large volume data processing, it is quite necessary to manage
the available resources properly so that every application can leverage them.
YARN Features: YARN gained popularity because of the following features-

• Scalability: The scheduler in Resource manager of YARN architecture allows Hadoop to


extend and manage thousands of nodes and clusters.
• Compatibility: YARN supports the existing map-reduce applications without disruptions
thus making it compatible with Hadoop 1.0 as well.
• Cluster Utilization: Since YARN supports Dynamic utilization of cluster in Hadoop,
which enables optimized Cluster Utilization.
• Multi-tenancy: It allows multiple engine access thus giving organizations a benefit of
multi-tenancy.
HADOOP YARN ARCHITECTURE

The main components of YARN architecture include:


(i) Client: It submits map-reduce jobs.
(ii) Resource Manager: It is the master daemon of YARN and is responsible for resource
assignment and management among all the applications. Whenever it receives a processing
request, it forwards it to the corresponding node manager and allocates resources for the
completion of the request accordingly. It has two major components:
• Scheduler: It performs scheduling based on the allocated application and available
resources. It is a pure scheduler, means it does not perform other tasks such as
monitoring or tracking and does not guarantee a restart if a task fails. The YARN
scheduler supports plugins such as Capacity Scheduler and Fair Scheduler to partition
the cluster resources.
• Application manager: It is responsible for accepting the application and negotiating
the first container from the resource manager. It also restarts the Application Master
container if a task fails.
(iii) Node Manager: It take care of individual node on Hadoop cluster and manages
application and workflow and that particular node. Its primary job is to keep-up with the
Resource Manager. It registers with the Resource Manager and sends heartbeats with the
health status of the node. It monitors resource usage, performs log management and also kills
a container based on directions from the resource manager. It is also responsible for creating
the container process and start it on the request of Application master.
(iv) Application Master: An application is a single job submitted to a framework. The
application master is responsible for negotiating resources with the resource manager,
tracking the status and monitoring progress of a single application. The application master
requests the container from the node manager by sending a Container Launch Context(CLC)
which includes everything an application needs to run. Once the application is started, it
sends the health report to the resource manager from time-to-time.
(v) Container: It is a collection of physical resources such as RAM, CPU cores and disk on a
single node. The containers are invoked by Container Launch Context(CLC) which is a
record that contains information such as environment variables, security tokens,
dependencies etc.
APPLICATION WORKFLOW IN HADOOP YARN

1. Client submits an application


2. The Resource Manager allocates a container to start the Application Manager
3. The Application Manager registers itself with the Resource Manager
4. The Application Manager negotiates containers from the Resource Manager
5. The Application Manager notifies the Node Manager to launch containers
6. Application code is executed in the container
7. Client contacts Resource Manager/Application Manager to monitor application’s status
8. Once the processing is complete, the Application Manager un-registers with the Resource
Manager

JOB SCHEDULING
• Job scheduling is the process of allocating system resources to many different tasks
by an operating system (OS). The system handles prioritized job queues that are
awaiting CPU time and it should determine which job to be taken from which queue
and the amount of time to be allocated for the job. This type of scheduling makes sure
that all jobs are carried out fairly and on time.
• Job scheduling is performed using job schedulers. Job schedulers are programs that
enable scheduling and, at times, track computer “batch” jobs, or units of work like the
operation of a payroll program. Job schedulers have the ability to start and control
jobs automatically by running prepared job-control-language statements or by means
of similar communication with a human operator.
• Most OSs like Unix, Windows, etc., include standard job-scheduling abilities. A
number of programs including database management systems (DBMS), backup,
enterprise resource planning (ERP) and business process management (BPM) feature
specific job-scheduling capabilities as well.

DIFFERENTIATE BETWEEN YARN AND MAP-REDUCE

TASK EXECUTION

You might also like