Big Data Notes Unit-3

Uploaded by

Mohit singh Chouhan

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views

Big Data Notes Unit-3

Uploaded by

Mohit singh Chouhan

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

UNIT-3

MAP-REDUCE
• MapReduce is a big data analysis model that processes data sets using a parallel
algorithm on computer clusters, typically Apache Hadoop clusters or cloud systems like
Amazon Elastic MapReduce (EMR) clusters.
• MapReduce and HDFS are the two major components of Hadoop which makes it so
powerful and efficient to use. MapReduce is a programming model used for efficient
processing in parallel over large data-sets in a distributed manner.
• The data is first split and then combined to produce the final result. The libraries for
MapReduce is written in so many programming languages with various different-
different optimizations.
• The purpose of MapReduce in Hadoop is to Map each of the jobs and then it will reduce
it to equivalent tasks for providing less overhead over the cluster network and to reduce
the processing power. The MapReduce task is mainly divided into two phases Map Phase
and Reduce Phase.
• MapReduce is essential to the operation of the Hadoop framework and a core component.
While “reduce tasks” shuffle and reduce the data, “map tasks” deal with separating and
mapping the data.
• MapReduce makes concurrent processing easier by dividing petabytes of data into
smaller chunks and processing them in parallel on Hadoop commodity servers. In the
end, it collects all the information from several servers and gives the application a
consolidated output.
• For example, let us consider a Hadoop cluster consisting of 20,000 affordable
commodity servers containing 256MB data blocks in each. It will be able to process
around five terabytes worth of data simultaneously. Compared to the sequential
processing of such a big data set, the usage of MapReduce cuts down the amount of time
needed for processing.
Components of MapReduce Architecture:
1. Client: The MapReduce client is the one who brings the Job to the MapReduce for
processing. There can be multiple clients available that continuously send jobs for
processing to the Hadoop MapReduce Manager.
2. Job: The MapReduce Job is the actual work that the client wanted to do which is
comprised of so many smaller tasks that the client wants to process or execute.
3. Hadoop MapReduce Master: It divides the particular job into subsequent job-parts.
4. Job-Parts: The task or sub-jobs that are obtained after dividing the main job. The result
of all the job-parts combined to produce the final output.
5. Input Data: The data set that is fed to the MapReduce for processing.
6. Output Data: The final result is obtained after the processing.
The MapReduce task is mainly divided into 2 phases i.e. Map phase and Reduce phase.
1. Map: As the name suggests its main use is to map the input data in key-value pairs. The
input to the map may be a key-value pair where the key can be the id of some kind of
address and value is the actual value that it keeps. The Map() function will be executed in
its memory repository on each of these input key-value pairs and generates the
intermediate key-value pair which works as input for the Reducer or Reduce() function.

2. Reduce: The intermediate key-value pairs that work as input for Reducer are shuffled
and sort and send to the Reduce() function. Reducer aggregate or group the data based on
its key-value pair as per the reducer algorithm written by the developer.

HOW DOES MAP-REDUCE WORK?

• MapReduce generally divides input data into pieces and distributes them among other
computers. The input data is broken up into key-value pairs. On computers in a cluster,
parallel map jobs process the chunked data.
• The reduction job combines the result into a specific key-value pair output, and the data
is then written to the Hadoop Distributed File System (HDFS).
• Typically, the MapReduce program operates on the same collection of computers as the
Hadoop Distributed File System.
• The time it takes to accomplish a task dramatically decreases when the framework runs a
job on the nodes that store the data. Several component daemons were used in the first
iteration of MapReduce, including TaskTrackers and JobTracker.

YET ANOTHER RESOURCE NEGOTIATOR (YARN)

YARN stands for “Yet Another Resource Negotiator”. It was introduced in Hadoop 2.0 to
remove the bottleneck on Job Tracker which was present in Hadoop 1.0. YARN was
described as a “Redesigned Resource Manager” at the time of its launching, but it has now
evolved to be known as large-scale distributed operating system used for Big Data
processing.
YARN architecture basically separates resource management layer from the processing layer.
In Hadoop 1.0 version, the responsibility of Job tracker is split between the resource manager
and application manager.

YARN also allows different data processing engines like graph processing, interactive
processing, stream processing as well as batch processing to run and process data stored in
HDFS (Hadoop Distributed File System) thus making the system much more efficient.
Through its various components, it can dynamically allocate various resources and schedule
the application processing. For large volume data processing, it is quite necessary to manage
the available resources properly so that every application can leverage them.
YARN Features: YARN gained popularity because of the following features-

• Scalability: The scheduler in Resource manager of YARN architecture allows Hadoop to

extend and manage thousands of nodes and clusters.
• Compatibility: YARN supports the existing map-reduce applications without disruptions
thus making it compatible with Hadoop 1.0 as well.
• Cluster Utilization: Since YARN supports Dynamic utilization of cluster in Hadoop,
which enables optimized Cluster Utilization.
• Multi-tenancy: It allows multiple engine access thus giving organizations a benefit of
multi-tenancy.
HADOOP YARN ARCHITECTURE

The main components of YARN architecture include:

(i) Client: It submits map-reduce jobs.
(ii) Resource Manager: It is the master daemon of YARN and is responsible for resource
assignment and management among all the applications. Whenever it receives a processing
request, it forwards it to the corresponding node manager and allocates resources for the
completion of the request accordingly. It has two major components:
• Scheduler: It performs scheduling based on the allocated application and available
resources. It is a pure scheduler, means it does not perform other tasks such as
monitoring or tracking and does not guarantee a restart if a task fails. The YARN
scheduler supports plugins such as Capacity Scheduler and Fair Scheduler to partition
the cluster resources.
• Application manager: It is responsible for accepting the application and negotiating
the first container from the resource manager. It also restarts the Application Master
container if a task fails.
(iii) Node Manager: It take care of individual node on Hadoop cluster and manages
application and workflow and that particular node. Its primary job is to keep-up with the
Resource Manager. It registers with the Resource Manager and sends heartbeats with the
health status of the node. It monitors resource usage, performs log management and also kills
a container based on directions from the resource manager. It is also responsible for creating
the container process and start it on the request of Application master.
(iv) Application Master: An application is a single job submitted to a framework. The
application master is responsible for negotiating resources with the resource manager,
tracking the status and monitoring progress of a single application. The application master
requests the container from the node manager by sending a Container Launch Context(CLC)
which includes everything an application needs to run. Once the application is started, it
sends the health report to the resource manager from time-to-time.
(v) Container: It is a collection of physical resources such as RAM, CPU cores and disk on a
single node. The containers are invoked by Container Launch Context(CLC) which is a
record that contains information such as environment variables, security tokens,
dependencies etc.
APPLICATION WORKFLOW IN HADOOP YARN

1. Client submits an application

2. The Resource Manager allocates a container to start the Application Manager
3. The Application Manager registers itself with the Resource Manager
4. The Application Manager negotiates containers from the Resource Manager
5. The Application Manager notifies the Node Manager to launch containers
6. Application code is executed in the container
7. Client contacts Resource Manager/Application Manager to monitor application’s status
8. Once the processing is complete, the Application Manager un-registers with the Resource
Manager

JOB SCHEDULING
• Job scheduling is the process of allocating system resources to many different tasks
by an operating system (OS). The system handles prioritized job queues that are
awaiting CPU time and it should determine which job to be taken from which queue
and the amount of time to be allocated for the job. This type of scheduling makes sure
that all jobs are carried out fairly and on time.
• Job scheduling is performed using job schedulers. Job schedulers are programs that
enable scheduling and, at times, track computer “batch” jobs, or units of work like the
operation of a payroll program. Job schedulers have the ability to start and control
jobs automatically by running prepared job-control-language statements or by means
of similar communication with a human operator.
• Most OSs like Unix, Windows, etc., include standard job-scheduling abilities. A
number of programs including database management systems (DBMS), backup,
enterprise resource planning (ERP) and business process management (BPM) feature
specific job-scheduling capabilities as well.

DIFFERENTIATE BETWEEN YARN AND MAP-REDUCE

TASK EXECUTION

Unit 2 Notes BDA
No ratings yet
Unit 2 Notes BDA
10 pages
Unit-2 Bda Kalyan - Pagenumber
No ratings yet
Unit-2 Bda Kalyan - Pagenumber
15 pages
Hadoop Class 2 PDF
No ratings yet
Hadoop Class 2 PDF
18 pages
Mod 5
No ratings yet
Mod 5
46 pages
Map Reduce and Hadoop
No ratings yet
Map Reduce and Hadoop
39 pages
Yarn Tutorial
No ratings yet
Yarn Tutorial
14 pages
6_YARN
No ratings yet
6_YARN
10 pages
Framework For Processing Data in Hadoop - : Yarn and Mapreduce
No ratings yet
Framework For Processing Data in Hadoop - : Yarn and Mapreduce
31 pages
UNIT-4 bda
No ratings yet
UNIT-4 bda
26 pages
unit5 b
No ratings yet
unit5 b
4 pages
YARN (Yet Another Resource Negotiator) : Apache Hadoop in A Nutshell
No ratings yet
YARN (Yet Another Resource Negotiator) : Apache Hadoop in A Nutshell
2 pages
Bigdata and Hadoop - Unit III
No ratings yet
Bigdata and Hadoop - Unit III
24 pages
Apache Hadoop YARN
No ratings yet
Apache Hadoop YARN
24 pages
Big data unit 3 own
No ratings yet
Big data unit 3 own
20 pages
Download
No ratings yet
Download
7 pages
CH 4 BDA
No ratings yet
CH 4 BDA
7 pages
ECS765P_W3_Hadoop principles and components
No ratings yet
ECS765P_W3_Hadoop principles and components
47 pages
Yarn and its Failures
No ratings yet
Yarn and its Failures
22 pages
Hadoop 2.0
No ratings yet
Hadoop 2.0
20 pages
YARN
No ratings yet
YARN
5 pages
custom_notes
No ratings yet
custom_notes
10 pages
YARN Essentials - Sample Chapter
No ratings yet
YARN Essentials - Sample Chapter
12 pages
BDMA Part 3
No ratings yet
BDMA Part 3
22 pages
MapReduce workflows
No ratings yet
MapReduce workflows
43 pages
Unit - 4 Yarn
No ratings yet
Unit - 4 Yarn
20 pages
10 - Big Data Architecture and Tools (1)
No ratings yet
10 - Big Data Architecture and Tools (1)
31 pages
Unit 2 B)
No ratings yet
Unit 2 B)
16 pages
Unit 3
No ratings yet
Unit 3
18 pages
Hadoop Notesforstudents
No ratings yet
Hadoop Notesforstudents
13 pages
Hadoop Streaming: Mapreduce
No ratings yet
Hadoop Streaming: Mapreduce
8 pages
Big Data-Week 3 - 1
No ratings yet
Big Data-Week 3 - 1
22 pages
Module 4_Yarn
No ratings yet
Module 4_Yarn
34 pages
A Weather Dataset. Understanding Hadoop API for MapReduce Framework
No ratings yet
A Weather Dataset. Understanding Hadoop API for MapReduce Framework
9 pages
Hadoop 2full Mod2
No ratings yet
Hadoop 2full Mod2
10 pages
Lecture 06
No ratings yet
Lecture 06
26 pages
UNIT-1-part-2-BIG DATA ANALYTICS AND TOOLS
No ratings yet
UNIT-1-part-2-BIG DATA ANALYTICS AND TOOLS
19 pages
Wa0002.
No ratings yet
Wa0002.
32 pages
Apache Hadoop Yarn Architecture PDF
No ratings yet
Apache Hadoop Yarn Architecture PDF
3 pages
Chapter3 HDFS MapReduce YARN
No ratings yet
Chapter3 HDFS MapReduce YARN
35 pages
Apache Hadoop YARN: Unit 3 Chapter 2
No ratings yet
Apache Hadoop YARN: Unit 3 Chapter 2
9 pages
Introduction To YARN
No ratings yet
Introduction To YARN
17 pages
BDA_UNIT_3
No ratings yet
BDA_UNIT_3
50 pages
Module 3 - Mapreduce
No ratings yet
Module 3 - Mapreduce
40 pages
Hadoop YARN Architecture
No ratings yet
Hadoop YARN Architecture
5 pages
Unit-3 BDA
No ratings yet
Unit-3 BDA
30 pages
Big Data Unit 2
No ratings yet
Big Data Unit 2
31 pages
MapReduce V1
No ratings yet
MapReduce V1
26 pages
Big Data Unit 2 AKTU Notes
No ratings yet
Big Data Unit 2 AKTU Notes
63 pages
Assn - No:1 Cloud Computing Assignment 13.10.2019
No ratings yet
Assn - No:1 Cloud Computing Assignment 13.10.2019
4 pages
Unit 5
No ratings yet
Unit 5
35 pages
Unit-2 - Introduction To Hadoop and Hadoop Architecture
No ratings yet
Unit-2 - Introduction To Hadoop and Hadoop Architecture
46 pages
04 MapRed 6 JobExecutionOnYarn
No ratings yet
04 MapRed 6 JobExecutionOnYarn
20 pages
Lecture 8 - Batch Analysis Full
100% (1)
Lecture 8 - Batch Analysis Full
36 pages
Hadoop 1 Converted
No ratings yet
Hadoop 1 Converted
26 pages
Lecture 8 - Batch Analysis Part 1
No ratings yet
Lecture 8 - Batch Analysis Part 1
29 pages
BDA Assignment 3
No ratings yet
BDA Assignment 3
24 pages
Sem 7 - COMP - BDA
No ratings yet
Sem 7 - COMP - BDA
16 pages
Big Data and Hadoop - Suzanne
No ratings yet
Big Data and Hadoop - Suzanne
5 pages
Unit3 MapReduce
No ratings yet
Unit3 MapReduce
7 pages
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
Project Report For Advanced Encryption System Complted
No ratings yet
Project Report For Advanced Encryption System Complted
90 pages
Jukebox Class: Import
No ratings yet
Jukebox Class: Import
4 pages
Building A Gate Way Service Using A SEGW Transaction
No ratings yet
Building A Gate Way Service Using A SEGW Transaction
10 pages
ARGOX Printer OS-214plus Series
No ratings yet
ARGOX Printer OS-214plus Series
4 pages
Bigtreetech GTR V1.0 Operating Instruction
No ratings yet
Bigtreetech GTR V1.0 Operating Instruction
23 pages
Image Processing With Hololens
No ratings yet
Image Processing With Hololens
12 pages
Drive Config
No ratings yet
Drive Config
136 pages
Log
No ratings yet
Log
3 pages
Configuring IO Modules - CX4 PDF
No ratings yet
Configuring IO Modules - CX4 PDF
16 pages
Control Builder Components - Ref
No ratings yet
Control Builder Components - Ref
256 pages
Big-Ip Virtual Edition Setup Guide For Microsoft Hyper-V
No ratings yet
Big-Ip Virtual Edition Setup Guide For Microsoft Hyper-V
26 pages
MAGIC-X User Manual
No ratings yet
MAGIC-X User Manual
2 pages
RFQ For ICT Equipments
No ratings yet
RFQ For ICT Equipments
15 pages
Original Programming Manual Classiccontroller Cr0032: Runtime System V02.01.06 Codesys V2.3
No ratings yet
Original Programming Manual Classiccontroller Cr0032: Runtime System V02.01.06 Codesys V2.3
247 pages
PWP Practical No. 4
No ratings yet
PWP Practical No. 4
2 pages
Software Houses List
No ratings yet
Software Houses List
16 pages
Internet of ThingsProtocols
No ratings yet
Internet of ThingsProtocols
10 pages
The Longest Code That Ive Created
No ratings yet
The Longest Code That Ive Created
11 pages
Assignment (IaaS, PaaS)
No ratings yet
Assignment (IaaS, PaaS)
5 pages
Progress Database
No ratings yet
Progress Database
30 pages
Open Positions Feb 2024
No ratings yet
Open Positions Feb 2024
29 pages
Reminiscence IMS Session
No ratings yet
Reminiscence IMS Session
80 pages
OpenShift Container Platform 4.17 Support en US
No ratings yet
OpenShift Container Platform 4.17 Support en US
162 pages
Python Programming: An Introduction To Computer Science: Objects and Graphics
No ratings yet
Python Programming: An Introduction To Computer Science: Objects and Graphics
56 pages
Submitt Ed By: Student ID: Subject: ECE2551
No ratings yet
Submitt Ed By: Student ID: Subject: ECE2551
9 pages
DVC100 User Guide
No ratings yet
DVC100 User Guide
25 pages
Unit 10 String Handling
No ratings yet
Unit 10 String Handling
36 pages
Getting Ready For Your Red Hat Remote Exam
No ratings yet
Getting Ready For Your Red Hat Remote Exam
19 pages
FSD Assignment-Updated PDF
No ratings yet
FSD Assignment-Updated PDF
2 pages
Sinamics g120 PN at s7-1200 Docu v1d3 en PDF
100% (1)
Sinamics g120 PN at s7-1200 Docu v1d3 en PDF
63 pages

Big Data Notes Unit-3

Uploaded by

Big Data Notes Unit-3

Uploaded by

UNIT-3

HOW DOES MAP-REDUCE WORK?

YET ANOTHER RESOURCE NEGOTIATOR (YARN)

• Scalability: The scheduler in Resource manager of YARN architecture allows Hadoop to

The main components of YARN architecture include:

1. Client submits an application

DIFFERENTIATE BETWEEN YARN AND MAP-REDUCE

You might also like