0% found this document useful (0 votes)

25 views18 pages

Hadoop Class 2 PDF

hadoop

Uploaded by

ANKIT MATHUR

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views18 pages

Hadoop Class 2 PDF

hadoop

Uploaded by

ANKIT MATHUR

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Map-Reduce

MapReduce is a programming framework that allows us to perform distributed and parallel processing on large
data sets in a distributed environment.

● MapReduce consists of two distinct tasks — Map and Reduce.

● As the name MapReduce suggests, reducer phase takes place after the mapper phase has been
completed.
● So, the first is the map job, where a block of data is read and processed to produce key-value pairs as
intermediate outputs.
● The output of a Mapper or map job (key-value pairs) is input to the Reducer.
● The reducer receives the key-value pair from multiple map jobs.
● Then, the reducer aggregates those intermediate data tuples (intermediate key-value pair) into a smaller
set of tuples or key-value pairs which is the final output.
Advantages of MapReduce

● Parallel Processing: In MapReduce, we are dividing the job among multiple nodes and each node works
with a part of the job simultaneously. So, MapReduce is based on Divide and Conquer paradigm which helps
us to process the data using different machines very quickly.

● Data Locality: Instead of moving data to the processing unit, we are moving the processing unit to the data
in the MapReduce Framework. In the traditional system, we used to bring data to the processing unit and
process it. But, as the data grew and became very huge, bringing this huge amount of data to the processing
unit posed the following issues:

○ Moving huge data to processing is costly and deteriorates the network performance.
○ Processing takes time as the data is processed by a single unit which becomes the bottleneck.
○ The master node can get over-burdened and may fail.

Now, MapReduce allows us to overcome the above issues by bringing the processing unit to the data. This
allows us to have the following advantages:

● It is very cost-effective to move processing unit to the data.

● The processing time is reduced as all the nodes are working with their part of the data in parallel.
● Every node gets a part of the data to process and therefore, there is no chance of a node getting
overburdened.
Word Count Problem
Map-Reduce Data Flow
● Input Files: The data for a MapReduce task is stored in input files, and input files typically lives in HDFS.
● InputFormat: Now, InputFormat defines how these input files are split and read. It selects the files or other
objects that are used for input.
● InputSplits: It is created by InputFormat, logically represent the data which will be processed by an individual
Mapper. One map task is created for each split; thus the number of map tasks will be equal to the number
of InputSplits.
● RecordReader: It communicates with the InputSplit in Hadoop MapReduce and converts the data into
key-value pairs suitable for reading by the mapper.
● Mapper: It processes each input record (from RecordReader) and generates new key-value pair, and this
key-value pair generated by Mapper is completely different from the input pair. The output of Mapper is also
known as intermediate output which is written to the local disk. The output of the Mapper is not stored on
HDFS as this is temporary data and writing on HDFS will create unnecessary copies.
● Combiner: The combiner is also known as ‘Mini-reducer’. Hadoop MapReduce Combiner performs local
aggregation on the mappers’ output, which helps to minimize the data transfer between mapper and reducer.
● Partitioner: Hadoop MapReduce, Partitioner comes into the picture if we are working on more than one
reducer (for one reducer partitioner is not used).
○ Partitioner takes the output from combiners and performs partitioning. Partitioning of output takes place
on the basis of the key and then sorted. By hash function, key (or a subset of the key) is used to derive
the partition.

○ According to the key value in MapReduce, each combiner output is partitioned, and a record having the
same key value goes into the same partition, and then each partition is sent to a reducer.
● Shuffling and Sorting: Now, the output is Shuffled to the reduce node (which is a normal slave node but
reduce phase will run here hence called as reducer node). The shuffling is the physical movement of the data
which is done over the network. Once all the mappers are finished and their output is shuffled on the reducer
nodes, then this intermediate output is merged and sorted, which is then provided as input to reduce phase.

● Reducer: It takes the set of intermediate key-value pairs produced by the mappers as the input and then runs
a reducer function on each of them to generate the output. The output of the reducer is the final output, which
is stored in HDFS.

● RecordWriter: It writes these output key-value pair from the Reducer phase to the output files.

● OutputFormat: The way these output key-value pairs are written in output files by RecordWriter is determined
by the OutputFormat.
Difference between Input Split & Block
Difference between Input Split & Block
YARN

Apache Hadoop YARN (Yet Another Resource Negotiator) is a resource management layer in Hadoop.
YARN came into the picture with the introduction of Hadoop 2.x. It allows various data processing engines
such as interactive processing, graph processing, batch processing, and stream processing to run and
process data stored in HDFS (Hadoop Distributed File System).
Components Of YARN
Components Of YARN

1. Resource Manager: Resource Manager is the master daemon of YARN. It is responsible for managing
several other applications, along with the global assignments of resources such as CPU and memory. It is
used for job scheduling. Resource Manager has two components:
a. Scheduler: Schedulers’ task is to distribute resources to the running applications. It only deals with the
scheduling of tasks and hence it performs no tracking and no monitoring of applications.

b. Application Manager: The application Manager manages applications running in the cluster. Tasks,
such as the starting of Application Master or monitoring, are done by the Application Manager.

2. Node Manager: Node Manager is the slave daemon of YARN. It has the following responsibilities:
a. Node Manager has to monitor the container’s resource usage, along with reporting it to the Resource
Manager.
b. The health of the node on which YARN is running is tracked by the Node Manager.
c. It takes care of each node in the cluster while managing the workflow, along with user jobs on a
particular node.
d. It keeps the data in the Resource Manager updated
e. Node Manager can also destroy or kill the container if it gets an order from the Resource Manager to do
so.
Components Of YARN

3. Application Master: Every job submitted to the framework is an application, and every application has a specific
Application Master associated with it. Application Master performs the following tasks:

● It coordinates the execution of the application in the cluster, along with managing the faults.
● It negotiates resources from the Resource Manager.
● It works with the Node Manager for executing and monitoring other components’ tasks.
● At regular intervals, heartbeats are sent to the Resource Manager for checking its health, along with updating
records according to its resource demands.
● Now, we will step forward with the fourth component of Apache Hadoop YARN.

4. Container: A container is a set of physical resources (CPU cores, RAM, disks, etc.) on a single node. The tasks
of a container are listed below:

● It grants the right to an application to use a specific amount of resources (memory, CPU, etc.) on a specific
host.
● YARN containers are particularly managed by a Container Launch context which is Container Life Cycle
(CLC). This record contains a map of environment variables, dependencies stored in remotely accessible
storage, security tokens, the payload for Node Manager services, and the command necessary to create the
process.
Running an Application through YARN

1. Application Submission: The RM accepts the application, causing the creation of an ApplicationMaster (AM) instance. The AM is
responsible for negotiating resources from the RM and working with the Node Managers (NMs) to execute and monitor the tasks.

2. Resource Request: The AM starts by requesting resources from the RM. It specifies what resources are needed, in which
locations, and other constraints. These resources are encapsulated in terms of "Resource Containers" which include
specifications like memory size, CPU cores, etc.

3. Resource Allocation: The Scheduler in the RM, based on the current system load and capacity, as well as policies (e.g.,
capacity, fairness), allocates resources to the applications by granting containers. The specific strategy depends on the scheduler
type (e.g., FIFO, Capacity Scheduler).

4. Container Launching: Post-allocation, the RM communicates with relevant NMs to launch the containers. The Node Manager
sets up the container's environment, then starts the container by executing the specified commands.

5. Task Execution: Each container then runs the task assigned by the ApplicationMaster. These are actual data processing tasks,
specific to the application's purpose.

6. Monitoring and Fault Tolerance: The AM monitors the progress of each task. If a container fails, the AM requests a new
container from the RM and retries the task, ensuring fault tolerance in the execution phase.

7. Completion and Release of Resources: Upon task completion, the AM releases the allocated containers, freeing up resources.
After all tasks are complete, the AM itself is terminated, and its resources are also released.

8. Finalization: The client then polls the RM or receives a notification to know the status of the application. Once informed of the
completion, the client retrieves the result and finishes the process.
Running an Application through YARN

1. Run job command

2. Get new application_id
3. Copy job resources
4. Submit Application
5.
a. Start First Container
b. Launch Application Master
6. Initialize Job
7. Retrieve Input Splits
8. Allocate Resources
9.
a. Start requested containers
b. Launch Task
10. Retrieve job resources

MapReduce Workflows
No ratings yet
MapReduce Workflows
43 pages
Bda Unit 3 - Mam
No ratings yet
Bda Unit 3 - Mam
89 pages
The CAP Theorem Overview
No ratings yet
The CAP Theorem Overview
16 pages
Updated Unit-IV Reference PPT 08-02-2022
No ratings yet
Updated Unit-IV Reference PPT 08-02-2022
103 pages
Unit Iii
No ratings yet
Unit Iii
38 pages
Module 03 MapReduce - Distributed Off-Line Batch Processing and Yarn - Resource Negotiator
No ratings yet
Module 03 MapReduce - Distributed Off-Line Batch Processing and Yarn - Resource Negotiator
43 pages
19, 9852 1825 01 Service Manual ST14 DD
100% (3)
19, 9852 1825 01 Service Manual ST14 DD
144 pages
3.4 Map Scheduler
No ratings yet
3.4 Map Scheduler
23 pages
BDA Unit 1
No ratings yet
BDA Unit 1
35 pages
Hadoop MapReduce YARN Detailed
No ratings yet
Hadoop MapReduce YARN Detailed
2 pages
Bda U2
No ratings yet
Bda U2
79 pages
Unit-4: Illustrate Mapreduce Architecture With Diagram
No ratings yet
Unit-4: Illustrate Mapreduce Architecture With Diagram
7 pages
Bda Unit 3
No ratings yet
Bda Unit 3
14 pages
Module 3 - Mapreduce
No ratings yet
Module 3 - Mapreduce
40 pages
Chapter - 6 - Hadoop
No ratings yet
Chapter - 6 - Hadoop
51 pages
Bda Unit 3
No ratings yet
Bda Unit 3
50 pages
AASHTO - Pavement Management Guide 2nd Edition-AASHTO (2012)
100% (4)
AASHTO - Pavement Management Guide 2nd Edition-AASHTO (2012)
196 pages
BDA Unit 4 PDF
No ratings yet
BDA Unit 4 PDF
31 pages
Cloud PDF
No ratings yet
Cloud PDF
138 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
56 pages
UNIT III Notes
No ratings yet
UNIT III Notes
24 pages
10 - Big Data Architecture and Tools
No ratings yet
10 - Big Data Architecture and Tools
31 pages
B. Hadoop Ecosystem - III (MapReduce)
No ratings yet
B. Hadoop Ecosystem - III (MapReduce)
55 pages
Custom Notes
No ratings yet
Custom Notes
10 pages
UNIT-4 Bda
No ratings yet
UNIT-4 Bda
26 pages
Introduction-to-Hadoop-Ecosystem
No ratings yet
Introduction-to-Hadoop-Ecosystem
26 pages
Ap Educe Undamentals: Business
No ratings yet
Ap Educe Undamentals: Business
74 pages
Bda CHP2
No ratings yet
Bda CHP2
105 pages
Unit 3
No ratings yet
Unit 3
18 pages
3-MapReduce Different Phases-13-01-2025
No ratings yet
3-MapReduce Different Phases-13-01-2025
23 pages
Bda U4
No ratings yet
Bda U4
25 pages
Anatomy of Mapreduce Job Run: Some Slides Are Taken From Cmu PPT Presentation
No ratings yet
Anatomy of Mapreduce Job Run: Some Slides Are Taken From Cmu PPT Presentation
73 pages
Lec 6
No ratings yet
Lec 6
16 pages
Unit5 B
No ratings yet
Unit5 B
4 pages
Unit - III
No ratings yet
Unit - III
37 pages
Big Data Unit 2
No ratings yet
Big Data Unit 2
31 pages
Lec 6
No ratings yet
Lec 6
14 pages
HADOOP
No ratings yet
HADOOP
19 pages
Big Data Notes Unit-3
No ratings yet
Big Data Notes Unit-3
7 pages
Chapter3 HDFS MapReduce YARN
No ratings yet
Chapter3 HDFS MapReduce YARN
35 pages
Unit-2 - Introduction To Hadoop and Hadoop Architecture
No ratings yet
Unit-2 - Introduction To Hadoop and Hadoop Architecture
46 pages
Big Data - Hadoop
No ratings yet
Big Data - Hadoop
20 pages
Big Data Exam Help
No ratings yet
Big Data Exam Help
7 pages
Hadoop 2full Mod2
No ratings yet
Hadoop 2full Mod2
10 pages
Unit IV Notes
No ratings yet
Unit IV Notes
25 pages
Bigdata and Hadoop - Unit III
No ratings yet
Bigdata and Hadoop - Unit III
24 pages
Big Data Unit 3 Own
No ratings yet
Big Data Unit 3 Own
20 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
27 pages
Big Data-Week 3 - 1
No ratings yet
Big Data-Week 3 - 1
22 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
Minervini Algorithmic Approach
No ratings yet
Minervini Algorithmic Approach
99 pages
An English Grammar
100% (1)
An English Grammar
208 pages
Sem 7 - COMP - BDA
No ratings yet
Sem 7 - COMP - BDA
16 pages
MapReduce Arch
No ratings yet
MapReduce Arch
29 pages
Yaesu Bda Ft-991
No ratings yet
Yaesu Bda Ft-991
158 pages
Unit 5
No ratings yet
Unit 5
35 pages
Framework For Processing Data in Hadoop - : Yarn and Mapreduce
No ratings yet
Framework For Processing Data in Hadoop - : Yarn and Mapreduce
31 pages
DM Hadoop Architecture
No ratings yet
DM Hadoop Architecture
6 pages
ECS765P - W3 - Hadoop Principles and Components
No ratings yet
ECS765P - W3 - Hadoop Principles and Components
47 pages
Opentext™ Vendor Invoice Management For Sap Solutions: Installation Guide
No ratings yet
Opentext™ Vendor Invoice Management For Sap Solutions: Installation Guide
290 pages
Unit 2 - From Hadoop Streaming PDF
No ratings yet
Unit 2 - From Hadoop Streaming PDF
20 pages
Data Analysis and Decision Making: A Case Study of Re-Accommodating Passengers For An Airline Company
No ratings yet
Data Analysis and Decision Making: A Case Study of Re-Accommodating Passengers For An Airline Company
16 pages
Grokking Object-Oriented-Design
No ratings yet
Grokking Object-Oriented-Design
214 pages
Unit-2 Bda Kalyan - Pagenumber
No ratings yet
Unit-2 Bda Kalyan - Pagenumber
15 pages
Map Reduce and Hadoop
No ratings yet
Map Reduce and Hadoop
39 pages
Straightforward Pre SB 230908 131056
No ratings yet
Straightforward Pre SB 230908 131056
159 pages
PA-EAD BIM Standard Manual PDF
No ratings yet
PA-EAD BIM Standard Manual PDF
180 pages
Formats
No ratings yet
Formats
14 pages
Unit 2 Notes BDA
No ratings yet
Unit 2 Notes BDA
10 pages
BSNL Landline Broadband Closure Letter
0% (1)
BSNL Landline Broadband Closure Letter
2 pages
Official - PCPP
No ratings yet
Official - PCPP
12 pages
Microprocessor Unit 3
No ratings yet
Microprocessor Unit 3
58 pages
Virtual University of Pakistan: Exam Entrance Slip
100% (1)
Virtual University of Pakistan: Exam Entrance Slip
1 page
Ece I Basic Electronics Engg. (15eln15) Notes
No ratings yet
Ece I Basic Electronics Engg. (15eln15) Notes
124 pages
PHP Full Stack Development
No ratings yet
PHP Full Stack Development
16 pages
PACS DATA EXTRACT-User Guide
100% (1)
PACS DATA EXTRACT-User Guide
15 pages
Thomas Adrienne 2
No ratings yet
Thomas Adrienne 2
2 pages
Copernicus Product Catalogue 20200302
No ratings yet
Copernicus Product Catalogue 20200302
76 pages
Activity File XII 24-25 - 240919 - 091153
No ratings yet
Activity File XII 24-25 - 240919 - 091153
17 pages
Fdsfs
No ratings yet
Fdsfs
23 pages
Documents - Pub - The Elastix Call Center Protocol Revealed
No ratings yet
Documents - Pub - The Elastix Call Center Protocol Revealed
68 pages
Weeklly Report
No ratings yet
Weeklly Report
13 pages
Complete Notes of Computer Network at Interview Time 1648549011
No ratings yet
Complete Notes of Computer Network at Interview Time 1648549011
55 pages
CTSD C03 &co4
No ratings yet
CTSD C03 &co4
38 pages
Caltech PGP Cloud Computing Brochure 19-12-2023
No ratings yet
Caltech PGP Cloud Computing Brochure 19-12-2023
31 pages
Spring and Spring Bootinterviewqsns
No ratings yet
Spring and Spring Bootinterviewqsns
8 pages
LAS WEEK 1 - Grade 10 ICT
No ratings yet
LAS WEEK 1 - Grade 10 ICT
4 pages
Microservices
No ratings yet
Microservices
2 pages
Targeted Topic
No ratings yet
Targeted Topic
2 pages
2287-Article Text-14622-5-10-20230331
No ratings yet
2287-Article Text-14622-5-10-20230331
14 pages
Control Theory Quiz
No ratings yet
Control Theory Quiz
28 pages
WR 1 Q P Memo
No ratings yet
WR 1 Q P Memo
7 pages
Linear Programming Stock Market
No ratings yet
Linear Programming Stock Market
10 pages
Smart India Hackathon 2024
No ratings yet
Smart India Hackathon 2024
6 pages
32 Secret Combinations On Your Keyboard
100% (1)
32 Secret Combinations On Your Keyboard
2 pages
Code:: Bahria University, Islamabad Campus Short Assignment (Quiz 01) (Fall 2020 Semester)
No ratings yet
Code:: Bahria University, Islamabad Campus Short Assignment (Quiz 01) (Fall 2020 Semester)
4 pages
Review Paper: Virtual Autopsy: A New Trend in Forensic Investigation
No ratings yet
Review Paper: Virtual Autopsy: A New Trend in Forensic Investigation
7 pages
Product Senior Manager Financial Services in Phoenix AZ Resume Corey Miller
No ratings yet
Product Senior Manager Financial Services in Phoenix AZ Resume Corey Miller
2 pages
Tally Shortcuts - Quick Short Cuts
No ratings yet
Tally Shortcuts - Quick Short Cuts
6 pages
How These Books Were Found: Get Updates in Your Inbox
No ratings yet
How These Books Were Found: Get Updates in Your Inbox
1 page
SAP interface programming with RFC and VBA: Edit SAP data with MS Access
From Everand
SAP interface programming with RFC and VBA: Edit SAP data with MS Access
Karl Josef Hensel
No ratings yet
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet

Hadoop Class 2 PDF

Uploaded by

Hadoop Class 2 PDF

Uploaded by

Map-Reduce

● MapReduce consists of two distinct tasks — Map and Reduce.

● It is very cost-effective to move processing unit to the data.

1. Run job command

You might also like