Introduction to Hadoop

The document provides an overview of Hadoop, an open-source project for distributed computing, detailing its architecture, components like HDFS and YARN, and the MapReduce programming model. It explains the roles of master and slave nodes, data storage, and resource management within Hadoop. Additionally, it highlights the features of YARN, including multi-tenancy and scalability, and provides references for further reading.

Uploaded by

rahman2312091037

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Introduction to Hadoop

Uploaded by

rahman2312091037

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

An introduction to

Mr. ISRAFIL
Lecturer
Department of Computing & Information System
What is
Hadoop is an open project overseen by the Apache Software
Foundation
Originally based on papers published by Google in 2003 and 2004
Hadoop committers work at several different organizations – Including
Cloudera, Yahoo!, Facebook, LinkedIn

Hadoop takes a radical new approach to the problem of

distributed computing – distribute the data as it’s initially
stored in the system and individual nodes work on data
local to the nodes.
History
Who Uses Hadoop?
Hadoop Components

Application Application Layer

& Resource Management
Layer

Resource Management
Layer
Storage Layer
Storage Layer
HDFS
HDFS stands for Hadoop Distributed File System. It provides for data storage of Hadoop. HDFS
splits the data unit into smaller units called blocks and stores them in a distributed manner.
It has got two daemons running.
• Master node – NameNode
• Slave nodes – DataNode.
Master Nodes
• NameNode
• Only 1 per cluster
• A single NameNode stores all metadata
• Filenames, locations on DataNodes of each block, owner, group, etc.
• All information maintained in RAM for fast lookup
• File system metadata size is limited to the amount of available RAM on the NameNode
Slave Nodes
• DataNode
• 1-4000 per cluster
• Store file contents
• Stores as opaque ‘blocks’ on the underlying file system
• Different blocks of the same file will be stored on different DataNodes
• Same blocks is stored on three (or more) DataNode for redundancy
Self-healing
• DataNodes send heartbeats to the NameNode
• After a period without any heartbeats, a DataNode is assumed to be lost
• NameNode determines which blocks were on the lost node
• NameNode finds other DataNodes with copies of these blocks Same block stored in
• These DataNodes are instructed to copy the blocks to other nodes different DataNodes
• Replication is actively maintained
Block in HDFS
Block is nothing but the smallest unit of storage on a computer system. It is the smallest contiguous
storage allocated to a file. In Hadoop, we have a default block size of 128MB or 256 MB.
What is MapReduce

MapReduce is a method for distributing a task across multiple nodes

Each node processes data stored on that node

• Where possible

Consists of two phases:

• Map
• Reduce
MapReduce
• Map Task
• RecordReader
• Map
• Combiner
• Partitioner

• Reduce Task
• Shuffle and Sort
• Reduce
• OutputFormat
YARN
YARN or Yet Another Resource Negotiator is the resource management layer of
Hadoop.

❑ separate resource management and job scheduling/monitoring function into separate daemons
❑ one global ResourceManager and per-application ApplicationMaster
❑ Application can be a single job or a DAG of jobs

Inside the YARN framework, we have two daemons

• ResourceManager
• resources among all the competing applications in the system

• NodeManager
• monitor the resource usage by the container and report the same to ResourceManger
YARN
ResourceManger
The ResourceManger has two important components
• Scheduler
• ApplicationManager

Scheduler
• Scheduler is responsible for allocating resources to various applications. This is a pure scheduler as it does not
perform tracking of status for the application. It also does not reschedule the tasks which fail due to software or
hardware errors. The scheduler allocates the resources based on the requirements of the applications.

Application Manager
• Accepts job submission.
• Negotiates the first container for executing ApplicationMaster. A container incorporates elements such as CPU,
memory, disk, and network.
• Restarts the ApplicationMaster container on failure.
• Negotiates resource container from Scheduler.
• Tracks the resource container status.
• Monitors progress of the application.
Features of Yarn

• Multi-tenancy
• Cluster Utilization
• Scalability
• Compatibility

Go through this link to get the detailed idea.

https://www.geeksforgeeks.org/hadoop-yarn-architecture/
Reference

⮚ https://data-flair.training/blogs/hadoop-tutorial/Cluster Utilization
⮚ https://www.geeksforgeeks.org/hadoop-introduction/Compatibility
Thank You

Lecture 2
No ratings yet
Lecture 2
28 pages
Unit 3
No ratings yet
Unit 3
18 pages
Hadoop
No ratings yet
Hadoop
4 pages
UNIT-1-part-2-BIG DATA ANALYTICS AND TOOLS
No ratings yet
UNIT-1-part-2-BIG DATA ANALYTICS AND TOOLS
19 pages
BDA-Unit-1
No ratings yet
BDA-Unit-1
35 pages
ECS765P_W3_Hadoop principles and components
No ratings yet
ECS765P_W3_Hadoop principles and components
47 pages
Hadoopintro
No ratings yet
Hadoopintro
31 pages
Hadoop 1
No ratings yet
Hadoop 1
75 pages
Bigdata and Hadoop - Unit III
No ratings yet
Bigdata and Hadoop - Unit III
24 pages
Unit-3 BDA
No ratings yet
Unit-3 BDA
30 pages
Big Data Unit 2
No ratings yet
Big Data Unit 2
31 pages
CH 2
No ratings yet
CH 2
6 pages
HADOOP FRAME WORK
No ratings yet
HADOOP FRAME WORK
38 pages
Hadoop 1 Converted
No ratings yet
Hadoop 1 Converted
26 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)
No ratings yet
Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)
25 pages
Wa0002.
No ratings yet
Wa0002.
32 pages
bd sec b
No ratings yet
bd sec b
19 pages
Unit-Iv CC&BD CS71
No ratings yet
Unit-Iv CC&BD CS71
148 pages
Cloud Computing
No ratings yet
Cloud Computing
19 pages
UNIT 5 Combined
No ratings yet
UNIT 5 Combined
13 pages
Unit 2 Notes BDA
No ratings yet
Unit 2 Notes BDA
10 pages
Introduction to Hadoop- chapter-2
No ratings yet
Introduction to Hadoop- chapter-2
59 pages
Hadoop
No ratings yet
Hadoop
12 pages
Unit 2
No ratings yet
Unit 2
56 pages
2-Hadoop History Terminologies DFS-03-01-2025
No ratings yet
2-Hadoop History Terminologies DFS-03-01-2025
52 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
60 pages
Business Intelligence & Big Data Analytics-CSE3124Y
No ratings yet
Business Intelligence & Big Data Analytics-CSE3124Y
26 pages
Module II
No ratings yet
Module II
46 pages
Hadoop Intro and Hdfs
No ratings yet
Hadoop Intro and Hdfs
37 pages
UNIT 5-PLH
No ratings yet
UNIT 5-PLH
34 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
56 pages
Hadoop Overview: Open Source Framework Processing Large Amounts of Heterogeneous Data Sets Distributed Fashion
No ratings yet
Hadoop Overview: Open Source Framework Processing Large Amounts of Heterogeneous Data Sets Distributed Fashion
62 pages
Understanding Hadoop Ecosystem1 2
No ratings yet
Understanding Hadoop Ecosystem1 2
65 pages
L5-MapReduce-P3
No ratings yet
L5-MapReduce-P3
23 pages
Intro Hadoop Ecosystem Components, Hadoop Ecosystem Tools
No ratings yet
Intro Hadoop Ecosystem Components, Hadoop Ecosystem Tools
15 pages
Hadoop Major Components
No ratings yet
Hadoop Major Components
10 pages
Data Science
No ratings yet
Data Science
14 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
5 pages
bdcc-2.2
No ratings yet
bdcc-2.2
12 pages
Big Data
No ratings yet
Big Data
16 pages
Chapter3 HDFS MapReduce YARN
No ratings yet
Chapter3 HDFS MapReduce YARN
35 pages
Big Data-UNIT-2
No ratings yet
Big Data-UNIT-2
46 pages
NYOUG Hadoop Presentaton
No ratings yet
NYOUG Hadoop Presentaton
47 pages
Hadoop
No ratings yet
Hadoop
7 pages
Hadoop
No ratings yet
Hadoop
7 pages
HDFS, MapReduce, Yarn
No ratings yet
HDFS, MapReduce, Yarn
25 pages
Unit-2_ch_1_updated
No ratings yet
Unit-2_ch_1_updated
22 pages
HDFS 79
No ratings yet
HDFS 79
74 pages
Hadoop Intro1
No ratings yet
Hadoop Intro1
15 pages
YARN
No ratings yet
YARN
5 pages
CC Unit 5 Notes
No ratings yet
CC Unit 5 Notes
30 pages
UNIT V-Cloud Computing
No ratings yet
UNIT V-Cloud Computing
33 pages
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
No ratings yet
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
47 pages
Hadoop: A Software Framework For Data Intensive Computing Applications
No ratings yet
Hadoop: A Software Framework For Data Intensive Computing Applications
47 pages
Hadoop
No ratings yet
Hadoop
31 pages
Module-2 PPT-1
No ratings yet
Module-2 PPT-1
126 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
48 pages
Hadoop Presentaton
No ratings yet
Hadoop Presentaton
47 pages
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
WebLogs Core Diagnostics-Error-Logs
No ratings yet
WebLogs Core Diagnostics-Error-Logs
2 pages
Full Stack Project: Synopsis
No ratings yet
Full Stack Project: Synopsis
4 pages
EBSCO-FullText-02_07_2025 (4)
No ratings yet
EBSCO-FullText-02_07_2025 (4)
14 pages
Python GTK 3 Tutorial PDF
No ratings yet
Python GTK 3 Tutorial PDF
123 pages
Java-Selenium-Frameworks
No ratings yet
Java-Selenium-Frameworks
61 pages
Cloud Computing Fresco Play Mcqs Answers: Pride Mont
No ratings yet
Cloud Computing Fresco Play Mcqs Answers: Pride Mont
11 pages
PHP MySQL Tutorial-C
No ratings yet
PHP MySQL Tutorial-C
142 pages
Python Microproject
No ratings yet
Python Microproject
11 pages
Keyboard Shortcuts Poster Chandoo
No ratings yet
Keyboard Shortcuts Poster Chandoo
1 page
Jntuk - Mca 5th Sem Syllabus
No ratings yet
Jntuk - Mca 5th Sem Syllabus
19 pages
1.0 System Analysis and Design 9th Edition - Shelly Cashman-56-57
No ratings yet
1.0 System Analysis and Design 9th Edition - Shelly Cashman-56-57
2 pages
Python Report Final
No ratings yet
Python Report Final
9 pages
SRM Institute of Science and Technology
No ratings yet
SRM Institute of Science and Technology
7 pages
xv6 Rev5
No ratings yet
xv6 Rev5
87 pages
Regular Expressions
100% (4)
Regular Expressions
197 pages
Answers To Chapter 8 Activities and Questions
50% (2)
Answers To Chapter 8 Activities and Questions
26 pages
Planetpress Connect Rest API Cookbook
No ratings yet
Planetpress Connect Rest API Cookbook
524 pages
Introduction To VBScript
No ratings yet
Introduction To VBScript
60 pages
Lab 8 Ans
No ratings yet
Lab 8 Ans
13 pages
Question Bank BBE
No ratings yet
Question Bank BBE
7 pages
Software Engineering and Modeling F (1) enc
No ratings yet
Software Engineering and Modeling F (1) enc
216 pages
Loops
No ratings yet
Loops
3 pages
Worksheet - List
50% (2)
Worksheet - List
12 pages
Application Development and Emerging Technologies
No ratings yet
Application Development and Emerging Technologies
9 pages
TE Comp - Web Technology
No ratings yet
TE Comp - Web Technology
1 page
Basic Operations Assesment 1.0
No ratings yet
Basic Operations Assesment 1.0
4 pages
Emgucv - OCRForm - Cs at Master Emgucv - Emgucv GitHub
No ratings yet
Emgucv - OCRForm - Cs at Master Emgucv - Emgucv GitHub
8 pages
Syntax Analysis: Dr. Nguyen Hua Phung Nhphung@hcmut - Edu.vn
No ratings yet
Syntax Analysis: Dr. Nguyen Hua Phung Nhphung@hcmut - Edu.vn
33 pages
Test 000-253: Ibm Websphere Application Server Network Deployment V6.1, Core Administration
No ratings yet
Test 000-253: Ibm Websphere Application Server Network Deployment V6.1, Core Administration
3 pages
Full Stack Java (With Angular) - 30K
No ratings yet
Full Stack Java (With Angular) - 30K
11 pages

Introduction to Hadoop

Uploaded by

Introduction to Hadoop

Uploaded by

An introduction to

Hadoop takes a radical new approach to the problem of

Application Application Layer

MapReduce is a method for distributing a task across multiple nodes

Each node processes data stored on that node

Consists of two phases:

Inside the YARN framework, we have two daemons

Go through this link to get the detailed idea.

You might also like