0% found this document useful (0 votes)
2 views

Introduction to Hadoop

The document provides an overview of Hadoop, an open-source project for distributed computing, detailing its architecture, components like HDFS and YARN, and the MapReduce programming model. It explains the roles of master and slave nodes, data storage, and resource management within Hadoop. Additionally, it highlights the features of YARN, including multi-tenancy and scalability, and provides references for further reading.

Uploaded by

rahman2312091037
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Introduction to Hadoop

The document provides an overview of Hadoop, an open-source project for distributed computing, detailing its architecture, components like HDFS and YARN, and the MapReduce programming model. It explains the roles of master and slave nodes, data storage, and resource management within Hadoop. Additionally, it highlights the features of YARN, including multi-tenancy and scalability, and provides references for further reading.

Uploaded by

rahman2312091037
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

An introduction to

Mr. ISRAFIL
Lecturer
Department of Computing & Information System
What is
Hadoop is an open project overseen by the Apache Software
Foundation
Originally based on papers published by Google in 2003 and 2004
Hadoop committers work at several different organizations – Including
Cloudera, Yahoo!, Facebook, LinkedIn

Hadoop takes a radical new approach to the problem of


distributed computing – distribute the data as it’s initially
stored in the system and individual nodes work on data
local to the nodes.
History
Who Uses Hadoop?
Hadoop Components

Application Application Layer


& Resource Management
Layer

Resource Management
Layer
Storage Layer
Storage Layer
HDFS
HDFS stands for Hadoop Distributed File System. It provides for data storage of Hadoop. HDFS
splits the data unit into smaller units called blocks and stores them in a distributed manner.
It has got two daemons running.
• Master node – NameNode
• Slave nodes – DataNode.
Master Nodes
• NameNode
• Only 1 per cluster
• A single NameNode stores all metadata
• Filenames, locations on DataNodes of each block, owner, group, etc.
• All information maintained in RAM for fast lookup
• File system metadata size is limited to the amount of available RAM on the NameNode
Slave Nodes
• DataNode
• 1-4000 per cluster
• Store file contents
• Stores as opaque ‘blocks’ on the underlying file system
• Different blocks of the same file will be stored on different DataNodes
• Same blocks is stored on three (or more) DataNode for redundancy
Self-healing
• DataNodes send heartbeats to the NameNode
• After a period without any heartbeats, a DataNode is assumed to be lost
• NameNode determines which blocks were on the lost node
• NameNode finds other DataNodes with copies of these blocks Same block stored in
• These DataNodes are instructed to copy the blocks to other nodes different DataNodes
• Replication is actively maintained
Block in HDFS
Block is nothing but the smallest unit of storage on a computer system. It is the smallest contiguous
storage allocated to a file. In Hadoop, we have a default block size of 128MB or 256 MB.
What is MapReduce

MapReduce is a method for distributing a task across multiple nodes

Each node processes data stored on that node


• Where possible

Consists of two phases:


• Map
• Reduce
MapReduce
• Map Task
• RecordReader
• Map
• Combiner
• Partitioner

• Reduce Task
• Shuffle and Sort
• Reduce
• OutputFormat
YARN
YARN or Yet Another Resource Negotiator is the resource management layer of
Hadoop.

❑ separate resource management and job scheduling/monitoring function into separate daemons
❑ one global ResourceManager and per-application ApplicationMaster
❑ Application can be a single job or a DAG of jobs

Inside the YARN framework, we have two daemons


• ResourceManager
• resources among all the competing applications in the system

• NodeManager
• monitor the resource usage by the container and report the same to ResourceManger
YARN
ResourceManger
The ResourceManger has two important components
• Scheduler
• ApplicationManager

Scheduler
• Scheduler is responsible for allocating resources to various applications. This is a pure scheduler as it does not
perform tracking of status for the application. It also does not reschedule the tasks which fail due to software or
hardware errors. The scheduler allocates the resources based on the requirements of the applications.

Application Manager
• Accepts job submission.
• Negotiates the first container for executing ApplicationMaster. A container incorporates elements such as CPU,
memory, disk, and network.
• Restarts the ApplicationMaster container on failure.
• Negotiates resource container from Scheduler.
• Tracks the resource container status.
• Monitors progress of the application.
Features of Yarn

• Multi-tenancy
• Cluster Utilization
• Scalability
• Compatibility

Go through this link to get the detailed idea.


https://www.geeksforgeeks.org/hadoop-yarn-architecture/
Reference

⮚ https://data-flair.training/blogs/hadoop-tutorial/Cluster Utilization
⮚ https://www.geeksforgeeks.org/hadoop-introduction/Compatibility
Thank You

You might also like