Introduction to Hadoop
Introduction to Hadoop
Mr. ISRAFIL
Lecturer
Department of Computing & Information System
What is
Hadoop is an open project overseen by the Apache Software
Foundation
Originally based on papers published by Google in 2003 and 2004
Hadoop committers work at several different organizations – Including
Cloudera, Yahoo!, Facebook, LinkedIn
Resource Management
Layer
Storage Layer
Storage Layer
HDFS
HDFS stands for Hadoop Distributed File System. It provides for data storage of Hadoop. HDFS
splits the data unit into smaller units called blocks and stores them in a distributed manner.
It has got two daemons running.
• Master node – NameNode
• Slave nodes – DataNode.
Master Nodes
• NameNode
• Only 1 per cluster
• A single NameNode stores all metadata
• Filenames, locations on DataNodes of each block, owner, group, etc.
• All information maintained in RAM for fast lookup
• File system metadata size is limited to the amount of available RAM on the NameNode
Slave Nodes
• DataNode
• 1-4000 per cluster
• Store file contents
• Stores as opaque ‘blocks’ on the underlying file system
• Different blocks of the same file will be stored on different DataNodes
• Same blocks is stored on three (or more) DataNode for redundancy
Self-healing
• DataNodes send heartbeats to the NameNode
• After a period without any heartbeats, a DataNode is assumed to be lost
• NameNode determines which blocks were on the lost node
• NameNode finds other DataNodes with copies of these blocks Same block stored in
• These DataNodes are instructed to copy the blocks to other nodes different DataNodes
• Replication is actively maintained
Block in HDFS
Block is nothing but the smallest unit of storage on a computer system. It is the smallest contiguous
storage allocated to a file. In Hadoop, we have a default block size of 128MB or 256 MB.
What is MapReduce
• Reduce Task
• Shuffle and Sort
• Reduce
• OutputFormat
YARN
YARN or Yet Another Resource Negotiator is the resource management layer of
Hadoop.
❑ separate resource management and job scheduling/monitoring function into separate daemons
❑ one global ResourceManager and per-application ApplicationMaster
❑ Application can be a single job or a DAG of jobs
• NodeManager
• monitor the resource usage by the container and report the same to ResourceManger
YARN
ResourceManger
The ResourceManger has two important components
• Scheduler
• ApplicationManager
Scheduler
• Scheduler is responsible for allocating resources to various applications. This is a pure scheduler as it does not
perform tracking of status for the application. It also does not reschedule the tasks which fail due to software or
hardware errors. The scheduler allocates the resources based on the requirements of the applications.
Application Manager
• Accepts job submission.
• Negotiates the first container for executing ApplicationMaster. A container incorporates elements such as CPU,
memory, disk, and network.
• Restarts the ApplicationMaster container on failure.
• Negotiates resource container from Scheduler.
• Tracks the resource container status.
• Monitors progress of the application.
Features of Yarn
• Multi-tenancy
• Cluster Utilization
• Scalability
• Compatibility
⮚ https://data-flair.training/blogs/hadoop-tutorial/Cluster Utilization
⮚ https://www.geeksforgeeks.org/hadoop-introduction/Compatibility
Thank You