0% found this document useful (0 votes)
43 views47 pages

ECS765P - W3 - Hadoop Principles and Components

The document provides an overview of Apache Hadoop, detailing its principles, components, and architecture, including HDFS and YARN. It explains the roles of different nodes and daemons, the MapReduce framework, and the importance of data distribution and replication. Additionally, it covers the Hadoop ecosystem, including security measures, distributed coordination services, and data ingestion tools.

Uploaded by

Yen-Kai Cheng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views47 pages

ECS765P - W3 - Hadoop Principles and Components

The document provides an overview of Apache Hadoop, detailing its principles, components, and architecture, including HDFS and YARN. It explains the roles of different nodes and daemons, the MapReduce framework, and the importance of data distribution and replication. Additionally, it covers the Hadoop ecosystem, including security measures, distributed coordination services, and data ingestion tools.

Uploaded by

Yen-Kai Cheng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

ECS640U/ECS765P Big Data Processing

Hadoop principles and components


Lecturer: Ahmed M. A. Sayed
School of Electronic Engineering and Computer Science
ECS640U/ECS765P Big Data Processing

Hadoop Principles and Components


Lecturer: Ahmed M. A. Sayed
School of Electronic Engineering and Computer Science

Credit: Joseph Doyle, Jesus Carrion, Felix Cuadrado, …


Contents

● Introduction to Apache Hadoop


● HDFS
● YARN
● The Apache Hadoop Ecosystem
Map/Reduce job
Map/Reduce framework roles
Parallelising the problem
Hadoop is a MapReduce framework which executes on a cluster of networked PCs
● Each node runs a set of daemons or service to facilitate the execution of MapReduce jobs
● ResourceManager
● NodeManger as many as clusters YARN
● JobHistoryServer
● NameNode
● DataNode as many as clusters
HDFS
● SecondaryNameNode Hadoop Distributed File System
Nodes vs Daemons
● A node is a process which is running on virtual machine, physical machine or in a container
● A daemon is a process which runs in the background rather than being directly controlled by a user
Hadoop Architecture
Leader-Follower Architecture
master-slave
Leader (1)
● Is aware of all the follower nodes
● Receives external requests
● Decides who executes what, and when
● Speaks with follower

Follower (1..*)
● Worker node
● Executes the tasks the leader tells it to do
Hadoop Leader-Follower architecture
from the daemon’s point of view:
Contents

● Introduction to Apache Hadoop?


● HDFS (Hadoop Distributed File System)
● YARN
● The Apache Hadoop Ecosystem
HDFS
Hadoop Distributed File System (HDFS)
● Shared distributed storage among the nodes of the Hadoop cluster
Storage for Input and output of MapReduce jobs
HDFS is tailored for MapReduce jobs
● Large block size (128MB default)
But not too large, blocks define the minimum parallelization unit
Trade-offs for improving data processing throughput
HDFS Namenode (Master Node)

● Manages the file system namespace


● Maintains the filesystem tree and metadata for all files and directories in the tree
● Namespace Data is stored persistently in two files:
Namespace image file
Edit log file
Also knows which Datanode possess the blocks for a given file (not persistently)

NameNode maintains the data that provides information about DataNodes like which block is mapped to which
DataNode (this information is called metadata) and also executes operations like the renaming of files.
HDFS Datanode (Slave Node)

● Workhorse of the filesystem


● Stores and retrieves blocks when instructed
● Reports to Namenode periodically with a list of blocks that it is storing
● Implements block caching for blocks which are frequently accessed
Blocks is cached in Datanode’s memory
By default a block is only cached in one datanode’s memory but this is configurable
128M

DataNodes store the actual data and also perform tasks like replication and deletion of data as instructed by
NameNode. DataNodes also communicate with each other.
Hadoop Nodes Daemons
● DataNode (1 ..* per cluster)
Stores blocks from the HDFS
Report periodically to NameNode list of stored blocks
● NameNode (1 per cluster)
Keeps index table with (all) the locations of each block
Heavy task, no computation responsibilities Daemons have no computation responsibilities!

Single point of failure


● Secondary Namenode (1 per cluster)
Communicates periodically with NameNode
Stores backup copy of index table
HDFS Data Distribution
Data distribution is a key element of the MapReduce model and architecture
“Move computation to data” principle
Blocks are replicated over the cluster for fault-tolerance purposes
Default number of replicas is three times
Spread replicas among different physical locations
Improves reliability
Data Replication
HDFS Usage

Note: Client is an interface that communicates with NameNode for metadata and DataNodes for read
and writes operations.
HDFS File Read operation
HDFS File Write operation
(only contact one datanode)

7: complete
Contents

● Introduction to Apache Hadoop?


● HDFS
● YARN (Yet Another Resource Negotiator)
● The Apache Hadoop Ecosystem

Quiz and Break


MapReduce Classic Job Tracker Architecture
Competing demand for resources and
execution cycles arising from the
single point of control in the design

Reliability, Availability and Utilization issues


Scalability issue - Clusters of 10,000 nodes or/and 200,000 cores
Unpredictable Latency - A major customer concern
Job Execution Architecture (YARN)
The fundamental idea of YARN is to split
(2) Job Scheduling
up the two major functionalities of the & Monitoring
JobTracker into separate processes

(1) Resource Management (daemons)


Hadoop computation tasks
Resource Management Resource manager, Node manager

● Being aware of what resources are in the cluster


● Which resources are available/used/failed now
Job Allocation Resource manager, Application Master/Manager

● How many resources are needed to compute the job


● Which nodes should execute each of the tasks
Job Execution/Monitoring Application Master/Manager, Node manager

● Coordinate task execution from workers


● Make sure the job completes, deal with failures
Hadoop job allocation
Resource management needs to estimate how many Map and Reduce tasks are needed for a given job
● Based on input dataset
● Based on job definition
Ideally, a single node (physical node, VM, or container) will be allocated for each different Map/Reduce
tasks
● Otherwise multiple tasks can be allocated to the same node (physical node, VM, or container)
Job Execution: complete MapReduce job flow
● Split (logically) input data into computing chunks
● Assign one chunk to a (co-located) NodeManager
● Run 1..* Mappers
● Shuffle and Sort
● Run 1..* Reducers
● Results from the Reducers create the job output
How many Mappers are needed?
Mapper parallelisation:
● Each Mapper processes a different input split
● Input dataset size is known
Number of mappers = input size / split size
● If input has multiple small files, more Mappers can be invoked (Hadoop inefficiency)
How many Reducers are needed?
Reducer parallelisation
● Keys are partitioned across the reducers
● Hard to automatically estimate what is the right number
● Too many Reducers can result into too much shuffle and sort.
Number of reducers = User defined parameter
● (in MapReduce job definition)
Hadoop Execution daemons
ResourceManager (1 per cluster)
● Receives job requests from Hadoop Clients
● Creates one ApplicationMaster per job to manage it
● Allocates Containers in slave nodes, with assigned/dedicated resources
● Keeps track of the health of NodeManager nodes
NodeManager (1..* per cluster)
● Coordinates execution of Map and Reduce tasks at node
● Sends heartbeat messages to ResourceManager

the task/chunk is run by each container


Application Master/Manager in slave node/node manager
the most computing part takes place here

One per job. Implements the specific computing framework


● After creation, negotiates with ResourceManager how many resources will be required for the job
● Decides which nodes will run Map and Reduce jobs among the Containers given by the
ResourceManager
● Reports to the ResourceManager about the progress and completion of the whole job
● Is destroyed when the job is completed
● Job outcome recorded in the JobHistoryServer
Responsibilities on computation tasks
● Resource management

● Job allocation

● Job execution
Application Manager
Three different schedulers available in YARN
● FIFO
● Capacity
● Fair
FIFO Scheduler
First in, first out. Requests for the first application in the queue are allocated first; once its requests have
been satisfied, the next application in the queue is served, and so on.
● Easy to understand
● No configuration necessary
● Not suitable for shared clusters
● Large applications will use all resources in the cluster so each application will have to wait its turn
FIFO Scheduler
Capacity Scheduler
A separate dedicated queue allows a small job to start as soon as it is submitted
● Large jobs finish later
● Smaller jobs get results back in reasonable time
● The overall cluster utilization can be low since the queue capacity is reserved for jobs in that queue
Capacity Scheduler
Fair Scheduler
Cluster will dynamically balance resources between all running jobs. Just after the first (large) job starts, it
is the only job running, so it gets all the resources in the cluster. When the second (small) job starts, it is
allocated half of the cluster resources so that each job is using its fair share of resources.
● Lag between the time the second job starts and when it receives its fair share, since it has to wait for
resources to free up as containers used by the first job complete.
● High cluster utilization
● Timely small job completion
Fair Scheduler

there should be a lag after job 2 submitted


Contents

● Introduction to Apache Hadoop?


● HDFS
● YARN
● The Apache Hadoop Ecosystem

Quiz!
The Apache Hadoop Ecosystem

Several other components can be used in the Hadoop Ecosystem


Three of the important ones are:
● Security (Kerebos)
● Distributed Coordination Service (Zookeeper)
● Data Ingestion from event-based data (Flume)
Kerebos
By default, the security in Hadoop is set to simple which uses a simple authentication mechanism
However, malicious users to assume the root’s identity to access or delete any data in the cluster
Kerebos prevents this by introducing a three step process to gain access to a service which are
● Authentication: The client authenticates itself to the Authentication Server and receives a
timestamped Ticket-Granting Ticket (TGT).
● Authorization: The client uses the TGT to request a service ticket from the Ticket-Granting Server.
● Service request: The client uses the service ticket to authenticate itself to the server that is providing
the service. In the case of Hadoop, this might be the namenode or the resource manager.

TGTs last 10 hours by default to user will only need go through this process every 10 hours (This is also
configurable). Kind of similar to Single Sign-on (SSO)
Kerebos
Zookeeper
● Distributed, open-source coordination service for distributed applications.
keeps the distributed system functioning together as a single unit via synchronization and coordination
● Quorum algorithms for selecting leaders, agreeing on shared state
The minimum number of servers required to run the Zookeeper is called Quorum. Zookeeper replicates
whole data tree to all the quorum servers

https://medium.com/@akashsingla19/zookeeper-quorum-44906bb17d74
Hadoop with Automated failover

ZKFailoverController (ZKFC) is a ZooKeeper client which implements automated failover for the
NameNodes (master and secondaries) by:

● Failure detection - each NameNode maintains a persistent session in ZooKeeper. If the machine
crashes, the ZooKeeper session will expire, notifying the other NameNodes that a failover should be
triggered.
● Active NameNode election - If the active NameNode crashes, another node may take a special
exclusive lock in ZooKeeper indicating that it should become the next active.

https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.0.0/fault-
tolerance/content/configuring_and_deploying_namenode_automatic_failover.html
Flume

● Flume runs agents which are long-lived Java process that run sources and sinks, connected by channels
● A source in Flume produces events and delivers them to the channel
● The channel stores the events until they are forwarded to the sink
● The Flume installation is made up of a collection of connected agents running in a distributed topology

https://flume.apache.org/FlumeUserGuide.html
Flume

• Flume event is defined as a unit of data flow having a byte payload and an optional set of string
attributes
• Flume Agent is a (JVM) process that hosts the components through which events flow from an
external source to the next destination
Contents

● Introduction to Apache Hadoop?


● HDFS
● YARN
● The Apache Hadoop Ecosystem

Quiz and End!

You might also like