0% found this document useful (0 votes)
31 views51 pages

Chapter - 6 - Hadoop

Chapter 6 provides an overview of Hadoop, an open-source software framework for distributed processing of large data sets. It discusses the architecture of Hadoop, including its components like HDFS and YARN, and highlights its advantages and disadvantages. The chapter also covers the roles of various Hadoop daemons and the requirements for effective Hadoop operation.

Uploaded by

Tek singh Ayer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views51 pages

Chapter - 6 - Hadoop

Chapter 6 provides an overview of Hadoop, an open-source software framework for distributed processing of large data sets. It discusses the architecture of Hadoop, including its components like HDFS and YARN, and highlights its advantages and disadvantages. The chapter also covers the roles of various Hadoop daemons and the requirements for effective Hadoop operation.

Uploaded by

Tek singh Ayer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Chapter 6

Bal Krishna Nyaupane


Assistant Professor
Department of Electronics and Computer Engineering
Institute of Engineering, Tribhuvan University
[email protected]
2
3 Hadoop’s Developers
4 What is Hadoop?
5 What is Hadoop?
 Apache Hadoop is an open source software framework that provides
highly reliable distributed processing of large data sets using simple
programming models.
 Hadoop, known for its scalability, is built on clusters of commodity
computers, providing a cost-effective solution for storing and processing
massive amounts of structured, semi-structured and unstructured data
with no format requirements.
 At Google MapReduce operation are run on a special file system called
Google File System (GFS) ,not open source, that is highly optimized for
this purpose.
 Doug Cutting and Yahoo! reverse engineered the GFS and called it
Hadoop Distributed File System (HDFS).
6 What is Hadoop?
 The software framework that supports HDFS, MapReduce and other related
entities is called the project Hadoop or simply Hadoop.
 This is distributed by Apache.
7 Core Hadoop Concept
8 Hadoop High-Level Overview
9 Why Hadoop?
10 Why Hadoop?
11 Disadvantages of Hadoop
▪ Hadoop is written in Java which is a widely used
programming language hence it is easily exploited by
cyber criminals which makes Hadoop vulnerable to
security breaches.
▪ At the core, Hadoop has a batch processing engine
which is not efficient in stream processing. It cannot
produce output in real-time with low latency. It only
works on data which we collect and store in a file in
advance before processing.
▪ High Up Processing: In Hadoop, the data is read from
the disk and written to the disk which makes read/write
operations very expensive when we are dealing with tera
and petabytes of data. Hadoop cannot do in-memory
calculations hence it incurs processing overhead.
▪ Hadoop can efficiently perform over a small number of
files of large size. Hadoop stores the file in the form of file
blocks which are from 128MB in size(by default) to
256MB.
12 Who Uses Hadoop?
13 Motivations for Hadoop

What were the limitations of earlier large-scale computing?

What requirements should an alternative approach have?

How does Hadoop address those requirements?


14 Motivations for Hadoop
 Early Large Scale Computing
▪ Historically computation was processor-bound.
▪ Advances in computer technology has historically
centered around improving the power of a single
machine.
 Advances in CPUs
▪ Moore’s Law: The number of transistors on a dense
integrated circuit doubles every two years.
▪ Single-core computing can’t scale with current
computing needs
▪ Single-Core Limitation: Power consumption limits
the speed increase we get from transistor density.
15 Motivations for Hadoop
 Distributed Systems: Allows developers to use multiple machines for a single task.
 Distributed System Problems: Programming on a distributed system is much more
complex
▪ Synchronizing data exchanges
▪ Managing a finite bandwidth
▪ Controlling computation timing is complicated
 Distributed systems must be designed with the expectation of failure
 Distributed System Data Storage
▪ Typically divided into Data Nodes and Compute Nodes
▪ At compute time, data is copied to the Compute Nodes
▪ Fine for relatively small amounts of data
▪ Modern systems deal with far more data than was gathering in the past
16 Requirements for Hadoop
 Most support Partial Failures.
▪ Failure of a single component must not
cause the failure of the entire system only a
degradation of the application performance
▪ Failure should not result in the loss of any
data
 Most support Component Recovery.
▪ If a component fails, it should be able to
recover without restarting the entire
system
▪ Component failure or recovery during a
job must not affect the final output.
17 Requirements for Hadoop
 Must be scalable.
▪ Increasing resources should increase load capacity.
▪ Increasing the load on the system should result in a graceful decline
in performance for all jobs
▪ Not system failure
18 The Hadoop Ecosystem

Hadoop Common
▪ The common utilities that support the other Hadoop modules.
▪ It is an essential part or module of the Apache Hadoop Framework, along
with the Hadoop Distributed File System (HDFS), Hadoop YARN and Hadoop
MapReduce.
▪ Hadoop Common is also known as Hadoop Core.
19 The Hadoop Ecosystem
20 HDFS
HDFS is a file system written in Java based on the Google’s
GFS
Responsible for storing data on the cluster
Provides redundant storage for massive amounts of data
Data files are split into blocks and distributed across the
nodes in the cluster
Each block is replicated multiple times
HDFS works best with a smaller number of large files
Millions as opposed to billions of files
Typically 100MB or more per file
21 HDFS: Master/Slave Architecture
22 HDFS: Master/Slave Architecture
23 HDFS: Master/Slave Architecture
24 HDFS Architecture
25 HDFS Architecture
Hadoop Distributed File System follows the master-slave
architecture. Each cluster comprises a single master node and
multiple slave nodes.
Internally the files get divided into one or more blocks, and
each block is stored on different slave machines depending on
the replication factor.
The master node stores and manages the file system
namespace, that is information about blocks of files like block
locations, permissions, etc. The slave nodes store data blocks
of files.
The Master node is the NameNode and DataNodes are the slave
nodes.
26 What is HDFS NameNode?
 NameNode is the centerpiece of the HDFS. It maintains and manages
the file system namespace and provides the right access
permission to the clients.
 The NameNode stores information about blocks locations,
permissions, etc. on the local disk in the form of two files:
▪ Fsimage: Fsimage stands for File System image. It contains the
complete namespace of the Hadoop file system since the
NameNode creation.
▪ Edit log: It contains all the recent changes performed to the file
system namespace to the most recent Fsimage.
27 Functions of HDFS NameNode
1. It executes the file system namespace operations like opening,
renaming, and closing files and directories.
2. NameNode manages and maintains the DataNodes.
3. It determines the mapping of blocks of a file to DataNodes.
4. NameNode records each change made to the file system namespace.
5. It keeps the locations of each block of a file.
6. NameNode takes care of the replication factor of all the blocks.
7. NameNode receives heartbeat and block reports from all DataNodes
that ensure DataNode is alive.
8. If the DataNode fails, the NameNode chooses new DataNodes for
new replicas.
28 What is HDFS DataNode?
 DataNodes are the slave nodes in Hadoop HDFS. DataNodes are
inexpensive commodity hardware. They store blocks of a file.
 Functions of DataNode
▪ DataNode is responsible for serving the client read/write
requests.
▪ Based on the instruction from the NameNode, DataNodes
performs block creation, replication, and deletion.
▪ DataNodes send a heartbeat to NameNode to report the health
of HDFS.
▪ DataNodes also sends block reports to NameNode to report the
list of blocks it contains.
29 HDFS: Data Replication
30 HDFS Data Block
31 HDFS Fault Tolerance
32 What is YARN?
33 Hadoop YARN
34 What is YARN?
 YARN stands for “Yet Another Resource Negotiator“.
 It was introduced in Hadoop 2.0 to remove the bottleneck on Job Tracker
which was present in Hadoop 1.0.
 YARN was described as a “Redesigned Resource Manager” at the time of its
launching, but it has now evolved to be known as large-scale distributed
operating system used for Big Data processing.
 YARN allows different data processing methods like graph processing,
interactive processing, stream processing as well as batch processing to
run and process data stored in HDFS. Therefore YARN opens up Hadoop to
other types of distributed applications beyond MapReduce.
 YARN enabled the users to perform operations as per requirement by using
a variety of tools like Spark for real-time processing, Hive for SQL, HBase for
NoSQL and others.
35 Components of YARN
36 Components of YARN
 Apache Hadoop YARN Architecture consists of the following main
components.
 Resource Manager
➢ It is the ultimate authority in resource allocation.
➢ On receiving the processing requests, it passes parts of requests to
corresponding node managers accordingly, where the actual processing
takes place.
➢ It is the arbitrator of the cluster resources and decides the allocation of the
available resources for competing applications.
➢ It has two major components: a) Scheduler b) Application Manager
➢ The scheduler is responsible for allocating resources to the various running
applications subject to constraints of capacities, queues etc.
➢ Application Manager is responsible for accepting job submissions.
37 Components of YARN
 Node Manager
▪ They run on the slave daemons and are responsible for the execution of a
task on every single Data Node.
▪ It takes care of individual nodes in a Hadoop cluster and manages user jobs
and workflow on the given node.
▪ It registers with the Resource Manager and sends heartbeats with the health
status of the node.
 Application Master: Manages the user job lifecycle and resource needs of
individual applications. It works along with the Node Manager and monitors the
execution of tasks.
 Container: Package of resources including RAM, CPU, Network, HDD etc. on
a single node.
38 YARN Application Workflow
39 Hadoop Daemons
 Daemons mean Process. Hadoop Daemons are a set of processes that
run on Hadoop. Hadoop is a framework written in Java, so all these processes
are Java Processes. Apache Hadoop 2 consists of the following Daemons:
1. NameNode
2. DataNode
3. Secondary Name Node
4. Resource Manager
5. Node Manager
 “Running Hadoop” means running a set of daemons, or resident programs,
on the different servers in your network.
 These daemons have specific roles; some exist only on one server, some
exist across multiple servers.
40 Hadoop Daemons: NameNode
 The NameNode(master) directs the slave DataNode daemons to perform the
low-level I/O tasks.
 The NameNode is bookkeeper of HDFS; it keeps track of how your files are
broken down into file blocks, which nodes store those blocks, and the overall
health of the distributed file system.
 The function of the NameNode is memory and I/O intensive. As such, the
server hosting the NameNode typically doesn’t store any user data or perform
any computations for a MapReduce program to lower the workload on the
machine.
 It’s a single point of failure of your Hadoop cluster.
 For any of the other daemons, if their host nodes fail for software or hardware
reasons, the Hadoop cluster will likely continue to function smoothly or you can
quickly restart it. Not so for the NameNode.
41 Hadoop Daemons: DataNode
 Each slave machine in cluster host a DataNode daemon to perform work of the
distributed file system, reading and writing HDFS blocks to actual files on the
local file system. Read or write a HDFS file, the file is broken into blocks and the
NameNode will tell your client which DataNode each block resides in.
 Job communicates directly with the DataNode daemons to process the local files
corresponding to the blocks.
 DataNode may communicate with other DataNodes to replicate its data blocks for
redundancy.
 DataNodes are constantly reporting to the NameNode. Upon initialization, each
of the DataNodes informs the NameNode of the blocks it’s currently storing.
 After this mapping is complete, the DataNodes continually poll the NameNode to
provide information regarding local changes as well as receive instructions to
create, move, or delete blocks from the local disk.
42 Hadoop Daemons: Secondary Name Node
 The Secondary NameNode (SNN) is an assistant daemon for monitoring
the state of the cluster HDFS.
 Like the NameNode, each cluster has one SNN.
 No other DataNode or TaskTracker daemons run on the same server.
 The SNN differs from the NameNode, it doesn’t receive or record any real-
time changes to HDFS. Instead, it communicates with the NameNode to
take snapshots of the HDFS metadata at intervals defined by the cluster
configuration.
 As mentioned earlier, the NameNode is a single point of failure for a Hadoop
cluster, and the SNN snapshots help minimize the downtime and loss of
data.
 Nevertheless, a NameNode failure requires human intervention to
reconfigure the cluster to use the SNN as the primary NameNode.
43 Hadoop Daemons: Resource Manager
 Resource Manager is also known as the Global Master Daemon that
works on the Master System. The Resource Manager Manages the
resources for the applications that are running in a Hadoop Cluster.
 The Resource Manager Mainly consists of 2 things.
1. Applications Manager
2. Scheduler
 An Application Manager is responsible for accepting the request for a
client and also makes a memory resource on the Slaves in a
Hadoop cluster to host the Application Master.
 The scheduler is utilized for providing resources for applications in
a Hadoop cluster and for monitoring this application.
44 JobTracker
 JobTracker process runs on a separate node and not usually on a DataNode.
 JobTracker is an essential Daemon for MapReduce execution in MRv1. It is replaced
by ResourceManager/ApplicationMaster in MRv2.
 JobTracker receives the requests for MapReduce execution from the client.
 JobTracker talks to the NameNode to determine the location of the data.
 JobTracker finds the best TaskTracker nodes to execute tasks based on the data
locality (proximity of the data) and the available slots to execute a task on a given
node.
 JobTracker monitors the individual TaskTrackers and the submits back the overall
status of the job back to the client.
 JobTracker process is critical to the Hadoop cluster in terms of MapReduce execution.
 When the JobTracker is down, HDFS will still be functional but the MapReduce
execution can not be started and the existing MapReduce jobs will be halted.
45 TaskTracker
TaskTracker runs on DataNode. Mostly on all DataNodes.
TaskTracker is replaced by Node Manager in MRv2.
Mapper and Reducer tasks are executed on DataNodes
administered by TaskTrackers.
TaskTrackers will be assigned Mapper and Reducer tasks to
execute by JobTracker.
TaskTracker will be in constant communication with the
JobTracker signalling the progress of the task in execution.
TaskTracker failure is not considered fatal. When a TaskTracker
becomes unresponsive, JobTracker will assign the task executed
by the TaskTracker to another node.
46 Hadoop Daemons: Node Manager
The Node Manager works on the Slaves System that manages
the memory resource within the Node and Memory Disk.
Each Slave Node in a Hadoop cluster has a single NodeManager
Daemon running in it.
It also sends this monitoring information to the Resource
Manager.
47 Hadoop Configuration Modes
48 Standalone Mode
 The standalone mode is the default mode for Hadoop.
 Hadoop chooses to be conservative and assumes a minimal configuration.
All XML (Configuration) files are empty under this default mode. With
empty configuration files, Hadoop will run completely on the local
machine.
 In Standalone Mode none of the Daemon will run i.e. Namenode,
Datanode, Secondary Name node, Job Tracker, and Task Tracker. We use
job-tracker and task-tracker for processing purposes in Hadoop1. For
Hadoop2 we use Resource Manager and Node Manager.
 Standalone Mode also means that we are installing Hadoop only in a
single system.
 We mainly use Hadoop in this Mode for the Purpose of Learning,
testing, and debugging.
49 Pseudo Distributed Mode
 The pseudo-distributed mode is running Hadoop in a “cluster of one” with all
daemons running on a single machine. This mode complements the
standalone mode for debugging your code, allowing you to examine memory
usage, HDFS input/output issues, and other daemon interactions.
 Namenode and Resource Manager are used as Master and Datanode and Node
Manager is used as a slave. A secondary name node is also used as a Master. The
purpose of the Secondary Name node is to just keep the hourly based backup of the
Name node
 In this Mode,
▪ Hadoop is used for development and for debugging purposes both.
▪ Our HDFS(Hadoop Distributed File System ) is utilized for managing the Input
and Output processes.
▪ We need to change the configuration files mapred-site.xml, core-site.xml,
hdfs-site.xml for setting up the environment.
50 Fully Distributed Mode
 This is the most important one in which multiple nodes are used few
of them run the Master Daemon’s that are Namenode and Resource
Manager and the rest of them run the Slave Daemon’s that are
DataNode and Node Manager.
 Benefits of distributed storage and distributed computation
▪ Master: the server that hosts the namenode and job-tracker
daemons
▪ Backup: the server that hosts the secondary namenode daemon
▪ Slaves: the servers that host both datanode and tasktracker
daemons
51

Thank You
???

You might also like