0% found this document useful (0 votes)
10 views112 pages

Bda 2 - Hadoop

The document provides an overview of Hadoop, an open-source software framework for big data analytics, detailing its history, components, and architecture. Key features include the Hadoop Distributed File System (HDFS) and the MapReduce programming paradigm, which facilitate reliable data storage and processing across distributed computing environments. It also discusses the roles of the NameNode and DataNodes, as well as mechanisms for resilience against NameNode failure.

Uploaded by

112115138
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views112 pages

Bda 2 - Hadoop

The document provides an overview of Hadoop, an open-source software framework for big data analytics, detailing its history, components, and architecture. Key features include the Hadoop Distributed File System (HDFS) and the MapReduce programming paradigm, which facilitate reliable data storage and processing across distributed computing environments. It also discusses the roles of the NameNode and DataNodes, as well as mechanisms for resilience against NameNode failure.

Uploaded by

112115138
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 112

Big Data Analytics:

Hadoop
Dr. Shivangi Shukla
Assistant Professor
Computer Science and Engineering
IIIT Pune
Contents
• Introduction
• History
• The Apache Hadoop Project
• HDFS

Big Data Analytics: Concepts and Methods 2


Introduction to Hadoop
• Hadoop:
• open-source software framework
• provides reliable shared storage and analysis
• designed to handle big data in distributed
computing
• framework is based on Java programming with
some native code in C and shell scripts

Big Data Analytics: Hadoop 3


History
• The initial version of Hadoop was created in 2004
by Doug Cutting (and named after his son’s stuffed
elephant).
• Hadoop became a top-level Apache Software
Foundation project in January 2008.
• Hadoop was being used by many companies:
Facebook, New York Times etc.
• There have been many contributors, both
academic and commercial
• Yahoo being the largest such contributor
Big Data Analytics: Hadoop 4
The Apache Hadoop Project
• Hadoop
• collection of related subprojects that fall under
the umbrella of infrastructure for distributed
computing
• These projects are hosted by the Apache
Software Foundation
• provides support for a community of open
source software projects
• Hadoop is best known for
• MapReduce
• HDFS (Hadoop Distributed File System)
Big Data Analytics: Hadoop 5
The Apache Hadoop Project

Fig 1: Hadoop Subprojects

Big Data Analytics: Hadoop 6


The Apache Hadoop Project
• Core
• A set of components and interfaces for distributed
filesystems and general I/O (serialization, Java RPC,
persistent data structures)
• Avro
• A data serialization system for efficient, cross-language
RPC, and persistent data storage.
• HDFS (Hadoop Distributed File System)
• A distributed filesystem that runs on large clusters of
commodity machines.
• Allows storage large amounts of data across multiple
machines

Big Data Analytics: Hadoop 7


The Apache Hadoop Project
• MapReduce
• A distributed data processing model and execution
environment that runs on large clusters of commodity
machines
• Pig
• A data flow language and execution environment for
exploring very large datasets. Pig runs on HDFS and
MapReduce clusters.
• HBase
• A distributed, column-oriented database that uses HDFS
for its underlying storage.
• It supports both batch-style computations using
MapReduce and point queries (random reads)
Big Data Analytics: Hadoop 8
The Apache Hadoop Project
• ZooKeeper
• A distributed, highly available coordination service that
provides primitives such as distributed locks that can be
used for building distributed applications.
• Hive
• A distributed data warehouse that manages data stored
in HDFS and provides a query language based on SQL for
querying the data.
• Chukwa
• A distributed data collection and analysis system that
runs collectors that store data in HDFS, and it uses
MapReduce to produce reports.
Big Data Analytics: Hadoop 9
Two Important Components of
Hadoop
• File System
• provided by HDFS (Hadoop Distributed File
System)
• allows storage large amounts of data across
multiple machines
• Programming Paradigm
• provided by MapReduce
• used to access extensive data stored in HDFS

Big Data Analytics: Hadoop 10


Hadoop Distributed File System
(HDFS)

11
HDFS (Hadoop Distributed File
System)
• Distributed Filesystems
• Filesystems that manage the storage across a
network of machines are called distributed
filesystems.
• Distributed filesystems are more complex than
regular disk filesystems owing to complications of
network programming.
• One of the biggest challenge is to tolerate node
failure without suffering data loss.

Big Data Analytics: Hadoop 12


HDFS Features
• Very Large Files
• Files that are hundreds of megabytes, gigabytes, or terabytes
in size.
• There are Hadoop clusters running today that store
petabytes of data.
• Streaming Data Access
• HDFS is built around the idea that the most efficient data
processing pattern is a write-once, read-many-times pattern.
• A dataset is typically generated or copied from source, then
various analyses are performed on that dataset over time.
Each analysis involves a large proportion, if not all, of the
dataset,
• hence, the time to read the whole dataset is more
important than the latency in reading the first record.
Big Data Analytics: Hadoop 13
HDFS Features..

• Commodity Hardware
• Hadoop is designed to run on clusters of commodity
hardware.
• These commodity hardware have high chance of node
failure across the cluster, especially in case of large
clusters
• HDFS is designed to carry on working without a
noticeable interruption to the user in the face of such
failure

Big Data Analytics: Hadoop 14


Limitations of HDFS
• High-latency Data Access
• HDFS has high-latency for data access as it is optimized
for batch processing rather than real-time data access
thus, focusing on throughput rather than latency
• Inefficient in handling lots of small files
• HDFS is not well-suited for handling large numbers of
small files because
• the metadata overhead increases
• HDFS has large block sizes (which is typically 128 MB
or 256 MB). When a small file is stored, it still
occupies an entire block, resulting in inefficient
storage utilization and wasted space.
Big Data Analytics: Hadoop 15
Limitations of HDFS..
• No support for multiple writers
• HDFS does not support multiple writers
• Files in HDFS may be written to by a single writer.
Writes are always made at the end of the file.

Big Data Analytics: Hadoop 16


HDFS Concepts
• Blocks
• Disk has block size which defines the minimum amount of
data that it can read or write
• Filesystem blocks are of few kilobytes in size, while disk
blocks are normally 512 bytes
• HDFS has blocks of large size- 64MB by default
• HDFS blocks are large compared to disk blocks, and the
reason is to minimize the seek time
• By making a block large enough, transfer time (time to
transfer the data from the disk) can be made to be
significantly larger than seek time (the time to seek to the
start of the block)
Big Data Analytics: Hadoop 17
HDFS Concepts..
• Ques: Assume Seek Time = 10ms and is 1% of
Transfer Time. If transfer rate is 100 MB/sec then
what should be block size?

Big Data Analytics: Hadoop 18


HDFS Concepts..
Soln:
Seek Time = 10ms

Seek Time = 1% of Transfer Time


Transfer Time = (100*10) ms = 1000ms

Transfer Rate = 100 MB/sec = 0.1 MB/ms


Block Size = Transfer Rate*Transfer Time
= 0.1MB/ms * 1000ms = 100 MB

Big Data Analytics: Hadoop 19


HDFS Concepts..

• Default size of block in HDFS is 64 MB although many


HDFS installations use 128 MB blocks
• This size will continue to vary as transfer speeds grow
with new generations of disk drives.

Big Data Analytics: Hadoop 20


HDFS Concepts..
• Block abstraction offers several benefits:
• In case a file is larger than any single disk in the network,
HDFS takes advantage of any of the disks in the cluster
• Block abstraction safeguards against corrupted blocks and
disk failure and machine failure as each block is replicated
to a small number of physically separate machines
(typically three).
• If a block becomes unavailable, a copy can be read
from another location in a way that is transparent to
the client.
• A block that is no longer available due to corruption
or machine failure can be replicated from their
alternative locations to other live machines to bring
the replication factor back to the normal level.
Big Data Analytics: Hadoop 21
HDFS Architecture

Fig 1: HDFS Architecture

Big Data Analytics: Hadoop 22


HDFS Architecture
• HDFS
• block-structured filesystem
• each file is divided into blocks of a pre-determined size
• these blocks are stored across a cluster of one or
several machines
• follows Master/Slave Architecture where a cluster
comprises of
• single NameNode (Master node)
• other nodes are DataNodes (Slave nodes)
• HDFS can be deployed on a broad spectrum of machines
that support Java
• Though one can run several DataNodes on a single
machine, but in the practical world, these DataNodes are
spread across various machines.
Big Data Analytics: Hadoop 23
HDFS Architecture..

Fig 2: HDFS Architecture with NameNode and DataNode

Big Data Analytics: Hadoop


24
HDFS Architecture..
• NameNode (also called as Primary Namenode)
• Manages
• filesystem namespace
• Maintains
• filesystem tree and
• metadata for all the files and directories in the tree
• this information is stored persistently on the local disk in the
form of two files:
• FsImage (filesystem image or namespace image)
• EditLogs
• NameNode has knowledge about
• DataNodes on which all the blocks for a given file are
located
Big Data Analytics: Hadoop 25
HDFS Architecture..
• Client
• responsible for loading data and fetching results
• A client accesses the filesystem on behalf of the
user by communicating with the NameNode and
datanodes
• DataNode
• stores and retrieves blocks
• as instructed by NameNode or clients
• considered as workhorses of the filesystem
• DataNodes report back to NameNode
periodically
• with lists of blocks that they are storing 26
Big Data Analytics: Hadoop
HDFS Architecture..
• It should be noted that the filesystem cannot be
used without NameNode
• if the machine running the NameNode collapses
• all the files on the filesystem would be lost since
there would be no way of knowing how to
reconstruct the files from the blocks on the
DataNodes
• hence, it is important to ensure that the
NameNode resilient to failure

Big Data Analytics: Hadoop


27
Functions of NameNode

• master node that maintains and manages the


DataNodes (slave nodes)
• executes file system namespace operations like
opening, closing, and renaming files and directories
• determines the mapping of blocks to DataNodes
(internally, a file is split into one or more blocks and
these blocks are stored in a set of DataNodes)

Big Data Analytics: Hadoop 28


Functions of NameNode..
• records the metadata of all the files stored in the
cluster
• location of blocks stored
• size of the files, permissions, hierarchy, etc.
• there are two files associated with the metadata:
• FsImage: contains the complete state of the
file system namespace since the start of the
Namenode
• EditLogs: contains all the recent modifications
made to the file system with respect to the
most recent FsImage.
Big Data Analytics: Hadoop 29
Functions of NameNode..
• records each change that takes place to the file system
metadata.
• For example, if a file is deleted in HDFS, the
NameNode will immediately record this in the EditLog
• regularly receives block reports from all the DataNodes in
the cluster to ensure that the DataNodes are live
• keeps a record of all the blocks in HDFS and in which
nodes these blocks are located.
• NameNode is also responsible to take care of
the replication of all the blocks
• In case of a DataNode failure, the NameNode chooses
new DataNodes for new replicas, balance disk usage and
manages communication traffic to the DataNodes 30
Big Data Analytics: Hadoop
Functions of DataNodes
DataNodes are commodity hardware, that is, a non-expensive
system which is not of high quality or high-availability
whereas NameNode which is very highly available server
that manages the filesystem namespace and controls access
to files by clients
The DataNode functions are as follows:
• DataNodes are slave nodes or process which runs on each
slave machine
• actual data is stored on DataNodes
• DataNodes also perform block creation, deletion, and
replication upon instruction from the NameNode

Big Data Analytics: Hadoop 31


Functions of DataNodes..

The DataNode functions are as follows:


• DataNodes perform the low-level read and write
requests from the file system’s clients
• DataNodes sends block reports to NameNode
periodically to report overall working of HDFS, by
default, this frequency is set to 3 seconds.

Big Data Analytics: Hadoop 32


What if NameNode fails??
• Hadoop has two mechanisms to provide resilience
against failure of NameNode:
1) Have back up of files that make up the persistent
state of the filesystem metadata.
➢Hadoop can be configured so that the
namenode writes its persistent state to multiple
filesystems.
➢These writes are synchronous and atomic.
❖The usual configuration choice is to write to local
disk as well as a remote NFS (Network File
System) mount.

Big Data Analytics: Hadoop


33
What if NameNode fails?? (contd)
2) Run a Secondary NameNode that keeps a copy of the merged
namespace image, which can be used in the event of the
namenode failing
➢Secondary NameNode does not act as namenode
➢Its main role is to periodically merge the namespace image
with the edit log to limit the edit log from becoming too large
➢It usually runs on a separate physical machine, since it
requires plenty of CPU and as much memory as the
namenode to perform the merge.
➢However, the state of the Secondary NameNode lags that of
the Primary, so in the event of total failure of the primary
data, loss is almost guaranteed.
❖The usual course of action in this case is to copy the
namenode’s metadata files that are on NFS to the
secondary and run it as the new primary. 34
Big Data Analytics: Hadoop
Secondary NameNode
• Secondary NameNode works concurrently with the
Primary NameNode as a helper node
• It should be noted that Secondary NameNode is not
backup of NameNode
• Secondary NameNode is responsible for performing
regular checkpoints in HDFS
• Thus, it is also called as Checkpoint Node

Big Data Analytics: Hadoop 35


Secondary NameNode..
• Functions of Secondary NameNode are as follows:
➢Secondary NameNode constantly reads all the file
systems and metadata from the RAM of the NameNode
and writes it into the hard disk or the file system
➢It is responsible for combining
the EditLogs with FsImage from the NameNode
➢It downloads the EditLogs from the NameNode at
regular intervals and applies to FsImage. The new
FsImage is copied back to the NameNode, which is used
whenever the NameNode is started the next time.

Big Data Analytics: Hadoop 36


Secondary NameNode..

Fig 3: NameNode and Secondary NameNode

Big Data Analytics: Hadoop 37


Important Points
• As Secondary NameNode lags the latest state of
Primary NameNode because merging of FsImage
and EditLogs happens only at certain intervals of
time (by default every one hour)
• So, If NameNode fails completely, then there will
be definitely loss to at least a small amount of
data.
• To overcome this, Hadoop Implemented Backup
Node and Checkpoint Node

Big Data Analytics: Hadoop 38


Secondary NameNode - Checkpoint
• The checkpoint process on Secondary NameNode is
controlled by two configuration parameters.
1) dfs.namenode.checkpoint.period
➢1 hour by default,
➢the maximum delay between two consecutive
checkpoints
2) dfs.namenode.checkpoint.txns
➢1 million by default
➢the number of uncheck pointed transactions on the
Namenode which will force an urgent checkpoint,
event if the checkpoint period has not been reached.

Big Data Analytics: Hadoop 39


Secondary NameNode – Checkpoint..
• It should be noted that the Secondary NameNode
stores latest checkpoint in a directory which is
structured in the same way as Primary NameNode’s
directory.
• Thus, checkpoint image is always ready to be
used by Primary NameNode if necessary

Big Data Analytics: Hadoop 40


Primary and Secondary NameNode
Working
• NameNode stores modifications to the file system
as a log appended to a native file system
file, EditLog
• When a NameNode starts up, it reads HDFS state
from an image file, FsImage, and then applies edits
from the EditLog file
• It then writes new HDFS state to FsImage and
starts normal operation with an empty EditLog

Big Data Analytics: Hadoop 41


Primary and Secondary NameNode
Working..
• Since NameNode merges FsImage and EditLog files only
during start up,
• the EditLog file could get very large over time on a
busy cluster
• Also, large EditLog file increases the time required
for next restart of NameNode
• The Secondary NameNode merges the FsImage and the
EditLog files periodically and keeps the size of EditLog
file within a limit
• Usually, Secondary NameNode runs on a different machine
than the Primary NameNode since its memory
requirements are on the same order as the Primary
NameNode Big Data Analytics: Hadoop 42
NameNode and DataNode
• NameNode and DataNodes are pieces of software
designed to run on commodity machines.
• These machines typically run a GNU/Linux operating
system (OS).
• HDFS is built using the Java language; any machine
that supports Java can run the NameNode or the
DataNode software.
• Usage of the highly portable Java language means
that HDFS can be deployed on a wide range of
machines.

Big Data Analytics: Hadoop 43


NameNode and DataNode..
• The typical deployment has a dedicated machine
that runs only the NameNode software. Each of the
other machines in the cluster runs one instance of
the DataNode software.
• The existence of a single NameNode in a cluster
greatly simplifies the architecture of the system.
The NameNode is the arbitrator and repository for
all HDFS metadata.

Big Data Analytics: Hadoop 44


The File System Namespace
• HDFS supports a traditional hierarchical file
organization.
▪ A user or an application can create directories
and store files inside these directories
▪ The file system namespace hierarchy is similar
to most other existing file systems;
▪ one can create and remove files, move a file
from one directory to another, or rename a file

Big Data Analytics: Hadoop 45


The File System Namespace
• The NameNode maintains the file system
namespace
▪ Any change to the file system namespace or its
properties is recorded by NameNode.
▪ An application can specify the number of
replicas of a file that should be maintained by
HDFS.
➢The number of copies of a file is called the
replication factor of that file. This
information is stored by the NameNode.

Big Data Analytics: Hadoop 46


Data Replication
• HDFS is designed to offer reliable storage for very large
files across machines in a large cluster
• Files in HDFS are write-once and have strictly one writer
at any time
• HDFS stores each file as a sequence of blocks
• all blocks in a file except the last block are the same
size
• To provide fault tolerance
• blocks of a file are replicated
• By default, replication factor is three (which is
configurable)
Big Data Analytics: Hadoop 47
Data Replication..
• Block size and Replication factor are configurable per file
• The application can specify the number of replicas of a file
• and this replication factor can be specified at the time of
file creation and can be changed later
• NameNode makes all decisions regarding replication of
blocks
• NameNode periodically receives a heartbeat and a block
report from each of the DataNodes in the cluster
• receipt of heartbeat implies that the DataNode is
functioning properly
• the block report provides the list of all blocks on the
respective DataNode 48
Big Data Analytics: Hadoop
Data Replication..

• the block report helps in maintaining replication


factor in HDFS
• if the blocks are over-replicated
• then NameNode deletes replica of blocks
• if the blocks are under-replicated
• then NameNode adds replica of blocks

Big Data Analytics: Hadoop 49


Data Replication..

Fig 4: Block Replication(with the assumption replication factor is three)

Big Data Analytics: Hadoop 50


Rack Awareness
• HDFS stores files across multiple nodes (DataNodes) in
a cluster
• To get the maximum performance from Hadoop
• and to improve network traffic during file
read/write,
• NameNode chooses DataNodes on the same
rack or nearby racks for data read/write
• It follows an in-built Rack Awareness Algorithm to
• choosing the closer DataNode based on rack
information
• NameNode ensures that all the replicas are not stored
on the same rack or a single rack
Big Data Analytics: Hadoop 51
Rack Awareness: Rack
• The Rack is the collection of around 40-50
DataNodes connected using the same network
switch
• If the network goes down, the whole rack will be
unavailable
• A large Hadoop cluster is deployed in multiple racks

Big Data Analytics: Hadoop 52


Data Replication (without Rack
Awareness)
• Simple but non-optimal policy
• place replicas on unique racks (as shown in Fig 4)
• this prevents losing data when an entire rack fails
• and evenly distributes replicas in the cluster
which makes it easy to balance load on
component failure
• However, this policy increases the cost of writes
because a write needs to transfer blocks to multiple
racks

Big Data Analytics: Hadoop 53


Rack Awareness in HDFS
• HDFS comprises of multiple racks
• and each rack consists of multiple DataNodes
• HDFS maintains rack IDs of each DataNode as rack
information
• each DataNode is associated with a specific rack,
and this association is identified by a rack ID
• rack ID is used to group DataNodes within the same
physical rack
• Communication between the DataNodes on the same
rack is more efficient as compared to the
communication between DataNodes residing on
different racks
Big Data Analytics: Hadoop 54
Rack Awareness in HDFS..
• To reduce the network traffic during file read/write,
• NameNode chooses the closest DataNode for
serving the client read/write request
• NameNode utilizes rack information to gain
knowledge about rack ids of each DataNode
• The concept of choosing the closest DataNode
based on the rack information is known as Rack
Awareness

Big Data Analytics: Hadoop 55


Need of Rack Awareness
The reasons for the Rack Awareness in HDFS are:
i. To reduce the network traffic while file read/write,
which improves the cluster performance
• The communication between DataNodes residing
on different racks is directed via switch. In general,
we have greater network bandwidth between
DataNodes in the same rack than the DataNodes
residing in different rack.
• So, the Rack Awareness helps to reduce write
traffic in between different racks and thus
providing a better write performance. Similarly,
the read performance is also increases.
Big Data Analytics: Hadoop 56
Need of Rack Awareness..
The reasons for the Rack Awareness in HDFS are:
ii. To achieve fault tolerance, even when the rack goes
down
iii. To achieve high availability of data so that data is
available even in unfavorable conditions
• for instance, an entire rack fails because of the switch
failure or power failure
iv. To reduce the latency, that is, to make the file
read/write operations done with lower delay
NameNode uses a rack awareness algorithm while
placing the replicas in HDFS
Big Data Analytics: Hadoop 57
Replica Placement via Random
Awareness in HDFS
• HDFS stores replicas of data blocks of a file to provide
fault tolerance and high availability.
• Communication between the nodes (DataNodes) on
the same rack is more efficient as compared to the
communication between nodes (DataNodes) residing
on different racks
• the network bandwidth between nodes on the
same rack is higher than the network bandwidth
between nodes on a different rack

Big Data Analytics: Hadoop 58


Replica Placement via Random
Awareness in HDFS..
• If we store replicas on different nodes on the same
rack, then it improves the network bandwidth,
• but if the rack fails (rarely happens), then there
will be no copy of data on another rack
• If replicas are on unique racks,
• then due to the transfer of blocks to multiple
racks while writes increase the cost of writes.

Big Data Analytics: Hadoop 59


Replica Placement via Random
Awareness in HDFS
• NameNode on multiple rack cluster maintains block
replication by using inbuilt Rack awareness
policies which says:
➢Not more than one replica be placed on one
DataNode.
➢Not more than two replicas are placed on the same
rack.
➢The number of racks used for block replication
should always be smaller than the number of
replicas.
• The chance of rack failure is far less than that of node
failure; this algorithm does not impact data reliability
60
and availability guarantees. Big Data Analytics: Hadoop
Replica Placement via Random
Awareness in HDFS..
• Usually the replication factor is three,
▪ block replication policy put the first replica on the local
rack
▪ second replica on the different DataNode on the same
rack
▪ third replica on the different rack
• To improves write performance and network traffic without
compromising fault tolerance, while re-replicating a block,
▪ if the existing replica is one,
oplace the second replica on a different rack
▪ if existing replicas are two and are on the same rack,
othen place the third replica on a different rack
Big Data Analytics: Hadoop 61
Replica Placement via Random
Awareness in HDFS..

Fig 5: Replica Placement via Random Awareness in HDFS


Big Data Analytics: Hadoop 62
Data Replication: Replica Selection
• To minimize global bandwidth consumption and
read latency,
• HDFS tries to satisfy a read request from a
replica that is closest to the reader
• If there exists a replica on the same rack as
the reader node, then that replica is
preferred to satisfy the read request
• If HDFS cluster spans multiple data centers,
• then a replica that is resident in the local data
center is preferred over any remote replica.
Big Data Analytics: Hadoop 63
Data Replication..
• Checked in
• When a file is stored in HDFS, it is divided into
blocks, and each block is replicated across multiple
DataNodes according to the replication factor (by
default, 3 replicas).
• NameNode keeps track of where each block's
replicas are stored across the DataNodes
• After the NameNode initiates the replication of a
block to several DataNodes, it waits for each
DataNode to report back (via block reports) that it
has successfully received and stored the block. This
reporting back is what is referred to as "checked in."
Big Data Analytics: Hadoop 64
Data Replication..
• The checked in mechanism ensures the following:
➢Data Integrity: The "checked in" mechanism
ensures that the NameNode is fully aware of the
block's location across the cluster, which is crucial
for maintaining data integrity
➢Fault Tolerance: Only after all required replicas
have checked in, the system can conclude that the
block is safely replicated. If a DataNode fails prior to
checked in, the NameNode can instruct other
DataNodes to create additional replicas, ensuring
the data is not lost.

Big Data Analytics: Hadoop 65


Data Replication..
• A block is considered “safely replicated”
• when the minimum number of required replicas (as
per the replication factor) has successfully been
stored
• and the corresponding DataNodes have
"checked in" by sending a block report
• that confirms the storage of those replicas to
NameNode thereby, ensuring data
redundancy and reliability

Big Data Analytics: Hadoop 66


Data Replication: Safemode
• On startup, the NameNode enters a special state called
Safemode
• During Safe Mode, the NameNode does not allow any
modifications to the filesystem, meaning no new files can be
created, and no existing files can be modified or deleted.
• NameNode receives heartbeat and block report messages
from the DataNodes
• Once the NameNode determines that the configurable
percentage of blocks are safely replicated (information
provided through block report that DataNodes have checked
in),
• NameNode waits for additional 30-second delay before
exiting Safe Mode
Big Data Analytics: Hadoop 67
Data Replication: Safemode..
• The additional delay of 30 seconds before exiting
Safemode
• serves as a buffer period to receive any remaining
block reports and gives the system time to stabilize
before resuming normal operations
• After exiting Safemode
• NameNode allows clients to start writing data to
HDFS, modifying files, and performing other normal
operations

Big Data Analytics: Hadoop 68


DataNode Failure in HDFS
• The primary objective of HDFS is to store data reliably even in the
presence of failures.
• Detection of DataNode failure,
• Each DataNode sends a heartbeat message to the NameNode
periodically
• NameNode detects DataNode failure by the absence of a
heartbeat message. NameNode marks DataNodes without
recent heartbeats as dead and does not forward any new input
or output requests to them.
• Any data that was registered to a dead DataNode is not
available to HDFS any more. Thus, a DataNode failure may
cause the replication factor of some blocks to fall below their
specified value.
• NameNode constantly tracks which blocks need to be
replicated and initiates replication whenever necessary. 69
Big Data Analytics: Hadoop
DataNode Failure in HDFS..
• The necessity for re-replication may arise due to many
reasons:
❖ a DataNode may become unavailable, or
❖ a replica may become corrupted,
❖ hard disk on a DataNode may fail, or
❖ the replication factor of a file may be increased

Big Data Analytics: Hadoop 70


Cluster Rebalancing
• HDFS architecture is compatible with data rebalancing
schemes.
• A scheme might automatically move data from one
DataNode to another if the free space on a
DataNode falls below a certain threshold.
• In the event of a sudden high demand for a
particular file, a scheme might dynamically create
additional replicas and rebalance other data in the
cluster.
• However, these types of data rebalancing schemes
are not yet implemented.

Big Data Analytics: Hadoop 71


Data Integrity
• Blocks of data retrieved from DataNode can be
corrupted,
• this corruption can occur because of faults in a
storage device, network faults, or buggy software
• To provide data integrity in HDFS,
• client software implements checksum checking on
the contents of files
• When a client creates an HDFS file,
• client computes a checksum of each block of the file
and stores these checksums in a separate hidden
file in the same HDFS namespace
Big Data Analytics: Hadoop 72
Data Integrity..
• When a client retrieves file contents,
• client verifies that the data it received from each
DataNode matches the checksum stored in the
associated checksum file
• If not, then the client can opt to retrieve that block
from another DataNode that has a replica of that
block

Big Data Analytics: Hadoop 73


Metadata Disk Failure
• FsImage and the EditLog are central data structures of
HDFS
• corruption of these files can cause the HDFS
instance to be non-functional
• NameNode can be configured to support maintaining
multiple copies of the FsImage and EditLog
• Any update to either the FsImage or EditLog causes
each of the FsImages and EditLog to get updated
synchronously
• this synchronous updating of multiple copies of the
FsImage and EditLog may degrade the rate of
namespace transactions per second that NameNode
can support Big Data Analytics: Hadoop 74
Metadata Disk Failure..
• The degradation of namespace transactions per second is
acceptable
• because HDFS applications are very data intensive in
nature, but they are not metadata intensive
• When a NameNode restarts,
• it selects the latest consistent FsImage and EditLog to
use
• The NameNode machine is a single point of failure for
HDFS cluster
• if NameNode machine fails, manual intervention is
necessary
• currently, automatic restart and failover of the
NameNode software to another machine is not
supported Big Data Analytics: Hadoop 75
Snapshots
• Snapshots support storing a copy of data at a particular
instant of time
• One usage of the snapshot feature may be to roll back a
corrupted HDFS instance to a previously known good
point in time.

Big Data Analytics: Hadoop 76


Data Organization: Blocks
• HDFS is designed to support very large files.
• Applications that are compatible with HDFS are those
that deal with large data sets.
• These applications write their data only once but
they read it one or more times and require these
reads to be satisfied at streaming speeds.
• HDFS supports write-once-read-many semantics on
files.
• A typical block size used by HDFS is 64 MB.
• Thus, an HDFS file is chopped up into 64 MB chunks,
and if possible, each chunk will reside on a different
DataNode. Big Data Analytics: Hadoop 77
Data Organization: Staging
• A client request to create a file does not reach the NameNode
immediately.
▪ Initially the HDFS client caches the file data into a temporary
local file. Application writes are transparently redirected to
this temporary local file.
▪ When the local file accumulates data worth over one HDFS
block size, the client contacts NameNode.
▪ NameNode inserts the file name into the file system
hierarchy and allocates a data block for it.
o NameNode responds to the client request with the
identity of the DataNode and the destination data block.
o Then the client flushes the block of data from the local
temporary file to the specified DataNode.
Big Data Analytics: Hadoop 78
Data Organization: Staging

• When a file is closed, the remaining un-flushed data in the


temporary local file is transferred to the DataNode.
• The client then tells the NameNode that the file is closed. At
this point, the NameNode commits the file creation
operation into a persistent store. If the NameNode dies
before the file is closed, the file is lost.
• The client writes to temporary local file and not to remote
file directly
• because network speed and congestion in the network
impacts throughput

Big Data Analytics: Hadoop 79


Data Organization: Replication
Pipelining
• When a client is writing data to an HDFS file, initially its
data is first written to a local file
• Assume the HDFS has a replication factor of three for
any file
➢When the local file accumulates a full block of user
data, the client retrieves a list of DataNodes from
the NameNode
oThis list contains DataNodes that will host a
replica of that block
➢The client then flushes the data block to the first
DataNode
Big Data Analytics: Hadoop 80
Data Organization: Replication
Pipelining..
➢The first DataNode starts receiving the data in small
portions (4 KB), writes each portion to its local
repository and transfers that portion to the second
DataNode in the list.
➢The second DataNode, in turn starts receiving each
portion of the data block, writes that portion to its
repository and then flushes that portion to the third
DataNode.
➢Finally, the third DataNode writes the data to its local
repository.
• DataNode can be receiving data from the previous one in
the pipeline and at the same time forwarding data to the
next one in the pipeline. Thus, the data is pipelined from
one DataNode to the next Big Data Analytics: Hadoop 81
Space Reclamation: File Deletion
• When a file is deleted by a user or an application, it is not
immediately removed from HDFS
• Instead, HDFS first renames it to a file in
the /trash directory.
• The file can be restored quickly as long as it remains
in /trash
• A file remains in /trash for a configurable amount of
time
• After the expiry of its life in /trash, NameNode deletes
the file from the HDFS namespace
• The deletion of a file causes the blocks associated with
the file to be freed
82
Big Data Analytics: Hadoop
Space Reclamation: File Deletion
• It should be noted that there could be an
appreciable time delay between the time a file is
deleted by a user and the time of the
corresponding increase in free space in HDFS.

Big Data Analytics: Hadoop 83


Space Reclamation: File Un-Deletion
• User can undelete a file after deleting it as long as it remains
in the /trash directory.
• If a user wants to undelete a file that he/she has deleted,
then user can navigate the /trash directory and retrieve
the file.
• The /trash directory contains only the latest copy of
the file that was deleted.
• The /trash directory is just like any other directory
with one special feature: HDFS applies specified policies
to automatically delete files from this directory.
• The current default policy is to delete files
from /trash that are more than six hours old.
• In the future, this policy will be configurable through a
well defined interface. 84
Big Data Analytics: Hadoop
HDFS Read/ Write Architecture
• HDFS follows Write Once-Read Many Philosophy.
• Users can’t edit files already stored in HDFS.
• However, users can append new data by re-
opening the file

Big Data Analytics: Hadoop 85


HDFS Write Architecture
• Assume HDFS client, wants to write a file named
“example.txt” of size 248 MB
• Assume block size is configured for 128 MB
(default)
• Client will be dividing the file “example.txt” into
two blocks – one of 128 MB (Block A) and the other
of 120 MB (block B)

Big Data Analytics: Hadoop 86


HDFS Write Architecture..
• The following protocol is followed whenever the data is
written into HDFS:
i. Client will reach out to the NameNode for a Write
Request against the two blocks, say, Block A &
Block B (two blocks of file “example.txt”)
ii. NameNode grants the client the write permission
and provide the IP addresses of the DataNodes
where the file blocks has to be copied eventually
❑this selection of IP addresses of DataNodes is
purely randomized based on availability,
replication factor and rack awareness

Big Data Analytics: Hadoop 87


HDFS Write Architecture..
iii. Assuming that replication factor is set to default i.e.
three. Therefore, for each block the NameNode will
be providing the client a list of three IP addresses of
DataNodes. It should be noted that this list will be
unique for each block.
iv. Assume that the NameNode provides following lists
of IP addresses to the client:
❑For Block A, list A = {IP of DataNode 1, IP of
DataNode 4, IP of DataNode 6}
❑For Block B, list B = {IP of DataNode 7, IP of
DataNode 9, IP of DataNode 3}
Big Data Analytics: Hadoop 88
HDFS Write Architecture..
v. Each block will be copied in three different
DataNodes to maintain the replication factor
consistent throughout the cluster.
vi. Now the whole data copy process will happen in
three stages:
A. Set up of Pipeline
B. Data streaming and replication
C. Shutdown of Pipeline (Acknowledgement
stage)

Big Data Analytics: Hadoop 89


HDFS Write Architecture..
A. Set up of Pipeline
• Before writing the blocks, the client confirms whether
the DataNodes, present in each of the list of IPs, are
ready to receive the data or not.
▪ If ready, then the client creates a pipeline for each
of the blocks by connecting the individual
DataNodes in the respective list for that block.
▪ Suppose for Block A, the list of DataNodes provided
by the NameNode is as follows:
▪ For Block A, list A = {IP of DataNode 1, IP of
DataNode 4, IP of DataNode 6}
Big Data Analytics: Hadoop 90
HDFS Write Architecture..
A. Set up of Pipeline
• For block A, the client has to perform the following steps to
create a pipeline:
1) The client chooses the first DataNode in the list (DataNode
IPs for Block A) which is DataNode 1 and establishes a
TCP/IP connection.
2) The client informs DataNode 1 to be ready to receive the
block. It will also provide the IPs of next two DataNodes (4
and 6) to the DataNode 1 where the block is supposed to
be replicated.
3) The DataNode 1 connects to DataNode 4. The DataNode 1
inform DataNode 4 to be ready to receive the block and
provides the IP of DataNode 6. Then, DataNode 4 informs
DataNode 6 to be ready for receiving the data. 91
Big Data Analytics: Hadoop
HDFS Write Architecture..
A. Set up of Pipeline
4) Next, the acknowledgement of readiness follow the
reverse sequence,
❑i.e. the acknowledgement flows from the DataNode
6 to DataNode 4 and then to DataNode 1.
5) At last, DataNode 1 informs the client that all the
DataNodes are ready and a pipeline is formed
between the client, DataNode 1, DataNode 4 and
DataNode 6.
Now pipeline set up is complete and the client will finally
begin the data copy or streaming process.
Big Data Analytics: Hadoop 92
HDFS Write Architecture..

Fig 6: Setting up of Write Pipeline in HDFS


93
Big Data Analytics: Hadoop
HDFS Write Architecture..
B. Data Streaming and Replication
• Once the pipeline has been created, the client pushes
the data into the pipeline.
• Data is replicated based on replication factor in HDFS.
• Thereby, as per our example, Block A will be stored to
three DataNodes as the assumed replication factor is
three.
• The client will copy the block A to DataNode 1 only.
• The replication is always done by DataNodes
sequentially.
Big Data Analytics: Hadoop 94
HDFS Write Architecture..
B. Data Streaming and Replication
The following steps are executed during replication:
i. Once the block has been written to DataNode 1
by the client, DataNode 1 connects to DataNode
4.
ii. Then, DataNode 1 pushes the block in the
pipeline and data is copied to DataNode 4.
iii. Again, DataNode 4 connects to DataNode 6 and
DataNode 6 copies the last replica of the block.

Big Data Analytics: Hadoop 95


HDFS Write Architecture..

Fig 7: HDFS Write operation for Block A


96
Big Data Analytics: Hadoop
HDFS Write Architecture..
C. Shutdown of Pipeline or Acknowledgement Stage
• Once the block has been copied into all the three
DataNodes,
• series of acknowledgements takes place to ensure the
client and NameNode that the data has been written
successfully
• Finally, the client closes the pipeline to end the TCP
session
• As per previous example, the acknowledgement
follows in the reverse sequence i.e. from DataNode 6
to DataNode 4 and then to DataNode 1
• Finally, the DataNode 1 pushes three
acknowledgements (including its own) into the pipeline
and send it to the client
• The client informs NameNode that data has been
written successfully. The NameNode updates its
metadata and the client will shut down the pipeline
Big Data Analytics: Hadoop 97
HDFS Write Architecture..

Fig 8: Acknowledgment Stage for Block A in HDFS


Big Data Analytics: Hadoop 98
HDFS Write Architecture..
• As per previous example (“example.txt” into two blocks –
Block A and Block B),
• Block B will also be copied into the DataNodes in parallel
with Block A. So,
• The following points are to be noted here:
➢The client will copy Block A and Block B to the first
DataNode simultaneously.
➢Therefore, in our example, two pipelines will be
formed for each of the block and all the process will
happen in parallel in these two pipelines.
➢The client writes the block into the first DataNode
and then the DataNodes will be replicating the block
sequentially.
Big Data Analytics: Hadoop 99
HDFS Write Architecture..
List for Block A =
{DataNode 1,
DataNode 4,
DataNode 6}
List for Block B,
list B = {DataNode
7, DataNode 9,
DataNode 3}
The two pipelines are
•Block A: 1A -> 2A
-> 3A -> 4A
•Block B: 1B -> 2B
-> 3B -> 4B -> 5B
-> 6B

Fig 9: Multi-Block Write Operation for Block A and Block B in HDFS


Big Data Analytics: Hadoop 100
HDFS Read Architecture
• Assume HDFS client, wants to write a file named
“example.txt” of size 248 MB
• Assume block size is configured for 128 MB
(default)
• Client will be dividing the file “example.txt” into
two blocks – one of 128 MB (Block A) and the other
of 120 MB (block B)

Big Data Analytics: Hadoop 101


HDFS Read Architecture
The following steps are executed to read the file:
• The client will reach out to NameNode asking for the
block metadata for the file “example.txt”.
• The NameNode will return the list of DataNodes where
each block (Block A and B) are stored.
• After that client, will connect to the DataNodes where
the blocks are stored.
• The client starts reading data parallel from the
DataNodes (Block A from DataNode 1 and Block B from
DataNode 3).
• Once the client gets all the required file blocks, it will
combine these blocks to form a file.
Big Data Analytics: Hadoop 102
HDFS Read Architecture..
• While serving read request of the client, HDFS
selects the replica which is closest to the client.
• This reduces the read latency and the
bandwidth consumption.
• Therefore, that replica is selected which
resides on the same rack as the reader node,
if possible.

Big Data Analytics: Hadoop 103


HDFS Commands: Hadoop Shell
Commands to Manage HDFS
Following are some frequently used HDFS
commands:
fsck HDFS command to check the health of the Hadoop
file system.
Command: hdfs fsck /
ls HDFS Command to display the list of Files and
Directories in HDFS
Command: hdfs dfs –ls /
mkdir HDFS Command to create the directory in HDFS.
Command: hdfs dfs –mkdir /directory_name
Big Data Analytics: Hadoop 104
HDFS Commands: Hadoop Shell
Commands to Manage HDFS
Following are some frequently used HDFS
commands:
touchz HDFS command to check create file in HDFS with
file size 0 bytes.
Command: hdfs dfs –touchz /directory/filename
du HDFS Command to check file size
Command: hdfs dfs –du –s /directory/filename
text HDFS Command that takes source file and outputs
the file in text format
Command: hdfs dfs –text /directory/filename
Big Data Analytics: Hadoop 105
HDFS Commands: Hadoop Shell
Commands to Manage HDFS
Following are some frequently used HDFS commands:

copyFromLocal HDFS command to copy the file from local file


system to HDFS
Command:
hdfs dfs –copyFromLocal <localsrc> <hdfs
destination>
copyToLocal HDFS Command to copy file from HDFS to local
file system
Command:
hdfs dfs –copyToLocal <hdfs source> <localdst>
Big Data Analytics: Hadoop 106
HDFS Commands: Hadoop Shell
Commands to Manage HDFS
Following are some frequently used HDFS commands:
put HDFS command to copy single source or multiple sources
from local file system to the destination file system.
Command: hdfs dfs –put <localsrc> <destination>
get HDFS Command to copy files from hdfs to local file system
Command: hdfs dfs –get <src> <localdst>
count HDFS Command to count number of directories, files, and
bytes, under the paths that match the specified file
pattern.
Command: hdfs dfs –count <path>
107
Big Data Analytics: Hadoop
HDFS Commands: Hadoop Shell
Commands to Manage HDFS
• Following are some frequently used HDFS
commands
rm HDFS command to remove file from HDFS
Command: hdfs dfs –rm <path>
rm -r HDFS command to remove entire directory and all
its content from HDFS
Command: hdfs dfs –rm –r <path>
cp HDFS command to copy files from source to
destination.
Command: hdfs dfs –cp <src> <dest>
Big Data Analytics: Hadoop 108
HDFS Commands: Hadoop Shell
Commands to Manage HDFS
• Following are some frequently used HDFS
commands
mv HDFS command to remove files from source to
destination
Command: hdfs dfs –mv <src> <dest>
expunge HDFS command that makes the trash empty
Command: hdfs dfs –expunge
rmdir HDFS command to remove directory
Command: hdfs dfs –rmdir <path>

Big Data Analytics: Hadoop 109


HDFS Commands: Hadoop Shell
Commands to Manage HDFS
• Following are some frequently used HDFS
commands
usage HDFS command that returns the help for an
individual command.
Command: hdfs dfs –usage <command>
help HDFS Command that displays help for given
command or all commands if none is specified.
Command: hdfs dfs -help

Big Data Analytics: Hadoop 110


HDFS Commands: Hadoop
Shell Commands to Manage
HDFS
Following are some frequently used HDFS
commands:
fsck: HDFS command to check the health of the
Hadoop file system.

Big Data Analytics: Hadoop 111


References
• https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#:~:text
=NameNode%20and%20DataNodes,-
HDFS%20has%20a&text=An%20HDFS%20cluster%20consists%20
of,nodes%20that%20they%20run%20on.
• https://www.edureka.co/blog/apache-hadoop-hdfs-architecture/
• https://data-flair.training/blogs/rack-awareness-hadoop-hdfs/
• https://hadoop.apache.org/docs/r1.2.1/hdfs_user_guide.html#:~
:text=Secondary%20NameNode,-
The%20NameNode%20stores&text=Since%20NameNode%20mer
ges%20fsimage%20and,restart%20of%20NameNode%20takes%2
0longer.
• https://www.edureka.co/blog/hdfs-commands-hadoop-shell-
command

Big Data Analytics: Hadoop 114

You might also like