0% found this document useful (0 votes)
18 views72 pages

Hdfs Part 1

Uploaded by

Being Gamer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views72 pages

Hdfs Part 1

Uploaded by

Being Gamer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 72

Big Data Analytics:

Hadoop
Dr. Shivangi Shukla
Assistant Professor
Computer Science and Engineering
IIIT Pune
Contents
• Introduction
• History
• The Apache Hadoop Project
• HDFS

Big Data Analytics: Concepts and Methods 2


Introduction to Hadoop
• Hadoop:
• open-source software framework
• provides reliable shared storage and analysis
• designed to handle big data in distributed
computing
• framework is based on Java programming with
some native code in C and shell scripts

Big Data Analytics: Hadoop 3


History
• The initial version of Hadoop was created in 2004
by Doug Cutting (and named after his son’s stuffed
elephant).
• Hadoop became a top-level Apache Software
Foundation project in January 2008.
• Hadoop was being used by many companies:
Facebook, New York Times etc.
• There have been many contributors, both
academic and commercial
• Yahoo being the largest such contributor
Big Data Analytics: Hadoop 4
The Apache Hadoop Project
• Hadoop
• collection of related subprojects that fall under
the umbrella of infrastructure for distributed
computing
• These projects are hosted by the Apache
Software Foundation
• provides support for a community of open
source software projects
• Hadoop is best known for
• MapReduce
• HDFS (Hadoop Distributed File System)
Big Data Analytics: Hadoop 5
The Apache Hadoop Project

Fig 1: Hadoop Subprojects

Big Data Analytics: Hadoop 6


The Apache Hadoop Project
• Core
• A set of components and interfaces for distributed
filesystems and general I/O (serialization, Java RPC,
persistent data structures)
• Avro
• A data serialization system for efficient, cross-language
RPC, and persistent data storage.
• HDFS (Hadoop Distributed File System)
• A distributed filesystem that runs on large clusters of
commodity machines.
• Allows storage large amounts of data across multiple
machines

Big Data Analytics: Hadoop 7


The Apache Hadoop Project
• MapReduce
• A distributed data processing model and execution
environment that runs on large clusters of commodity
machines
• Pig
• A data flow language and execution environment for
exploring very large datasets. Pig runs on HDFS and
MapReduce clusters.
• HBase
• A distributed, column-oriented database that uses HDFS
for its underlying storage.
• It supports both batch-style computations using
MapReduce and point queries (random reads)
Big Data Analytics: Hadoop 8
The Apache Hadoop Project
• ZooKeeper
• A distributed, highly available coordination service that
provides primitives such as distributed locks that can be
used for building distributed applications.
• Hive
• A distributed data warehouse that manages data stored
in HDFS and provides a query language based on SQL for
querying the data.
• Chukwa
• A distributed data collection and analysis system that
runs collectors that store data in HDFS, and it uses
MapReduce to produce reports.
Big Data Analytics: Hadoop 9
Two Important Components of
Hadoop
• File System
• provided by HDFS (Hadoop Distributed File
System)
• allows storage large amounts of data across
multiple machines
• Programming Paradigm
• provided by MapReduce
• used to access extensive data stored in HDFS

Big Data Analytics: Hadoop 10


Hadoop Distributed File System
(HDFS)

11
HDFS (Hadoop Distributed File
System)
• Distributed Filesystems
• Filesystems that manage the storage across a
network of machines are called distributed
filesystems.
• Distributed filesystems are more complex than
regular disk filesystems owing to complications of
network programming.
• One of the biggest challenge is to tolerate node
failure without suffering data loss.

Big Data Analytics: Hadoop 12


HDFS Features
• Very Large Files
• Files that are hundreds of megabytes, gigabytes, or terabytes
in size.
• There are Hadoop clusters running today that store
petabytes of data.
• Streaming Data Access
• HDFS is built around the idea that the most efficient data
processing pattern is a write-once, read-many-times pattern.
• A dataset is typically generated or copied from source, then
various analyses are performed on that dataset over time.
Each analysis involves a large proportion, if not all, of the
dataset,
• hence, the time to read the whole dataset is more
important than the latency in reading the first record.
Big Data Analytics: Hadoop 13
HDFS Features..

• Commodity Hardware
• Hadoop is designed to run on clusters of commodity
hardware.
• These commodity hardware have high chance of node
failure across the cluster, especially in case of large
clusters
• HDFS is designed to carry on working without a
noticeable interruption to the user in the face of such
failure

Big Data Analytics: Hadoop 14


Limitations of HDFS
• High-latency Data Access
• HDFS has high-latency for data access as it is optimized
for batch processing rather than real-time data access
thus, focusing on throughput rather than latency
• Inefficient in handling lots of small files
• HDFS is not well-suited for handling large numbers of
small files because
• the metadata overhead increases
• HDFS has large block sizes (which is typically 128 MB
or 256 MB). When a small file is stored, it still
occupies an entire block, resulting in inefficient
storage utilization and wasted space.
Big Data Analytics: Hadoop 15
Limitations of HDFS..
• No support for multiple writers
• HDFS does not support multiple writers
• Files in HDFS may be written to by a single writer.
Writes are always made at the end of the file.

Big Data Analytics: Hadoop 16


HDFS Concepts
• Blocks
• Disk has block size which defines the minimum amount of
data that it can read or write
• Filesystem blocks are of few kilobytes in size, while disk
blocks are normally 512 bytes
• HDFS has blocks of large size- 64MB by default
• HDFS blocks are large compared to disk blocks, and the
reason is to minimize the seek time
• By making a block large enough, transfer time (time to
transfer the data from the disk) can be made to be
significantly larger than seek time (the time to seek to the
start of the block)
Big Data Analytics: Hadoop 17
HDFS Concepts..
• Ques: Assume Seek Time = 10ms and is 1% of
Transfer Time. If transfer rate is 100 MB/sec then
what should be block size?

Big Data Analytics: Hadoop 18


HDFS Concepts..
Soln:
Seek Time = 10ms

Seek Time = 1% of Transfer Time


Transfer Time = (100*10) ms = 1000ms

Transfer Rate = 100 MB/sec = 0.1 MB/ms


Block Size = Transfer Rate*Transfer Time
= 0.1MB/ms * 1000ms = 100 MB

Big Data Analytics: Hadoop 19


HDFS Concepts..

• Default size of block in HDFS is 64 MB although many


HDFS installations use 128 MB blocks
• This size will continue to vary as transfer speeds grow
with new generations of disk drives.

Big Data Analytics: Hadoop 20


HDFS Concepts..
• Block abstraction offers several benefits:
• In case a file is larger than any single disk in the network,
HDFS takes advantage of any of the disks in the cluster
• Block abstraction safeguards against corrupted blocks and
disk failure and machine failure as each block is replicated
to a small number of physically separate machines
(typically three).
• If a block becomes unavailable, a copy can be read
from another location in a way that is transparent to
the client.
• A block that is no longer available due to corruption
or machine failure can be replicated from their
alternative locations to other live machines to bring
the replication factor back to the normal level.
Big Data Analytics: Hadoop 21
HDFS Architecture

Fig 1: HDFS Architecture

Big Data Analytics: Hadoop 22


HDFS Architecture
• HDFS
• block-structured filesystem
• each file is divided into blocks of a pre-determined size
• these blocks are stored across a cluster of one or
several machines
• follows Master/Slave Architecture where a cluster
comprises of
• single NameNode (Master node)
• other nodes are DataNodes (Slave nodes)
• HDFS can be deployed on a broad spectrum of machines
that support Java
• Though one can run several DataNodes on a single
machine, but in the practical world, these DataNodes are
spread across various machines.
Big Data Analytics: Hadoop 23
HDFS Architecture..

Fig 2: HDFS Architecture with NameNode and DataNode

Big Data Analytics: Hadoop


24
HDFS Architecture..
• NameNode (also called as Primary Namenode)
• Manages
• filesystem namespace
• Maintains
• filesystem tree and
• metadata for all the files and directories in the tree
• this information is stored persistently on the local disk in the
form of two files:
• FsImage (filesystem image or namespace image)
• EditLogs
• NameNode has knowledge about
• DataNodes on which all the blocks for a given file are
located
Big Data Analytics: Hadoop 25
HDFS Architecture..
• Client
• responsible for loading data and fetching results
• A client accesses the filesystem on behalf of the
user by communicating with the NameNode and
datanodes
• DataNode
• stores and retrieves blocks
• as instructed by NameNode or clients
• considered as workhorses of the filesystem
• DataNodes report back to NameNode
periodically
• with lists of blocks that they are storing 26
Big Data Analytics: Hadoop
HDFS Architecture..
• It should be noted that the filesystem cannot be
used without NameNode
• if the machine running the NameNode collapses
• all the files on the filesystem would be lost since
there would be no way of knowing how to
reconstruct the files from the blocks on the
DataNodes
• hence, it is important to ensure that the
NameNode resilient to failure

Big Data Analytics: Hadoop


27
Functions of NameNode

• master node that maintains and manages the


DataNodes (slave nodes)
• executes file system namespace operations like
opening, closing, and renaming files and directories
• determines the mapping of blocks to DataNodes
(internally, a file is split into one or more blocks and
these blocks are stored in a set of DataNodes)

Big Data Analytics: Hadoop 28


Functions of NameNode..
• records the metadata of all the files stored in the
cluster
• location of blocks stored
• size of the files, permissions, hierarchy, etc.
• there are two files associated with the metadata:
• FsImage: contains the complete state of the
file system namespace since the start of the
Namenode
• EditLogs: contains all the recent modifications
made to the file system with respect to the
most recent FsImage.
Big Data Analytics: Hadoop 29
Functions of NameNode..
• records each change that takes place to the file system
metadata.
• For example, if a file is deleted in HDFS, the
NameNode will immediately record this in the EditLog
• regularly receives block reports from all the DataNodes in
the cluster to ensure that the DataNodes are live
• keeps a record of all the blocks in HDFS and in which
nodes these blocks are located.
• NameNode is also responsible to take care of
the replication of all the blocks
• In case of a DataNode failure, the NameNode chooses
new DataNodes for new replicas, balance disk usage and
manages communication traffic to the DataNodes 30
Big Data Analytics: Hadoop
Functions of DataNodes
DataNodes are commodity hardware, that is, a non-expensive
system which is not of high quality or high-availability
whereas NameNode which is very highly available server
that manages the filesystem namespace and controls access
to files by clients
The DataNode functions are as follows:
• DataNodes are slave nodes or process which runs on each
slave machine
• actual data is stored on DataNodes
• DataNodes also perform block creation, deletion, and
replication upon instruction from the NameNode

Big Data Analytics: Hadoop 31


Functions of DataNodes..

The DataNode functions are as follows:


• DataNodes perform the low-level read and write
requests from the file system’s clients
• DataNodes sends block reports to NameNode
periodically to report overall working of HDFS, by
default, this frequency is set to 3 seconds.

Big Data Analytics: Hadoop 32


What if NameNode fails??
• Hadoop has two mechanisms to provide resilience
against failure of NameNode:
1) Have back up of files that make up the persistent
state of the filesystem metadata.
➢Hadoop can be configured so that the
namenode writes its persistent state to multiple
filesystems.
➢These writes are synchronous and atomic.
❖The usual configuration choice is to write to local
disk as well as a remote NFS (Network File
System) mount.

Big Data Analytics: Hadoop


33
What if NameNode fails?? (contd)
2) Run a Secondary NameNode that keeps a copy of the merged
namespace image, which can be used in the event of the
namenode failing
➢Secondary NameNode does not act as namenode
➢Its main role is to periodically merge the namespace image
with the edit log to limit the edit log from becoming too large
➢It usually runs on a separate physical machine, since it
requires plenty of CPU and as much memory as the
namenode to perform the merge.
➢However, the state of the Secondary NameNode lags that of
the Primary, so in the event of total failure of the primary
data, loss is almost guaranteed.
❖The usual course of action in this case is to copy the
namenode’s metadata files that are on NFS to the
secondary and run it as the new primary. 34
Big Data Analytics: Hadoop
Secondary NameNode
• Secondary NameNode works concurrently with the
Primary NameNode as a helper node
• It should be noted that Secondary NameNode is not
backup of NameNode
• Secondary NameNode is responsible for performing
regular checkpoints in HDFS
• Thus, it is also called as Checkpoint Node

Big Data Analytics: Hadoop 35


Secondary NameNode..
• Functions of Secondary NameNode are as follows:
➢Secondary NameNode constantly reads all the file
systems and metadata from the RAM of the NameNode
and writes it into the hard disk or the file system
➢It is responsible for combining
the EditLogs with FsImage from the NameNode
➢It downloads the EditLogs from the NameNode at
regular intervals and applies to FsImage. The new
FsImage is copied back to the NameNode, which is used
whenever the NameNode is started the next time.

Big Data Analytics: Hadoop 36


Secondary NameNode..

Fig 3: NameNode and Secondary NameNode

Big Data Analytics: Hadoop 37


Important Points
• As Secondary NameNode lags the latest state of
Primary NameNode because merging of FsImage
and EditLogs happens only at certain intervals of
time (by default every one hour)
• So, If NameNode fails completely, then there will
be definitely loss to at least a small amount of
data.
• To overcome this, Hadoop Implemented Backup
Node and Checkpoint Node

Big Data Analytics: Hadoop 38


Secondary NameNode - Checkpoint
• The checkpoint process on Secondary NameNode is
controlled by two configuration parameters.
1) dfs.namenode.checkpoint.period
➢1 hour by default,
➢the maximum delay between two consecutive
checkpoints
2) dfs.namenode.checkpoint.txns
➢1 million by default
➢the number of uncheck pointed transactions on the
Namenode which will force an urgent checkpoint,
event if the checkpoint period has not been reached.

Big Data Analytics: Hadoop 39


Secondary NameNode – Checkpoint..
• It should be noted that the Secondary NameNode
stores latest checkpoint in a directory which is
structured in the same way as Primary NameNode’s
directory.
• Thus, checkpoint image is always ready to be
used by Primary NameNode if necessary

Big Data Analytics: Hadoop 40


Primary and Secondary NameNode
Working
• NameNode stores modifications to the file system
as a log appended to a native file system
file, EditLog
• When a NameNode starts up, it reads HDFS state
from an image file, FsImage, and then applies edits
from the EditLog file
• It then writes new HDFS state to FsImage and
starts normal operation with an empty EditLog

Big Data Analytics: Hadoop 41


Primary and Secondary NameNode
Working..
• Since NameNode merges FsImage and EditLog files only
during start up,
• the EditLog file could get very large over time on a
busy cluster
• Also, large EditLog file increases the time required
for next restart of NameNode
• The Secondary NameNode merges the FsImage and the
EditLog files periodically and keeps the size of EditLog
file within a limit
• Usually, Secondary NameNode runs on a different machine
than the Primary NameNode since its memory
requirements are on the same order as the Primary
NameNode Big Data Analytics: Hadoop 42
NameNode and DataNode
• NameNode and DataNodes are pieces of software
designed to run on commodity machines.
• These machines typically run a GNU/Linux operating
system (OS).
• HDFS is built using the Java language; any machine
that supports Java can run the NameNode or the
DataNode software.
• Usage of the highly portable Java language means
that HDFS can be deployed on a wide range of
machines.

Big Data Analytics: Hadoop 43


NameNode and DataNode..
• The typical deployment has a dedicated machine
that runs only the NameNode software. Each of the
other machines in the cluster runs one instance of
the DataNode software.
• The existence of a single NameNode in a cluster
greatly simplifies the architecture of the system.
The NameNode is the arbitrator and repository for
all HDFS metadata.

Big Data Analytics: Hadoop 44


The File System Namespace
• HDFS supports a traditional hierarchical file
organization.
▪ A user or an application can create directories
and store files inside these directories
▪ The file system namespace hierarchy is similar
to most other existing file systems;
▪ one can create and remove files, move a file
from one directory to another, or rename a file

Big Data Analytics: Hadoop 45


The File System Namespace
• The NameNode maintains the file system
namespace
▪ Any change to the file system namespace or its
properties is recorded by NameNode.
▪ An application can specify the number of
replicas of a file that should be maintained by
HDFS.
➢The number of copies of a file is called the
replication factor of that file. This
information is stored by the NameNode.

Big Data Analytics: Hadoop 46


Data Replication
• HDFS is designed to offer reliable storage for very large
files across machines in a large cluster
• Files in HDFS are write-once and have strictly one writer
at any time
• HDFS stores each file as a sequence of blocks
• all blocks in a file except the last block are the same
size
• To provide fault tolerance
• blocks of a file are replicated
• By default, replication factor is three (which is
configurable)
Big Data Analytics: Hadoop 47
Data Replication..
• Block size and Replication factor are configurable per file
• The application can specify the number of replicas of a file
• and this replication factor can be specified at the time of
file creation and can be changed later
• NameNode makes all decisions regarding replication of
blocks
• NameNode periodically receives a heartbeat and a block
report from each of the DataNodes in the cluster
• receipt of heartbeat implies that the DataNode is
functioning properly
• the block report provides the list of all blocks on the
respective DataNode 48
Big Data Analytics: Hadoop
Data Replication..

• the block report helps in maintaining replication


factor in HDFS
• if the blocks are over-replicated
• then NameNode deletes replica of blocks
• if the blocks are under-replicated
• then NameNode adds replica of blocks

Big Data Analytics: Hadoop 49


Data Replication..

Fig 4: Block Replication(with the assumption replication factor is three)

Big Data Analytics: Hadoop 50


Rack Awareness
• HDFS stores files across multiple nodes (DataNodes) in
a cluster
• To get the maximum performance from Hadoop
• and to improve network traffic during file
read/write,
• NameNode chooses DataNodes on the same
rack or nearby racks for data read/write
• It follows an in-built Rack Awareness Algorithm to
• choosing the closer DataNode based on rack
information
• NameNode ensures that all the replicas are not stored
on the same rack or a single rack
Big Data Analytics: Hadoop 51
Rack Awareness: Rack
• The Rack is the collection of around 40-50
DataNodes connected using the same network
switch
• If the network goes down, the whole rack will be
unavailable
• A large Hadoop cluster is deployed in multiple racks

Big Data Analytics: Hadoop 52


Data Replication (without Rack
Awareness)
• Simple but non-optimal policy
• place replicas on unique racks (as shown in Fig 4)
• this prevents losing data when an entire rack fails
• and evenly distributes replicas in the cluster
which makes it easy to balance load on
component failure
• However, this policy increases the cost of writes
because a write needs to transfer blocks to multiple
racks

Big Data Analytics: Hadoop 53


Rack Awareness in HDFS
• HDFS comprises of multiple racks
• and each rack consists of multiple DataNodes
• HDFS maintains rack IDs of each DataNode as rack
information
• each DataNode is associated with a specific rack,
and this association is identified by a rack ID
• rack ID is used to group DataNodes within the same
physical rack
• Communication between the DataNodes on the same
rack is more efficient as compared to the
communication between DataNodes residing on
different racks
Big Data Analytics: Hadoop 54
Rack Awareness in HDFS..
• To reduce the network traffic during file read/write,
• NameNode chooses the closest DataNode for
serving the client read/write request
• NameNode utilizes rack information to gain
knowledge about rack ids of each DataNode
• The concept of choosing the closest DataNode
based on the rack information is known as Rack
Awareness

Big Data Analytics: Hadoop 55


Need of Rack Awareness
The reasons for the Rack Awareness in HDFS are:
i. To reduce the network traffic while file read/write,
which improves the cluster performance
• The communication between DataNodes residing
on different racks is directed via switch. In general,
we have greater network bandwidth between
DataNodes in the same rack than the DataNodes
residing in different rack.
• So, the Rack Awareness helps to reduce write
traffic in between different racks and thus
providing a better write performance. Similarly,
the read performance is also increases.
Big Data Analytics: Hadoop 56
Need of Rack Awareness..
The reasons for the Rack Awareness in HDFS are:
ii. To achieve fault tolerance, even when the rack goes
down
iii. To achieve high availability of data so that data is
available even in unfavorable conditions
• for instance, an entire rack fails because of the switch
failure or power failure
iv. To reduce the latency, that is, to make the file
read/write operations done with lower delay
NameNode uses a rack awareness algorithm while
placing the replicas in HDFS
Big Data Analytics: Hadoop 57
Replica Placement via Random
Awareness in HDFS
• HDFS stores replicas of data blocks of a file to provide
fault tolerance and high availability.
• Communication between the nodes (DataNodes) on
the same rack is more efficient as compared to the
communication between nodes (DataNodes) residing
on different racks
• the network bandwidth between nodes on the
same rack is higher than the network bandwidth
between nodes on a different rack

Big Data Analytics: Hadoop 58


Replica Placement via Random
Awareness in HDFS..
• If we store replicas on different nodes on the same
rack, then it improves the network bandwidth,
• but if the rack fails (rarely happens), then there
will be no copy of data on another rack
• If replicas are on unique racks,
• then due to the transfer of blocks to multiple
racks while writes increase the cost of writes.

Big Data Analytics: Hadoop 59


Replica Placement via Random
Awareness in HDFS
• NameNode on multiple rack cluster maintains block
replication by using inbuilt Rack awareness
policies which says:
➢Not more than one replica be placed on one
DataNode.
➢Not more than two replicas are placed on the same
rack.
➢The number of racks used for block replication
should always be smaller than the number of
replicas.
• The chance of rack failure is far less than that of node
failure; this algorithm does not impact data reliability
60
and availability guarantees. Big Data Analytics: Hadoop
Replica Placement via Random
Awareness in HDFS..
• Usually the replication factor is three,
▪ block replication policy put the first replica on the local
rack
▪ second replica on the different DataNode on the same
rack
▪ third replica on the different rack
• To improves write performance and network traffic without
compromising fault tolerance, while re-replicating a block,
▪ if the existing replica is one,
oplace the second replica on a different rack
▪ if existing replicas are two and are on the same rack,
othen place the third replica on a different rack
Big Data Analytics: Hadoop 61
Replica Placement via Random
Awareness in HDFS..

Fig 5: Replica Placement via Random Awareness in HDFS


Big Data Analytics: Hadoop 62
Data Replication: Replica Selection
• To minimize global bandwidth consumption and
read latency,
• HDFS tries to satisfy a read request from a
replica that is closest to the reader
• If there exists a replica on the same rack as
the reader node, then that replica is
preferred to satisfy the read request
• If HDFS cluster spans multiple data centers,
• then a replica that is resident in the local data
center is preferred over any remote replica.
Big Data Analytics: Hadoop 63
Data Replication..
• Checked in
• When a file is stored in HDFS, it is divided into
blocks, and each block is replicated across multiple
DataNodes according to the replication factor (by
default, 3 replicas).
• NameNode keeps track of where each block's
replicas are stored across the DataNodes
• After the NameNode initiates the replication of a
block to several DataNodes, it waits for each
DataNode to report back (via block reports) that it
has successfully received and stored the block. This
reporting back is what is referred to as "checked in."
Big Data Analytics: Hadoop 64
Data Replication..
• The checked in mechanism ensures the following:
➢Data Integrity: The "checked in" mechanism
ensures that the NameNode is fully aware of the
block's location across the cluster, which is crucial
for maintaining data integrity
➢Fault Tolerance: Only after all required replicas
have checked in, the system can conclude that the
block is safely replicated. If a DataNode fails prior to
checked in, the NameNode can instruct other
DataNodes to create additional replicas, ensuring
the data is not lost.

Big Data Analytics: Hadoop 65


Data Replication..
• A block is considered “safely replicated”
• when the minimum number of required replicas (as
per the replication factor) has successfully been
stored
• and the corresponding DataNodes have
"checked in" by sending a block report
• that confirms the storage of those replicas to
NameNode thereby, ensuring data
redundancy and reliability

Big Data Analytics: Hadoop 66


Data Replication: Safemode
• On startup, the NameNode enters a special state called
Safemode
• During Safe Mode, the NameNode does not allow any
modifications to the filesystem, meaning no new files can be
created, and no existing files can be modified or deleted.
• NameNode receives heartbeat and block report messages
from the DataNodes
• Once the NameNode determines that the configurable
percentage of blocks are safely replicated (information
provided through block report that DataNodes have checked
in),
• NameNode waits for additional 30-second delay before
exiting Safe Mode
Big Data Analytics: Hadoop 67
Data Replication: Safemode..
• The additional delay of 30 seconds before exiting
Safemode
• serves as a buffer period to receive any remaining
block reports and gives the system time to stabilize
before resuming normal operations
• After exiting Safemode
• NameNode allows clients to start writing data to
HDFS, modifying files, and performing other normal
operations

Big Data Analytics: Hadoop 68


DataNode Failure in HDFS
• The primary objective of HDFS is to store data reliably even in the
presence of failures.
• Detection of DataNode failure,
• Each DataNode sends a heartbeat message to the NameNode
periodically
• NameNode detects DataNode failure by the absence of a
heartbeat message. NameNode marks DataNodes without
recent heartbeats as dead and does not forward any new input
or output requests to them.
• Any data that was registered to a dead DataNode is not
available to HDFS any more. Thus, a DataNode failure may
cause the replication factor of some blocks to fall below their
specified value.
• NameNode constantly tracks which blocks need to be
replicated and initiates replication whenever necessary. 69
Big Data Analytics: Hadoop
DataNode Failure in HDFS..
• The necessity for re-replication may arise due to many
reasons:
❖ a DataNode may become unavailable, or
❖ a replica may become corrupted,
❖ hard disk on a DataNode may fail, or
❖ the replication factor of a file may be increased

Big Data Analytics: Hadoop 70


Cluster Rebalancing
• HDFS architecture is compatible with data rebalancing
schemes.
• A scheme might automatically move data from one
DataNode to another if the free space on a
DataNode falls below a certain threshold.
• In the event of a sudden high demand for a
particular file, a scheme might dynamically create
additional replicas and rebalance other data in the
cluster.
• However, these types of data rebalancing schemes
are not yet implemented.

Big Data Analytics: Hadoop 71


References
• https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#:
~:text=NameNode%20and%20DataNodes,-
HDFS%20has%20a&text=An%20HDFS%20cluster%20consist
s%20of,nodes%20that%20they%20run%20on.
• https://www.edureka.co/blog/apache-hadoop-hdfs-
architecture/
• https://data-flair.training/blogs/rack-awareness-hadoop-
hdfs/
• https://hadoop.apache.org/docs/r1.2.1/hdfs_user_guide.ht
ml#:~:text=Secondary%20NameNode,-
The%20NameNode%20stores&text=Since%20NameNode%2
0merges%20fsimage%20and,restart%20of%20NameNode%
20takes%20longer.
Big Data Analytics: Hadoop 72

You might also like