0% found this document useful (0 votes)

18 views72 pages

Hdfs Part 1

Uploaded by

Being Gamer

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views72 pages

Hdfs Part 1

Uploaded by

Being Gamer

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 72

Big Data Analytics:

Hadoop
Dr. Shivangi Shukla
Assistant Professor
Computer Science and Engineering
IIIT Pune
Contents
• Introduction
• History
• The Apache Hadoop Project
• HDFS

Big Data Analytics: Concepts and Methods 2

Introduction to Hadoop
• Hadoop:
• open-source software framework
• provides reliable shared storage and analysis
• designed to handle big data in distributed
computing
• framework is based on Java programming with
some native code in C and shell scripts

Big Data Analytics: Hadoop 3

History
• The initial version of Hadoop was created in 2004
by Doug Cutting (and named after his son’s stuffed
elephant).
• Hadoop became a top-level Apache Software
Foundation project in January 2008.
• Hadoop was being used by many companies:
Facebook, New York Times etc.
• There have been many contributors, both
academic and commercial
• Yahoo being the largest such contributor
Big Data Analytics: Hadoop 4
The Apache Hadoop Project
• Hadoop
• collection of related subprojects that fall under
the umbrella of infrastructure for distributed
computing
• These projects are hosted by the Apache
Software Foundation
• provides support for a community of open
source software projects
• Hadoop is best known for
• MapReduce
• HDFS (Hadoop Distributed File System)
Big Data Analytics: Hadoop 5
The Apache Hadoop Project

Fig 1: Hadoop Subprojects

Big Data Analytics: Hadoop 6

The Apache Hadoop Project
• Core
• A set of components and interfaces for distributed
filesystems and general I/O (serialization, Java RPC,
persistent data structures)
• Avro
• A data serialization system for efficient, cross-language
RPC, and persistent data storage.
• HDFS (Hadoop Distributed File System)
• A distributed filesystem that runs on large clusters of
commodity machines.
• Allows storage large amounts of data across multiple
machines

Big Data Analytics: Hadoop 7

The Apache Hadoop Project
• MapReduce
• A distributed data processing model and execution
environment that runs on large clusters of commodity
machines
• Pig
• A data flow language and execution environment for
exploring very large datasets. Pig runs on HDFS and
MapReduce clusters.
• HBase
• A distributed, column-oriented database that uses HDFS
for its underlying storage.
• It supports both batch-style computations using
MapReduce and point queries (random reads)
Big Data Analytics: Hadoop 8
The Apache Hadoop Project
• ZooKeeper
• A distributed, highly available coordination service that
provides primitives such as distributed locks that can be
used for building distributed applications.
• Hive
• A distributed data warehouse that manages data stored
in HDFS and provides a query language based on SQL for
querying the data.
• Chukwa
• A distributed data collection and analysis system that
runs collectors that store data in HDFS, and it uses
MapReduce to produce reports.
Big Data Analytics: Hadoop 9
Two Important Components of
Hadoop
• File System
• provided by HDFS (Hadoop Distributed File
System)
• allows storage large amounts of data across
multiple machines
• Programming Paradigm
• provided by MapReduce
• used to access extensive data stored in HDFS

Big Data Analytics: Hadoop 10

Hadoop Distributed File System
(HDFS)

11
HDFS (Hadoop Distributed File
System)
• Distributed Filesystems
• Filesystems that manage the storage across a
network of machines are called distributed
filesystems.
• Distributed filesystems are more complex than
regular disk filesystems owing to complications of
network programming.
• One of the biggest challenge is to tolerate node
failure without suffering data loss.

Big Data Analytics: Hadoop 12

HDFS Features
• Very Large Files
• Files that are hundreds of megabytes, gigabytes, or terabytes
in size.
• There are Hadoop clusters running today that store
petabytes of data.
• Streaming Data Access
• HDFS is built around the idea that the most efficient data
processing pattern is a write-once, read-many-times pattern.
• A dataset is typically generated or copied from source, then
various analyses are performed on that dataset over time.
Each analysis involves a large proportion, if not all, of the
dataset,
• hence, the time to read the whole dataset is more
important than the latency in reading the first record.
Big Data Analytics: Hadoop 13
HDFS Features..

• Commodity Hardware
• Hadoop is designed to run on clusters of commodity
hardware.
• These commodity hardware have high chance of node
failure across the cluster, especially in case of large
clusters
• HDFS is designed to carry on working without a
noticeable interruption to the user in the face of such
failure

Big Data Analytics: Hadoop 14

Limitations of HDFS
• High-latency Data Access
• HDFS has high-latency for data access as it is optimized
for batch processing rather than real-time data access
thus, focusing on throughput rather than latency
• Inefficient in handling lots of small files
• HDFS is not well-suited for handling large numbers of
small files because
• the metadata overhead increases
• HDFS has large block sizes (which is typically 128 MB
or 256 MB). When a small file is stored, it still
occupies an entire block, resulting in inefficient
storage utilization and wasted space.
Big Data Analytics: Hadoop 15
Limitations of HDFS..
• No support for multiple writers
• HDFS does not support multiple writers
• Files in HDFS may be written to by a single writer.
Writes are always made at the end of the file.

Big Data Analytics: Hadoop 16

HDFS Concepts
• Blocks
• Disk has block size which defines the minimum amount of
data that it can read or write
• Filesystem blocks are of few kilobytes in size, while disk
blocks are normally 512 bytes
• HDFS has blocks of large size- 64MB by default
• HDFS blocks are large compared to disk blocks, and the
reason is to minimize the seek time
• By making a block large enough, transfer time (time to
transfer the data from the disk) can be made to be
significantly larger than seek time (the time to seek to the
start of the block)
Big Data Analytics: Hadoop 17
HDFS Concepts..
• Ques: Assume Seek Time = 10ms and is 1% of
Transfer Time. If transfer rate is 100 MB/sec then
what should be block size?

Big Data Analytics: Hadoop 18

HDFS Concepts..
Soln:
Seek Time = 10ms

Seek Time = 1% of Transfer Time

Transfer Time = (100*10) ms = 1000ms

Transfer Rate = 100 MB/sec = 0.1 MB/ms

Block Size = Transfer Rate*Transfer Time
= 0.1MB/ms * 1000ms = 100 MB

Big Data Analytics: Hadoop 19

HDFS Concepts..

• Default size of block in HDFS is 64 MB although many

HDFS installations use 128 MB blocks
• This size will continue to vary as transfer speeds grow
with new generations of disk drives.

Big Data Analytics: Hadoop 20

HDFS Concepts..
• Block abstraction offers several benefits:
• In case a file is larger than any single disk in the network,
HDFS takes advantage of any of the disks in the cluster
• Block abstraction safeguards against corrupted blocks and
disk failure and machine failure as each block is replicated
to a small number of physically separate machines
(typically three).
• If a block becomes unavailable, a copy can be read
from another location in a way that is transparent to
the client.
• A block that is no longer available due to corruption
or machine failure can be replicated from their
alternative locations to other live machines to bring
the replication factor back to the normal level.
Big Data Analytics: Hadoop 21
HDFS Architecture

Fig 1: HDFS Architecture

Big Data Analytics: Hadoop 22

HDFS Architecture
• HDFS
• block-structured filesystem
• each file is divided into blocks of a pre-determined size
• these blocks are stored across a cluster of one or
several machines
• follows Master/Slave Architecture where a cluster
comprises of
• single NameNode (Master node)
• other nodes are DataNodes (Slave nodes)
• HDFS can be deployed on a broad spectrum of machines
that support Java
• Though one can run several DataNodes on a single
machine, but in the practical world, these DataNodes are
spread across various machines.
Big Data Analytics: Hadoop 23
HDFS Architecture..

Fig 2: HDFS Architecture with NameNode and DataNode

Big Data Analytics: Hadoop

24
HDFS Architecture..
• NameNode (also called as Primary Namenode)
• Manages
• filesystem namespace
• Maintains
• filesystem tree and
• metadata for all the files and directories in the tree
• this information is stored persistently on the local disk in the
form of two files:
• FsImage (filesystem image or namespace image)
• EditLogs
• NameNode has knowledge about
• DataNodes on which all the blocks for a given file are
located
Big Data Analytics: Hadoop 25
HDFS Architecture..
• Client
• responsible for loading data and fetching results
• A client accesses the filesystem on behalf of the
user by communicating with the NameNode and
datanodes
• DataNode
• stores and retrieves blocks
• as instructed by NameNode or clients
• considered as workhorses of the filesystem
• DataNodes report back to NameNode
periodically
• with lists of blocks that they are storing 26
Big Data Analytics: Hadoop
HDFS Architecture..
• It should be noted that the filesystem cannot be
used without NameNode
• if the machine running the NameNode collapses
• all the files on the filesystem would be lost since
there would be no way of knowing how to
reconstruct the files from the blocks on the
DataNodes
• hence, it is important to ensure that the
NameNode resilient to failure

Big Data Analytics: Hadoop

27
Functions of NameNode

• master node that maintains and manages the

DataNodes (slave nodes)
• executes file system namespace operations like
opening, closing, and renaming files and directories
• determines the mapping of blocks to DataNodes
(internally, a file is split into one or more blocks and
these blocks are stored in a set of DataNodes)

Big Data Analytics: Hadoop 28

Functions of NameNode..
• records the metadata of all the files stored in the
cluster
• location of blocks stored
• size of the files, permissions, hierarchy, etc.
• there are two files associated with the metadata:
• FsImage: contains the complete state of the
file system namespace since the start of the
Namenode
• EditLogs: contains all the recent modifications
made to the file system with respect to the
most recent FsImage.
Big Data Analytics: Hadoop 29
Functions of NameNode..
• records each change that takes place to the file system
metadata.
• For example, if a file is deleted in HDFS, the
NameNode will immediately record this in the EditLog
• regularly receives block reports from all the DataNodes in
the cluster to ensure that the DataNodes are live
• keeps a record of all the blocks in HDFS and in which
nodes these blocks are located.
• NameNode is also responsible to take care of
the replication of all the blocks
• In case of a DataNode failure, the NameNode chooses
new DataNodes for new replicas, balance disk usage and
manages communication traffic to the DataNodes 30
Big Data Analytics: Hadoop
Functions of DataNodes
DataNodes are commodity hardware, that is, a non-expensive
system which is not of high quality or high-availability
whereas NameNode which is very highly available server
that manages the filesystem namespace and controls access
to files by clients
The DataNode functions are as follows:
• DataNodes are slave nodes or process which runs on each
slave machine
• actual data is stored on DataNodes
• DataNodes also perform block creation, deletion, and
replication upon instruction from the NameNode

Big Data Analytics: Hadoop 31

Functions of DataNodes..

The DataNode functions are as follows:

• DataNodes perform the low-level read and write
requests from the file system’s clients
• DataNodes sends block reports to NameNode
periodically to report overall working of HDFS, by
default, this frequency is set to 3 seconds.

Big Data Analytics: Hadoop 32

What if NameNode fails??
• Hadoop has two mechanisms to provide resilience
against failure of NameNode:
1) Have back up of files that make up the persistent
state of the filesystem metadata.
➢Hadoop can be configured so that the
namenode writes its persistent state to multiple
filesystems.
➢These writes are synchronous and atomic.
❖The usual configuration choice is to write to local
disk as well as a remote NFS (Network File
System) mount.

Big Data Analytics: Hadoop

33
What if NameNode fails?? (contd)
2) Run a Secondary NameNode that keeps a copy of the merged
namespace image, which can be used in the event of the
namenode failing
➢Secondary NameNode does not act as namenode
➢Its main role is to periodically merge the namespace image
with the edit log to limit the edit log from becoming too large
➢It usually runs on a separate physical machine, since it
requires plenty of CPU and as much memory as the
namenode to perform the merge.
➢However, the state of the Secondary NameNode lags that of
the Primary, so in the event of total failure of the primary
data, loss is almost guaranteed.
❖The usual course of action in this case is to copy the
namenode’s metadata files that are on NFS to the
secondary and run it as the new primary. 34
Big Data Analytics: Hadoop
Secondary NameNode
• Secondary NameNode works concurrently with the
Primary NameNode as a helper node
• It should be noted that Secondary NameNode is not
backup of NameNode
• Secondary NameNode is responsible for performing
regular checkpoints in HDFS
• Thus, it is also called as Checkpoint Node

Big Data Analytics: Hadoop 35

Secondary NameNode..
• Functions of Secondary NameNode are as follows:
➢Secondary NameNode constantly reads all the file
systems and metadata from the RAM of the NameNode
and writes it into the hard disk or the file system
➢It is responsible for combining
the EditLogs with FsImage from the NameNode
➢It downloads the EditLogs from the NameNode at
regular intervals and applies to FsImage. The new
FsImage is copied back to the NameNode, which is used
whenever the NameNode is started the next time.

Big Data Analytics: Hadoop 36

Secondary NameNode..

Fig 3: NameNode and Secondary NameNode

Big Data Analytics: Hadoop 37

Important Points
• As Secondary NameNode lags the latest state of
Primary NameNode because merging of FsImage
and EditLogs happens only at certain intervals of
time (by default every one hour)
• So, If NameNode fails completely, then there will
be definitely loss to at least a small amount of
data.
• To overcome this, Hadoop Implemented Backup
Node and Checkpoint Node

Big Data Analytics: Hadoop 38

Secondary NameNode - Checkpoint
• The checkpoint process on Secondary NameNode is
controlled by two configuration parameters.
1) dfs.namenode.checkpoint.period
➢1 hour by default,
➢the maximum delay between two consecutive
checkpoints
2) dfs.namenode.checkpoint.txns
➢1 million by default
➢the number of uncheck pointed transactions on the
Namenode which will force an urgent checkpoint,
event if the checkpoint period has not been reached.

Big Data Analytics: Hadoop 39

Secondary NameNode – Checkpoint..
• It should be noted that the Secondary NameNode
stores latest checkpoint in a directory which is
structured in the same way as Primary NameNode’s
directory.
• Thus, checkpoint image is always ready to be
used by Primary NameNode if necessary

Big Data Analytics: Hadoop 40

Primary and Secondary NameNode
Working
• NameNode stores modifications to the file system
as a log appended to a native file system
file, EditLog
• When a NameNode starts up, it reads HDFS state
from an image file, FsImage, and then applies edits
from the EditLog file
• It then writes new HDFS state to FsImage and
starts normal operation with an empty EditLog

Big Data Analytics: Hadoop 41

Primary and Secondary NameNode
Working..
• Since NameNode merges FsImage and EditLog files only
during start up,
• the EditLog file could get very large over time on a
busy cluster
• Also, large EditLog file increases the time required
for next restart of NameNode
• The Secondary NameNode merges the FsImage and the
EditLog files periodically and keeps the size of EditLog
file within a limit
• Usually, Secondary NameNode runs on a different machine
than the Primary NameNode since its memory
requirements are on the same order as the Primary
NameNode Big Data Analytics: Hadoop 42
NameNode and DataNode
• NameNode and DataNodes are pieces of software
designed to run on commodity machines.
• These machines typically run a GNU/Linux operating
system (OS).
• HDFS is built using the Java language; any machine
that supports Java can run the NameNode or the
DataNode software.
• Usage of the highly portable Java language means
that HDFS can be deployed on a wide range of
machines.

Big Data Analytics: Hadoop 43

NameNode and DataNode..
• The typical deployment has a dedicated machine
that runs only the NameNode software. Each of the
other machines in the cluster runs one instance of
the DataNode software.
• The existence of a single NameNode in a cluster
greatly simplifies the architecture of the system.
The NameNode is the arbitrator and repository for
all HDFS metadata.

Big Data Analytics: Hadoop 44

The File System Namespace
• HDFS supports a traditional hierarchical file
organization.
▪ A user or an application can create directories
and store files inside these directories
▪ The file system namespace hierarchy is similar
to most other existing file systems;
▪ one can create and remove files, move a file
from one directory to another, or rename a file

Big Data Analytics: Hadoop 45

The File System Namespace
• The NameNode maintains the file system
namespace
▪ Any change to the file system namespace or its
properties is recorded by NameNode.
▪ An application can specify the number of
replicas of a file that should be maintained by
HDFS.
➢The number of copies of a file is called the
replication factor of that file. This
information is stored by the NameNode.

Big Data Analytics: Hadoop 46

Data Replication
• HDFS is designed to offer reliable storage for very large
files across machines in a large cluster
• Files in HDFS are write-once and have strictly one writer
at any time
• HDFS stores each file as a sequence of blocks
• all blocks in a file except the last block are the same
size
• To provide fault tolerance
• blocks of a file are replicated
• By default, replication factor is three (which is
configurable)
Big Data Analytics: Hadoop 47
Data Replication..
• Block size and Replication factor are configurable per file
• The application can specify the number of replicas of a file
• and this replication factor can be specified at the time of
file creation and can be changed later
• NameNode makes all decisions regarding replication of
blocks
• NameNode periodically receives a heartbeat and a block
report from each of the DataNodes in the cluster
• receipt of heartbeat implies that the DataNode is
functioning properly
• the block report provides the list of all blocks on the
respective DataNode 48
Big Data Analytics: Hadoop
Data Replication..

• the block report helps in maintaining replication

factor in HDFS
• if the blocks are over-replicated
• then NameNode deletes replica of blocks
• if the blocks are under-replicated
• then NameNode adds replica of blocks

Big Data Analytics: Hadoop 49

Data Replication..

Fig 4: Block Replication(with the assumption replication factor is three)

Big Data Analytics: Hadoop 50

Rack Awareness
• HDFS stores files across multiple nodes (DataNodes) in
a cluster
• To get the maximum performance from Hadoop
• and to improve network traffic during file
read/write,
• NameNode chooses DataNodes on the same
rack or nearby racks for data read/write
• It follows an in-built Rack Awareness Algorithm to
• choosing the closer DataNode based on rack
information
• NameNode ensures that all the replicas are not stored
on the same rack or a single rack
Big Data Analytics: Hadoop 51
Rack Awareness: Rack
• The Rack is the collection of around 40-50
DataNodes connected using the same network
switch
• If the network goes down, the whole rack will be
unavailable
• A large Hadoop cluster is deployed in multiple racks

Big Data Analytics: Hadoop 52

Data Replication (without Rack
Awareness)
• Simple but non-optimal policy
• place replicas on unique racks (as shown in Fig 4)
• this prevents losing data when an entire rack fails
• and evenly distributes replicas in the cluster
which makes it easy to balance load on
component failure
• However, this policy increases the cost of writes
because a write needs to transfer blocks to multiple
racks

Big Data Analytics: Hadoop 53

Rack Awareness in HDFS
• HDFS comprises of multiple racks
• and each rack consists of multiple DataNodes
• HDFS maintains rack IDs of each DataNode as rack
information
• each DataNode is associated with a specific rack,
and this association is identified by a rack ID
• rack ID is used to group DataNodes within the same
physical rack
• Communication between the DataNodes on the same
rack is more efficient as compared to the
communication between DataNodes residing on
different racks
Big Data Analytics: Hadoop 54
Rack Awareness in HDFS..
• To reduce the network traffic during file read/write,
• NameNode chooses the closest DataNode for
serving the client read/write request
• NameNode utilizes rack information to gain
knowledge about rack ids of each DataNode
• The concept of choosing the closest DataNode
based on the rack information is known as Rack
Awareness

Big Data Analytics: Hadoop 55

Need of Rack Awareness
The reasons for the Rack Awareness in HDFS are:
i. To reduce the network traffic while file read/write,
which improves the cluster performance
• The communication between DataNodes residing
on different racks is directed via switch. In general,
we have greater network bandwidth between
DataNodes in the same rack than the DataNodes
residing in different rack.
• So, the Rack Awareness helps to reduce write
traffic in between different racks and thus
providing a better write performance. Similarly,
the read performance is also increases.
Big Data Analytics: Hadoop 56
Need of Rack Awareness..
The reasons for the Rack Awareness in HDFS are:
ii. To achieve fault tolerance, even when the rack goes
down
iii. To achieve high availability of data so that data is
available even in unfavorable conditions
• for instance, an entire rack fails because of the switch
failure or power failure
iv. To reduce the latency, that is, to make the file
read/write operations done with lower delay
NameNode uses a rack awareness algorithm while
placing the replicas in HDFS
Big Data Analytics: Hadoop 57
Replica Placement via Random
Awareness in HDFS
• HDFS stores replicas of data blocks of a file to provide
fault tolerance and high availability.
• Communication between the nodes (DataNodes) on
the same rack is more efficient as compared to the
communication between nodes (DataNodes) residing
on different racks
• the network bandwidth between nodes on the
same rack is higher than the network bandwidth
between nodes on a different rack

Big Data Analytics: Hadoop 58

Replica Placement via Random
Awareness in HDFS..
• If we store replicas on different nodes on the same
rack, then it improves the network bandwidth,
• but if the rack fails (rarely happens), then there
will be no copy of data on another rack
• If replicas are on unique racks,
• then due to the transfer of blocks to multiple
racks while writes increase the cost of writes.

Big Data Analytics: Hadoop 59

Replica Placement via Random
Awareness in HDFS
• NameNode on multiple rack cluster maintains block
replication by using inbuilt Rack awareness
policies which says:
➢Not more than one replica be placed on one
DataNode.
➢Not more than two replicas are placed on the same
rack.
➢The number of racks used for block replication
should always be smaller than the number of
replicas.
• The chance of rack failure is far less than that of node
failure; this algorithm does not impact data reliability
60
and availability guarantees. Big Data Analytics: Hadoop
Replica Placement via Random
Awareness in HDFS..
• Usually the replication factor is three,
▪ block replication policy put the first replica on the local
rack
▪ second replica on the different DataNode on the same
rack
▪ third replica on the different rack
• To improves write performance and network traffic without
compromising fault tolerance, while re-replicating a block,
▪ if the existing replica is one,
oplace the second replica on a different rack
▪ if existing replicas are two and are on the same rack,
othen place the third replica on a different rack
Big Data Analytics: Hadoop 61
Replica Placement via Random
Awareness in HDFS..

Fig 5: Replica Placement via Random Awareness in HDFS

Big Data Analytics: Hadoop 62
Data Replication: Replica Selection
• To minimize global bandwidth consumption and
read latency,
• HDFS tries to satisfy a read request from a
replica that is closest to the reader
• If there exists a replica on the same rack as
the reader node, then that replica is
preferred to satisfy the read request
• If HDFS cluster spans multiple data centers,
• then a replica that is resident in the local data
center is preferred over any remote replica.
Big Data Analytics: Hadoop 63
Data Replication..
• Checked in
• When a file is stored in HDFS, it is divided into
blocks, and each block is replicated across multiple
DataNodes according to the replication factor (by
default, 3 replicas).
• NameNode keeps track of where each block's
replicas are stored across the DataNodes
• After the NameNode initiates the replication of a
block to several DataNodes, it waits for each
DataNode to report back (via block reports) that it
has successfully received and stored the block. This
reporting back is what is referred to as "checked in."
Big Data Analytics: Hadoop 64
Data Replication..
• The checked in mechanism ensures the following:
➢Data Integrity: The "checked in" mechanism
ensures that the NameNode is fully aware of the
block's location across the cluster, which is crucial
for maintaining data integrity
➢Fault Tolerance: Only after all required replicas
have checked in, the system can conclude that the
block is safely replicated. If a DataNode fails prior to
checked in, the NameNode can instruct other
DataNodes to create additional replicas, ensuring
the data is not lost.

Big Data Analytics: Hadoop 65

Data Replication..
• A block is considered “safely replicated”
• when the minimum number of required replicas (as
per the replication factor) has successfully been
stored
• and the corresponding DataNodes have
"checked in" by sending a block report
• that confirms the storage of those replicas to
NameNode thereby, ensuring data
redundancy and reliability

Big Data Analytics: Hadoop 66

Data Replication: Safemode
• On startup, the NameNode enters a special state called
Safemode
• During Safe Mode, the NameNode does not allow any
modifications to the filesystem, meaning no new files can be
created, and no existing files can be modified or deleted.
• NameNode receives heartbeat and block report messages
from the DataNodes
• Once the NameNode determines that the configurable
percentage of blocks are safely replicated (information
provided through block report that DataNodes have checked
in),
• NameNode waits for additional 30-second delay before
exiting Safe Mode
Big Data Analytics: Hadoop 67
Data Replication: Safemode..
• The additional delay of 30 seconds before exiting
Safemode
• serves as a buffer period to receive any remaining
block reports and gives the system time to stabilize
before resuming normal operations
• After exiting Safemode
• NameNode allows clients to start writing data to
HDFS, modifying files, and performing other normal
operations

Big Data Analytics: Hadoop 68

DataNode Failure in HDFS
• The primary objective of HDFS is to store data reliably even in the
presence of failures.
• Detection of DataNode failure,
• Each DataNode sends a heartbeat message to the NameNode
periodically
• NameNode detects DataNode failure by the absence of a
heartbeat message. NameNode marks DataNodes without
recent heartbeats as dead and does not forward any new input
or output requests to them.
• Any data that was registered to a dead DataNode is not
available to HDFS any more. Thus, a DataNode failure may
cause the replication factor of some blocks to fall below their
specified value.
• NameNode constantly tracks which blocks need to be
replicated and initiates replication whenever necessary. 69
Big Data Analytics: Hadoop
DataNode Failure in HDFS..
• The necessity for re-replication may arise due to many
reasons:
❖ a DataNode may become unavailable, or
❖ a replica may become corrupted,
❖ hard disk on a DataNode may fail, or
❖ the replication factor of a file may be increased

Big Data Analytics: Hadoop 70

Cluster Rebalancing
• HDFS architecture is compatible with data rebalancing
schemes.
• A scheme might automatically move data from one
DataNode to another if the free space on a
DataNode falls below a certain threshold.
• In the event of a sudden high demand for a
particular file, a scheme might dynamically create
additional replicas and rebalance other data in the
cluster.
• However, these types of data rebalancing schemes
are not yet implemented.

Big Data Analytics: Hadoop 71

References
• https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#:
~:text=NameNode%20and%20DataNodes,-
HDFS%20has%20a&text=An%20HDFS%20cluster%20consist
s%20of,nodes%20that%20they%20run%20on.
• https://www.edureka.co/blog/apache-hadoop-hdfs-
architecture/
• https://data-flair.training/blogs/rack-awareness-hadoop-
hdfs/
• https://hadoop.apache.org/docs/r1.2.1/hdfs_user_guide.ht
ml#:~:text=Secondary%20NameNode,-
The%20NameNode%20stores&text=Since%20NameNode%2
0merges%20fsimage%20and,restart%20of%20NameNode%
20takes%20longer.
Big Data Analytics: Hadoop 72

The Divine Comedy (Oxford World's Classics) - Dante Alighieri Translated by C - H - Sisson With An - Oxford World's Classics, First Published As A - 9780192830739 - Anna's Archive
No ratings yet
The Divine Comedy (Oxford World's Classics) - Dante Alighieri Translated by C - H - Sisson With An - Oxford World's Classics, First Published As A - 9780192830739 - Anna's Archive
756 pages
Apache Flink Introduction - Big Data Landscape
No ratings yet
Apache Flink Introduction - Big Data Landscape
26 pages
Linux For Absolute Beginners - 5 Books in 1 The Ultimate Guide To Advanced Linux Programming - Kernel
No ratings yet
Linux For Absolute Beginners - 5 Books in 1 The Ultimate Guide To Advanced Linux Programming - Kernel
251 pages
Hdfs Part 2
No ratings yet
Hdfs Part 2
42 pages
Linux USA 02 2025 Freemagazines Top
No ratings yet
Linux USA 02 2025 Freemagazines Top
100 pages
Openshift Container Platform-3.11-Day Two Operations Guide
No ratings yet
Openshift Container Platform-3.11-Day Two Operations Guide
110 pages
2014 Synthetic Biology I - Models and System Characterizations-Springer Netherlands
No ratings yet
2014 Synthetic Biology I - Models and System Characterizations-Springer Netherlands
338 pages
World of Open Source Global Spotlight 2023 - Report
100% (1)
World of Open Source Global Spotlight 2023 - Report
33 pages
The Kubernetes Learning Resources List - KLR
No ratings yet
The Kubernetes Learning Resources List - KLR
4 pages
Linux Kernel and Driver Development Training Linux Kernel
No ratings yet
Linux Kernel and Driver Development Training Linux Kernel
462 pages
CIS Oracle Cloud Infrastructure Container Engine For Kubernetes (OKE) Benchmark v1.6.0 PDF
No ratings yet
CIS Oracle Cloud Infrastructure Container Engine For Kubernetes (OKE) Benchmark v1.6.0 PDF
166 pages
A Review On Failure Modes of Wind Turbine Bearings
No ratings yet
A Review On Failure Modes of Wind Turbine Bearings
44 pages
Disk Detective Secrets You Must Know To Recover Information From A Computer (Norbert Zaenglein) (Z-Library)
No ratings yet
Disk Detective Secrets You Must Know To Recover Information From A Computer (Norbert Zaenglein) (Z-Library)
116 pages
Technology Radar Sept 2023
No ratings yet
Technology Radar Sept 2023
47 pages
50 PySpark Interview Questions PDF
No ratings yet
50 PySpark Interview Questions PDF
7 pages
InstantLink 8.1 Exercise Guide - Training Material
No ratings yet
InstantLink 8.1 Exercise Guide - Training Material
33 pages
Vagrant Cheat Sheet
No ratings yet
Vagrant Cheat Sheet
7 pages
Bhyve Bsdmag
No ratings yet
Bhyve Bsdmag
82 pages
SNAP Action Guide
No ratings yet
SNAP Action Guide
184 pages
Red Hat Enterprise Linux-8-Managing Storage Devices-En-us
No ratings yet
Red Hat Enterprise Linux-8-Managing Storage Devices-En-us
214 pages
En GVP 85 Troubleshooting Book
No ratings yet
En GVP 85 Troubleshooting Book
60 pages
Trino (Presto) DB: Zero Copy Lakehouse: Artem Aliev Huawei
100% (1)
Trino (Presto) DB: Zero Copy Lakehouse: Artem Aliev Huawei
45 pages
From The Balkans To Congo Nation Building
No ratings yet
From The Balkans To Congo Nation Building
343 pages
JVM Internals
No ratings yet
JVM Internals
23 pages
DBMS MCQ
No ratings yet
DBMS MCQ
32 pages
Answer Key To Lab Exercises
No ratings yet
Answer Key To Lab Exercises
58 pages
RecoveryManager Slides
No ratings yet
RecoveryManager Slides
35 pages
Working With SQL Server Spatial - Workshop
No ratings yet
Working With SQL Server Spatial - Workshop
108 pages
Qshell PDF
No ratings yet
Qshell PDF
214 pages
Docker Cheat Sheet
No ratings yet
Docker Cheat Sheet
5 pages
Unit2 PHP
No ratings yet
Unit2 PHP
39 pages
Cloudera Kafka
No ratings yet
Cloudera Kafka
175 pages
INFRA Master Manual A5 1 8 0 EN
No ratings yet
INFRA Master Manual A5 1 8 0 EN
70 pages
HDPDeveloper EnterpriseSpark1 StudentGuide
100% (1)
HDPDeveloper EnterpriseSpark1 StudentGuide
244 pages
DIALux Setup Log
No ratings yet
DIALux Setup Log
65 pages
Oxford University Press Journal of Social History
No ratings yet
Oxford University Press Journal of Social History
27 pages
Visual Guide On Kuberneetes
100% (1)
Visual Guide On Kuberneetes
53 pages
Business Educators Perception of Digital Skills Required by Office Technology and Management Students For Employability
No ratings yet
Business Educators Perception of Digital Skills Required by Office Technology and Management Students For Employability
9 pages
Manual Del MapInfo Discovery PDF
No ratings yet
Manual Del MapInfo Discovery PDF
69 pages
PML Course Iranpiping
100% (1)
PML Course Iranpiping
150 pages
Apache Kafka Setup With Zookeeper - SkillAnything
No ratings yet
Apache Kafka Setup With Zookeeper - SkillAnything
42 pages
Openshift - Container - Platform 4.11 Architecture en Us
No ratings yet
Openshift - Container - Platform 4.11 Architecture en Us
72 pages
Installation and Operation Manual (26099)
No ratings yet
Installation and Operation Manual (26099)
52 pages
Java Stream Cheat Sheet - 240425 - 222755
No ratings yet
Java Stream Cheat Sheet - 240425 - 222755
11 pages
Tuning ZFS On FreeBSD
No ratings yet
Tuning ZFS On FreeBSD
29 pages
Anti Reverse Engineering Linux PDF
No ratings yet
Anti Reverse Engineering Linux PDF
248 pages
Syslib rm005 - en e
No ratings yet
Syslib rm005 - en e
24 pages
Red Hat Enterprise Linux-7-Storage Administration Guide-En-US
No ratings yet
Red Hat Enterprise Linux-7-Storage Administration Guide-En-US
252 pages
Speed Up Your Kernel Development Cycle With QEMU
No ratings yet
Speed Up Your Kernel Development Cycle With QEMU
35 pages
Smart Note Taker Documentation1
0% (2)
Smart Note Taker Documentation1
20 pages
Gdpdu Data Export: Microsoft Dynamics GP
No ratings yet
Gdpdu Data Export: Microsoft Dynamics GP
25 pages
Magento 2 Quick Tips Ebook by Fooman
No ratings yet
Magento 2 Quick Tips Ebook by Fooman
25 pages
Operating System
No ratings yet
Operating System
60 pages
BK Ambari Installation
No ratings yet
BK Ambari Installation
72 pages
Rand Corporation - Human Smuggling and Associated Revenues
No ratings yet
Rand Corporation - Human Smuggling and Associated Revenues
78 pages
Reverse Engineering For Beginners: Dennis Yurichev
No ratings yet
Reverse Engineering For Beginners: Dennis Yurichev
6 pages
Waves Central: User Guide
No ratings yet
Waves Central: User Guide
18 pages
Flink Vs Spark by Slim Baltagi
No ratings yet
Flink Vs Spark by Slim Baltagi
67 pages
Haskell Book
No ratings yet
Haskell Book
87 pages
LS, Date, Cal Command
No ratings yet
LS, Date, Cal Command
3 pages
Introduction To Armed Forces
No ratings yet
Introduction To Armed Forces
29 pages
Hacking The Fender
No ratings yet
Hacking The Fender
5 pages
ZIP File and Unzip It Anywhere On Your Computer. 2. Open Your Server Control Panel (This Manual Uses Cpanel So Your Panel Interface May Differ A Bit)
100% (3)
ZIP File and Unzip It Anywhere On Your Computer. 2. Open Your Server Control Panel (This Manual Uses Cpanel So Your Panel Interface May Differ A Bit)
8 pages
Bigdata Engineer Complete Syllabus: Presented by
No ratings yet
Bigdata Engineer Complete Syllabus: Presented by
21 pages
Navigating Google Earth With The Keyboard
No ratings yet
Navigating Google Earth With The Keyboard
2 pages
Connect A PIC To USB Within 10 Minutes
No ratings yet
Connect A PIC To USB Within 10 Minutes
4 pages
MCSA 70-340 Notes
No ratings yet
MCSA 70-340 Notes
34 pages
Red Hat Enterprise Linux-7-Performance Tuning Guide-en-US PDF
No ratings yet
Red Hat Enterprise Linux-7-Performance Tuning Guide-en-US PDF
115 pages
HDFS Exercises - Basic
No ratings yet
HDFS Exercises - Basic
5 pages
Kontakt Player Developer Guide
No ratings yet
Kontakt Player Developer Guide
37 pages
Good News: You Are Not Your Brain!
No ratings yet
Good News: You Are Not Your Brain!
8 pages
DVS Linux Commands
No ratings yet
DVS Linux Commands
10 pages
The Data Engineering Cookbook: Andreas Kretz December 2, 2018 v0.1
No ratings yet
The Data Engineering Cookbook: Andreas Kretz December 2, 2018 v0.1
40 pages
Abattoir Waste Treatment in Sivakasi
No ratings yet
Abattoir Waste Treatment in Sivakasi
8 pages
Monitoring Linux
No ratings yet
Monitoring Linux
9 pages
Lab 11.4.3
No ratings yet
Lab 11.4.3
9 pages
CSCI 2270 - Data Structures and Algorithms Instructor Hoenigman Assignment 2 Due Friday, February 3 Before 3pm Word Analysis
No ratings yet
CSCI 2270 - Data Structures and Algorithms Instructor Hoenigman Assignment 2 Due Friday, February 3 Before 3pm Word Analysis
5 pages
Vagrant Cheat Sheet + Get Started With Vagrant
No ratings yet
Vagrant Cheat Sheet + Get Started With Vagrant
6 pages
Cyber Foresics - Tools
No ratings yet
Cyber Foresics - Tools
11 pages
Ceph Architecture and Administration: Definitive Reference for Developers and Engineers
From Everand
Ceph Architecture and Administration: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Mastering Data Engineering: Advanced Techniques with Apache Hadoop and Hive
From Everand
Mastering Data Engineering: Advanced Techniques with Apache Hadoop and Hive
Peter Jones
No ratings yet
Windows 11 Mastery: From Foundation to Mastery
From Everand
Windows 11 Mastery: From Foundation to Mastery
Kameron Hussain
No ratings yet
Learning Hadoop 2
From Everand
Learning Hadoop 2
Garry Turkington
4/5 (1)
Professional Hadoop Solutions
From Everand
Professional Hadoop Solutions
Boris Lublinsky
4/5 (2)
Learn Hbase in 24 Hours
From Everand
Learn Hbase in 24 Hours
Alex Nordeen
No ratings yet
Real-time Analytics with Storm and Cassandra
From Everand
Real-time Analytics with Storm and Cassandra
Shilpi Saxena
No ratings yet
Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
HBase Administration Cookbook
From Everand
HBase Administration Cookbook
Yifeng Jiang
No ratings yet
Unix / Linux FAQ: with Tips to Face Interviews
From Everand
Unix / Linux FAQ: with Tips to Face Interviews
Prof. N.B. Venkateswarlu
No ratings yet
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet

Hdfs Part 1

Uploaded by

Hdfs Part 1

Uploaded by

Big Data Analytics:

Big Data Analytics: Concepts and Methods 2

Big Data Analytics: Hadoop 3

Fig 1: Hadoop Subprojects

Big Data Analytics: Hadoop 6

Big Data Analytics: Hadoop 7

Big Data Analytics: Hadoop 10

Big Data Analytics: Hadoop 12

Big Data Analytics: Hadoop 14

Big Data Analytics: Hadoop 16

Big Data Analytics: Hadoop 18

Seek Time = 1% of Transfer Time

Transfer Rate = 100 MB/sec = 0.1 MB/ms

Big Data Analytics: Hadoop 19

• Default size of block in HDFS is 64 MB although many

Big Data Analytics: Hadoop 20

Fig 1: HDFS Architecture

Big Data Analytics: Hadoop 22

Fig 2: HDFS Architecture with NameNode and DataNode

Big Data Analytics: Hadoop

Big Data Analytics: Hadoop

• master node that maintains and manages the

Big Data Analytics: Hadoop 28

Big Data Analytics: Hadoop 31

The DataNode functions are as follows:

Big Data Analytics: Hadoop 32

Big Data Analytics: Hadoop

Big Data Analytics: Hadoop 35

Big Data Analytics: Hadoop 36

Fig 3: NameNode and Secondary NameNode

Big Data Analytics: Hadoop 37

Big Data Analytics: Hadoop 38

Big Data Analytics: Hadoop 39

Big Data Analytics: Hadoop 40

Big Data Analytics: Hadoop 41

Big Data Analytics: Hadoop 43

Big Data Analytics: Hadoop 44

Big Data Analytics: Hadoop 45

Big Data Analytics: Hadoop 46

• the block report helps in maintaining replication

Big Data Analytics: Hadoop 49

Fig 4: Block Replication(with the assumption replication factor is three)

Big Data Analytics: Hadoop 50

Big Data Analytics: Hadoop 52

Big Data Analytics: Hadoop 53

Big Data Analytics: Hadoop 55

Big Data Analytics: Hadoop 58

Big Data Analytics: Hadoop 59

Fig 5: Replica Placement via Random Awareness in HDFS

Big Data Analytics: Hadoop 65

Big Data Analytics: Hadoop 66

Big Data Analytics: Hadoop 68

Big Data Analytics: Hadoop 70

Big Data Analytics: Hadoop 71

You might also like