HDFS Unit 4
HDFS Unit 4
What is HDFS?
Hadoop Distributed File System (HDFS) is a distributed file system designed for storing
unstructured data and large files. It is optimized for high-bandwidth data streaming, following
a write-once, read-many-times pattern. HDFS is inspired by Google File System (GFS) and is
a simpler variant.
Apache Hadoop is a framework that offers distributed storage (HDFS) and computing, using
the MapReduce model to process large data sets. HDFS is scalable, fault-tolerant, and
primarily serves the MapReduce paradigm. Like GFS, HDFS is designed for data-intensive
applications, not for end-users, and is not POSIX-compliant. Access is typically through
HDFS clients or APIs.
Limitations of HDFS:
Low-latency data access: HDFS prioritizes throughput over latency, making it
unsuitable for applications requiring fast data access.
Handling small files: The NameNode stores all filesystem metadata in memory,
limiting the number of files based on its memory capacity. Storing billions of files is
beyond its capability.
Concurrent writing and file modifications: HDFS does not support multiple
concurrent writers or arbitrary file modifications. It only allows appending data at the
end of a file.
HDFS architecture
All files stored in HDFS are broken into multiple fixed-size blocks, where each block is 128
megabytes in size by default (configurable on a per-file basis). Each file stored in HDFS
consists of two parts: the actual file data and the metadata, i.e., how many block parts the
file has, their locations and the total file size, etc. HDFS cluster primarily consists of a
NameNode that manages the file system metadata and DataNodes that store the actual data.
HDFS high-level architecture
All blocks of a file are of the same size except the last one.
HDFS uses large block sizes because it is designed to store extremely large files to
enable MapReduce jobs to process them efficiently.
Each block is identified by a unique 64-bit ID called BlockID.
All read/write operations in HDFS operate at the block level.
DataNodes store each block in a separate file on the local file system and provide
read/write access.
When a DataNode starts up, it scans through its local file system and sends the list of
hosted data blocks (called BlockReport) to the NameNode.
The NameNode maintains two on-disk data structures to store the file system’s state:
an FsImage file and an EditLog. FsImage is a checkpoint of the file system metadata
at some point in time, while the EditLog is a log of all of the file system metadata
transactions since the image file was last created. These two files help NameNode to
recover from failure.
User applications interact with HDFS through its client. HDFS Client interacts with
NameNode for metadata, but all data transfers happen directly between the client and
DataNodes.
To achieve high-availability, HDFS creates multiple copies of the data and distributes
them on nodes throughout the cluster.
Explain features of HDFS. Discuss the design of Hadoop distributed file system and concept
in detail.
HDFS: (Hadoop Distributed File System) – HDFS is the basic storage system of Hadoop.
The large data files running on a cluster of commodity hardware are stored in HDFS. It can
store data in a reliable manner even when hardware fails. The key aspects of HDFS are:
HDFS is developed by the inspiration of Google File System(GFS).
Storage component: Stores data in hadoop
Distributes data across several nodes: divides large file into blocks and stores in
various data nodes.
Natively redundant: replicates the blocks in various data nodes.
High Throughput Access: Provides access to data blocks which are nearer to the
client.
Re-replicates the nodes when nodes are failed
HDFS Daemons:
(i) NameNode
The NameNode is the master of HDFS that directs the slave DataNodes to
perform I/O tasks.
Blocks: HDFS breaks large file into smaller pieces called blocks.
rackID: NameNode uses rackID to identify data nodes in the rack. (rack is a
collection of datanodes with in the cluster)
NameNode keep track of blocks of a file.
File System Namespace: NameNode is the book keeper of HDFS. It keeps track
of how files are broken down into blocks and which DataNode stores these blocks.
It is a collection of files in the cluster.
FsImage: file system namespace includes mapping of blocks of a file, file
properties and is stored in a file called FsImage.
EditLog: namenode uses an EditLog (transaction log) to record every transaction
that happens to the file system metadata.
NameNode is single point of failure of Hadoop cluster.
(ii) DataNode
Multiple data nodes per cluster. Each slave machine in the cluster have DataNode
daemon for reading and writing HDFS blocks of actual file on local file system.
During pipeline read and write DataNodes communicate with each other.
It also continuously Sends “heartbeat” message to NameNode to ensure the
connectivity between the Name node and the data node.
If no heartbeat is received for a period of time NameNode assumes that the DataNode
had failed and it is re-replicated.
Fig. Interaction between NameNode and DataNode.
(iii)Secondary name node
Takes snapshot of HDFS meta data at intervals specified in the hadoop configuration.
Memory is same for secondary node as NameNode.
But secondary node will run on a different machine.
In case of name node failure secondary name node can be configured manually to
bring up the cluster i.e; we make secondary namenode as name node.
1. Creating a directory:
Syntax: hdfs dfs –mkdir <path> Eg. hdfs
dfs –mkdir /chp