0% found this document useful (0 votes)
26 views8 pages

HDFS Unit 4

Hadoop Distributed File System (HDFS) is a distributed file system designed for storing large files and unstructured data, optimized for high-bandwidth data streaming. It features a master-slave architecture with a NameNode managing metadata and DataNodes storing actual data, and it is not suitable for low-latency access or handling small files. HDFS supports high throughput and data replication for fault tolerance, but has limitations such as no support for concurrent writes and modifications.

Uploaded by

yashi.bajpai18
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views8 pages

HDFS Unit 4

Hadoop Distributed File System (HDFS) is a distributed file system designed for storing large files and unstructured data, optimized for high-bandwidth data streaming. It features a master-slave architecture with a NameNode managing metadata and DataNodes storing actual data, and it is not suitable for low-latency access or handling small files. HDFS supports high throughput and data replication for fault tolerance, but has limitations such as no support for concurrent writes and modifications.

Uploaded by

yashi.bajpai18
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

Unit 4: Hadoop Distributed File System Architecture

What is HDFS?

Hadoop Distributed File System (HDFS) is a distributed file system designed for storing
unstructured data and large files. It is optimized for high-bandwidth data streaming, following
a write-once, read-many-times pattern. HDFS is inspired by Google File System (GFS) and is
a simpler variant.

Apache Hadoop is a framework that offers distributed storage (HDFS) and computing, using
the MapReduce model to process large data sets. HDFS is scalable, fault-tolerant, and
primarily serves the MapReduce paradigm. Like GFS, HDFS is designed for data-intensive
applications, not for end-users, and is not POSIX-compliant. Access is typically through
HDFS clients or APIs.
Limitations of HDFS:
 Low-latency data access: HDFS prioritizes throughput over latency, making it
unsuitable for applications requiring fast data access.
 Handling small files: The NameNode stores all filesystem metadata in memory,
limiting the number of files based on its memory capacity. Storing billions of files is
beyond its capability.
 Concurrent writing and file modifications: HDFS does not support multiple
concurrent writers or arbitrary file modifications. It only allows appending data at the
end of a file.
HDFS architecture
All files stored in HDFS are broken into multiple fixed-size blocks, where each block is 128
megabytes in size by default (configurable on a per-file basis). Each file stored in HDFS
consists of two parts: the actual file data and the metadata, i.e., how many block parts the
file has, their locations and the total file size, etc. HDFS cluster primarily consists of a
NameNode that manages the file system metadata and DataNodes that store the actual data.
HDFS high-level architecture
 All blocks of a file are of the same size except the last one.
 HDFS uses large block sizes because it is designed to store extremely large files to
enable MapReduce jobs to process them efficiently.
 Each block is identified by a unique 64-bit ID called BlockID.
 All read/write operations in HDFS operate at the block level.
 DataNodes store each block in a separate file on the local file system and provide
read/write access.
 When a DataNode starts up, it scans through its local file system and sends the list of
hosted data blocks (called BlockReport) to the NameNode.
 The NameNode maintains two on-disk data structures to store the file system’s state:
an FsImage file and an EditLog. FsImage is a checkpoint of the file system metadata
at some point in time, while the EditLog is a log of all of the file system metadata
transactions since the image file was last created. These two files help NameNode to
recover from failure.
 User applications interact with HDFS through its client. HDFS Client interacts with
NameNode for metadata, but all data transfers happen directly between the client and
DataNodes.
 To achieve high-availability, HDFS creates multiple copies of the data and distributes
them on nodes throughout the cluster.

Explain features of HDFS. Discuss the design of Hadoop distributed file system and concept
in detail.
HDFS: (Hadoop Distributed File System) – HDFS is the basic storage system of Hadoop.
The large data files running on a cluster of commodity hardware are stored in HDFS. It can
store data in a reliable manner even when hardware fails. The key aspects of HDFS are:
 HDFS is developed by the inspiration of Google File System(GFS).
 Storage component: Stores data in hadoop
 Distributes data across several nodes: divides large file into blocks and stores in
various data nodes.
 Natively redundant: replicates the blocks in various data nodes.
 High Throughput Access: Provides access to data blocks which are nearer to the
client.
 Re-replicates the nodes when nodes are failed
HDFS Daemons:
(i) NameNode
 The NameNode is the master of HDFS that directs the slave DataNodes to
perform I/O tasks.
 Blocks: HDFS breaks large file into smaller pieces called blocks.
 rackID: NameNode uses rackID to identify data nodes in the rack. (rack is a
collection of datanodes with in the cluster)
 NameNode keep track of blocks of a file.
 File System Namespace: NameNode is the book keeper of HDFS. It keeps track
of how files are broken down into blocks and which DataNode stores these blocks.
It is a collection of files in the cluster.
 FsImage: file system namespace includes mapping of blocks of a file, file
properties and is stored in a file called FsImage.
 EditLog: namenode uses an EditLog (transaction log) to record every transaction
that happens to the file system metadata.
 NameNode is single point of failure of Hadoop cluster.
(ii) DataNode
 Multiple data nodes per cluster. Each slave machine in the cluster have DataNode
daemon for reading and writing HDFS blocks of actual file on local file system.
 During pipeline read and write DataNodes communicate with each other.
 It also continuously Sends “heartbeat” message to NameNode to ensure the
connectivity between the Name node and the data node.
 If no heartbeat is received for a period of time NameNode assumes that the DataNode
had failed and it is re-replicated.
Fig. Interaction between NameNode and DataNode.
(iii)Secondary name node
 Takes snapshot of HDFS meta data at intervals specified in the hadoop configuration.
 Memory is same for secondary node as NameNode.
 But secondary node will run on a different machine.
 In case of name node failure secondary name node can be configured manually to
bring up the cluster i.e; we make secondary namenode as name node.

File Read operation:


The steps involved in the File Read are as follows:
1. The client opens the file that it wishes to read from by calling open() on the DFS.
2. The DFS communicates with the NameNode to get the location of data blocks.
NameNode returns with the addresses of the DataNodes that the data blocks are stored
on. Subsequent to this, the DFS returns an FSD to client to read from the file.
3. Client then calls read() on the stream DFSInputStream, which has addresses of
DataNodes for the first few block of the file.
4. Client calls read() repeatedly to stream the data from the DataNode. When the end of
the block is reached, DFSInputStream closes the connection with the DataNode. It
repeats the steps to find the best DataNode for the next block and subsequent blocks.
5. When the client completes the reading of the file, it calls close() on the
FSInputStream to the connection.
Fig. File Read Anatomy

File Write operation:


The steps involved in the File Write are as follows:
1. The client calls create() on DistributedFileSystem to create a file.
2. An RPC call to the namenode happens through the DFS to create a new file.
3. As the client writes data, data is split into packets by DFSOutputStream, which is then
writes to an internal queue, called data queue. Datastreamer consumes the data queue.
4. Data streamer streams the packets to the first DataNode in the pipeline. It stores packet
and forwards it to the second DataNode in the pipeline.
5. In addition to the internal queue, DFSOutputStream also manages on “Ackqueue” of the
packets that are waiting for acknowledged by DataNodes.
6. When the client finishes writing the file, it calls close() on the stream.

Fig. File Write Anatomy


Special features of HDFS:
1. Data Replication: There is absolutely no need for a client application to track all blocks.
It directs client to the nearest replica to ensure high performance.
2. Data Pipeline: A client application writes a block to the first DataNode in the pipeline.
Then this DataNode takes over and forwards the data to the next node in the pipeline.
This process continues for all the data blocks, and subsequently all the replicas are written
to the disk.

Fig. File Replacement Strategy

Q. Explain basic HDFS File operations with an example.

1. Creating a directory:
Syntax: hdfs dfs –mkdir <path> Eg. hdfs
dfs –mkdir /chp

2. Remove a file in specified path:


Syntax: hdfs dfs –rm <src>
Eg. hdfs dfs –rm /chp/abc.txt

3. Copy file from local file system to hdfs:


Syntax: hdfs dfs –copyFromLocal <src> <dst>
Eg.hdfs dfs –copyFromLocal /home/hadoop/sample.txt /chp/abc1.txt

4. To display list of contents in a directory:


Syntax: hdfs dfs –ls <path> Eg. hdfs
dfs –ls /chp
5. To display contents in a file:
Syntax: hdfs dfs –cat <path> Eg. hdfs
dfs –cat /chp/abc1.txt
6. Copy file from hdfs to local file system:
Syntax: hdfs dfs –copyToLocal <src <dst>
Eg.hdfs dfs –copyToLocal /chp/abc1.txt /home/hadoop/Desktop/sample.txt

7. To display last few lines of a file:


Syntax: hdfs dfs –tail <path>
Eg. hdfs dfs –tail /chp/abc1.txt

8. Display aggregate length of file in bytes:


Syntax: hdfs dfs –du <path> Eg. hdfs
dfs –du /chp

9. To count no.of directories, files and bytes under given path:


Syntax: hdfs dfs –count <path> Eg.
hdfs dfs –count /chp o/p: 1 1 60

10. Remove a directory from hdfs


Syntax: hdfs dfs –rmr <path>
Eg. hdfs dfs rmr /chp

You might also like