0% found this document useful (0 votes)

26 views8 pages

HDFS Unit 4

Hadoop Distributed File System (HDFS) is a distributed file system designed for storing large files and unstructured data, optimized for high-bandwidth data streaming. It features a master-slave architecture with a NameNode managing metadata and DataNodes storing actual data, and it is not suitable for low-latency access or handling small files. HDFS supports high throughput and data replication for fault tolerance, but has limitations such as no support for concurrent writes and modifications.

Uploaded by

yashi.bajpai18

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views8 pages

HDFS Unit 4

Uploaded by

yashi.bajpai18

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 8

Unit 4: Hadoop Distributed File System Architecture

What is HDFS?

Hadoop Distributed File System (HDFS) is a distributed file system designed for storing
unstructured data and large files. It is optimized for high-bandwidth data streaming, following
a write-once, read-many-times pattern. HDFS is inspired by Google File System (GFS) and is
a simpler variant.

Apache Hadoop is a framework that offers distributed storage (HDFS) and computing, using
the MapReduce model to process large data sets. HDFS is scalable, fault-tolerant, and
primarily serves the MapReduce paradigm. Like GFS, HDFS is designed for data-intensive
applications, not for end-users, and is not POSIX-compliant. Access is typically through
HDFS clients or APIs.
Limitations of HDFS:
 Low-latency data access: HDFS prioritizes throughput over latency, making it
unsuitable for applications requiring fast data access.
 Handling small files: The NameNode stores all filesystem metadata in memory,
limiting the number of files based on its memory capacity. Storing billions of files is
beyond its capability.
 Concurrent writing and file modifications: HDFS does not support multiple
concurrent writers or arbitrary file modifications. It only allows appending data at the
end of a file.
HDFS architecture
All files stored in HDFS are broken into multiple fixed-size blocks, where each block is 128
megabytes in size by default (configurable on a per-file basis). Each file stored in HDFS
consists of two parts: the actual file data and the metadata, i.e., how many block parts the
file has, their locations and the total file size, etc. HDFS cluster primarily consists of a
NameNode that manages the file system metadata and DataNodes that store the actual data.
HDFS high-level architecture
 All blocks of a file are of the same size except the last one.
 HDFS uses large block sizes because it is designed to store extremely large files to
enable MapReduce jobs to process them efficiently.
 Each block is identified by a unique 64-bit ID called BlockID.
 All read/write operations in HDFS operate at the block level.
 DataNodes store each block in a separate file on the local file system and provide
read/write access.
 When a DataNode starts up, it scans through its local file system and sends the list of
hosted data blocks (called BlockReport) to the NameNode.
 The NameNode maintains two on-disk data structures to store the file system’s state:
an FsImage file and an EditLog. FsImage is a checkpoint of the file system metadata
at some point in time, while the EditLog is a log of all of the file system metadata
transactions since the image file was last created. These two files help NameNode to
recover from failure.
 User applications interact with HDFS through its client. HDFS Client interacts with
NameNode for metadata, but all data transfers happen directly between the client and
DataNodes.
 To achieve high-availability, HDFS creates multiple copies of the data and distributes
them on nodes throughout the cluster.

Explain features of HDFS. Discuss the design of Hadoop distributed file system and concept
in detail.
HDFS: (Hadoop Distributed File System) – HDFS is the basic storage system of Hadoop.
The large data files running on a cluster of commodity hardware are stored in HDFS. It can
store data in a reliable manner even when hardware fails. The key aspects of HDFS are:
 HDFS is developed by the inspiration of Google File System(GFS).
 Storage component: Stores data in hadoop
 Distributes data across several nodes: divides large file into blocks and stores in
various data nodes.
 Natively redundant: replicates the blocks in various data nodes.
 High Throughput Access: Provides access to data blocks which are nearer to the
client.
 Re-replicates the nodes when nodes are failed
HDFS Daemons:
(i) NameNode
 The NameNode is the master of HDFS that directs the slave DataNodes to
perform I/O tasks.
 Blocks: HDFS breaks large file into smaller pieces called blocks.
 rackID: NameNode uses rackID to identify data nodes in the rack. (rack is a
collection of datanodes with in the cluster)
 NameNode keep track of blocks of a file.
 File System Namespace: NameNode is the book keeper of HDFS. It keeps track
of how files are broken down into blocks and which DataNode stores these blocks.
It is a collection of files in the cluster.
 FsImage: file system namespace includes mapping of blocks of a file, file
properties and is stored in a file called FsImage.
 EditLog: namenode uses an EditLog (transaction log) to record every transaction
that happens to the file system metadata.
 NameNode is single point of failure of Hadoop cluster.
(ii) DataNode
 Multiple data nodes per cluster. Each slave machine in the cluster have DataNode
daemon for reading and writing HDFS blocks of actual file on local file system.
 During pipeline read and write DataNodes communicate with each other.
 It also continuously Sends “heartbeat” message to NameNode to ensure the
connectivity between the Name node and the data node.
 If no heartbeat is received for a period of time NameNode assumes that the DataNode
had failed and it is re-replicated.
Fig. Interaction between NameNode and DataNode.
(iii)Secondary name node
 Takes snapshot of HDFS meta data at intervals specified in the hadoop configuration.
 Memory is same for secondary node as NameNode.
 But secondary node will run on a different machine.
 In case of name node failure secondary name node can be configured manually to
bring up the cluster i.e; we make secondary namenode as name node.

File Read operation:

The steps involved in the File Read are as follows:
1. The client opens the file that it wishes to read from by calling open() on the DFS.
2. The DFS communicates with the NameNode to get the location of data blocks.
NameNode returns with the addresses of the DataNodes that the data blocks are stored
on. Subsequent to this, the DFS returns an FSD to client to read from the file.
3. Client then calls read() on the stream DFSInputStream, which has addresses of
DataNodes for the first few block of the file.
4. Client calls read() repeatedly to stream the data from the DataNode. When the end of
the block is reached, DFSInputStream closes the connection with the DataNode. It
repeats the steps to find the best DataNode for the next block and subsequent blocks.
5. When the client completes the reading of the file, it calls close() on the
FSInputStream to the connection.
Fig. File Read Anatomy

File Write operation:

The steps involved in the File Write are as follows:
1. The client calls create() on DistributedFileSystem to create a file.
2. An RPC call to the namenode happens through the DFS to create a new file.
3. As the client writes data, data is split into packets by DFSOutputStream, which is then
writes to an internal queue, called data queue. Datastreamer consumes the data queue.
4. Data streamer streams the packets to the first DataNode in the pipeline. It stores packet
and forwards it to the second DataNode in the pipeline.
5. In addition to the internal queue, DFSOutputStream also manages on “Ackqueue” of the
packets that are waiting for acknowledged by DataNodes.
6. When the client finishes writing the file, it calls close() on the stream.

Fig. File Write Anatomy

Special features of HDFS:
1. Data Replication: There is absolutely no need for a client application to track all blocks.
It directs client to the nearest replica to ensure high performance.
2. Data Pipeline: A client application writes a block to the first DataNode in the pipeline.
Then this DataNode takes over and forwards the data to the next node in the pipeline.
This process continues for all the data blocks, and subsequently all the replicas are written
to the disk.

Fig. File Replacement Strategy

Q. Explain basic HDFS File operations with an example.

1. Creating a directory:
Syntax: hdfs dfs –mkdir <path> Eg. hdfs
dfs –mkdir /chp

2. Remove a file in specified path:

Syntax: hdfs dfs –rm <src>
Eg. hdfs dfs –rm /chp/abc.txt

3. Copy file from local file system to hdfs:

Syntax: hdfs dfs –copyFromLocal <src> <dst>
Eg.hdfs dfs –copyFromLocal /home/hadoop/sample.txt /chp/abc1.txt

4. To display list of contents in a directory:

Syntax: hdfs dfs –ls <path> Eg. hdfs
dfs –ls /chp
5. To display contents in a file:
Syntax: hdfs dfs –cat <path> Eg. hdfs
dfs –cat /chp/abc1.txt
6. Copy file from hdfs to local file system:
Syntax: hdfs dfs –copyToLocal <src <dst>
Eg.hdfs dfs –copyToLocal /chp/abc1.txt /home/hadoop/Desktop/sample.txt

7. To display last few lines of a file:

Syntax: hdfs dfs –tail <path>
Eg. hdfs dfs –tail /chp/abc1.txt

8. Display aggregate length of file in bytes:

Syntax: hdfs dfs –du <path> Eg. hdfs
dfs –du /chp

9. To count no.of directories, files and bytes under given path:

Syntax: hdfs dfs –count <path> Eg.
hdfs dfs –count /chp o/p: 1 1 60

10. Remove a directory from hdfs

Syntax: hdfs dfs –rmr <path>
Eg. hdfs dfs rmr /chp

DB Lab3
No ratings yet
DB Lab3
5 pages
Unit-2 Introduction To Hadoop
No ratings yet
Unit-2 Introduction To Hadoop
19 pages
Hordhac Excel
No ratings yet
Hordhac Excel
39 pages
Big Data Unit-3
No ratings yet
Big Data Unit-3
46 pages
Information Security Assessment Process
No ratings yet
Information Security Assessment Process
23 pages
Bda Unit 5
No ratings yet
Bda Unit 5
17 pages
Overview of A Computer System.
No ratings yet
Overview of A Computer System.
32 pages
1.3 WindSCADA System Generic XXHZ en r06
No ratings yet
1.3 WindSCADA System Generic XXHZ en r06
24 pages
1) Discuss The Design of Hadoop Distributed File System (HDFS) and Concept in Detail
No ratings yet
1) Discuss The Design of Hadoop Distributed File System (HDFS) and Concept in Detail
11 pages
Bigdata Unit 3
No ratings yet
Bigdata Unit 3
96 pages
HDFS
No ratings yet
HDFS
16 pages
PHD Thesis On Cloud Computing Security PDF
100% (1)
PHD Thesis On Cloud Computing Security PDF
8 pages
BDA - Unit-2
No ratings yet
BDA - Unit-2
24 pages
Hadoop Distributed File System: Presented by Mohammad Sufiyan Nagaraju Kola Prudhvi Krishna Kamireddy
No ratings yet
Hadoop Distributed File System: Presented by Mohammad Sufiyan Nagaraju Kola Prudhvi Krishna Kamireddy
17 pages
BCS061 Notes Unit3
No ratings yet
BCS061 Notes Unit3
23 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
56 pages
BDT - Unit - II - Hdfs and Hadoop Io
No ratings yet
BDT - Unit - II - Hdfs and Hadoop Io
42 pages
Box2d Manual
No ratings yet
Box2d Manual
70 pages
Unit 4
No ratings yet
Unit 4
104 pages
Unit IV
No ratings yet
Unit IV
248 pages
Unit-4 BDA As On 25-11-2024
No ratings yet
Unit-4 BDA As On 25-11-2024
248 pages
Unit 3.1
No ratings yet
Unit 3.1
88 pages
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
No ratings yet
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
47 pages
Unit-4 BDA As On 25-11-2024
No ratings yet
Unit-4 BDA As On 25-11-2024
258 pages
UNIT V-Cloud Computing
No ratings yet
UNIT V-Cloud Computing
33 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
48 pages
HDFS 3
No ratings yet
HDFS 3
51 pages
Bachelor Thesis ZhangYancan
No ratings yet
Bachelor Thesis ZhangYancan
37 pages
Module 1 PDF
No ratings yet
Module 1 PDF
49 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
84 pages
Aplikasi Siklus Produksi Dan Siklus Keuangan
No ratings yet
Aplikasi Siklus Produksi Dan Siklus Keuangan
49 pages
Unit 2 Da Material
No ratings yet
Unit 2 Da Material
71 pages
Hadoop Working
No ratings yet
Hadoop Working
33 pages
Hadoop Intro
No ratings yet
Hadoop Intro
40 pages
CC Unit 5 Notes
No ratings yet
CC Unit 5 Notes
30 pages
3.1 Hadoop Ecosystem
No ratings yet
3.1 Hadoop Ecosystem
48 pages
Lab7 Manual
No ratings yet
Lab7 Manual
15 pages
Ideathon
No ratings yet
Ideathon
38 pages
Chapter 4 - Hadoop Ecosystem
No ratings yet
Chapter 4 - Hadoop Ecosystem
24 pages
Unit - 2
No ratings yet
Unit - 2
27 pages
Unit 5-PLH
No ratings yet
Unit 5-PLH
34 pages
Introduction To Hadoop Ecosystem
No ratings yet
Introduction To Hadoop Ecosystem
46 pages
HDFS
No ratings yet
HDFS
37 pages
Bda Unit-Iv
No ratings yet
Bda Unit-Iv
37 pages
HDFSnew
No ratings yet
HDFSnew
20 pages
Unit - 3 - Big Data
No ratings yet
Unit - 3 - Big Data
66 pages
Lecture 4 Software Engineering - DR Mohammed Kamal 2024
No ratings yet
Lecture 4 Software Engineering - DR Mohammed Kamal 2024
32 pages
Hadoop File System
No ratings yet
Hadoop File System
36 pages
Hadoop Distributed File System: Bhavneet Kaur B.Tech Computer Science 2 Year
No ratings yet
Hadoop Distributed File System: Bhavneet Kaur B.Tech Computer Science 2 Year
34 pages
BD Unit-IIINotes
No ratings yet
BD Unit-IIINotes
17 pages
Big Data Unit 4 Own
No ratings yet
Big Data Unit 4 Own
18 pages
BD Module 1 Final
No ratings yet
BD Module 1 Final
17 pages
1.HDFS Architecture and Its Operations
No ratings yet
1.HDFS Architecture and Its Operations
6 pages
Hdfs R20it III
No ratings yet
Hdfs R20it III
19 pages
Unit 3 Big Data - 240516 - 090400
No ratings yet
Unit 3 Big Data - 240516 - 090400
20 pages
WUBR 170GN P4 Approval Sheet 1.1
No ratings yet
WUBR 170GN P4 Approval Sheet 1.1
14 pages
IMTC634 - Data Science - Chapter 14
No ratings yet
IMTC634 - Data Science - Chapter 14
22 pages
Micro Processors: Case Study Summary
No ratings yet
Micro Processors: Case Study Summary
6 pages
Unit 3 Part 1
No ratings yet
Unit 3 Part 1
17 pages
Unit 3 1
No ratings yet
Unit 3 1
20 pages
Cisco UCS C220 M6 - Datasheet
No ratings yet
Cisco UCS C220 M6 - Datasheet
8 pages
BIG DATA - Unit 4 HADOOP AND MAP REDUCE - Mini Xerox - Easy Read
No ratings yet
BIG DATA - Unit 4 HADOOP AND MAP REDUCE - Mini Xerox - Easy Read
16 pages
Macintosh Classic - Old Crap Vintage Computing
No ratings yet
Macintosh Classic - Old Crap Vintage Computing
12 pages
BD U-3 (Anupam Sir)
No ratings yet
BD U-3 (Anupam Sir)
23 pages
Speedread - A Command - Line Implementation of The Spritz Speed Reading System
No ratings yet
Speedread - A Command - Line Implementation of The Spritz Speed Reading System
9 pages
UNIT-5-HDFS (Hadoop Distributed File System)
No ratings yet
UNIT-5-HDFS (Hadoop Distributed File System)
18 pages
HDFS
No ratings yet
HDFS
14 pages
EOS Attribute Identifiers
No ratings yet
EOS Attribute Identifiers
4 pages
Hack Quizziz
No ratings yet
Hack Quizziz
4 pages
OJEE 2024: Opening and Closing Rank (Le To Btech)
No ratings yet
OJEE 2024: Opening and Closing Rank (Le To Btech)
12 pages
Marketing Automation Specialist in Geneva IL Resume Cynthia Jensen
No ratings yet
Marketing Automation Specialist in Geneva IL Resume Cynthia Jensen
2 pages
3 - HDFS Hive HBase Pig
No ratings yet
3 - HDFS Hive HBase Pig
8 pages
Jobs List
No ratings yet
Jobs List
8 pages
DC Mod 6
No ratings yet
DC Mod 6
9 pages
Histogram Spread Sheet
No ratings yet
Histogram Spread Sheet
5 pages
10 Dfs
No ratings yet
10 Dfs
5 pages
#1066: Topology Revision: Advanced Remesh Option When Using Quick Edit
No ratings yet
#1066: Topology Revision: Advanced Remesh Option When Using Quick Edit
7 pages
ST - Thomas Technical High School: School Based Assessment (Sba) Project 2023 - 2023
No ratings yet
ST - Thomas Technical High School: School Based Assessment (Sba) Project 2023 - 2023
5 pages
Exp3 BDI 60004200124
No ratings yet
Exp3 BDI 60004200124
5 pages
Data Comm CT Sol
No ratings yet
Data Comm CT Sol
3 pages
Match Report - Jobscan
No ratings yet
Match Report - Jobscan
5 pages
Complete Hadoop Notes Final
No ratings yet
Complete Hadoop Notes Final
4 pages
Read Write in HDFS
No ratings yet
Read Write in HDFS
6 pages
Retentive Timer On (RTO) : Instruction
No ratings yet
Retentive Timer On (RTO) : Instruction
3 pages
ASKEY-TCG220-d: D3.0 8x4 Data Cable Modem
No ratings yet
ASKEY-TCG220-d: D3.0 8x4 Data Cable Modem
2 pages
HDFS
No ratings yet
HDFS
3 pages
HDFS Tutorial - Architecture, Read & Write Operation Using Java API
No ratings yet
HDFS Tutorial - Architecture, Read & Write Operation Using Java API
3 pages
NCBI Structure Resources: Entrez Databases and NCBI Tools For Studying Macromolecular Structure
No ratings yet
NCBI Structure Resources: Entrez Databases and NCBI Tools For Studying Macromolecular Structure
2 pages
HDFS
No ratings yet
HDFS
1 page
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet

HDFS Unit 4

Uploaded by

HDFS Unit 4

Uploaded by

Unit 4: Hadoop Distributed File System Architecture

File Read operation:

File Write operation:

Fig. File Write Anatomy

Fig. File Replacement Strategy

Q. Explain basic HDFS File operations with an example.

2. Remove a file in specified path:

3. Copy file from local file system to hdfs:

4. To display list of contents in a directory:

7. To display last few lines of a file:

8. Display aggregate length of file in bytes:

9. To count no.of directories, files and bytes under given path:

10. Remove a directory from hdfs

You might also like