0% found this document useful (0 votes)
8 views5 pages

Bigdata

HDFS (Hadoop Distributed File System) is designed for efficient storage and processing of large data sets, handling hardware failures through data replication and focusing on high-speed processing of large files. It operates on a write-once-read-many model, allowing only appending to files, and processes data where it is stored to minimize network traffic. The system consists of a NameNode that manages metadata and DataNodes that store actual data, working together to ensure data integrity and accessibility.

Uploaded by

Deepa C n
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views5 pages

Bigdata

HDFS (Hadoop Distributed File System) is designed for efficient storage and processing of large data sets, handling hardware failures through data replication and focusing on high-speed processing of large files. It operates on a write-once-read-many model, allowing only appending to files, and processes data where it is stored to minimize network traffic. The system consists of a NameNode that manages metadata and DataNodes that store actual data, working together to ensure data integrity and accessibility.

Uploaded by

Deepa C n
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 5

Understanding HDFS (Hadoop Distributed File System) for Beginners

HDFS (Hadoop Distributed File System) is a special kind of storage system designed
to handle large amounts of data efficiently. It is different from normal file
systems because it is built for big data processing. Let’s go through its key
features in a simple way.

1. Hardware Failure is Common, So HDFS is Built to Handle It

Imagine you have a system with thousands of computers working together.

In such a big system, some computers will fail from time to time.

HDFS does not stop working if a computer fails. Instead, it keeps extra copies of
the data (replication) on different computers.

If one computer crashes, HDFS automatically recovers the data from another copy.

2. Designed for Fast Data Processing, Not Quick Responses

HDFS is not meant for daily use like a normal file system.

Instead of handling small files quickly, it is made for processing huge files
efficiently.

It focuses on high speed when reading large amounts of data.

It doesn’t follow strict file rules (like POSIX) to make data processing faster.

3. Handles Very Large Files Easily

Normal file systems may struggle with very big files (like terabytes of data).

HDFS is specially designed to handle massive files smoothly.

It can store and process millions of files across many computers.

4. Once a File is Written, It Cannot Be Changed (Only Append is Allowed)

In HDFS, you can write data to a file and read it later, but you cannot change a
file in the middle.

You can only add more data at the end (this is called appending).

This makes HDFS simple and prevents data corruption.

5. It is Faster to Move Computation to Data Instead of Moving Data

Think of it like this: If you have a huge book, it is easier to read it where it is
rather than carrying it to another place.

Similarly, in HDFS, instead of moving large files across the network, the
processing happens where the data is stored.

This reduces network traffic and makes processing faster.

6. Works on Different Computers and Operating Systems

HDFS is not restricted to a single type of hardware or software.


It can be installed on different types of computers and operating systems.

This makes it easy to use in many different places.

Final Summary

HDFS is made to store and process large data efficiently. It handles hardware
failures automatically, focuses on high-speed processing, supports very large
files, and follows a write-once-read-many model. Instead of moving data, it moves
computation to where the data is stored, making processing faster and more
efficient.

This is why big companies use HDFS for big data processing like analyzing large
customer data, social media trends, and scientific research. 🚀

1. NameNode (The Boss - Master Node)


📌 What It Does:

Stores metadata (information about files and where they are stored).
Does not store actual data, only records locations.
Manages the file system structure (directories, permissions, etc.).
If a file is deleted, NameNode updates its records, but the actual data is still in
DataNodes until cleaned up.
📌 Example:

Think of NameNode as a librarian in a big library.


The librarian does not hold the books but knows which shelf (DataNode) stores each
book (data block).
📌 What Happens If NameNode Fails?

The whole system can stop working because it manages everything.


That’s why Hadoop uses a backup NameNode (Secondary NameNode or Standby NameNode).
2. DataNode (The Workers - Slave Nodes)
📌 What It Does:

Stores actual data in small parts called blocks.


Reads and writes data when instructed by the NameNode.
Regularly sends health reports to the NameNode.
📌 Example:

DataNodes are like bookshelves in a library where books (data) are stored.
If one shelf (DataNode) breaks, the librarian (NameNode) finds another shelf with a
copy.
📌 What Happens If a DataNode Fails?

Hadoop makes multiple copies of data (replication), so if one DataNode fails,


another copy is available.
How NameNode and DataNode Work Together
You upload a file → NameNode splits it into blocks and assigns them to different
DataNodes.
You request a file → NameNode finds the DataNodes that have the blocks and tells
them to send the data.
DataNodes send the data back to you.
Feature NameNode (Master) DataNode (Worker)
Stores actual data? ❌ No ✅ Yes
Stores metadata (file structure)? ✅ Yes ❌ No
Controls file system? ✅ Yes ❌ No
Recovers lost data? ✅ Yes ❌ No, but stores copies
Sends health reports? ❌ No ✅ Yes

HDFS File Storage and Retrieval Process


Example: Data Stored in NameNode and DataNodes
Let's say we have a file called "bigdata.txt" that is 300 MB in size.

How HDFS Stores This File


HDFS splits this file into blocks of 128 MB each (default block size).

The file is divided into 3 blocks:

Block 1 → 128 MB

Block 2 → 128 MB

Block 3 → 44 MB (remaining part)

📌 How NameNode Stores Metadata


The NameNode does not store the actual file. It keeps a table-like record
(metadata) of where each block is stored in DataNodes.

File Name Block ID Stored In (DataNodes)


bigdata.txt Block 1 DataNode 1, DataNode 2
bigdata.txt Block 2 DataNode 2, DataNode 3
bigdata.txt Block 3 DataNode 3, DataNode 1
💡 Why Multiple DataNodes?
HDFS replicates each block (usually 3 copies) for fault tolerance. So if one
DataNode fails, other copies exist.

📌 How DataNodes Store Actual Data


DataNode 1 stores: Block 1 & Block 3

DataNode 2 stores: Block 1 & Block 2

DataNode 3 stores: Block 2 & Block 3

📌 How Data is Retrieved in HDFS (Step-by-Step Flow) 🚀


1 Client Requests File
1️⃣
The client wants to read a file (e.g., "bigdata.txt").

It sends a request to the NameNode asking where the file blocks are stored.

2️⃣ NameNode Responds with Metadata


The NameNode does NOT send actual data.

It only provides a list of DataNodes that store the file's blocks.

3️⃣ Client Connects to DataNodes


The client directly contacts the DataNodes where the blocks are stored.

DataNodes send the actual data to the client.


4️⃣ Client Assembles Data
The client receives all blocks from multiple DataNodes.

It reconstructs the original file and makes it ready for use.

📌 Step-by-Step Data Write Process in HDFS


1️⃣ Client Requests to Store a File
The client sends a request to the NameNode to store a file (e.g., "bigdata.txt").

The NameNode checks:

If the file already exists.

If the client has the required permissions.

2️⃣ Client Splits the File into Blocks


The NameNode does NOT split the file.

The HDFS client divides the file into blocks before sending them to HDFS.

Example:

Block 1 → 128MB

Block 2 → 128MB

Block 3 → 44MB

3️⃣ NameNode Decides Where to Store Each Block


The NameNode does NOT store actual data—it just selects which DataNodes will store
the blocks.

4️⃣ Client Sends Data to the First DataNode


The client sends Block 1 to the first chosen DataNode.

The first DataNode forwards the block to the second DataNode.

The second DataNode forwards the block to the third DataNode.

5️⃣ DataNodes Send Acknowledgments


After all copies are stored, the last DataNode sends an acknowledgment to the
previous DataNode.

The acknowledgment flows back to the client, confirming successful storage.

6️⃣ NameNode Updates Metadata


Once all blocks are successfully stored, the NameNode updates its metadata with the
block locations.

The file is now fully stored in HDFS, and the client can read it later.

✅ Key Takeaways
✔ The HDFS client splits the file into blocks, NOT the NameNode.
✔ The NameNode does NOT store file data—it only stores metadata.
✔ DataNodes do NOT divide blocks further—each block is stored as a whole.
✔ Replication happens in a pipeline (one DataNode forwards to the next).

This is the corrected and complete explanation for how data is stored, retrieved,
and written in HDFS. Let me know if you need any more clarifications!
How Data is Sent in HDFS Pipeline Transfer
1️⃣ The client sends a full block (e.g., Block 1) to DataNode 1.
2️⃣ DataNode 1 receives the full block and immediately forwards the same full block
to DataNode 2.
3️⃣ DataNode 2 receives the full block and then forwards the same full block to
DataNode 3.
4️⃣ Once DataNode 3 stores the block, it sends an acknowledgment back to DataNode 2 →
DataNode 1 → Client.

🚀 Important:
✔ The block is not split during the transfer.
✔ The transfer follows a chained pipeline process.
✔ The same full block is sent from one DataNode to the next until all replications
are done.

Would you like an illustration for better understanding? 😊

You're right to question that — let's clarify it step by step:

Client splits the file into blocks (usually 128MB per block).

Client contacts the NameNode, not to send the blocks, but to ask where to store
each block.

NameNode replies with the list of DataNodes for each block (including replicas).

Client sends the block directly to the first DataNode.

First DataNode sends to second, and second sends to third — this is replication,
done in a pipeline.

So to answer your earlier structure:

Client splits the file ✅ Yes

Client sends data to NameNode ❌ No (just metadata request)

NameNode transfers data to DataNode ❌ No (NameNode just gives instructions)

Client sends block to DataNode → DataNode to another DataNode ✅ Yes (replication


happens here)

You might also like