Bigdata
Bigdata
HDFS (Hadoop Distributed File System) is a special kind of storage system designed
to handle large amounts of data efficiently. It is different from normal file
systems because it is built for big data processing. Let’s go through its key
features in a simple way.
In such a big system, some computers will fail from time to time.
HDFS does not stop working if a computer fails. Instead, it keeps extra copies of
the data (replication) on different computers.
If one computer crashes, HDFS automatically recovers the data from another copy.
HDFS is not meant for daily use like a normal file system.
Instead of handling small files quickly, it is made for processing huge files
efficiently.
It doesn’t follow strict file rules (like POSIX) to make data processing faster.
Normal file systems may struggle with very big files (like terabytes of data).
In HDFS, you can write data to a file and read it later, but you cannot change a
file in the middle.
You can only add more data at the end (this is called appending).
Think of it like this: If you have a huge book, it is easier to read it where it is
rather than carrying it to another place.
Similarly, in HDFS, instead of moving large files across the network, the
processing happens where the data is stored.
Final Summary
HDFS is made to store and process large data efficiently. It handles hardware
failures automatically, focuses on high-speed processing, supports very large
files, and follows a write-once-read-many model. Instead of moving data, it moves
computation to where the data is stored, making processing faster and more
efficient.
This is why big companies use HDFS for big data processing like analyzing large
customer data, social media trends, and scientific research. 🚀
Stores metadata (information about files and where they are stored).
Does not store actual data, only records locations.
Manages the file system structure (directories, permissions, etc.).
If a file is deleted, NameNode updates its records, but the actual data is still in
DataNodes until cleaned up.
📌 Example:
DataNodes are like bookshelves in a library where books (data) are stored.
If one shelf (DataNode) breaks, the librarian (NameNode) finds another shelf with a
copy.
📌 What Happens If a DataNode Fails?
Block 1 → 128 MB
Block 2 → 128 MB
It sends a request to the NameNode asking where the file blocks are stored.
The HDFS client divides the file into blocks before sending them to HDFS.
Example:
Block 1 → 128MB
Block 2 → 128MB
Block 3 → 44MB
The file is now fully stored in HDFS, and the client can read it later.
✅ Key Takeaways
✔ The HDFS client splits the file into blocks, NOT the NameNode.
✔ The NameNode does NOT store file data—it only stores metadata.
✔ DataNodes do NOT divide blocks further—each block is stored as a whole.
✔ Replication happens in a pipeline (one DataNode forwards to the next).
This is the corrected and complete explanation for how data is stored, retrieved,
and written in HDFS. Let me know if you need any more clarifications!
How Data is Sent in HDFS Pipeline Transfer
1️⃣ The client sends a full block (e.g., Block 1) to DataNode 1.
2️⃣ DataNode 1 receives the full block and immediately forwards the same full block
to DataNode 2.
3️⃣ DataNode 2 receives the full block and then forwards the same full block to
DataNode 3.
4️⃣ Once DataNode 3 stores the block, it sends an acknowledgment back to DataNode 2 →
DataNode 1 → Client.
🚀 Important:
✔ The block is not split during the transfer.
✔ The transfer follows a chained pipeline process.
✔ The same full block is sent from one DataNode to the next until all replications
are done.
Client splits the file into blocks (usually 128MB per block).
Client contacts the NameNode, not to send the blocks, but to ask where to store
each block.
NameNode replies with the list of DataNodes for each block (including replicas).
First DataNode sends to second, and second sends to third — this is replication,
done in a pipeline.