HDFS Presentation Kunal Yadav

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 11

Understanding Hadoop Distributed

File System (HDFS)


• The Foundation of Big Data Storage in Hadoop

• Presented by: Kunal Yadav


• Submitted to: Professor Devesh Kumar Lal
Introduction to HDFS
• What is HDFS?

• - Primary storage system of Hadoop.


• - Designed to store vast amounts of data
across multiple machines.

• Purpose of HDFS
• - Handles large data sets with high fault
tolerance.
• - Supports distributed data storage.
Architecture of HDFS
• Master-Slave Architecture

• - Namenode (Master): Manages metadata and


file directory.
• - Datanode (Slave): Stores actual data and
communicates with Namenode.

• Replication Factor
• - Data split into blocks, each replicated across
nodes for redundancy.
Key Concepts in HDFS
• Blocks

• - Fixed-size data units (default 128MB).

• Replication
• - Default replication factor of 3 per block.

• Fault Tolerance
• - Data remains available despite node failures.
Namenode & Datanode Roles
• Namenode (Master)

• - Manages the filesystem namespace.


• - Controls access, keeps track of file metadata.

• Datanodes (Slaves)
• - Store and retrieve data blocks.
• - Send regular status updates to Namenode.
Data Storage Process in HDFS
• Data Write Process

• - Client splits data into blocks.


• - Namenode assigns Datanodes for storage.
• - Blocks are written to Datanodes with
replication.

• Data Read Process


• - Client requests file from Namenode.
• - Namenode provides Datanode locations of
blocks.
Fault Tolerance in HDFS
• Data Redundancy

• - Replication ensures data availability.

• Heartbeat Mechanism
• - Datanodes send "heartbeat" signals to
Namenode.
• - Namenode re-replicates data if a Datanode
fails.
Advantages of HDFS
• Scalable

• - Increase storage by adding nodes.

• Cost-Effective
• - Uses commodity hardware.

• High Availability
• - Replication ensures data remains accessible.
Limitations of HDFS
• Not for Small Files

• - Inefficient due to block size.

• Latency
• - Slower for real-time processing.

• Single Point of Failure


• - Namenode failure can impact operations
(mitigated with HA setups).
HDFS Use Cases
• Big Data Storage

• - Ideal for large data archives.

• Data Processing
• - Supports batch processing like log analysis.

• Data Backup
• - Secure, distributed storage for large datasets.
Conclusion
• Summary

• - HDFS is crucial to Hadoop’s reliable, large-


scale data storage and processing.

• Future of HDFS
• - Advancements for scalability and resilience.

You might also like