Paper Hdfs Summary
Paper Hdfs Summary
1. Introduction
HDFS is the file system component of Hadoop, designed for handling
large-scale data with high reliability and fault tolerance. While its
interface is inspired by the UNIX file system, it prioritizes performance
over strict adherence to standards. HDFS stores metadata separately from
application data, similar to other distributed file systems like GFS, PVFS,
and Lustre.
2. HDFS Architecture
A. NameNode
The NameNode manages the hierarchical namespace of files and
directories. It stores metadata, including:
• File attributes (permissions, timestamps, quotas)
• Mapping of file blocks to DataNodes
Each file is divided into large blocks (default 128MB), and each block is
replicated (typically three copies). The NameNode directs clients to the
appropriate DataNodes for reading or writing operations. It maintains all
namespace metadata in RAM for faster performance.
Failure Recovery:
• Stores a persistent checkpoint of the namespace and logs changes
in a journal.
• During a restart, it reconstructs the latest state by reading the
checkpoint and replaying the journal.
1
• Redundant copies of these files can be stored on different servers for
durability.
B. DataNodes
Each DataNode stores blocks of data and manages:
• Block Metadata: Includes checksums and generation timestamps
to detect corruption.
• Heartbeat Mechanism: Sends periodic signals to the NameNode to
confirm its availability.
• Block Reports: Provide up-to-date information on stored blocks.
C. HDFS Client
The HDFS client provides an API for user applications, allowing them to:
• Read and write files: The client first contacts the NameNode for
metadata and then interacts directly with DataNodes.
• Configure replication: Default is three copies, but critical files can
have a higher replication factor.
• Optimize data locality: Applications like MapReduce can schedule
computations near the data to minimize transfer overhead.
2
• Only one client can write or append to a file at a time.
• The writer holds a lease, periodically renewed via heartbeats. If it
fails to renew, another client can take over.
Files are divided into blocks and written in a pipeline fashion to multiple
DataNodes for redundancy. HDFS also allows append operations, but
data modifications are not permitted.
C. Replica Management
The NameNode monitors the replication status:
• Under-replicated blocks: Scheduled for additional copies.
• Over-replicated blocks: Excess copies are deleted to free up
space.
• Replica Balancing: Ensures even distribution across DataNodes
and racks.
D. Balancer Tool
To prevent uneven disk utilization, HDFS includes a balancer, which
redistributes block replicas to under-utilized nodes while maintaining
replication guarantees.
F. Node Decommissioning
When a DataNode is removed from service, the NameNode gradually
transfers its blocks to other nodes before marking it as decommissioned.
3
G. Inter-Cluster Data Copying
HDFS provides DistCp, a MapReduce-based tool for efficiently copying
large datasets between HDFS clusters.
HDFS is crucial for Yahoo!’s Web Map, which indexes the World Wide
Web, processing 500 TB of intermediate data in 75 hours.
C. Performance Benchmarks
HDFS achieves:
• 66 MB/s per node (Read)
• 40 MB/s per node (Write)
• Busy cluster throughput: ~1 MB/s per node (due to job mix)
4
D. Security and Resource Management
• UNIX-style permission framework controls access to files and
directories.
• Storage quotas prevent excessive resource consumption by
individual users.
• Hadoop Archives (HAR) optimize storage for small files.
5. Conclusion
HDFS is a highly scalable, fault-tolerant distributed file system
designed for handling big data workloads. It is widely adopted in
production environments like Yahoo!, where it supports petabyte-scale
data storage and processing.
Key Benefits of HDFS:
• Scalable and High-Performance: Supports clusters with
thousands of nodes.
• Optimized for Batch Processing: Designed for MapReduce
workloads.
• Fault-Tolerant and Reliable: Replication-based redundancy
ensures data availability.
• Flexible and Extensible: Open-source framework allows
customization.
HDFS continues to evolve, integrating features like real-time processing
(HBase, Scribe) and stronger security models for enterprise use.