0% found this document useful (0 votes)
6 views5 pages

Paper Hdfs Summary

The Hadoop Distributed File System (HDFS) is designed for large-scale data storage with high reliability and fault tolerance, utilizing a centralized NameNode for metadata management and DataNodes for actual data storage. HDFS supports a single-writer, multiple-reader model and employs a rack-aware block placement policy to optimize performance and fault tolerance. Widely used in environments like Yahoo!, HDFS efficiently handles petabyte-scale data processing while ensuring data integrity and availability through replication and monitoring mechanisms.

Uploaded by

mangal jadhav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as ODT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views5 pages

Paper Hdfs Summary

The Hadoop Distributed File System (HDFS) is designed for large-scale data storage with high reliability and fault tolerance, utilizing a centralized NameNode for metadata management and DataNodes for actual data storage. HDFS supports a single-writer, multiple-reader model and employs a rack-aware block placement policy to optimize performance and fault tolerance. Widely used in environments like Yahoo!, HDFS efficiently handles petabyte-scale data processing while ensuring data integrity and availability through replication and monitoring mechanisms.

Uploaded by

mangal jadhav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as ODT, PDF, TXT or read online on Scribd
You are on page 1/ 5

The Hadoop Distributed File System (HDFS) – Summary

1. Introduction
HDFS is the file system component of Hadoop, designed for handling
large-scale data with high reliability and fault tolerance. While its
interface is inspired by the UNIX file system, it prioritizes performance
over strict adherence to standards. HDFS stores metadata separately from
application data, similar to other distributed file systems like GFS, PVFS,
and Lustre.

Key Characteristics of HDFS:


• Metadata Management: A centralized NameNode stores the
namespace (metadata), while DataNodes store actual file data.
• Fault Tolerance: Instead of using RAID, HDFS ensures data
durability through replication across multiple DataNodes.
• Scalability: HDFS is designed to handle thousands of nodes and
petabytes of data efficiently.

2. HDFS Architecture

A. NameNode
The NameNode manages the hierarchical namespace of files and
directories. It stores metadata, including:
• File attributes (permissions, timestamps, quotas)
• Mapping of file blocks to DataNodes
Each file is divided into large blocks (default 128MB), and each block is
replicated (typically three copies). The NameNode directs clients to the
appropriate DataNodes for reading or writing operations. It maintains all
namespace metadata in RAM for faster performance.
Failure Recovery:
• Stores a persistent checkpoint of the namespace and logs changes
in a journal.
• During a restart, it reconstructs the latest state by reading the
checkpoint and replaying the journal.

1
• Redundant copies of these files can be stored on different servers for
durability.

B. DataNodes
Each DataNode stores blocks of data and manages:
• Block Metadata: Includes checksums and generation timestamps
to detect corruption.
• Heartbeat Mechanism: Sends periodic signals to the NameNode to
confirm its availability.
• Block Reports: Provide up-to-date information on stored blocks.

During failures, the NameNode reallocates blocks to ensure the replication


factor is maintained.

C. HDFS Client
The HDFS client provides an API for user applications, allowing them to:
• Read and write files: The client first contacts the NameNode for
metadata and then interacts directly with DataNodes.
• Configure replication: Default is three copies, but critical files can
have a higher replication factor.
• Optimize data locality: Applications like MapReduce can schedule
computations near the data to minimize transfer overhead.

D. Image and Journal Management


The NameNode maintains the namespace image (metadata snapshot)
and journal logs (record of transactions). These ensure consistency and
durability.
• CheckpointNode: Periodically merges the journal into the
namespace image and replaces the existing checkpoint.
• BackupNode: Maintains an up-to-date in-memory copy of the
namespace, reducing the need for frequent NameNode restarts.

3. File Operations and Replica Management

A. File I/O Operations


HDFS follows a single-writer, multiple-reader model:
• A file can be read concurrently by multiple clients.

2
• Only one client can write or append to a file at a time.
• The writer holds a lease, periodically renewed via heartbeats. If it
fails to renew, another client can take over.
Files are divided into blocks and written in a pipeline fashion to multiple
DataNodes for redundancy. HDFS also allows append operations, but
data modifications are not permitted.

B. Block Placement Policy


HDFS follows a rack-aware policy for placing block replicas:
1.The first replica is stored on the local node.
2.The second and third replicas are placed on different nodes in a
separate rack.
3.Additional replicas are distributed across the cluster for load
balancing.
This placement strategy ensures fault tolerance and optimizes read/write
bandwidth.

C. Replica Management
The NameNode monitors the replication status:
• Under-replicated blocks: Scheduled for additional copies.
• Over-replicated blocks: Excess copies are deleted to free up
space.
• Replica Balancing: Ensures even distribution across DataNodes
and racks.

D. Balancer Tool
To prevent uneven disk utilization, HDFS includes a balancer, which
redistributes block replicas to under-utilized nodes while maintaining
replication guarantees.

E. Data Integrity (Block Scanner)


• Each DataNode periodically scans stored blocks to verify checksums.
• Corrupt blocks are detected and replaced with healthy copies from
other DataNodes.

F. Node Decommissioning
When a DataNode is removed from service, the NameNode gradually
transfers its blocks to other nodes before marking it as decommissioned.

3
G. Inter-Cluster Data Copying
HDFS provides DistCp, a MapReduce-based tool for efficiently copying
large datasets between HDFS clusters.

4. HDFS at Yahoo! (Practical Use Case)


Yahoo! operates some of the largest HDFS clusters, supporting
massive-scale data processing.

A. Cluster Configuration at Yahoo!


• Nodes per Cluster: 3,500
• Total Storage: 9.8 PB (replicated 3 times) → 3.3 PB usable storage
• Replication Factor: 3 copies per block
• Network Setup: Racks of 40 nodes connected via core switches

Each day, the cluster handles:


• 60 million files
• 63 million blocks
• 2 million new files created

HDFS is crucial for Yahoo!’s Web Map, which indexes the World Wide
Web, processing 500 TB of intermediate data in 75 hours.

B. Durability & Fault Tolerance


• Node Failures: About 0.8% of nodes fail each month (~1-2 nodes
per day).
• Block Recovery: Lost replicas are recreated in ~2 minutes using
parallel re-replication.
• Rack Failures: HDFS tolerates rack switch failures but can
experience temporary data unavailability during core switch failures.

C. Performance Benchmarks
HDFS achieves:
• 66 MB/s per node (Read)
• 40 MB/s per node (Write)
• Busy cluster throughput: ~1 MB/s per node (due to job mix)

For large-scale sorting (Gray Sort Competition):


• 1 PB dataset sorted in ~58,500s using 3,658 nodes

4
D. Security and Resource Management
• UNIX-style permission framework controls access to files and
directories.
• Storage quotas prevent excessive resource consumption by
individual users.
• Hadoop Archives (HAR) optimize storage for small files.

5. Conclusion
HDFS is a highly scalable, fault-tolerant distributed file system
designed for handling big data workloads. It is widely adopted in
production environments like Yahoo!, where it supports petabyte-scale
data storage and processing.
Key Benefits of HDFS:
• Scalable and High-Performance: Supports clusters with
thousands of nodes.
• Optimized for Batch Processing: Designed for MapReduce
workloads.
• Fault-Tolerant and Reliable: Replication-based redundancy
ensures data availability.
• Flexible and Extensible: Open-source framework allows
customization.
HDFS continues to evolve, integrating features like real-time processing
(HBase, Scribe) and stronger security models for enterprise use.

You might also like