0% found this document useful (0 votes)

6 views5 pages

Paper Hdfs Summary

The Hadoop Distributed File System (HDFS) is designed for large-scale data storage with high reliability and fault tolerance, utilizing a centralized NameNode for metadata management and DataNodes for actual data storage. HDFS supports a single-writer, multiple-reader model and employs a rack-aware block placement policy to optimize performance and fault tolerance. Widely used in environments like Yahoo!, HDFS efficiently handles petabyte-scale data processing while ensuring data integrity and availability through replication and monitoring mechanisms.

Uploaded by

mangal jadhav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as ODT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views5 pages

Paper Hdfs Summary

Uploaded by

mangal jadhav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as ODT, PDF, TXT or read online on Scribd

You are on page 1/ 5

The Hadoop Distributed File System (HDFS) – Summary

1. Introduction
HDFS is the file system component of Hadoop, designed for handling
large-scale data with high reliability and fault tolerance. While its
interface is inspired by the UNIX file system, it prioritizes performance
over strict adherence to standards. HDFS stores metadata separately from
application data, similar to other distributed file systems like GFS, PVFS,
and Lustre.

Key Characteristics of HDFS:

• Metadata Management: A centralized NameNode stores the
namespace (metadata), while DataNodes store actual file data.
• Fault Tolerance: Instead of using RAID, HDFS ensures data
durability through replication across multiple DataNodes.
• Scalability: HDFS is designed to handle thousands of nodes and
petabytes of data efficiently.

2. HDFS Architecture

A. NameNode
The NameNode manages the hierarchical namespace of files and
directories. It stores metadata, including:
• File attributes (permissions, timestamps, quotas)
• Mapping of file blocks to DataNodes
Each file is divided into large blocks (default 128MB), and each block is
replicated (typically three copies). The NameNode directs clients to the
appropriate DataNodes for reading or writing operations. It maintains all
namespace metadata in RAM for faster performance.
Failure Recovery:
• Stores a persistent checkpoint of the namespace and logs changes
in a journal.
• During a restart, it reconstructs the latest state by reading the
checkpoint and replaying the journal.

1
• Redundant copies of these files can be stored on different servers for
durability.

B. DataNodes
Each DataNode stores blocks of data and manages:
• Block Metadata: Includes checksums and generation timestamps
to detect corruption.
• Heartbeat Mechanism: Sends periodic signals to the NameNode to
confirm its availability.
• Block Reports: Provide up-to-date information on stored blocks.

During failures, the NameNode reallocates blocks to ensure the replication

factor is maintained.

C. HDFS Client
The HDFS client provides an API for user applications, allowing them to:
• Read and write files: The client first contacts the NameNode for
metadata and then interacts directly with DataNodes.
• Configure replication: Default is three copies, but critical files can
have a higher replication factor.
• Optimize data locality: Applications like MapReduce can schedule
computations near the data to minimize transfer overhead.

D. Image and Journal Management

The NameNode maintains the namespace image (metadata snapshot)
and journal logs (record of transactions). These ensure consistency and
durability.
• CheckpointNode: Periodically merges the journal into the
namespace image and replaces the existing checkpoint.
• BackupNode: Maintains an up-to-date in-memory copy of the
namespace, reducing the need for frequent NameNode restarts.

3. File Operations and Replica Management

A. File I/O Operations

HDFS follows a single-writer, multiple-reader model:
• A file can be read concurrently by multiple clients.

2
• Only one client can write or append to a file at a time.
• The writer holds a lease, periodically renewed via heartbeats. If it
fails to renew, another client can take over.
Files are divided into blocks and written in a pipeline fashion to multiple
DataNodes for redundancy. HDFS also allows append operations, but
data modifications are not permitted.

B. Block Placement Policy

HDFS follows a rack-aware policy for placing block replicas:
1.The first replica is stored on the local node.
2.The second and third replicas are placed on different nodes in a
separate rack.
3.Additional replicas are distributed across the cluster for load
balancing.
This placement strategy ensures fault tolerance and optimizes read/write
bandwidth.

C. Replica Management
The NameNode monitors the replication status:
• Under-replicated blocks: Scheduled for additional copies.
• Over-replicated blocks: Excess copies are deleted to free up
space.
• Replica Balancing: Ensures even distribution across DataNodes
and racks.

D. Balancer Tool
To prevent uneven disk utilization, HDFS includes a balancer, which
redistributes block replicas to under-utilized nodes while maintaining
replication guarantees.

E. Data Integrity (Block Scanner)

• Each DataNode periodically scans stored blocks to verify checksums.
• Corrupt blocks are detected and replaced with healthy copies from
other DataNodes.

F. Node Decommissioning
When a DataNode is removed from service, the NameNode gradually
transfers its blocks to other nodes before marking it as decommissioned.

3
G. Inter-Cluster Data Copying
HDFS provides DistCp, a MapReduce-based tool for efficiently copying
large datasets between HDFS clusters.

4. HDFS at Yahoo! (Practical Use Case)

Yahoo! operates some of the largest HDFS clusters, supporting
massive-scale data processing.

A. Cluster Configuration at Yahoo!

• Nodes per Cluster: 3,500
• Total Storage: 9.8 PB (replicated 3 times) → 3.3 PB usable storage
• Replication Factor: 3 copies per block
• Network Setup: Racks of 40 nodes connected via core switches

Each day, the cluster handles:

• 60 million files
• 63 million blocks
• 2 million new files created

HDFS is crucial for Yahoo!’s Web Map, which indexes the World Wide
Web, processing 500 TB of intermediate data in 75 hours.

B. Durability & Fault Tolerance

• Node Failures: About 0.8% of nodes fail each month (~1-2 nodes
per day).
• Block Recovery: Lost replicas are recreated in ~2 minutes using
parallel re-replication.
• Rack Failures: HDFS tolerates rack switch failures but can
experience temporary data unavailability during core switch failures.

C. Performance Benchmarks
HDFS achieves:
• 66 MB/s per node (Read)
• 40 MB/s per node (Write)
• Busy cluster throughput: ~1 MB/s per node (due to job mix)

For large-scale sorting (Gray Sort Competition):

• 1 PB dataset sorted in ~58,500s using 3,658 nodes

4
D. Security and Resource Management
• UNIX-style permission framework controls access to files and
directories.
• Storage quotas prevent excessive resource consumption by
individual users.
• Hadoop Archives (HAR) optimize storage for small files.

5. Conclusion
HDFS is a highly scalable, fault-tolerant distributed file system
designed for handling big data workloads. It is widely adopted in
production environments like Yahoo!, where it supports petabyte-scale
data storage and processing.
Key Benefits of HDFS:
• Scalable and High-Performance: Supports clusters with
thousands of nodes.
• Optimized for Batch Processing: Designed for MapReduce
workloads.
• Fault-Tolerant and Reliable: Replication-based redundancy
ensures data availability.
• Flexible and Extensible: Open-source framework allows
customization.
HDFS continues to evolve, integrating features like real-time processing
(HBase, Scribe) and stronger security models for enterprise use.

Basics of Apache Kafka
100% (1)
Basics of Apache Kafka
168 pages
HDFS
No ratings yet
HDFS
1 page
Bigdta Unit 3
No ratings yet
Bigdta Unit 3
65 pages
The Hadoop Distributed File System
No ratings yet
The Hadoop Distributed File System
29 pages
Unit 3 1
No ratings yet
Unit 3 1
20 pages
Read Write in HDFS
No ratings yet
Read Write in HDFS
6 pages
DC Mod 6
No ratings yet
DC Mod 6
9 pages
Big Data Aktu Unit 3
No ratings yet
Big Data Aktu Unit 3
90 pages
Bigdata Unit 3
No ratings yet
Bigdata Unit 3
96 pages
Unit-4 BDA As On 25-11-2024
No ratings yet
Unit-4 BDA As On 25-11-2024
248 pages
HDFS
No ratings yet
HDFS
16 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
84 pages
Introduction To Hadoop Ecosystem
No ratings yet
Introduction To Hadoop Ecosystem
46 pages
Big Data-UNIT-2
No ratings yet
Big Data-UNIT-2
46 pages
What Is HDFS
No ratings yet
What Is HDFS
3 pages
3.1 Hadoop Ecosystem
No ratings yet
3.1 Hadoop Ecosystem
48 pages
Big Data Unit 3 by Multi Atoms
No ratings yet
Big Data Unit 3 by Multi Atoms
6 pages
Unit 3 Full
No ratings yet
Unit 3 Full
89 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
48 pages
Big Data Refers To Extremely Large and Complex Datasets That 1
No ratings yet
Big Data Refers To Extremely Large and Complex Datasets That 1
421 pages
HDFS 3
No ratings yet
HDFS 3
51 pages
BD U-3 Notes
No ratings yet
BD U-3 Notes
27 pages
HDFS Internals
No ratings yet
HDFS Internals
30 pages
HDFS
No ratings yet
HDFS
11 pages
Exp1 Bda
No ratings yet
Exp1 Bda
11 pages
Hadoop Distributed File System
No ratings yet
Hadoop Distributed File System
14 pages
HDFS
No ratings yet
HDFS
37 pages
Hadoop Distributed File System (HDFS) : Suresh Pathipati
No ratings yet
Hadoop Distributed File System (HDFS) : Suresh Pathipati
43 pages
(17CS82) 8 Semester CSE: Big Data Analytics
No ratings yet
(17CS82) 8 Semester CSE: Big Data Analytics
169 pages
Big Data All Units by MultiAtoms 1
No ratings yet
Big Data All Units by MultiAtoms 1
49 pages
Rob Jordan & Chris Livdahl
No ratings yet
Rob Jordan & Chris Livdahl
32 pages
Unit-4 BDA As On 25-11-2024
No ratings yet
Unit-4 BDA As On 25-11-2024
258 pages
Hadoop Distributed File System (HDFS)
No ratings yet
Hadoop Distributed File System (HDFS)
6 pages
Unit 2 Da Material
No ratings yet
Unit 2 Da Material
71 pages
HDFS Architecture Guide: by Dhruba Borthakur
No ratings yet
HDFS Architecture Guide: by Dhruba Borthakur
13 pages
Module 4 - Hadoop HDFS
No ratings yet
Module 4 - Hadoop HDFS
102 pages
Unit IV
No ratings yet
Unit IV
248 pages
Hadoop Distributed File System: Presented by Mohammad Sufiyan Nagaraju Kola Prudhvi Krishna Kamireddy
No ratings yet
Hadoop Distributed File System: Presented by Mohammad Sufiyan Nagaraju Kola Prudhvi Krishna Kamireddy
17 pages
Apex Institute of Technology: Big Data Security
No ratings yet
Apex Institute of Technology: Big Data Security
30 pages
Notes - 3 Unit Neha
No ratings yet
Notes - 3 Unit Neha
25 pages
Huawei
No ratings yet
Huawei
32 pages
IMTC634 - Data Science - Chapter 14
No ratings yet
IMTC634 - Data Science - Chapter 14
22 pages
Hadoop File System: B. Ramamurthy
No ratings yet
Hadoop File System: B. Ramamurthy
36 pages
Hadoop Distributed File System: Bhavneet Kaur B.Tech Computer Science 2 Year
No ratings yet
Hadoop Distributed File System: Bhavneet Kaur B.Tech Computer Science 2 Year
34 pages
Big Data Unit-III
No ratings yet
Big Data Unit-III
39 pages
Module 1
No ratings yet
Module 1
110 pages
Hadoop Intro
No ratings yet
Hadoop Intro
40 pages
Bdav QB
No ratings yet
Bdav QB
88 pages
Hadoop File System: B. Ramamurthy
No ratings yet
Hadoop File System: B. Ramamurthy
36 pages
BIG DATA - Unit 4 HADOOP AND MAP REDUCE - Mini Xerox - Easy Read
No ratings yet
BIG DATA - Unit 4 HADOOP AND MAP REDUCE - Mini Xerox - Easy Read
16 pages
Module 1 PDF
No ratings yet
Module 1 PDF
49 pages
4
No ratings yet
4
53 pages
Unit-2 Introduction To Hadoop
No ratings yet
Unit-2 Introduction To Hadoop
19 pages
CS19741-Cloud Computing-Unit 3 Notes
No ratings yet
CS19741-Cloud Computing-Unit 3 Notes
37 pages
BD U-3 (Anupam Sir)
No ratings yet
BD U-3 (Anupam Sir)
23 pages
Distributed File Systems Leading To Hadoop File System: UNIT-2
No ratings yet
Distributed File Systems Leading To Hadoop File System: UNIT-2
12 pages
BDH Unit 3
No ratings yet
BDH Unit 3
25 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
The Ceph Handbook: Building and Managing Scalable Distributed Storage Systems
From Everand
The Ceph Handbook: Building and Managing Scalable Distributed Storage Systems
Robert Johnson
No ratings yet
Reliability and Architecture of HDFS: Definitive Reference for Developers and Engineers
From Everand
Reliability and Architecture of HDFS: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Learn Hbase in 24 Hours
From Everand
Learn Hbase in 24 Hours
Alex Nordeen
No ratings yet
Paper Map Reduce
No ratings yet
Paper Map Reduce
16 pages
No SQL DB
No ratings yet
No SQL DB
18 pages
Paper Gfs Summary
No ratings yet
Paper Gfs Summary
14 pages
SSH
No ratings yet
SSH
19 pages
Big Data Notes UNIT-1
No ratings yet
Big Data Notes UNIT-1
14 pages
Prathap
No ratings yet
Prathap
5 pages
Big Data Engineering PDF
No ratings yet
Big Data Engineering PDF
17 pages
Practice 2
No ratings yet
Practice 2
7 pages
Pig Full Lecture
No ratings yet
Pig Full Lecture
38 pages
Thecodingshef: Unit 4 Big Data MCQ Aktu
No ratings yet
Thecodingshef: Unit 4 Big Data MCQ Aktu
13 pages
BDA Question Bank
100% (1)
BDA Question Bank
10 pages
BD 09 Parallel MF
No ratings yet
BD 09 Parallel MF
37 pages
HADOOP
No ratings yet
HADOOP
6 pages
Apache Oozie Installation Guide
No ratings yet
Apache Oozie Installation Guide
3 pages
System Design
No ratings yet
System Design
6 pages
Data Virtualization Patterns
No ratings yet
Data Virtualization Patterns
45 pages
Cloud Computing
No ratings yet
Cloud Computing
22 pages
Complete Hadoop Notes Final
No ratings yet
Complete Hadoop Notes Final
4 pages
AWS Service Used in ETL Testing
No ratings yet
AWS Service Used in ETL Testing
1 page
IMAC Quiz
No ratings yet
IMAC Quiz
8 pages
Demystifying The Big Data Ecosystem... - Param Natarajan
100% (1)
Demystifying The Big Data Ecosystem... - Param Natarajan
8 pages
BigData Unit4
No ratings yet
BigData Unit4
70 pages
Apache Hive: An Introduction
No ratings yet
Apache Hive: An Introduction
51 pages
Real Time Analytics With Spark and Kafka
No ratings yet
Real Time Analytics With Spark and Kafka
53 pages
A PACS Gateway To The Cloud: Conference Paper
No ratings yet
A PACS Gateway To The Cloud: Conference Paper
7 pages
CTBD Ex02
No ratings yet
CTBD Ex02
3 pages
BIG DATA For Healthcare A Survey
No ratings yet
BIG DATA For Healthcare A Survey
12 pages
Data Engineer
No ratings yet
Data Engineer
6 pages
New - Performance Testing Hadoop Based Big Data Analytics
No ratings yet
New - Performance Testing Hadoop Based Big Data Analytics
9 pages
Me III Sem Syllabus
No ratings yet
Me III Sem Syllabus
4 pages
Azure Storage Types
No ratings yet
Azure Storage Types
1 page
Amazon Elastic MapReduce PDF
No ratings yet
Amazon Elastic MapReduce PDF
231 pages
5-Overiview of Big Data Technologies - Hadoop
No ratings yet
5-Overiview of Big Data Technologies - Hadoop
36 pages

Paper Hdfs Summary

Uploaded by

Paper Hdfs Summary

Uploaded by

The Hadoop Distributed File System (HDFS) – Summary

Key Characteristics of HDFS:

During failures, the NameNode reallocates blocks to ensure the replication

D. Image and Journal Management

3. File Operations and Replica Management

A. File I/O Operations

B. Block Placement Policy

E. Data Integrity (Block Scanner)

4. HDFS at Yahoo! (Practical Use Case)

A. Cluster Configuration at Yahoo!

Each day, the cluster handles:

B. Durability & Fault Tolerance

For large-scale sorting (Gray Sort Competition):

You might also like