HDFS
HDFS
When a dataset exceeds the storage capacity of a single physical machine, it becomes necessary to partition
and distribute it across multiple machines. Distributed filesystems manage this distributed storage across a
network of machines, allowing for scalability and fault tolerance.
Key Points:
• Definition: Distributed filesystems are filesystems that operate over a network, managing storage
across multiple machines.
• Complexity: They are more complex than traditional disk filesystems due to the challenges of network
programming and ensuring data integrity across nodes.
• Challenges:
o Node Failure: Ensuring the filesystem can tolerate the failure of nodes without data loss.
o Network Issues: Handling latency, bandwidth limitations, and reliability of the network.
HDFS is a distributed filesystem designed to store and manage large datasets across a cluster of machines. It
is a core component of the Hadoop ecosystem.
Key Features:
• Scalability: HDFS is designed to scale out by adding more nodes to the cluster, which increases
storage capacity and computational power.
• Fault Tolerance: It replicates data across multiple nodes to ensure data is not lost in case of hardware
failures.
• High Throughput: Optimized for high throughput rather than low latency, making it suitable for
processing large files.
Key Components:
• NameNode: Manages the metadata of the filesystem, such as file names, directory structure, and file-
to-block mapping. It is a single point of failure and crucial for the filesystem's operation.
• DataNode: Stores the actual data blocks. Data is replicated across multiple DataNodes for fault
tolerance.
• Secondary NameNode: Performs periodic checkpoints of the NameNode's metadata to provide a
backup in case of failures.
HDFS Operation:
• Data Storage: Files are split into blocks (default size is 128 MB or 256 MB) and distributed across the
cluster. Each block is replicated multiple times (default replication factor is 3) to ensure data reliability.
• Access: HDFS is designed for large-scale data processing tasks and provides write-once, read-many
access patterns. It is not optimized for small file storage or frequent updates.
Hadoop’s filesystem abstraction allows it to integrate with various storage systems beyond HDFS:
• Local Filesystem: Hadoop can use the local filesystem for development or smaller datasets.
• Amazon S3: Hadoop can interact with Amazon S3 as a storage backend, allowing for scalable storage
in the cloud. This integration uses the S3A file system client to read and write data.
• RDBMS, DateLake, DataWarehouses, streaming systems, cloud systems and so on
Key Considerations:
• Data Consistency: When integrating with other storage systems, ensure that the data consistency
models are compatible with your use case.
• Performance: Consider the performance implications of different storage backends, particularly in
terms of latency and throughput.
Design and Use Cases of HDFS
HDFS (Hadoop Distributed File System) is specifically designed to address the needs of storing and processing
very large files in a distributed computing environment. Here’s a breakdown of its key design principles:
Key Features:
While HDFS is powerful for many use cases, there are specific scenarios where it may not be the best fit:
• Latency Constraints: Applications requiring quick, sub-second data access (in the tens of
milliseconds range) may not perform well with HDFS. Its optimization for high throughput rather than
low latency means it may not meet the performance needs of such applications.
• Alternative Solutions: HBase, which is built on top of HDFS, is often used for scenarios requiring
low-latency data access. HBase provides capabilities for fast random access to data.
Handling Lots of Small Files:
• Metadata Management: HDFS stores metadata (such as file and directory information) in memory
on the NameNode. This means the scalability of HDFS in terms of the number of files is limited by the
memory capacity of the NameNode.
• Scalability Constraints: Each file, directory, and block take up about 150 bytes of memory. For
example, handling one million files requires at least 300 MB of memory. While managing millions of
files is feasible, managing billions of files can exceed current hardware capabilities.
• Write Restrictions: HDFS supports a single writer per file, with data being appended to the end of the
file. It does not support multiple writers or modifications at arbitrary file offsets.
• Future Considerations: Although support for multiple writers and arbitrary modifications might be
introduced in the future, such features are likely to be less efficient compared to the current append-
only model.
HDFS Concepts
Blocks
In both traditional filesystems and HDFS, the concept of "blocks" is fundamental. Here's a detailed look at
what blocks are and their significance:
• Traditional Filesystems:
o Disk Blocks: The smallest unit of storage on a disk, usually 512 bytes.
o Filesystem Blocks: Typically larger, a few kilobytes in size, used by the filesystem to manage
data. The filesystem block size is an integral multiple of the disk block size.
o Tools: Commands like df and fsck operate at the filesystem block level for maintenance and
checking.
• HDFS Blocks:
o Size: Much larger than traditional filesystem blocks, with a default size of 128 MB.
o Function: Files in HDFS are divided into blocks of this size, which are stored independently
across the cluster.
o Storage Efficiency: If a file is smaller than a block, only the space needed for the file is used
(e.g., a 1 MB file on a 128 MB block uses only 1 MB of disk space).
2. Why Are HDFS Blocks So Large?
The choice of large block sizes in HDFS is driven by several practical considerations:
HDFS (Hadoop Distributed File System) operates with a master-worker architecture involving two main types
of nodes: Namenodes and Datanodes. Understanding their roles and mechanisms is essential for maintaining
the health and performance of an HDFS cluster.
1. Namenode
The Namenode is the master node in the HDFS architecture, responsible for managing the filesystem
namespace and metadata.
Responsibilities:
• Critical Role:
o If the Namenode fails, the filesystem cannot be used, and data loss can occur because the system
would not know how to reconstruct files from the blocks on Datanodes.
• Resilience Mechanisms:
o Backup:
▪ Namenode's persistent state is backed up to multiple filesystems. This includes
synchronous and atomic writes to local disks and remote NFS mounts.
o Secondary Namenode:
▪ Role: Periodically merges the namespace image with the edit log to prevent the edit log
from becoming too large.
▪ Operation: Runs on a separate physical machine with sufficient CPU and memory. It
keeps a copy of the merged namespace image.
▪ Limitations: The state of the Secondary Namenode lags behind the primary, so there is
a risk of data loss if the primary fails. In such cases, the primary's metadata files are
copied to the Secondary Namenode, which then acts as the new primary.
▪ Alternative: A hot standby Namenode can be used for high availability, which provides
a more robust failover solution.
2. Datanodes
Datanodes are the worker nodes in the HDFS architecture, responsible for storing and managing the data
blocks.
Responsibilities:
Operation:
• Data Management:
o Datanodes handle the actual data blocks, and their efficiency directly impacts the performance
of HDFS.
o They ensure data redundancy and availability by storing multiple replicas of each block.
• Client Operations:
o Clients interact with the Namenode to get information about where to find the blocks of a file.
o The Namenode directs clients to the appropriate Datanodes for block retrieval or storage.
• Datanode Reporting:
o Datanodes regularly send heartbeat signals and block reports to the Namenode.
o This reporting helps the Namenode monitor the health of the Datanodes and manage data
replication and block recovery.
Block Caching
Purpose:
• Block caching improves read performance by storing frequently accessed blocks in the memory of
Datanodes.
• It allows quick access to data without having to read from disk repeatedly.
Mechanics:
• Caching Location: Blocks are cached in an off-heap memory area on the Datanodes. Off-heap
memory is used to avoid garbage collection overhead.
• Cache Scope: By default, each block is cached in one Datanode’s memory. This is configurable on a
per-file basis, allowing multiple Datanodes to cache the same block if needed.
• Cache Management: Administrators can specify which files should be cached and the duration of
caching using cache directives added to a cache pool.
• Cache Pools: Cache pools are used to manage cache permissions and resource usage. They help in
organizing and controlling access to cached data.
Benefits:
• Performance Improvement: Increases read performance by reducing disk I/O. For instance, job
schedulers in frameworks like MapReduce or Spark can schedule tasks on Datanodes that have relevant
blocks cached, leading to faster data access.
• Use Cases: Ideal for use cases such as small lookup tables used in joins, where data is frequently
accessed.
HDFS Federation
Purpose:
• HDFS Federation enhances the scalability of HDFS by allowing the use of multiple Namenodes.
Mechanics:
• Namespace Management: In HDFS Federation, the filesystem namespace is split across multiple
Namenodes. Each Namenode manages a portion of the namespace and its associated block pool.
• Namespace Volumes: Each Namenode is responsible for a namespace volume, which includes
metadata for its portion of the namespace. These volumes are independent, meaning that the failure of
one Namenode does not affect others.
• Block Pool Storage: Datanodes register with all Namenodes in the cluster and store blocks from
multiple block pools. This ensures that blocks from different namespaces can be stored and managed
efficiently.
Access:
• Client Interaction: Clients use client-side mount tables to map file paths to the appropriate
Namenodes. Configuration is managed using ViewFileSystem and the viewfs:// URIs.
Benefits:
• Scalability: Allows the HDFS cluster to scale beyond the limitations of a single Namenode's memory
by distributing the namespace management.
• Fault Tolerance: Improves fault tolerance by isolating namespace management across multiple
Namenodes.
HDFS High Availability (HA)
Purpose:
• HDFS HA addresses the single point of failure issue associated with the Namenode by providing a pair
of Namenodes in an active-standby configuration.
Mechanics:
• Active-Standby Configuration: One Namenode acts as the active Namenode, handling client
requests, while the other acts as the standby, ready to take over if the active Namenode fails.
• Shared Storage: Namenodes use highly available shared storage to keep the edit log. Two main
choices for this storage are:
o NFS Filer: Traditional Network File System for shared storage.
o Quorum Journal Manager (QJM): A specialized HDFS component designed to provide
highly available edit logs. It uses a group of journal nodes where each edit must be written to a
majority of nodes (e.g., three nodes, allowing for one node failure).
• Datanode Reporting: Datanodes must report block information to both the active and standby
Namenodes.
• Client Configuration: Clients must be configured to handle Namenode failover transparently.
Failover Process:
• Quick Failover: The standby Namenode can take over quickly (within seconds) as it maintains up-to-
date state in memory, including the latest edit log and block mappings.
• Recovery: In case the standby Namenode is down when the active fails, the administrator can start the
standby from cold. While this process is better than the non-HA scenario, it still requires standard
operational procedures.
Advantages:
• Reduced Downtime: Provides high availability and reduces downtime by enabling a rapid failover
mechanism.
• Operational Efficiency: Standardizes the failover process, making it more predictable and
manageable.
Failover Process
Failover Controller:
• Role: Manages the transition between the active and standby Namenodes. It ensures that only one
Namenode is active at a time.
• Default Implementation: Uses ZooKeeper, which coordinates the failover process by monitoring the
health of the Namenodes and triggering failover if needed.
• Process:
o Heartbeat Mechanism: Each Namenode runs a lightweight failover controller that sends
heartbeats to check the status of the other Namenode.
o Graceful Failover: Can be initiated manually by an administrator, such as during routine
maintenance. This involves an orderly transition where both Namenodes switch roles smoothly.
o Ungraceful Failover: Occurs automatically if the active Namenode fails unexpectedly. This
can happen due to issues like network partitions or slow networks, where the active Namenode
might still be running but is unreachable.
• Transparent to Clients: The client library manages failover transparently. Clients are configured with
a logical hostname that maps to a pair of Namenode addresses.
• Failover Mechanism: The client library attempts connections to each Namenode address in turn until
it succeeds. This ensures continuous service availability even during failover.
Fencing Mechanisms
Purpose:
• Prevent Data Corruption: Ensures that the previously active Namenode, which might still be running
or reachable, does not interfere with the cluster operations or cause data corruption.
Fencing Techniques:
• SSH Fencing Command: A simple method where an SSH command is used to kill the process of the
previously active Namenode. This is effective if the failover controller is unsure if the old Namenode
has stopped completely.
• NFS Filer: When using an NFS filer for shared edit logs, stronger fencing methods are required
because NFS filers do not enforce exclusive write access as effectively as QJM.
o Revoking Access: Commands to revoke the Namenode’s access to the shared storage directory
can be used.
o Disabling Network Port: The Namenode’s network port can be disabled via remote
management commands to prevent it from accepting requests.
o STONITH (Shoot The Other Node In The Head): This drastic measure involves using a
specialized power distribution unit (PDU) to forcibly power down the host machine of the failed
Namenode, ensuring it can no longer affect the system.
Hadoop Filesystems: Overview and Usage
Hadoop’s flexible filesystem abstraction allows it to interact with various storage systems, each designed for
specific use cases. Understanding these filesystems can help in choosing the right one for your needs, whether
for local testing, distributed processing, or cloud integration.
6. ViewFS (viewfs.ViewFileSystem)
7. FTP (fs.ftp.FTPFileSystem)
8. Amazon S3 (fs.s3a.S3AFileSystem)
Choosing a Filesystem
• HDFS is recommended for large-scale data processing due to its optimization for data locality and
fault tolerance.
• Cloud-based filesystems (S3, Azure, Swift) are used for integrating with cloud storage platforms and
can be suitable depending on specific storage and processing requirements.
• Local filesystems are best for small-scale or development testing.
Each filesystem has distinct use cases depending on your needs for data volume, processing, and integration
with other storage systems.
Interface
Hadoop provides several interfaces to interact with its filesystems, catering to different programming
languages, environments, and access needs. Here’s an overview of the key interfaces:
1. Java API
• Purpose: Hadoop is written in Java, so its primary filesystem interactions are handled through Java
APIs.
• Implementation: The org.apache.hadoop.fs.FileSystem class provides methods for filesystem
operations such as reading, writing, and listing files.
• Usage: Commonly used for Java applications and for running Hadoop commands.
2. HTTP (WebHDFS)
• Purpose: Provides access to HDFS via HTTP, enabling non-Java applications to interact with Hadoop.
• Implementation: The WebHDFS protocol exposes an HTTP REST API that allows file operations
over HTTP or HTTPS.
• Access Methods:
o Direct Access: Clients interact directly with HDFS daemons (namenodes and datanodes) via
embedded web servers.
o Proxy Access: Clients connect through an HttpFS proxy server, which forwards requests to
HDFS. This approach can help with firewall and bandwidth management, and is useful for
cross-data-center or cloud access.
• Performance: WebHDFS is slower than the native Java client, so it’s less suitable for very large data
transfers.
3. C API
• Purpose: Allows HDFS to be mounted on a local client’s filesystem, enabling interaction using
standard Unix utilities.
• Implementation: Hadoop’s NFSv3 gateway makes HDFS appear as a local filesystem.
• Usage: Useful for Unix-like environments where standard tools and libraries are used for file
operations.
5. FUSE (Filesystem in Userspace)
These interfaces provide flexibility for accessing Hadoop filesystems from various programming
environments and tools:
Each interface serves different use cases and provides different levels of access, performance, and
compatibility.
DATA FLOW
Components Involved
Sequence of Events
Example:
• Datandode Failures:
o Retry: If an error occurs while communicating with a datanode, DFSInputStream will try the
next closest datanode.
o Tracking Failures: It remembers failed datanodes to avoid retrying them unnecessarily.
Example:
Example:
o If checksum verification fails for block 1 from /d1/r1/n1, read block 1 from /d1/r1/n2.
Design Benefits