UNIT 3 HDFS, Hadoop Environment Part 1
UNIT 3 HDFS, Hadoop Environment Part 1
HDFS: When a dataset outgrows the storage capacity of a single physical machine, it becomes
necessary to partition it across a number of separate machines. Filesystems that manage the
storage across a network of machines are called distributed filesystems.
Hadoop comes with a distributed filesystem called HDFS, which stands for Hadoop Distributed
Filesystem.
Very large files: “Very large” in this context means files that are hundreds of megabytes,
gigabytes, or terabytes in size. There are Hadoop clusters running today that store petabytes of
data.
Streaming data access: HDFS is built around the idea that the most efficient data processing
pattern is a write-once, read- many-times pattern. A dataset is typically generated or copied from
source, then various analyses are performed on that dataset over time.
Commodity hardware: Hadoop doesn’t require expensive, highly reliable hardware to run on.
It’s designed to run on clusters of commodity hardware (commonly available hardware available
from multiple vendors3) for which the chance of node failure across the cluster is high, at least
for large clusters. HDFS is designed to carry on working without a noticeable interruption to the
user in the face of such failure.
HDFS Concepts:
1. Blocks: A Block is the minimum amount of data that it can read or write. HDFS blocks
are 128 MB by default and this is configurable. Files in HDFS are broken into block-
sized chunks, which are stored as independent units. Unlike a file system, if the file is in
HDFS is smaller than block size, then it does not occupy full blocks size, i.e. 5 MB of
file stored in HDFS of block size 128 MB takes 5MB of space only. The HDFS block
size is large just to minimize the cost of seek.
2. Name Node: HDFS works in master-worker pattern where the name node acts as master.
Name Node is controller and manager of HDFS as it knows the status and the metadata
of all the files in HDFS; the metadata information being file permission, names and
location of each block. The metadata are small, so it is stored in the memory of name
node, allowing faster access to data. Moreover the HDFS cluster is accessed by multiple
clients concurrently, so all this information is handled via single machine. The file system
operations like opening, closing, renaming etc. are executed by it.
3. Data Node: They store and retrieve blocks when they are told to; by client or name node.
They report back to name node periodically, with list of blocks that they are storing. The
data node being a commodity hardware also does the work of block creation, deletion
and replication as stated by the name node.
Secondary Name Node: It is a separate physical machine which acts as a helper of name node.
It performs periodic check points. It communicates with the name node and take snapshot of
Meta data which helps minimize downtime and loss of data.
Benefits of HDFS:
As an open source subproject within Hadoop, HDFS offers five core benefits when dealing with
big data:
Fault Tolerance: HDFS has been designed to detect faults and automatically recover
quickly ensuring continuity and reliability.
Speed: because of its cluster architecture, it can maintain 2 GB of data per second.
Access to more types of data: specifically streaming data. Because of its design to
handle large amounts of data for batch processing it allows for high data throughput rates
making it ideal to support streaming data.
Compatibility & Portability: HDFS is designed to be portable across a variety of
hardware setups and compatible with several underlying operating systems ultimately
providing user’s optionality to use HDFS with their own tailored setup.
Scalable: You can scale resources according to the size of your file system. HDFS
includes vertical and horizontal scalability mechanisms.
These are areas where HDFS is not a good fit today (Challenges):
Low-latency data access: Applications that require low-latency access to data, in the tens of
milliseconds range, will not work well with HDFS.
Lots of small files: Since the namenode holds filesystem metadata in memory, the limit to the
number of files in a filesystem is governed by the amount of memory on the namenode.
Multiple writers, arbitrary file modifications: Files in HDFS may be written to by a single
writer. Writes are always made at the end of the file. There is no support for multiple writers, or
for modifications at arbitrary offsets in the file.
All the blocks of a file are of the same size except the last one (if the file size is not a multiple of
128).
Suppose we have a file of size 612 MB, and we are using the default block configuration (128
MB). Therefore five blocks are created, the first four blocks are 128 MB in size, and the fifth
block is 100 MB in size (128*4+100=612).
From the above example, we can conclude that:
1. A file in HDFS, smaller than a single block does not occupy a full block size space of the
underlying storage.
2. Each file stored in HDFS doesn’t need to be an exact multiple of the configured block size.
Having a block abstraction for a distributed filesystem brings several
benefits:
In HDFS the abstraction is made over the blocks of a file rather than a single file which
simplifies the storage subsystem. Since the size of the blocks is fixed it is easy to manage and
calculate how many blocks can be stored on a single disk.
The first benefit:
A file can be larger than any single disk in the network. There’s nothing that requires the blocks
from a file to be stored on the same disk, so they can take advantage of any of the disks in the
cluster.
Second:
Making the unit of abstraction a block rather than a file simplifies the storage subsystem. The
storage subsystem deals with blocks, simplifying storage management (since blocks are a fixed
size, it is easy to calculate how many can be stored on a given disk) and eliminating metadata
concerns.
Third:
Blocks fit well with replication for providing fault tolerance and availability. To insure against
Corrupted blocks and disk and machine failure, each block is replicated to a small number of
physically separate machines (typically three).
HDFS blocks are large compared to disk blocks, and the reason is to minimize the cost of seeks.
By making a block large enough, the time to transfer the data from the disk can be made to be
significantly larger than the time to seek to the start of the block. Thus the time to transfer a
large file made of multiple blocks operates at the disk transfer rate.
Data Replication:
In a Hadoop HDFS cluster, data replication is a crucial aspect of ensuring fault tolerance and
data durability. Let’s analyse each possibility you mentioned for data replication and determine
the most plausible approach:
1. Client to Master:
In this approach, the client would send the data to the master node, and the master node
would be responsible for replicating the data to other datanodes. However, this approach is
not commonly used in Hadoop HDFS because it introduces a single point of failure. If the
master node fails, it can result in data loss and disrupt the replication process.
2. Client sending data to each datanode:
In this approach, the client directly sends the data to each individual datanode in the cluster.
While this method would ensure replication, it requires the client to have knowledge of all
the datanodes’ IP addresses and manually handle the replication process. This approach is
not practical as it adds complexity and overhead to the client’s responsibilities.
3. Client copying data first in one datanode and then that datanode copying it to another
datanode:
This approach involves the client sending the data to one datanode, and then that datanode
takes the responsibility of copying the data to other datanodes in the cluster. This is the
most plausible and widely used approach for data replication in Hadoop HDFS.
In Hadoop HDFS, the responsibility for data replication lies with the datanodes themselves.
When the client sends data to a particular datanode, that datanode replicates the data to other
datanodes based on the replication factor specified in the Hadoop configuration. The replication
process is handled internally by the datanodes, leveraging the cluster’s distributed architecture
and the HDFS replication pipeline.
By distributing the replication responsibility among the datanodes, Hadoop HDFS achieves fault
tolerance and data reliability. If a datanode fails, the replicated copies stored on other datanodes
ensure that the data remains accessible. Additionally, HDFS’s block placement policy ensures
that replicated blocks are distributed across different racks and nodes, further enhancing fault
tolerance and data availability.
In conclusion, the most posssible approach for data replication in Hadoop HDFS is for the client
to send the data to one datanode, and then the datanodes handle the replication process internally.
This approach leverages the distributed architecture of Hadoop and ensures fault tolerance and
data durability without placing additional burden on the client.
NameNode: NameNode can be considered as a master of the system. It maintains the file
system tree and the metadata for all the files and directories present in the system. Two
files ‘Namespace image’ and the ‘edit log’ are used to store metadata information.
Namenode has knowledge of all the datanodes containing data blocks for a given file,
however, it does not store block locations persistently. This information is reconstructed
every time from datanodes when the system starts.
DataNode: DataNodes are slaves which reside on each machine in a cluster and provide
the actual storage. It is responsible for serving, read and write requests for the clients.
Read/write operations in HDFS operate at a block level. Data files in HDFS are broken into
block-sized chunks, which are stored as independent units. Default block-size is 64 MB.
HDFS operates on a concept of data replication wherein multiple replicas of data blocks are
created and are distributed on nodes throughout a cluster to enable high availability of data in
the event of node failure.
Store Operation in HDFS:
As we now know, HDFS data is stored in something called blocks. These blocks are the smallest
unit of data that the file system can store. Files are processed and broken down into these blocks,
which are then taken and distributed across the cluster - and also replicated for safety.
Data read request is served by HDFS, NameNode, and DataNode. Let’s call the reader as a
‘client’. Below diagram depicts file read operation in Hadoop.
Hadoop has an abstract notion of filesystems, of which HDFS is just one implementation.
The Java abstract class org.apache.hadoop.fs.FileSystem represents the client interface to a
filesystem in Hadoop, and there are several concrete implementations.Hadoop is written in
Java, so most Hadoop filesystem interactions are mediated through the Java API. The
filesystem shell, for example, is a Java application that uses the Java FileSystem class to
provide filesystem operations.By exposing its filesystem interface as a Java API, Hadoop
makes it awkward for non-Java applications to access HDFS. The HTTP REST API exposed
by the WebHDFS protocol makes it easier for other languages to interact with HDFS. Note
that the HTTP interface is slower than the native Java client, so should be avoided for very
large data transfers if possible.