0% found this document useful (0 votes)
35 views9 pages

UNIT 3 HDFS, Hadoop Environment Part 1

HDFS is a distributed file system designed for large files and streaming data access. It works on a master-slave model with the NameNode as master and DataNodes as slaves. Files are divided into large blocks for distribution across DataNodes and replication for fault tolerance. The NameNode manages metadata and DataNodes store and retrieve blocks.

Uploaded by

works8606
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views9 pages

UNIT 3 HDFS, Hadoop Environment Part 1

HDFS is a distributed file system designed for large files and streaming data access. It works on a master-slave model with the NameNode as master and DataNodes as slaves. Files are divided into large blocks for distribution across DataNodes and replication for fault tolerance. The NameNode manages metadata and DataNodes store and retrieve blocks.

Uploaded by

works8606
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

BIG DATA (KCS 061)

UNIT 3: HDFS (Hadoop Distributed File System), Hadoop Environment

HDFS: When a dataset outgrows the storage capacity of a single physical machine, it becomes
necessary to partition it across a number of separate machines. Filesystems that manage the
storage across a network of machines are called distributed filesystems.
Hadoop comes with a distributed filesystem called HDFS, which stands for Hadoop Distributed
Filesystem.

The Design of HDFS:


HDFS is a filesystem designed for storing very large files with streaming data access patterns,
running on clusters of commodity hardware.

Very large files: “Very large” in this context means files that are hundreds of megabytes,
gigabytes, or terabytes in size. There are Hadoop clusters running today that store petabytes of
data.

Streaming data access: HDFS is built around the idea that the most efficient data processing
pattern is a write-once, read- many-times pattern. A dataset is typically generated or copied from
source, then various analyses are performed on that dataset over time.

Commodity hardware: Hadoop doesn’t require expensive, highly reliable hardware to run on.
It’s designed to run on clusters of commodity hardware (commonly available hardware available
from multiple vendors3) for which the chance of node failure across the cluster is high, at least
for large clusters. HDFS is designed to carry on working without a noticeable interruption to the
user in the face of such failure.

HDFS Concepts:

1. Blocks: A Block is the minimum amount of data that it can read or write. HDFS blocks
are 128 MB by default and this is configurable. Files in HDFS are broken into block-
sized chunks, which are stored as independent units. Unlike a file system, if the file is in
HDFS is smaller than block size, then it does not occupy full blocks size, i.e. 5 MB of
file stored in HDFS of block size 128 MB takes 5MB of space only. The HDFS block
size is large just to minimize the cost of seek.
2. Name Node: HDFS works in master-worker pattern where the name node acts as master.
Name Node is controller and manager of HDFS as it knows the status and the metadata
of all the files in HDFS; the metadata information being file permission, names and
location of each block. The metadata are small, so it is stored in the memory of name
node, allowing faster access to data. Moreover the HDFS cluster is accessed by multiple
clients concurrently, so all this information is handled via single machine. The file system
operations like opening, closing, renaming etc. are executed by it.
3. Data Node: They store and retrieve blocks when they are told to; by client or name node.
They report back to name node periodically, with list of blocks that they are storing. The
data node being a commodity hardware also does the work of block creation, deletion
and replication as stated by the name node.

HDFS DataNode and NameNode Image:

HDFS Read Image:

HDFS Write Image:


Since all the metadata is stored in name node, it is very important. If it fails the file system cannot
be used as there would be no way of knowing how to reconstruct the files from blocks present
in data node. To overcome this, the concept of secondary name node arises.

Secondary Name Node: It is a separate physical machine which acts as a helper of name node.
It performs periodic check points. It communicates with the name node and take snapshot of
Meta data which helps minimize downtime and loss of data.

Benefits of HDFS:
As an open source subproject within Hadoop, HDFS offers five core benefits when dealing with
big data:

 Fault Tolerance: HDFS has been designed to detect faults and automatically recover
quickly ensuring continuity and reliability.
 Speed: because of its cluster architecture, it can maintain 2 GB of data per second.
 Access to more types of data: specifically streaming data. Because of its design to
handle large amounts of data for batch processing it allows for high data throughput rates
making it ideal to support streaming data.
 Compatibility & Portability: HDFS is designed to be portable across a variety of
hardware setups and compatible with several underlying operating systems ultimately
providing user’s optionality to use HDFS with their own tailored setup.
 Scalable: You can scale resources according to the size of your file system. HDFS
includes vertical and horizontal scalability mechanisms.

These are areas where HDFS is not a good fit today (Challenges):
Low-latency data access: Applications that require low-latency access to data, in the tens of
milliseconds range, will not work well with HDFS.

Lots of small files: Since the namenode holds filesystem metadata in memory, the limit to the
number of files in a filesystem is governed by the amount of memory on the namenode.
Multiple writers, arbitrary file modifications: Files in HDFS may be written to by a single
writer. Writes are always made at the end of the file. There is no support for multiple writers, or
for modifications at arbitrary offsets in the file.

File Sizes, Block Sizes and Block Abstraction in HDFS:


Hadoop is known for its reliable storage. Hadoop HDFS can store data of any size and format.
HDFS has the concept of a block, but it is a much larger unit—64 MB by default. Files in HDFS
are broken into block-sized chunks, which are stored as independent units.
HDFS in Hadoop divides the file into small size blocks called data blocks. These data blocks
serve many advantages to the Hadoop HDFS.
Files in HDFS are broken into block-sized chunks called data blocks. These blocks are stored
as independent units.
The size of these HDFS data blocks is 128 MB by default. We can configure the block size as
per our requirement by changing the dfs.block.size property in hdfs-site.xml
Hadoop distributes these blocks on different slave machines, and the master machine stores the
metadata about blocks location.

All the blocks of a file are of the same size except the last one (if the file size is not a multiple of
128).

Suppose we have a file of size 612 MB, and we are using the default block configuration (128
MB). Therefore five blocks are created, the first four blocks are 128 MB in size, and the fifth
block is 100 MB in size (128*4+100=612).
From the above example, we can conclude that:

1. A file in HDFS, smaller than a single block does not occupy a full block size space of the
underlying storage.
2. Each file stored in HDFS doesn’t need to be an exact multiple of the configured block size.
Having a block abstraction for a distributed filesystem brings several
benefits:
In HDFS the abstraction is made over the blocks of a file rather than a single file which
simplifies the storage subsystem. Since the size of the blocks is fixed it is easy to manage and
calculate how many blocks can be stored on a single disk.
The first benefit:
A file can be larger than any single disk in the network. There’s nothing that requires the blocks
from a file to be stored on the same disk, so they can take advantage of any of the disks in the
cluster.
Second:
Making the unit of abstraction a block rather than a file simplifies the storage subsystem. The
storage subsystem deals with blocks, simplifying storage management (since blocks are a fixed
size, it is easy to calculate how many can be stored on a given disk) and eliminating metadata
concerns.
Third:
Blocks fit well with replication for providing fault tolerance and availability. To insure against
Corrupted blocks and disk and machine failure, each block is replicated to a small number of
physically separate machines (typically three).

Why Is a Block in HDFS So Large?

HDFS blocks are large compared to disk blocks, and the reason is to minimize the cost of seeks.
By making a block large enough, the time to transfer the data from the disk can be made to be
significantly larger than the time to seek to the start of the block. Thus the time to transfer a
large file made of multiple blocks operates at the disk transfer rate.

Data Replication:
In a Hadoop HDFS cluster, data replication is a crucial aspect of ensuring fault tolerance and
data durability. Let’s analyse each possibility you mentioned for data replication and determine
the most plausible approach:

1. Client to Master:
In this approach, the client would send the data to the master node, and the master node
would be responsible for replicating the data to other datanodes. However, this approach is
not commonly used in Hadoop HDFS because it introduces a single point of failure. If the
master node fails, it can result in data loss and disrupt the replication process.
2. Client sending data to each datanode:
In this approach, the client directly sends the data to each individual datanode in the cluster.
While this method would ensure replication, it requires the client to have knowledge of all
the datanodes’ IP addresses and manually handle the replication process. This approach is
not practical as it adds complexity and overhead to the client’s responsibilities.
3. Client copying data first in one datanode and then that datanode copying it to another
datanode:
This approach involves the client sending the data to one datanode, and then that datanode
takes the responsibility of copying the data to other datanodes in the cluster. This is the
most plausible and widely used approach for data replication in Hadoop HDFS.

In Hadoop HDFS, the responsibility for data replication lies with the datanodes themselves.
When the client sends data to a particular datanode, that datanode replicates the data to other
datanodes based on the replication factor specified in the Hadoop configuration. The replication
process is handled internally by the datanodes, leveraging the cluster’s distributed architecture
and the HDFS replication pipeline.

By distributing the replication responsibility among the datanodes, Hadoop HDFS achieves fault
tolerance and data reliability. If a datanode fails, the replicated copies stored on other datanodes
ensure that the data remains accessible. Additionally, HDFS’s block placement policy ensures
that replicated blocks are distributed across different racks and nodes, further enhancing fault
tolerance and data availability.

In conclusion, the most posssible approach for data replication in Hadoop HDFS is for the client
to send the data to one datanode, and then the datanodes handle the replication process internally.
This approach leverages the distributed architecture of Hadoop and ensures fault tolerance and
data durability without placing additional burden on the client.

How does HDFS Store, Read, & Write files:


HDFS cluster primarily consists of a NameNode that manages the file system Metadata and
a DataNodes that stores the actual data.

 NameNode: NameNode can be considered as a master of the system. It maintains the file
system tree and the metadata for all the files and directories present in the system. Two
files ‘Namespace image’ and the ‘edit log’ are used to store metadata information.
Namenode has knowledge of all the datanodes containing data blocks for a given file,
however, it does not store block locations persistently. This information is reconstructed
every time from datanodes when the system starts.
 DataNode: DataNodes are slaves which reside on each machine in a cluster and provide
the actual storage. It is responsible for serving, read and write requests for the clients.

Read/write operations in HDFS operate at a block level. Data files in HDFS are broken into
block-sized chunks, which are stored as independent units. Default block-size is 64 MB.

HDFS operates on a concept of data replication wherein multiple replicas of data blocks are
created and are distributed on nodes throughout a cluster to enable high availability of data in
the event of node failure.
Store Operation in HDFS:

As we now know, HDFS data is stored in something called blocks. These blocks are the smallest
unit of data that the file system can store. Files are processed and broken down into these blocks,
which are then taken and distributed across the cluster - and also replicated for safety.

Read Operation in HDFS:

Data read request is served by HDFS, NameNode, and DataNode. Let’s call the reader as a
‘client’. Below diagram depicts file read operation in Hadoop.

1. A client initiates read request by calling ‘open()’ method of FileSystem object; it is an


object of type DistributedFileSystem.
2. This object connects to namenode using RPC and gets metadata information such as the
locations of the blocks of the file. Please note that these addresses are of first few
blocks of a file.
3. In response to this metadata request, addresses of the DataNodes having a copy of that
block is returned back.
4. Once addresses of DataNodes are received, an object of type FSDataInputStream is
returned to the client. FSDataInputStream contains DFSInputStream which takes
care of interactions with DataNode and NameNode. In step 4 shown in the above
diagram, a client invokes ‘read()’ method which causes DFSInputStream to establish
a connection with the first DataNode with the first block of a file.
5. Data is read in the form of streams wherein client invokes ‘read()’ method repeatedly.
This process of read() operation continues till it reaches the end of block.
6. Once the end of a block is reached, DFSInputStream closes the connection and moves
on to locate the next DataNode for the next block
7. Once a client has done with the reading, it calls a close() method.
Write Operation In HDFS:

1. A client initiates write operation by calling ‘create()’ method of DistributedFileSystem


object which creates a new file – Step no. 1 in the above diagram.
2. DistributedFileSystem object connects to the NameNode using RPC call and initiates
new file creation. However, this file creates operation does not associate any blocks with
the file. It is the responsibility of NameNode to verify that the file (which is being created)
does not exist already and a client has correct permissions to create a new file. If a file
already exists or client does not have sufficient permission to create a new file,
then IOException is thrown to the client. Otherwise, the operation succeeds and a new
record for the file is created by the NameNode.
3. Once a new record in NameNode is created, an object of type FSDataOutputStream is
returned to the client. A client uses it to write data into the HDFS. Data write method is
invoked (step 3 in the diagram).
4. FSDataOutputStream contains DFSOutputStream object which looks after
communication with DataNodes and NameNode. While the client continues writing
data, DFSOutputStream continues creating packets with this data. These packets are
enqueued into a queue which is called as DataQueue.
5. There is one more component called DataStreamer which consumes this DataQueue.
DataStreamer also asks NameNode for allocation of new blocks thereby picking desirable
DataNodes to be used for replication.
6. Now, the process of replication starts by creating a pipeline using DataNodes. In our case,
we have chosen a replication level of 3 and hence there are 3 DataNodes in the pipeline.
7. The DataStreamer pours packets into the first DataNode in the pipeline.
8. Every DataNode in a pipeline stores packet received by it and forwards the same to the
second DataNode in a pipeline.
9. Another queue, ‘Ack Queue’ is maintained by DFSOutputStream to store packets which
are waiting for acknowledgment from DataNodes.
10. Once acknowledgment for a packet in the queue is received from all DataNodes in the
pipeline, it is removed from the ‘Ack Queue’. In the event of any DataNode failure,
packets from this queue are used to reinitiate the operation.
11. After a client is done with the writing data, it calls a close() method (Step 9 in the
diagram) Call to close(), results into flushing remaining data packets to the pipeline
followed by waiting for acknowledgment.
12. Once a final acknowledgment is received, NameNode is contacted to tell it that the file
write operation is complete.

JAVA Interfaces to HDFS:

Hadoop has an abstract notion of filesystems, of which HDFS is just one implementation.
The Java abstract class org.apache.hadoop.fs.FileSystem represents the client interface to a
filesystem in Hadoop, and there are several concrete implementations.Hadoop is written in
Java, so most Hadoop filesystem interactions are mediated through the Java API. The
filesystem shell, for example, is a Java application that uses the Java FileSystem class to
provide filesystem operations.By exposing its filesystem interface as a Java API, Hadoop
makes it awkward for non-Java applications to access HDFS. The HTTP REST API exposed
by the WebHDFS protocol makes it easier for other languages to interact with HDFS. Note
that the HTTP interface is slower than the native Java client, so should be avoided for very
large data transfers if possible.

You might also like