0% found this document useful (0 votes)

35 views9 pages

UNIT 3 HDFS, Hadoop Environment Part 1

HDFS is a distributed file system designed for large files and streaming data access. It works on a master-slave model with the NameNode as master and DataNodes as slaves. Files are divided into large blocks for distribution across DataNodes and replication for fault tolerance. The NameNode manages metadata and DataNodes store and retrieve blocks.

Uploaded by

works8606

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views9 pages

UNIT 3 HDFS, Hadoop Environment Part 1

Uploaded by

works8606

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

BIG DATA (KCS 061)

UNIT 3: HDFS (Hadoop Distributed File System), Hadoop Environment

HDFS: When a dataset outgrows the storage capacity of a single physical machine, it becomes
necessary to partition it across a number of separate machines. Filesystems that manage the
storage across a network of machines are called distributed filesystems.
Hadoop comes with a distributed filesystem called HDFS, which stands for Hadoop Distributed
Filesystem.

The Design of HDFS:

HDFS is a filesystem designed for storing very large files with streaming data access patterns,
running on clusters of commodity hardware.

Very large files: “Very large” in this context means files that are hundreds of megabytes,
gigabytes, or terabytes in size. There are Hadoop clusters running today that store petabytes of
data.

Streaming data access: HDFS is built around the idea that the most efficient data processing
pattern is a write-once, read- many-times pattern. A dataset is typically generated or copied from
source, then various analyses are performed on that dataset over time.

Commodity hardware: Hadoop doesn’t require expensive, highly reliable hardware to run on.
It’s designed to run on clusters of commodity hardware (commonly available hardware available
from multiple vendors3) for which the chance of node failure across the cluster is high, at least
for large clusters. HDFS is designed to carry on working without a noticeable interruption to the
user in the face of such failure.

HDFS Concepts:

1. Blocks: A Block is the minimum amount of data that it can read or write. HDFS blocks
are 128 MB by default and this is configurable. Files in HDFS are broken into block-
sized chunks, which are stored as independent units. Unlike a file system, if the file is in
HDFS is smaller than block size, then it does not occupy full blocks size, i.e. 5 MB of
file stored in HDFS of block size 128 MB takes 5MB of space only. The HDFS block
size is large just to minimize the cost of seek.
2. Name Node: HDFS works in master-worker pattern where the name node acts as master.
Name Node is controller and manager of HDFS as it knows the status and the metadata
of all the files in HDFS; the metadata information being file permission, names and
location of each block. The metadata are small, so it is stored in the memory of name
node, allowing faster access to data. Moreover the HDFS cluster is accessed by multiple
clients concurrently, so all this information is handled via single machine. The file system
operations like opening, closing, renaming etc. are executed by it.
3. Data Node: They store and retrieve blocks when they are told to; by client or name node.
They report back to name node periodically, with list of blocks that they are storing. The
data node being a commodity hardware also does the work of block creation, deletion
and replication as stated by the name node.

HDFS DataNode and NameNode Image:

HDFS Read Image:

HDFS Write Image:

Since all the metadata is stored in name node, it is very important. If it fails the file system cannot
be used as there would be no way of knowing how to reconstruct the files from blocks present
in data node. To overcome this, the concept of secondary name node arises.

Secondary Name Node: It is a separate physical machine which acts as a helper of name node.
It performs periodic check points. It communicates with the name node and take snapshot of
Meta data which helps minimize downtime and loss of data.

Benefits of HDFS:
As an open source subproject within Hadoop, HDFS offers five core benefits when dealing with
big data:

 Fault Tolerance: HDFS has been designed to detect faults and automatically recover
quickly ensuring continuity and reliability.
 Speed: because of its cluster architecture, it can maintain 2 GB of data per second.
 Access to more types of data: specifically streaming data. Because of its design to
handle large amounts of data for batch processing it allows for high data throughput rates
making it ideal to support streaming data.
 Compatibility & Portability: HDFS is designed to be portable across a variety of
hardware setups and compatible with several underlying operating systems ultimately
providing user’s optionality to use HDFS with their own tailored setup.
 Scalable: You can scale resources according to the size of your file system. HDFS
includes vertical and horizontal scalability mechanisms.

These are areas where HDFS is not a good fit today (Challenges):
Low-latency data access: Applications that require low-latency access to data, in the tens of
milliseconds range, will not work well with HDFS.

Lots of small files: Since the namenode holds filesystem metadata in memory, the limit to the
number of files in a filesystem is governed by the amount of memory on the namenode.
Multiple writers, arbitrary file modifications: Files in HDFS may be written to by a single
writer. Writes are always made at the end of the file. There is no support for multiple writers, or
for modifications at arbitrary offsets in the file.

File Sizes, Block Sizes and Block Abstraction in HDFS:

Hadoop is known for its reliable storage. Hadoop HDFS can store data of any size and format.
HDFS has the concept of a block, but it is a much larger unit—64 MB by default. Files in HDFS
are broken into block-sized chunks, which are stored as independent units.
HDFS in Hadoop divides the file into small size blocks called data blocks. These data blocks
serve many advantages to the Hadoop HDFS.
Files in HDFS are broken into block-sized chunks called data blocks. These blocks are stored
as independent units.
The size of these HDFS data blocks is 128 MB by default. We can configure the block size as
per our requirement by changing the dfs.block.size property in hdfs-site.xml
Hadoop distributes these blocks on different slave machines, and the master machine stores the
metadata about blocks location.

All the blocks of a file are of the same size except the last one (if the file size is not a multiple of
128).

Suppose we have a file of size 612 MB, and we are using the default block configuration (128
MB). Therefore five blocks are created, the first four blocks are 128 MB in size, and the fifth
block is 100 MB in size (128*4+100=612).
From the above example, we can conclude that:

1. A file in HDFS, smaller than a single block does not occupy a full block size space of the
underlying storage.
2. Each file stored in HDFS doesn’t need to be an exact multiple of the configured block size.
Having a block abstraction for a distributed filesystem brings several
benefits:
In HDFS the abstraction is made over the blocks of a file rather than a single file which
simplifies the storage subsystem. Since the size of the blocks is fixed it is easy to manage and
calculate how many blocks can be stored on a single disk.
The first benefit:
A file can be larger than any single disk in the network. There’s nothing that requires the blocks
from a file to be stored on the same disk, so they can take advantage of any of the disks in the
cluster.
Second:
Making the unit of abstraction a block rather than a file simplifies the storage subsystem. The
storage subsystem deals with blocks, simplifying storage management (since blocks are a fixed
size, it is easy to calculate how many can be stored on a given disk) and eliminating metadata
concerns.
Third:
Blocks fit well with replication for providing fault tolerance and availability. To insure against
Corrupted blocks and disk and machine failure, each block is replicated to a small number of
physically separate machines (typically three).

Why Is a Block in HDFS So Large?

HDFS blocks are large compared to disk blocks, and the reason is to minimize the cost of seeks.
By making a block large enough, the time to transfer the data from the disk can be made to be
significantly larger than the time to seek to the start of the block. Thus the time to transfer a
large file made of multiple blocks operates at the disk transfer rate.

Data Replication:
In a Hadoop HDFS cluster, data replication is a crucial aspect of ensuring fault tolerance and
data durability. Let’s analyse each possibility you mentioned for data replication and determine
the most plausible approach:

1. Client to Master:
In this approach, the client would send the data to the master node, and the master node
would be responsible for replicating the data to other datanodes. However, this approach is
not commonly used in Hadoop HDFS because it introduces a single point of failure. If the
master node fails, it can result in data loss and disrupt the replication process.
2. Client sending data to each datanode:
In this approach, the client directly sends the data to each individual datanode in the cluster.
While this method would ensure replication, it requires the client to have knowledge of all
the datanodes’ IP addresses and manually handle the replication process. This approach is
not practical as it adds complexity and overhead to the client’s responsibilities.
3. Client copying data first in one datanode and then that datanode copying it to another
datanode:
This approach involves the client sending the data to one datanode, and then that datanode
takes the responsibility of copying the data to other datanodes in the cluster. This is the
most plausible and widely used approach for data replication in Hadoop HDFS.

In Hadoop HDFS, the responsibility for data replication lies with the datanodes themselves.
When the client sends data to a particular datanode, that datanode replicates the data to other
datanodes based on the replication factor specified in the Hadoop configuration. The replication
process is handled internally by the datanodes, leveraging the cluster’s distributed architecture
and the HDFS replication pipeline.

By distributing the replication responsibility among the datanodes, Hadoop HDFS achieves fault
tolerance and data reliability. If a datanode fails, the replicated copies stored on other datanodes
ensure that the data remains accessible. Additionally, HDFS’s block placement policy ensures
that replicated blocks are distributed across different racks and nodes, further enhancing fault
tolerance and data availability.

In conclusion, the most posssible approach for data replication in Hadoop HDFS is for the client
to send the data to one datanode, and then the datanodes handle the replication process internally.
This approach leverages the distributed architecture of Hadoop and ensures fault tolerance and
data durability without placing additional burden on the client.

How does HDFS Store, Read, & Write files:

HDFS cluster primarily consists of a NameNode that manages the file system Metadata and
a DataNodes that stores the actual data.

 NameNode: NameNode can be considered as a master of the system. It maintains the file
system tree and the metadata for all the files and directories present in the system. Two
files ‘Namespace image’ and the ‘edit log’ are used to store metadata information.
Namenode has knowledge of all the datanodes containing data blocks for a given file,
however, it does not store block locations persistently. This information is reconstructed
every time from datanodes when the system starts.
 DataNode: DataNodes are slaves which reside on each machine in a cluster and provide
the actual storage. It is responsible for serving, read and write requests for the clients.

Read/write operations in HDFS operate at a block level. Data files in HDFS are broken into
block-sized chunks, which are stored as independent units. Default block-size is 64 MB.

HDFS operates on a concept of data replication wherein multiple replicas of data blocks are
created and are distributed on nodes throughout a cluster to enable high availability of data in
the event of node failure.
Store Operation in HDFS:

As we now know, HDFS data is stored in something called blocks. These blocks are the smallest
unit of data that the file system can store. Files are processed and broken down into these blocks,
which are then taken and distributed across the cluster - and also replicated for safety.

Read Operation in HDFS:

Data read request is served by HDFS, NameNode, and DataNode. Let’s call the reader as a
‘client’. Below diagram depicts file read operation in Hadoop.

1. A client initiates read request by calling ‘open()’ method of FileSystem object; it is an

object of type DistributedFileSystem.
2. This object connects to namenode using RPC and gets metadata information such as the
locations of the blocks of the file. Please note that these addresses are of first few
blocks of a file.
3. In response to this metadata request, addresses of the DataNodes having a copy of that
block is returned back.
4. Once addresses of DataNodes are received, an object of type FSDataInputStream is
returned to the client. FSDataInputStream contains DFSInputStream which takes
care of interactions with DataNode and NameNode. In step 4 shown in the above
diagram, a client invokes ‘read()’ method which causes DFSInputStream to establish
a connection with the first DataNode with the first block of a file.
5. Data is read in the form of streams wherein client invokes ‘read()’ method repeatedly.
This process of read() operation continues till it reaches the end of block.
6. Once the end of a block is reached, DFSInputStream closes the connection and moves
on to locate the next DataNode for the next block
7. Once a client has done with the reading, it calls a close() method.
Write Operation In HDFS:

1. A client initiates write operation by calling ‘create()’ method of DistributedFileSystem

object which creates a new file – Step no. 1 in the above diagram.
2. DistributedFileSystem object connects to the NameNode using RPC call and initiates
new file creation. However, this file creates operation does not associate any blocks with
the file. It is the responsibility of NameNode to verify that the file (which is being created)
does not exist already and a client has correct permissions to create a new file. If a file
already exists or client does not have sufficient permission to create a new file,
then IOException is thrown to the client. Otherwise, the operation succeeds and a new
record for the file is created by the NameNode.
3. Once a new record in NameNode is created, an object of type FSDataOutputStream is
returned to the client. A client uses it to write data into the HDFS. Data write method is
invoked (step 3 in the diagram).
4. FSDataOutputStream contains DFSOutputStream object which looks after
communication with DataNodes and NameNode. While the client continues writing
data, DFSOutputStream continues creating packets with this data. These packets are
enqueued into a queue which is called as DataQueue.
5. There is one more component called DataStreamer which consumes this DataQueue.
DataStreamer also asks NameNode for allocation of new blocks thereby picking desirable
DataNodes to be used for replication.
6. Now, the process of replication starts by creating a pipeline using DataNodes. In our case,
we have chosen a replication level of 3 and hence there are 3 DataNodes in the pipeline.
7. The DataStreamer pours packets into the first DataNode in the pipeline.
8. Every DataNode in a pipeline stores packet received by it and forwards the same to the
second DataNode in a pipeline.
9. Another queue, ‘Ack Queue’ is maintained by DFSOutputStream to store packets which
are waiting for acknowledgment from DataNodes.
10. Once acknowledgment for a packet in the queue is received from all DataNodes in the
pipeline, it is removed from the ‘Ack Queue’. In the event of any DataNode failure,
packets from this queue are used to reinitiate the operation.
11. After a client is done with the writing data, it calls a close() method (Step 9 in the
diagram) Call to close(), results into flushing remaining data packets to the pipeline
followed by waiting for acknowledgment.
12. Once a final acknowledgment is received, NameNode is contacted to tell it that the file
write operation is complete.

JAVA Interfaces to HDFS:

Hadoop has an abstract notion of filesystems, of which HDFS is just one implementation.
The Java abstract class org.apache.hadoop.fs.FileSystem represents the client interface to a
filesystem in Hadoop, and there are several concrete implementations.Hadoop is written in
Java, so most Hadoop filesystem interactions are mediated through the Java API. The
filesystem shell, for example, is a Java application that uses the Java FileSystem class to
provide filesystem operations.By exposing its filesystem interface as a Java API, Hadoop
makes it awkward for non-Java applications to access HDFS. The HTTP REST API exposed
by the WebHDFS protocol makes it easier for other languages to interact with HDFS. Note
that the HTTP interface is slower than the native Java client, so should be avoided for very
large data transfers if possible.

Motoman MRC Troubleshooting
100% (1)
Motoman MRC Troubleshooting
153 pages
Saurashtra University: Rajkot - India
No ratings yet
Saurashtra University: Rajkot - India
50 pages
Chapter 20 The Computer Environment
100% (1)
Chapter 20 The Computer Environment
13 pages
Ee6008 Microcontroller Based System Designl Question Bank
No ratings yet
Ee6008 Microcontroller Based System Designl Question Bank
4 pages
Manual I-Reader (EN) (6-2018)
No ratings yet
Manual I-Reader (EN) (6-2018)
48 pages
Interface Panel Function PDF
No ratings yet
Interface Panel Function PDF
41 pages
Bigdata 15cs82 Vtu Module 1 2 Notes
57% (14)
Bigdata 15cs82 Vtu Module 1 2 Notes
49 pages
Itb 9101 PDF
No ratings yet
Itb 9101 PDF
29 pages
Veritas Netbackup 8.3: Advanced Administration: View Online
No ratings yet
Veritas Netbackup 8.3: Advanced Administration: View Online
8 pages
Vcos in Ads 699
No ratings yet
Vcos in Ads 699
19 pages
Unit 4 Se
No ratings yet
Unit 4 Se
67 pages
Netwrix Portfolio
No ratings yet
Netwrix Portfolio
54 pages
Compiler Design Full PDF
No ratings yet
Compiler Design Full PDF
138 pages
Canon Ir-C3380 2380 Series SM
No ratings yet
Canon Ir-C3380 2380 Series SM
50 pages
Lantronix
No ratings yet
Lantronix
398 pages
Interposing CT Connections in 3-ph TR
No ratings yet
Interposing CT Connections in 3-ph TR
12 pages
Programacion PLC51
No ratings yet
Programacion PLC51
69 pages
PME Unit-2 NOTES
No ratings yet
PME Unit-2 NOTES
7 pages
Untitled
No ratings yet
Untitled
35 pages
Visa Procesing System PowerPoint Templates
No ratings yet
Visa Procesing System PowerPoint Templates
8 pages
Patholab Software - My School Project
No ratings yet
Patholab Software - My School Project
34 pages
Unit-Iv CC&BD CS71
No ratings yet
Unit-Iv CC&BD CS71
148 pages
Notes
88% (8)
Notes
18 pages
DPDK
No ratings yet
DPDK
2 pages
Hadoop Session
No ratings yet
Hadoop Session
65 pages
Sona College of Technology: U10It505 - Software Engineering
No ratings yet
Sona College of Technology: U10It505 - Software Engineering
3 pages
Unit Ii
No ratings yet
Unit Ii
39 pages
Salesforce Org To Org Opportunity Aggregation
No ratings yet
Salesforce Org To Org Opportunity Aggregation
10 pages
V2V
No ratings yet
V2V
62 pages
BDA Mod 3 QB Solns
No ratings yet
BDA Mod 3 QB Solns
19 pages
EE338 Tutorial 7
No ratings yet
EE338 Tutorial 7
5 pages
Big Data Assighmwnt 2
No ratings yet
Big Data Assighmwnt 2
60 pages
Bigdata 15cs82 Vtu Module 1 2 Notes PDF
No ratings yet
Bigdata 15cs82 Vtu Module 1 2 Notes PDF
49 pages
UNIT 5 Complete Notes
No ratings yet
UNIT 5 Complete Notes
21 pages
What Is Hadoop HDFS
No ratings yet
What Is Hadoop HDFS
20 pages
Apllied Electronics Edu
No ratings yet
Apllied Electronics Edu
10 pages
How To Install The Microsoft Loopback Adapter in Windows XP: Manual Installation
No ratings yet
How To Install The Microsoft Loopback Adapter in Windows XP: Manual Installation
5 pages
Unit 1st Cloud Computing Notes
No ratings yet
Unit 1st Cloud Computing Notes
21 pages
Unit-3 (HDFS)
No ratings yet
Unit-3 (HDFS)
59 pages
Module 1 PDF
No ratings yet
Module 1 PDF
49 pages
Unit2 HDFS
No ratings yet
Unit2 HDFS
17 pages
Hadoop: OREIN IT Technologies
No ratings yet
Hadoop: OREIN IT Technologies
65 pages
Distributed File Systems Leading To Hadoop File System: UNIT-2
No ratings yet
Distributed File Systems Leading To Hadoop File System: UNIT-2
12 pages
Hadoop Working
No ratings yet
Hadoop Working
33 pages
Wa0001.
No ratings yet
Wa0001.
56 pages
Unit 2
No ratings yet
Unit 2
56 pages
Experiment No. 2 Training Session On Hadoop: Hadoop Distributed File System
No ratings yet
Experiment No. 2 Training Session On Hadoop: Hadoop Distributed File System
9 pages
Switch
No ratings yet
Switch
4 pages
High Performance Fault-Tolerant Hadoop Distributed File System
No ratings yet
High Performance Fault-Tolerant Hadoop Distributed File System
9 pages
HDFS
No ratings yet
HDFS
22 pages
Unit II-bid Data Programming
No ratings yet
Unit II-bid Data Programming
23 pages
High Performance Fault-Tolerant Hadoop Distributed File System
No ratings yet
High Performance Fault-Tolerant Hadoop Distributed File System
9 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
48 pages
Hadoop Distributed File System: Presented by Mohammad Sufiyan Nagaraju Kola Prudhvi Krishna Kamireddy
No ratings yet
Hadoop Distributed File System: Presented by Mohammad Sufiyan Nagaraju Kola Prudhvi Krishna Kamireddy
17 pages
BD Unit-IIINotes
No ratings yet
BD Unit-IIINotes
17 pages
cx300 Comxpert Communications Service Monitor Brochures en
No ratings yet
cx300 Comxpert Communications Service Monitor Brochures en
8 pages
Unit II Big Data Analytics
No ratings yet
Unit II Big Data Analytics
11 pages
Block Reduction and SFG - DPP 01
No ratings yet
Block Reduction and SFG - DPP 01
3 pages
HDFS
No ratings yet
HDFS
11 pages
Big Data Refers To Extremely Large and Complex Datasets That 1
No ratings yet
Big Data Refers To Extremely Large and Complex Datasets That 1
421 pages
HDFS Concepts
No ratings yet
HDFS Concepts
10 pages
HDFS
No ratings yet
HDFS
11 pages
Unit-4 BDA As On 25-11-2024
No ratings yet
Unit-4 BDA As On 25-11-2024
248 pages
Unit 2
No ratings yet
Unit 2
14 pages
HDFS
No ratings yet
HDFS
15 pages
Unit 3 Big Data - 240516 - 090400
No ratings yet
Unit 3 Big Data - 240516 - 090400
20 pages
Citra Log
No ratings yet
Citra Log
7 pages
HDFS
No ratings yet
HDFS
8 pages
BDA-Unit 4
No ratings yet
BDA-Unit 4
20 pages
Hadoop Distributed File System (HDFS)
No ratings yet
Hadoop Distributed File System (HDFS)
6 pages
Unit-2 CH 1 Updated
No ratings yet
Unit-2 CH 1 Updated
22 pages
Cisco Certification Roadmap 2021
No ratings yet
Cisco Certification Roadmap 2021
1 page
HDFS
No ratings yet
HDFS
16 pages
Bigdta Unit 3
No ratings yet
Bigdta Unit 3
65 pages
Unit - 3 (HDFS) - 1
No ratings yet
Unit - 3 (HDFS) - 1
24 pages
Unit 3 Full
No ratings yet
Unit 3 Full
89 pages
Big Data Aktu Unit 3
No ratings yet
Big Data Aktu Unit 3
90 pages
HDFS
No ratings yet
HDFS
3 pages
Notes - 3 Unit Neha
No ratings yet
Notes - 3 Unit Neha
25 pages
Design of HDFS: Unit 3
No ratings yet
Design of HDFS: Unit 3
20 pages
HDFS (27 Jan 2025 Hadoop Distributed File System)
No ratings yet
HDFS (27 Jan 2025 Hadoop Distributed File System)
73 pages
Big Data Unit-III
No ratings yet
Big Data Unit-III
39 pages
Unit - 3 (HDFS)
No ratings yet
Unit - 3 (HDFS)
23 pages
BD U-3 (Anupam Sir)
No ratings yet
BD U-3 (Anupam Sir)
23 pages
Unit-4 BDA As On 25-11-2024
No ratings yet
Unit-4 BDA As On 25-11-2024
258 pages
HDFS
No ratings yet
HDFS
14 pages
Unit 3 Part 1
No ratings yet
Unit 3 Part 1
17 pages
Unit 3 1
No ratings yet
Unit 3 1
20 pages
Bigdata Unit 3
No ratings yet
Bigdata Unit 3
96 pages
BD U-3 Notes
No ratings yet
BD U-3 Notes
27 pages
BCS061 Notes Unit3
No ratings yet
BCS061 Notes Unit3
23 pages
BIG DATA - Unit 4 HADOOP AND MAP REDUCE - Mini Xerox - Easy Read
No ratings yet
BIG DATA - Unit 4 HADOOP AND MAP REDUCE - Mini Xerox - Easy Read
16 pages
Mastering Data Engineering: Advanced Techniques with Apache Hadoop and Hive
From Everand
Mastering Data Engineering: Advanced Techniques with Apache Hadoop and Hive
Peter Jones
No ratings yet
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet

UNIT 3 HDFS, Hadoop Environment Part 1

Uploaded by

UNIT 3 HDFS, Hadoop Environment Part 1

Uploaded by

BIG DATA (KCS 061)

UNIT 3: HDFS (Hadoop Distributed File System), Hadoop Environment

The Design of HDFS:

HDFS DataNode and NameNode Image:

HDFS Read Image:

HDFS Write Image:

File Sizes, Block Sizes and Block Abstraction in HDFS:

Why Is a Block in HDFS So Large?

How does HDFS Store, Read, & Write files:

Read Operation in HDFS:

1. A client initiates read request by calling ‘open()’ method of FileSystem object; it is an

1. A client initiates write operation by calling ‘create()’ method of DistributedFileSystem

JAVA Interfaces to HDFS:

You might also like