BCS061 Notes Unit3
BCS061 Notes Unit3
Hadoop File System was developed using distributed file system design. It is run on
commodity hardware. Unlike other distributed systems, HDFS is highly fault tolerant and
designed using low-cost hardware.
HDFS holds very large amount of data and provides easier access. To store such huge data,
the files are stored across multiple machines. These files are stored in redundant fashion to
rescue the system from possible data losses in case of failure. HDFS also makes applications
available to parallel processing.
Features of HDFS
Goals of HDFS
Fault detection and recovery − Since HDFS includes a large number of commodity
hardware, failure of components is frequent. Therefore HDFS should have mechanisms for
quick and automatic fault detection and recovery.
Huge datasets − HDFS should have hundreds of nodes per cluster to manage the
applications having huge datasets.
Hardware at data − A requested task can be done efficiently, when the computation takes
place near the data. Especially where huge datasets are involved, it reduces the network
traffic and increases the throughput.
HDFS Architecture
Given below is the architecture of a Hadoop File System.
HDFS follows the master-slave architecture and it has the following elements.
NameNode
Name Node is the single point of contact for accessing files in HDFS and it determines the
block ids and locations for data access. So, NameNode plays a Master role in Master/Slaves
Architecture whereas Data Nodes acts as slaves. File System metadata is stored on Name
Node.
File System Metadata contains majorly, File names, File Permissions and locations of each
block of files. Thus, Metadata is relatively small in size and fits into Main Memory of a
computer machine. So, it is stored in Main Memory of Namenode to allow fast access.
FsImage : It is a file on Name Node’s Local File System containing entire HDFS file system
namespace (including mapping of blocks to files and file system properties)
Editlog : It is a Transaction Log residing on Name Node’s Local File System and contains a
record/entry for every change that occurs to File System Metadata.
DataNode
Data Nodes are the slaves part of Master/Slaves Architecture and on which actual HDFS files
are stored in the form of fixed size chunks of data which are called blocks. Data Nodes serve
read and write requests of clients on HDFS files and also perform block creation, replication
and deletions.
Data Nodes Failure Recovery Each data node on a cluster periodically sends a heartbeat
message to the namenode which is used by the name node to discover the data node failures
based on missing heartbeats.
The name node marks data nodes without recent heartbeats as dead, and does not dispatch
any new I/O requests to them. Because data located at a dead data node is no longer available
to HDFS, data node death may cause the replication factor of some blocks to fall below their
specified values. The name node constantly tracks which blocks must be re-replicated, and
initiates replication whenever necessary.
Thus all the blocks on a dead data node are re-replicated on other live data node and
replication factor remains normal.
HDFS Blocks
HDFS is a block structured file system. Each HDFS file is broken into blocks of fixed size
usually 128 MB which are stored across various data nodes on the cluster. Each of these
blocks is stored as a separate file on local file system on data nodes (Commodity machines on
cluster).
Thus to access a file on HDFS, multiple data nodes need to be referenced and the list of the
data nodes which need to be accessed is determined by the file system metadata stored on
Name Node.
So, any HDFS client trying to access/read a HDFS file, will get block information from
Name Node first, and then based on the block id’s and locations, data will be read from
corresponding data nodes/computer machines on cluster. HDFS’s fsck command is a useful
to get the files and blocks details of file system. -
$ hadoop fsck / -files –blocks
Backup Node
In Hadoop, Backup node keeps an in-memory, up-to-date copy of the file system namespace.
It is always synchronized with the active NameNode state.
Replication Management
Block replication provides fault tolerance.If one copy is not accessible and corrupted, we can
read data from other copy. The number of copies or replicas of each block of a file in HDFS
Architecture is replication factor.
Rack Awareness
Rack Awareness in Hadoop is the concept that chooses DataNodes based on the rack
information.
NameNode achieves rack information by maintaining the rack ids of each DataNode.
What is HAR?
Hadoop Archives or HAR is an archiving facility that packs files in to HDFS blocks
efficiently and hence HAR can be used to tackle the small files problem in Hadoop.
HAR is created from a collection of files and the archiving tool (a simple command) will run
a MapReduce job to process the input files in parallel and create an archive file.
HADOOP I/O
Unlike any I/O subsystem, Hadoop also comes with a set of primitives.
These primitive considerations, although generic in nature, go with the Hadoop IO
system as well with some special connotation to it, of course.
Hadoop deals with multi-terabytes of datasets; a special consideration on these
primitives will give an idea how Hadoop handles data input and output.
This article quickly skims over these primitives to give a perspective on the Hadoop
input output system.
Compression:
Keeping in mind the volume of data Hadoop deals with, compression is not a luxury
but a requirement.
There are many obvious benefits of file compression rightly used by Hadoop.
It economizes storage requirements and is a must-have capability to speed up data
transmission over the network and disks.
There are many tools, techniques, and algorithms commonly used by Hadoop.
Many of them are quite popular and have been used in file compression over the ages
.For example, gzip, bzip2, LZO, zip, and so forth are often used.
Serialization:
The process that turns structured objects to stream of bytes is called serialization.
This is specifically required for data transmission over the network or persisting raw
data in disks.
Deserialization is just the reverse process, where a stream of bytes is transformed into
a structured object.
This is particularly required for object implementation of the raw bytes.
Therefore, it is not surprising that distributed computing uses this in a couple of
distinct areas: inter-process communication and data persistence.
source.
When data is ingested in batches, data items are imported in discrete chunks at
Apache Flume:
Apache Flume is a distributed, reliable, and available service for efficiently collecting,
aggregating, and moving large amounts of log data into HDFS.
Ò It has a simple and flexible architecture based on streaming data flows; and robust and
fault tolerant with tunable reliability mechanisms and many failover and recovery
mechanisms.
Ò It uses a simple extensible data model that allows for online analytic application.
Ò Flume employs the familiar producer-consumer model. Source is the entity through
which data enters into Flume. Sources either actively poll for data or passively wait for data
to be delivered to them. On the other hand, Sink is the entity that delivers the data to the
destination. Flume has many built-in sources (e.g. log4j and syslogs) and sinks (e.g. HDFS
and HBase). Channel is the conduit between the Source and the Sink.
Sources ingest events into the channel and the sinks drain the channel. Channels allow
decoupling of ingestion rate from drain rate. When data are generated faster than what the
destination can handle, the channel size increases.
A service for streaming logs into Hadoop
Flume lets Hadoop users ingest high-volume streaming data into HDFS for storage.
Specifically, Flume allows users to:
Stream data
Ingest streaming data from multiple sources into Hadoop for storage and analysis
Insulate systems
Buffer storage platform from transient spikes, when the rate of incoming data exceeds
the rate at which data can be written to the destination
Guarantee data delivery
Flume NG uses channel-based transactions to guarantee reliable message delivery. When a
message moves from one agent to another, two transactions are started, one on the agent that
delivers the event and the other on the agent that receives the event. This ensures guaranteed
delivery semantics
Scale horizontally
To ingest new data streams and additional volume as needed
Apache Sqoop:
Apache Sqoop is a tool designed to efficiently transfer data between Hadoop and relational
databases. We can use Sqoop to import data from a relational database table into HDFS. The
import process is performed in parallel and thus generates multiple files in the format of
delimited text, Avro, or Sequence File. Besides, Sqoop generates a Java class that
encapsulates one row of the imported table, which can be used in subsequent MapReduce
processing of the data. Moreover, Sqoop can export the data (e.g. the results of MapReduce
processing) back to the relational database for consumption by external applications or users.
Hadoop Security:
Hadoop Security is generally defined as a procedure to secure the Hadoop Data Storage unit, by
offering a virtually impenetrable wall of security against any potential cyber threat. Hadoop attains
this high-calibre security wall by following the below security protocol.
Authentication
Authorization
Auditing
Authentication:
Authentication is the first stage where the user’s credentials are verified. The credentials
typically include the user’s dedicated User-Name and a secret password. Entered credentials
will be checked against the available details on the security database. If valid, the user will
be authenticated.
Authorization :
Authorization is the second stage where the system gets to decide whether to provide
permission to the user, to access data or not. It is based on the predestinated access control
list. The Confidential information is kept secure and only authorized personnel can access it.
Auditing:
Auditing is the last stage, it simply keeps track of the operations performed by the
authenticated user during the period in which he was logged into the cluster.
This is solely done for security purposes only.
Types of Hadoop Security:
1)Kerberos Security:
Kerberos is one of the leading Network Authentication Protocol designed to provide
powerful authentication services to both Server and Client-ends through Secret-
Key cryptography techniques. It is proven to be highly secure since it uses encrypted
service tickets throughout the entire session.
Step 6:
Format the hadoop file system. From hadoop directory run the following:
bin/hadoop namenode -fo rmat.
The default block size in HDFS is 128 MB (can be configured). A larger block size minimizes
seeking time and allows efficient storage of large files.
Data replication in HDFS ensures fault tolerance by storing multiple copies of each data block
(default: 3 copies) across different nodes to prevent data loss in case of node failure.
When reading, the client contacts the NameNode, retrieves block locations, and reads from
DataNodes.
When writing, data is first written to a temporary local file, then split into blocks and stored on
DataNodes with replication.
The NameNode manages metadata, including the file directory structure, block locations, and
access permissions. It does not store actual data.
Flume is a tool used for ingesting streaming data (e.g., log files, social media feeds) into
Hadoop for real-time processing.
HAR helps to reduce the number of small files in HDFS by grouping them into larger archive
files, improving storage efficiency and metadata management.
YARN (Yet Another Resource Negotiator) manages cluster resources and job scheduling,
allowing multiple applications to run simultaneously in Hadoop.
Hadoop security includes Kerberos authentication, HDFS file permissions, data encryption,
and access control lists (ACLs).
14. How does Hadoop in the cloud differ from on-premises Hadoop?
Hadoop in the cloud (e.g., AWS EMR, Azure HDInsight) provides scalability, reduced
hardware costs, and managed services, while on-premises Hadoop requires dedicated
infrastructure.
Sqoop is used for importing and exporting structured data (e.g., from MySQL, Oracle)
between HDFS and relational databases.
16. What is HDFS monitoring, and which tools are used for it?
HDFS monitoring ensures system health and performance using tools like JMX, Ambari, and
Ganglia to track resource usage and failures.
AKTU:
NameNode – The master node that manages metadata and controls access to files.
DataNode – The worker nodes that store actual data and handle read/write
requests.
2. Describe the concepts of file sizes, block sizes and block abstraction in
HDFS?
HDFS splits files into blocks and stores them across multiple DataNodes.
The NameNode maintains metadata about which DataNodes store each block.
Blocks are replicated (default: 3 copies) across different DataNodes for fault tolerance
and high availability.
When reading a file, the client retrieves blocks in parallel, improving performance.
Fault Tolerance: If a node fails, copies of the block exist on other nodes.
Parallel Processing: Different blocks of a file can be processed simultaneously.
Efficient Storage Management: Large files are managed efficiently, avoiding fragmentation.
3. What are the benefits and challenges of using HDFS for Distributed
storage and processing?
Benefits and Challenges of Using HDFS for Distributed Storage and Processing
Benefits of HDFS:
Scalability – HDFS can handle petabytes of data by adding more nodes to the cluster.
Fault Tolerance – Data is replicated (default 3 copies) across multiple nodes to prevent data
loss.
Optimized for Large Files – HDFS efficiently stores and processes large files by dividing them
into blocks.
Data Locality – Computation is moved closer to data, reducing network bandwidth usage.
Integration with Big Data Tools – Compatible with MapReduce, Spark, Hive, Pig, HBase,
etc.
Challenges of HDFS:
Not Suitable for Small Files – Large block size (128MB+) leads to inefficient storage for small
files.
High Latency for Random Reads – HDFS is optimized for sequential access; random access
is slow.
Metadata Overhead – The NameNode stores metadata in memory, which can become a
bottleneck.
AKTU Questions:
1. Discuss in brief about the cluster specification. Describe how to setting up a Hadoop
Cluster?
2. Explain the core concepts of HDFS, including NameNode, DataNode, and the file
system namespace. How do these components work together to manage data storage
and replication in Hadoop clusters?
3. Describe the considerations for deploying Hadoop in a cloud environment. What are
the advantage and challenges of running Hadoop clusters on cloud platforms like
Amazon Web Services(AWS), Microsoft Azure and Google Cloud Platform(GCP)?