0% found this document useful (0 votes)
12 views23 pages

BCS061 Notes Unit3

HDFS (Hadoop Distributed File System) is a fault-tolerant, distributed file system designed to store and process large datasets across commodity hardware. It operates on a master-slave architecture with a NameNode managing metadata and DataNodes storing actual data blocks, ensuring efficient data access and replication for fault tolerance. HDFS also includes features like Hadoop Archives for managing small files and supports various data ingestion methods, including real-time and batch processing.

Uploaded by

khushi.t.00007
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views23 pages

BCS061 Notes Unit3

HDFS (Hadoop Distributed File System) is a fault-tolerant, distributed file system designed to store and process large datasets across commodity hardware. It operates on a master-slave architecture with a NameNode managing metadata and DataNodes storing actual data blocks, ensuring efficient data access and replication for fault tolerance. HDFS also includes features like Hadoop Archives for managing small files and supports various data ingestion methods, including real-time and batch processing.

Uploaded by

khushi.t.00007
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Unit -3

HDFS: Hadoop Distributed File System

Hadoop File System was developed using distributed file system design. It is run on
commodity hardware. Unlike other distributed systems, HDFS is highly fault tolerant and
designed using low-cost hardware.
HDFS holds very large amount of data and provides easier access. To store such huge data,
the files are stored across multiple machines. These files are stored in redundant fashion to
rescue the system from possible data losses in case of failure. HDFS also makes applications
available to parallel processing.

Features of HDFS

 It is suitable for the distributed storage and processing.


 Hadoop provides a command interface to interact with HDFS.
 The built-in servers of namenode and datanode help users to easily check the status of
cluster.
 Streaming access to file system data.
 HDFS provides file permissions and authentication.

Goals of HDFS

Fault detection and recovery − Since HDFS includes a large number of commodity
hardware, failure of components is frequent. Therefore HDFS should have mechanisms for
quick and automatic fault detection and recovery.
Huge datasets − HDFS should have hundreds of nodes per cluster to manage the
applications having huge datasets.
Hardware at data − A requested task can be done efficiently, when the computation takes
place near the data. Especially where huge datasets are involved, it reduces the network
traffic and increases the throughput.
HDFS Architecture
Given below is the architecture of a Hadoop File System.

HDFS follows the master-slave architecture and it has the following elements.

NameNode

Name Node is the single point of contact for accessing files in HDFS and it determines the
block ids and locations for data access. So, NameNode plays a Master role in Master/Slaves
Architecture whereas Data Nodes acts as slaves. File System metadata is stored on Name
Node.
File System Metadata contains majorly, File names, File Permissions and locations of each
block of files. Thus, Metadata is relatively small in size and fits into Main Memory of a
computer machine. So, it is stored in Main Memory of Namenode to allow fast access.

Important components of name node:

FsImage : It is a file on Name Node’s Local File System containing entire HDFS file system
namespace (including mapping of blocks to files and file system properties)

Editlog : It is a Transaction Log residing on Name Node’s Local File System and contains a
record/entry for every change that occurs to File System Metadata.
DataNode

Data Nodes are the slaves part of Master/Slaves Architecture and on which actual HDFS files
are stored in the form of fixed size chunks of data which are called blocks. Data Nodes serve
read and write requests of clients on HDFS files and also perform block creation, replication
and deletions.
Data Nodes Failure Recovery Each data node on a cluster periodically sends a heartbeat
message to the namenode which is used by the name node to discover the data node failures
based on missing heartbeats.
The name node marks data nodes without recent heartbeats as dead, and does not dispatch
any new I/O requests to them. Because data located at a dead data node is no longer available
to HDFS, data node death may cause the replication factor of some blocks to fall below their
specified values. The name node constantly tracks which blocks must be re-replicated, and
initiates replication whenever necessary.
Thus all the blocks on a dead data node are re-replicated on other live data node and
replication factor remains normal.

HDFS Blocks

HDFS is a block structured file system. Each HDFS file is broken into blocks of fixed size
usually 128 MB which are stored across various data nodes on the cluster. Each of these
blocks is stored as a separate file on local file system on data nodes (Commodity machines on
cluster).
Thus to access a file on HDFS, multiple data nodes need to be referenced and the list of the
data nodes which need to be accessed is determined by the file system metadata stored on
Name Node.
So, any HDFS client trying to access/read a HDFS file, will get block information from
Name Node first, and then based on the block id’s and locations, data will be read from
corresponding data nodes/computer machines on cluster. HDFS’s fsck command is a useful
to get the files and blocks details of file system. -
$ hadoop fsck / -files –blocks

Backup Node
In Hadoop, Backup node keeps an in-memory, up-to-date copy of the file system namespace.
It is always synchronized with the active NameNode state.

Replication Management
Block replication provides fault tolerance.If one copy is not accessible and corrupted, we can
read data from other copy. The number of copies or replicas of each block of a file in HDFS
Architecture is replication factor.
Rack Awareness
Rack Awareness in Hadoop is the concept that chooses DataNodes based on the rack
information.
NameNode achieves rack information by maintaining the rack ids of each DataNode.

HDFS Read and Write operation.


Write Operation : When a client wants to write a file to HDFS, it communicates to namenode
for metadata. The Namenode responds with a number of blocks, their location, replicas and
other details.
Read Operation : To read from HDFS, the first client communicates to namenode for
metadata. A client comes out of NameNode with the name of files and its location.

How Does Hadoop File System Works?


The Hadoop Distributed File System (HDFS) is the primary data storage system used by
Hadoop applications. It employs a NameNode and DataNode architecture to implement a
distributed file system that provides high-performance access to data across highly scalable
Hadoop clusters.
HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a
file is split into one or more blocks and these blocks are stored in a set of DataNodes.
The Hadoop distributed file system (HDFS) is a distributed,scalable, and portable file-system
written in Java for the Hadoop framework. A Hadoop cluster has nominally a single
NameNode plus a cluster of DataNodes,. Each DataNode serves up blocks of data over the
network using a block protocol specific to HDFS. The basic HDFS architecture is represented
in the following image.
The NameNode keeps all metadata information about where the data is stored, the location of
the data files, how the files are splitted across DataNodes, etc.. HDFS stores large files
(typically in the range of gigabytes to terabytes) across multiple machines, called DataNodes.
The files are splitted in blocks (usually 64 MB or 128 MB) and stored on a serie of
DataNodes (what DataNodes are used for each file, is managed by the NameNode).
The blocks of data are also replicated (usually three times) on other DataNodes so that in case
hardware failures, clients can find the data on another server. The information about the
location of the data blocks and the replicas is also managed by the NameNode. Data Nodes
can talk to each other to re balance data, to move copies around, and to keep the replication of
data high.
● The HDFS file system also includes a secondary NameNode, a name which is misleading,
as it might be interpreted as a back-up for the NameNode. In fact, this secondary NameNode
regularly connects on the regular basis to the primary NameNode and builds snapshots of all
directory information managed by the latter, which the system then saves to local or remote
directories. These snapshots can later be used to restart a failed primary
NameNode without having to replay the entire journal of file-system actions, then to edit the
log to create an up- to-date directory structure.
● What of the issues related to HDFS architecture is the fact that the NameNode is the single
point for storage and management of metadata, and so it can become a bottleneck when
dealing with a huge number of files, especially a large number of small files. HDFS
Federation, a new addition, aims to tackle this problem to a certain extent by allowing
multiple namespaces served by separate namenodes. Moreover, there are some issues in
HDFS, namely, small file issue, scalability problem, Single Point of Failure (SPoF), and
bottleneck in huge metadata request.

Java Interface with HDFS


● Interfaces are derived from real-world scenarios with the main purpose to use an
object by strict rules.
● Java interfaces have same behaviour: they set strict rules on how to interact with objects.
Hadoop has an abstract notion of filesystems, of which HDFS is just one implementation.
● The Java abstract class org.apache.hadoop.fs.FileSystem represents the client interface to a
filesystem in Hadoop, and there are several concrete implementations.
● Hadoop is written in Java, so most Hadoop filesystem interactions are mediated through the
Java API.
● The filesystem shell, for example, is a Java application that uses the Java FileSystem class
to provide filesystem operations.
● By exposing its file system interface as a Java API, Hadoop makes it awkward for non-Java
applications to access HDFS.
● The HTTP REST API exposed by the WebHDFS protocol makes it easier for other
languages to interact with HDFS.
● Note that the HTTP interface is slower than the native Java client, so should be avoided for
very large data transfers if possible.
● There are two ways of accessing HDFS over HTTP : directly, where the HDFS daemons
serve HTTP
requests to clients and via a proxy, which accesses HDFS on the client’s behalf using the
usual
DistributedFileSystem API.
● In the First Case, the embedded web servers in the NameNode and DataNodes act as
WebHDFS
endpoints.
● File metadata operations are handled by the namenode, while file read (and write)
operations are sent first to the namenode, which sends an HTTP redirect to the client
indicating the datanode to stream file data.
In Second Case of accessing HDFS over HTTP relies on one or more standalone proxy
servers.
● All traffic to the cluster passes through the proxy, so the client never accesses the
namenode or datanode directly.
● This allows for stricter firewall and bandwidth-limiting policies to be put in place.
● The HttpFS proxy exposes the same HTTP (and HTTPS) interface as WebHDFS, so clients
can access
both using webhdfs (or swebhdfs) URIs.
● The HttpFS proxy is started independently of the namenode and datanode daemons, using
the httpfs.sh
script, and by default listens on a different port number 14000.
There are two ways to use JAVA API in HDFS :
1. Reading Data Using the FileSystem API.
2. Writing Data Using the FileSystem API.

Hadoop Archives (HAR)


Hadoop archives are special format archives. A Hadoop archive maps to a file system
directory. A Hadoop archive always has a *.har extension. A Hadoop archive directory
contains metadata (in the form of _index and _masterindex) and data (part-*) files.
The _index file contains the name of the files that are part of the archive and the location
within the part files. Hadoop Archives (HAR) offers an effective way to deal with the small
files problem.
The Problem with Small Files
Hadoop works best with big files and small files are handled inefficiently in HDFS. As we
know, Namenode holds the metadata information in memory for all the files stored in HDFS.
Let’s say we have a file in HDFS which is 1 GB in size and the Namenode will store
metadata information of the file – like file name, creator, created time stamp, blocks,
permissions etc.
Now assume we decide to split this 1 GB file in to 1000 pieces and store all 100o “small”
files in HDFS.
Now Namenode has to store metadata information of 1000 small files in memory.
This is not very efficient – first it takes up a lot of memory and second soon Namenode will
become a bottleneck as it is trying to manage a lot of data.

What is HAR?
Hadoop Archives or HAR is an archiving facility that packs files in to HDFS blocks
efficiently and hence HAR can be used to tackle the small files problem in Hadoop.
HAR is created from a collection of files and the archiving tool (a simple command) will run
a MapReduce job to process the input files in parallel and create an archive file.

Limitation of HAR Files


● Once an archive file is created, you cannot update the file to add or remove files. In other
words, har files are immutable.
● Archive file will have a copy of all the original files so once a .har is created it will take as
much space as the original files. Don’t mistake .har files for compressed files.
● When .har file is given as an input to MapReduce job, the small files inside the .har file will
be processed individually by separate mappers which is inefficient.

HADOOP I/O
 Unlike any I/O subsystem, Hadoop also comes with a set of primitives.
 These primitive considerations, although generic in nature, go with the Hadoop IO
system as well with some special connotation to it, of course.
 Hadoop deals with multi-terabytes of datasets; a special consideration on these
primitives will give an idea how Hadoop handles data input and output.
 This article quickly skims over these primitives to give a perspective on the Hadoop
input output system.

Compression:
 Keeping in mind the volume of data Hadoop deals with, compression is not a luxury
but a requirement.
 There are many obvious benefits of file compression rightly used by Hadoop.
 It economizes storage requirements and is a must-have capability to speed up data
transmission over the network and disks.
 There are many tools, techniques, and algorithms commonly used by Hadoop.
 Many of them are quite popular and have been used in file compression over the ages
.For example, gzip, bzip2, LZO, zip, and so forth are often used.
Serialization:

 The process that turns structured objects to stream of bytes is called serialization.
 This is specifically required for data transmission over the network or persisting raw
data in disks.
 Deserialization is just the reverse process, where a stream of bytes is transformed into
a structured object.
 This is particularly required for object implementation of the raw bytes.
 Therefore, it is not surprising that distributed computing uses this in a couple of
distinct areas: inter-process communication and data persistence.

 Hadoop uses RPC (Remote Procedure Call) to enact inter-process communication


between nodes.
 Therefore, the RPC protocol uses the process of serialization and deserialization to
render a message to the stream of bytes and vice versa and sends it across the
network.
 However, the process must be compact enough to best use the network bandwidth, as
well as fast, interoperable, and flexible to accommodate protocol updates over time.
Hadoop has its own compact and fast serialization format, Writables, that MapReduce
programs use to generate keys and value types.

AVRO:FILE BASED DATA STRUCTURE


 To transfer data over a network or for its persistent storage, you need to serialize the
data.
 Prior to the serialization APIs provided by Java and Hadoop, we have a special
utility, called Avro, a schema-based serialization technique.

 Apache Avro is a language-neutral data serialization system. It was developed by


Doug Cutting, the father of Hadoop.
 Since Hadoop writable classes lack language portability, Avro becomes quite helpful,
as it deals with data formats that can be processed by multiple languages.
 Avro is a preferred tool to serialize data in Hadoop.
 Avro has a schema-based system. A language-independent schema is associated with
its read and write operations.
 Avro serializes the data which has a built-in schema.
 Avro serializes the data into a compact binary format, which can be deserialized by
any application.
 Avro uses JSON format to declare the data structures.
 Presently, it supports languages such as Java, C, C++, C#, Python, and Ruby.
 Avro is a row-based storage format for Hadoop which is widely used as a
serialization platform.
 Avro stores the data definition (schema) in JSON format making it easy to read and
interpret by any program.
 The data itself is stored in binary format making it compact and efficient
DATA INGESTION: Data ingestion is the process of obtaining and importing data or immediate use
or storage in a database. To ingest something is to "take something in or absorb something."

 Data can be streamed in real time or ingested in batches.


 When data is ingested in real time, each data item is
deas
tm
ropi it is emitted by the

source.
 When data is ingested in batches, data items are imported in discrete chunks at

periodic intervals of time.


Note:
An effective data ingestion process begins by prioritizing data sources, validating
individual files and routing data items to thetcdestination.
eroc

Apache Flume:
Apache Flume is a distributed, reliable, and available service for efficiently collecting,
aggregating, and moving large amounts of log data into HDFS.
Ò It has a simple and flexible architecture based on streaming data flows; and robust and
fault tolerant with tunable reliability mechanisms and many failover and recovery
mechanisms.
Ò It uses a simple extensible data model that allows for online analytic application.
Ò Flume employs the familiar producer-consumer model. Source is the entity through
which data enters into Flume. Sources either actively poll for data or passively wait for data
to be delivered to them. On the other hand, Sink is the entity that delivers the data to the
destination. Flume has many built-in sources (e.g. log4j and syslogs) and sinks (e.g. HDFS
and HBase). Channel is the conduit between the Source and the Sink.
Sources ingest events into the channel and the sinks drain the channel. Channels allow
decoupling of ingestion rate from drain rate. When data are generated faster than what the
destination can handle, the channel size increases.
A service for streaming logs into Hadoop
Flume lets Hadoop users ingest high-volume streaming data into HDFS for storage.
Specifically, Flume allows users to:
Stream data
Ingest streaming data from multiple sources into Hadoop for storage and analysis
Insulate systems
Buffer storage platform from transient spikes, when the rate of incoming data exceeds
the rate at which data can be written to the destination
Guarantee data delivery
Flume NG uses channel-based transactions to guarantee reliable message delivery. When a
message moves from one agent to another, two transactions are started, one on the agent that
delivers the event and the other on the agent that receives the event. This ensures guaranteed
delivery semantics
Scale horizontally
To ingest new data streams and additional volume as needed

Apache Sqoop:
Apache Sqoop is a tool designed to efficiently transfer data between Hadoop and relational
databases. We can use Sqoop to import data from a relational database table into HDFS. The
import process is performed in parallel and thus generates multiple files in the format of
delimited text, Avro, or Sequence File. Besides, Sqoop generates a Java class that
encapsulates one row of the imported table, which can be used in subsequent MapReduce
processing of the data. Moreover, Sqoop can export the data (e.g. the results of MapReduce
processing) back to the relational database for consumption by external applications or users.

Hadoop Security:
Hadoop Security is generally defined as a procedure to secure the Hadoop Data Storage unit, by
offering a virtually impenetrable wall of security against any potential cyber threat. Hadoop attains
this high-calibre security wall by following the below security protocol.

 Authentication
 Authorization
 Auditing
Authentication:
Authentication is the first stage where the user’s credentials are verified. The credentials
typically include the user’s dedicated User-Name and a secret password. Entered credentials
will be checked against the available details on the security database. If valid, the user will
be authenticated.
Authorization :
Authorization is the second stage where the system gets to decide whether to provide
permission to the user, to access data or not. It is based on the predestinated access control
list. The Confidential information is kept secure and only authorized personnel can access it.
Auditing:
Auditing is the last stage, it simply keeps track of the operations performed by the
authenticated user during the period in which he was logged into the cluster.
This is solely done for security purposes only.
Types of Hadoop Security:
1)Kerberos Security:
Kerberos is one of the leading Network Authentication Protocol designed to provide
powerful authentication services to both Server and Client-ends through Secret-
Key cryptography techniques. It is proven to be highly secure since it uses encrypted
service tickets throughout the entire session.

2)HDFS Encryption : HDFS Encryption is a formidable advancement that Hadoop


ever embraced. Here, the data from source to destination(HDFS) gets completely
encrypted. This procedure does not require any changes to be made on to the original
Hadoop Application, making the client to be the only authorized personnel
to access the data.

3) Traffic Encryption: Traffic Encryption is none other than HTTPS(HyperText


Transfer Protocol Secure). This procedure is used to secure the data
transmission, from the website as well as data transmission to the website.
Much online banking gateways use this method to secure transactions over a Security
Certificate

Hadoop Single Node Set Up-


Step 1 :
Download hadoop from http://hadoop.apache.org/mapred uce/releases. htm I
Step 2:
Untar the hadoop file:
tar xvfz hadoop-0.20.2.tar.gz
Step 3:
Set the path to java compiler by ed it i ng
JAVA HOME
Parameter in hadoop/conf/hadoop---env.sh
Step 4:
Create an RSA key to be used by hadoop when ssh'ing to localhost:
ssh-keygen -t rsa -P \\ \\
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
Step 5:
Do the following changes to the configuration files under hadoop/conf
coresite. xml:
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:6000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/Users/ims-a/Documents/hadoop-install/hdfs_location_new/</value>
<description>A base for other temporary directories.</description>
</property>

Step 6:
Format the hadoop file system. From hadoop directory run the following:
bin/hadoop namenode -fo rmat.

Can Hadoop run In the Cloud?


Technically yes, but cloud and Hadoop architectures were created for different purposes.
While Hadoop was really meant for physical data centers, cloud environments were built to
provide better elasticity and speed of source provisioning.
Maintaining a large permanent cluster in the cloud is expensive, especially at scale. So to run
a permanent HDFS you might need to use cloud storage solutions and other SQL engine
alternatives.
Another challenge is that there is really no universal solution to support security. While you
might have a preferred cloud solution from one vendor, your Hadoop vendor may use another
cloud platform.
Using a mix of solutions can result in a steep learning curve, and possibly switching costs and
when it comes to ephemeral clusters in the cloud, governance and security become even more
challenging.
There is not yet an ideal solution to resolve them. Running Hadoop on the cloud, hence
should not be viewed as a simple technical choice, but a choice based on overall capability
and long term strategic vision.

HDFS Monitoring & Maintenance


 In the Hadoop world, a Systems Administrator is called a Hadoop Administrator.
 Hadoop Admin Roles and Responsibilities include setting up Hadoop clusters.
 Other duties involve backup, recovery and maintenance.
 Hadoop administration requires good knowledge of hardware systems and excellent
understanding of Hadoop architecture.
 It’s easy to get started with Hadoop administration because Linux system
administration is a pretty well-known beast, and because systems administrators are
used to administering all kinds of existing complex applications.
Problems faced by hadoop Admininstrator-
 Human: Humans cause most common errors in health of systems or machines. Even
a simple mistake can create a disaster and lead to down time. To avoid these errors it
is important to have proper and proactive diagnostic process in place.
 Miss-configuration is another problem that Hadoop administrators come across.
Even now, after so much of development in Hadoop, it is considered a young
technology.
 Hardware does not fail straight away, but they degrade over time and lead to
different failures. HDFS is best in detecting corrupt data blocks and automatically
replicate to new copies without human intervention.
 Resource exhaustion: It is also a major factor that causes problem. As an
administrator, one should measure and track task failures so as to help the user
identify and correct the processes. Repetitive task failures can occupy task slots and
can also take away resources from other jobs.Therefore, it should be seen as a drain
on overall capacity.
 Network partition: it is a situation where network is unable to communicate with
other hosts on segment of network. This means that host X on switch 1 cannot send
messages to host Y present on switch 2.
Section -A [2Marks]
1. What is the default block size in HDFS, and why is it large?[AKTU 21-22]

The default block size in HDFS is 128 MB (can be configured). A larger block size minimizes
seeking time and allows efficient storage of large files.

2. What is data replication in HDFS, and why is it important?[AKTU22-23]

Data replication in HDFS ensures fault tolerance by storing multiple copies of each data block
(default: 3 copies) across different nodes to prevent data loss in case of node failure.

3. How does HDFS handle read and write operations?

When reading, the client contacts the NameNode, retrieves block locations, and reads from
DataNodes.

When writing, data is first written to a temporary local file, then split into blocks and stored on
DataNodes with replication.

4. What is Avro, and why is it used in Hadoop?


Avro is a data serialization system that enables efficient data storage and schema evolution,
making it useful for exchanging data between Hadoop applications.

5. What is the role of the NameNode in HDFS?

The NameNode manages metadata, including the file directory structure, block locations, and
access permissions. It does not store actual data.

6. What is the role of Flume in Hadoop?

Flume is a tool used for ingesting streaming data (e.g., log files, social media feeds) into
Hadoop for real-time processing.

7. What is the purpose of the Hadoop Archives (HAR)?

HAR helps to reduce the number of small files in HDFS by grouping them into larger archive
files, improving storage efficiency and metadata management.

8. What is the difference between Hadoop’s command-line interface and Java


API?
The command-line interface (CLI) provides basic file operations (e.g., hdfs dfs -ls),
while the Java API offers programmatic control for advanced file handling and application
integration.

9. What are the key components required to set up a Hadoop cluster?

A Hadoop cluster requires Master (NameNode, ResourceManager) and Worker Nodes


(DataNodes, NodeManagers) with networked storage and proper Hadoop configuration.

10. What is Hadoop benchmarking, and why is it important?


Hadoop benchmarking tests cluster performance using tools like TestDFSIO and TeraSort to
measure read/write speeds and processing efficiency.

11. What is the role of YARN in Hadoop?

YARN (Yet Another Resource Negotiator) manages cluster resources and job scheduling,
allowing multiple applications to run simultaneously in Hadoop.

12. What are the security features in Hadoop?

Hadoop security includes Kerberos authentication, HDFS file permissions, data encryption,
and access control lists (ACLs).

13. How does Hadoop handle fault tolerance?


Hadoop ensures fault tolerance by replicating data blocks, using automatic task re-execution,
and checkpointing metadata in Secondary NameNode.

14. How does Hadoop in the cloud differ from on-premises Hadoop?
Hadoop in the cloud (e.g., AWS EMR, Azure HDInsight) provides scalability, reduced
hardware costs, and managed services, while on-premises Hadoop requires dedicated
infrastructure.

15. What is the purpose of Sqoop in Hadoop?

Sqoop is used for importing and exporting structured data (e.g., from MySQL, Oracle)
between HDFS and relational databases.

16. What is HDFS monitoring, and which tools are used for it?

HDFS monitoring ensures system health and performance using tools like JMX, Ambari, and
Ganglia to track resource usage and failures.
AKTU:

1. Name the two types of Nodes in Hadoop?

The two types of nodes in Hadoop are:

NameNode – The master node that manages metadata and controls access to files.

DataNode – The worker nodes that store actual data and handle read/write
requests.

2. Describe the concepts of file sizes, block sizes and block abstraction in
HDFS?

File Sizes in HDFS

 HDFS is designed to handle large files efficiently.


 Files are divided into fixed-size blocks to optimize storage and retrieval.
 HDFS performs best when working with large files (hundreds of MBs to TBs) rather
than small files.

Block Sizes in HDFS

 A block is the smallest unit of data storage in HDFS.


 The default block size is 128 MB (configurable to 256 MB or more for large files).
 Unlike traditional file systems (which use small block sizes like 4 KB), HDFS uses large
blocks to reduce disk seeks and improve throughput.

Block Abstraction in HDFS

 HDFS splits files into blocks and stores them across multiple DataNodes.
 The NameNode maintains metadata about which DataNodes store each block.
 Blocks are replicated (default: 3 copies) across different DataNodes for fault tolerance
and high availability.
 When reading a file, the client retrieves blocks in parallel, improving performance.

Advantages of Block Abstraction

Fault Tolerance: If a node fails, copies of the block exist on other nodes.
Parallel Processing: Different blocks of a file can be processed simultaneously.
Efficient Storage Management: Large files are managed efficiently, avoiding fragmentation.

3. What are the benefits and challenges of using HDFS for Distributed
storage and processing?
Benefits and Challenges of Using HDFS for Distributed Storage and Processing

Benefits of HDFS:

Scalability – HDFS can handle petabytes of data by adding more nodes to the cluster.

Fault Tolerance – Data is replicated (default 3 copies) across multiple nodes to prevent data
loss.

High Throughput – Supports parallel processing of large datasets, improving performance.

Cost-Effective – Works on commodity hardware, reducing infrastructure costs.

Optimized for Large Files – HDFS efficiently stores and processes large files by dividing them
into blocks.

Data Locality – Computation is moved closer to data, reducing network bandwidth usage.

Integration with Big Data Tools – Compatible with MapReduce, Spark, Hive, Pig, HBase,
etc.

Challenges of HDFS:

Not Suitable for Small Files – Large block size (128MB+) leads to inefficient storage for small
files.

High Latency for Random Reads – HDFS is optimized for sequential access; random access
is slow.

Metadata Overhead – The NameNode stores metadata in memory, which can become a
bottleneck.

Complex Administration – Requires cluster setup, monitoring, and maintenance, increasing


management overhead.

Security Concerns – Requires additional configuration for authentication, encryption, and


access control.

Replication Overhead – Data replication increases storage requirements (e.g., 3x replication


means more disk usage).

Dependency on NameNode – The NameNode is a single point of failure (unless High


Availability (HA) is configured).

4. Differentiate between flume and Sqoop?


SECTION-B & SECTION-C
1. Explain the architecture of HDFS and its key components. How does it handle large-
scale data storage efficiently?[AKTU 22-23, 21-22]
2. Discuss the benefits and challenges of using HDFS. How does block abstraction improve
data management in HDFS? [AKTU 21-22]
3. Describe the data replication mechanism in HDFS. How does HDFS ensure data
availability and fault tolerance?
4. Explain the process of reading and writing files in HDFS. How do Java interfaces
and the command-line interface facilitate HDFS operations? [AKTU 22-23, 21-22]
5. What is the role of Flume and Sqoop in data ingestion? Explain their working principles
and use cases.
6. Explain Hadoop I/O operations, including compression, serialization, and Avro.
How do these features enhance Hadoop performance?
7. What are the key steps in setting up a Hadoop cluster? Explain Hadoop cluster
specification, security aspects, and monitoring tools used for administration. [AKTU]

AKTU Questions:

1. Discuss in brief about the cluster specification. Describe how to setting up a Hadoop
Cluster?
2. Explain the core concepts of HDFS, including NameNode, DataNode, and the file
system namespace. How do these components work together to manage data storage
and replication in Hadoop clusters?
3. Describe the considerations for deploying Hadoop in a cloud environment. What are
the advantage and challenges of running Hadoop clusters on cloud platforms like
Amazon Web Services(AWS), Microsoft Azure and Google Cloud Platform(GCP)?

You might also like