0% found this document useful (0 votes)
43 views46 pages

Big Data Unit-3

The document provides an overview of the Hadoop Distributed File System (HDFS), detailing its architecture, components, and functionalities. It explains the roles of the NameNode and DataNode, the process of data storage and retrieval, and the importance of replication for fault tolerance. Additionally, it discusses the benefits and challenges of HDFS, as well as interfaces for interacting with the system, including Java APIs and command-line tools.

Uploaded by

guptaraman600
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views46 pages

Big Data Unit-3

The document provides an overview of the Hadoop Distributed File System (HDFS), detailing its architecture, components, and functionalities. It explains the roles of the NameNode and DataNode, the process of data storage and retrieval, and the importance of replication for fault tolerance. Additionally, it discusses the benefits and challenges of HDFS, as well as interfaces for interacting with the system, including Java APIs and command-line tools.

Uploaded by

guptaraman600
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 46

Components of Hadoop

1
Hadoop distributed File System

⚫ In Hadoop data resides in a distributed file


system which is called a Hadoop Distributed
File system.
⚫ The Hadoop Distributed File System (HDFS) is
based on the Google File System (GFS) and
provides a distributed file system that is
designed to run on commodity hardware
⚫ Hadoop Distributed File System (HDFS) is the
storage unit
of Hadoop.
⚫ HDFS splits files into blocks and sends
them across various nodes in form of
large clusters. 2
⚫ HDFS is designed for storing very large
data files, running on clusters of
commodity hardware.
⚫ It is fault tolerant, scalable, and extremely
simple to expand.
⚫ Hadoop HDFS has a Master/Slave
architecture in which Master is
NameNode and Slave is DataNode.
⚫ HDFS Architecture consists of single
NameNode and all the other nodes are
DataNodes.
3
HDFS Architecture

4
HDFS NameNode
⚫ It is also known as Master node.
⚫ HDFS Namenode stores meta-data i.e.
number of data blocks, replicas and other
details.
⚫ This meta-data is available in memory in the
master for
faster retrieval of data.
⚫ NameNode maintains and manages the
slave nodes, and assigns tasks to them.
⚫ It should deploy on reliable hardware as it is
the
5
Functions of NameNode
⚫ Manage file system namespace.
⚫ Regulates client’s access to files.
⚫ It also executes file system execution such
as naming, closing, opening
files/directories.
⚫ All DataNodes sends a Heartbeat and block
report to the NameNode in the Hadoop
cluster
⚫ NameNode is also responsible for taking
care of the
Replication Factor of all the blocks 6
Files present in the
NameNode metadata are

7
FsImage –
⚫ It is an “Image file”
⚫ Fsimage stands for File System image
⚫ It contains the complete directory
structure (namespace) of the HDFS with
details about the location of the data on
the Data Blocks and which blocks are
stored on which node
• It stored as a file in the namenode’s
local file system.
• Fsimage is a point-in-time snapshot of
HDFS's namespace
• The last snapshot is actually stored in
FSImage. 8
EditLogs –
⚫ EditLogs is a transaction log that records
the changes in the HDFS file system or any
action performed on the HDFS cluster such as
addition of a new block, replication,
deletion etc
⚫ Edit log records every changes from the last
snapshot.
⚫ In short, it records the changes since the
last FsImage was created
⚫ It contains all the recent modifications
made to the file system on the most recent
FsImage.
⚫ Namenode receives a create/update/delete
request from the client. After that this 9
HDFS DataNode
⚫ It is also known as Slave.
⚫ In Hadoop HDFS Architecture, DataNode
stores actual data in HDFS.
⚫ It performs read and write operation
as per the request of the client.
⚫ DataNodes can deploy on commodity
hardware

10
Functions of DataNode
⚫ Block replica creation, deletion, and
replication
according to the instruction of Namenode.
⚫ DataNode manages data storage of the
system.
⚫ DataNodes send heartbeat to the
NameNode
⚫ By default, this frequency is set to 3
seconds.
⚫ Every 3 seconds, each DataNode sends a
heartbeat signal to the NameNode. 11
Blocks
⚫ HDFS in Apache Hadoop split huge files
into small chunks known as Blocks.
⚫ These are the smallest unit of data in a
filesystem.
⚫ We (client and admin) do not have any
control on the block like block location.

12
Block size of a HDFS

A typical block size used by


HDFS is
128 MB. Thus, an HDFS
file is chopped up into 128 MB
chunks, and if possible, each
chunk will reside on a different 13
Replication Management
⚫ Block replication provides fault tolerance. If
one copy is not accessible and corrupted then
we can read data from other copy.
⚫ The number of copies or replicas of each
block of a file is replication factor. The default
replication factor is 3 which are again
configurable. So, each block replicates
three times and stored on different
DataNodes.
⚫ NameNode receives block report from
DataNode
periodically to maintain the replication factor.

14
Secondary Namenode:
⚫ Secondary NameNode downloads the
FsImage and EditLogs from the
NameNode.
⚫ And then merges EditLogs with the
FsImage
(FileSystem Image).
⚫ It keeps edits log size within a limit.
⚫ It stores the modified FsImage into
persistent storage.
⚫ And we can use it in the case of
NameNode failure. 15
Rack Awareness
The Rack is the collection of around 40-50
DataNodes connected using the same network
switch.
If the network goes down, the whole rack
will be unavailable. A large Hadoop cluster
is deployed in multiple racks.

16
Block size of a HDFS

A typical block size used by


HDFS is
128 MB. Thus, an HDFS
file is chopped up into 128 MB
chunks, and if possible, each
chunk will reside on a different 17
Block size of a HDFS

A typical block size used by


HDFS is
128 MB. Thus, an HDFS
file is chopped up into 128 MB
chunks, and if possible, each
chunk will reside on a different 18
BlockAbstraction
• Block abstraction means that files are broken into logical
blocks that are distributed across multiple nodes in the
cluster.
• The abstraction hides the complexities of physical storage,
allowing users to interact with files without worrying about
their physical location.

• Key Aspects of Block Abstraction in HDFS:


• 1. Logical Representation – Files are represented as a sequence
of blocks rather than contiguous storage.
• 2. Fixed Block Size – Default is 128MB or 256MB, much larger
than traditional file systems.

19
BlockAbstraction

3. Replication Mechanism – Each block is replicated


(default: 3 copies) for fault tolerance.

4. Distributed Storage – Blocks are spread across


multiple nodes to enable parallel processing.

5. Fault Tolerance & Recovery – If a node fails, the


NameNode ensures data availability by managing
block replicas.

20
File Size
Checking File Size in HDFS

21
Listing Files with Sizes

22
Rack Awareness
⚫ In HDFS Architecture, NameNode makes
sure that all the replicas are not stored
on the same rack or single rack.
⚫ It follows Rack Awareness Algorithm
to reduce latency as well as fault
tolerance.
⚫ We know that default replication
factor is 3.
⚫ Rack Awareness is important to improve:
 Data high availability and reliability.
 The performance of the cluster.
 To improve network bandwidth 23
Benefits of HDFS
Scalability – Handles large datasets across
multiple nodes.
Fault Tolerance – Data replication ensures
reliability.
High Throughput – Optimized for batch
processing
.Cost-Effective – Runs on commodity hardware.
Integration – Works with Hadoop ecosystem
tools.
Write-Once, Read-Many – Efficient for big
data analytics.

24
Challenges of
HDFS
Latency Issues – Not suitable for real-time
processing.
Small File Problem – High overhead for many small
files.
Complexity – Requires expertise to manage.
Replication Overhead – Increases storage needs.
Security Concerns – Requires extra configurations.
Limited Random Writes – No in-place file
modifications.

25
How does HDFS sore, read and
write files
Store

26
How does HDFS store, read and
write files
Store
Before the NameNode can help you store and manage the data, it
first needs to
partition the file into smaller, manageable data blocks.
• This process is called data block splitting.

27
How does HDFS sore, read and
write files
Read

28
How does HDFS store, read and
write files
Write

29
Interfaces to use HDFS

30
Java Interfaces to use HDFS
• Java provides multiple interfaces to interact with HDFS,
mainly through the Hadoop FileSystem API, WebHDFS
(REST API), and Hadoop RPC. Below are the key
interfaces:
)

31
.

Java Interfaces to use HDFS

1. Hadoop FileSystem API


(org.apache.hadoop.fs.FileSystem )

• The FileSystem class is the primary interface for


)

interacting with HDFS in Java

Key Interfaces in org.apache.hadoop.fs:


•FileSystem → Abstract class to interact with HDFS.
•Path → Represents file or directory path in HDFS.
•FSDataInputStream → Stream for reading files.
•FSDataOutputStream → Stream for writing files.

32
Java Interfaces to use HDFS
2. WebHDFS (REST API)
HDFS provides a RESTful API (WebHDFS) for applications
that interact with HDFS over HTTP.
When to Use WebHDFS?
• When using non-Java applications to interact with HDFS.
)

• When you need remote access via HTTP.

3. Hadoop RPC (Remote Procedure Call)


HDFS Namenode uses Hadoop RPC to communicate with
clients. It allows direct interaction with HDFS metadata.
When to Use Hadoop RPC?
• For low-level interaction with HDFS.
• When building custom HDFS clients.
33
HDFS Command Line
Interface (CLI)
HDFS provides a Command Line Interface (CLI) allows users
to interact with HDFS using commands.
The main command used is hdfs dfs, followed by specific
subcommands to manage files and directories.
1. Basic HDFS Commands
)

1.1 Listing Files and Directories

34
HDFS Command Line
Interface (CLI)

35
HDFS Command Line
Interface (CLI)

36
HDFS Command Line
Interface (CLI)

37
HDFS Command Line
Interface (CLI)

38
HDFS Command Line
Interface (CLI)

39
Hadoop File System Interfaces
• The File System Interface in HDFS allows users and
applications to interact with the Hadoop Distributed
File System (HDFS).
• It provides various ways to read, write, and manage
files stored in HDFS.

HDFS provides several interfaces for file system


)

interaction:

Command-Line Interface (CLI) – hdfs dfs commands


Java API – org.apache.hadoop.fs.FileSystem class
WebHDFS REST API – HTTP-based file system
operations
NFS Gateway – Mount HDFS as an NFS file system

40
Hadoop File System Interfaces
Java API – org.apache.hadoop.fs.FileSystem Class
• Hadoop provides a Java API to interact with HDFS
programmatically.
• The org.apache.hadoop.fs.FileSystem class is the main entry
point for performing file system operations.
)

Key Features:
• Provides methods for file operations like reading, writing,
deleting, and listing files.
• Supports both local and distributed file system
implementations.
• Allows fine-grained control over HDFS access in Java-
based applications

41
Streaming and Piping
Read and write data in
HDFS
HDFS Operations to Read the file
⚫ To read any file from the HDFS, you have to
interact with the NameNode as it stores
the metadata about the DataNodes.
⚫ The user gets a token from the NameNode
and that specifies the address where the data
is stored.
⚫ You can put a read request to
NameNode for a particular block location
through distributed file systems.
⚫ The NameNode will then check your
privilege to access the DataNode and
allows you to read the address block if
the access is valid. 43
Read & Write Operations in
HDFS
⚫ You can execute various reading, writing
operations such as creating a directory,
providing permissions, copying files,
updating files, deleting, etc.
⚫ HDFS Operations to Read the file
⚫ To read any file from the HDFS, you have to
interact with the NameNode as it stores
the metadata about the DataNodes.
⚫ The user gets a token from the NameNode
and that specifies the address where the data is
stored.
⚫ You can put a read request to NameNode for a
particular block location through distributed file
systems.
⚫ The NameNode will then check your privilege to
access the DataNode and allows you to read the
address block if the access is valid. 44
Challenges of
HDFS

45
Analysing Data with
⚫ To Hadoopdeveloper
increase sever
productivity, languages and APIs alhave
higher-level
been created that abstract away the
low-level details of the MapReduce
programming model.
⚫ There are several choices available
for writing data analysis jobs.
The Hive and Pig projects
Hbase

46

You might also like