Big Data Unit-3
Big Data Unit-3
1
Hadoop distributed File System
4
HDFS NameNode
⚫ It is also known as Master node.
⚫ HDFS Namenode stores meta-data i.e.
number of data blocks, replicas and other
details.
⚫ This meta-data is available in memory in the
master for
faster retrieval of data.
⚫ NameNode maintains and manages the
slave nodes, and assigns tasks to them.
⚫ It should deploy on reliable hardware as it is
the
5
Functions of NameNode
⚫ Manage file system namespace.
⚫ Regulates client’s access to files.
⚫ It also executes file system execution such
as naming, closing, opening
files/directories.
⚫ All DataNodes sends a Heartbeat and block
report to the NameNode in the Hadoop
cluster
⚫ NameNode is also responsible for taking
care of the
Replication Factor of all the blocks 6
Files present in the
NameNode metadata are
7
FsImage –
⚫ It is an “Image file”
⚫ Fsimage stands for File System image
⚫ It contains the complete directory
structure (namespace) of the HDFS with
details about the location of the data on
the Data Blocks and which blocks are
stored on which node
• It stored as a file in the namenode’s
local file system.
• Fsimage is a point-in-time snapshot of
HDFS's namespace
• The last snapshot is actually stored in
FSImage. 8
EditLogs –
⚫ EditLogs is a transaction log that records
the changes in the HDFS file system or any
action performed on the HDFS cluster such as
addition of a new block, replication,
deletion etc
⚫ Edit log records every changes from the last
snapshot.
⚫ In short, it records the changes since the
last FsImage was created
⚫ It contains all the recent modifications
made to the file system on the most recent
FsImage.
⚫ Namenode receives a create/update/delete
request from the client. After that this 9
HDFS DataNode
⚫ It is also known as Slave.
⚫ In Hadoop HDFS Architecture, DataNode
stores actual data in HDFS.
⚫ It performs read and write operation
as per the request of the client.
⚫ DataNodes can deploy on commodity
hardware
10
Functions of DataNode
⚫ Block replica creation, deletion, and
replication
according to the instruction of Namenode.
⚫ DataNode manages data storage of the
system.
⚫ DataNodes send heartbeat to the
NameNode
⚫ By default, this frequency is set to 3
seconds.
⚫ Every 3 seconds, each DataNode sends a
heartbeat signal to the NameNode. 11
Blocks
⚫ HDFS in Apache Hadoop split huge files
into small chunks known as Blocks.
⚫ These are the smallest unit of data in a
filesystem.
⚫ We (client and admin) do not have any
control on the block like block location.
12
Block size of a HDFS
14
Secondary Namenode:
⚫ Secondary NameNode downloads the
FsImage and EditLogs from the
NameNode.
⚫ And then merges EditLogs with the
FsImage
(FileSystem Image).
⚫ It keeps edits log size within a limit.
⚫ It stores the modified FsImage into
persistent storage.
⚫ And we can use it in the case of
NameNode failure. 15
Rack Awareness
The Rack is the collection of around 40-50
DataNodes connected using the same network
switch.
If the network goes down, the whole rack
will be unavailable. A large Hadoop cluster
is deployed in multiple racks.
16
Block size of a HDFS
19
BlockAbstraction
20
File Size
Checking File Size in HDFS
21
Listing Files with Sizes
22
Rack Awareness
⚫ In HDFS Architecture, NameNode makes
sure that all the replicas are not stored
on the same rack or single rack.
⚫ It follows Rack Awareness Algorithm
to reduce latency as well as fault
tolerance.
⚫ We know that default replication
factor is 3.
⚫ Rack Awareness is important to improve:
Data high availability and reliability.
The performance of the cluster.
To improve network bandwidth 23
Benefits of HDFS
Scalability – Handles large datasets across
multiple nodes.
Fault Tolerance – Data replication ensures
reliability.
High Throughput – Optimized for batch
processing
.Cost-Effective – Runs on commodity hardware.
Integration – Works with Hadoop ecosystem
tools.
Write-Once, Read-Many – Efficient for big
data analytics.
24
Challenges of
HDFS
Latency Issues – Not suitable for real-time
processing.
Small File Problem – High overhead for many small
files.
Complexity – Requires expertise to manage.
Replication Overhead – Increases storage needs.
Security Concerns – Requires extra configurations.
Limited Random Writes – No in-place file
modifications.
25
How does HDFS sore, read and
write files
Store
26
How does HDFS store, read and
write files
Store
Before the NameNode can help you store and manage the data, it
first needs to
partition the file into smaller, manageable data blocks.
• This process is called data block splitting.
27
How does HDFS sore, read and
write files
Read
28
How does HDFS store, read and
write files
Write
29
Interfaces to use HDFS
30
Java Interfaces to use HDFS
• Java provides multiple interfaces to interact with HDFS,
mainly through the Hadoop FileSystem API, WebHDFS
(REST API), and Hadoop RPC. Below are the key
interfaces:
)
31
.
32
Java Interfaces to use HDFS
2. WebHDFS (REST API)
HDFS provides a RESTful API (WebHDFS) for applications
that interact with HDFS over HTTP.
When to Use WebHDFS?
• When using non-Java applications to interact with HDFS.
)
34
HDFS Command Line
Interface (CLI)
35
HDFS Command Line
Interface (CLI)
36
HDFS Command Line
Interface (CLI)
37
HDFS Command Line
Interface (CLI)
38
HDFS Command Line
Interface (CLI)
39
Hadoop File System Interfaces
• The File System Interface in HDFS allows users and
applications to interact with the Hadoop Distributed
File System (HDFS).
• It provides various ways to read, write, and manage
files stored in HDFS.
interaction:
40
Hadoop File System Interfaces
Java API – org.apache.hadoop.fs.FileSystem Class
• Hadoop provides a Java API to interact with HDFS
programmatically.
• The org.apache.hadoop.fs.FileSystem class is the main entry
point for performing file system operations.
)
Key Features:
• Provides methods for file operations like reading, writing,
deleting, and listing files.
• Supports both local and distributed file system
implementations.
• Allows fine-grained control over HDFS access in Java-
based applications
41
Streaming and Piping
Read and write data in
HDFS
HDFS Operations to Read the file
⚫ To read any file from the HDFS, you have to
interact with the NameNode as it stores
the metadata about the DataNodes.
⚫ The user gets a token from the NameNode
and that specifies the address where the data
is stored.
⚫ You can put a read request to
NameNode for a particular block location
through distributed file systems.
⚫ The NameNode will then check your
privilege to access the DataNode and
allows you to read the address block if
the access is valid. 43
Read & Write Operations in
HDFS
⚫ You can execute various reading, writing
operations such as creating a directory,
providing permissions, copying files,
updating files, deleting, etc.
⚫ HDFS Operations to Read the file
⚫ To read any file from the HDFS, you have to
interact with the NameNode as it stores
the metadata about the DataNodes.
⚫ The user gets a token from the NameNode
and that specifies the address where the data is
stored.
⚫ You can put a read request to NameNode for a
particular block location through distributed file
systems.
⚫ The NameNode will then check your privilege to
access the DataNode and allows you to read the
address block if the access is valid. 44
Challenges of
HDFS
45
Analysing Data with
⚫ To Hadoopdeveloper
increase sever
productivity, languages and APIs alhave
higher-level
been created that abstract away the
low-level details of the MapReduce
programming model.
⚫ There are several choices available
for writing data analysis jobs.
The Hive and Pig projects
Hbase
46