0% found this document useful (0 votes)

43 views46 pages

Big Data Unit-3

The document provides an overview of the Hadoop Distributed File System (HDFS), detailing its architecture, components, and functionalities. It explains the roles of the NameNode and DataNode, the process of data storage and retrieval, and the importance of replication for fault tolerance. Additionally, it discusses the benefits and challenges of HDFS, as well as interfaces for interacting with the system, including Java APIs and command-line tools.

Uploaded by

guptaraman600

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

43 views46 pages

Big Data Unit-3

Uploaded by

guptaraman600

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 46

Components of Hadoop

1
Hadoop distributed File System

⚫ In Hadoop data resides in a distributed file

system which is called a Hadoop Distributed
File system.
⚫ The Hadoop Distributed File System (HDFS) is
based on the Google File System (GFS) and
provides a distributed file system that is
designed to run on commodity hardware
⚫ Hadoop Distributed File System (HDFS) is the
storage unit
of Hadoop.
⚫ HDFS splits files into blocks and sends
them across various nodes in form of
large clusters. 2
⚫ HDFS is designed for storing very large
data files, running on clusters of
commodity hardware.
⚫ It is fault tolerant, scalable, and extremely
simple to expand.
⚫ Hadoop HDFS has a Master/Slave
architecture in which Master is
NameNode and Slave is DataNode.
⚫ HDFS Architecture consists of single
NameNode and all the other nodes are
DataNodes.
3
HDFS Architecture

4
HDFS NameNode
⚫ It is also known as Master node.
⚫ HDFS Namenode stores meta-data i.e.
number of data blocks, replicas and other
details.
⚫ This meta-data is available in memory in the
master for
faster retrieval of data.
⚫ NameNode maintains and manages the
slave nodes, and assigns tasks to them.
⚫ It should deploy on reliable hardware as it is
the
5
Functions of NameNode
⚫ Manage file system namespace.
⚫ Regulates client’s access to files.
⚫ It also executes file system execution such
as naming, closing, opening
files/directories.
⚫ All DataNodes sends a Heartbeat and block
report to the NameNode in the Hadoop
cluster
⚫ NameNode is also responsible for taking
care of the
Replication Factor of all the blocks 6
Files present in the
NameNode metadata are

7
FsImage –
⚫ It is an “Image file”
⚫ Fsimage stands for File System image
⚫ It contains the complete directory
structure (namespace) of the HDFS with
details about the location of the data on
the Data Blocks and which blocks are
stored on which node
• It stored as a file in the namenode’s
local file system.
• Fsimage is a point-in-time snapshot of
HDFS's namespace
• The last snapshot is actually stored in
FSImage. 8
EditLogs –
⚫ EditLogs is a transaction log that records
the changes in the HDFS file system or any
action performed on the HDFS cluster such as
addition of a new block, replication,
deletion etc
⚫ Edit log records every changes from the last
snapshot.
⚫ In short, it records the changes since the
last FsImage was created
⚫ It contains all the recent modifications
made to the file system on the most recent
FsImage.
⚫ Namenode receives a create/update/delete
request from the client. After that this 9
HDFS DataNode
⚫ It is also known as Slave.
⚫ In Hadoop HDFS Architecture, DataNode
stores actual data in HDFS.
⚫ It performs read and write operation
as per the request of the client.
⚫ DataNodes can deploy on commodity
hardware

10
Functions of DataNode
⚫ Block replica creation, deletion, and
replication
according to the instruction of Namenode.
⚫ DataNode manages data storage of the
system.
⚫ DataNodes send heartbeat to the
NameNode
⚫ By default, this frequency is set to 3
seconds.
⚫ Every 3 seconds, each DataNode sends a
heartbeat signal to the NameNode. 11
Blocks
⚫ HDFS in Apache Hadoop split huge files
into small chunks known as Blocks.
⚫ These are the smallest unit of data in a
filesystem.
⚫ We (client and admin) do not have any
control on the block like block location.

12
Block size of a HDFS

A typical block size used by

HDFS is
128 MB. Thus, an HDFS
file is chopped up into 128 MB
chunks, and if possible, each
chunk will reside on a different 13
Replication Management
⚫ Block replication provides fault tolerance. If
one copy is not accessible and corrupted then
we can read data from other copy.
⚫ The number of copies or replicas of each
block of a file is replication factor. The default
replication factor is 3 which are again
configurable. So, each block replicates
three times and stored on different
DataNodes.
⚫ NameNode receives block report from
DataNode
periodically to maintain the replication factor.

14
Secondary Namenode:
⚫ Secondary NameNode downloads the
FsImage and EditLogs from the
NameNode.
⚫ And then merges EditLogs with the
FsImage
(FileSystem Image).
⚫ It keeps edits log size within a limit.
⚫ It stores the modified FsImage into
persistent storage.
⚫ And we can use it in the case of
NameNode failure. 15
Rack Awareness
The Rack is the collection of around 40-50
DataNodes connected using the same network
switch.
If the network goes down, the whole rack
will be unavailable. A large Hadoop cluster
is deployed in multiple racks.

16
Block size of a HDFS

A typical block size used by

HDFS is
128 MB. Thus, an HDFS
file is chopped up into 128 MB
chunks, and if possible, each
chunk will reside on a different 17
Block size of a HDFS

A typical block size used by

HDFS is
128 MB. Thus, an HDFS
file is chopped up into 128 MB
chunks, and if possible, each
chunk will reside on a different 18
BlockAbstraction
• Block abstraction means that files are broken into logical
blocks that are distributed across multiple nodes in the
cluster.
• The abstraction hides the complexities of physical storage,
allowing users to interact with files without worrying about
their physical location.

• Key Aspects of Block Abstraction in HDFS:

• 1. Logical Representation – Files are represented as a sequence
of blocks rather than contiguous storage.
• 2. Fixed Block Size – Default is 128MB or 256MB, much larger
than traditional file systems.

19
BlockAbstraction

3. Replication Mechanism – Each block is replicated

(default: 3 copies) for fault tolerance.

4. Distributed Storage – Blocks are spread across

multiple nodes to enable parallel processing.

5. Fault Tolerance & Recovery – If a node fails, the

NameNode ensures data availability by managing
block replicas.

20
File Size
Checking File Size in HDFS

21
Listing Files with Sizes

22
Rack Awareness
⚫ In HDFS Architecture, NameNode makes
sure that all the replicas are not stored
on the same rack or single rack.
⚫ It follows Rack Awareness Algorithm
to reduce latency as well as fault
tolerance.
⚫ We know that default replication
factor is 3.
⚫ Rack Awareness is important to improve:
 Data high availability and reliability.
 The performance of the cluster.
 To improve network bandwidth 23
Benefits of HDFS
Scalability – Handles large datasets across
multiple nodes.
Fault Tolerance – Data replication ensures
reliability.
High Throughput – Optimized for batch
processing
.Cost-Effective – Runs on commodity hardware.
Integration – Works with Hadoop ecosystem
tools.
Write-Once, Read-Many – Efficient for big
data analytics.

24
Challenges of
HDFS
Latency Issues – Not suitable for real-time
processing.
Small File Problem – High overhead for many small
files.
Complexity – Requires expertise to manage.
Replication Overhead – Increases storage needs.
Security Concerns – Requires extra configurations.
Limited Random Writes – No in-place file
modifications.

25
How does HDFS sore, read and
write files
Store

26
How does HDFS store, read and
write files
Store
Before the NameNode can help you store and manage the data, it
first needs to
partition the file into smaller, manageable data blocks.
• This process is called data block splitting.

27
How does HDFS sore, read and
write files
Read

28
How does HDFS store, read and
write files
Write

29
Interfaces to use HDFS

30
Java Interfaces to use HDFS
• Java provides multiple interfaces to interact with HDFS,
mainly through the Hadoop FileSystem API, WebHDFS
(REST API), and Hadoop RPC. Below are the key
interfaces:
)

31
.

Java Interfaces to use HDFS

1. Hadoop FileSystem API

(org.apache.hadoop.fs.FileSystem )

• The FileSystem class is the primary interface for

)

interacting with HDFS in Java

Key Interfaces in org.apache.hadoop.fs:

•FileSystem → Abstract class to interact with HDFS.
•Path → Represents file or directory path in HDFS.
•FSDataInputStream → Stream for reading files.
•FSDataOutputStream → Stream for writing files.

32
Java Interfaces to use HDFS
2. WebHDFS (REST API)
HDFS provides a RESTful API (WebHDFS) for applications
that interact with HDFS over HTTP.
When to Use WebHDFS?
• When using non-Java applications to interact with HDFS.
)

• When you need remote access via HTTP.

3. Hadoop RPC (Remote Procedure Call)

HDFS Namenode uses Hadoop RPC to communicate with
clients. It allows direct interaction with HDFS metadata.
When to Use Hadoop RPC?
• For low-level interaction with HDFS.
• When building custom HDFS clients.
33
HDFS Command Line
Interface (CLI)
HDFS provides a Command Line Interface (CLI) allows users
to interact with HDFS using commands.
The main command used is hdfs dfs, followed by specific
subcommands to manage files and directories.
1. Basic HDFS Commands
)

1.1 Listing Files and Directories

34
HDFS Command Line
Interface (CLI)

35
HDFS Command Line
Interface (CLI)

36
HDFS Command Line
Interface (CLI)

37
HDFS Command Line
Interface (CLI)

38
HDFS Command Line
Interface (CLI)

39
Hadoop File System Interfaces
• The File System Interface in HDFS allows users and
applications to interact with the Hadoop Distributed
File System (HDFS).
• It provides various ways to read, write, and manage
files stored in HDFS.

HDFS provides several interfaces for file system

)

interaction:

Command-Line Interface (CLI) – hdfs dfs commands

Java API – org.apache.hadoop.fs.FileSystem class
WebHDFS REST API – HTTP-based file system
operations
NFS Gateway – Mount HDFS as an NFS file system

40
Hadoop File System Interfaces
Java API – org.apache.hadoop.fs.FileSystem Class
• Hadoop provides a Java API to interact with HDFS
programmatically.
• The org.apache.hadoop.fs.FileSystem class is the main entry
point for performing file system operations.
)

Key Features:
• Provides methods for file operations like reading, writing,
deleting, and listing files.
• Supports both local and distributed file system
implementations.
• Allows fine-grained control over HDFS access in Java-
based applications

41
Streaming and Piping
Read and write data in
HDFS
HDFS Operations to Read the file
⚫ To read any file from the HDFS, you have to
interact with the NameNode as it stores
the metadata about the DataNodes.
⚫ The user gets a token from the NameNode
and that specifies the address where the data
is stored.
⚫ You can put a read request to
NameNode for a particular block location
through distributed file systems.
⚫ The NameNode will then check your
privilege to access the DataNode and
allows you to read the address block if
the access is valid. 43
Read & Write Operations in
HDFS
⚫ You can execute various reading, writing
operations such as creating a directory,
providing permissions, copying files,
updating files, deleting, etc.
⚫ HDFS Operations to Read the file
⚫ To read any file from the HDFS, you have to
interact with the NameNode as it stores
the metadata about the DataNodes.
⚫ The user gets a token from the NameNode
and that specifies the address where the data is
stored.
⚫ You can put a read request to NameNode for a
particular block location through distributed file
systems.
⚫ The NameNode will then check your privilege to
access the DataNode and allows you to read the
address block if the access is valid. 44
Challenges of
HDFS

45
Analysing Data with
⚫ To Hadoopdeveloper
increase sever
productivity, languages and APIs alhave
higher-level
been created that abstract away the
low-level details of the MapReduce
programming model.
⚫ There are several choices available
for writing data analysis jobs.
The Hive and Pig projects
Hbase

Bda Unit 5
No ratings yet
Bda Unit 5
17 pages
21CS72 Bigdata Module 2 HDFS
No ratings yet
21CS72 Bigdata Module 2 HDFS
55 pages
HDFS
No ratings yet
HDFS
16 pages
Chapter N2 HDFS The Hadoop Distributed File System - Matrix
No ratings yet
Chapter N2 HDFS The Hadoop Distributed File System - Matrix
37 pages
BCS061 Notes Unit3
No ratings yet
BCS061 Notes Unit3
23 pages
Unit 3 Full
No ratings yet
Unit 3 Full
89 pages
Unit-2 CH 1 Updated
No ratings yet
Unit-2 CH 1 Updated
22 pages
Big Data Unit-2 PPT Part1
No ratings yet
Big Data Unit-2 PPT Part1
76 pages
Unit - 3 (HDFS)
No ratings yet
Unit - 3 (HDFS)
23 pages
BD U-3 (Anupam Sir)
No ratings yet
BD U-3 (Anupam Sir)
23 pages
Unit-4 BDA As On 25-11-2024
No ratings yet
Unit-4 BDA As On 25-11-2024
258 pages
HDFS (27 Jan 2025 Hadoop Distributed File System)
No ratings yet
HDFS (27 Jan 2025 Hadoop Distributed File System)
73 pages
Unit-2 Introduction To Hadoop
No ratings yet
Unit-2 Introduction To Hadoop
19 pages
Module 4 - Hadoop HDFS
No ratings yet
Module 4 - Hadoop HDFS
102 pages
Bigdata
No ratings yet
Bigdata
5 pages
Bigdta Unit 3
No ratings yet
Bigdta Unit 3
65 pages
Hdfs R20it III
No ratings yet
Hdfs R20it III
19 pages
Bda - M 2
No ratings yet
Bda - M 2
113 pages
Huawei
No ratings yet
Huawei
32 pages
Big Data Lecture # 05
No ratings yet
Big Data Lecture # 05
22 pages
HDFS
No ratings yet
HDFS
11 pages
Unit-4 BDA As On 25-11-2024
No ratings yet
Unit-4 BDA As On 25-11-2024
248 pages
HDFS
No ratings yet
HDFS
19 pages
HDFS
No ratings yet
HDFS
15 pages
Hadoop Distributed File System (HDFS)
No ratings yet
Hadoop Distributed File System (HDFS)
22 pages
Bigdata Unit 3
No ratings yet
Bigdata Unit 3
96 pages
HDFSnew
No ratings yet
HDFSnew
20 pages
Complete Hadoop Notes Final
No ratings yet
Complete Hadoop Notes Final
4 pages
Unit - 3 (HDFS) - 1
No ratings yet
Unit - 3 (HDFS) - 1
24 pages
HDFS
No ratings yet
HDFS
37 pages
BDA - Unit-2
No ratings yet
BDA - Unit-2
24 pages
Hadoop File System: B. Ramamurthy
No ratings yet
Hadoop File System: B. Ramamurthy
36 pages
BD U-3 Notes
No ratings yet
BD U-3 Notes
27 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
84 pages
IMTC634 - Data Science - Chapter 14
No ratings yet
IMTC634 - Data Science - Chapter 14
22 pages
05 - Introduction To HDFS
No ratings yet
05 - Introduction To HDFS
27 pages
Unit-3 (HDFS)
No ratings yet
Unit-3 (HDFS)
59 pages
Unit 2
No ratings yet
Unit 2
22 pages
HDFS
No ratings yet
HDFS
8 pages
(17CS82) 8 Semester CSE: Big Data Analytics
No ratings yet
(17CS82) 8 Semester CSE: Big Data Analytics
169 pages
HDFS 3
No ratings yet
HDFS 3
51 pages
Unit 2 Da Material
No ratings yet
Unit 2 Da Material
71 pages
Chapter 4 - Hadoop Ecosystem
No ratings yet
Chapter 4 - Hadoop Ecosystem
24 pages
Unit 3 1
No ratings yet
Unit 3 1
20 pages
Unit II Big Data Analytics
No ratings yet
Unit II Big Data Analytics
11 pages
BD Unit-IIINotes
No ratings yet
BD Unit-IIINotes
17 pages
Hadoop File System: B. Ramamurthy
No ratings yet
Hadoop File System: B. Ramamurthy
36 pages
Unit 3.1
No ratings yet
Unit 3.1
88 pages
IS 511 Exam Theory Paper A
No ratings yet
IS 511 Exam Theory Paper A
7 pages
Unit 2
No ratings yet
Unit 2
56 pages
10 Dfs
No ratings yet
10 Dfs
5 pages
Hdfs and Pig
No ratings yet
Hdfs and Pig
13 pages
Hadoop File System
No ratings yet
Hadoop File System
36 pages
Introduction To Hadoop Ecosystem
No ratings yet
Introduction To Hadoop Ecosystem
46 pages
Hadoop File System: B. Ramamurthy
No ratings yet
Hadoop File System: B. Ramamurthy
36 pages
Hadoop Distributed File System: Bhavneet Kaur B.Tech Computer Science 2 Year
No ratings yet
Hadoop Distributed File System: Bhavneet Kaur B.Tech Computer Science 2 Year
34 pages
Notes Memory PDF
100% (3)
Notes Memory PDF
15 pages
Module 1 PDF
No ratings yet
Module 1 PDF
49 pages
SD Creation Guide For Kess V2
0% (1)
SD Creation Guide For Kess V2
8 pages
Big Data Aktu Unit 3
No ratings yet
Big Data Aktu Unit 3
90 pages
HDFS Internals
No ratings yet
HDFS Internals
30 pages
IBM Storwize V5000E
100% (1)
IBM Storwize V5000E
79 pages
MLT Unit - 1
No ratings yet
MLT Unit - 1
38 pages
Big Data Unit-5
No ratings yet
Big Data Unit-5
9 pages
Lecture 6 Secondary Storage Devices
No ratings yet
Lecture 6 Secondary Storage Devices
17 pages
06 Inst 12
No ratings yet
06 Inst 12
110 pages
06inst1 2
No ratings yet
06inst1 2
151 pages
Urisys 2400 - Service Manual - English
No ratings yet
Urisys 2400 - Service Manual - English
351 pages
Rev03 Q40 Installation and Operations Manual
No ratings yet
Rev03 Q40 Installation and Operations Manual
180 pages
MLT Unit-3
No ratings yet
MLT Unit-3
39 pages
3-2 Storage Data Protection Technologies and Applications
No ratings yet
3-2 Storage Data Protection Technologies and Applications
48 pages
MLT Unit-4
No ratings yet
MLT Unit-4
33 pages
MLT Unit-2
No ratings yet
MLT Unit-2
30 pages
Cloud Storage - Update
No ratings yet
Cloud Storage - Update
19 pages
Avigilon Acm6 Ent Entplus Datasheet en Rev3
No ratings yet
Avigilon Acm6 Ent Entplus Datasheet en Rev3
4 pages
Effortless Logical Disk Expansion For HP Servers - A Non-Disruptive Solution
No ratings yet
Effortless Logical Disk Expansion For HP Servers - A Non-Disruptive Solution
29 pages
Nigel Comp Sci
No ratings yet
Nigel Comp Sci
3 pages
Linux LVM Mirror
No ratings yet
Linux LVM Mirror
5 pages
Fast23 Smrstore
No ratings yet
Fast23 Smrstore
15 pages
Data Storage Structures
No ratings yet
Data Storage Structures
38 pages
Ca Unit 5 Prabu
No ratings yet
Ca Unit 5 Prabu
37 pages
9 - Computer Memory System Overview Cache Memory Principles
No ratings yet
9 - Computer Memory System Overview Cache Memory Principles
11 pages
Soft Partitions
No ratings yet
Soft Partitions
4 pages
h15459 WP Powerscale Onefs Storage Efficiency - Pdf.external
No ratings yet
h15459 WP Powerscale Onefs Storage Efficiency - Pdf.external
15 pages
Assignment 2
No ratings yet
Assignment 2
7 pages
6ab 7ab
No ratings yet
6ab 7ab
5 pages
Computer Memory Research
No ratings yet
Computer Memory Research
10 pages
Proactive Replace vnx2 Vault Drive
No ratings yet
Proactive Replace vnx2 Vault Drive
2 pages
Pag 38
No ratings yet
Pag 38
2 pages
Secondary Memories
No ratings yet
Secondary Memories
4 pages
What Is A Secondary Storage
No ratings yet
What Is A Secondary Storage
2 pages
Fs Mini Project: Employee Management System
No ratings yet
Fs Mini Project: Employee Management System
21 pages
RAID (Redundant Array of Inexpensive Disks)
No ratings yet
RAID (Redundant Array of Inexpensive Disks)
2 pages
Mastering Data Engineering: Advanced Techniques with Apache Hadoop and Hive
From Everand
Mastering Data Engineering: Advanced Techniques with Apache Hadoop and Hive
Peter Jones
No ratings yet
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet

Big Data Unit-3

Uploaded by

Big Data Unit-3

Uploaded by

Components of Hadoop

⚫ In Hadoop data resides in a distributed file

A typical block size used by

A typical block size used by

A typical block size used by

• Key Aspects of Block Abstraction in HDFS:

3. Replication Mechanism – Each block is replicated

4. Distributed Storage – Blocks are spread across

5. Fault Tolerance & Recovery – If a node fails, the

Java Interfaces to use HDFS

1. Hadoop FileSystem API

• The FileSystem class is the primary interface for

interacting with HDFS in Java

Key Interfaces in org.apache.hadoop.fs:

• When you need remote access via HTTP.

3. Hadoop RPC (Remote Procedure Call)

1.1 Listing Files and Directories

HDFS provides several interfaces for file system

Command-Line Interface (CLI) – hdfs dfs commands

You might also like