0% found this document useful (0 votes)
18 views51 pages

HDFS 3

Uploaded by

himavamsi19
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views51 pages

HDFS 3

Uploaded by

himavamsi19
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 51

HDFS (Hadoop Distributed File System)

What is HDFS?
• HDFS is a distributed file system designed for
storing large data sets across clusters of
computers.
• It is a core component of the Apache Hadoop
ecosystem.
• HDFS provides a reliable, scalable, and efficient
storage solution for big data applications.
• Purpose of HDFS
• HDFS (Hadoop Distributed File System) is a
distributed file system designed to store and process
large datasets across multiple machines in a reliable
and scalable manner.
• Key Features of HDFS
• Fault tolerance: HDFS is designed to handle hardware
failures by replicating data across multiple nodes.
• Scalability: HDFS can scale horizontally by adding
more nodes to the cluster.
• High throughput: HDFS is optimized for sequential
data access, making it suitable for big data processing.
HDFS (Hadoop Distributed File System)
NameNode: The NameNode is the master
node that manages the file system
namespace and regulates access to files. It
stores the metadata about files and
directories, including the file hierarchy,
permissions, and block locations.
DataNode: DataNodes are the
worker nodes that store the actual
data blocks of files. They are
responsible for reading and writing
data to the local file system and
replicating data blocks to ensure
fault-tolerance.
•Data Replication:-Ensures data availability and
fault tolerance.
•Each data block is replicated across multiple
data nodes.
•Replication factor: Configurable parameter to
determine the number of copies.
•Improves data availability: If a DataNode fails,
data can still be accessed from other replicas.
•Client:-Applications interact with HDFS
through the HDFS client.
•Available as a command-line tool (hdfs dfs) and
programmatic APIs.
•Performs file system operations on behalf of
users.
•Communicates with the NameNode for
metadata management.
•Interacts directly with DataNodes for data
transfer (read/write).
•Block:-The fundamental unit of data storage in
HDFS.
•Files are divided into fixed-size blocks
(configurable).
•Default block size: 128 MB (can be adjusted
based on workload).
•Each block is replicated across multiple data
nodes.
•Block size impacts performance and storage
efficiency.
Advantages of Block
Advantages of Block
• Fault tolerance
Advantages of Block
• Fault tolerance
• Parallel processing
Advantages of Block
• Fault tolerance
• Parallel processing
• Scalability
Advantages of Block
• Fault tolerance
• Parallel processing
• Scalability
• Data locality
Advantages of Block
• Fault tolerance
• Parallel processing
• Scalability
• Data locality
• Ease of data management
Advantages of Block
• Fault tolerance
• Parallel processing
• Scalability
• Data locality
• Ease of data management
•Rack:-A group of DataNodes physically
located together in a data center.
•Connected by a high-bandwidth network
switch.
•Improves data locality and network
performance.
•Enables rack awareness for data placement
strategies.
.
HDFS (Hadoop Distributed File System)
READ OPERATION
READ OPERATION
• Client Request
READ OPERATION
• Client Request
• NameNode Lookup
and Block locations
READ OPERATION
• Client Request
• NameNode Lookup
and Block locations
• Client Reads Data
Blocks
READ OPERATION
• Client Request
• NameNode Lookup
and Block locations
• Client Reads Data
Blocks
• Data Streaming and
Block-by-Block
Processing
READ OPERATION
• Client Request
• NameNode Lookup
and Block locations
• Client Reads Data
Blocks
• Data Streaming and
Block-by-Block
Processing
• Stream Management and
Next Block Lookups
READ OPERATION
• Client Request
• NameNode Lookup
and Block locations
• Client Reads Data
Blocks
• Data Streaming and
Block-by-Block
Processing
• Stream Management and
Next Block Lookups
• Delivery to Application
READ OPERATION
• Client Request
• NameNode Lookup
and Block locations
• Client Reads Data
Blocks
• Data Streaming and
Block-by-Block
Processing
• Stream Management and
Next Block Lookups
• Delivery to Application
HDFS (Hadoop Distributed File System)
Write Operation
Write Operation
• Client Initiates Write Request
Write Operation
• Client Initiates Write Request
• NameNode Interaction and
File Creation
Write Operation
• Client Initiates Write Request
• NameNode Interaction and
File Creation
• Client Writes Data and
Packet Creation
Write Operation
• Client Initiates Write Request
• NameNode Interaction and
File Creation
• Client Writes Data and
Packet Creation
• DataStreamer, Block
Allocation, and Pipeline
Creation
Write Operation
• Client Initiates Write Request
• NameNode Interaction and
File Creation
• Client Writes Data and
Packet Creation
• DataStreamer, Block
Allocation, and Pipeline
Creation
• Data Replication Pipeline
Write Operation
• Client Initiates Write Request
• NameNode Interaction and
File Creation
• Client Writes Data and
Packet Creation
• DataStreamer, Block
Allocation, and Pipeline
Creation
• Data Replication Pipeline
HDFS (Hadoop Distributed File System)
What is Hdfs Federation
•HDFS Federation is an extension of the Hadoop
Distributed File System (HDFS) that allows you to
manage multiple HDFS clusters as a single, unified
namespace. This means you can store and access
data across geographically distributed clusters,
overcoming the limitations of a single HDFS
cluster.
Features of HDFS Federation
Features of HDFS Federation
•Overcoming Limitations
Features of HDFS Federation
•Overcoming Limitations
•Scalability
Features of HDFS Federation
•Overcoming Limitations
•Scalability
•Improved Manageability
Features of HDFS Federation
•Overcoming Limitations
•Scalability
•Improved Manageability
• Geographical Distribution
HDFS (Hadoop Distributed File System)
HDFS federation

• Supports multiple NameNodes for


scalability and isolation
• Namespaces are federated, but not shared
• Useful for large clusters or multiple
organizations
HDFS federation

• Supports multiple NameNodes for


scalability and isolation
• Namespaces are federated, but not shared
• Useful for large clusters or multiple
organizations
: HDFS High
Availability

• Provides NameNode redundancy


• Active NameNode and Standby
NameNode
• Automatic failover in case of NameNode
failure
HDFS Permissions

• Supports POSIX-like permissions (user, group,


other)
• Permissions are managed by the NameNode
• Useful for securing sensitive data
HDFS Snapshots

• Provides point-in-time backups of the file system


• Efficient storage utilization by storing only
differences
• Useful for data recovery or auditing purposes
HDFS File Operations

• Support for basic file operations (create, read,


write, delete, rename, etc.)
• Optimized for streaming data access patterns
• Not recommended for large numbers of small
files
The hdfs dfs command line interface provides a
way to interact with HDFS from the command

HDFS
line.

Command Common commands include ls (list


files/directories), put (upload files), get
(download files), rm (remove files/directories),

Line mkdir (create directories), chmod (change


permissions), and others.

Interface The command line interface is useful for


managing and administering HDFS clusters.

You might also like