0% found this document useful (0 votes)
20 views19 pages

HDFS

HDFS is a distributed file system that runs on commodity hardware. It provides scalable and reliable storage for large files across servers. HDFS has a master/slave architecture with a single NameNode that manages file system metadata and DataNodes that store file data in blocks. Data is replicated across multiple DataNodes for fault tolerance. The NameNode tracks mapping of blocks to DataNodes. HDFS provides high throughput access to application data and is suitable for applications processing large datasets.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views19 pages

HDFS

HDFS is a distributed file system that runs on commodity hardware. It provides scalable and reliable storage for large files across servers. HDFS has a master/slave architecture with a single NameNode that manages file system metadata and DataNodes that store file data in blocks. Data is replicated across multiple DataNodes for fault tolerance. The NameNode tracks mapping of blocks to DataNodes. HDFS provides high throughput access to application data and is suitable for applications processing large datasets.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 19

HDFS

 “Without big data analytics,


companies are blind and deaf,
wandering out onto the web like deer
on a freeway.” – Geoffrey Moore
 Geoffrey Moore (born 1946) is an
American organizational theorist,
management consultant and author,
known for his work Crossing the
Chasm: Marketing and Selling High-
Tech Products to Mainstream
Customers.
HDFS Architecture

Secondary
HDFS Client Name Node
NameNode

Rack 1 Rack n
DataNode DataNode DataNode
DataNode
Hadoop Distributed File System

 HDFS – Runs on large clusters and


provides high-throughput access to
data
 Highly Fault tolerant system that
works with commodity
 Stores each file as a sequence of
blocks
 The blocks replicated to provide fault
tolerance
Characterestics of HDFS
 Scalable Storage for Large Files
 Replication - The default block size –
128 MB and replication factor – 3
 Streaming – Provides high
throughput streaming reads and
writes. HDFS design relaxes POSIX
(Portable OS interface) requirements
for access to streaming data
 File Appends – Originally immutable.
Recent versions have append
capability
HDFS Architecture
Master/Slave Architecture :
cluster consisting of a single
NameNode (Master node)
and several DataNodes
(Slave nodes).
HDFS - deployed on a broad
spectrum of machines that
support Java.
Name Node
Manages the file system
namespace
Stores the filesystem meta-data
and mappings of blocks to
datanodes on the disk as two files
fsimage – contains complete
snapshot of the filesystem meta-
data
edits file – stores incremental
update to metadata
Name Node
 When Namenode starts,
loads fsimage into
memory and applies the
edits file to keep fsimage
up-to date
 Writes a new fsimage file
to the disk
SecondaryName Node
 Edits file increases in size
with time
 Checkpointing – Applying
updates to fsimage file
 Done every one hour or
after certain un-
checkpointed transactions
in Namenode
SecondaryName Node
 Downloads fsimage and
edits file from
Namenode
 Applies edits file on the
fsimage file and uploads
new fsimage file back to
Namenode
SecondaryName Node

Secondary Active Standby


Namenode Query for Namenode Namenode
Edit
logs
Updated High
FsImage with Availability
edit logs Of Namenode

Copy back to
FsImage FsImage
Name node
Data Nodes
 Stores the files as data
blocks
 Serves the read and
write requests
 Sends Heartbeat
messages to Namenode
 Sends block report to
Namenode
Rack-Aware Placement Policy
 Blocks replicated on
datanodes
 One replica on a local
node
 Another on a remote
rack and third one on a
different node on the
same remote rack
HDFS Read Path
 Client request to Namenode to get
block locations
 Namenode checks if file available
and whether client has permissions
to read file
 Returns data block locations sorted
by distance from client node
 Client can read from local node and
other nodes based on the sorted
list
HDFS Write Path
 Client request to Namenode to
create a new file in the filesystem
namespace
 Namenode checks if file already
exists or not and whether client has
permissions to write file
 Returns an output stream object
 Client writes to output stream object
which splits data into packets and
puts them into a data queue
HDFS Write Path
 Thread of consumed data pockets
gets block location information from
Namenode
 Pockets of data from data queue
written to first datanode on the
replication pipeline, when then
writes to second datanode and so
on. This goes on till block size is
reached
 Client requests Namenode for new
blocks for additional data
HDFS Write Path
 Acknowledgement sent to
Client from datanodes
 Process continues till all data
pockets written to datanodes
and all acknowledged by
datanodes
 Client closes output stream
and requests Namenode to
close the file
HDFS
HDFS

You might also like