HDFS
HDFS
HDFS
04/13/24
Reference
• The Hadoop Distributed File System: Architecture and Design by Apache Foundation
Inc.
2
04/13/24
Basic Features: HDFS
• Highly fault-tolerant
• High throughput
• Suitable for applications with large data sets
• Streaming access to file system data
• Can be built out of commodity hardware
4
04/13/24
Fault tolerance
5
04/13/24
Data Characteristics
6
04/13/24
MapReduce
combine part0
map reduce
Cat split
reduce part1
split map combine
Bat
map part2
split combine reduce
Dog
split map
Other
Words
(size:
7
TByte)
04/13/24
8
Architecture
04/13/24
Namenode and Datanodes
Master/slave architecture
HDFS cluster consists of a single Namenode, a master server that manages the file
system namespace and regulates access to files by clients.
There are a number of DataNodes usually one per node in a cluster.
The DataNodes manage storage attached to the nodes that they run on.
HDFS exposes a file system namespace and allows user data to be stored in files.
A file is split into one or more blocks and set of blocks are stored in DataNodes.
DataNodes: serves read, write requests, performs block creation, deletion, and
replication upon instruction from Namenode.
9
04/13/24
HDFS Architecture
Metadata(Name, replicas..)
Metadata ops Namenode (/home/foo/data,6. ..
Client
Block ops
Read Datanodes Datanodes
replication
B
Blocks
Client
10
04/13/24
File system Namespace
11
04/13/24
Data Replication
HDFS is designed to store very large files across machines in a large cluster.
Each file is a sequence of blocks.
All blocks in the file except the last are of the same size.
Blocks are replicated for fault tolerance.
Block size and replicas are configurable per file.
The Namenode receives a Heartbeat and a BlockReport from each DataNode in the
cluster.
BlockReport contains all the blocks on a Datanode.
12
04/13/24
Replica Placement
The placement of the replicas is critical to HDFS reliability and performance.
Optimizing replica placement distinguishes HDFS from other distributed file systems.
Rack-aware replica placement:
Goal: improve reliability, availability and network bandwidth utilization
Research topic
Many racks, communication between racks are through switches.
Network bandwidth between machines on the same rack is greater than those in
different racks.
Namenode determines the rack id for each DataNode.
Replicas are typically placed on unique racks
Simple but non-optimal
Writes are expensive
Replication factor is 3
Another research topic?
Replicas are placed: one on a node in a local rack, one on a different node in the local
rack and one on a node in a different rack.
13
1/3 of the replica on a node, 2/3 on a rack and 1/3 distributed evenly across remaining
racks.
04/13/24
Replica Selection
• Replica selection for READ operation: HDFS tries to minimize the bandwidth
consumption and latency.
• If there is a replica on the Reader node then that is preferred.
• HDFS cluster may span multiple data centers: replica in the local data center is
preferred over the remote one.
14
04/13/24
Safemode Startup
15
04/13/24
Filesystem Metadata
16
04/13/24
Namenode
Keeps image of entire file system namespace and file Blockmap in memory.
4GB of local RAM is sufficient to support the above data structures that represent
the huge number of files and directories.
When the Namenode starts up it gets the FsImage and Editlog from its local file
system, update FsImage with EditLog information and then stores a copy of the
FsImage on the filesytstem as a checkpoint.
Periodic checkpointing is done. So that the system can recover back to the last
checkpointed state in case of a crash.
17
04/13/24
Datanode
18
04/13/24
19
Protocol
04/13/24
The Communication Protocol
All HDFS communication protocols are layered on top of the TCP/IP protocol
A client establishes a connection to a configurable TCP port on the Namenode
machine. It talks ClientProtocol with the Namenode.
The Datanodes talk to the Namenode using Datanode protocol.
RPC abstraction wraps both ClientProtocol and Datanode protocol.
Namenode is simply a server and never initiates a request; it only responds to RPC
requests issued by DataNodes or clients.
20
04/13/24
21
Robustness
04/13/24
Objectives
22
04/13/24
DataNode failure and heartbeat
• A network partition can cause a subset of Datanodes to lose connectivity with the
Namenode.
• Namenode detects this condition by the absence of a Heartbeat message.
• Namenode marks Datanodes without Hearbeat and does not send any IO requests
to them.
• Any data registered to the failed Datanode is not available to the HDFS.
• Also the death of a Datanode may cause replication factor of some of the blocks to
fall below their specified value.
23
04/13/24
Re-replication
24
04/13/24
Cluster Rebalancing
• HDFS architecture is compatible with data rebalancing schemes.
• A scheme might move data from one Datanode to another if the free space on a
Datanode falls below a certain threshold.
• In the event of a sudden high demand for a particular file, a scheme might dynamically
create additional replicas and rebalance other data in the cluster.
• These types of data rebalancing are not yet implemented: research issue.
25
04/13/24
Data Integrity
26
04/13/24
Metadata Disk Failure
Data Organization
04/13/24
Data Blocks
29
04/13/24
Staging
30
04/13/24
Staging (contd.)
31
04/13/24
Replication Pipelining
• When the client receives response from Namenode, it flushes its block in small
pieces (4K) to the first replica, that in turn copies it to the next replica and so on.
• Thus data is pipelined from Datanode to the next.
32
04/13/24
33
API (Accessibility)
04/13/24
Application Programming Interface
34
04/13/24
FS Shell, Admin and Browser Interface
• HDFS organizes its data in files and directories.
• It provides a command line interface called the FS shell that lets the user interact with
data in the HDFS.
• The syntax of the commands is similar to bash and csh.
• Example: to create a directory /foodir
/bin/hadoop dfs –mkdir /foodir
• There is also DFSAdmin interface available
• Browser interface is also available to view the namespace.
35
04/13/24
Space Reclamation
• When a file is deleted by a client, HDFS renames file to a file in be the /trash directory
for a configurable amount of time.
• A client can request for an undelete in this allowed time.
• After the specified time the file is deleted and the space is reclaimed.
• When the replication factor is reduced, the Namenode selects excess replicas that can
be deleted.
• Next heartbeat(?) transfers this information to the Datanode that clears the blocks for
use.
36
04/13/24
Summary
• We discussed the features of the Hadoop File System, a peta-scale file system to
handle big-data sets.
• What discussed: Architecture, Protocol, API, etc.
• Missing element: Implementation
• The Hadoop file system (internals)
• An implementation of an instance of the HDFS (for use by applications such as web crawlers).
37
04/13/24