Huawei
Huawei
File System
Foreword
⚫ This chapter describes the HDFS concept, advantages and
disadvantages, architecture, read and write processes, basic
commands, and use cases.
2
Objectives
⚫ Upon completion of this course, you will understand:
HDFS advantages and disadvantages
HDFS architecture and key features
HDFS read and write processes
Common HDFS commands and basic operations
3
Contents
1. HDFS Overview
4
HDFS Overview
⚫ HDFS is the bottom-layer core of the Hadoop big data ecosystem
and supports distributed storage of big data. It is designed and
developed to process large data sets, facilitating high-throughput,
large-scale file operations.
5
HDFS Advantages
Multi-replica mechanism Parallel computing from
Automatic replica mobile applications to
restoration data nodes
High
fault Efficiency
tolerance
Java as the programming
Streaming data ingestion mode
language Streaming Processing by block in batches
Strong transplantability Cross-platform
data
compatibility
ingestion
Suitable
Simple file for
model big data
processing
6
HDFS Disadvantages
⚫ HDFS has the following disadvantages:
To write vast amounts HDFS files are stored as Files can only be Data files can only be
of data in a certain blocks. The storage of modified by one user appended, but not
period of time, HDFS is block metadata occupies a instead of by multiple randomly modified.
optimized, which large amount of memory users concurrently.
increases the latency of of NameNodes. However,
obtaining data. the memory resources of
NameNodes are limited.
7
Contents
1. HDFS Overview
8
HDFS Architecture
HDFS architecture
Metadata (Name,replicas...):
NameNode /home/foo/data,3...
Metadata ops
Block ops
Client
Replication
Blocks Blocks
Client
Rack 1 Rack 2
9
Block
⚫ The default HDFS block size is 64 MB in versions earlier than Hadoop 2.0 and
128 MB in Hadoop 2.0 or later. A file is divided into multiple blocks. A block
is the storage unit.
⚫ The block size is much larger than that of a common file system, minimizing
the addressing overhead.
⚫ A block has the following benefits:
Large-scale file storage
Simplified system design
Applicable to data backup
10
Client
⚫ Clients are the most common way of using HDFS. HDFS provides a client during deployment.
⚫ It is a library that contains HDFS interfaces that hide most of the complexity in HDFS
implementation.
⚫ It supports common operations such as opening, reading, and writing, and provides a shell-
like command line mode to access data in HDFS.
⚫ HDFS also provides Java APIs that serve as client programming interfaces for applications to
access the file system.
⚫ Strictly speaking, the client is not part of the HDFS architecture. It is a built-in HDFS library
of Hadoop and is external to HDFS.
11
NameNode Functions
⚫ In HDFS, the NameNode manages the namespace of the distributed file system and
stores two core data structures: FsImage and EditLog.
NameNode
The EditLog file records update
Metadata mirroring Operation log
FsImage
operations such as file creation,
EditLog
The FsImage file contains a deletion, and renaming. When a
serialized form of all the NameNode is started, FsImage is
directory and file inodes in the loaded to the memory, and then
file system. It maintains the Root directory operations in EditLog are
metadata of the file system performed to synchronize
tree and all the files and metadata in the memory with
folders in that file tree. Subdirectory Subdirectory Subdirectory the actual metadata. Metadata
in the memory supports read
... File ... operations on clients.
13
SecondaryNameNode Functions
⚫ SecondaryNameNode is a component of
NameNode SecondaryNamenode
the HDFS architecture. It is used to store
replicas of HDFS metadata in a NameNode EditLog FsImage
2. Obtains Editlog and FsImage from the
active NameNode.
and reduce the restart time of the
NameNode. Its main function is to 1. Sends
notifications.
periodically combine the FsImage file and
Editlog
the EditLog file of the NameNode to FsImage
.new EditLog
prevent the log file from being too large.
3. Merges the
Typically, SecondaryNameNode runs FsImage and
EditLog files.
separately on a node. Fsimage Fsimage
.ckpt .ckpt
⚫ SecondaryNameNode is not the standby
5. Rolls back
node when a NameNode is faulty. It plays a FsImage. 4. Uploads the newly generated
FsImage file to the active NameNode.
different role from the NameNode. EditLog FsImage
14
Contents
1. HDFS Overview
15
Block Replication
⚫ HDFS stores very large files across machines in a large cluster. It stores each
file as a sequence of blocks; all blocks in a file except the last block are the
same size.
⚫ In most cases, a file has three replicas. The HDFS storage policy is to store a
replica on a node in the local rack, store a replica on another node in the
same rack, and store the last replica on a node in a different rack. In this
way, blocks are replicated.
16
Rack Awareness
⚫ Rack awareness: In an HDFS cluster, two nodes on different racks communicate with each
other through a switch. NameNode can horizontally replicate block replicas and store them
on DataNodes on different racks. This process is rack awareness.
⚫ Advantages: This prevents data loss when a rack fails and allows the bandwidth of multiple
racks to be fully utilized when data is read. In this way, replicas can be evenly distributed in
the cluster, implementing load balancing.
⚫ Disadvantage: A write operation of this policy needs to transmit blocks to multiple racks,
which increases the write cost.
17
Cluster Balancing Policy
⚫ After Hadoop starts a balancer task, the cluster automatically reads the disk
usage of each node and replicates data from the node whose space usage is
far greater than the average value to the node whose space usage is lower
than the average value based on the configured host space usage difference.
After the replication is complete, the original node data is deleted. In this
way, the cluster load is balanced.
18
Data Integrity
⚫ With checksum checkpointing, when an HDFS file is created on the client, the
client calculates the checksum of all blocks in the file and stores the
checksum as an independent hidden file in the same namespace in HDFS.
After obtaining the file, the client checks whether the data obtained from a
DataNode matches the checksum in the checksum file. If they do not match,
the client can obtain a replica of the block from another DataNode to ensure
that the obtained data is complete.
19
Snapshot Principle
⚫ A snapshot is copies of specified files in HDFS at a certain point in time. In
other words, a snapshot is an image of a file or a directory at a specific time.
⚫ An HDFS snapshot is used to create an index for a file system and create a
new space to store modified files. Once a snapshot is created, the file and file
directory structure at a certain time point can be restored using the snapshot
regardless of the file directory changes. Snapshots are read-only and can be
used to restore important data and prevent misoperations.
20
Contents
1. HDFS Overview
21
HDFS Data Write Process
Client node
4 4
DataNode DataNode DataNode
5 5
22
HDFS Data Read Process
4. Reads data.
23
Contents
1. HDFS Overview
24
Common HDFS Commands (1)
25
Common HDFS Commands (2)
26
Uploading Data (Write Operation)
⚫ Uploading a local file to HDFS:
[root@localhost ~]# hdfs dfs –put /root/student.txt /user/inputs
⚫ This command is used to upload the student.txt file in the local root
directory to the /user/inputs directory in HDFS.
27
Downloading Data (Read Operation)
⚫ Downloading a file from HDFS to the local host:
[root@localhost ~]# hdfs dfs –get /user/inputs/student.txt /root/inputs
⚫ This command is used to download the student.txt file from HDFS to the
/root/inputs directory on the local host.
28
Section Summary
⚫ This chapter described the HDFS concept, advantages and
disadvantages, architecture, read and write processes, basic
commands, and use cases.
29
Q&A
1. What are the common commands for adding, deleting,
modifying, and querying HDFS data?
30
Recommendations
⚫ Huawei Cloud websites
Official website: https://www.huaweicloud.com/intl/en-us/
Developer Institute: https://edu.huaweicloud.com/intl/en-us/
Huawei Cloud
Developer Institute
31
Thank You.
Copyright© 2023 Huawei Technologies Co., Ltd. All Rights Reserved.
The information in this document may contain predictive statements including,
without limitation, statements regarding the future financial and operating results,
future product portfolio, new technology, etc. There are a number of factors that
could cause actual results and developments to differ materially from those
expressed or implied in the predictive statements. Therefore, such information is
provided for reference purpose only and constitutes neither an offer nor an
acceptance. Huawei may change the information at any time without notice.
32