0% found this document useful (0 votes)
15 views32 pages

Huawei

huawei

Uploaded by

eric sandria
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views32 pages

Huawei

huawei

Uploaded by

eric sandria
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Chapter 3 HDFS — Hadoop Distributed

File System
Foreword
⚫ This chapter describes the HDFS concept, advantages and
disadvantages, architecture, read and write processes, basic
commands, and use cases.

2
Objectives
⚫ Upon completion of this course, you will understand:
 HDFS advantages and disadvantages
 HDFS architecture and key features
 HDFS read and write processes
 Common HDFS commands and basic operations

3
Contents
1. HDFS Overview

2. HDFS Basic Components

3. HDFS Key Features

4. HDFS Read and Write Processes

5. HDFS Use Cases

4
HDFS Overview
⚫ HDFS is the bottom-layer core of the Hadoop big data ecosystem
and supports distributed storage of big data. It is designed and
developed to process large data sets, facilitating high-throughput,
large-scale file operations.

5
HDFS Advantages
Multi-replica mechanism Parallel computing from
Automatic replica mobile applications to
restoration data nodes

High
fault Efficiency
tolerance
Java as the programming
Streaming data ingestion mode
language Streaming Processing by block in batches
Strong transplantability Cross-platform
data
compatibility
ingestion

Suitable
Simple file for
model big data
processing

Support for thousands of


Written once and read multiple times nodes in a cluster
File unmodifiable but appendable PB-level data processing

6
HDFS Disadvantages
⚫ HDFS has the following disadvantages:

Low-latency Not suitable for Not suitable for No support for


data small file concurrent data random file
access restricted storage write modification

To write vast amounts HDFS files are stored as Files can only be Data files can only be
of data in a certain blocks. The storage of modified by one user appended, but not
period of time, HDFS is block metadata occupies a instead of by multiple randomly modified.
optimized, which large amount of memory users concurrently.
increases the latency of of NameNodes. However,
obtaining data. the memory resources of
NameNodes are limited.

7
Contents
1. HDFS Overview

2. HDFS Basic Components

3. HDFS Key Features

4. HDFS Read and Write Processes

5. HDFS Use Cases

8
HDFS Architecture
HDFS architecture

Metadata (Name,replicas...):
NameNode /home/foo/data,3...
Metadata ops

Block ops
Client

Read DataNode DataNode

Replication

Blocks Blocks

Client
Rack 1 Rack 2

9
Block
⚫ The default HDFS block size is 64 MB in versions earlier than Hadoop 2.0 and
128 MB in Hadoop 2.0 or later. A file is divided into multiple blocks. A block
is the storage unit.
⚫ The block size is much larger than that of a common file system, minimizing
the addressing overhead.
⚫ A block has the following benefits:
 Large-scale file storage
 Simplified system design
 Applicable to data backup

10
Client
⚫ Clients are the most common way of using HDFS. HDFS provides a client during deployment.
⚫ It is a library that contains HDFS interfaces that hide most of the complexity in HDFS
implementation.
⚫ It supports common operations such as opening, reading, and writing, and provides a shell-
like command line mode to access data in HDFS.
⚫ HDFS also provides Java APIs that serve as client programming interfaces for applications to
access the file system.
⚫ Strictly speaking, the client is not part of the HDFS architecture. It is a built-in HDFS library
of Hadoop and is external to HDFS.

11
NameNode Functions
⚫ In HDFS, the NameNode manages the namespace of the distributed file system and
stores two core data structures: FsImage and EditLog.

NameNode
The EditLog file records update
Metadata mirroring Operation log
FsImage
operations such as file creation,
EditLog
The FsImage file contains a deletion, and renaming. When a
serialized form of all the NameNode is started, FsImage is
directory and file inodes in the loaded to the memory, and then
file system. It maintains the Root directory operations in EditLog are
metadata of the file system performed to synchronize
tree and all the files and metadata in the memory with
folders in that file tree. Subdirectory Subdirectory Subdirectory the actual metadata. Metadata
in the memory supports read
... File ... operations on clients.

Block ... Block


12
Functions of DataNodes
⚫ A DataNode is a worker node of HDFS. It stores and retrieves data based on
Client requests or NameNode scheduling, and periodically sends the list of
blocks stored on it to NameNodes.
 A DataNode is a place where data is stored in a file system.
 A Client or NameNode may request to write or read blocks to or from a DataNode, and
the DataNode periodically returns information about the blocks stored in it to the
NameNode.

13
SecondaryNameNode Functions
⚫ SecondaryNameNode is a component of
NameNode SecondaryNamenode
the HDFS architecture. It is used to store
replicas of HDFS metadata in a NameNode EditLog FsImage
2. Obtains Editlog and FsImage from the
active NameNode.
and reduce the restart time of the
NameNode. Its main function is to 1. Sends
notifications.
periodically combine the FsImage file and
Editlog
the EditLog file of the NameNode to FsImage
.new EditLog
prevent the log file from being too large.
3. Merges the
Typically, SecondaryNameNode runs FsImage and
EditLog files.
separately on a node. Fsimage Fsimage
.ckpt .ckpt
⚫ SecondaryNameNode is not the standby
5. Rolls back
node when a NameNode is faulty. It plays a FsImage. 4. Uploads the newly generated
FsImage file to the active NameNode.
different role from the NameNode. EditLog FsImage

14
Contents
1. HDFS Overview

2. HDFS Basic Components

3. HDFS Key Features

4. HDFS Read and Write Processes

5. HDFS Use Cases

15
Block Replication
⚫ HDFS stores very large files across machines in a large cluster. It stores each
file as a sequence of blocks; all blocks in a file except the last block are the
same size.
⚫ In most cases, a file has three replicas. The HDFS storage policy is to store a
replica on a node in the local rack, store a replica on another node in the
same rack, and store the last replica on a node in a different rack. In this
way, blocks are replicated.

16
Rack Awareness
⚫ Rack awareness: In an HDFS cluster, two nodes on different racks communicate with each
other through a switch. NameNode can horizontally replicate block replicas and store them
on DataNodes on different racks. This process is rack awareness.

⚫ Advantages: This prevents data loss when a rack fails and allows the bandwidth of multiple
racks to be fully utilized when data is read. In this way, replicas can be evenly distributed in
the cluster, implementing load balancing.

⚫ Disadvantage: A write operation of this policy needs to transmit blocks to multiple racks,
which increases the write cost.

17
Cluster Balancing Policy
⚫ After Hadoop starts a balancer task, the cluster automatically reads the disk
usage of each node and replicates data from the node whose space usage is
far greater than the average value to the node whose space usage is lower
than the average value based on the configured host space usage difference.
After the replication is complete, the original node data is deleted. In this
way, the cluster load is balanced.

18
Data Integrity
⚫ With checksum checkpointing, when an HDFS file is created on the client, the
client calculates the checksum of all blocks in the file and stores the
checksum as an independent hidden file in the same namespace in HDFS.
After obtaining the file, the client checks whether the data obtained from a
DataNode matches the checksum in the checksum file. If they do not match,
the client can obtain a replica of the block from another DataNode to ensure
that the obtained data is complete.

19
Snapshot Principle
⚫ A snapshot is copies of specified files in HDFS at a certain point in time. In
other words, a snapshot is an image of a file or a directory at a specific time.
⚫ An HDFS snapshot is used to create an index for a file system and create a
new space to store modified files. Once a snapshot is created, the file and file
directory structure at a certain time point can be restored using the snapshot
regardless of the file directory changes. Snapshots are read-only and can be
used to restore important data and prevent misoperations.

20
Contents
1. HDFS Overview

2. HDFS Basic Components

3. HDFS Key Features

4. HDFS Read and Write Processes

5. HDFS Use Cases

21
HDFS Data Write Process

1. Sends requests to create files. 2. Creates file metadata.


Distributed
HDFS file system NameNode
3. Writes data. 7. Completes the write operation.
client
FSData NameNode
6. Closes files.
output stream

Client node

4. Writes data packets. 5. Receives acknowledgment packets.

4 4
DataNode DataNode DataNode
5 5

DataNode DataNode DataNode

22
HDFS Data Read Process

1. Opens the file. Distributed 2. Obtains the block information.


HDFS NameNode
3. Reads the request.
File system
client
FSData NameNode
Input stream

Client node 5. Reads data.

4. Reads data.

DataNode DataNode DataNode

DataNode DataNode DataNode

23
Contents
1. HDFS Overview

2. HDFS Basic Components

3. HDFS Key Features

4. HDFS Read and Write Processes

5. HDFS Use Cases

24
Common HDFS Commands (1)

Command Format Command Function


hdfs dfs -cat <hdfs file> /* Views the content of a specified file in HDFS.
hdfs dfs -chmod [-R]
<MODE[,MODE]... | OCTALMODE> Modifies the permission on a file.
PATH...
Collects statistics on the number of directories, files, and
hdfs dfs -count <hdfs path>
total file bytes in a specified directory in HDFS.
hdfs dfs -ls <hdfs path> Lists directories and files in a specified directory in HDFS.
hdfs dfs -mkdir <hdfs path> Creates a subdirectory in a specified directory in HDFS.
hdfs dfs -get <hdfs file> <local file or Downloads a specified file from HDFS to a local file or
dir > directory.

25
Common HDFS Commands (2)

Command Format Command Function


hdfs dfs -put <local file> <hdfs file> Uploads a file to HDFS.
hdfs dfs -rm <hdfs file> Deletes a file from HDFS.
Deletes directories and files from a specified directory in
hdfs dfs -rm -r <hdfs dir>
HDFS.
hdfs dfs -cp <path/file> <path/file> Copies a file in HDFS.
Moves a file in HDFS, which is equivalent to cutting or
hdfs dfs -mv <hdfs file> <hdfs file>
renaming a file.
hdfs dfs -tail <hdfs file> Displays the content at the end of a file.
hdfs dfs -text <hdfs file> Displays the file content in characters.

26
Uploading Data (Write Operation)
⚫ Uploading a local file to HDFS:
[root@localhost ~]# hdfs dfs –put /root/student.txt /user/inputs

⚫ This command is used to upload the student.txt file in the local root
directory to the /user/inputs directory in HDFS.

27
Downloading Data (Read Operation)
⚫ Downloading a file from HDFS to the local host:
[root@localhost ~]# hdfs dfs –get /user/inputs/student.txt /root/inputs

⚫ This command is used to download the student.txt file from HDFS to the
/root/inputs directory on the local host.

28
Section Summary
⚫ This chapter described the HDFS concept, advantages and
disadvantages, architecture, read and write processes, basic
commands, and use cases.

29
Q&A
1. What are the common commands for adding, deleting,
modifying, and querying HDFS data?

2. Describe the HDFS file read process.

3. Describe the HDFS file write process.

30
Recommendations
⚫ Huawei Cloud websites
 Official website: https://www.huaweicloud.com/intl/en-us/
 Developer Institute: https://edu.huaweicloud.com/intl/en-us/

Huawei Cloud
Developer Institute

31
Thank You.
Copyright© 2023 Huawei Technologies Co., Ltd. All Rights Reserved.
The information in this document may contain predictive statements including,
without limitation, statements regarding the future financial and operating results,
future product portfolio, new technology, etc. There are a number of factors that
could cause actual results and developments to differ materially from those
expressed or implied in the predictive statements. Therefore, such information is
provided for reference purpose only and constitutes neither an offer nor an
acceptance. Huawei may change the information at any time without notice.

32

You might also like