Chapter 2
Chapter 2
and other Huawei trademarks are trademarks of Huawei Technologies Co., Ltd.
All other trademarks and trade names mentioned in this document are the property of
their respective holders.
Notice
The purchased products, services and features are stipulated by the contract made
between Huawei and the customer. All or part of the products, services and features
described in this document may not be within the purchase scope or the usage scope.
Unless otherwise specified in the contract, all statements, information, and
recommendations in this document are provided "AS IS" without warranties,
guarantees or representations of any kind, either express or implied.
The information in this document is subject to change without notice. Every effort has
been made in the preparation of this document to ensure accuracy of the contents, but
all statements, information, and recommendations in this document do not constitute
a warranty of any kind, express or implied.
Contents
Back-end (BE) ports connect a controller enclosure to a disk enclosure and provide
disks with channels for reading and writing data.
A cache is a memory chip on a disk controller. It provides fast data access and is a
buffer between the internal storage and external interfaces.
An engine is a core component of a development program or system on an
electronic platform. It usually provides support for programs or a set of systems.
Coffer disks store user data, system configurations, logs, and dirty data in the cache
to protect against unexpected power outages.
− Built-in coffer disk: Each controller of Huawei OceanStor Dorado V6 has one or
two built-in SSDs as coffer disks. See the product documentation for more
details.
− External coffer disk: The storage system automatically selects four disks as coffer
disks. Each coffer disk provides 2 GB space to form a RAID 1 group. The
remaining space can store service data. If a coffer disk is faulty, the system
automatically replaces the faulty coffer disk with a normal disk for redundancy.
Power module: The controller enclosure employs an AC power module for its normal
operations.
− A 4 U controller enclosure has four power modules (PSU 0, PSU 1, PSU 2, and
PSU 3). PSU 0 and PSU 1 form a power plane to power controllers A and C and
provide mutual redundancy. PSU 2 and PSU 3 form the other power plane to
power controllers B and D and provide mutual redundancy. It is recommended
that you connect PSU 0 and PSU 2 to one PDU and PSU 1 and PSU 3 to another
PDU for maximum reliability.
− A 2 U controller enclosure has two power modules (PSU 0 and PSU 1) to power
controllers A and B. The two power modules form a power plane and provide
mutual redundancy. Connect PSU 0 and PSU 1 to different PDUs for maximum
reliability.
1.1.3.2 CE Switch
Huawei CloudEngine series fixed switches are next-generation Ethernet switches for data
centers and provide high performance, high port density, and low latency. The switches
use a flexible front-to-rear or rear-to-front design for airflow and support IP SANs and
distributed storage networks.
1.1.4 HDD
1.1.4.1 HDD Structure
A platter is coated with magnetic materials on both surfaces with polarized magnetic
grains to represent a binary information unit, or bit.
A read/write head reads and writes data for platters. It changes the polarities of
magnetic grains on the platter surface to save data.
The actuator arm moves the read/write head to the specified position.
The spindle has a motor and bearing underneath. It rotates the specified position on
the platter to the read/write head.
Data Storage Technology Page 6
The control circuit controls the speed of the platter and movement of the actuator
arm, and delivers commands to the head.
Tracks may vary in the number of sectors. A sector can generally store 512 bytes of
user data, but some disks can be formatted into even larger sectors of 4 KB.
where the head does not need to change the track or read a specified sector, but
reads and writes all sectors sequentially and cyclically on one track.
External transfer rate is also called burst data transfer rate or interface transfer rate.
It refers to the data transfer rate between the system bus and the disk buffer and
depends on the disk port type and buffer size.
Serial transmission is less efficient than parallel transmission, but is generally faster
with potential increases in transmission speed from increasing the transmission
frequency.
Serial transmission is used for long-distance transmission. Currently, PCI interfaces
use serial transmission. The PCIe interface is a typical example of serial transmission.
The transmission rate of a single line is up to 2.5 Gbit/s.
Advantages:
− It is applicable to a wide range of devices. One SCSI controller card can connect
to 15 devices simultaneously.
− It provides high performance with multi-task processing, low CPU usage, fast
rotation speed, and a high transmission rate.
− SCSI disks support diverse applications as external or built-in components with
hot-swappable replacement.
Disadvantages:
− High cost and complex installation and configuration.
SAS port:
SAS is similar to SATA in its use of a serial architecture for a high transmission rate
and streamlined internal space with shorter internal connections.
SAS improves the efficiency, availability, and scalability of the storage system. It is
backward compatible with SATA for the physical and protocol layers.
Advantages:
− SAS is superior to SCSI in its transmission rate, anti-interference, and longer
connection distances.
Disadvantages:
− SAS disks are more expensive.
Fibre Channel port:
Fiber Channel was originally designed for network transmission rather than disk
ports. It has gradually been applied to disk systems in pursuit of higher speed.
Advantages:
− Easy to upgrade. Supports optical fiber cables with a length over 10 km.
− Large bandwidth
− Strong universality
Disadvantages:
− High cost
− Complex to build
1.1.5 SSD
1.1.5.1 SSD Overview
Traditional disks use magnetic materials to store data, but SSDs use NAND flash with
cells as storage units. NAND flash is a non-volatile random access storage medium that
can retain stored data after the power is turned off. It quickly and compactly stores
digital information.
SSDs eliminate high-speed rotational components for higher performance, lower power
consumption, and zero noise.
SSDs do not have mechanical parts, but this does not mean that they have an infinite life
cycle. Because NAND flash is a non-volatile medium, original data must be erased before
Data Storage Technology Page 11
new data can be written. However, there is a limit to how many times each cell can be
erased. Once the limit is reached, data reads and writes become invalid on that cell.
These four types of cells have similar costs but store different amounts of data.
Originally, the capacity of an SSD was only 64 GB or smaller. Now, a TLC SSD can store
up to 2 TB of data. However, each cell type has a different life cycle, resulting in different
SSD reliability. The life cycle is also an important factor in selecting SSDs.
For example, host page A was originally stored in flash page X, and the mapping
relationship was A to X. Later, the host rewrites the host page. Flash memory does
not overwrite data, so the SSD writes the new data to a new page Y, establishes the
new mapping relationship of A to Y, and cancels the original mapping relationship.
The data in page X becomes aged and invalid, which is also known as garbage data.
The host continues to write data to the SSD until it is full. In this case, the host
cannot write more data unless the garbage data is cleared.
SSD read process:
An 8-fold increase in read speed depends on whether the read data is evenly
distributed in the blocks of each channel. If the 32 KB data is stored in the blocks of
channels 1 through 4, the read speed can only support a 4-fold improvement at
most. That is why smaller files are transmitted at a slower rate.
Gbit/s Fibre Channel ports. SmartIO interface modules connect storage devices to
application servers.
The optical module rate must match the rate on the interface module label. Otherwise,
the storage system will report an alarm and the port will become unavailable.
1.1.7 Quiz
1. (Multiple-answer question) Which of the following are SSD types?
A. SLC
B. MLC
C. TLC
D. QLC
2. (Multiple-answer question) Which of the following affect HDD performance?
A. Disk capacity
B. Rotation speed
Data Storage Technology Page 15
One is storing data copies on another redundant disk to improve data reliability and
read performance.
The other is parity. Parity data is additional information calculated using user data.
For a RAID array that uses parity, an additional parity disk is required. The XOR
(symbol: ⊕) algorithm is used for parity.
1.2.1.2 RAID 0
RAID 0, also referred to as striping, provides the best storage performance among all
RAID levels. RAID 0 uses the striping technology to distribute data to all disks in a RAID
array.
In contrast, RAID 0 searches the target data block and reads data in all disks upon
receiving a data read request. The preceding figure shows a data read process. A RAID 0
array provides a read/write performance that is directly proportional to disk quantity.
Best Practices for RAID 0
As shown in 错误!未找到引用源。, when a system sends a data I/O request to a logical
drive (a RAID 0 array) formed by three drives, the request is converted into three
operations specific to the three physical drives, respectively.
In this RAID 0 array, a data request is concurrently processed on all of the three drives.
Theoretically, drive read and write rates triple in that operations are performed
concurrently on three disks. In practice, drive read and write rates increase by less than
this due to a variety of factors, such as bus bandwidth. However, the parallel
transmission rate of large amounts of data improves remarkably over the serial
transmission rate.
1.2.1.3 RAID 1
RAID 1, also referred to as mirroring, maximizes data security. A RAID 1 array uses two
identical disks including one mirror disk. When data is written to a disk, a copy of the
same data is stored in the mirror disk. When the source (physical) disk fails, the mirror
disk takes over services from the source disk to maintain service continuity. The mirror
disk is used as a backup to provide high data reliability.
The amount of data stored in a RAID 1 array is only equal to the capacity of a single disk,
and data copies are retained in another disk. That is, each gigabyte data needs 2
gigabyte disk space. Therefore, a RAID 1 array consisting of two disks has a space
utilization of 50%.
As shown in the preceding figure, data blocks D 0, D 1, and D 2 are to be written to disks.
D 0 and the copy of D 0 are written to the two disks (disk 1 and disk 2) at the same time.
Other data blocks are also written to the RAID 1 array in the same way by mirroring.
Generally, a RAID 1 array provides write performance of a single disk.
A RAID 1 array reads data from the data disk and the mirror disk at the same time to
improve read performance. If one disk fails, data can be read from the other disk.
A RAID 1 array provides read performance which is the sum of the read performance of
the two disks. When a RAID array degrades, its performance decreases by half.
Best Practices for RAID 1
As shown in 错误!未找到引用源。, a system sends a data I/O request to a logical drive (a
RAID 1 array) formed by two drives.
When data is written to drive 0, the same data is automatically copied to drive 1.
In a data read operation, data is read from drive 0 and drive 1 at the same time.
1.2.1.4 RAID 3
RAID 3 is similar to RAID 0 but uses dedicated parity stripes. In a RAID 3 array, a
dedicated disk (parity disk) is used to store the parity data of strips in other disks in the
same stripe. If incorrect data is detected or a disk fails, data in the faulty disk can be
recovered using the parity data. RAID 3 applies to data-intensive or single-user
environments where data blocks need to be continuously accessed for a long time. RAID
3 writes data to all member data disks. However, when new data is written to any disk,
RAID 3 recalculates and rewrites parity data. Therefore, when a large amount of data
from an application is written, the parity disk in a RAID 3 array needs to process heavy
workloads. Parity operations have certain impact on the read and write performance of a
RAID 3 array. In addition, the parity disk is subject to the highest failure rate in a RAID 3
array due to heavy workloads. A write penalty occurs when just a small amount of data is
written to multiple disks, which does not improve disk performance as compared with
data writes to a single disk.
RAID 3 uses a single disk for fault tolerance and performs parallel data transmission.
RAID 3 uses striping to divide data into blocks and writes XOR parity data to the last disk
(parity disk).
The write performance of RAID 3 depends on the amount of changed data, the number
of disks, and the time required to calculate and store parity data. If a RAID 3 array
consists of N member disks of the same rotational speed and write penalty is not
considered, its sequential I/O write performance is theoretically slightly inferior to N – 1
times that of a single disk when full-stripe write is performed. (Additional time is
required to calculate redundancy check.)
In a RAID 3 array, data is read by stripe. Data blocks in a stripe can be concurrently read
as drives in all disks are controlled.
RAID 3 performs parallel data reads and writes. The read performance of a RAID 3 array
depends on the amount of data to be read and the number of member disks.
1.2.1.5 RAID 5
RAID 5 is improved based on RAID 3 and consists of striping and parity. In a RAID 5 array,
data is written to disks by striping. In a RAID 5 array, the parity data of different strips is
distributed among member disks instead of a parity disk.
Similar to RAID 3, a write penalty occurs when just a small amount of data is written.
1.2.1.6 RAID 6
Data protection mechanisms of all RAID arrays previously discussed considered only
failures of individual disks (excluding RAID 0). The time required for reconstruction
increases along with the growth of disk capacities. It may take several days instead of
hours to reconstruct a RAID 5 array consisting of large-capacity disks. During the
reconstruction, the array is in the degraded state, and the failure of any additional disk
will cause the array to be faulty and data to be lost. This is why some organizations or
units need a dual-redundancy system. In other words, a RAID array should tolerate
failures of up to two disks while maintaining normal access to data. Such dual-
redundancy data protection can be implemented in the following ways:
The first one is multi-mirroring. Multi-mirroring is a method of storing multiple
copies of a data block in redundant disks when the data block is stored in the
primary disk. This means heavy overheads.
The second one is a RAID 6 array. A RAID 6 array protects data by tolerating failures
of up to two disks even at the same time.
The formal name of RAID 6 is distributed double-parity (DP) RAID. It is essentially an
improved RAID 5, and also consists of striping and distributed parity. RAID 6 supports
double parity, which means that:
When user data is written, double parity calculation needs to be performed.
Therefore, RAID 6 provides the slowest data writes among all RAID levels.
Additional parity data takes storage spaces in two disks. This is why RAID 6 is
considered as an N + 2 RAID.
Currently, RAID 6 is implemented in different ways. Different methods are used for
obtaining parity data.
RAID 6 P+Q
Data Storage Technology Page 21
1.2.1.7 RAID 10
For most enterprises, RAID 0 is not really a practical choice, while RAID 1 is limited by
disk capacity utilization. RAID 10 provides the optimal solution by combining RAID 1 and
RAID 0. In particular, RAID 10 provides superior performance by eliminating write penalty
in random writes.
A RAID 10 array consists of an even number of disks. User data is written to half of the
disks and mirror copies of user data are retained in the other half of disks. Mirroring is
performed based on stripes.
Data Storage Technology Page 23
1.2.1.8 RAID 50
RAID 50 combines RAID 0 and RAID 5. Two RAID 5 sub-arrays form a RAID 0 array. The
two RAID 5 sub-arrays are independent of each other. A RAID 50 array requires at least
six disks because a RAID 5 sub-array requires at least three disks.
Data Storage Technology Page 24
reconstruction and data will be lost if any additional disk or data block fails.
Therefore, a longer reconstruction duration results in higher risk of data loss.
Material impact on services: During reconstruction, member disks are engaged in
reconstruction and provide poor service performance, which will affect the operation
of upper-layer services.
To solve the preceding problems of traditional RAID and ride on the development of
virtualization technologies, the following alternative solutions emerged:
LUN virtualization: A traditional RAID array is further divided into small units. These
units are regrouped into storage spaces accessible to hosts.
Block virtualization: Disks in a storage pool are divided into small data blocks. A
RAID array is created using these data blocks so that data can be evenly distributed
to all disks in the storage pool. Then, resources are managed based on data blocks.
Huawei RAID 2.0+ is implemented in another way. A disk domain should be created
first. A disk domain is a disk array. A disk can belong to only one disk domain. One or
more disk domains can be created in an OceanStor storage system. It seems that a
disk domain is similar to a RAID array. Both consist of disks but have significant
differences. A RAID array consists of disks of the same type, size, and rotational
speed, and such disks are associated with a RAID level. In contrast, a disk domain
consists of up to more than 100 disks of up to three types. Each type of disk is
associated with a storage tier. For example, SSDs are associated with the high
performance tier, SAS disks are associated with the performance tier, and NL-SAS
disks are associated with the capacity tier. A storage tier would not exist if there are
no disks of the corresponding type in a disk domain. A disk domain separates an
array of disks from another array of disks for fully isolating faults and maintaining
independent performance and storage resources. RAID levels are not specified when
a disk domain is created. That is, data redundancy protection methods are not
specified. Actually, RAID 2.0+ provides more flexible and specific data redundancy
protection methods. The storage space formed by disks in a disk domain is divided
into storage pools of a smaller granularity and hot spare space shared among
storage tiers. The system automatically sets the hot spare space based on the hot
spare policy (high, low, or none) set by an administrator for the disk domain and the
number of disks at each storage tier in the disk domain. In a traditional RAID array,
an administrator should specify a disk as the hot space disk.
2. Storage Pool and Storage Tier
A storage pool is a storage resource container. The storage resources used by
application servers are all from storage pools.
A storage tier is a collection of storage media providing the same performance level
in a storage pool. Different storage tiers manage storage media of different
performance levels and provide storage space for applications that have different
performance requirements.
A storage pool created based on a specified disk domain dynamically allocates CKs
from the disk domain to form CKGs according to the RAID policy of each storage tier
for providing storage resources with RAID protection to applications.
A storage pool can be divided into multiple tiers based on disk types.
When creating a storage pool, a user is allowed to specify a storage tier and related
RAID policy and capacity for the storage pool.
OceanStor storage systems support RAID 1, RAID 10, RAID 3, RAID 5, RAID 50, and
RAID 6 and related RAID policies.
The capacity tier consists of large-capacity SATA and NL-SAS disks. DP RAID 6 is
recommended.
3. Disk Group
An OceanStor storage system automatically divides disks of each type in each disk
domain into one or more disk groups (DGs) according to disk quantity.
One DG consists of disks of only one type.
CKs in a CKG are allocated from different disks in a DG.
Data Storage Technology Page 27
DGs are internal objects automatically configured by OceanStor storage systems and
typically used for fault isolation. DGs are not presented externally.
4. Logical Drive
A logical drive (LD) is a disk that is managed by a storage system and corresponds to
a physical disk.
5. CK
A chunk (CK) is a disk space of a specified size allocated from a storage pool. It is the
basic unit of a RAID array.
6. CKG
A chunk group (CKG) is a logical storage unit that consists of CKs from different
disks in the same DG based on the RAID algorithm. It is the minimum unit for
allocating resources from a disk domain to a storage pool.
All CKs in a CKG are allocated from the disks in the same DG. A CKG has RAID
attributes, which are actually configured for corresponding storage tiers. CKs and
CKGs are internal objects automatically configured by storage systems. They are not
presented externally.
7. Extent
Each CKG is divided into logical storage spaces of a specific and adjustable size called
extents. Extent is the minimum unit (granularity) for migration and statistics of hot
data. It is also the minimum unit for space application and release in a storage pool.
An extent belongs to a volume or LUN. A user can set the extent size when creating
a storage pool. After that, the extent size cannot be changed. Different storage pools
may consist of extents of different sizes, but one storage pool must consist of extents
of the same size.
8. Grain
When a thin LUN is created, extents are divided into 64 KB blocks which are called
grains. A thin LUN allocates storage space by grains. Logical block addresses (LBAs)
in a grain are consecutive.
Grains are mapped to thin LUNs. A thick LUN does not involve grains.
9. Volume and LUN
A volume is an internal management object in a storage system.
A LUN is a storage unit that can be directly mapped to a host for data reads and
writes. A LUN is the external embodiment of a volume.
A volume organizes all extents and grains of a LUN and applies for and releases
extents to increase and decrease the actual space used by the volume.
available from disks outside the disk domain, the system dynamically reconstructs the
original N + M chunks to (N - 1) + M chunks. When a new SSD is inserted, the system
migrates data from the (N - 1) + M chunks to the newly constructed N + M chunks for
efficient disk utilization.
Dynamic RAID adopts the erasure coding (EC) algorithm, which can dynamically adjust
the number of CKs in a CKG if only SSDs are used to meet the system reliability and
capacity requirements.
1.2.3.2 RAID-TP
RAID protection is essential to a storage system for consistently high reliability and
performance. However, the reliability of RAID protection is challenged by uncontrollable
RAID array construction time due to drastic increase in capacity.
RAID-TP achieves optimal performance, reliability, and capacity utilization.
Customers have to purchase disks of larger capacity to replace existing disks for system
upgrades. In such a case, one system may consist of disks of different capacities. How to
maintain the optimal capacity utilization in a system that uses a mix of disks with
different capacities?
RAID-TP uses Huawei's optimized FlexEC algorithm that allows the system to tolerate
failures of up to three disks, improving reliability while allowing a longer reconstruction
time window.
RAID-TP with FlexEC algorithm reduces the amount of data read from a single disk by
70%, as compared with traditional RAID, minimizing the impact on system performance.
In a typical 4:2 RAID 6 array, the capacity utilization is about 67%. The capacity
utilization of a Huawei OceanStor all-flash storage system with 25 disks is improved by
20% on this basis.
1.2.4 Quiz
1. What is the difference between a strip and stripe?
2. Which RAID level would you recommend if a user focuses on reliability and random
write performance?
3. (True or false) Data access will remain unaffected if any disk in a RAID 10 array fails.
4. (Multiple-answer question) What are the advantages of RAID 2.0+?
A. Automatic load balancing to reduce fault rate
B. Fast thin reconstruction to reduce dual-disk failure rate
C. Intelligent fault troubleshooting to maintain system reliability
D. Virtual storage pools to simplify storage planning and management
5. (Multiple-answer question) Which of the following storage pool technologies use
extents of RAID 2.0+ as basic units?
A. Applying for space
B. Space release
C. Statistics collection
D. Data migration
Data Storage Technology Page 29
− Bus architecture
− Hi-Star architecture
− Direct-connection architecture
− Virtual matrix architecture
Mission-critical storage architecture evolution:
In 1990, EMC launched Symmetrix, a full bus architecture. A parallel bus connected
front-end interface modules, cache modules, and back-end disk interface modules
for data and signal exchange in time-division multiplexing mode.
In 2000, HDS adopted the switching architecture for Lightning 9900 products. Front-
end interface modules, cache modules, and back-end disk interface modules were
connected on two redundant switched networks, increasing communication channels
to dozens of times more than that of the bus architecture. The internal bus was no
longer a performance bottleneck.
In 2003, EMC launched the DMX series based on full mesh architecture. All modules
were connected in point-to-point mode, obtaining theoretically larger internal
bandwidth but adding system complexity and limiting scalability challenges.
In 2009, to reduce hardware development costs, EMC launched the distributed
switching architecture by connecting a separated switch module to the tightly
coupled dual-controller of mid-range storage systems. This achieved a balance
between costs and scalability.
In 2012, Huawei launched the Huawei OceanStor 18000 series, a mission-critical
storage product also based on distributed switching architecture.
Storage Software Technology Evolution:
A storage system combines unreliable and low-performance disks to provide high-
reliability and high-performance storage through effective management. Storage systems
provide sharing, easy-to-manage, and convenient data protection functions. Storage
system software has evolved from basic RAID and cache to data protection features such
as snapshot and replication, to dynamic resource management with improved data
management efficiency, and deduplication and tiered storage with improved storage
efficiency.
Distributed Storage Architecture:
A distributed storage system organizes local HDDs and SSDs of general-purpose
servers into a large-scale storage resource pool, and then distributes data to multiple
data storage servers.
Currently, distributed storage of Huawei learns from Google, building a distributed
file system among multiple servers and then implementing storage services on the
file system.
Most storage nodes are general-purpose servers. Huawei OceanStor 100D is
compatible with multiple general-purpose x86 servers and Arm servers.
− Protocol: storage protocol layer. The block, object, HDFS, and file services
support local mounting access over iSCSI or VSC, S3/Swift access, HDFS access,
and NFS access respectively.
Data Storage Technology Page 31
− VBS: block access layer of FusionStorage Block. User I/Os are delivered to VBS
over iSCSI or SCSI.
− EDS-B: provides block services with enterprise features, and receives and
processes I/Os from VBS.
− EDS-F: provides the HDFS service.
− Metadata Controller (MDC): The metadata control device controls distributed
cluster node status, data distribution rules, and data rebuilding rules.
− Object Storage Device (OSD): a storage device for storing user data in
distributed clusters of the object storage device
− Cluster Manager (CM): manages cluster information.
Port consistency: In a loop, the EXP (P1) port of an upper-level disk enclosure is
connected to the PRI (P0) port of a lower-level disk enclosure.
Dual-plane networking: Expansion board A connects to controller A, while expansion
board B connects to controller B.
Symmetric networking: On controllers A and B, symmetric ports and slots are
connected to the same disk enclosure.
Forward connection networking: Both expansion modules A and B use forward
connection.
Cascading depth: The number of cascaded disk enclosures in a loop cannot exceed
the upper limit.
IP scale-out is used for Huawei OceanStor V3 and V5 entry-level and mid-range series,
Huawei OceanStor V5 Kunpeng series, and Huawei OceanStor Dorado V6 series. IP scale-
out integrates TCP/IP, Remote Direct Memory Access (RDMA), and Internet Wide Area
RDMA Protocol (iWARP) to implement service switching between controllers, which
complies with the all-IP trend of the data center network.
PCIe scale-out is used for Huawei OceanStor 18000 V3 and V5 series, and Huawei
OceanStor Dorado V3 series. PCIe scale-out integrates PCIe channels and the RDMA
technology to implement service switching between controllers.
PCIe scale-out: features high bandwidth and low latency.
IP scale-out: employs standard data center technologies (such as ETH, TCP/IP, and
iWARP) and infrastructure, and boosts the development of Huawei's proprietary chips for
entry-level and mid-range products.
Next, let's move on to I/O read and write processes of the host. The scenarios are as
follows:
Local Write Process
− A host delivers write I/Os to engine 0.
− Engine 0 writes the data into the local cache, implements mirror protection, and
returns a message indicating that data is written successfully.
− Engine 0 flushes dirty data onto a disk. If the target disk is on the local
computer, engine 0 directly delivers the write I/Os.
− If the target disk is on a remote device, engine 0 transfers the I/Os to the engine
(engine 1 for example) where the disk resides.
− Engine 1 writes dirty data onto disks.
Non-local Write Process
− A host delivers write I/Os to engine 2.
− After detecting that the LUN is owned by engine 0, engine 2 transfers the write
I/Os to engine 0.
− Engine 0 writes the data into the local cache, implements mirror protection, and
returns a message to engine 2, indicating that data is written successfully.
− Engine 2 returns the write success message to the host.
− Engine 0 flushes dirty data onto a disk. If the target disk is on the local
computer, engine 0 directly delivers the write I/Os.
Data Storage Technology Page 33
− If the target disk is on a remote device, engine 0 transfers the I/Os to the engine
(engine 1 for example) where the disk resides.
− Engine 1 writes dirty data onto disks.
Local Read Process
− A host delivers write I/Os to engine 0.
− If the read I/Os are hit in the cache of engine 0, engine 0 returns the data to the
host.
− If the read I/Os are not hit in the cache of engine 0, engine 0 reads data from
the disk. If the target disk is on the local computer, engine 0 reads data from the
disk.
− After the read I/Os are hit locally, engine 0 returns the data to the host.
− If the target disk is on a remote device, engine 0 transfers the I/Os to the engine
(engine 1 for example) where the disk resides.
− Engine 1 reads data from the disk.
− Engine 1 accomplishes the data read.
− Engine 1 returns the data to engine 0 and then engine 0 returns the data to the
host.
Non-local Read Process
− The LUN is not owned by the engine that delivers read I/Os, and the host
delivers the read I/Os to engine 2.
− After detecting that the LUN is owned by engine 0, engine 2 transfers the read
I/Os to engine 0.
− If the read I/Os are hit in the cache of engine 0, engine 0 returns the data to
engine 2.
− Engine 2 returns the data to the host.
− If the read I/Os are not hit in the cache of engine 0, engine 0 reads data from
the disk. If the target disk is on the local computer, engine 0 reads data from the
disk.
− After the read I/Os are hit locally, engine 0 returns the data to engine 2 and then
engine 2 returns the data to the host.
− If the target disk is on a remote device, engine 0 transfers the I/Os to engine 1
where the disk resides.
− Engine 1 reads data from the disk.
− Engine 1 completes the data read.
− Engine 1 returns the data to engine 0, engine 0 returns the data to engine 2, and
then engine 2 returns the data to the host.
Symmetric architecture
− All products support host access in active-active mode. Requests can be evenly
distributed to each front-end link.
− They eliminate LUN ownership of controllers, making LUNs easier to use and
balancing loads. They accomplish this by dividing a LUN into multiple slices that
are then evenly distributed to all controllers using the DHT algorithm
− Mission-critical products reduce latency with intelligent FIMs that divide LUNs
into slices for hosts I/Os and send the requests to their target controller.
Shared port
− A single port is shared by four controllers in a controller enclosure.
− Loads are balanced without host multipathing.
Global cache
− The system directly writes received I/Os (in one or two slices) to the cache of the
corresponding controller and sends an acknowledgement to the host.
− The intelligent read cache of all controllers participates in prefetch and cache hit
of all LUN data and metadata.
FIMs of Huawei OceanStor Dorado 8000 and 18000 V6 series storage adopt Huawei-
developed Hi1822 chip to connect to all controllers in a controller enclosure via four
internal links and each front-end port provides a communication link for the host. If any
controller restarts during an upgrade, services are seamlessly switched to the other
controller without impacting hosts and interrupting links. The host is unaware of
controller faults. Switchover is completed within 1 second.
The FIM has the following features:
Failure of a controller will not disconnect the front-end link, and the host is unaware
of the controller failure.
The PCIe link between the FIM and the controller is disconnected, and the FIM
detects the controller failure.
Service switchover is performed between the controllers, and the FIM redistributes
host requests to other controllers.
The switchover time is about 1 second, which is much shorter than switchover
performed by multipathing software (10-30s).
In global cache mode, host data is directly written into linear space logs, and the logs
directly copy the host data to the memory of multiple controllers using RDMA based on a
preset copy policy. The global cache consists of two parts:
Global memory: memory of all controllers (four controllers in the figure). This is
managed in a unified memory address, and provides linear address space for the
upper layer based on a redundancy configuration policy.
WAL: new write cache of the log type
The global pool uses RAID 2.0+, full-strip write of new data, and shared RAID groups
between multiple strips.
Data Storage Technology Page 36
1.3.4 Quiz
1. (True or false) Scale-up is a method in which disk enclosures are continuously added
to existing storage systems to handle increasing data volumes.
2. What are the differences between scale-up and scale-out?
3. If any controller of an OceanStor V3 storage system is faulty, the other controller can
seamlessly take over services using the host multipathing software to ensure service
continuity.
4. (Multiple-answer question) Which of the following specifications can be selected
when RH2288 V3 is fully configured with disks?
A. 8 disks
B. 12 disks
C. 16 disks
D. 25 disks
5. (Multiple-answer question) Which operating systems are supported by Huawei
OceanStor 5300 V3 block storage?
A. Windows
B. Linux
C. FusionSphere
D. VMware
Data Storage Technology Page 38
1.4.2 NAS
Enterprises need to store a large amount of data and share the data through a network.
Therefore, network-attached storage (NAS) is a good choice. NAS connects storage
devices to the live network to provide data and file services.
For a server or host, a NAS device is an external device and can be flexibly deployed
through a network. In addition, NAS provides file-level sharing rather than block-level
sharing, which makes it easier for clients to access NAS over a network. UNIX and
Microsoft Windows users can seamlessly share data using NAS or File Transfer Protocol
(FTP). When NAS is used for data sharing, UNIX uses NFS and Windows uses CIFS.
NAS has the following characteristics:
NAS provides storage resources through file-level data access and sharing, enabling
users to quickly share files with minimum storage management costs.
NAS is preferred for file sharing storage because it does not require multiple file
servers.
NAS also helps eliminate bottlenecks in access to general-purpose servers.
NAS uses network and file sharing protocols for archiving and storage. These
protocols include TCP/IP for data transmission as well as CIFS and NFS for remote
file services.
A general-purpose server runs a general-purpose operating system and can carry any
application. Unlike general-purpose servers, NAS is dedicated for file services and
provides file sharing services for other operating systems using open standard protocols.
NAS devices are optimized based on general-purpose servers in aspects such as file
service functions, storage, and retrieval. To improve the high availability of NAS devices,
some vendors also provide the NAS clustering function.
The components of a NAS device are as follows:
NAS engine (CPU and memory)
One or more NICs for network connection, for example, GE NIC and 10GE NIC
Data Storage Technology Page 39
The command space of the NFSv4 file system is changed. A root file system (fsid=0)
must be set on the server, and other file systems are mounted to the root file system
for export.
Compared with NFSv3, the cross-platform feature of NFSv4 is enhanced.
Working principles of NFS: Like other file sharing protocols, NFS also uses the C/S
architecture. However, NFS provides only the basic file processing function and does not
provide any TCP/IP data transmission function. The TCP/IP data transmission function can
be implemented only by using the Remote Procedure Call (RPC) protocol. NFS file
systems are completely transparent to clients. Accessing files or directories in an NFS file
system is the same as accessing local files or directories.
One program can use RPC to request a service from a program located in another
computer over a network without having to understand the underlying network
protocols. RPC assumes the existence of a transmission protocol such as Transmission
Control Protocol (TCP) or User Datagram Protocol (UDP) to carry the message data
between communicating programs. In the OSI network communication model, RPC
traverses the transport layer and application layer. RPC simplifies development of
applications.
RPC works based on the client/server model. The requester is a client, and the service
provider is a server. The client sends a call request with parameters to the RPC server and
waits for a response. On the server side, the process remains in a sleep state until the call
request arrives. Upon receipt of the call request, the server obtains the process
parameters, outputs the calculation results, and sends the response to the client. Then,
the server waits for the next call request. The client receives the response and obtains call
results.
One of the typical applications of NFS is using the NFS server as internal shared storage
in cloud computing. The NFS client is optimized based on cloud computing to provide
better performance and reliability. Cloud virtualization software (such as VMware)
optimizes the NFS client, so that the VM storage space can be created on the shared
space of the NFS server.
NDMP Protocol
The backup process of the traditional NAS storage is as follows:
A NAS device is a closed storage system. The Client Agent of the backup software can
only be installed on the production system instead of the NAS device. In the traditional
network backup process, data is read from a NAS device through the CIFS or NFS sharing
protocol, and then transferred to a backup server over a network.
Such a mode occupies network, production system and backup server resources, resulting
in poor performance and an inability to meet the requirements for backing up a large
amount of data.
The NDMP protocol is designed for the data backup system of NAS devices. It enables
NAS devices, without any backup client agent, to send data directly to the connected disk
devices or the backup servers on the network for backup.
There are two networking modes for NDMP:
On a 2-way network, the backup medium is connected directly to a NAS storage system
instead of to a backup server. In a backup process, the backup server sends a backup
Data Storage Technology Page 41
command to the NAS storage system through the Ethernet. The system then directly
backs up data to the tape library it is connected to.
In the NDMP 2-way backup mode, data flows are transmitted directly to backup media,
greatly improving the transmission performance and reducing server resource usage.
However, a tape library is connected to a NAS storage device, so the tape library can
back up data only for the NAS storage device to which it is connected.
Tape libraries are expensive. To enable different NAS storage devices to share tape
devices, NDMP also supports the 3-way backup mode.
In the 3-way backup mode, a NAS storage system can transfer backup data to a NAS
storage device connected to a tape library through a dedicated backup network. Then,
the storage device backs up the data to the tape library.
Working principles of CIFS: CIFS runs on top of TCP/IP and allows Windows computers to
access files on UNIX computers over a network.
CIFS Protocol
In 1996, Microsoft renamed SMB to CIFS and added many new functions. Now, CIFS
includes SMB1, SMB2, and SMB3.0.
CIFS has high requirements on network transmission reliability, so usually uses TCP/IP.
CIFS is mainly used for the Internet and by Windows hosts to access files or other
resources over the Internet. CIFS allows Windows clients to identify and access shared
resources. With CIFS, clients can quickly read, write, and create files in storage systems as
on local PCs. CIFS helps maintain a high access speed and a fast system response even
when many users simultaneously access the same shared file.
The CIFS protocol applies to file sharing. Two typical application scenarios are as follows:
File sharing service
− CIFS is commonly used in file sharing service scenarios such as enterprise file
sharing.
Hyper-V VM application scenario
− SMB can be used to share mirrors of Hyper-V virtual machines promoted by
Microsoft. In this scenario, the failover feature of SMB 3.0 is required to ensure
service continuity upon a node failure and to ensure the reliability of VMs.
File protocol comparison
Transmission
Type Application scenario Work Mode
Protocol
Transmission
Type Application scenario Work Mode
Protocol
C/S architecture,
with client
FTP No restrictions on operating systems TCP software integrated
into operating
systems
b: NIS is a domain environment in Linux and can centrally manage the directory service
of system databases.
1.4.3 SAN
1.4.3.1 IP SAN Technologies
NIC + Initiator software: Host devices such as servers and workstations use standard NICs
to connect to Ethernet switches. iSCSI storage devices are also connected to the Ethernet
switches or to the NICs of the hosts. The initiator software installed on hosts virtualizes
NICs into iSCSI cards. The iSCSI cards are used to receive and transmit iSCSI data packets,
implementing iSCSI and TCP/IP transmission between the hosts and iSCSI devices. This
mode uses standard Ethernet NICs and switches, eliminating the need for adding other
adapters. Therefore, this mode is the most cost-effective. However, the mode occupies
host resources when converting iSCSI packets into TCP/IP packets, increasing host
operation overheads and degrading system performance. The NIC + initiator software
mode is applicable to scenarios that require the relatively low I/O and bandwidth
performance for data access.
TOE NIC + initiator software: The TOE NIC processes the functions of the TCP/IP protocol
layer, and the host processes the functions of the iSCSI protocol layer. Therefore, the TOE
NIC significantly improves the data transmission rate. Compared with the pure software
mode, this mode reduces host operation overheads and requires minimal network
construction expenditure. This is a trade-off solution.
iSCSI HBA:
An iSCSI HBA is installed on the host to implement efficient data exchange between
the host and the switch and between the host and the storage device. Functions of
the iSCSI protocol layer and TCP/IP protocol stack are handled by the host HBA,
occupying the least CPU resources. This mode delivers the best data transmission
performance but requires high expenditure.
Data Storage Technology Page 43
The iSCSI communication system inherits some of SCSI's features. The iSCSI
communication involves an initiator that sends I/O requests and a target that
responds to the I/O requests and executes I/O operations. After a connection is set
up between the initiator and target, the target controls the entire process as the
primary device. An iSCSI target is usually an iSCSI disk array or iSCSI tape library.
The iSCSI protocol defines a set of naming and addressing methods for iSCSI
initiators and targets. All iSCSI nodes are identified by their iSCSI names. This method
distinguishes iSCSI names from host names.
iSCSI uses iSCSI Qualified Name (IQN) to identify initiators and targets. Addresses
change with the relocation of initiator or target devices, but their names remain
unchanged. When setting up a connection, an initiator sends a request. After the
target receives the request, it checks whether the iSCSI name contained in the
request is consistent with that bound with the target. If the iSCSI names are
consistent, the connection is set up. Each iSCSI node has a unique iSCSI name. One
iSCSI name can be used in the connections from one initiator to multiple targets.
Multiple iSCSI names can be used in the connections from one target to multiple
initiators.
Logical ports are created based on bond ports, VLAN ports, or Ethernet ports. Logical
ports are virtual ports that carry host services. A unique IP address is allocated to each
logical port for carrying its services.
Bond port: To improve reliability of paths for accessing file systems and increase
bandwidth, you can bond multiple Ethernet ports on the same interface module to
form a bond port.
VLAN: VLANs logically divide the physical Ethernet ports or bond ports of a storage
system into multiple broadcast domains. On a VLAN, when service data is being sent
or received, a VLAN ID is configured for the data so that the networks and services
of VLANs are isolated, further ensuring service data security and reliability.
Ethernet port: Physical Ethernet ports on an interface module of a storage system.
Bond ports, VLANs, and logical ports are created based on Ethernet ports.
IP address failover: A logical IP address fails over from a faulty port to an available port.
In this way, services are switched from the faulty port to the available port without
interruption. The faulty port takes over services back after it recovers. This task can be
completed automatically or manually. IP address failover applies to IP SAN and NAS.
During the IP address failover, services are switched from the faulty port to an available
port, ensuring service continuity and improving the reliability of paths for accessing file
systems. Users are not aware of this process.
The essence of IP address failover is a service switchover between ports. The ports can be
Ethernet ports, bond ports, or VLAN ports.
Ethernet port–based IP address failover: To improve the reliability of paths for
accessing file systems, you can create logical ports based on Ethernet ports.
Data Storage Technology Page 44
Host services are running on logical port A of Ethernet port A. The corresponding IP
address is "a". Ethernet port A fails and thereby cannot provide services. After IP
address failover is enabled, the storage system will automatically locate available
Ethernet port B, delete the configuration of logical port A that corresponds to
Ethernet port A, and create and configure logical port A on Ethernet port B. In this
way, host services are quickly switched to logical port A on Ethernet port B. The
service switchover is executed quickly. Users are not aware of this process.
Bond port-based IP address failover: To improve the reliability of paths for accessing
file systems, you can bond multiple Ethernet ports to form a bond port. When an
Ethernet port that is used to create the bond port fails, services are still running on
the bond port. The IP address fails over only when all Ethernet ports that are used to
create the bond port fail.
Multiple Ethernet ports are bonded to form bond port A. Logical port A created
based on bond port A can provide high-speed data transmission. When both Ethernet
ports A and B fail due to various causes, the storage system will automatically locate
bond port B, delete logical port A, and create the same logical port A on bond port B.
In this way, services are switched from bond port A to bond port B. After Ethernet
ports A and B recover, services will be switched back to bond port A if failback is
enabled. The service switchover is executed quickly, and users are not aware of this
process.
Data Storage Technology Page 45
VLAN-based IP address failover: You can create VLANs to isolate different services.
− To implement VLAN-based IP address failover, you must create VLANs, allocate
a unique ID to each VLAN, and use the VLANs to isolate different services. When
an Ethernet port on a VLAN fails, the storage system will automatically locate an
available Ethernet port with the same VLAN ID and switch services to the
available Ethernet port. After the faulty port recovers, it takes over the services.
− VLAN names, such as VLAN A and VLAN B, are automatically generated when
VLANs are created. The actual VLAN names depend on the storage system
version.
− Ethernet ports and their corresponding switch ports are divided into multiple
VLANs, and different IDs are allocated to the VLANs. The VLANs are used to
isolated different services. VLAN A is created on Ethernet port A, and the VLAN
ID is 1. Logical port A that is created based on VLAN A can be used to isolate
services. When Ethernet port A fails due to various causes, the storage system
will automatically locate VLAN B and the port whose VLAN ID is 1, delete logical
port A, and create the same logical port A based on VLAN B. In this way, the
port where services are running is switched to VLAN B. After Ethernet port A
recovers, the port where services are running will be switched back to VLAN A if
failback is enabled.
− An Ethernet port can belong to multiple VLANs. When the Ethernet port fails, all
VLANs will fail. Services must be switched to ports of other available VLANs. The
service switchover is executed quickly, and users are not aware of this process.
LUN: a namespace resource described by a SCSI target. A target may include multiple
LUNs, and attributes of the LUNs may be different. For example, LUN#0 may be a disk,
and LUN#1 may be another device.
The initiator and target of SCSI constitute a typical C/S model. Each instruction is
implemented through the request/response mode. The initiator sends SCSI requests. The
target responds to the SCSI requests, provides services through LUNs, and provides a task
management function.
SCSI Initiator Model
SCSI initiator logical layers in different operating systems:
On Windows, a SCSI initiator includes three logical layers: storage/tape driver, SCSI port,
and mini port. The SCSI port implements the basic framework processing procedures for
SCSI, such as device discovery and namespace scanning.
On Linux, a SCSI initiator includes three logical layers: SCSI device driver, scsi_mod middle
layer, and SCSI adapter driver (HBA). The scsi_mod middle layer processes SCSI device-
irrelevant and adapter-irrelevant processes, such as exceptions and namespace
maintenance. The HBA driver provides link implementation details, such as SCSI
instruction packaging and unpacking. The device driver implements specific SCSI device
drivers, such as the famous SCSI disk driver, SCSI tape driver, and SCSI CD-ROM device
driver.
The structure of Solaris comprises the SCSI device driver, SSA middle layer, and SCSI
adapter driver, which is similar to the structure of Linux/Windows.
The AIX architecture is structured in three layers: SCSI device driver, SCSI middle layer,
and SCSI adaptation driver.
SCSI Target Model
Based on the SCSI architecture, a target includes three layers: port layer, middle layer,
and device layer.
A PORT model in a target packages or unpackages SCSI instructions on links. For
example, a PORT can package instructions into FPC, iSCSI, or SAS, or unpackage
instructions from those formats.
A device model in a target serves as a SCSI instruction analyser. It tells the initiator what
device the current LUN is by processing INQUIRT, and processes I/Os through
READ/WRITE.
The middle layer of a target maintains models such as LUN space, task set, and task
(command). There are two ways to maintain LUN space. One is to maintain a global LUN
for all PORTs, and the other is to maintain a LUN space for each PORT.
SCSI Protocol and Storage Systems
The SCSI protocol is the basic protocol used for communication between hosts and
storage devices.
The controller sends a signal to the bus processor requesting to use the bus. After the
request is accepted, the controller's high-speed cache sends data. During this process, the
bus is occupied by the controller and other devices connected to the same bus cannot use
it. However, the bus processor can interrupt the data transfer at any time and allow other
devices to use the bus for operations of a higher priority.
Data Storage Technology Page 47
A SCSI controller is like a small CPU with its own command set and cache. The special
SCSI bus architecture can dynamically allocate resources to tasks run by multiple devices
in a computer. In this way, multiple tasks can be processed at the same time.
SCSI Protocol Addressing
A traditional SCSI controller is connected to a single bus. Therefore, only one bus ID is
allocated. An enterprise-level server may be configured with multiple SCSI controllers, so
there may be multiple SCSI buses. In a storage network, each FC HBA or iSCSI network
adapter is connected to a bus. A bus ID must therefore be allocated to each bus to
distinguish between them.
To address devices connected to a SCSI bus, SCSI device IDs and LUNs are used. Each
device on the SCSI bus must have a unique device ID. The HBA on the server also has its
own device ID: 7. Each bus, including the bus adapter, supports a maximum of 8 or 16
device IDs. The device ID is used to address devices and identify the priority of the devices
on the bus.
Each storage device may include sub-devices, such as virtual disks and tape drives. As a
result, LUN IDs are used to address sub-devices in a storage device.
A ternary description (bus ID, target device ID, and LUN ID) is used to identify a SCSI
target.
SAN network. FCoE runs on the Ethernet. Therefore, the Ethernet link layer replaces
the preceding two layers.
Different environments: The FC protocol runs on the traditional FC SAN storage
network, while FCoE runs on the Ethernet.
Different channels: The FC protocol runs on the FC network, and all packets are
transmitted through FC channels. There are various protocol packets, such as IP and
ARP packets, on the Ethernet. To transmit FCoE packets, a virtual FC needs to be
created on the Ethernet.
Compared with the FC protocol, the FIP initialization protocol is used for FCoE to
obtain the VLAN, establish a virtual channel with an FCF, and maintain virtual links.
FCoE requires the support of other protocols. The Ethernet tolerates packet loss, but the
FC protocol does not. As the FC protocol for transmission on the Ethernet, FCoE inherits
this feature that packet loss is not allowed. To ensure that FCoE runs properly on an
Ethernet network, the Ethernet needs to be enhanced to prevent packet loss. The
enhanced Ethernet is called Converged Enhanced Ethernet (CEE).
1.4.5 Quiz
6. (Multiple-answer question) Which of the following networks are included in a
distributed storage network topology?
A. Management network
B. Front-end service network
C. Front-end storage network
D. Back-end storage network
7. (Multiple-answer question) Which of the following protocols are commonly used for
SAN?
A. Fibre Channel
B. iSCSI
C. CIFS
D. NFS
8. (Multiple-answer question) Which of the following statements are true about SAN
and NAS?
Data Storage Technology Page 52
A loop cannot be formed in a SAS domain to ensure end devices can be normally
identified.
In practice, to ensure high bandwidth, the number of end devices connected to an
expander is far fewer than 128.
SAS cables and connections:
Most storage vendors now use SAS cables for connections between disk enclosures
and between disk and controller enclosures. A SAS cable aggregates four
independent physical links (narrow ports) into a wide port to provide higher
bandwidth. Each of the four independent links can transmit at 12 Gbit/s. As a result,
a wide port can transmit at 48 Gbit/s. To keep the data flow over a SAS cable under
its maximum bandwidth, the total number of disks connected to a SAS loop must be
limited.
For Huawei storage devices, a SAS loop comprises up to 168 disks, or seven 24-slot
disk enclosures, but all disks in the loop must be traditional SAS disks. However,
SSDs are becoming more common as they provide much higher transmission speeds
than SAS disks. As the best practice, a loop comprises up to 96 SSDs, or four 24-slot
disk enclosures.
SAS cables include mini-SAS cables (6 Gbit/s per link) and mini-SAS High Density
cables (12 Gbit/s per link).
1.5.1.2 SATA
SATA, also called Serial ATA, is a type of computer bus used to transfer data between a
main board and many storage devices, such as disks and CD-ROM drives. SATA uses a
brand new bus architecture but not just improves over PATA, also called Parallel ATA.
At the physical layer, the SAS interface is compatible with the SATA interface. SATA is
literally a subordinate interface standard of SAS and a SAS controller can directly control
SATA disks. Therefore, SATA disks can be used directly in a SAS environment. However,
SAS disks cannot be used directly in a SATA environment in that a SATA controller cannot
control SAS disks.
At the protocol layer, SAS supports three types of protocols that are used to transfer data
between specific types of interconnected devices.
Serial SCSI Protocol (SSP): for transmitting SCSI commands
Serial Management Protocol (SMP): for managing and maintaining interconnected
devices
Serial ATA Tunneled Protocol (STP): for data transfer between SAS and SATA
With these three protocols, SAS works seamlessly with SATA and some SCSI devices.
The parallel bus provides poor scalability and limits the number of connected
devices.
Connections between multiple devices considerably undermine the effective bus
bandwidth and slow down data transfer.
Advances in processor technology are driving the replacement of parallel bus with high-
speed differential bus in the interconnection field. In contrast to single-ended parallel
signals, high-speed differential signals are used for higher clock rates. This led to the
birth of the PCIe bus.
PCIe, also called PCI Express, is a high-performance and high-bandwidth serial
communication and interconnection standard. It was first proposed by Intel and then
developed by the Peripheral Component Interconnect Special Interest Group (PCI-SIG) to
replace the bus-based communication architecture.
PCIe improves over the traditional PCI bus in the following ways:
Dual simplex channel, high bandwidth, and high transfer speeds:
− A transmission mode (separated RX and TX) similar to the full duplex mode is
implemented.
− Higher transfer speeds are provided: 2.5 Gbit/s in PCIe 1.0, 5 Gbit/s in PCIe 2.0, 8
Gbit/s in PCIe 3.0, 16 Gbit/s in PCIe 4.0, and up to 32 Gbit/s in PCIe 5.0.
− Bandwidth can increase linearly with link quantity.
Compatibility:
− PCIe is compatible with PCI at the application layer. Upgraded PCIe versions are
also compatible with existing PCI software.
Ease-of-use:
− PCIe provides the hot-swap functionality. A PCIe bus interface slot contains the
hot-swap detection signal, supporting hot-swap and heat exchange like a USB.
Error processing and reporting:
− A PCIe bus uses a layered architecture where the application layer processes and
reports errors.
Multiple virtual lanes in each physical link:
− Each physical link contains multiple virtual lanes (in theory, eight virtual lanes
are allowed for independent communication control). This way, PCIe supports
QoS for each virtual lane and provides excellent traffic quality control.
Lower I/O pin count, smaller physical footprint, and reduced crosstalk:
− A typical PCI bus data cable has at least 50 I/O pins, while a PCIe x1 link has
only four I/O pins. Fewer I/O pins mean a smaller physical footprint, greater
clearance between I/O pins, and reduced crosstalk.
Why PCIe? The PCIe standard is still evolving with an outlook to provide higher
throughput for future systems by leveraging the latest technologies. In addition, PCIe is
also being developed to facilitate the transition from PCI to PCIe by remaining back-
compatible with existing PCI software using layered protocols and drivers. PCIe features
point-to-point connection, high reliability, tree network, full duplex, and frame-based
transmission.
Data Storage Technology Page 56
The PCIe architecture comprises the physical layer, data link layer, transaction layer, and
application layer.
The physical layer determines the physical characteristics of the bus. In future
development, the performance of a PCIe bus can be further improved by increasing
the speed or changing the encoding or decoding scheme. Such changes only affect
the physical layer, facilitating upgrades.
The data link layer plays a vital role in ensuring the correctness and reliability of data
packets transmitted over a PCIe bus. The data link layer checks whether a data
packet is completely and correctly encapsulated, adds the SN and CRC code to the
data, and then uses the ACK/NACK handshake protocol for error detection and
correction.
The transaction layer receives read and write requests sent from the application
layer, or itself creates an encapsulated request packet and transmits it to the data
link layer. This type of data packet is called a transaction layer packet (TLP). The
transaction layer also receives data link layer packets (DLLPs) sent from the link
layer, associates them with related software requests, and then transmits them to
the application layer for processing.
The application layer is designed by users based on actual needs. Other layers must
comply with the protocol requirements.
1.5.2.2 NVMe
NVMe, also called Non-Volatile Memory Express, is designed for PCIe SSDs. A direct
connection between the native PCIe lane and the CPU can eliminate the latency caused
by communications between the external controller (PCH) of the SATA and SAS
interfaces and the CPU.
NVMe serves as a logical protocol interface, a command standard, and a protocol
throughout the entire storage process. By fully utilizing the low latency and parallelism of
PCIe lanes and the parallelism of modern processors, platforms, and applications, NVMe
aims to remarkably improve the read and write performance of SSDs, optimize the high
latency caused by advanced host controller interfaces (AHCIs) at controllable storage
costs, and ultimately unleash greater performance for SSDs in a SATA environment.
NVMe protocol stack:
I/O transmission path
− A SAS-based all-flash storage array: I/Os are transmitted from a front-end server
to a CPU through a front-end interface protocol (Fibre Channel or IP), then from
the CPU to a SAS chip through a PCIe link and a PCIe switch, and further to a
SAS expander and a SAS SSD.
− A Huawei NVMe-based all-flash storage system that supports end-to-end NVMe:
Data I/Os are transmitted from a front-end server to a CPU through a front-end
interface protocol (FC-NVMe or NVMe over RDMA). Then, data is transmitted
directly to an NVMe SSD through 100G RDMA. CPUs of the NVMe-based all-
flash storage system communicate directly with NVMe SSDs via a shorter
transmission path, reducing latency and improving transmission efficiency.
Software protocol parsing
Data Storage Technology Page 57
Currently, there are three types of RDMA networks: IB, RoCE, and iWARP. IB is designed
for RDMA to ensure hardware-based reliable transmission. RoCE and iWARP run over
Ethernet. All three support specific verbs APIs.
IB is a next-generation network protocol that has supported RDMA since its
emergence. NICs and switches that support this technology are required.
RoCE is a network protocol that allows RDMA over an Ethernet network. RoCE has a
lower Ethernet network header and a higher IB network header (containing data).
This way, RoCE allows RDMA on standard Ethernet infrastructure (switches). NICs
are unique and must support RoCE. RoCE v1 is an RDMA protocol implemented
based on the Ethernet link layer. Switches must support flow control technologies,
such as priority-based flow control (PFC), to ensure reliable transmission at the
physical layer. RoCE v2 is implemented at the UDP layer in the Ethernet TCP or IP
protocol architecture.
iWARP is layered on the Transmission Control Protocol (TCP) to implement RDMA.
The functions supported by IB and RoCE are not supported by iWARP. This way,
iWARP allows RDMA on standard Ethernet infrastructure (switches). NICs must
support iWARP if CPU offload is used. Otherwise, all iWARP stacks can be
implemented in the switches, and most RDMA performance advantages become
void.
1.5.3.2 IB
The IB technology is designed for interconnections and communications between servers
(for example, replication and distributed work), between a server and a storage device
(for example, SAN and DAS), and between a server and a network (for example, LAN,
WAN, and the Internet).
IB defines a set of devices used for system communications, including channel adapters
(CAs), switches, and routers. CAs are used to connect to other devices and include host
channel adapters (HCAs) and target channel adapters (TCAs).
Characteristics of IB:
Standard-based protocol: IB is an open standard designed by the InfiniBand Trade
Association (IBTA), which was established in 1999 and has 225 member companies.
Principal IBTA members include Agilent, Dell, HP, IBM, InfiniSwitch, Intel, Mellanox,
Network Appliance, and Sun Microsystems. More than 100 other members also
support the development and promotion of IB.
Speed: IB provides high speeds.
Memory: Servers that support IB use HCAs to convert the IB protocol to an internal
PCI-X or PCI-Xpress bus. An IB HCA supports RDMA and is also called a kernel
bypass. RDMA also applies to clusters. It uses a virtual addressing solution to allow a
server to identify and use memory resources provided by other servers without the
involvement of any OS kernels.
Transport offload: RDMA facilitates transport offload that moves data packet routing
from an operating system to a chip, reducing workloads for the processor. An 80
GHz processor is required to process data at a transmission speed of 10 Gbit/s in the
operating system.
Data Storage Technology Page 59
An IB system comprises CAs, a switch, a router, a repeater, and connected links. CAs
include HCAs and TCAs.
An HCA connects a host processor to the IB architecture.
A TCA connects an I/O adapter to the IB architecture.
IB in storage: The IB front-end network is used to exchange data with clients and
transmit data through the IPoIB protocol. The IB back-end network is used for data
interactions between nodes in a storage system. The RPC module uses RDMA to
synchronize data between nodes.
The IB architecture comprises the application layer, transport layer, network layer, link
layer, and physical layer. The following describes the functions of each layer:
The transport layer is responsible for in-order packet delivery, partitioning, channel
multiplexing, and transport services. The transport layer also sends, receives, and
reassembles data packet segments.
The network layer handles routing of packets from one subnet to another. Each of
the packets that are sent between source and target nodes contains a Global Route
Header (GRH) and a 128-bit IPv6 address. A standard global 64-bit identifier is also
embedded at the network layer and this identifier is unique among all subnets.
Through the exchange of such identifier values, data can be transmitted across
multiple subnets.
The link layer encompasses packet layout, point-to-point link operations, and
switching within a local subnet. There are two types of packets within the link layer,
data transmission and network management packets. Network management packets
are used for operation control, subnet indication, and fault tolerance in relation to
device enumeration. Data transmission packets are used for data transmission. The
maximum size of each packet is 4 KB. In each specific device subnet, the direction
and exchange of each packet are implemented by a local subnet manager with a 16-
bit identifier address.
The physical layer defines three link speeds, 1X, 4X, and 12X. Each individual link
provides a signaling rate of 2.5 Gbit/s, 10 Gbit/s, and 30 Gbit/s, respectively. IBA
therefore allows multiple connections at 30 Gbit/s. In the full duplex serial
communication mode, a 1X bidirectional connection only requires 4 wires and a 12X
connection requires only 48 wires.
1.5.4 Quiz
1. (Multiple-answer question) Which of the following are Fibre Channel network
topologies?
A. Fibre Channel arbitrated loop (FC-AL)
B. Fibre Channel point-to-point (FC-P2P)
C. Fibre Channel switch (FC-SW)
D. Fibre Channel dual-switch
2. (Multiple-answer question) Which PCIe versions are available?
A. PCIe 1.0
B. PCIe 2.0
Data Storage Technology Page 60
C. PCIe 3.0
D. PCIe 4.0
3. (Multiple-answer question) Which of the following are file sharing protocols?
A. HTTP
B. iSCSI
C. NFS
D. CIFS
4. (Multiple-answer question) Which of the following operations are involved in the
CIFS protocol?
A. Protocol handshake
B. Security authentication
C. Shared connection
D. File operation
E. Disconnection