0% found this document useful (0 votes)
115 views15 pages

Fast23 Smrstore

This paper introduces SMR STORE, a storage engine designed for cloud object storage on shingled magnetic recording (SMR) drives. SMR STORE directly implements chunk storage interfaces over SMR drives using a log-structured design and guided data placement to reduce garbage collection and provide consistent performance comparable to conventional magnetic recording drives. Evaluation shows SMR STORE can be up to 2.16x faster than other file systems on SMR drives. Deploying SMR STORE has decreased storage costs in Alibaba Cloud Object Storage Service by up to 15%.

Uploaded by

刘金鑫
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
115 views15 pages

Fast23 Smrstore

This paper introduces SMR STORE, a storage engine designed for cloud object storage on shingled magnetic recording (SMR) drives. SMR STORE directly implements chunk storage interfaces over SMR drives using a log-structured design and guided data placement to reduce garbage collection and provide consistent performance comparable to conventional magnetic recording drives. Evaluation shows SMR STORE can be up to 2.16x faster than other file systems on SMR drives. Deploying SMR STORE has decreased storage costs in Alibaba Cloud Object Storage Service by up to 15%.

Uploaded by

刘金鑫
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

SMRstore: A Storage Engine for Cloud Object Storage

on HM-SMR Drives
Su Zhou, Erci Xu, Hao Wu, Yu Du, Jiacheng Cui, Wanyu Fu, Chang Liu, Yingni
Wang, Wenbo Wang, Shouqu Sun, Xianfei Wang, Bo Feng, Biyun Zhu, Xin Tong,
Weikang Kong, Linyan Liu, Zhongjie Wu, Jinbo Wu, Qingchao Luo, and
Jiesheng Wu, Alibaba Group
https://www.usenix.org/conference/fast23/presentation/zhou

This paper is included in the Proceedings of the


21st USENIX Conference on File and
Storage Technologies.
February 21–23, 2023 • Santa Clara, CA, USA
978-1-939133-32-8

Open access to the Proceedings


of the 21st USENIX Conference on
File and Storage Technologies
is sponsored by
SMR STORE: A Storage Engine for Cloud Object Storage on HM-SMR Drives
Su Zhou, Erci Xu*, Hao Wu, Yu Du, Jiacheng Cui, Wanyu Fu, Chang Liu, Yingni Wang, Wenbo Wang,
Shouqu Sun, Xianfei Wang, Bo Feng, Biyun Zhu, Xin Tong, Weikang Kong, Linyan Liu, Zhongjie Wu,
Jinbo Wu, Qingchao Luo, Jiesheng Wu

Alibaba Group

Abstract byproduct—not allowing random writes [8, 15]. This charac-


teristic in return may require the upper-level software stack,
Cloud object storage vendors are always in pursuit of bet- such as storage engines, to make the corresponding adaptions.
ter cost efficiency. Emerging Shingled Magnetic Record-
A possible direction is to use Host Managed SMR (HM-
ing (SMR) drives are becoming economically favorable in
SMR) drives where the host OS manages the I/Os and com-
archival storage systems due to significantly improved areal
municates with HM-SMR drives via the Zoned Block Device
density. However, for standard-class object storage, previous
(ZBD) subsystem [6, 7]. There are mainly three types of ap-
studies and our preliminary exploration revealed that the ex-
proaches in designing HM-SMR-based storage systems. First,
isting SMR drive solutions can experience severe throughput
Linux kernel can expose HM-SMR drives as standard block
variations due to garbage collection (GC).
devices by employing a shingled translation layer (STL),
In this paper, we introduce SMR STORE, an SMR-based
such as dm-zoned [21]. Second, file systems with a log-
storage engine for standard-class object storage without com-
structured [28] or copy-on-write design, can directly support
promising performance or durability. The key features of
HM-SMR drives(e.g., F2FS [17] and Btrfs [27]). Further,
SMR STORE include directly implementing chunk store in-
developers can modify their applications to accommodate
terfaces over SMR drives, using a complete log-structured
HM-SMR drives (e.g., GearDB [32], SMORE [19], and SM-
design, and applying guided data placement to reduce GC
RDB [26]) or directly employ them in archival-class object
for consistent performance. The evaluation shows that SMR-
storage systems such as Alibaba Archive Storage Service [1]
STORE delivers comparable performance as Ext4 on the Con-
and Huawei Object Store [18].
ventional Magnetic Recording (CMR) drives, and can be up
Unfortunately, these existing HM-SMR solutions can not
to 2.16x faster than F2FS on SMR drives. By switching
be applied to standard-class Alibaba Cloud Object Storage
to SMR drives, we have decreased the total cost by up to
Service (OSS) . First, setting the HM-SMR drive as a block
15% and provided performance on par with the prior system
device (i.e., via dm-zoned [21]) could suffer a significant
for customers. Currently, we have deployed SMR STORE in
throughput drop due to frequent buffer zones reclaiming after
standard-class Alibaba Cloud Object Storage Service (OSS)
random updates (e.g., a 56.1% drop under a sustained write
to store hundreds of PBs of data. We plan to use SMR drives
workload [22]). Second, employing log-structured file sys-
for all classes of OSS in the near future.
tems to manage HM-SMR drives can experience throughput
1 Introduction variations due to GC in file systems. For example, our eval-
uation shows that the throughput of F2FS on a HM-SMR
Object storage is a “killer app” in the cloud era. Users can drive can drop 61.5% due to frequent F2FS GCs triggered by
use the service to persist and retrieve objects with high scal- random deletions. Third, though archival-class and standard-
ability, elasticity and reliability. Typical usage scenarios of class OSS share the same data abstraction (i.e., object), they
object storage include Binary Large OBjects (BLOBs) stor- have drastically different Service Level Objectives (SLOs).
age [10,23], datalake [2] and cloud archive [1]. Object storage Therefore, a design that works well in the archival class may
systems usually employ a large fleet of HDDs. Therefore, a not deliver satisfying performances in the standard class.
key challenge of building a competitive cloud object storage
Our benchmarks show that existing SMR translation lay-
is the cost efficiency.
ers or file systems could result in severe overhead possibly
Emerging Shingled Magnetic Recording (SMR) drives [5]
due to garbage collection. Moreover, log-structured design
are economically attractive [29] but they may not serve as a
offers direct support to SMR drives and could achieves a high
simple drop-in replacement for traditional CMR drives [12].
throughput when not affected by GC. Besides, GC, which is
SMR drives, via overlapping tracks, have a higher areal
inevitable due to the append-only nature of SMR zones, could
density [14] (i.e., 25% more than CMR drives) and hence
be alleviated via workload-aware data placement.
the better cost efficiency. However, shingling tracks has a
In this paper, we describe SMR STORE, a high-performance
* Corresponding author. [email protected] HM-SMR storage engine co-designed with Alibaba Cloud

USENIX Association 21st USENIX Conference on File and Storage Technologies 395
Restful Object Request/Response offset 0 AppendOnly
Pangu File
OSS FrontEnd Layer

Chunk 1 Chunk 2 Chunk 3 Chunk 4


OSS Service Layer kvserver kvserver … (sealed) (sealed) (sealed) (unsealed)

Persistence Layer master master … replica 1 replica 2 replica 3 replica 1 replica 2 replica 3
(Pangu)
chunkserver chunkserver chunkserver chunkserver chunkserver chunkserver
chunkserver
HDD …
HDD SSD SSD
Engine Engine Engine Engine chunk Figure 2: Semantics of PANGU file and chunk. This figure shows

server a PANGU file consists of four chunks. Only chunk 4 (the last chunk)
HDD HDD HDD SSD SSD is not sealed (writable). PANGU keeps multiple replicas for each
Disk1 Disk60 Disk61 Disk62
… chunk across chunkservers to protect data against any failures.

Figure 1: Architecture overview of OSS (§2.1). The red shaded


HHD engines refer to traditional ext4-based storage engines. The Standard-class OSS, usually for hosting hot data, offers the
green shaded SSD engines refer to user-space storage engines. fastest SLOs with the highest economical cost.
Architecture. Figure 1 illustrates the three layers in Alibaba
standard-class OSS. There are three key features in SMR- Cloud standard-class OSS stack, including an OSS frontend
STORE. First, SMR STORE is a user-space storage engine that
layer, an OSS service layer, and a persistence layer. The fron-
does not require local file system support and directly imple- tend layer pre-processes users’ http requests and dispatches
ments chunk interfaces of PANGU distributed file system. them to the service layer. The service layer, consisting of mul-
tiple KV servers, has two functionalities. First, the service
Second, SMR STORE strictly follows a log-structured de-
layer writes the objects to PANGU files. Second, the service
sign to organize HM-SMR on-disk layout. In SMR STORE,
layer maintains the objects’ metadata (the mapping from ob-
the basic building block is a variable-length customized log
jects’ names to locations within the corresponding PANGU
format called Record. We use records to persist data, and form
files) using an LSM-tree based KV store [25], and writes
various metadata structures (e.g., checkpoint and journal).
these metadata to additional PANGU files. The persistence
Third, we design a series of workload-aware zone alloca-
layer is our distributed file system PANGU.
tion strategies to reduce the interleaving of different types
of OSS data & metadata in zones. These effort help us to PANGU overview. PANGU is a HDFS-like distributed file
effectively lower the overhead of GC in HM-SMR drives. system and each PANGU cluster comprises a set of masters
We extensively evaluate SMR STORE under various scenar- (handling PANGU files’ metadata, not objects’ metadata) and
ios. The results show that PANGU chunkserver with SMR- up to thousands of chunkservers (storing data of PANGU
STORE achieves more than 110MB/s throughput in high con- files). Each chunkserver exclusively owns a physical storage
current write workloads, 30% higher than the previous gen- server to operate, consisting of 60 HDDs for persistence
eration design (i.e., chunkserver with Ext4 on CMR drives). and two high performance SSDs for caching 1 . We leverage
Moreover, on a storage server (60 HDDs and 2 cache SSDs), Linux kernel storage stack (Ext4 file system and libaio with
chunkserver with SMR STORE provides steady 4GB/s write O DIRECT) for the HDD storage engines and build a user-
throughput in macro benchmarks, 2.16x higher than F2FS. space storage file system for the SSD engines.
Third, in OSS deployment, the performances of the HM-SMR PANGU data abstractions. Figure 2 illustrates two levels of
cluster are comparable to the CMR cluster in all aspects. abstractions in PANGU, file and chunk. Each file is append-
The rest of the paper is organized as follows. We describe only and can be further split into multiple chunks. Each chunk
standard-class OSS in Alibaba and the HM-SMR drives in §2. has a Chunk ID (a 24-byte UUID) and is replicated via copies
Then, we analyze the pros and cons of existing solutions (§3). or erasure coding. PANGU can create, write(append), read,
Further, we demonstrate the design choices (§4), the detail im- delete, and seal a chunk. Similar to the “extent seal” in Win-
plementation of SMR STORE (§5) and the evaluation (§6). We dows Azure Storage [13], PANGU seals a chunk when: i) the
conclude with discussions on the limitation of SMR STORE size of chunk—including data and corresponding checksum—
(§7), the related work (§8) and a short conclusion (§9). reaches the limit; ii) the application closes the PANGU file
when writing is finished; iii) in the face of failures (e.g., net-
2 Background work timeout). Due to case ii) and iii), the chunks can be of
variable sizes. Note that only the last chunk of a PANGU file
2.1 Alibaba Cloud OSS can be appended (not sealed) and only sealed chunks can be
Alibaba Cloud OSS offers four classes of services, including 1 PANGU also supports other services (e.g., block storage and big data)
standard, infrequent access, archive, and cold archive (prices and can have various modifications. In this paper, our discussion on PANGU
in descending order and retrieval time in ascending order). and corresponding software/hardware setups only apply to OSS scenario.

396 21st USENIX Conference on File and Storage Technologies USENIX Association
Related to Object Data Ext4-based Storage Engine Type Latency Concur- iosize Lifespan
In Memory
(ms) rency (Byte) (Day)
KV
Server Object Foo Pangu File A Offset Length OSS Data W <1 ~1500 512K-1M <7
OSS Data R <20 - 512K-1M -
In Memory
Pangu
Pangu File A
OSS Meta W <1 ~2000 4K-128K <60
Master
Chunk 0 Chunk 1 OSS Meta R <20 - 4K-128K -
OSS GC <20 ~100 512K-1M <90
In Memory Chunk1 Metadata
Chunk1 ID Status Size
Table 1: Characteristics of streams on a chunkserver(§2.1).
Pangu
Chunk On Disk (Ext4) Chunk1 File are first written to SSD caches and later moved to HDDs.
Server Sector (4096B ) … Normally, hundreds of PANGU chunks are opened for writ-
Data (4048B) Footer (48B) ing on a chunkserver, thereby yielding high concurrency.
• OSS Data Read Stream. OSS directly reads object data
Figure 3: Dataflow with traditional storage engine (OSS (§2.1). from HDDs to achieve high throughput. Due to space limits,
This figure illustrates the write path of an object in traditional CS- object data for read are not cached in the SSDs.
Ext4 stack (red shaded). The yellow shaded area means it is related • OSS Metadata Write Stream. OSS metadata stream in-
to the object “Foo”. The data is split as a series of 4048B segments cludes objects’ metadata formatted as Write-Ahead Logs
where each is attached with a 48B footer for checksum. and Sorted String Table Files from KV store. The metadata
are first flushed to SSD cache and then migrated to HDDs.
flushed from cache SSDs to HDDs for persistence. Storage The metadata accounts for around 2% of the total capacity
engine does not provide any redundancy for failures. Instead, used (around 24TB per chunkserver).
we rely on PANGU providing fault tolerance for each chunk • OSS Metadata Read Stream. The KV server maintains a
across chunkservers by replication or erasure coding. cache for object index. In most cases, the metadata read
directly hits the KV index cache and returns. If cache
I/O Path. Figure 3 presents high-level write flows in OSS misses, OSS routes to SSTFiles in PANGU.
with the traditional Ext4-based storage engine. We use an • OSS GC Stream. The garbage collection in OSS service
example object called "Foo" to highlight the write flow. The layer (referred to as OSS GC) is to reclaim garbage space
KV server chooses the PANGU file A for storing “Foo”. Then in PANGU files. A PANGU file can hold multiple objects.
the KV server uses the PANGU SDK or contacts the PANGU When a certain amount of objects are deleted in a PANGU
master to locate the tail chunk (i.e.,chunk-1) and its respective file, OSS would re-allocate the rest to another PANGU file,
chunkservers. Further, the chunkserver appends the object’s and delete the old file. The chunks written by OSS GC
data (with checksums) to the corresponding Ext4 file. To streams account for more than 80% of the total capacity
verify data integrity, we break the object’s data into a series used in one chunkserver. OSS GC streams run in back-
of 4048-Byte segments. We attach to each segment a 48 Byte ground and directly routed to HDDs for persistence.
footer which includes the checksum and segment locations
in chunk. Note that each chunk in the chunkserver is an Ext4 2.2 Host-Managed HM-SMR
file with the chunk ID as its filename. HM-SMR drives overlap the tracks to achieve higher areal
Workloads. A KV server can open multiple PANGU files density but consequently sacrifices random write support.
to store or retrieve the data and metadata of objects, and Specifically, HM-SMR drives organize the address space as
perform GC on deleted objects in PANGU files. From the multiple fixed-size zones including sequential zones (referred
perspective of chunkservers, we define an active PANGU file to as zones) and a few (around 1%) conventional zones (re-
as a stream. The stream starts as the PANGU is opened by a ferred to as czones). For example, the Seagate SMR drive
KV server for read or write, and ends when the KV closes the we use in this paper has a capacity of 20TB and 74508 zones
PANGU file. Based on the operations (read or write) and types (including 800 czones). The size of each zone is 256MB, and
(metadata, data or GC), we can categorize the workloads it takes around 20ms, 24ms and 22ms for opening, closing
issued by KV servers as five types of streams. Table 1 lists and erasing a zone, respectively. Note that, for certain SMR
the characteristic of each type of streams. The “Concurrency” HDD models (e.g., West Digital DC HC650), there is a limit
refers to the PANGU file concurrency, namely the number of on the number of zones to be opened concurrently.
PANGU opened files on a chunkserver to append data. The Device mapper translation. A straightforward solution
“Lifespan” refers to the expected lifespan of the data on disk is to insert a shim layer, called shingled translation layer
(from being persisted to deleted), NOT the duration of the (STL), such as dm-zoned [21], to provide dynamic mapping
streams. from logical block address to physical sectors and hence
• OSS Data Write Stream. Persisting object data requires low achieve random-to-sequential translation. Apparently, the
latency to achieve quick response. Therefore, object data major advantage of this approach is allowing the users (e.g.,

USENIX Association 21st USENIX Conference on File and Storage Technologies 397
chunkserver process) to adopt the HM-SMR drives as cost-
efficient drop-in replacement for CMR drives.
SMR-aware file systems. The log-structured design file sys-
tems (e.g., F2FS) make them an ideal match for the append-
only zone design of HM-SMR disks. For example, F2FS
started to support zoned block devices since kernel 4.10.
Users can mount a F2FS on an HM-SMR drive and utilize
the F2FS GC mechanism to support random writes. Simi-
larly, Btrfs, a file system based on copy-on-write principle,
currently provided an experimental support for zoned block Figure 4: High Concurrency Write Throughput (§3.2). The fig-
device in kernel 5.12. ure shows the write throughput of CS-Ext4 and CS-F2FS in mi-
crobenchmarks.
End-to-End Co-design. Instead of relying on dm-zoned or
general file systems, applications that perform mostly sequen-
tial writes can be modified to adopt HM-SMR. The benefits
of end-to-end integration has been proved by several recent
works, such as GearDB [32], ZenFs [11], SMORE [19], etc.
Applications could eliminate the block/fs-level overhead and
achieve predictable performance by managing on-disk data
placement and garbage collection at application level [24, 30].

3 Evaluating Existing Solutions


We evaluate running F2FS atop HM-SMR drives with mi-
crobenchmark and macrobenchmark (i.e., simulated OSS Figure 5: F2FS Access Pattern (§3.2). The figure shows the ac-
workloads). We compare the performances of chunkservers cessed zoneIds by F2FS in a few seconds. F2FS writes all data into
with Ext4 on CMR drives (referred to as CS-Ext4) and F2FS one zone in a period of time and switches to the next only when the
on HM-SMR drives (referred to as CS-F2FS). zone is full.

3.1 Evaluation Configurations


generator. For microbenchmark, we start a chunkserver with
CMR Server SMR Server
one disk, and focus on testing write throughput with different
OS Linux 4.19.91 I/O sizes in the clean state (no F2FS GC). We start a Fio with
2*Intel(R) Xeon(R) Platinum 8331C [email protected]
CPU 4 numjobs, 4 iodepth, and 128 nrfiles to simulate a high write
48 Physical Cores 96 Threads
pressure.
SSD 2*INTEL SSDPF21Q800GB
Mem 512G For macrobenchmark, we evaluate chunkserver with all
60*ST16000NM001G- 60*ST20000NM001J- disks loaded (60 HDDs and 2 cache SSDs) and run four Fio
2KK103 2U6101 processes to simulate different types of write streams. Table 3
HDD Rand. 4KB(IOPS): 113 Rand. 4KB(IOPS): 121 lists the detailed configurations. Note that we use two Fio
Seq. 512KB(MB/s): Seq. 512KB(MB/s):
processes to simulate two kinds OSS GC streams (i.e., OSS
254.8(W) 254.5(R) 255.7(W) 255.6(R)
GC Wr 1 and 2). For OSS GC Wr 2, we use a smaller chunk
Table 2: Configurations of storage servers in evaluation. A SMR size and rate (64MB and 20MB/s) to simulate the situations
server has the exact same setups with a CMR server, except the
where the chunks are sealed before reaching the size limit
HDDs are 20TB SMR HDDs instead. The raw performance compar-
ison with queue depth 1 random read and queue depth 32 sequential
(due to reaching the end of PANGU file or encountering I/O
read/write is listed in the last row. failures).
The macrobenchmark generates a stable 4GB/s throughput
Environment Setup. Table 2 lists the configurations of the to simulate a typical high pressure workload. There are two
storage servers in the evaluation. Note that F2FS does not phases in this test. In the first phase, we simply let the four
support devices with a capacity larger than 16TB. Therefore, streams to fill the HDDs and there is no file deletions. In the
we format the disk with 6TB capacity. Moreover, in all cases second phase, the utilization of capacity reaches around 80%
we disable the disk write cache by hdparm [4] tool to prevent (around 12 hours after the first phase started) and triggers the
data loss upon crashes, a mandatory setting in OSS. random deletions to maintain the utilization rate at around
Workloads. For both micro- and macro-benchmarks, we use 80%. The average chunk deletion rate on a chunkserver
the Fio [3] (modified to use the PANGU SDK) as the workload ranges from 4 operations per second (ops/s) to 15 ops/s.

398 21st USENIX Conference on File and Storage Technologies USENIX Association
Stream Type #Fio Target numjobs iodepth iosize nrfiles chunk size rate
OSS GC Wr 1 1 HDDs 8 32 1MB 25 256MB 400MB/s
OSS GC Wr 2 1 HDDs 8 32 1MB 25 64MB 20MB/s
OSS Data Wr 1 SSDs 3 32 1MB 300 256MB 200MB/s
OSS Meta Wr 1 SSDs 1 8 4KB-128KB 500 4MB 80MB/s
Table 3: Macro benchmark setups (§3.1). OSS GC Wr 1 refers to OSS GC streams with large chunks. OSS GC Wr 2 refers to OSS GC
streams with small chunks. OSS Data Wr refers to OSS object data write streams. OSS Meta Wr refers to OSS metadata write streams.

Figure 7: F2FS GC related metrics (§3.2.) This figure illustrates


Figure 6: CS-Ext4 vs CS-F2FS in macrobenchmark (§3.2). The the status of F2FS. The dirty segment count on the left axis reflects
test starts on empty disks and with steady 4GB/s throughput. At the generation of garbage space. The increasing accumulated GC
hour 12, the capacity utilization reaches 80% and random deletions count (right Y axis) indicates the continuing GC activities which are
occur. the immediate causes of the performance drop.

4 SMR STORE Design Choices


3.2 Performance Comparison
No local file system. We build SMR STORE to support
chunk semantics (including chunk_create, chunk_append,
Microbenchmark Performance. Figure 4 shows that CS-
chunk_read, chunk_seal, and chunk_delete) on SMR
F2FS on HM-SMR drives achieves 1.3x - 12.9x higher
zoned namespace. There are three functionalities in SMR-
throughput compared to CS-Ext4 on CMR drives. This is
STORE to support this feature. First, SMR STORE directly
because F2FS writes from different streams to one zone at
manages the disk address space for persisting metadata (i.e.,
a time and thus always performs sequential writes. Figure 5
checkpoints and journalings) and data (i.e., the chunks). Sec-
shows the accessing distribution of SMR ZoneIDs from the
ond, SMR STORE manages a mapping table between chunks
CS-F2FS. We can see F2FS fills up one SMR zone at a time
and SMR zones to translate logical range in chunks (via
(e.g., Zone 233 from second 0). This allocation strategy avoid
chunkId, offset, and length) to the physical locations on disk
overhead from jumping between zones.
(i.e., zoneId, offset, and length). Third, SMR STORE orches-
Macrobenchmark performance. Figure 6 shows the trates the lifecycle of zones and data placement strategies in
throughput performances of CS-F2FS and CS-Ext4 along the zones.
time. Initially, we can observe that both maintain stable Everything is log. SMR STORE stores both metadata and
throughput from hour 0 to 12. Then, after 12 hours, the CS- data as logs in SMR sequential zones. Specifically, SMR-
F2FS quickly drops and remains a low throughput for the rest STORE uses a basic structural unit, called record, to form
of the time. This is because the random deletion starts and GC different types of metadata and data. To avoid wasting space,
in F2FS kicks in to handle the increasing amount of obsolete record is of variable-length and enforces 4KB alignment with
data (see Figure 7). Note that F2FS puts chunks from differ- disk physical sector.
ent types of streams into one zone. Due to random deletions, Guided data placement. Since SMR zones are append-only
severe F2FS GC can be frequently triggered and influence the (except a few czones), and chunks from different PANGU files
OSS metadata/data streams, resulting in a performance drop. can be interleaved in SMR zones, deleting PANGU chunks can
We are aware that F2FS provides multi-head logging to leave zones with obsolete data. This requires SMR STORE
separate streams on disk, but this technique cannot separate to migrate valid data from old zones to new ones, termed as
chunks from the same type of streams. In practice, PANGU SMR GC in this paper. SMR STORE reduces SMR GC by: i)
file concurrency in each type of OSS stream can range from only allowing chunks to be mixed in a zone if they are from
tens to hundreds, and F2FS would write all the chunks from the same type of streams (i.e., similar lifespans); ii) trying to
the same type of stream into one zone. Therefore, random allocate an exclusive zone for each large chunk if possible.
deletions on those chunks (a common scenario in standard- Note that SMR GC is different from the OSS GC. In OSS
class OSS) still trigger severe F2FS GCs. GC, after objects deleted by users, the corresponding PANGU

USENIX Association 21st USENIX Conference on File and Storage Technologies 399
Pangu Clients Pangu Clients In Memory Chunk1 Metadata
In Memory
SMRSTORE Chunk1 ID Status Size Index Group List Ptr
Chunkserver Chunkserver

SSD HDD SSD SMR Data On-disk Index Group … Index Group 3
… STORE … Index Layout
Engine Engine Engine
Engine
Pangu Record 0 Index Record 1 Index Record 2 Index
SPDK Ext4
Garbage Collection Chunk
SPDK
Server On Disk (SMR)
Zone Management ··· Record 1 Record 2
Block
Zoned Datazone
Block
Layer Recovery
Layer
Header Payload Padding
SMRSTORE
SSD
CMR
HDD
SSD
SMR
HDD
Functionality
Slice 0 Data
(4096B)
Slice 0 FT
(32B)
Slice 1 Data
(4096B)
Slice 1FT
(32B) ··
CS-Ext4 CS-SMRSTORE
·
Stack Stack

Figure 9: Dataflow in SMR STORE engine. (§5.1). Compared to


Figure 8: Overview of CS-EXT4 and CS-SMR STORE (§5.1).
Figure 3, the storage engine is SMR STORE (green shaded) and the
SMR STORE is integrated in chunkserver, runs in the user space
disk is an HM-SMR drive. FT: slice footer.
and communicates with HM-SMR drives directly by ZBD interface.

On-disk Layout
files can be partially filled with obsolete objects. KV servers Superzone Metazone Metazone ··· Datazone Datazone ···
would create new PANGU files to store the valid objects
collected from old PANGU files (i.e., generating OSS GC Datazone Layout
Metazone Layout
streams). ZoneHead Checkpoint Journal ZoneHead Data
··· ···
Record Record Record Record Record
5 SMR STORE Design & Implementation Meta Record Layout Data Record Layout
Header Payload Padding Header Payload Padding
5.1 Architecture Overview
Data Payload Layout
Figure 8 shows a side-by-side comparison between running Slice Data Slice Footer ···
chunkserver with Ext4 on CMR disks (CS-Ext4), and with
SMR STORE on SMR disks (CS-SMR STORE). The main Figure 10: On-disk Data Layout of SMR STORE (§5.2). SMR-
STORE divides a disk into three partitions. Both metadata and data
difference is the addition of SMR STORE to the chunkserver,
are implemented based on the unified data structure called record.
sitting in the user space, and communicating with the SMR
The record can be of variable in length and have different type. The
disks via Zoned Block Device (ZBD) subsystem. Next, we payload of a data record is divided into several slices to support
discuss the key functionalities of SMR STORE: partial read.
• On-disk data layout. SMR STORE divides an HM-SMR
drive into three fixed-size areas, namely the superzone, gine with a SMR STORE engine, the KV server and PANGU
the metazones, and the datazones. The SMR STORE uses master follow the same procedures shown in Figure 3. As
“record” as the basic unit for metazone and datazone. illustrated in Figure 9, SMR STORE no longer relies on local
• Data index. SMR STORE employs three levels of in- system support and uses an in-memory chunk metadata table
memory data structures, including chunk metadata, index for mapping. SMR STORE first locates the table entry and
group, and record index, to map a chunk to a series of its index group linked list by using the chunk ID as index.
records on the disk. SMR STORE further identifies the targeted index group or
• Zone Management. SMR STORE uses a state machine to creates a new one. Then, SMR STORE appends data to the
manage the lifecycle of zones, and keeps metadata (e.g., datazone (indicated by the targeted index group) as record(s),
status) of each zone in the memory. Further, SMR STORE and updates the record index(es) in the corresponding index
adopts three workload-aware zone allocation strategies to group.
achieve low SMR GC overhead.
• Garbage Collection. SMR STORE periodically performs 5.2 On-Disk Data Layout
SMR GC to reclaim area with stale data at the granularity of
zones. There are three steps in SMR GC procedure: victim Overview. Figure 10 shows the three partitions of an HM-
zone selection, data migration, and metadata update. SMR drive under the SMR STORE, including one superzone,
• Recovery. Upon crashes, SMR STORE restore through four multiple metazones, and multiple datazones. All partitions
steps: recovering meta zone table, loading the latest check- are fixed-sized and statically allocated. In other words, we
point, replaying journals, and completing the chunk meta- place the superzone on the first SMR zone, the metazones oc-
data table by scanning opened data zones. cupy the next 400 SMR zones, and the rest of SMR zones are
assigned as datazones. We do not allow metazones and data-
CS-SMR STORE I/O Path. When replacing the storage en- zones to be interleaved along disk address space to facilitate

400 21st USENIX Conference on File and Storage Technologies USENIX Association
In Memory Index Group Index 5.3 Data Index
Chunk Table Zone ID Group
Chunk ID Size Ptr_Addr Zone ID SMR STORE uses an in-memory data structure, called record
Record Index
··· ··· ··· Record index, to manage the metadata of each record. The record
Chunk ID Record
Chunk X Index
2 MB Chunk X-Ptr Location in Chunk Index index includes Chunk ID, the logical location of user’s data
··· ··· ··· Location in Zone
in the chunk (i.e., chunk offset and size of user’s data) and
record’s physical location in the datazone (i.e., offset in the
Figure 11: Data Index of SMRStore (§5.3). In the chunk metadata datazone and size of the record).
table, each chunk has a pointer to an index group list. Each index
A chunk usually can have multiple records that are dis-
group can have multile record indexes in a zone. The index groups
and records are all sorted by the offset to the chunk.
tributed among several datazones. Note that SMR STORE
appends the data of a chunk to only one datazone at a time
until that datazone is full. This guarantees two properties: i)
the metazones scanning during recovery. the records in each datazone together must cover a consecu-
tive range of the chunk; ii) the covered chunk ranges in each
Superzone. The superzone stores the information for initial- datazone are not overlapped with each other.
ization, including the format version, the format timestamp, Therefore, we group the record indexes of a chunk in each
and other system configurations. datazone as an index group. Based on i), inside each index
Metazone. Inside the metazone, there are three types of group, we can sort record indexes based on their chunk offsets.
metadata: the zonehead, checkpoint , and journal. Note that The index group also includes the corresponding datazone ID.
the metazones only store metadata of SMR STORE not the Moreover, due to property ii), we can further sort the index
metadata of OSS (i.e., data from OSS metadata stream). The groups of a chunk, based on the chunk offset of first record
metadata are composed by different types of records. The index in each group, as a list.
zonehead record stores the zone-related information, such as Then, SMR STORE organizes the metadata of chunks as
the zone type and the timestamp of zone allocation (used for a table (see the left of Figure 11). Each entry of the table,
recovery). The checkpoint is a full snapshot of in-memory indexed by the Chunk ID (a 24 byte UUID), contains the
data structures while the journals contain key operations of chunk size (the total length of the chunk on this disk), the
chunk and zone which we further introduce in §5.6. chunk status (sealed or not, not illustrated in the figure), and
the corresponding sorted list of index groups.
Inside each record, there are also three fields: the header,
When receiving a read request (specified by the chunk
the payload, and the padding. The header specifies the type
ID, the chunk offset, and the data length), SMR STORE can
of records (i.e., zonehead, checkpoint, or journal record), the
locate the chunk metadata with the ChunkID, find the target
length of the record and the CRC checksum of the payload.
index group in the sorted list with chunk offset, and locate
The payload contains the serialized metadata. An optional
corresponding record index(es) with the chunk offset and data
padding is appended at the end of the record as the SMR drive
length.
is 4KB-aligned.
For a write request, SMR STORE always locates the last
Datazone. The datazones occupy the rest of the disk. In each index group of the target chunk. If there are enough space
zone, there are two types of records, the zonehead record and left on the corresponding datazone, SMR STORE appends the
data record. The zonehead record is similar to the metazone data to the datazone as a new record and adds the new record
zonehead record except the zone type. index to the index group. If not, SMR STORE allocates a new
datazone, appends the data, and adds the record index to the
Data record & slice. The payload of data record hosts user’s new index group.
data (i.e., a proportion of the chunk). The padding at the tail
of a data record is used to bring it a multiple of 4096 bytes 5.4 Zone Management
(i.e., 4KB-aligned). However, the payload field of data record
is different from other types of records (see bottom right of Zone state machine. SMR STORE employs a state machine
Figure 10). To avoid read amplification, the payload is further to manage the status of datazones as shown in Figure 12.
divided into 4096-Byte slices, with a 32-Byte slice footer SMR STORE maintains a pool of opened zones (55 zones
appended to each slice. The slice footer contains the chunk by default) for fast allocation. SMR STORE only resets the
ID (24 bytes), the logical offset to chunk (4 bytes) and the GARBAGE zones to FREE zones when the amount of FREE
checksum of slice data (4 bytes). Without payload slicing, zones are not enough. Metazone follows the similar state
reading a 4KB from a 512KB record would require SMR- machine except there is no pool of opened metazones.
STORE to fetch the whole record for verifying the payload Zone table. SMR STORE maintains a zone table in the mem-
with the record’s checksum. Now, with slices, reading a 4KB ory. Each entry of the zone table includes the zone ID, the
only needs to read at most two slices, and SMR STORE can zone status (OPENED, CLOSED, etc.), a list of live index
use the footer in the slice for checksum verification. groups, and a write pointer. We further introduce the usage

USENIX Association 21st USENIX Conference on File and Storage Technologies 401
Opened in BG Allocated for writes • 2 Adapting chunk size limit for datazone. Recall that a
chunk is sealed when it reaches the size limit, the end of
FREE OPENED ACTIVATED PANGU file or I/O failures. Hence, we configure the size
limit of a chunk (including its checksum) to match the size
Reset GARBAGE CLOSED Filled of one datazone (256MB). A chunk may still be sealed
well under 256MB (e.g., due to I/O errors). In that case,
All data become stale the left space would be shared with other chunks from the
same type of streams if necessary. Note that we still use the
Figure 12: Zone State Transition of SMRStore (§5.4). SMR-
default size limit (64MB) for chunks from OSS metadata
STORE maintains a pool of opened zones for fast allocation. When a
zone is assigned to a new chunk, it transitions to ACTIVATED status. stream as the corresponding PANGU files are usually small
If a zone is closed, it will not be reopened for write before reset. (several to tens of MBs each).
• 3 Zone pool & round-robin allocation. SMR STORE pre-
Chunk A Chunk B Chunk C Chunk D
opens and preserves zones for different types of OSS
(OSS GC) (OSS GC) (OSS Data) (OSS Meta) streams. Specifically, we prepare 40, 10 and 5 opened
zone 1 zone 1
zones for OSS GC, Data and Metadata stream, respectively.
(closed) (closed) The rationale is that OSS GC stream is the main contributor
zone 2 zone 2
(closed) (closed) of the I/O traffic. The OSS Metadata and Data streams have
zone 3 zone 3 high PANGU file concurrency but can be throttled by the
(opened) (opened)
zone 4
cache SSDs. Moreover, SMR STORE allocates zones for
(opened) new chunks in a round-robin fashion to reduce the chances
(a) No optimization (b) With SMRstore strategies of chunks to be mixed together.
Figure 13: The effectiveness of SMR STORE zone allocation In Figure 13, we use an example to showcase the effective-
strategies. (§5.4). Chunk A-D come from four different OSS ness of our strategies. Consider there are four OSS streams—
streams and shaded with corresponding colors, respectively. Sub- two OSS GC streams (green and red), one OSS Data stream
figure(a) represents a possible scenario under the random allocation (yellow) and one OSS Metadata stream (blue). If we do not
(no optimization) and (b) illustrates a possible layout with SMR- enable any strategies, SMR STORE would allocate datazones
STORE strategies enabled. Each data block may be composed of one one by one for the incoming chunks. As a result, we can
or more records. expect datazones to be interleaved as shown in Figure 13 (a),
similar to the F2FS scenario in §3. In this case, for example if
of per-zone live index groups list when discussing SMR GC chunk A is deleted, all three datazones would have chunk A’s
(§5.5) and recovery (§5.6) stale data and require further SMR GC to reclaim the space.
Now, in Figure 13(b), due to Strategy , 1 chunks from
Zone allocation. Earlier in F2FS (see §3), we showed that
different types of streams are no longer mixed together. More-
allocating chunks from different types of OSS streams to the
over, since we reconfigure the size limit of chunks (Strategy
same datazone can result in high overhead led by frequent
)
2 and use round-robin allocation (Strategy ), 3 we can see
F2FS GC. One can allocate a single datazone for each chunk
that chunk A and B can both own a zone exclusively and
to reduce such GC. However, this can in return waste con-
fill the entire zone. The three strategies achieve our goal by
siderable space. For example, chunks from OSS metadata
allocating large-sized chunks with exclusive zones. Now, if
stream are usually just several megabytes large, much smaller
chunk A is deleted, SMR STORE can directly reset zone 1 (i.e.,
than the size of a datazone (256MB).
no SMR GC needed).
Hence, a more practical solution is to only pursue the “one
chunk per zone” for large chunks and let the small chunks 5.5 Garbage Collection
with similar lifespans to be mixed together. A challenge here
SMR STORE performs garbage collection in three steps:
is that the size of a chunk is only determined after it is sealed.
In other words, when allocating datazones for incoming OSS Victim zone selection. The SMR STORE first choose a victim
streams, SMR STORE does not know the sizes of the chunks. zone among the CLOSED ones to perform SMR GC. We use
Therefore, we design the following zone allocation strategies. greedy algorithm to select a zone with most garbages.
• 1 Separating streams by types. Note that different OSS Data migration. For the selected victim zone, by scan-
types of streams can have disparate characteristics (see ning live index group list from the zone table, SMR STORE
Table 1). Therefore, we modify the OSS KV store to embed can identify valid data in this zone and migrate them to an
the types of the OSS streams (i.e., OSS Metadata, OSS available zone which is activated only for garbage collec-
Data or OSS GC) along with the data. SMR STORE only tion. Moreover, SMR STORE enables a throttle module that
allows chunks from the same type of streams to share a dynamically limits the throughput of SMR GC to alleviate
datazone. interference to the foreground I/O.

402 21st USENIX Conference on File and Storage Technologies USENIX Association
Metadata update. During migration, SMR STORE creates reduce impacts on the write latency. Therefore, the last step
index groups with new record indexes for migrated data. Af- of recovery is to check the datazones that have not been cov-
ter SMR GC finished, SMR STORE replaces the old index ered by the checkpoint and journals for yet-to-be-recovered
groups in the linked list with the new ones. Finally, SMR- writes. SMR STORE checks the validity (i.e., allocated for
STORE updates the zone table by marking the victim zone as writes before crash) of datazones by reading their zonehead
GARBAGE. records. For each valid datazone, SMR STORE verifies the
data record one by one with the per-record checksums. Fi-
5.6 Recovery nally, SMR STORE updates the in-memory chunk metadata
SMR STORE relies on journals and checkpoints to restore the table (including index groups and record indexes).
in-memory data structures. In this section, we first introduce
the detailed design of journal and checkpoint. Then, we 6 Evaluation
discuss the four steps of recovery.
Software/Hardware setup. We evaluate the end-to-end
Checkpoint design. The checkpoint of SMR STORE is a performance of three types of candidates, including the
full snapshot of the in-memory data structures including the chunkserver with CMR drives (i.e., CS-Ext4), the chunkserver
chunk metadata table (§5.3) and zone table (§5.4). SMR- with F2FS on SMR drives (i.e., CS-F2FS), and SMR STORE
STORE periodically creates a checkpoint and persists it into as the storage engine for chunkserver on SMR drives (i.e.,
the metazones as a series of records. The zone table is usually CS-SMR STORE). Additionally, we setup two alternative ver-
small and can be stored in one record. The chunk metadata sions of CS-SMR STORE. The CS-SMR STORE-20T shows
table is much larger (including all the index groups and record the performance with full-disk 20TB capacity and the CS-
indexes, see Figure 11) and requires multiple records to store. SMR STORE-OneZone imitates the data placement strategy of
Therefore, we also use two records to mark the start and end F2FS (i.e., mixing data from different streams into one zone).
of a checkpoint, called checkpoint start/end record. Our node configurations are listed in Table 2.
Journal design. In SMR STORE, only the create, seal, delete Workloads setup. We use Fio (modified to use the PANGU
operations of chunk, and the resetting of the zone need to be SDK) to generate workloads. Our experiments evaluate the
recorded by journals. Note that SMR STORE does not jour- following aspects of SMR STORE.
nal write operation (i.e., chunk append) as this can severely
• High concurrency micro benchmark. We extend the mi-
impact the latency. Instead, we can restore the latest data loca-
crobenchmark in §3 to further evaluate the candidates under
tions by scanning the previously opened zones. SMR STORE
highly concurrent random read workloads.
journals the zone reset operation to handle the case where
• OSS simulation macro benchmark. We also repeat the
the same zone may be opened, closed and reused multiple
multi-stream OSS simulation in §3 to evaluate the candi-
times between two checkpoints. Note that the checkpoint of
dates with multiple write streams, random file deletion and
SMR STORE is non-blocking, hence the journal records and
high disk utilization rate.
checkpoint records can be interleaved in the metazones.
• Garbage collection performance. We evaluate the SMR
Recovery process. The four steps of recovery are as follows: GC overhead in SMR STORE and further examine the ef-
• Identifying the latest valid checkpoint. The first step is fectiveness of data placement strategies by comparing cor-
to scan zonehead record of each metazone. Recall that, responding SMR GC overheads under different strategy
when opened, each metazone is assigned with a timestamp setups.
and stored in the zonehead record. Now, by sorting the • Recovery. To evaluate the recovery performance, we restart
timestamps, we can scan the metazones from the latest chunkserver on 20TB SMR drives with 60% capacity uti-
to the earliest to locate the most recent checkpoint end lization, then analyze time consumption in recovery.
record and further obtain the corresponding checkpoint • Resource consumption. We compare the resources, such
start record. as CPU and memory usage, between CS-Ext4 and CS-
• Loading latest checkpoint. By scanning records between SMR STORE (i.e., the two generations of storage stack for
the checkpoint start and end record, SMR STORE can re- standard-class OSS), under a similar setup.
cover zone table and chunk metadata table (including index • Field deployment. Both CS-Ext4 and CS-SMR STORE are
groups and record indexes) from the most recent check- currently deployed in standard-class OSS. We summarize,
point. demonstrate, and compare key performance statistics of a
• Replaying journals. Next, after the checkpoint start record, CS-Ext4 cluster and a CS-SMR STORE cluster in the field.
SMR STORE replays each journal record till the checkpoint
end record to update the zone table, and chunk metadata 6.1 High Concurrency Microbenchmark
table. In this microbenchmark, we evaluate the candidates on one
• Scanning datazones. Recall that the journals do not log disk (SMR or CMR) with two types of workloads: High
the write (i.e., chunk_append()) operations in order to Concurrency Write (HC-W) and High Concurrency Rand

USENIX Association 21st USENIX Conference on File and Storage Technologies 403
Figure 16: High Concurrency RandRead Throughput (§6.1).
Figure 14: High Concurrency Write Throughput (§6.1). This
figure presents the comparison of write throughput between different
storage engines. CS-F2FS (green) and CS-SMR STORE-OneZone
(black) achieve rather high throughputs as they place all incoming
chunks onto the same zones, which can incur high F2FS/SMR GC
overhead later.

Figure 17: Throughput Comparison of Multi-Stream Bench-


mark (§6.2).

with HC-RR. We can see CS-SMR STORE manages to deliver


Figure 15: SMR STORE Access Pattern (§6.1). The figure
presents the distribution of accessed zones under SMR STORE dur-
comparable performance to the CS-Ext4. Moreover, in both
ing a few seconds. Each red dot represents the corresponding zone HC-W and HC-RR experiments, we can observe that the
is accessed (zone ID on the Y axis). This shows the effectiveness of full-disk version (i.e., CS-SMR STORE-20T) does not suffer
the round-robin allocation, rendering a clear contrast to the zone severe performance drops.
accessing in CS-F2FS (Figure 5).
6.2 Multi-Stream Benchmark
Read (HC-RR). Note that in this experiment, the disk is in
the clean state and thus would not trigger F2FS or SMR GC. Next, same as the multi-stream experiment in §3.2, we evalu-
Figure 14 shows the HC-W throughput of each candidate ate the candidates under a more realistic setup with multiple
under different I/O sizes (from 4KB to 1MB). We can see that data streams, random deletion, and subsequent F2FS/SMR
CS-SMR STORE-OneZone and CS-F2FS always have much GC. We reuse the set of parameters as Table 3. In Figure 17,
higher throughput. As discussed in §3.2, flushing data from all candidates begin with a stable throughput of around 4GB/s.
different streams to enforce the “one zone at a time” policy After reaching 80% capacity, random deletion starts, and then
can significantly benefit the throughput during the clean state the GC kicks in. Recall our discussion in §3.2, CS-F2FS expe-
(no deletion and F2FS/SMR GC). riences a considerable performance drop due to frequent F2FS
For the rest three, their performance gradually increase GC led by mixed data allocation. CS-Ext4 is hardly affected
with I/O size. We notice that, for small I/O size (i.e., <32KB), by the random deletion as Ext4 does not incur GC. Finally,
SMR STORE shows low throughput. This is caused by the CS-SMR STORE continues to offer high throughput under
round-robin zone allocation strategy which tends to allocate random deletion. The main reason is that CS-SMR STORE
a new zone for each new chunk to avoid mixed placement, adopts several strategies to reduce the frequency and overhead
thereby generating random writes for the disk (see Figure 15). of SMR GC.
As I/O size increases, the throughput of SMR STORE grad- Now, we take a closer look to understand the reason behind
ually catches up at 128KB, finally reaches 110MB/s and CS-SMR STORE performance. In Figure 18, we plot the CDF
exceeds CS-Ext4 by 30% at 1MB I/O size. This is acceptable of zone utilization under SMR STORE. We can see that most
as most writes in standard-class OSS are larger than 128KB zones are 100% used (i.e., the 100 on the X axis) and only a
(see Table 1). few zones are occupied with small chunks, thereby indicating
Figure 16 shows the performance comparison of candidates less frequent SMR GC.

404 21st USENIX Conference on File and Storage Technologies USENIX Association
Figure 18: Zone utilizations (CDF) of CS-SMR STORE (§6.2).
Figure 20: Zone Space Utilizations (CDF) Comparison (§6.3).
The results show that SMR STORE can maintain a high space effi-
ciency by enabling three end-to-end data placement strategies.

Figure 19: Throughput with different data placement strategies Figure 21: Recovery Performance (§6.4). The figure shows the
(§6.3). No optimization: no separating, 64MB chunk size, random breakdown of recovery time. CS-SMR STORE-INIT refers to the
allocation on 55 opened zones. Strategy 1: separating streams by initial version of SMR STORE without fixed metazone partition.
types. Strategy 2: adapting chunk size limit for datazone. Strategy 3: Recovering zone table refers to the step “Identifying the latest valid
zone pool & round-robin allocation. checkpoint (§5.6)”. Replaying journals is negligible and not shown.

6.3 Effectiveness of Placement Strategy the previously opened zones.


The concentrated distribution of high utilization zones is a Note we also include a previous implementation, the CS-
joint effort of different data placement strategies. Figure 19 SMR STORE-INIT which takes more than 4 minutes to re-
shows the various combinations of individual strategies and cover. The major reason is that in this version, the on-disk
corresponding effectiveness on write throughput. Here, we layout is dynamic, meaning the metazones and datazones
run the same multi-stream experiment with four different can be interleaved. As a result, SMR STORE needs to scan
combinations. all zone headers (both metazone and datazone) for recovery.
From Figure 19, when only ‘separating streams by types’ Therefore, we switch to static zone allocation.
is enabled, the SMR GC overhead is quite obvious and the 6.5 Resource Consumption
performance is close to that of no optimization on alloca-
tion. Moreover, ‘adapting chunk size limit’ or ‘zone pool & Memory. In a single server (60 HDDs and 2 SSD caches),
round-robin allocation’ each contributes around half of the the CS-SMR STORE occupies 49.3GB of memory, around
speedup and such phenomenons are also reflected by the zone two times more than CS-Ext4. Memory growth is mainly
utilizations CDFs in Figure 20. contributed by the in-memory data structures of SMR STORE.
Specifically, the metadata of each chunk occupies around 200
6.4 Recovery Performance bytes, and each record index in memory needs 8 bytes. The
In this experiment, we measure the time consumption in the record indexes can be further compressed and we decide not
recovery of a 20TB SMR drive with 60% capacity occupied. to discuss in this paper due to space limit.
Figure 21 shows that CS-Ext4 with a 16TB CMR drive costs CPU. The CS-SMR STORE uses around 19 cores which are
less than 20 seconds. CS-Ext4 only has two steps in recov- 26.7% more than CS-EXT4. We use 8 cores for 8 partitions
ery, loading checkpoint (which takes 3.27 seconds) and data of the two cache SSDs (polling with spdk). We use another
scanning (which takes 16.3 seconds). CS-SMR STORE com- 4 cores for user-space network threads. SMR STORE uses
pletes the recovery with 94.4 seconds which takes around 19 another 7 cores for processing requests, memory copy, check-
seconds to load the checkpoint, less than 1 second to replay a sum calculation, and background GC tasks of 60 SMR drives.
few journals, and the remaining 75 seconds are for scanning With increasing areal density and comparable performance,

USENIX Association 21st USENIX Conference on File and Storage Technologies 405
cluster to another. Certain clusters can have regular patterns
on object creations and deletions while others perform more
randomly. Currently, we are exploring more efficient SMR
GC algorithms to better serve a variety of OSS workloads
based on the accumulated statistics.

8 Related Work
Figure 22: Performance comparison in OSS benchmark (§6.6).
Figure(a) compares key metrics of KV servers, including throughput Enabling HM-SMR drives. There are mainly three fashions
of object write, object read and OSS GC. Figure(b) compares the of solutions in enabling HM-SMR, including adding a shim
corresponding read and write throughput of chunkservers. layer between the host and the ZBD subsystem [20,21], adopt-
ing local file systems to provide support [16, 17], and modify-
ing applications to efficiently utilize SMR devices [19,26,32].
the extra overhead on CPU and memory usage is acceptable.
SMR STORE differs from above from two aspects. First,
Space efficiency. Apart from persisting data, SMR STORE SMR STORE completely discards random write by building
further requires extra space for record headers, record everything as logs and hence avoid the potential constraints
paddings, and slice footers. For large IOs (512KB-1MB)—a led by using the limited conventional zones or the tax imposed
common scenario in SMR STORE (i.e., OSS GC/data stream, by random-to-sequential translation. Second, SMR STORE
see Table 1)—SMR STORE requires another 1-2% space of significantly minimizes SMR GC overhead by end-to-end
the IO size. The percentage increases for smaller writes but data placement strategies with the guidance of workloads.
they are rather uncommon for HDDs due to IO merging in
Storage engine designs. To avoid the indirect overheads
cache SSDs.
of general-purpose file systems [17, 27], storage engines of
6.6 Field Deployment cloud storage systems [18] and distributed file systems [31])
tend to evolve towards to user space, special purposed [9],
In the OSS full stack benchmark, all of the key metrics
and end-to-end integration [11, 32]. SMR STORE follows and
in the SMR cluster are on par with the CMR cluster. The
further explores this path by building in the user space and im-
two clusters are both deployed with 13 KV store servers, 13
plementing the semantics of PANGU chunks, which is much
chunkservers, and 780 HDDs in total. Figure 22 shows that,
simpler than general file semantics (e.g., directory operations,
at OSS service layer, each KV server in the SMR cluster
file hardlink). Further, the range of the end-to-end integra-
achieves 374.2MB/s object write throughput, 227.7MB/s ob-
tion in SMR STORE is much wider than host-device, which
ject read throughput, and 394.8MB/s OSS GC throughput.
includes OSS service layer, PANGU distributed file system
Each chunkserver in the SMR cluster provides 1898.6MB/s
layer, the storage engine persistence layer, and a novel but
write throughput and 752.8MB/s read throughput. Similarly,
backward-incompatible device (i.e., HM-SMR drive). The re-
in the CMR cluster, each chunkserver provides 1888.3MB/s
sults of SMR STORE showcase the benefits can inspire future
write throughput and 723MB/s read throughput. This sug-
storage system designs under similar circumstances.
gests, from an end-to-end perspective, we are able to replace
CMR drives in standard-class OSS with SMR drives with no 9 Conclusion
performance penalty thanks to SMR STORE.
This paper describes our efforts in understanding, designing,
7 Limitation & Future Work evaluating, and deploying HM-SMR disks for standard-class
OSS in Alibaba. By directly bridging the semantics between
CZone. SMR STORE follows a strictly log-structured design PANGU and HM-SMR zoned namespace, enforcing an all-
and thus does not require random writes support from czones. logs layout and adopting guided placement strategies, SMR-
The use of the czones is under discussion. We could use STORE achieves our goal by deploying HM-SMR drives in
czones as szones by maintaining a writer pointer in memory standard-class OSS and providing comparable performance
and a sequence number for each czone. The sequence number against CMR disks yet with much better cost efficiency.
is used to identify valid records when the czone is reused.
Ad hoc to Alibaba standard OSS. At the moment, SMR- Acknowledgments
STORE is dedicated to serve standard-class OSS in Alibaba The authors thank our shepherd Prof. Peter Desnoyers and
Cloud. However, SMR STORE can easily adopt other zoned anonymous reviewers for their meticulous reviews and insight-
block devices , such as ZNS SSD. In fact, adapting SMR- ful suggestions. We also thank the OSS and the PANGU team
STORE to ZNS SSD devices is in progress and will serve other for their tremendous support on the SMR STORE project and
services (e.g., Alibaba EBS). this paper. We sincerely thank Yikang Xu who pioneered the
Garbage collection. The expected on-disk lifespans of OSS SMR STORE prototype development. This research was partly
data, OSS metadata and OSS GC are different from one OSS supported by Alibaba AIR program and NSFC(62102424).

406 21st USENIX Conference on File and Storage Technologies USENIX Association
References [14] T. R. Feldman and G. A. Gibson. Shingled Magnetic
[1] Archival-class OSS on Alibaba Cloud. https://www. Recording: Areal Density Increase Requires New Data
alibabacloud.com/solutions/backup_archive. Management. Usenix Magazine, 2013.
[15] G. Gibson and G. Ganger. Principles of operation for
[2] Data Lake on Alibaba Cloud. https://www.
shingled disk devices. Canregie Mellon Parallel Data
alibabacloud.com/solutions/data-lake.
Laboratory, CMU-PDL-11-107, 2011.
[3] Fio. https://github.com/axboe/fio. [16] C. Jin, W.-Y. Xi, Z.-Y. Ching, F. Huo, and C.-T. Lim.
[4] hdparm. https://www.man7.org/linux/man- HiSMRfs: A high performance file system for shingled
pages/man8/hdparm.8.html. storage array. In Proceedings of 30th Symposium on
Mass Storage Systems and Technologies (MSST), 2014.
[5] Shingled Magnetic Recording. https:
//zonedstorage.io/docs/introduction/smr. [17] C. Lee, D. Sim, J. Hwang, and S. Cho. F2FS: A new
file system for flash storage. In Proceedings of 13th
[6] INCITS T13 Technical Committee. Information technol- USENIX Conference on File and Storage Technologies
ogy - Zoned Device ATA Command Set (ZAC). Draft (FAST), 2015.
Standard T13/BSR INCITS 537, 2015.
[18] Q. Luo. Implement object storage with smr based key-
[7] INCITS T10 Technical Committee. Information value store. In Proceedings of Storage Developer Con-
technology-Zoned Block Commands (ZBC). Draft Stan- ference (SDC), 2015.
dard T10/BSR INCITS 536, 2017. [19] P. Macko, X. Ge, J. Haskins, J. Kelley, D. Slik, K. A.
[8] A. Aghayev and P. Desnoyers. Skylight—A window on Smith, and M. G. Smith. SMORE: A Cold Data Object
shingled disk operation. In Proceedings of 13th USENIX Store for SMR Drives (Extended Version). https://
Conference on File and Storage Technologies (FAST), arxiv.org/abs/1705.09701, 2017.
2015. [20] A. Manzanares, N. Watkins, C. Guyot, D. LeMoal,
[9] A. Aghayev, S. Weil, M. Kuchnik, M. Nelson, G. R. C. Maltzahn, and Z. Bandic. ZEA, a data manage-
Ganger, and G. Amvrosiadis. File systems unfit as ment approach for SMR. In Proceedings of 8th USENIX
distributed storage backends: lessons from 10 years Workshop on Hot Topics in Storage and File Systems
of Ceph evolution. In Proceedings of the 27th ACM (HotStorage), 2016.
Symposium on Operating Systems Principles (SOSP), [21] D. L. Moal. dm-zoned: Zoned Block Device device map-
2019. per. https://lwn.net/Articles/714387/, 2017.
[10] D. Beaver, S. Kumar, H. C. Li, J. Sobel, and P. Vajgel. [22] D. L. Moal. Linux SMR Support Status.
Finding a needle in Haystack: Facebook’s photo storage. https://events.static.linuxfound.org/
In Proceedings of 9th USENIX Symposium on Operating sites/events/files/slides/lemoal-Linux-
Systems Design and Implementation (OSDI), 2010. SMR-vault-2017.pdf, 2017.

[11] M. Bjørling, A. Aghayev, H. Holmberg, A. Ramesh, [23] S. Muralidhar, W. Lloyd, S. Roy, C. Hill, E. Lin, W. Liu,
D. L. Moal, G. R. Ganger, and G. Amvrosiadis. ZNS: S. Pan, S. Shankar, V. Sivakumar, L. Tang, and S. Ku-
Avoiding the Block Interface Tax for Flash-based SSDs. mar. f4: Facebook’s warm BLOB storage system. In
In Proceedings of USENIX Annual Technical Confer- Proceedings of 11th USENIX Symposium on Operating
ence (USENIX ATC), 2021. Systems Design and Implementation (OSDI), 2014.
[24] G. Oh, J. Yang, and S. Ahn. Efficient Key-Value Data
[12] E. Brewer, L. Ying, L. Greenfield, R. Cypher, and
Placement for ZNS SSD. Applied Sciences, 2021.
T. T’so. Disks for Data Centers. https://research.
google/pubs/pub44830.pdf, 2016. [25] Z. Pang, Q. Lu, S. Chen, R. Wang, Y. Xu, and J. Wu.
ArkDB: A Key-Value Engine for Scalable Cloud Stor-
[13] B. Calder, J. Wang, A. Ogus, N. Nilakantan, age Services. In Proceedings of the 2021 International
A. Skjolsvold, S. McKelvie, Y. Xu, S. Srivastav, J. Wu, Conference on Management of Data (SIGMOD), 2021.
H. Simitci, J. Haridas, C. Uddaraju, H. Khatri, A. Ed-
wards, V. Bedekar, S. Mainali, R. Abbasi, A. Agarwal, [26] R. Pitchumani, J. Hughes, and E. L. Miller. SMRDB:
M. F. u. Haq, M. I. u. Haq, D. Bhardwaj, S. Dayanand, Key-Value Data Store for Shingled Magnetic Recording
A. Adusumilli, M. McNett, S. Sankaran, K. Manivannan, Disks. In Proceedings of the 8th ACM International
and L. Rigas. Windows azure storage: A highly avail- Systems and Storage Conference (SYSTOR), 2015.
able cloud storage service with strong consistency. In [27] O. Rodeh, J. Bacik, and C. Mason. Btrfs: The linux
Proceedings of the 21th ACM Symposium on Operating b-tree filesystem. ACM Transactions on Storage (TOS),
Systems Principles (SOSP), 2011. 2013.

USENIX Association 21st USENIX Conference on File and Storage Technologies 407
[28] M. Rosenblum and J. K. Ousterhout. The design and
implementation of a log-structured file system. ACM
Transactions on Computer Systems (TOCS), 1992.
[29] A. Suresh, G. A. Gibson, and G. R. Ganger. Shingled
Magnetic Recording for Big Data Applications. Techni-
cal Report CMU-PDL-11-107, 2012.
[30] Q. Wang, J. Li, P. P. C. Lee, T. Ouyang, C. Shi,
and L. Huang. Separating data via block invalidation
time inference for write amplification reduction in Log-
Structured storage. In Proceedings of 20th USENIX
Conference on File and Storage Technologies (FAST),
2022.
[31] S. A. Weil, S. A. Brandt, E. L. Miller, D. D. E. Long,
and C. Maltzahn. Ceph: A scalable, High-Performance
distributed file system. In Proceedings of 7th USENIX
Symposium on Operating Systems Design and Imple-
mentation (OSDI), 2006.
[32] T. Yao, J. Wan, P. Huang, Y. Zhang, Z. Liu, C. Xie,
and X. He. GearDB: A GC-free Key-Value Store on
HM-SMR Drives with Gear Compaction. In Proceed-
ings of 17th USENIX Conference on File and Storage
Technologies (FAST), 2019.

408 21st USENIX Conference on File and Storage Technologies USENIX Association

You might also like