Fast23 Smrstore
Fast23 Smrstore
on HM-SMR Drives
Su Zhou, Erci Xu, Hao Wu, Yu Du, Jiacheng Cui, Wanyu Fu, Chang Liu, Yingni
Wang, Wenbo Wang, Shouqu Sun, Xianfei Wang, Bo Feng, Biyun Zhu, Xin Tong,
Weikang Kong, Linyan Liu, Zhongjie Wu, Jinbo Wu, Qingchao Luo, and
Jiesheng Wu, Alibaba Group
https://www.usenix.org/conference/fast23/presentation/zhou
Alibaba Group
USENIX Association 21st USENIX Conference on File and Storage Technologies 395
Restful Object Request/Response offset 0 AppendOnly
Pangu File
OSS FrontEnd Layer
Persistence Layer master master … replica 1 replica 2 replica 3 replica 1 replica 2 replica 3
(Pangu)
chunkserver chunkserver chunkserver chunkserver chunkserver chunkserver
chunkserver
HDD …
HDD SSD SSD
Engine Engine Engine Engine chunk Figure 2: Semantics of PANGU file and chunk. This figure shows
…
server a PANGU file consists of four chunks. Only chunk 4 (the last chunk)
HDD HDD HDD SSD SSD is not sealed (writable). PANGU keeps multiple replicas for each
Disk1 Disk60 Disk61 Disk62
… chunk across chunkservers to protect data against any failures.
396 21st USENIX Conference on File and Storage Technologies USENIX Association
Related to Object Data Ext4-based Storage Engine Type Latency Concur- iosize Lifespan
In Memory
(ms) rency (Byte) (Day)
KV
Server Object Foo Pangu File A Offset Length OSS Data W <1 ~1500 512K-1M <7
OSS Data R <20 - 512K-1M -
In Memory
Pangu
Pangu File A
OSS Meta W <1 ~2000 4K-128K <60
Master
Chunk 0 Chunk 1 OSS Meta R <20 - 4K-128K -
OSS GC <20 ~100 512K-1M <90
In Memory Chunk1 Metadata
Chunk1 ID Status Size
Table 1: Characteristics of streams on a chunkserver(§2.1).
Pangu
Chunk On Disk (Ext4) Chunk1 File are first written to SSD caches and later moved to HDDs.
Server Sector (4096B ) … Normally, hundreds of PANGU chunks are opened for writ-
Data (4048B) Footer (48B) ing on a chunkserver, thereby yielding high concurrency.
• OSS Data Read Stream. OSS directly reads object data
Figure 3: Dataflow with traditional storage engine (OSS (§2.1). from HDDs to achieve high throughput. Due to space limits,
This figure illustrates the write path of an object in traditional CS- object data for read are not cached in the SSDs.
Ext4 stack (red shaded). The yellow shaded area means it is related • OSS Metadata Write Stream. OSS metadata stream in-
to the object “Foo”. The data is split as a series of 4048B segments cludes objects’ metadata formatted as Write-Ahead Logs
where each is attached with a 48B footer for checksum. and Sorted String Table Files from KV store. The metadata
are first flushed to SSD cache and then migrated to HDDs.
flushed from cache SSDs to HDDs for persistence. Storage The metadata accounts for around 2% of the total capacity
engine does not provide any redundancy for failures. Instead, used (around 24TB per chunkserver).
we rely on PANGU providing fault tolerance for each chunk • OSS Metadata Read Stream. The KV server maintains a
across chunkservers by replication or erasure coding. cache for object index. In most cases, the metadata read
directly hits the KV index cache and returns. If cache
I/O Path. Figure 3 presents high-level write flows in OSS misses, OSS routes to SSTFiles in PANGU.
with the traditional Ext4-based storage engine. We use an • OSS GC Stream. The garbage collection in OSS service
example object called "Foo" to highlight the write flow. The layer (referred to as OSS GC) is to reclaim garbage space
KV server chooses the PANGU file A for storing “Foo”. Then in PANGU files. A PANGU file can hold multiple objects.
the KV server uses the PANGU SDK or contacts the PANGU When a certain amount of objects are deleted in a PANGU
master to locate the tail chunk (i.e.,chunk-1) and its respective file, OSS would re-allocate the rest to another PANGU file,
chunkservers. Further, the chunkserver appends the object’s and delete the old file. The chunks written by OSS GC
data (with checksums) to the corresponding Ext4 file. To streams account for more than 80% of the total capacity
verify data integrity, we break the object’s data into a series used in one chunkserver. OSS GC streams run in back-
of 4048-Byte segments. We attach to each segment a 48 Byte ground and directly routed to HDDs for persistence.
footer which includes the checksum and segment locations
in chunk. Note that each chunk in the chunkserver is an Ext4 2.2 Host-Managed HM-SMR
file with the chunk ID as its filename. HM-SMR drives overlap the tracks to achieve higher areal
Workloads. A KV server can open multiple PANGU files density but consequently sacrifices random write support.
to store or retrieve the data and metadata of objects, and Specifically, HM-SMR drives organize the address space as
perform GC on deleted objects in PANGU files. From the multiple fixed-size zones including sequential zones (referred
perspective of chunkservers, we define an active PANGU file to as zones) and a few (around 1%) conventional zones (re-
as a stream. The stream starts as the PANGU is opened by a ferred to as czones). For example, the Seagate SMR drive
KV server for read or write, and ends when the KV closes the we use in this paper has a capacity of 20TB and 74508 zones
PANGU file. Based on the operations (read or write) and types (including 800 czones). The size of each zone is 256MB, and
(metadata, data or GC), we can categorize the workloads it takes around 20ms, 24ms and 22ms for opening, closing
issued by KV servers as five types of streams. Table 1 lists and erasing a zone, respectively. Note that, for certain SMR
the characteristic of each type of streams. The “Concurrency” HDD models (e.g., West Digital DC HC650), there is a limit
refers to the PANGU file concurrency, namely the number of on the number of zones to be opened concurrently.
PANGU opened files on a chunkserver to append data. The Device mapper translation. A straightforward solution
“Lifespan” refers to the expected lifespan of the data on disk is to insert a shim layer, called shingled translation layer
(from being persisted to deleted), NOT the duration of the (STL), such as dm-zoned [21], to provide dynamic mapping
streams. from logical block address to physical sectors and hence
• OSS Data Write Stream. Persisting object data requires low achieve random-to-sequential translation. Apparently, the
latency to achieve quick response. Therefore, object data major advantage of this approach is allowing the users (e.g.,
USENIX Association 21st USENIX Conference on File and Storage Technologies 397
chunkserver process) to adopt the HM-SMR drives as cost-
efficient drop-in replacement for CMR drives.
SMR-aware file systems. The log-structured design file sys-
tems (e.g., F2FS) make them an ideal match for the append-
only zone design of HM-SMR disks. For example, F2FS
started to support zoned block devices since kernel 4.10.
Users can mount a F2FS on an HM-SMR drive and utilize
the F2FS GC mechanism to support random writes. Simi-
larly, Btrfs, a file system based on copy-on-write principle,
currently provided an experimental support for zoned block Figure 4: High Concurrency Write Throughput (§3.2). The fig-
device in kernel 5.12. ure shows the write throughput of CS-Ext4 and CS-F2FS in mi-
crobenchmarks.
End-to-End Co-design. Instead of relying on dm-zoned or
general file systems, applications that perform mostly sequen-
tial writes can be modified to adopt HM-SMR. The benefits
of end-to-end integration has been proved by several recent
works, such as GearDB [32], ZenFs [11], SMORE [19], etc.
Applications could eliminate the block/fs-level overhead and
achieve predictable performance by managing on-disk data
placement and garbage collection at application level [24, 30].
398 21st USENIX Conference on File and Storage Technologies USENIX Association
Stream Type #Fio Target numjobs iodepth iosize nrfiles chunk size rate
OSS GC Wr 1 1 HDDs 8 32 1MB 25 256MB 400MB/s
OSS GC Wr 2 1 HDDs 8 32 1MB 25 64MB 20MB/s
OSS Data Wr 1 SSDs 3 32 1MB 300 256MB 200MB/s
OSS Meta Wr 1 SSDs 1 8 4KB-128KB 500 4MB 80MB/s
Table 3: Macro benchmark setups (§3.1). OSS GC Wr 1 refers to OSS GC streams with large chunks. OSS GC Wr 2 refers to OSS GC
streams with small chunks. OSS Data Wr refers to OSS object data write streams. OSS Meta Wr refers to OSS metadata write streams.
USENIX Association 21st USENIX Conference on File and Storage Technologies 399
Pangu Clients Pangu Clients In Memory Chunk1 Metadata
In Memory
SMRSTORE Chunk1 ID Status Size Index Group List Ptr
Chunkserver Chunkserver
SSD HDD SSD SMR Data On-disk Index Group … Index Group 3
… STORE … Index Layout
Engine Engine Engine
Engine
Pangu Record 0 Index Record 1 Index Record 2 Index
SPDK Ext4
Garbage Collection Chunk
SPDK
Server On Disk (SMR)
Zone Management ··· Record 1 Record 2
Block
Zoned Datazone
Block
Layer Recovery
Layer
Header Payload Padding
SMRSTORE
SSD
CMR
HDD
SSD
SMR
HDD
Functionality
Slice 0 Data
(4096B)
Slice 0 FT
(32B)
Slice 1 Data
(4096B)
Slice 1FT
(32B) ··
CS-Ext4 CS-SMRSTORE
·
Stack Stack
On-disk Layout
files can be partially filled with obsolete objects. KV servers Superzone Metazone Metazone ··· Datazone Datazone ···
would create new PANGU files to store the valid objects
collected from old PANGU files (i.e., generating OSS GC Datazone Layout
Metazone Layout
streams). ZoneHead Checkpoint Journal ZoneHead Data
··· ···
Record Record Record Record Record
5 SMR STORE Design & Implementation Meta Record Layout Data Record Layout
Header Payload Padding Header Payload Padding
5.1 Architecture Overview
Data Payload Layout
Figure 8 shows a side-by-side comparison between running Slice Data Slice Footer ···
chunkserver with Ext4 on CMR disks (CS-Ext4), and with
SMR STORE on SMR disks (CS-SMR STORE). The main Figure 10: On-disk Data Layout of SMR STORE (§5.2). SMR-
STORE divides a disk into three partitions. Both metadata and data
difference is the addition of SMR STORE to the chunkserver,
are implemented based on the unified data structure called record.
sitting in the user space, and communicating with the SMR
The record can be of variable in length and have different type. The
disks via Zoned Block Device (ZBD) subsystem. Next, we payload of a data record is divided into several slices to support
discuss the key functionalities of SMR STORE: partial read.
• On-disk data layout. SMR STORE divides an HM-SMR
drive into three fixed-size areas, namely the superzone, gine with a SMR STORE engine, the KV server and PANGU
the metazones, and the datazones. The SMR STORE uses master follow the same procedures shown in Figure 3. As
“record” as the basic unit for metazone and datazone. illustrated in Figure 9, SMR STORE no longer relies on local
• Data index. SMR STORE employs three levels of in- system support and uses an in-memory chunk metadata table
memory data structures, including chunk metadata, index for mapping. SMR STORE first locates the table entry and
group, and record index, to map a chunk to a series of its index group linked list by using the chunk ID as index.
records on the disk. SMR STORE further identifies the targeted index group or
• Zone Management. SMR STORE uses a state machine to creates a new one. Then, SMR STORE appends data to the
manage the lifecycle of zones, and keeps metadata (e.g., datazone (indicated by the targeted index group) as record(s),
status) of each zone in the memory. Further, SMR STORE and updates the record index(es) in the corresponding index
adopts three workload-aware zone allocation strategies to group.
achieve low SMR GC overhead.
• Garbage Collection. SMR STORE periodically performs 5.2 On-Disk Data Layout
SMR GC to reclaim area with stale data at the granularity of
zones. There are three steps in SMR GC procedure: victim Overview. Figure 10 shows the three partitions of an HM-
zone selection, data migration, and metadata update. SMR drive under the SMR STORE, including one superzone,
• Recovery. Upon crashes, SMR STORE restore through four multiple metazones, and multiple datazones. All partitions
steps: recovering meta zone table, loading the latest check- are fixed-sized and statically allocated. In other words, we
point, replaying journals, and completing the chunk meta- place the superzone on the first SMR zone, the metazones oc-
data table by scanning opened data zones. cupy the next 400 SMR zones, and the rest of SMR zones are
assigned as datazones. We do not allow metazones and data-
CS-SMR STORE I/O Path. When replacing the storage en- zones to be interleaved along disk address space to facilitate
400 21st USENIX Conference on File and Storage Technologies USENIX Association
In Memory Index Group Index 5.3 Data Index
Chunk Table Zone ID Group
Chunk ID Size Ptr_Addr Zone ID SMR STORE uses an in-memory data structure, called record
Record Index
··· ··· ··· Record index, to manage the metadata of each record. The record
Chunk ID Record
Chunk X Index
2 MB Chunk X-Ptr Location in Chunk Index index includes Chunk ID, the logical location of user’s data
··· ··· ··· Location in Zone
in the chunk (i.e., chunk offset and size of user’s data) and
record’s physical location in the datazone (i.e., offset in the
Figure 11: Data Index of SMRStore (§5.3). In the chunk metadata datazone and size of the record).
table, each chunk has a pointer to an index group list. Each index
A chunk usually can have multiple records that are dis-
group can have multile record indexes in a zone. The index groups
and records are all sorted by the offset to the chunk.
tributed among several datazones. Note that SMR STORE
appends the data of a chunk to only one datazone at a time
until that datazone is full. This guarantees two properties: i)
the metazones scanning during recovery. the records in each datazone together must cover a consecu-
tive range of the chunk; ii) the covered chunk ranges in each
Superzone. The superzone stores the information for initial- datazone are not overlapped with each other.
ization, including the format version, the format timestamp, Therefore, we group the record indexes of a chunk in each
and other system configurations. datazone as an index group. Based on i), inside each index
Metazone. Inside the metazone, there are three types of group, we can sort record indexes based on their chunk offsets.
metadata: the zonehead, checkpoint , and journal. Note that The index group also includes the corresponding datazone ID.
the metazones only store metadata of SMR STORE not the Moreover, due to property ii), we can further sort the index
metadata of OSS (i.e., data from OSS metadata stream). The groups of a chunk, based on the chunk offset of first record
metadata are composed by different types of records. The index in each group, as a list.
zonehead record stores the zone-related information, such as Then, SMR STORE organizes the metadata of chunks as
the zone type and the timestamp of zone allocation (used for a table (see the left of Figure 11). Each entry of the table,
recovery). The checkpoint is a full snapshot of in-memory indexed by the Chunk ID (a 24 byte UUID), contains the
data structures while the journals contain key operations of chunk size (the total length of the chunk on this disk), the
chunk and zone which we further introduce in §5.6. chunk status (sealed or not, not illustrated in the figure), and
the corresponding sorted list of index groups.
Inside each record, there are also three fields: the header,
When receiving a read request (specified by the chunk
the payload, and the padding. The header specifies the type
ID, the chunk offset, and the data length), SMR STORE can
of records (i.e., zonehead, checkpoint, or journal record), the
locate the chunk metadata with the ChunkID, find the target
length of the record and the CRC checksum of the payload.
index group in the sorted list with chunk offset, and locate
The payload contains the serialized metadata. An optional
corresponding record index(es) with the chunk offset and data
padding is appended at the end of the record as the SMR drive
length.
is 4KB-aligned.
For a write request, SMR STORE always locates the last
Datazone. The datazones occupy the rest of the disk. In each index group of the target chunk. If there are enough space
zone, there are two types of records, the zonehead record and left on the corresponding datazone, SMR STORE appends the
data record. The zonehead record is similar to the metazone data to the datazone as a new record and adds the new record
zonehead record except the zone type. index to the index group. If not, SMR STORE allocates a new
datazone, appends the data, and adds the record index to the
Data record & slice. The payload of data record hosts user’s new index group.
data (i.e., a proportion of the chunk). The padding at the tail
of a data record is used to bring it a multiple of 4096 bytes 5.4 Zone Management
(i.e., 4KB-aligned). However, the payload field of data record
is different from other types of records (see bottom right of Zone state machine. SMR STORE employs a state machine
Figure 10). To avoid read amplification, the payload is further to manage the status of datazones as shown in Figure 12.
divided into 4096-Byte slices, with a 32-Byte slice footer SMR STORE maintains a pool of opened zones (55 zones
appended to each slice. The slice footer contains the chunk by default) for fast allocation. SMR STORE only resets the
ID (24 bytes), the logical offset to chunk (4 bytes) and the GARBAGE zones to FREE zones when the amount of FREE
checksum of slice data (4 bytes). Without payload slicing, zones are not enough. Metazone follows the similar state
reading a 4KB from a 512KB record would require SMR- machine except there is no pool of opened metazones.
STORE to fetch the whole record for verifying the payload Zone table. SMR STORE maintains a zone table in the mem-
with the record’s checksum. Now, with slices, reading a 4KB ory. Each entry of the zone table includes the zone ID, the
only needs to read at most two slices, and SMR STORE can zone status (OPENED, CLOSED, etc.), a list of live index
use the footer in the slice for checksum verification. groups, and a write pointer. We further introduce the usage
USENIX Association 21st USENIX Conference on File and Storage Technologies 401
Opened in BG Allocated for writes •
2 Adapting chunk size limit for datazone. Recall that a
chunk is sealed when it reaches the size limit, the end of
FREE OPENED ACTIVATED PANGU file or I/O failures. Hence, we configure the size
limit of a chunk (including its checksum) to match the size
Reset GARBAGE CLOSED Filled of one datazone (256MB). A chunk may still be sealed
well under 256MB (e.g., due to I/O errors). In that case,
All data become stale the left space would be shared with other chunks from the
same type of streams if necessary. Note that we still use the
Figure 12: Zone State Transition of SMRStore (§5.4). SMR-
default size limit (64MB) for chunks from OSS metadata
STORE maintains a pool of opened zones for fast allocation. When a
zone is assigned to a new chunk, it transitions to ACTIVATED status. stream as the corresponding PANGU files are usually small
If a zone is closed, it will not be reopened for write before reset. (several to tens of MBs each).
•
3 Zone pool & round-robin allocation. SMR STORE pre-
Chunk A Chunk B Chunk C Chunk D
opens and preserves zones for different types of OSS
(OSS GC) (OSS GC) (OSS Data) (OSS Meta) streams. Specifically, we prepare 40, 10 and 5 opened
zone 1 zone 1
zones for OSS GC, Data and Metadata stream, respectively.
(closed) (closed) The rationale is that OSS GC stream is the main contributor
zone 2 zone 2
(closed) (closed) of the I/O traffic. The OSS Metadata and Data streams have
zone 3 zone 3 high PANGU file concurrency but can be throttled by the
(opened) (opened)
zone 4
cache SSDs. Moreover, SMR STORE allocates zones for
(opened) new chunks in a round-robin fashion to reduce the chances
(a) No optimization (b) With SMRstore strategies of chunks to be mixed together.
Figure 13: The effectiveness of SMR STORE zone allocation In Figure 13, we use an example to showcase the effective-
strategies. (§5.4). Chunk A-D come from four different OSS ness of our strategies. Consider there are four OSS streams—
streams and shaded with corresponding colors, respectively. Sub- two OSS GC streams (green and red), one OSS Data stream
figure(a) represents a possible scenario under the random allocation (yellow) and one OSS Metadata stream (blue). If we do not
(no optimization) and (b) illustrates a possible layout with SMR- enable any strategies, SMR STORE would allocate datazones
STORE strategies enabled. Each data block may be composed of one one by one for the incoming chunks. As a result, we can
or more records. expect datazones to be interleaved as shown in Figure 13 (a),
similar to the F2FS scenario in §3. In this case, for example if
of per-zone live index groups list when discussing SMR GC chunk A is deleted, all three datazones would have chunk A’s
(§5.5) and recovery (§5.6) stale data and require further SMR GC to reclaim the space.
Now, in Figure 13(b), due to Strategy
, 1 chunks from
Zone allocation. Earlier in F2FS (see §3), we showed that
different types of streams are no longer mixed together. More-
allocating chunks from different types of OSS streams to the
over, since we reconfigure the size limit of chunks (Strategy
same datazone can result in high overhead led by frequent
)
2 and use round-robin allocation (Strategy
), 3 we can see
F2FS GC. One can allocate a single datazone for each chunk
that chunk A and B can both own a zone exclusively and
to reduce such GC. However, this can in return waste con-
fill the entire zone. The three strategies achieve our goal by
siderable space. For example, chunks from OSS metadata
allocating large-sized chunks with exclusive zones. Now, if
stream are usually just several megabytes large, much smaller
chunk A is deleted, SMR STORE can directly reset zone 1 (i.e.,
than the size of a datazone (256MB).
no SMR GC needed).
Hence, a more practical solution is to only pursue the “one
chunk per zone” for large chunks and let the small chunks 5.5 Garbage Collection
with similar lifespans to be mixed together. A challenge here
SMR STORE performs garbage collection in three steps:
is that the size of a chunk is only determined after it is sealed.
In other words, when allocating datazones for incoming OSS Victim zone selection. The SMR STORE first choose a victim
streams, SMR STORE does not know the sizes of the chunks. zone among the CLOSED ones to perform SMR GC. We use
Therefore, we design the following zone allocation strategies. greedy algorithm to select a zone with most garbages.
•
1 Separating streams by types. Note that different OSS Data migration. For the selected victim zone, by scan-
types of streams can have disparate characteristics (see ning live index group list from the zone table, SMR STORE
Table 1). Therefore, we modify the OSS KV store to embed can identify valid data in this zone and migrate them to an
the types of the OSS streams (i.e., OSS Metadata, OSS available zone which is activated only for garbage collec-
Data or OSS GC) along with the data. SMR STORE only tion. Moreover, SMR STORE enables a throttle module that
allows chunks from the same type of streams to share a dynamically limits the throughput of SMR GC to alleviate
datazone. interference to the foreground I/O.
402 21st USENIX Conference on File and Storage Technologies USENIX Association
Metadata update. During migration, SMR STORE creates reduce impacts on the write latency. Therefore, the last step
index groups with new record indexes for migrated data. Af- of recovery is to check the datazones that have not been cov-
ter SMR GC finished, SMR STORE replaces the old index ered by the checkpoint and journals for yet-to-be-recovered
groups in the linked list with the new ones. Finally, SMR- writes. SMR STORE checks the validity (i.e., allocated for
STORE updates the zone table by marking the victim zone as writes before crash) of datazones by reading their zonehead
GARBAGE. records. For each valid datazone, SMR STORE verifies the
data record one by one with the per-record checksums. Fi-
5.6 Recovery nally, SMR STORE updates the in-memory chunk metadata
SMR STORE relies on journals and checkpoints to restore the table (including index groups and record indexes).
in-memory data structures. In this section, we first introduce
the detailed design of journal and checkpoint. Then, we 6 Evaluation
discuss the four steps of recovery.
Software/Hardware setup. We evaluate the end-to-end
Checkpoint design. The checkpoint of SMR STORE is a performance of three types of candidates, including the
full snapshot of the in-memory data structures including the chunkserver with CMR drives (i.e., CS-Ext4), the chunkserver
chunk metadata table (§5.3) and zone table (§5.4). SMR- with F2FS on SMR drives (i.e., CS-F2FS), and SMR STORE
STORE periodically creates a checkpoint and persists it into as the storage engine for chunkserver on SMR drives (i.e.,
the metazones as a series of records. The zone table is usually CS-SMR STORE). Additionally, we setup two alternative ver-
small and can be stored in one record. The chunk metadata sions of CS-SMR STORE. The CS-SMR STORE-20T shows
table is much larger (including all the index groups and record the performance with full-disk 20TB capacity and the CS-
indexes, see Figure 11) and requires multiple records to store. SMR STORE-OneZone imitates the data placement strategy of
Therefore, we also use two records to mark the start and end F2FS (i.e., mixing data from different streams into one zone).
of a checkpoint, called checkpoint start/end record. Our node configurations are listed in Table 2.
Journal design. In SMR STORE, only the create, seal, delete Workloads setup. We use Fio (modified to use the PANGU
operations of chunk, and the resetting of the zone need to be SDK) to generate workloads. Our experiments evaluate the
recorded by journals. Note that SMR STORE does not jour- following aspects of SMR STORE.
nal write operation (i.e., chunk append) as this can severely
• High concurrency micro benchmark. We extend the mi-
impact the latency. Instead, we can restore the latest data loca-
crobenchmark in §3 to further evaluate the candidates under
tions by scanning the previously opened zones. SMR STORE
highly concurrent random read workloads.
journals the zone reset operation to handle the case where
• OSS simulation macro benchmark. We also repeat the
the same zone may be opened, closed and reused multiple
multi-stream OSS simulation in §3 to evaluate the candi-
times between two checkpoints. Note that the checkpoint of
dates with multiple write streams, random file deletion and
SMR STORE is non-blocking, hence the journal records and
high disk utilization rate.
checkpoint records can be interleaved in the metazones.
• Garbage collection performance. We evaluate the SMR
Recovery process. The four steps of recovery are as follows: GC overhead in SMR STORE and further examine the ef-
• Identifying the latest valid checkpoint. The first step is fectiveness of data placement strategies by comparing cor-
to scan zonehead record of each metazone. Recall that, responding SMR GC overheads under different strategy
when opened, each metazone is assigned with a timestamp setups.
and stored in the zonehead record. Now, by sorting the • Recovery. To evaluate the recovery performance, we restart
timestamps, we can scan the metazones from the latest chunkserver on 20TB SMR drives with 60% capacity uti-
to the earliest to locate the most recent checkpoint end lization, then analyze time consumption in recovery.
record and further obtain the corresponding checkpoint • Resource consumption. We compare the resources, such
start record. as CPU and memory usage, between CS-Ext4 and CS-
• Loading latest checkpoint. By scanning records between SMR STORE (i.e., the two generations of storage stack for
the checkpoint start and end record, SMR STORE can re- standard-class OSS), under a similar setup.
cover zone table and chunk metadata table (including index • Field deployment. Both CS-Ext4 and CS-SMR STORE are
groups and record indexes) from the most recent check- currently deployed in standard-class OSS. We summarize,
point. demonstrate, and compare key performance statistics of a
• Replaying journals. Next, after the checkpoint start record, CS-Ext4 cluster and a CS-SMR STORE cluster in the field.
SMR STORE replays each journal record till the checkpoint
end record to update the zone table, and chunk metadata 6.1 High Concurrency Microbenchmark
table. In this microbenchmark, we evaluate the candidates on one
• Scanning datazones. Recall that the journals do not log disk (SMR or CMR) with two types of workloads: High
the write (i.e., chunk_append()) operations in order to Concurrency Write (HC-W) and High Concurrency Rand
USENIX Association 21st USENIX Conference on File and Storage Technologies 403
Figure 16: High Concurrency RandRead Throughput (§6.1).
Figure 14: High Concurrency Write Throughput (§6.1). This
figure presents the comparison of write throughput between different
storage engines. CS-F2FS (green) and CS-SMR STORE-OneZone
(black) achieve rather high throughputs as they place all incoming
chunks onto the same zones, which can incur high F2FS/SMR GC
overhead later.
404 21st USENIX Conference on File and Storage Technologies USENIX Association
Figure 18: Zone utilizations (CDF) of CS-SMR STORE (§6.2).
Figure 20: Zone Space Utilizations (CDF) Comparison (§6.3).
The results show that SMR STORE can maintain a high space effi-
ciency by enabling three end-to-end data placement strategies.
Figure 19: Throughput with different data placement strategies Figure 21: Recovery Performance (§6.4). The figure shows the
(§6.3). No optimization: no separating, 64MB chunk size, random breakdown of recovery time. CS-SMR STORE-INIT refers to the
allocation on 55 opened zones. Strategy 1: separating streams by initial version of SMR STORE without fixed metazone partition.
types. Strategy 2: adapting chunk size limit for datazone. Strategy 3: Recovering zone table refers to the step “Identifying the latest valid
zone pool & round-robin allocation. checkpoint (§5.6)”. Replaying journals is negligible and not shown.
USENIX Association 21st USENIX Conference on File and Storage Technologies 405
cluster to another. Certain clusters can have regular patterns
on object creations and deletions while others perform more
randomly. Currently, we are exploring more efficient SMR
GC algorithms to better serve a variety of OSS workloads
based on the accumulated statistics.
8 Related Work
Figure 22: Performance comparison in OSS benchmark (§6.6).
Figure(a) compares key metrics of KV servers, including throughput Enabling HM-SMR drives. There are mainly three fashions
of object write, object read and OSS GC. Figure(b) compares the of solutions in enabling HM-SMR, including adding a shim
corresponding read and write throughput of chunkservers. layer between the host and the ZBD subsystem [20,21], adopt-
ing local file systems to provide support [16, 17], and modify-
ing applications to efficiently utilize SMR devices [19,26,32].
the extra overhead on CPU and memory usage is acceptable.
SMR STORE differs from above from two aspects. First,
Space efficiency. Apart from persisting data, SMR STORE SMR STORE completely discards random write by building
further requires extra space for record headers, record everything as logs and hence avoid the potential constraints
paddings, and slice footers. For large IOs (512KB-1MB)—a led by using the limited conventional zones or the tax imposed
common scenario in SMR STORE (i.e., OSS GC/data stream, by random-to-sequential translation. Second, SMR STORE
see Table 1)—SMR STORE requires another 1-2% space of significantly minimizes SMR GC overhead by end-to-end
the IO size. The percentage increases for smaller writes but data placement strategies with the guidance of workloads.
they are rather uncommon for HDDs due to IO merging in
Storage engine designs. To avoid the indirect overheads
cache SSDs.
of general-purpose file systems [17, 27], storage engines of
6.6 Field Deployment cloud storage systems [18] and distributed file systems [31])
tend to evolve towards to user space, special purposed [9],
In the OSS full stack benchmark, all of the key metrics
and end-to-end integration [11, 32]. SMR STORE follows and
in the SMR cluster are on par with the CMR cluster. The
further explores this path by building in the user space and im-
two clusters are both deployed with 13 KV store servers, 13
plementing the semantics of PANGU chunks, which is much
chunkservers, and 780 HDDs in total. Figure 22 shows that,
simpler than general file semantics (e.g., directory operations,
at OSS service layer, each KV server in the SMR cluster
file hardlink). Further, the range of the end-to-end integra-
achieves 374.2MB/s object write throughput, 227.7MB/s ob-
tion in SMR STORE is much wider than host-device, which
ject read throughput, and 394.8MB/s OSS GC throughput.
includes OSS service layer, PANGU distributed file system
Each chunkserver in the SMR cluster provides 1898.6MB/s
layer, the storage engine persistence layer, and a novel but
write throughput and 752.8MB/s read throughput. Similarly,
backward-incompatible device (i.e., HM-SMR drive). The re-
in the CMR cluster, each chunkserver provides 1888.3MB/s
sults of SMR STORE showcase the benefits can inspire future
write throughput and 723MB/s read throughput. This sug-
storage system designs under similar circumstances.
gests, from an end-to-end perspective, we are able to replace
CMR drives in standard-class OSS with SMR drives with no 9 Conclusion
performance penalty thanks to SMR STORE.
This paper describes our efforts in understanding, designing,
7 Limitation & Future Work evaluating, and deploying HM-SMR disks for standard-class
OSS in Alibaba. By directly bridging the semantics between
CZone. SMR STORE follows a strictly log-structured design PANGU and HM-SMR zoned namespace, enforcing an all-
and thus does not require random writes support from czones. logs layout and adopting guided placement strategies, SMR-
The use of the czones is under discussion. We could use STORE achieves our goal by deploying HM-SMR drives in
czones as szones by maintaining a writer pointer in memory standard-class OSS and providing comparable performance
and a sequence number for each czone. The sequence number against CMR disks yet with much better cost efficiency.
is used to identify valid records when the czone is reused.
Ad hoc to Alibaba standard OSS. At the moment, SMR- Acknowledgments
STORE is dedicated to serve standard-class OSS in Alibaba The authors thank our shepherd Prof. Peter Desnoyers and
Cloud. However, SMR STORE can easily adopt other zoned anonymous reviewers for their meticulous reviews and insight-
block devices , such as ZNS SSD. In fact, adapting SMR- ful suggestions. We also thank the OSS and the PANGU team
STORE to ZNS SSD devices is in progress and will serve other for their tremendous support on the SMR STORE project and
services (e.g., Alibaba EBS). this paper. We sincerely thank Yikang Xu who pioneered the
Garbage collection. The expected on-disk lifespans of OSS SMR STORE prototype development. This research was partly
data, OSS metadata and OSS GC are different from one OSS supported by Alibaba AIR program and NSFC(62102424).
406 21st USENIX Conference on File and Storage Technologies USENIX Association
References [14] T. R. Feldman and G. A. Gibson. Shingled Magnetic
[1] Archival-class OSS on Alibaba Cloud. https://www. Recording: Areal Density Increase Requires New Data
alibabacloud.com/solutions/backup_archive. Management. Usenix Magazine, 2013.
[15] G. Gibson and G. Ganger. Principles of operation for
[2] Data Lake on Alibaba Cloud. https://www.
shingled disk devices. Canregie Mellon Parallel Data
alibabacloud.com/solutions/data-lake.
Laboratory, CMU-PDL-11-107, 2011.
[3] Fio. https://github.com/axboe/fio. [16] C. Jin, W.-Y. Xi, Z.-Y. Ching, F. Huo, and C.-T. Lim.
[4] hdparm. https://www.man7.org/linux/man- HiSMRfs: A high performance file system for shingled
pages/man8/hdparm.8.html. storage array. In Proceedings of 30th Symposium on
Mass Storage Systems and Technologies (MSST), 2014.
[5] Shingled Magnetic Recording. https:
//zonedstorage.io/docs/introduction/smr. [17] C. Lee, D. Sim, J. Hwang, and S. Cho. F2FS: A new
file system for flash storage. In Proceedings of 13th
[6] INCITS T13 Technical Committee. Information technol- USENIX Conference on File and Storage Technologies
ogy - Zoned Device ATA Command Set (ZAC). Draft (FAST), 2015.
Standard T13/BSR INCITS 537, 2015.
[18] Q. Luo. Implement object storage with smr based key-
[7] INCITS T10 Technical Committee. Information value store. In Proceedings of Storage Developer Con-
technology-Zoned Block Commands (ZBC). Draft Stan- ference (SDC), 2015.
dard T10/BSR INCITS 536, 2017. [19] P. Macko, X. Ge, J. Haskins, J. Kelley, D. Slik, K. A.
[8] A. Aghayev and P. Desnoyers. Skylight—A window on Smith, and M. G. Smith. SMORE: A Cold Data Object
shingled disk operation. In Proceedings of 13th USENIX Store for SMR Drives (Extended Version). https://
Conference on File and Storage Technologies (FAST), arxiv.org/abs/1705.09701, 2017.
2015. [20] A. Manzanares, N. Watkins, C. Guyot, D. LeMoal,
[9] A. Aghayev, S. Weil, M. Kuchnik, M. Nelson, G. R. C. Maltzahn, and Z. Bandic. ZEA, a data manage-
Ganger, and G. Amvrosiadis. File systems unfit as ment approach for SMR. In Proceedings of 8th USENIX
distributed storage backends: lessons from 10 years Workshop on Hot Topics in Storage and File Systems
of Ceph evolution. In Proceedings of the 27th ACM (HotStorage), 2016.
Symposium on Operating Systems Principles (SOSP), [21] D. L. Moal. dm-zoned: Zoned Block Device device map-
2019. per. https://lwn.net/Articles/714387/, 2017.
[10] D. Beaver, S. Kumar, H. C. Li, J. Sobel, and P. Vajgel. [22] D. L. Moal. Linux SMR Support Status.
Finding a needle in Haystack: Facebook’s photo storage. https://events.static.linuxfound.org/
In Proceedings of 9th USENIX Symposium on Operating sites/events/files/slides/lemoal-Linux-
Systems Design and Implementation (OSDI), 2010. SMR-vault-2017.pdf, 2017.
[11] M. Bjørling, A. Aghayev, H. Holmberg, A. Ramesh, [23] S. Muralidhar, W. Lloyd, S. Roy, C. Hill, E. Lin, W. Liu,
D. L. Moal, G. R. Ganger, and G. Amvrosiadis. ZNS: S. Pan, S. Shankar, V. Sivakumar, L. Tang, and S. Ku-
Avoiding the Block Interface Tax for Flash-based SSDs. mar. f4: Facebook’s warm BLOB storage system. In
In Proceedings of USENIX Annual Technical Confer- Proceedings of 11th USENIX Symposium on Operating
ence (USENIX ATC), 2021. Systems Design and Implementation (OSDI), 2014.
[24] G. Oh, J. Yang, and S. Ahn. Efficient Key-Value Data
[12] E. Brewer, L. Ying, L. Greenfield, R. Cypher, and
Placement for ZNS SSD. Applied Sciences, 2021.
T. T’so. Disks for Data Centers. https://research.
google/pubs/pub44830.pdf, 2016. [25] Z. Pang, Q. Lu, S. Chen, R. Wang, Y. Xu, and J. Wu.
ArkDB: A Key-Value Engine for Scalable Cloud Stor-
[13] B. Calder, J. Wang, A. Ogus, N. Nilakantan, age Services. In Proceedings of the 2021 International
A. Skjolsvold, S. McKelvie, Y. Xu, S. Srivastav, J. Wu, Conference on Management of Data (SIGMOD), 2021.
H. Simitci, J. Haridas, C. Uddaraju, H. Khatri, A. Ed-
wards, V. Bedekar, S. Mainali, R. Abbasi, A. Agarwal, [26] R. Pitchumani, J. Hughes, and E. L. Miller. SMRDB:
M. F. u. Haq, M. I. u. Haq, D. Bhardwaj, S. Dayanand, Key-Value Data Store for Shingled Magnetic Recording
A. Adusumilli, M. McNett, S. Sankaran, K. Manivannan, Disks. In Proceedings of the 8th ACM International
and L. Rigas. Windows azure storage: A highly avail- Systems and Storage Conference (SYSTOR), 2015.
able cloud storage service with strong consistency. In [27] O. Rodeh, J. Bacik, and C. Mason. Btrfs: The linux
Proceedings of the 21th ACM Symposium on Operating b-tree filesystem. ACM Transactions on Storage (TOS),
Systems Principles (SOSP), 2011. 2013.
USENIX Association 21st USENIX Conference on File and Storage Technologies 407
[28] M. Rosenblum and J. K. Ousterhout. The design and
implementation of a log-structured file system. ACM
Transactions on Computer Systems (TOCS), 1992.
[29] A. Suresh, G. A. Gibson, and G. R. Ganger. Shingled
Magnetic Recording for Big Data Applications. Techni-
cal Report CMU-PDL-11-107, 2012.
[30] Q. Wang, J. Li, P. P. C. Lee, T. Ouyang, C. Shi,
and L. Huang. Separating data via block invalidation
time inference for write amplification reduction in Log-
Structured storage. In Proceedings of 20th USENIX
Conference on File and Storage Technologies (FAST),
2022.
[31] S. A. Weil, S. A. Brandt, E. L. Miller, D. D. E. Long,
and C. Maltzahn. Ceph: A scalable, High-Performance
distributed file system. In Proceedings of 7th USENIX
Symposium on Operating Systems Design and Imple-
mentation (OSDI), 2006.
[32] T. Yao, J. Wan, P. Huang, Y. Zhang, Z. Liu, C. Xie,
and X. He. GearDB: A GC-free Key-Value Store on
HM-SMR Drives with Gear Compaction. In Proceed-
ings of 17th USENIX Conference on File and Storage
Technologies (FAST), 2019.
408 21st USENIX Conference on File and Storage Technologies USENIX Association