0% found this document useful (0 votes)
17 views18 pages

Ethane - An Asymmetric File System For Disaggregated Persistent Memory

Uploaded by

seunghyun.yoon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views18 pages

Ethane - An Asymmetric File System For Disaggregated Persistent Memory

Uploaded by

seunghyun.yoon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Ethane: An Asymmetric File System

for Disaggregated Persistent Memory


Miao Cai, College of Computer Science and Technology, Nanjing University of
Aeronautics and Astronautics; Junru Shen, College of Computer Science and
Software Engineering, Hohai University; Baoliu Ye, State Key Laboratory for Novel
Software Technology, Nanjing University
https://www.usenix.org/conference/atc24/presentation/cai

This paper is included in the Proceedings of the


2024 USENIX Annual Technical Conference.
July 10–12, 2024 • Santa Clara, CA, USA
978-1-939133-41-0

Open access to the Proceedings of the


2024 USENIX Annual Technical Conference
is sponsored by
Ethane: An Asymmetric File System for Disaggregated Persistent Memory

Miao Cai† , Junru Shen‡ , Baoliu Ye§


College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics †
College of Computer Science and Software Engineering, Hohai University ‡
State Key Laboratory for Novel Software Technology, Nanjing University§

Abstract For symmetric PM architecture, file system data are scat-


tered over a cluster of machines. When serving a client re-
The ultra-fast persistent memories (PMs) promise a practical
quest, the server node has to interact with other nodes, re-
solution towards high-performance distributed file systems.
sulting in excessive, expensive network round-trips. More
This paper examines and reveals a cascade of three perfor-
severely, when distributed data meets non-uniform access pat-
mance and cost issues in the current PM provision scheme,
terns, the load imbalance problem arises and the server node
namely expensive cross-node interaction, weak single-node
storing the hot files inevitably becomes a performance bottle-
capability, and costly scale-out performance, which not only
neck, crippling the overall system performance. To remedy
underutilizes fast PM devices but also magnifies its limited
the bottlenecked node, administrators have to purchase more
storage capacity and high price deficiencies. To remedy this,
machines to amortize the hotspot pressure. Unfortunately, be-
we introduce Ethane, a file system built on disaggregated
sides the necessary PM devices, additional expenses have to
persistent memory (DPM). Through resource separation us-
be paid for encapsulated processors and other peripheral de-
ing fast connectivity technologies, DPM achieves efficient
vices, which significantly increases the total cost of ownership
and cost-effective PM sharing while retaining low-latency
(TCO) for data center vendors.
memory access. To unleash such hardware potentials, Ethane
To summarize, this hardware usage magnifies PM’s draw-
incorporates an asymmetric file system architecture inspired
backs of limited capacity and high price as well as under-
by the imbalanced resource provision feature of DPM. It splits
utilizes precious PM resources. Furthermore, our analysis
a file system into a control-plane FS and a data-plane FS and
in Section §2.1 reveals that expensive cross-node interaction,
designs these two planes to make the best use of the respec-
weak single-node capability, and costly scale-out performance
tive hardware resources. Evaluation results demonstrate that
caused by the symmetric PM architecture result in unpre-
Ethane reaps the DPM hardware benefits, performs up to
dictable request latency, degraded overall performance, and
68× better than modern distributed file systems, and improves
high monetary costs for distributed file systems, making them
data-intensive application throughputs by up to 17×.
hardly meet the stringent Server Level Objectives (SLOs) in
the regime of “Killer Microsecond” [16, 23].
1 Introduction To tackle these issues, we propose Ethane 1 , a file system
Distributed file systems (DFSs) are the backbone of modern built on disaggregated persistent memory. The persistent mem-
data center storage. To meet the unprecedented performance ory disaggregation is a key enabling technique for the next-
demands posed by data center applications [17, 20, 77], dis- generation high-performance data center [34, 43, 61, 65, 75].
tributed file systems heavily rely on high-speed storage de- DPM separates CPU and PM resources and assembles them
vices like persistent memories [12, 36, 41, 49, 72]. In contrast into dedicated compute nodes (CNs) and memory nodes
to the large, cheap, slow storage devices (e.g., solid-state or (MNs) connected with fast data connectivity technologies
hard-disk drives), persistent memory is a disruptive storage (e.g., RDMA [31, 40, 60] and CXL [44, 45, 52]), which deliv-
technology with three distinctive features, namely ultra-fast ers both surpassing large storage capacity and high aggregated
speed (∼300 ns latency), limited storage capacity (≤512 GB bandwidth in a cost-efficient manner. The DPM architecture
per DIMM slot), and expensive price ($3.27/GB). In the cur- is appealing due to the fast evolution of surging high-speed
rent monolithic data centers [61], every server machine is memory and network technologies [31, 44, 45, 52, 60, 73].
equipped with a number of PM modules, dubbed symmetric To drive the DPM system, we depolymerize the compound
PM architecture in this paper. This egalitarian PM provision, 1 Ethane
(C2 H6 ) is an organic chemical compound whose structural for-
however, leads to a cascade of performance and cost issues. mula resembles a DPM system with RDMA-connected CNs and MNs.

USENIX Association 2024 USENIX Annual Technical Conference 191


250 8

Number of Round Trips


# Round Trips Total: 1298 1000 790 500 448
300
200 Interaction Node0 261 85 44 17 7
6
Latency (us)

240 Node0 Node1 Node2

KOPS/s (256B)
150 Remaining Node1 262 108 78 37 20
180
4 PMEM PMEM PMEM
100 Node2 254 343 353 312 363 120

2 Node3 250 180 105 38 12 60


50

transport
0
PMEM
0 0 Node4 270 283 209 97 45
... ...
65536-2 256-3 40-4 16-5 9-6 0.0 1.2 1.5 2.0 2.6
DRAM CN MN PMEM
Directory Width - Path Length Access Skewness (α)
(a) Cross-node Interaction (b) Weak-node Capability (c) Symmetric vs Disaggregated PM Architecture
Figure 1: Performance Issues in Symmetric PM File Systems and Disaggregated PM Architecture. (a) demonstrates that the
cross-node interaction occupies a large portion of the total execution time in CephFS [69]; (b) showcases the overall system performance of
Octopus [49] is crippled due to single node performance limitation; (c) presents the symmetric and disaggregated PM architecture.

file system architecture with a key insight. Particularly, the To sum up, this paper makes the following contributions.
DPM features imbalanced resource provision between CNs • We examine current PM use in distributed file systems and
and MNs. CNs yields superior computing capability than reveal three issues. To tackle these issues, we advocate
MNs whereas only owns a few gigabytes of DRAM. In con- disaggregating PM and propose an asymmetric file system
trast, MNs is equipped with tera- or peta-bytes of PMs but architecture with a novel functionality separation.
is supplied with less powerful processing units. This char-
• Leveraging the centralized view of the shared memory node,
acteristic inspires us to design an asymmetric file system
we define the control-plane FS based on shared log abstrac-
architecture, which splits file system functionalities into a
tion for efficient functionality delegation.
control-plane FS and a data-plane FS, making the best use of
respective strong computing and memory resources. • To harvest the aggregated bandwidth, we design the data-
plane FS as a key-value store with a unified storage
• The control-plane FS is responsible for handling compli-
paradigm and dependence-disentangled data access.
cated system control and management logic like concur-
rency control and crash consistency. Leveraging the central- • We demonstrate the performance benefits and cost effi-
ized view of the shared MN, we delegate intricate control- ciency of prototyped Ethane with extensive experiments.
plane FS functionalities to simplified, lightweight shared
log [13–15, 25, 38, 48, 67]. For instance, linearizable sys- 2 Background and Motivation
tem call execution is turned into a log ordering problem.
Extracting file system semantics, we propose a variety of 2.1 Symmetric PM Architecture
techniques to improve log insert scalability, reduce log play- The commercialized PM device is a paramount storage tech-
back latency, and achieve strong operation durability. nology to hunt the “Killer Microseconds” [16, 23] for its hun-
dreds of nanoseconds of latency and large bandwidth. Hence,
• The data-plane FS is responsible for storage management commodity distributed file systems [12, 33, 36, 41, 49, 72]
and processing data requests. It aims to harvest the large extensively and intensively use PMs to fulfill the strict per-
capacity and aggregated bandwidth benefits of parallel- formance requirements for data center applications. From
connected MNs. Towards this end, we design a unified the hardware perspective, data centers manage resources in
storage paradigm for translating a variety of dependence- the unit of monolithic servers. Every server machine is full-
coupled file system data structures into unified, access- featured which hosts both CPU and PM resources. From the
disentangled key-value tuples and propose mechanisms to file system perspective, this PM usage induces a series of
achieve parallel metadata and data paths. correlated issues, as described below.
We build Ethane for an RDMA-capable disaggregated PM Expensive cross-node interaction. A distributed file sys-
system and evaluate it on an emulation platform with a rack tem usually stores a large volume of application data [30,
of four Intel Optane DC persistent memory machines con- 56, 69]. In a symmetric PM architecture, data are scattered
nected with a 100 GbE Mellanox switch. We compare Ethane and managed by independent server nodes. These nodes are
with three modern distributed PM file systems: Octopus [49], self-managed individuals that run customized, deep storage
Assise [12], and CephFS [69]. Evaluations show promising and network stacks and communicate using general-purpose
results. Ethane delivers much better NIC and PM bandwidth RPCs [36, 63, 69, 72]. When a server node receives a client
utilizations. It achieves up to 68× higher throughputs and up request, it has to interact with other nodes to serve the re-
to 1.71× lower monetary costs with synthetic benchmarks. quest. An interaction includes cross-node communications
When running a replicated key-value store Redis Cluster [9] and a (meta-)data fetch from the target node. Considering
and a MapReduce application Metis [24], Ethane improves current DFSs, general communication mechanisms and long
their performance by up to 16×. data paths lead to a high end-to-end interaction latency.

192 2024 USENIX Annual Technical Conference USENIX Association


We use file path resolution in CephFS [69] to describe and Costly scale-out performance. To remedy the single-node
quantify interaction costs. CephFS partitions the namespace weakness and keep pace with the exponential growth of ap-
tree among a number of metadata servers (MDSs) [70]. When plication requirements, data center vendors have to purchase
resolving a file path, the client accesses multiple MDSs to more PM machines. Unfortunately, the symmetric PM provi-
fetch directory entries (dentries) and inodes. We conduct an sion makes this scale-out paradigm costly and inefficient, es-
experiment that runs an RDMA-enabled CephFS with PM- pecially for distributed data processing applications with high
based OSD storage. We use MDTest [7] to generate a large elastic resource requirements [24]. For example, Hadoop [1],
directory tree with 65535 entries across four MDSs. A client a well-known MapReduce implementation, is designed with
issues stat to access files in the namespace. Figure 1a shows two procedures: map and reduce. Each procedure consists of
the latency breakdown and the number of network round trips. two independent phases: a computing-intensive phase for data
CephFS incorporates a message-based RPC and a Blue- processing and an IO-intensive phase for data loading/writing.
Store storage backend [11] with a layer of RocksDB, BlueFS, Hadoop uses HDFS [63] as its primary storage system.
and PMDK. For a remote dentry read, the sender encapsulates Suppose an HDFS node is running the IO phase and its PM
the request in a message, copies and serializes data in the mes- devices are under-provisioned. After adding a PM machine,
sage buffer, and transmits it over the network transport (i.e., other HDFS nodes are unable to enjoy the added capacity and
RDMA-over-RoCE). The receiver deserializes the message, performance benefits directly. Besides that, coupled CPUs in
loads data with the BlueStore and copies them to the NIC the new machine may be over-provisioned for the computing
buffer, and transmits the response. As a result, a complete phase. Our evaluation §5.4 shows that this low elastic resource
interaction includes costly data movements, encapsulation, scaling significantly increases the total purchase budget and
de-/serialization and passes through several storage layers, maintenance costs for MapReduce applications.
which takes ∼162 µs and occupies 60.24% of the component
resolution time. 2.2 Disaggregated PM Architecture
Furthermore, serialized component resolution design in
CephFS incurs excessive sequential cross-node interactions. To tackle aforementioned issues, decoupling PMs from mono-
The linear growth of expensive interactions significantly af- lithic machines and aggregating them in a dedicated memory
fects the overall syscall latency. The interaction time even pool is a promising solution. The disaggregated PM archi-
occupies 91.71% of the total syscall time when resolving a tecture enables efficient and cost-effective PM sharing with
six-component path. fast data access [43, 46, 51, 65, 78]. Depicted in Figure 1c,
Weak single-node capability. Due to manufacturing re- either a CN or an MN is a specialized machine that assem-
strictions, a machine only can be equipped with a few PM bles blades of computing or memory resources, exhibiting
DIMMs [73] which limits the total PM capacity by up to a few much stronger hardware capability than a monolithic ma-
terabytes. Moreover, commodity PM devices have a limited chine in the symmetric PM architecture. Moreover, the shared
bandwidth. Performance studies show that only four parallel PM pool embraces the fast evolved data connectivity tech-
writers with an IO size of 256B saturate the bandwidth [32,73]. nologies like RMDA [60] and CXL [64]. It supports low-
This small performance and storage upper bound raise serious latency data connections with highly aggregated bandwidth.
concerns for production data-intensive applications due to Finally, the DPM allows independent scaling of two types
their non-uniform access popularity [17, 20, 74]. Skewed data of hardware resources. The administrator could provision or
access causes a server node to easily become a bottleneck, de-provision a specific type of hardware resource flexibly
crippling the overall file system performance. and on-demand. Recently, the DPM architecture is propelled
Figure 1b demonstrates the load imbalance issue of a PM- rapidly thanks to ultra-fast memory and high-speed network
based distributed file system, Octopus [49]. This experiment technologies [31, 44, 45, 52, 60, 73].
creates a set of 4KB files and distributes them over four server Besides performance and cost advantages, the DPM has a
nodes. Ten clients issue read requests to four server nodes. unique feature: resource asymmetry. In particular, CNs and
The IO size is 256B and the data access popularity follows a MNs exhibit respective strong computation and memory capa-
Zipf distribution [20]. For a uniform request distribution (i.e., bilities. A CN is equipped with powerful processing units and
α = 0), all server nodes deliver almost the same throughput. limited memory capacity. Its small-sized memory only can
When α increases, Node2’s throughput increases but this node be used for hot data caching or running performance-critical
is bottlenecked. More severely, other nodes’ PM devices be- tasks locally. In contrast, MNs manage a large memory pool
come underutilized and their throughputs drop dramatically. consisting of tens or even hundreds of PM modules [78]. How-
The total throughput also decreases significantly by up to ever, every MN is only equipped with weak computing units
two times. The weak node deficiency is an inevitable conse- (e.g., ARM SoC and ASIC) that support running necessary
quence of scattered data distribution, and this problem can system tasks like memory scrubbing [43]. Realizing this char-
hardly be resolved as hot data keeps frequently shifting and acteristic, the challenge is how to design a file system to fully
changing [20, 74]. drive such distinctive hardware architecture.

USENIX Association 2024 USENIX Annual Technical Conference 193


3 Asymmetric File System Architecture instance maintains a cached, partial view in its local, small
DRAM.
To respond to the question, we introduce an asymmetric file
On the other side, MNs provide a shared memory pool
system architecture. Inspired by the resource asymmetry char-
with PB-scale PM modules. Memory blades are parallel con-
acteristic, we split the functionalities of a file system into two
nected through high-performance NICs [34, 46] or CXL con-
planes: (1) a control-plane FS which is responsible for man-
trollers [45, 52]. Because the MN offers a global view of the
aging and controlling file system states; (2) a data-plane FS
whole file system, we design the data-plane FS as a sharedFS
which is responsible for storage management and processing
which shards data over disjoint PM modules and parallelizes
data requests, and run these two FS planes on the CNs and
data access paths with hardware-provided parallelism.
MNs, respectively.
Shared-log-based control-plane FS. The control-plane
FS is built upon the shared log abstraction [13–15, 25, 38, 48,
Control Plane

CacheFS CacheFS CacheFS 67]. We identify the shared log is a good fit for the control-
...
RNIC

RNIC

RNIC
local state local state local state plane FS for two reasons. First, the memory node provides a
CN1 CN2 CNk centralized view for all compute nodes, which natively sup-
ports efficient data sharing. Second, implementing control-
Fast Data Connectivity (RMDA, CXL) plane FS functionalities is complicated and requires consid-
erably sophisticated techniques [12, 57, 69, 71]. The shared
Data Plane

log provides an elegant and efficient means to achieve them


Global State
PM Pool simultaneously. We delegate the control-plane FS function-
RNIC

Shared Log
Namespace
MN2

alities to the shared log and propose a range of techniques


MN1
ingest

[ ... ] File Block


... to support strong persistent guarantee, efficient concurrency
SharedFS control, and low-cost state coherence for cacheFS instances.
Figure 2: Asymmetric File System Architecture. The control Access-disentangled data-plane FS. An endemic in con-
plane consists of a set of cacheFS instances running on computing ventional DFSs is entangled data paths, i.e., data access is
nodes, whereas the data plane provides a centralized sharedFS atop tightly coupled with data processing inside file system opera-
the memory nodes. tions [19]. For example, a path resolution resolves a number
of path components. A component resolution includes a den-
Separation of control and data plane: File systems ab- try read and many other coupled dentry processing like sanity
stract, define, and store various objects, such as dentries, in- checks. This sequential, entangled data path squanders the
odes, and file blocks. These objects are classified into two large DPM bandwidth and magnifies the latency inefficiency
categories: meta objects and data objects. File system opera- of the RDMA network. To deal with it, we propose a DPM-
tions manipulate these objects, e.g., a chmod changes the in- friendly data path which disentangles the data access from
ode permission fields. Conventional DFS architecture is built other coupled operations. It overlaps data access to reap the
upon the principle of separate object management [30, 63, 69], aggregated bandwidth of parallel-connected PM devices.
i.e., meta and data objects are stored and manipulated by dif-
ferent nodes. This design is well-suited for file systems built 4 Ethane: Design and Implementation
on symmetric PM architecture, whereas it is ill-suited for the
DPM as neither a CN nor an MN has sufficient PM or CPU Applying the architecture, we build Ethane, a file system for
resources for object storage or manipulation. RDMA-enabled disaggregated persistent memory. We present
We propose a design principle of separating FS object the design of its control plane §4.1 and data plane §4.2, as
manipulation from storage. Guided by this principle, we split well as the implementation §4.3.
a file system into a control-plane FS and a data-plane FS. The
data-plane FS stores both meta and data objects as well as 4.1 Control-plane FS
provides efficient mechanisms to access them. The control- The control-plane FS consists of a set of cacheFS instances.
plane FS fetches objects from the data-plane FS and handles Every cacheFS maintains volatile, partial states. The local
complex and intricate object manipulation logics, such as state is small, which is sufficient to reside in the small-sized
namespace query, crash consistency, and concurrency control. local DRAM, and is volatile, which can be rebuilt from the
Best use of hardware resource. The primary goal of this remote sharedFS. The cacheFS mainly consists of two compo-
functionality separation is to make the best use of available nents: (1) a namespace cache which stores recently accessed
hardware in each server node. The CNs exhibit superior com- namespace entries and is structured as a chain-based hash
puting capability. We deploy the control-plane FS on the table; (2) a block cache which caches the data block metadata
CNs to handle complex and compute-intensive system man- (e.g., remote address) and is organized as an AVL tree.
agement tasks. We incarnate the control-plane FS as a set The core functionalities of the control-plane FS, such as
of cacheFSes that run atop available CNs. Each cacheFS cacheFS operation durability, concurrency control of cacheFS

194 2024 USENIX Annual Technical Conference USENIX Association


dlog: opcode reuse uid gid path meta_obj_addr mlog: CID fgprt offset size flag
: system call : valid mlog insert : log playback : pseudo mlog insert
CN1 CNk C1
... ethane_mkdir fail & re-insert
C2
Arena_1 Arena_2 1 dlog_persist C3
mlog
2 mlog_insert C4
playback skip sharding 3 oplog_playback
4 cachefs_mkdir Arena Arena Arena
... (a) case I (b) case II (c) case III
MN1 MN2 MN3 MNk

(a) The Arena Layout and Syscall Execution Flow (b) Three Cases of Operation Log Insertion
Figure 3: Log Arena Design

operations, and coherence among cacheFS instances, are del- Our insight is that producing a sequence of mlogs for lineariz-
egated to the shared log. Illustrated with the example in Fig- able syscalls does not require linearizable mlog append. We
ure 3a, the following section elaborates on our delegation propose log arena mechanism which dramatically reduces
mechanisms and techniques for optimizing shared log persis- mlog insert contention while still enforcing a valid log order
tence, insertion, and playback. with respect to linearizable syscall executions.
Figure 3a shows the mlog region is partitioned into a series
4.1.1 Delegating Durability to Log Persistence of arenas. An arena consists of a number of slots for storing
We decouple the log persistence from the log ordering [25]. A mlogs. An arena insertion consists of two steps: (1) a mlog
syscall has an operation log (oplog) which includes a data log insertion and (2) filling preceding empty slots. In step (1),
(dlog) and a meta log (mlog). Every cacheFS has a private PM every cacheFS randomly picks up an empty slot in the current
region in the MN for storing dlogs. A dlog contains an opcode, active arena and uses an RDMA_CAS to insert the mlog and
a file path, credentials, the meta object address, and a reuse persist it with an RDMA_READ (case I in Figure 3b). In an
field used in the collaborative log playback. Moreover, there ideal case, no preceding empty slots exist when C1 -C4 finish
is a global log order array for storing mlogs. An 8-byte mlog arena insertions. Thus, step (2) is omitted and these arena
packs a 12-bit cacheFS ID (CID), a 2-byte path fingerprint, a insertions have no contention. Also, they generate a valid
26-bit dlog region offset, a 9-bit dlog size, and a 1-bit flag for log sequence. The trick is that C1 -C4 are concurrent cacheFS
indicating this oplog is associated with a rename/symlink instances. Hence, there are no order restrictions for associated
syscall or other syscalls. mlog insertions.
As shown in Figure 3a, the cacheFS creates a dlog and When a cacheFS finishes the arena insertion, it should
a mlog for a mkdir syscall. It first writes the dlog in the ensure that there are no empty slots preceding the inserted
private region via a RDMA_WRITE ( 1 ). Then it uses another mlogx . Otherwise, if a subsequent cacheFS inserts a mlogy
RDMA_READ to the queue pair issued the RDMA_WRITE in order into one of those empty slots, mlogy precedes mlogx in the
to flush the MN’s PCIe buffer [68]. Persisting the dlog ensures log history. The linearizability is violated. To prevent this, in
the data durability of this syscall. Leveraging the in-order case II, C4 scans preceding slots and fills empty slots with
delivery property provided by commodity RNICs [66], we pseudo mlogs. The pseudo mlog represents a null operation.
issue these two RDMA requests simultaneously. The empty slot belongs to the in-flight C1 . C1 and C4 are
concurrent instances. If C4 precedes C1 , C4 ’s pseudo mlog
4.1.2 Delegating Linearizability to Log Ordering insertion may cause C1 ’s insertion to fail. C1 would re-insert
The control-plane FS provides a compatible linearizability the mlog.
model instead of a relaxed consistency model for clients [48]. Scanning and filling empty slots increases the arena in-
The shared log approach turns concurrent syscall execution sertion latency and causes contention for concurrent insert
into a sequential history of oplogs. The cacheFS writes the operations. We introduce two optimizations. First, the num-
mlog in the global log order array ( 2 ). The mlog order in ber of slots in an arena is set to be smaller than the number
the order array reflects the order of corresponding syscall of concurrent threads. Thus, all empty slots are likely to be
executions. Every cacheFS instance replays the same log filled by threads and the pseudo log insertion rarely happens.
sequence as if these syscalls take place locally. Thus, produc- Second, we perform step (2) after log playback ( 3 ), which
ing a sequence of mlogs with respect to linearizable syscall creates a time window for those in-flight arena insertions.
executions is the key to linearizable cacheFS design. Both optimizations try to minimize the likelihood of empty
To achieve a valid log sequence, a naive solution is us- slots for concurrent arena insertions.
ing RDMA_CAS to append mlogs to a list one by one [14, 15]. In case III, C3 is non-concurrent with the other three
Imposing a strict order with RDMA_CAS is expensive. It is cacheFS instances. There are no empty slots after the other
because modern RNICs use an internal lock to serialize con- three cacheFSs insert mlogs. Thus, C3 ’s mlog locates behind
current RDMA_CASes [40, 66]. High RNIC contention renders their mlogs in the arena. When C3 finishes mlog insertion,
the mlog list tail to become a severe scalability bottleneck. there exists two empty slots ahead of its mlog. It fills these

USENIX Association 2024 USENIX Annual Technical Conference 195


empty slots to complete the log history. records the final slot position in the global log order array dur-
ing last playback. It helps skip logs which have been replayed
4.1.3 Delegating Coherence to Log Playback in the last playback. As shown in Figure 5, the cacheFS cal-
Every cacheFS maintains a coherent state via replication. The culates the lineage for the target file /a/b/c ( 1 ), queries the
coherence among replicated cacheFS instances is achieved via skip table with the fingerprint of every file path ( 2 ), and finds
log playback. File system interfaces are not nilext [29]. Thus, associated playback ranges of every file. The final playback
to return a correct value, the syscall should externalize its range is the union of all ranges.
effect and modify file system states immediately. Therefore,
1 lineage 2 query skip table
the cacheFS first forwards its local state to a newest one by mlog Skip Table
/a
scanning and replaying oplogs ( 3 ). Afterwards, it executes H(/a) H(/a/b) ...

hash
/a/b/c /a/b =?
3 H(/x/y)
the mkdir locally and returns the execution result ( 4 ). /a/b/c read & compare
skip
The non-nilext interface property forces the log playback
: pseudo mlog
to appear during the syscall execution path. To resolve the
4.2
bottleneck, we propose two techniques: file-lineage-based log : valid mlog Y
reuse
Playback Range
partial replay N 4.1 full replay
dependence check and collaborative log playback.
Log dependence check. Replaying a long sequence of
Figure 5: Fast, Collaborative Log Playback
logs significantly increases the total latency. To reduce the
log playback sequence length, the cacheFS aims to play de- During log playback, the cacheFS reads mlogs one by one
pendent logs. To this end, we need to answer two questions: and checks their dependence ( 3 ). In particular, we calculate
(a) which oplogs are dependent? and (b) how to identify de- the path fingerprints in the lineage and compare each of them
pendent oplogs quickly. with the path fingerprint stored in the mlog. If one equals,
mkdir(/a/c/g) rename(/a/c, /b/e) symlink(/a, /b) this oplog is dependent. Thus, the cacheFS reads the dlog and
/ / performs the associated operation to update cacheFS states. In
/
addition, if an mlog is associated with a rename or a symlink
a b a b a b syscall, the associated step or direct and remote lineage is
c d e c d e c d used for checking log dependence.
e
g f g f g g f Collaborative log playback. The log playback performs
(a) Direct Lineage (b) Step Lineage (c) Remote Lineage history operations to derive a coherent cacheFS state. A com-
plete file syscall has a long execution path. The collaborative
Figure 4: Direct, Step, and Remote Lineage log playback accelerates syscall execution by reusing partial
To respond (a), we validate the dependence of two file execution results of other log playback routines. Specifically,
system operations based on file lineage. The direct lineage of almost all metadata syscalls are composed of two parts: a file
a file f is defined as a set of files whose paths is a prefix of path walk and the final file modification (e.g., changing file
f ’s path. Figure 4(a) illustrates an example. /a/c/g’s direct credentials). The file path walk is lengthy as it consists of a
lineage includes /, /a, and /a/c. The f ’s operation depends series of path component resolutions which occupies a large
on operations whose files and directories belong to f ’s lineage. portion of the total execution time [19, 50].
For instance, removing /a or disabling /a’s read permission We aim to reuse the path walk result of other log playbacks.
causes /a/c/g to become inaccessible. This reusing mechanism is feasible. Suppose the current log
A directory rename operation changes f ’s existing lineage. playback contains a log. This log has a deterministic order in
Figure 4(b) shows that rename(/a/c, /b/e) moves files from history as well as a set of dependent logs. If another cacheFS
/a/c to /b/e. Files in /a/c change their lineage. The new has played this log before, these dependent logs already have
lineage is called step lineage. Dependence checking of oplogs been replayed in its local state. The file path walk of this
behind the rename’s oplog uses the step lineage. oplog in these two playbacks produces the same result.
The symbolic link (symlink) adds a new file lineage. If a We add a reuse field in the dlog. A set field indicates that
source directory is a symlink and points to a directory which this log has been played before. Hence, the current playback
does not belong to its lineage, the lineage of the target direc- routine performs a partial log replay ( 4.2 ) by fetching the path
tory is the new lineage for children in the source directory. We resolution result and modifying the file directly. Otherwise, it
call the new lineage remote lineage. Figure 4(c) shows that performs a complete log replay ( 4.1 ).
/a/c is a symlink and points to /b. /a/c/g has a remote lin-
eage /b. Dependence checking of oplogs behind the symlink’s
oplog uses both the direct and remote lineage. 4.2 Data-plane FS
To answer (b), we design a mlog skip table to reduce the The data-plane FS provides a shared, DPM-friendly storage
log playback range. Each cacheFS has a volatile, private mlog layer. It unifies data management for diverse file system struc-
skip table. Every file has a corresponding table entry which tures with a key-value-based storage paradigm and a vector

196 2024 USENIX Annual Technical Conference USENIX Association


access interface. Furthermore, it provides disentangled, paral- we find the data section first and use the pointer to get the
lelized (meta-)data access paths to harvest DPM bandwidth. associated extent. The detailed file mapping procedure is
presented later §4.2.2.
4.2.1 Data Storage Paradigm Hash-based data management. We use cuckoo hash ta-
Metadata and data in file systems are managed with vari- bles to manage key-value tuples for each type of file system
ous data structures. For example, dentries are organized in objects. We choose cuckoo hash table because its constant
a namespace tree [57]. Ethane designs a unified key-value number of slot probes per lookup facilitates designing parallel
storage paradigm for (meta-)data indexing and management. data search. We use global instead of per-file block manage-
The key-value storage paradigm is advantageous for PM file ment [53], i.e., key-value tuples of all files’ data sections are
systems. It is expressive and provides efficient support for managed by global hash tables. The key space of a hash table
structured file system data [37,58]. Also, it offers fine-grained is split. A sharding of key-value tuples is stored in a cuckoo
and easy-to-use interfaces which effectively exploits PM byte- hash table. The cuckoo hash table is organized as a linear
addressability to avoid access amplification [42]. array of slots. We stripe the linear array across all available
MNs.
Type Key Value
[obj1_addr, obj0_addr]
Access interface. The data-plane FS provides a vector-
/a/b obj1
based key-value get interface: int vec_kv_get(key_t
Meta /a/b/c [obj2_addr, obj1_addr]
Object obj2 *k_vec, val_t *v_vec). This interface is vectorized which
/a/hardlink [obj2_addr, obj0_addr]
accepts a bunch of keys. It is beneficial for batched file system
/a/b/symlink [obj3_addr, obj1_addr] obj3 operations. For example, a file path walk needs to search a col-
section
Data [obj2_addr, start_addr, extent_addr
lection of meta objects for component resolutions. These meta
Section section_size] extent object lookups can be performed at once via this interface.
In addition, this interface is approximate which returns a
Figure 6: Key-value Data Storage Paradigm number of possibly correct values. Due to hash collisions, a
Translating FS objects to KV tuples. The data-plane FS hash table lookup needs to validate keys. To saturate the DPM
mainly includes three types of objects: a superblock, meta bandwidth, this interface delays key validation and instead
objects, and data sections. The superblock records global file returns all possibly correct values for the caller at once. The
system states. Every file or directory has a meta object which caller at the CN side filters out desired values afterward. In
stores its metadata, including file type, file size, and full path, addition, the data-plane FS also provides a vector-based key-
etc. Figure 6 shows that every meta object is associated with a value put interface.
key-value tuple. The key is the unique full path. The value is
a combination of the meta object address itself and the meta 4.2.2 Data Path Disentanglement
object address of its parent directory. A hard link to a target The data path in traditional file system calls includes entan-
file has no meta object. Its value stores a pointer to the target gled, sequential data access and data processing. We introduce
file’s meta object. In contrast, a symlink has an independent disentangled data path and demonstrate it with two examples.
meta object. It separates the data access from the data processing, so as to
The data-plane FS uses extents to organize data blocks. leverage the vector lookup interface to saturate the aggregated
An extent is a mapping from contiguous logical blocks to DPM bandwidth and hide RDMA network latency.
contiguous physical blocks. To translate a file offset to a Parallel, pipelined hash lookup. The vector lookup in-
physical block number —file mapping, file systems often use terface is implemented via parallel, pipelined cuckoo hash
extent tree [47, 53]. This translation method is unfriendly lookups. The cuckoo hash introduces two slot probes per
for DPM due to a cascade of pointer chasings during tree lookup and there are no memory access dependence for
traversal. To overcome this, we propose a data section-based these two slot probes. Exploiting this feature, we propose
file mapping design. a pipelined lookup mechanism. A slot probe includes a se-
Every file has a logical contiguous linear space. A data quence of computational tasks, which calculates hash value
section represents an aligned range of linear space and has and target MN, etc, and a one-sided RDMA_READ. We overlap
three fixed sizes: 1 GB, 2 MB, and 4 KB. Moreover, a data these computational tasks with the remote memory access for
section is associated with an extent and its range is inclusive pipelined slot probe.
to that extent range. A large extent may have several data Furthermore, our cuckoo hash table supports optimistic
sections with different sizes. For example, an extent with a concurrency control [28]. Every slot contains a version num-
mapping range of [0, 2113536] has a 2 MB section and four ber. Readers are lock-free. It reads and compares the versions
4 KB sections. A data section has a key of a concatenation of two slots before and after reading the data. Only if these
of three fields: the memory address of its file’s meta object, two versions are unchanged, this lookup succeeds. Otherwise
the section start address, and the section size. The value is a the reader retries. Therefore, a lookup requires two rounds
backward pointer to the associated extent. For a file mapping, of slot probes at least. Only the first round of slot probes is

USENIX Association 2024 USENIX Annual Technical Conference 197


optimized with the pipeline mechanism. [offset, offset+count]. Because the target data section
File path walk. A metadata syscall first performs a path size is unknown, we calculate the start addresses of three pos-
walk and then modifies the meta object. The file path walk sible data sections in line 4-6. Next, we compute three key
is an iterative process which consists of a number of path tuples in line 7 and send data section lookup requests via a
component resolutions. Traditionally, each path component vec_kv_get invocation in line 8.
resolution includes a dentry fetch (data access) and a series of Algorithm 1: read(int* fd,void* buf,size_t count,off_t offset)
coupled operations (data processing) like permission checks. 1 uint64 union_min = offset, union_max = 0, addr = offset, i = 0;
2 struct* extents[N]; struct block* blks[M]; vec_t v_vec;
: object fetch : object processing component resolution 3 while union_min >= offset && union_max <= (offset+count) do
path walk: ... 4 uint64 section1_start = ALIGN_DOWN(addr/SIZE_1GB);
5 uint64 section2_start = ALIGN_DOWN(addr/SIZE_2MB);
6 uint64 section3_start = ALIGN_DOWN(addr/SIZE_4KB);
... ...
input
]

7 key_t* k_vec = vectorize_key(section1_start, section2_start,


...

batch
section3_start); // get a vector of keys
lookup Filter 8 vec_kv_get(k_vec, v_vec); // batch section lookup
[

9 extents[i] = filter_sections(v_vec);
vec_key_get Data-plane FS 10 union_max += extents[i]->range_size; // extend the union range
batch read: [ ... ] 11 addr += extents[i]->range_size; i++; // update lookup endpoint
12 blks = get_blocks(extents); // get blocks in all extents
Figure 7: Disentangled File Path Walk
13 buf = read_blocks(blks, offset, count); // parallel block reads
We decompose the path walk into a batch of dentry
lookups and remaining operations. A dentry lookup equals a The vec_kv_get performs DPM-friendly, batched lookups
meta object search in our data-plane FS. Suppose a syscall and returns a vector of data sections. We filter out the correct
unlink(/a/b/c) and the associated path /a/b/c contains section by validating candidate sections’ ranges and get the as-
three components. Its file path has three prefix paths. In Fig- sociated extent in line 9. After that, we extend the union range
ure 7, to find associated meta objects for three prefix paths, of found extents in line 10 and update the lookup endpoint
the cacheFS issues three lookups via one vec_kv_get invo- in line 11. If the union range cannot include the requested
cation. The vec_kv_get computes six hash values for three range, we perform another extent lookup. Otherwise, we get
keys, performs six parallel, dependence-free hash lookups, all blocks for found extents in line 12. The data-plane FS
and returns six lookup results. The return values may con- stripes a file in the unit of extents across PM devices in the
tain false meta objects, i.e., their file paths do not belong to memory pool, which facilitates parallel data R/W onto file
/a/b/c’s lineage. To filter out correct objects quickly, we blocks belonging to disjoint extents. Hence we parallelize
perform a swift sanity check on returned meta objects before data block reads in line 13.
exact filename comparisons. 4.2.3 Log Ingestion
Assume we check returned meta objects for the prefix path The sharedFS ingests shared logs to update its states. The
/a. We compare the parent directories of returned meta ob- log ingestion is split into two phases. First, every CN runs
jects with the root directory and abandon these objects with a log ingestion worker and suppose there are N workers in
an incorrect parent. We use the correct meta object as a new total. Logs belonging to the same file are dependent. Every
parent directory and repeat this sanity check with the next worker i scans all shared logs and gathers those logs whose
prefix path /a/b. Note that the root directory has an empty f ingerprint%N == i in its working set. It ensures that de-
parent directory. After finding out all correct meta objects, it pendent logs are processed by the same worker. The worker
performs remaining operations at once. fetches and replays operation logs to forward cacheFS states.
File data read. For a read(int* fd,void* buf,off_t When finishes replaying logs, it translates data in the names-
offset,size_t count), the file system performs a file map- pace cache and block cache into corresponding key-value
ping by traversing the extent tree, visiting tree node entries tuples and feeds these key-value data to sharedFS via a vector
(data access), performing a binary search (data processing), put invocation. Second, the sharedFS ingests these data by
and comparing the extent range with the requested range (data creating, inserting, or updating associated key-value tuples.
processing). Then it reads block data (data access) and repeats
this process until all requested data are fetched. This data read 4.3 Implementation
consists of a number of sequential extent tree lookups and We develop Ethane prototype from scratch. Its source code is
block reads, resulting in a relatively long read path. available at https://github.com/miaogecm/Ethane.git.
Our data-plane FS decomposes the data IO path into a It consists of 10910 lines of C code. The CN runs the Linux op-
series of disentangled file mappings and parallel data reads, erating system to provide POSIX-compatible interfaces, effi-
as shown in Algorithm 1. First, our file mapping performs cient resource management, and data protection. The cacheFS
batched data section lookups to find extents. The endpoint is implemented as a user-level library which includes 4922
addr is initialized as the left point of the lookup range lines of C code.

198 2024 USENIX Annual Technical Conference USENIX Association


2.4 100 300 1.2
LAT: Para-Walk

System Call Latency (us)


Disent-Read Ent-Read
Throughput (M ops/s)

Throughput (M ops/s)
2.0 6

NIC IOPS (M ops/s)


10 LAT: Seq-Walk

Throughput (ops/s)
80 240
IOPS: Para-Walk 0.9
1.6
60 IOPS: Seq-Walk 180
1.2 10
5 Baseline 0.6
Log+Arena +DepCheck 40 120
0.8
Log+CAS 4 +DepCheck+Reuse 0.3
0.4 10 20 60

0 0 0 0
32 64 128 256 384 512 64 96 128 192 256 1 2 3 4 5 6 7 8 9 4KB 256KB 16MB 1GB 64GB
Number of Clients K Number of Path Component File Size
(a) Log Arena Scalability (b) Log Replay Performance (c) Path Walk Latency (d) Data Read Throughput
Figure 8: Control-Plane FS and Data-Plane FS Evaluation Results

Every cacheFS instance runs as a state machine which Table 1: Hardware Configuration of Disaggregated and Sym-
replays logs to transit local states. Multiple cacheFS instances metric PM Systems
are isolated from each other. CacheFS instances are classified CPU DRAM PMEM SSD NIC Price
into two categories: external cacheFSs and internal cacheFSs. 32 8 GB 512 GB 2×ConnectX-6
CN - $3919
cores DDR4 NVMe SSD NIC
An external cacheFS is linked with a client and serves user 8 GB 4×128 GB 512 GB 2×ConnectX-6
MN 1 core $3463
requests. Internal cacheFSs are used for log ingestion. There DDR4 DCPMM NVMe SSD NIC
16 2×32 GB 2×128 GB 512 GB ConnectX-6
are no clients for internal cacheFS instances. SN
cores DDR4 DCPMM NVMe SSD NIC
$3789

The MN has less computation power and is unable to sup-


port executing a full-fledged operating system kernel. Instead, sists of two CNs and two MNs and a symmetric PM system
every MN runs a thin sharedFS daemon which is responsible includes four symmetric nodes (SNs). The total prices of two
for PM pool management and log garbage collection. Ethane systems are $14764 and $15156, respectively.
runs on the reliable connected RDMA transport. We use the We compare Ethane with CephFS [69], Octopus [49], and
ZooKeeper [2] for managing namespace and maintaining con- Assise [12]. These three DFSs run on the symmetric PM
figuration information for CNs and MNs. system. We configure an RDMA-enabled CephFS with Blue-
We assume the sharedFS belongs to the trusted comput- Store as OSD storage backend. The OSD uses PMDK to
ing base. Every cacheFS instance runs in the user space. manage DCPMM. The CephFS runs four MDS daemons and
Its states and data are volatile and can be rebuilt based on four OSD daemons. We pin a pair of MDS and OSD dae-
the sharedFS and shared logs. Any data corruption or stray mons to an SN. Due to the limited number of physical cores,
writes [26, 57, 76] that happened in the cacheFS or client ap- experiments also use a coroutine library [6].
plications are unable to affect the data integrity of sharedFS.
The log ingestion of internal cacheFS instances is performed 5.1 Control-plane FS Evaluation
by dedicated kernel threads at the CN. The control-plane FS functionalities is delegated to the shared
log. We evaluate its performance by analyzing three shared
5 Evaluation log techniques: log persistence, log arena and log playback.
Log persistence latency. We measure the oplog persis-
This section tries to answer three questions: (1) Is Ethane tence latency with different log size. An oplog consists of a
friendly to disaggregated persistent memory architecture? mlog and a dlog. We evaluate three dlog sizes (64B, 1024B,
(§5.1-5.2) (2) Does Ethane perform better than conventional 4096B) by varying file path length. Inserting and persist-
distributed file systems? (§5.3) (3) How does Ethane perform ing a mlog/dlog need one RDMA_CAS/RDMA_WRITE and one
with real-world data-intensive applications? (§5.4) RDMA_READ. An RDMA_CAS/RDMA_READ and RDMA_WRITE la-
We set up two platforms with a rack of four blade servers for tency is 4.8 µs and 3.2 µs. These two RDMA requests are
emulating disaggregated PM and symmetric PM. Every blade transmitted in parallel. Thanks to this optimization, the 8-byte
server has two NUMA nodes and is equipped with two Intel mlog insertion and persistence latency takes 5.48 µs. Inserting
Xeon Gold 5220 CPU @ 2.20 GHZ, 128 GB (4×32 GB) SK and persisting a small- and medium-sized dlog costs 4.32 µs
Hynix DDR4 DRAM, 512 GB (4×128 GB) Intel Optane DC and 5.53 µs. The 4096B dlog insertion and persistence takes
Persistent Memory (DCPMM), a 512 GB Samsung PM981 7.46 µs.
NVMe SSD, and two Mellanox ConnectX-6 100 GbE NICs. Log arena scalability. This experiment studies log insert
Every blade server runs Ubuntu 18.04 and is connected to a scalability. Every client has a private working directory and
100 Gb Ethernet Mellanox switch. repeats creating files in this directory. We compare our log
Table 1 lists the hardware configurations and estimated arena approach (Log+Arena) with a RDMA_CAS-based solution
prices of disaggregated PM and symmetric PM systems. Hard- (Log+CAS). The number of slots in an arena is set to the half
ware device prices are collected from Amazon and HPE CDW of the number of clients. The experiment varies the number of
websites. In the experiment, a disaggregated PM system con- clients. Figure 8a demonstrates that Log+Arena scales much

USENIX Association 2024 USENIX Annual Technical Conference 199


better than Log+CAS. Its throughput increases linearly when the file path length and measure the syscall latency. We com-
the number of clients increases. pare the parallel path walk design in Ethane with the tra-
50 5
ditional sequential path walk. Figure 8c demonstrates that

Data Transfer Rate (M ops/s)


# RDMA_CAS: Log+Arena
Number of RDMA_CAS

# RDMA_CAS: Log+CAS
Para-Walk delivers a much more stable latency than that of
40 4
Transfer Rate: Log+Arena Seq-Walk. The Seq-Walk is non-scalable. Its total path walk
30 Transfer Rate: Log+CAS 3 latency is proportional to the number of path components. Its
20 2 path walk latency at 9-component takes 263 µs which is 3.09
10 1
× higher than that of 1-component.
The Para-Walk decouples the component access from com-
0 0
32 64 96 128 192 256 384 ponent processing. It utilizes the vector lookup interface to
Number of Clients
perform parallel component lookups, which hides the RDMA
Figure 9: # RDMA_CAS and PCIe Data Transfer Rate latency and delivers high network bandwidth utilization. To
We collect the number of RDMA_CAS operations and re- confirm it, we profile the NIC IOPS at the receiver side. Fig-
port them in Figure 9. The Log+CAS introduces much more ure 8c shows that Para-Walk achieves up to 2.43× higher
atomic RDMA operations than Log+Arena. In contrast, the IOPS than that of Seq-Walk.
Log+Arena incurs a few RDMA_CAS operations. Moreover, Data read throughput. This experiment creates a client to
Log+CAS has a low PM bandwidth utilization. RDMA_CAS perform random data read to a file. The IO size is set to 4 KB.
causes heavy NIC lock contention which prevents throughput We change the file size from 4 KB to 64 GB. We compare
increment. We use Intel PMWatch tool [5] to measure the two data path designs: disentangled read path (Disent-Read)
number of write operations received from the memory con- and entangled read path (Ent-Read). The entangled read path
trol, i.e., ddrt_write_ops. The RNIC accesses the DCPMM uses extent tree-based block management. Figure 8d shows
via PCIe bus and the on-chip PCIe controller forwards re- that the Disent-Read performs much better than the Ent-Read,
quests to the memory control. Thus, the increase rate of especially for a large file size. When the file size is 64 GB, its
ddrt_write_ops approximates the PCIe data transfer rate. throughput is 5.76× higher than that of the Ent-Read. This is
The Log+Arena achieves a much higher PCIe data transfer because the file mapping dominates the large file read time.
rate. Its peak rate is 3× higher than Log+CAS, leading to a
better PM bandwidth utilization. Table 2: # Pointer Chasing and % File Mapping Time
Log playback efficiency. Our log playback consists of # Pointer Chasing % File Mapping Time
File Size
Disent-Read Ent-Read Disent-Read Ent-Read
two techniques: lineage-based log dependence check and
4 KB 2.0 2.0 28.2% 49.7%
collaborative log playback. We first evaluate the effectiveness 256 KB 2.0 2.76 31.4% 62.2%
of the log dependence check. In the experiment, every client 16 MB 2.0 3.89 29.9% 77.8%
creates files in the directory /ethane-<X> where 1 ≤ X ≤ K. 1 GB 2.0 4.38 29.7% 80.2%
Experiments vary K from 64 to 256. Before a client creates 64 GB 2.0 5.41 30.5% 86.7%
a file, it must replay all dependent logs. Hence, a small K
When the file size is large, the extent tree is high. As a
causes more dependent logs.
result, the Ent-Read needs to traverse many tree levels to
The baseline replays logs one by one without any depen-
find a data block. To analyze performance overheads deeply,
dence check or reusing optimizations. Figure 8b shows that
Table 2 reports the number of pointer chasings during two
the baseline throughput is only 5 Kops/s. The log dependence
data paths. It shows that the Ent-Read requires 5.41 pointer
check optimization is effective. When K increases, the num-
chasings in extent tree traversal for a 64 GB file read. The total
ber of dependent logs reduces. It filters out a large number of
file mapping time occupies 86.7% of the total time. Thanks to
unrelated logs during log replay. The +DepCheck improves
the data section-based block management, our Disent-Read
the throughput by up to 1.7 Mops/s. The log reusing mecha-
only incurs two pointer chasings per file read regardless of
nism brings more performance benefits. Because all creat
file size changing. Moreover, it only spends 28.2%-30.5% of
syscalls share a common prefix path, the collaborative log
the total time in file mapping.
playback utilizes this to accelerate the execution path. It per-
forms 42.21% better than +DepCheck on average.
5.3 Macrobenchmark Performance
5.2 Data-plane FS Evaluation End-to-end latency. We measure the end-to-end latency of
The data-plane FS incorporates DPM-friendly, disentangled system calls. Eight clients run on four nodes and send mkdir
data paths. This section analyzes two representative data and stat requests to backend file system servers. Figure 10 re-
paths: file path walk and file data read. ports measured latencies. In the first experiment, every client
Path walk latency. This experiment preloads a Linux-4.15 creates directories in /ethane-<X> where X=rand(1,K). We
source code repository. Then it creates 256 clients and each vary K from 1 to 10. As shown in Figure 10a, deep software
client accesses a file in the directory tree via stat. We change stack and distributed namespace tree in CephFS causes over

200 2024 USENIX Annual Technical Conference USENIX Association


eight hundred microseconds of mkdir latency. Octopus de- clients. CephFS yields orders of magnitudes of lower through-
livers approximately 60 µs latency. Assise achieves 31.27% puts than Ethane. The MDS in CephFS relies on OSDs to store
lower latency than that of Octopus on average owing to its metadata. A creat or unlink syscall causes frequent cross-
client-local NVM design. Ethane delivers the similar latency node communication between MDS and OSD. Moreover, the
as Assise when K is small. Fortunately, when K increases, BlueStore in OSD has a deep software stack for managing
the number of dependent logs decreases. The log playback PM devices, which includes RocksDB, BlueFS, and PMDK.
latency in Ethane effectively reduces. It outperforms Assise It leads to a high latency, aggravating the interaction issue.
by up to 33.54%. Similar to Octopus, the MDS is single-threaded which also
3 4
impedes CephFS scalability improvement. Assise performs
10 10
Latency (us)
worse than Ethane. Even though Assise localizes data access
Latency (us)

3
Ethane Octopus 10
CephFS Assise with client-local PMs, its chain replication protocol incurs
2 2
10 10 high remote write overheads. Specifically, every time a client
1
10 creates a file, it needs to propagate this data update to all other
1 0
10 1 2 4 6 8 10 10 0 0.99 1.2 1.5 2.6
remote nodes in sequence.
K α We evaluate two settings of file stat. In the first setting,
(a) mkdir (b) stat file systems randomly access files and the experiment varies
Figure 10: System call Latency the file path length. Octopus achieves approximately 1 Mops/s
In the second experiment, every client accesses files via regardless of the number of path component changes. The path
stat. The file access pattern follows a Zipfian distribution resolution in Octopus is not POSIX-compliant. It hashes the
with a parameter α. A large α indicates a skewed access whole path without any individual path component resolution.
pattern. Both CephFS and Octopus suffer from the load im- In contrast, Ethane faithfully resolves every component and
balance issue. Their stat latencies increase linearly when it is still 1.8× faster than Octopus thanks to the parallel path
α becomes bigger. Assise replicates hot data in client-local walk design. Assise’s throughput is close to that of Ethane.
NVMs. Thus, skewed file access improves its performance. It is because there is no data replication during file stat and
Similarly, the namespace cache design in cacheFS also helps all data access happens on local devices. However, random
avoid the load imbalance issue for Ethane. file access causes numerous PM misses for Assise. Loading
Metadata scalability. We use MDTest [7] to evaluate data from SSD-based storage delivers a similar or even longer
metadata performance. We run an MPICH framework [8] latency than RDMA_READ latency.
to generate MDTest processes across server nodes. Before The second setting uses a non-uniform access popular-
the experiment, every process creates a private directory hier- ity. Both Octopus and CephFS configure one worker per SN.
archy for each client. Then, every client creates two million Therefore, their total throughput decreases dramatically when
files, accesses them, and removes these files. Figure 11 shows the file access becomes more skewed. Fortunately, Ethane has
the metadata performance of four file systems. no such load imbalance concern. All stat requests are evenly
distributed among cacheFS instances.
Throughput (ops/s)

Throughput (ops/s)

6 50 18
Throughput (GB/s)

Throughput (GB/s)

Ethane 6 Octopus
10 10 Ethane
40 15 CephFS
Octopus 12
5 5 30 Assise
10 10 9
20 6
4 4 10 3
10 10
0 0
16 32 48 64 96 128 16 32 48 64 96 128
# Clients # Clients 4 16 32 64 128 4 16 32 64 128
(a) creat (b) unlink IO Size (KB) IO Size (KB)
(a) read (b) write
Throughput (ops/s)

Throughput (ops/s)

8 Assise
10
10
6
7
Figure 12: Fio Throughput
10
6
CephFS 10
10
5
5
Data IO throughput. We use fio [4] to evaluate the data
10
4
R/W performance. We spawn 32 clients per CN/SN. All client
10
4 10
2 4 6 8 10 0 0.99 1.2 1.5 2.6 performs data reads to a shared file. We measure the data
# Path Components Access Skewness (α) throughput with different IO sizes in Figure 12. CephFS per-
(c) stat (random) (d) stat (skew)
forms worst. Data de-/serialization, message encapsulation,
Figure 11: MDTest Performance
and extra data copies in its message-based RPC lead to low
For file creation and removal, Ethane outperforms all other throughput. Its peak throughput only approaches 6 GB/s. Oc-
three distributed file systems. Octopus only assigns one topus has load imbalance issues. When the IO size increases,
worker per data server. It is unable to process massive client its total throughputs are bounded by a single node. Besides,
requests efficiently, which causes severe weak node capability Octopus achieves better performance than CephFS for small-
issues. The total throughput of Octopus stops increasing at 48 sized IOs. Its client-active IO uses one-sided RDMA_READ to

USENIX Association 2024 USENIX Annual Technical Conference 201


reduce data transfer overheads. IO size, Ethane only needs to add a new PM DIMM with-
For Ethane, both its file mapping and block reading in the out purchasing an entire machine as other PM file systems.
data path fully exploit aggregated bandwidth provided by the Two Intel DCPMMs cost $838 while an SN machine includ-
remote PM pool. Its total throughput decreases when the IO ing two DCPMMs takes $3789. Disaggregating PM reduces
size exceeds 64KB. Every time Ethane reads file data, it needs significant monetary costs compared with symmetric PM.
to initiate one extra remote read to check if there exist any
new dependent logs. These additional data reads consume 5.4 Application Performance
network and PM bandwidth. When running 32 threads with Redis cluster. We evaluate the data persistence performance
an IO size of 128KB, the bandwidth exhausts and the total of a distributed key-value store: Redis cluster [9]. Redis clus-
throughput decreases. Assise read throughput is higher than ter shards data and replicates Redis nodes to manage these
Ethane. Every client in Assise caches the file in its local NVM. shards. The Redis node supports two data persistence modes:
A local NVM read is 10× lower than a remote NVM read (1) AOF, which persists every operation in a log and flushes
which accounts for their performance difference. logs periodically to disk; (2) RDB, which snapshots database
We spawn four clients per CN/SN for the write experi- states and checkpoints it to disk.
ment and clients write a shared file. Analogously to the read Experiments create sixteen shards and each client operates
experiment, when the IO size is large, the total file system a shard by putting 100 million keys and executing the SAVE
performance of CephFS and Octopus are bottlenecked by the command to dump the database into a RDB file. We run the
PM capability in an SN. They deliver a peak throughput of Redis cluster atop four file systems and measure both AOF
0.63 GB/s and 4.4 GB/s, respectively. The write throughput throughput and ROB latency in Figure 14a. Ethane achieves
in Assise is much worse than the read throughput. For a file 6.77%/17.98×/41.55% higher AOF throughputs than Octo-
write, it has to send the updated data to all remaining SNs. pus/CephFS/Assise. Replicated Redis is unfriendly to As-
Such expensive data coherence protocol greatly degrades the sise for its expensive data coherence mechanism. CephFS is
write throughput. Ethane throughputs scale linearly thanks to 10× slower than Ethane for its heavyweight software stack.
its parallel data path design and reach a peak of 15.52 GB/s. When clients dump ten 10 GB RDB files, Ethane achieves
Cost efficiency. This experiment uses the video server 27.56%/8.21×/4.71× lower latency than others.
workload from filebench [3]. This workload prepares a set of 6
10 Octopus Assise 15 3.0

Total Latency (sec)


Throughput (ops/s)

1 GB video files. During experiments, a vidwriter writes new CephFS Ethane Computing Phase

Latency (sec)
videos and fifteen clients read these video files with different 5
IO Phase
10 10 2.0
IO sizes. This workload is read-intensive which stresses the
file system IO performance. 4 1.0
10 5
5 Octopus Assise
4 3 0
[MB/s]/$

10 0
3
CephFS Ethane AOF Thpt. RDB Lat. Ethane CephFS Assise Octopus
2 (a) Redis Cluster (b) Metis
1 Figure 14: Application Performance
0
4 32 128 256 512
IO Size (KB) Metis. We run a multicore-optimized MapReduce applica-
Figure 13: Performance-cost Evaluation tion Metis [18]. We use Metis to run WordCount with a 16GB
input file. We configure two SNs for symmetric PM file sys-
A single PM device only offers a peak of 6 GB/s bandwidth. tems. Besides, We configure 0.5 CN and 1.5 MNs for Ethane.
For large IO sizes, file systems need to add more PM devices The half CN and MN is enabled by using one NUMA node of
to serve the IO requests. Symmetric PM file systems add more that machine. It ensures that the total costs of 2 SNs and 0.5
monolithic SN while disaggregated PM file system plugs CN + 1.5 MNs are approximately equal. Two SNs have twice
more PM modules into the MN. For different IO sizes, we more cores than the half CN but they deliver similar com-
choose the most cost-effective hardware configuration for puting phase latency. It suggests that the computing resource
each file system. This hardware configuration fitly supports is over-provisioned for symmetric PM file systems. On the
running this workload. For example, for an IO size of 32 KB, other side, Ethane has a shorter IO phase latency. Four PM
CephFS requires two SNs. These two SNs contain four PM devices is under-provisioned for symmetric PM file systems.
devices whose total bandwidth is sufficient for running the Thanks to its elastic resource scaling, Ethane yields superior
workload. One SN or three SNs is under- or over-provisioned. performance than others with the same hardware cost.
Figure 13 shows the performance-cost efficiency. Ethane
yields the highest throughput (i.e., MB/s) per dollar. The rea- 6 Related Works
sons are twofold. First, the disentangled data path design in Distributed file systems. For the past decades, DFSs play
Ethane brings better PM utilization than other file systems. a critical role in large-scale data storage. Conventional file
Second, when PM resources become scarce due to increased systems decouple metadata from data management, e.g.,

202 2024 USENIX Annual Technical Conference USENIX Association


HDFS [63], PVFS [21], and Lustre [10]. It is a reasonable cost weakness, we propose a DPM-based file system Ethane.
design for monolithic data center as a machine is capable Ethane features an asymmetric file system architecture which
of storing and manipulating file data. A line of research ef- decouples an FS into two planes running on distinct server
forts have been devoted to improving capabilities in handling nodes in DPM. Compared with modern PM-based distributed
metadata requests and processing data IOs. CephFS [69] file systems, Ethane yields significant better performance for
improves metadata scalability via namespace tree partition- data-intensive applications with much lower monetary costs.
ing. GIGA+ [56] adopts a hash-based directory partitioning
scheme. To avoid system-wide synchronization, GIGA+ dis-
Acknowledgments
ables client caching for high concurrency. IndexFS [59] and
We thank the reviewers for their helpful feedback. This pa-
HopsFS [54] use NoSQL and relational databases for efficient
per is supported by the Fundamental Research Funds for
small-sized metadata storage and indexing. HDFS [63] uses
the Central Universities (Grant No. NS2024057) and the
block replication to improve data availability and achieve
Natural Science Foundation of Jiangsu Province (Grant No.
aggregated IO throughput. QFS [55] reduces replication-
BK20220973). Baoliu Ye is the corresponding author.
incurred storage consumption via erasure coding.
PM-based file systems. For local file systems, researchers
exploit PM characteristics to redesign various file system References
modules, such as namespace hierarchy [19], data IO [39],
and journal mechanism [57, 71]. For distributed file systems, [1] Apache Hadoop. https://hadoop.apache.org/,
incorporating PM is flexible. BlueStore in Ceph [11] uses 2023.
PM for OSD storage. However, it has a deep software stack [2] Apache ZooKeeper. https://zookeeper.apache.o
for PM management, resulting in a long software latency. rg/, 2023.
NVFS [36] proposes a PM-based write-ahead log design for
HDFS. S INGULAR FS [33] deploys all PM devices in one [3] Filebench. https://github.com/filebench/fileb
server machine. It only scales for billions of files due to the ench, 2023.
PM limitation of a single machine.
[4] Flexible I/O Tester. https://github.com/axboe/f
Octopus [49] and Orion [72] couple high-speed RDMA
io, 2023.
and PM. Octopus proposes a shared PM pool abstraction
via unifying disjoint PM devices across multiple nodes. The [5] Intel PMWatch. https://github.com/intel/intel
weak node issue easily arises as it lacks efficient load balance -pmwatch, 2023.
mechanism. Orion [72] configures PMs in client machines,
metadata servers, and data stores. Its scattered data introduces [6] libaco. https://github.com/hnes/libaco, 2023.
substantial node interactions during request processing. As- [7] MDTest. https://github.com/LLNL/mdtest, 2023.
sise [12] is a client-local PM file system. It achieves superior
system performance by placing PM near client applications. [8] MPICH. https://www.mpich.org/, 2023.
However, this architecture design transfers PM expenses to
[9] Scale with Redis Cluster. https://redis.io/docs/
users. These file systems extensively leverage PM’s strength
management/scaling/, 2023.
but they overlook PM’s drawbacks.
Disaggregated PM system. Memory and storage disaggre- [10] The Lustre file system. https://www.lustre.org/,
gation gains increased research interests recently [22, 27, 34, 2023.
35, 43, 46, 61, 62, 65, 75]. Resource disaggregation effectively
overcomes the inherent storage capacity and cost deficiency [11] Abutalib Aghayev, Sage A. Weil, Michael Kuch-
of persistent memories. Moreover, commodity RDMA net- nik, Mark Nelson, Gregory R. Ganger, and George
work [40] and forthcoming fast CXL protocols [45, 52] retain Amvrosiadis. File Systems Unfit as Distributed Storage
the latency advantage of PMs. This paper argues that disag- Backends: Lessons from 10 years of Ceph Evolution. In
gregated PMs provides an attractive and competitive solution 27th ACM Symposium on Operating Systems Principles,
towards future high-performance DFSs, and to the best of our Huntsville, ON, Canada, October 27-30, 2019, pages
knowledge, Ethane is the first file system that unleashes such 353–369.
hardware potentials with a novel asymmetric architecture and [12] Thomas E. Anderson, Marco Canini, Jongyul Kim, De-
efficient functionality separation. jan Kostic, Youngjin Kwon, Simon Peter, Waleed Reda,
Henry N. Schuh, and Emmett Witchel. Assise: Per-
7 Conclusion formance and Availability via Client-local NVM in a
This paper revisits the PM usage in existing distributed file Distributed File System. In 14th USENIX Symposium on
systems and reveals three correlated issues. To leverage PM Operating Systems Design and Implementation, Virtual
performance strength as well as overcome its capacity and Event, November 4-6, 2020, pages 1011–1027.

USENIX Association 2024 USENIX Annual Technical Conference 203


[13] Mahesh Balakrishnan, Jason Flinn, Chen Shen, Mi- [21] Philip H. Carns, Walter B. Ligon III, Robert B. Ross, and
hir Dharamshi, Ahmed Jafri, Xiao Shi, Santosh Ghosh, Rajeev Thakur. PVFS: A Parallel File System for Linux
Hazem Hassan, Aaryaman Sagar, Rhed Shi, Jingming Clusters. In 4th Annual Linux Showcase & Conference,
Liu, Filip Gruszczynski, Xianan Zhang, Huy Hoang, Atlanta, Georgia, USA, October 10-14, 2000.
Ahmed Yossef, Francois Richard, and Yee Jiun Song.
Virtual Consensus in Delos. In 14th USENIX Sympo- [22] Zongzhi Chen, Xinjun Yang, Feifei Li, Xuntao Cheng,
sium on Operating Systems Design and Implementation, Qingda Hu, Zheyu Miao, Rongbiao Xie, Xiaofei Wu,
Virtual Event, November 4-6, 2020, pages 617–632. Kang Wang, Zhao Song, Haiqing Sun, Zechao Zhuang,
Yuming Yang, Jie Xu, Liang Yin, Wenchao Zhou, and
[14] Mahesh Balakrishnan, Dahlia Malkhi, Vijayan Prab- Sheng Wang. CloudJump: Optimizing Cloud Databases
hakaran, Ted Wobber, Michael Wei, and John D. Davis. for Cloud Storages. Proc. VLDB Endow., 15(12):3432–
CORFU: A Shared Log Design for Flash Clusters. In 3444, 2022.
9th USENIX Symposium on Networked Systems Design
[23] Shenghsun Cho, Amoghavarsha Suresh, Tapti Palit,
and Implementation, San Jose, CA, USA, April 25-27,
Michael Ferdman, and Nima Honarmand. Taming the
2012, pages 1–14.
Killer Microsecond. In 51st Annual IEEE/ACM In-
[15] Mahesh Balakrishnan, Dahlia Malkhi, Ted Wobber, ternational Symposium on Microarchitecture, Fukuoka,
Ming Wu, Vijayan Prabhakaran, Michael Wei, John D. Japan, October 20-24, 2018, pages 627–640.
Davis, Sriram Rao, Tao Zou, and Aviad Zuck. Tango:
[24] Jeffrey Dean and Sanjay Ghemawat. MapReduce: Sim-
Distributed Data Structures over a Shared Log. In ACM
plified Data Processing on Large Clusters. In 6th Sympo-
SIGOPS 24th Symposium on Operating Systems Princi-
sium on Operating System Design and Implementation,
ples, Farmington, PA, USA, November 3-6, 2013, pages
San Francisco, California, USA, December 6-8, 2004,
325–340.
pages 137–150.
[16] Luiz André Barroso, Mike Marty, David A. Patterson, [25] Cong Ding, David Chu, Evan Zhao, Xiang Li, Lorenzo
and Parthasarathy Ranganathan. Attack of the Killer Alvisi, and Robbert van Renesse. Scalog: Seamless
Microseconds. Communications of ACM, 60(4):48–54, Reconfiguration and Total Order in a Scalable Shared
2017. Log. In 17th USENIX Symposium on Networked Sys-
[17] Benjamin Berg, Daniel S. Berger, Sara McAllister, Isaac tems Design and Implementation, Santa Clara, CA, USA,
Grosof, Sathya Gunasekar, Jimmy Lu, Michael Uhlar, February 25-27, 2020, pages 325–338.
Jim Carrig, Nathan Beckmann, Mor Harchol-Balter, and [26] Mingkai Dong, Heng Bu, Jifei Yi, Benchao Dong, and
Gregory R. Ganger. The CacheLib Caching Engine: De- Haibo Chen. Performance and Protection in the ZoFS
sign and Experiences at Scale. In 14th USENIX Sympo- User-space NVM File System. In 27th ACM Sympo-
sium on Operating Systems Design and Implementation, sium on Operating Systems Principles, Huntsville, ON,
Virtual Event, November 4-6, 2020, pages 753–768. Canada, October 27-30, 2019, pages 478–493.
[18] Silas Boyd-Wickizer, Austin T. Clements, Yandong [27] Siying Dong, Shiva Shankar P., Satadru Pan, Anand
Mao, Aleksey Pesterev, M. Frans Kaashoek, Robert Tap- Ananthabhotla, Dhanabal Ekambaram, Abhinav Sharma,
pan Morris, and Nickolai Zeldovich. An Analysis of Shobhit Dayal, Nishant Vinaybhai Parikh, Yanqin Jin,
Linux Scalability to Many Cores. In 9th USENIX Sympo- Albert Kim, Sushil Patil, Jay Zhuang, Sam Dunster,
sium on Operating Systems Design and Implementation, Akanksha Mahajan, Anirudh Chelluri, Chaitanya Datye,
October 4-6, 2010, Vancouver, BC, Canada, pages 1–16. Lucas Vasconcelos Santana, Nitin Garg, and Omkar
Gawde. Disaggregating RocksDB: A Production Expe-
[19] Miao Cai, Junru Shen, Bin Tang, Hao Huang, and Baoliu
rience. Proc. ACM Manag. Data, 1(2):1–24, 2023.
Ye. FlatFS: Flatten Hierarchical File System Namespace
on Non-volatile Memories. In 2022 USENIX Annual [28] Bin Fan, David G. Andersen, and Michael Kaminsky.
Technical Conference, Carlsbad, CA, USA, July 11-13, MemC3: Compact and Concurrent MemCache with
2022, pages 899–914. Dumber Caching and Smarter Hashing. In Proceedings
of the 10th USENIX Symposium on Networked Systems
[20] Zhichao Cao, Siying Dong, Sagar Vemuri, and David Design and Implementation, Lombard, IL, USA, April
H. C. Du. Characterizing, Modeling, and Benchmarking 2-5, 2013, pages 371–384.
RocksDB Key-Value Workloads at Facebook. In 18th
USENIX Conference on File and Storage Technologies, [29] Aishwarya Ganesan, Ramnatthan Alagappan, Andrea C.
Santa Clara, CA, USA, February 24-27, 2020, pages Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. Ex-
209–223. ploiting nil-externality for fast replicated storage. In

204 2024 USENIX Annual Technical Conference USENIX Association


ACM SIGOPS 28th Symposium on Operating Systems [38] Zhipeng Jia and Emmett Witchel. Boki: Stateful Server-
Principles, Virtual Event / Koblenz, Germany, October less Computing with Shared Logs. In ACM SIGOPS
26-29, 2021, pages 440–456. 28th Symposium on Operating Systems Principles, Vir-
tual Event / Koblenz, Germany, October 26-29, 2021,
[30] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Le- pages 691–707.
ung. The Google File System. In 19th ACM Symposium
on Operating Systems Principles, Bolton Landing, NY, [39] Rohan Kadekodi, Se Kwon Lee, Sanidhya Kashyap,
USA, October 19-22, 2003, pages 29–43. Taesoo Kim, Aasheesh Kolli, and Vijay Chidambaram.
SplitFS: Reducing Software Overhead in File Systems
[31] Juncheng Gu, Youngmoon Lee, Yiwen Zhang, Mosharaf for Persistent Memory. In 27th ACM Symposium on
Chowdhury, and Kang G. Shin. Efficient Memory Dis- Operating Systems Principles, Huntsville, ON, Canada,
aggregation with Infiniswap. In 14th USENIX Sympo- October 27-30, 2019, pages 494–508.
sium on Networked Systems Design and Implementation,
[40] Anuj Kalia, Michael Kaminsky, and David G. Andersen.
Boston, MA, USA, March 27-29, 2017, pages 649–667.
Design Guidelines for High Performance RDMA Sys-
tems. In 2016 USENIX Annual Technical Conference,
[32] Shashank Gugnani, Arjun Kashyap, and Xiaoyi Lu. Un-
Denver, CO, USA, June 22-24, 2016, pages 437–450.
derstanding the Idiosyncrasies of Real Persistent Mem-
ory. Proc. VLDB Endow., 14(4):626–639, 2020. [41] Jongyul Kim, Insu Jang, Waleed Reda, Jaeseong Im,
Marco Canini, Dejan Kostic, Youngjin Kwon, Simon
[33] Hao Guo, Youyou Lu, Wenhao Lv, Xiaojian Liao, Peter, and Emmett Witchel. LineFS: Efficient SmartNIC
Shaoxun Zeng, and Jiwu Shu. SingularFS: A Billion- Offload of a Distributed File System with Pipeline Paral-
Scale Distributed File System Using a Single Metadata lelism. In ACM SIGOPS 28th Symposium on Operating
Server. In 2023 USENIX Annual Technical Conference, Systems Principles, Virtual Event, Koblenz, Germany,
Boston, MA, USA, July 10-12, 2023, pages 915–928. October 26-29, 2021, pages 756–771.
[34] Zhiyuan Guo, Yizhou Shan, Xuhao Luo, Yutong Huang, [42] Jinhyung Koo, Junsu Im, Jooyoung Song, Juhyung Park,
and Yiying Zhang. Clio: a Hardware-software Co- Eunji Lee, Bryan S. Kim, and Sungjin Lee. Moderniz-
designed Disaggregated Memory System. In 27th ACM ing File System through In-Storage Indexing. In 15th
International Conference on Architectural Support for USENIX Symposium on Operating Systems Design and
Programming Languages and Operating Systems, Lau- Implementation, July 14-16, 2021, pages 75–92.
sanne, Switzerland, 28 February - 4 March, 2022, pages
417–433. [43] Se Kwon Lee, Soujanya Ponnapalli, Sharad Singhal,
Marcos K. Aguilera, Kimberly Keeton, and Vijay Chi-
[35] Haoyu Huang and Shahram Ghandeharizadeh. Nova- dambaram. DINOMO: An Elastic, Scalable, High-
LSM: A Distributed, Component-based LSM-tree Key- Performance Key-Value Store for Disaggregated Persis-
value Store. In International Conference on Manage- tent Memory. Proc. VLDB Endow., 15(13):4023–4037,
ment of Data, Virtual Event, China, June 20-25, 2021, 2022.
pages 749–763.
[44] Ang Li, Shuaiwen Leon Song, Jieyang Chen, Jiajia Li,
Xu Liu, Nathan R. Tallent, and Kevin J. Barker. Evaluat-
[36] Nusrat Sharmin Islam, Md. Wasi-ur-Rahman, Xiaoyi
ing Modern GPU Interconnect: PCIe, NVLink, NV-SLI,
Lu, and Dhabaleswar K. Panda. High Performance De-
NVSwitch and GPUDirect. IEEE Transactions on Par-
sign for HDFS with Byte-Addressability of NVM and
allel Distributed Systems, 31(1):94–110, 2020.
RDMA. In Proceedings of the 2016 International Con-
ference on Supercomputing, Istanbul, Turkey, June 1-3, [45] Huaicheng Li, Daniel S. Berger, Lisa Hsu, Daniel Ernst,
2016, pages 1–14. Pantea Zardoshti, Stanko Novakovic, Monish Shah,
Samir Rajadnya, Scott Lee, Ishwar Agarwal, Mark D.
[37] William Jannen, Jun Yuan, Yang Zhan, Amogh Akshin- Hill, Marcus Fontoura, and Ricardo Bianchini. Pond:
tala, John Esmet, Yizheng Jiao, Ankur Mittal, Prashant CXL-Based Memory Pooling Systems for Cloud Plat-
Pandey, Phaneendra Reddy, Leif Walsh, Michael A. Ben- forms. In 28th ACM International Conference on Ar-
der, Martin Farach-Colton, Rob Johnson, Bradley C. chitectural Support for Programming Languages and
Kuszmaul, and Donald E. Porter. BetrFS: A Right- Operating Systems, Vancouver, BC, Canada, March 25-
Optimized Write-Optimized File System. In 13th 29, 2023, pages 574–587.
USENIX Conference on File and Storage Technologies,
Santa Clara, CA, USA, February 16-19, 2015, pages [46] Pengfei Li, Yu Hua, Pengfei Zuo, Zhangyu Chen, and
301–315. Jiajie Sheng. ROLEX: A Scalable RDMA-oriented

USENIX Association 2024 USENIX Annual Technical Conference 205


Learned Key-Value Store for Disaggregated Memory [54] Salman Niazi, Mahmoud Ismail, Seif Haridi, Jim Dowl-
Systems. In 21st USENIX Conference on File and Stor- ing, Steffen Grohsschmiedt, and Mikael Ronström.
age Technologies, Santa Clara, CA, USA, February 21- HopsFS: Scaling Hierarchical File System Metadata
23, 2023, pages 99–114. Using NewSQL Databases. In 15th USENIX Confer-
ence on File and Storage Technologies, Santa Clara, CA,
[47] Ruibin Li, Xiang Ren, Xu Zhao, Siwei He, Michael USA, February 27 - March 2, 2017, pages 89–104.
Stumm, and Ding Yuan. ctFS: Replacing File Indexing
with Hardware Memory Translation through Contiguous [55] Michael Ovsiannikov, Silvius Rus, Damian Reeves, Paul
File Allocation for Persistent Memory. In 20th USENIX Sutter, Sriram Rao, and Jim Kelly. A The Quantcast File
Conference on File and Storage Technologies, Santa System. Proc. VLDB Endow., 6(11):1092–1101, 2013.
Clara, CA, USA, February 22-24, 2022, pages 35–50.
[56] Swapnil Patil and Garth A. Gibson. Scale and Concur-
[48] Joshua Lockerman, Jose M. Faleiro, Juno Kim, Soham rency of GIGA+: File System Directories with Millions
Sankaran, Daniel J. Abadi, James Aspnes, Siddhartha of Files. In 9th USENIX Conference on File and Storage
Sen, and Mahesh Balakrishnan. The FuzzyLog: A Par- Technologies, San Jose, CA, USA, February 15-17, 2011,
tially Ordered Shared Log. In 13th USENIX Sympo- pages 177–190.
sium on Operating Systems Design and Implementation,
[57] Dulloor Subramanya Rao, Sanjay Kumar, Anil S.
Carlsbad, CA, USA, October 8-10, 2018, pages 357–372.
Keshavamurthy, Philip Lantz, Dheeraj Reddy, Rajesh
[49] Youyou Lu, Jiwu Shu, Youmin Chen, and Tao Li. Octo- Sankaran, and Jeff Jackson. System Software for Per-
pus: an RDMA-enabled Distributed Persistent Memory sistent Memory. In Ninth Eurosys Conference on Com-
File System. In 2017 USENIX Annual Technical Con- puter Systems, Amsterdam, The Netherlands, April 13-
ference, Santa Clara, CA, USA, July 12-14, 2017, pages 16, 2014, pages 1–15.
773–785.
[58] Kai Ren and Garth A. Gibson. TABLEFS: Enhancing
Metadata Efficiency in the Local File System. In 2013
[50] Wenhao Lv, Youyou Lu, Yiming Zhang, Peile Duan, and
USENIX Annual Technical Conference, San Jose, CA,
Jiwu Shu. InfiniFS: An Efficient Metadata Service for
USA, June 26-28, 2013, pages 145–156.
Large-Scale Distributed Filesystems. In 20th USENIX
Conference on File and Storage Technologies, Santa [59] Kai Ren, Qing Zheng, Swapnil Patil, and Garth A. Gib-
Clara, CA, USA, February 22-24, 2022, pages 313–328. son. IndexFS: Scaling File System Metadata Perfor-
mance with Stateless Caching and Bulk Insertion. In
[51] Teng Ma, Mingxing Zhang, Kang Chen, Zhuo Song,
International Conference for High Performance Com-
Yongwei Wu, and Xuehai Qian. AsymNVM: An Ef-
puting, Networking, Storage and Analysis, New Orleans,
ficient Framework for Implementing Persistent Data
LA, USA, November 16-21, 2014, pages 237–248.
Structures on Asymmetric NVM Architecture. In 25th
ACM International Conference on Architectural Sup- [60] Zhenyuan Ruan, Malte Schwarzkopf, Marcos K. Aguil-
port for Programming Languages and Operating Sys- era, and Adam Belay. AIFM: High-Performance,
tems, Lausanne, Switzerland, March 16-20, 2020, pages Application-Integrated Far Memory. In 14th USENIX
757–773. Symposium on Operating Systems Design and Imple-
mentation, Virtual Event, November 4-6, 2020, pages
[52] Hasan Al Maruf, Hao Wang, Abhishek Dhanotia, Jo- 315–332.
hannes Weiner, Niket Agarwal, Pallab Bhattacharya,
Chris Petersen, Mosharaf Chowdhury, Shobhit O. [61] Yizhou Shan, Yutong Huang, Yilun Chen, and Yiying
Kanaujia, and Prakash Chauhan. TPP: Transparent Page Zhang. LegoOS: A Disseminated, Distributed OS for
Placement for CXL-Enabled Tiered-Memory. In 28th Hardware Resource Disaggregation. In 13th USENIX
ACM International Conference on Architectural Support Symposium on Operating Systems Design and Imple-
for Programming Languages and Operating Systems, mentation, Carlsbad, CA, USA, October 8-10, 2018,
Vancouver, BC, Canada, March 25-29, 2023, pages 742– pages 69–87.
755.
[62] Junyi Shu, Ruidong Zhu, Yun Ma, Gang Huang, Hong
[53] Ian Neal, Gefei Zuo, Eric Shiple, Tanvir Ahmed Khan, Mei, Xuanzhe Liu, and Xin Jin. Disaggregated RAID
Youngjin Kwon, Simon Peter, and Baris Kasikci. Re- Storage in Modern Datacenters. In 28th ACM Interna-
thinking File Mapping for Persistent Memory. In 19th tional Conference on Architectural Support for Program-
USENIX Conference on File and Storage Technologies, ming Languages and Operating Systems, Vancouver, BC,
February 23-25, 2021, pages 97–111. Canada, March 25-29, 2023, pages 147–163.

206 2024 USENIX Annual Technical Conference USENIX Association


[63] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, [71] Jian Xu and Steven Swanson. NOVA: A Log-structured
and Robert Chansler. The Hadoop Distributed File Sys- File System for Hybrid Volatile/Non-volatile Main
tem. In IEEE 26th Symposium on Mass Storage Systems Memories. In 14th USENIX Conference on File and
and Technologies, Lake Tahoe, Nevada, USA, May 3-7, Storage Technologies, Santa Clara, CA, USA, February
2010, pages 1–10. 22-25, 2016, pages 323–338.

[64] Yan Sun, Yifan Yuan, Zeduo Yu, Reese Kuper, Chi- [72] Jian Yang, Joseph Izraelevitz, and Steven Swanson.
hun Song, Jinghan Huang, Houxiang Ji, Siddharth Agar- Orion: A Distributed File System for Non-Volatile
wal, Jiaqi Lou, Ipoom Jeong, Ren Wang, Jung Ho Ahn, Main Memory and RDMA-Capable Networks. In 17th
Tianyin Xu, and Nam Sung Kim. Demystifying CXL USENIX Conference on File and Storage Technologies,
Memory with Genuine CXL-Ready Systems and De- Boston, MA, February 25-28, 2019, pages 221–234.
vices. In 56th Annual IEEE/ACM International Sym-
[73] Jian Yang, Juno Kim, Morteza Hoseinzadeh, Joseph
posium on Microarchitecture, Toronto, ON, Canada, 28
Izraelevitz, and Steven Swanson. An Empirical Guide
October - 1 November, 2023, pages 105–121.
to the Behavior and Use of Scalable Persistent Memory.
In 18th USENIX Conference on File and Storage Tech-
[65] Shin-Yeh Tsai, Yizhou Shan, and Yiying Zhang. Dis-
nologies, Santa Clara, CA, USA, February 24-27, 2020,
aggregating Persistent Memory and Controlling Them
pages 169–182.
Remotely: An Exploration of Passive Disaggregated
Key-Value Stores. In 2020 USENIX Annual Technical [74] Juncheng Yang, Yao Yue, and K. V. Rashmi. A large
Conference, July 15-17, 2020, pages 33–48. scale analysis of hundreds of in-memory cache clusters
at Twitter. In 14th USENIX Symposium on Operat-
[66] Qing Wang, Youyou Lu, and Jiwu Shu. Sherman: A ing Systems Design and Implementation, Virtual Event,
Write-Optimized Distributed B+ Tree Index on Disag- November 4-6, 2020, pages 191–208.
gregated Memory. In International Conference on Man-
agement of Data, Philadelphia, PA, USA, June 12 - 17, [75] Ming Zhang, Yu Hua, Pengfei Zuo, and Lurong Liu.
2022, pages 1033–1048. FORD: Fast One-sided RDMA-based Distributed Trans-
actions for Disaggregated Persistent Memory. In 20th
[67] Michael Wei, Amy Tai, Christopher J. Rossbach, Ittai USENIX Conference on File and Storage Technologies,
Abraham, Maithem Munshed, Medhavi Dhawan, Jim Santa Clara, CA, USA, February 22-24, 2022, pages
Stabile, Udi Wieder, Scott Fritchie, Steven Swanson, 51–68.
Michael J. Freedman, and Dahlia Malkhi. vCorfu: A
Cloud-Scale Object Store on a Shared Log. In 14th [76] Diyu Zhou, Vojtech Aschenbrenner, Tao Lyu, Jian
USENIX Symposium on Networked Systems Design and Zhang, Sudarsun Kannan, and Sanidhya Kashyap. En-
Implementation, Boston, MA, USA, March 27-29, 2017, abling High-Performance and Secure Userspace NVM
pages 35–49. File Systems with the Trio Architecture. In 29th Sym-
posium on Operating Systems Principles, Koblenz, Ger-
[68] Xingda Wei, Xiating Xie, Rong Chen, Haibo Chen, and many, October 23-26, 2023, pages 150–165.
Binyu Zang. Characterizing and Optimizing Remote
[77] Hang Zhu, Kostis Kaffes, Zixu Chen, Zhenming Liu,
Persistent Memory with RDMA and NVM. In 2021
Christos Kozyrakis, Ion Stoica, and Xin Jin. RackSched:
USENIX Annual Technical Conference, July 14-16, 2021,
A Microsecond-Scale Scheduler for Rack-Scale Com-
pages 523–536.
puters. In 14th USENIX Symposium on Operating Sys-
tems Design and Implementation, Virtual Event, Novem-
[69] Sage A. Weil, Scott A. Brandt, Ethan L. Miller, Darrell
ber 4-6, 2020, pages 1225–1240.
D. E. Long, and Carlos Maltzahn. Ceph: A Scalable,
High-Performance Distributed File System. In 7th Sym- [78] Pengfei Zuo, Jiazhao Sun, Liu Yang, Shuangwu Zhang,
posium on Operating Systems Design and Implemen- and Yu Hua. One-sided RDMA-Conscious Extendible
tation, November 6-8, 2006, Seattle, WA, USA, pages Hashing for Disaggregated Memory. In 2021 USENIX
307–320. Annual Technical Conference, July 14-16, 2021, pages
15–29.
[70] Sage A. Weil, Kristal T. Pollack, Scott A. Brandt, and
Ethan L. Miller. Dynamic Metadata Management for
Petabyte-Scale File Systems. In Proceedings of the
ACM/IEEE Conference on High Performance Network-
ing and Computing, 6-12 November 2004, Pittsburgh,
PA, USA, pages 1–12.

USENIX Association 2024 USENIX Annual Technical Conference 207

You might also like