Ethane - An Asymmetric File System For Disaggregated Persistent Memory
Ethane - An Asymmetric File System For Disaggregated Persistent Memory
KOPS/s (256B)
150 Remaining Node1 262 108 78 37 20
180
4 PMEM PMEM PMEM
100 Node2 254 343 353 312 363 120
transport
0
PMEM
0 0 Node4 270 283 209 97 45
... ...
65536-2 256-3 40-4 16-5 9-6 0.0 1.2 1.5 2.0 2.6
DRAM CN MN PMEM
Directory Width - Path Length Access Skewness (α)
(a) Cross-node Interaction (b) Weak-node Capability (c) Symmetric vs Disaggregated PM Architecture
Figure 1: Performance Issues in Symmetric PM File Systems and Disaggregated PM Architecture. (a) demonstrates that the
cross-node interaction occupies a large portion of the total execution time in CephFS [69]; (b) showcases the overall system performance of
Octopus [49] is crippled due to single node performance limitation; (c) presents the symmetric and disaggregated PM architecture.
file system architecture with a key insight. Particularly, the To sum up, this paper makes the following contributions.
DPM features imbalanced resource provision between CNs • We examine current PM use in distributed file systems and
and MNs. CNs yields superior computing capability than reveal three issues. To tackle these issues, we advocate
MNs whereas only owns a few gigabytes of DRAM. In con- disaggregating PM and propose an asymmetric file system
trast, MNs is equipped with tera- or peta-bytes of PMs but architecture with a novel functionality separation.
is supplied with less powerful processing units. This char-
• Leveraging the centralized view of the shared memory node,
acteristic inspires us to design an asymmetric file system
we define the control-plane FS based on shared log abstrac-
architecture, which splits file system functionalities into a
tion for efficient functionality delegation.
control-plane FS and a data-plane FS, making the best use of
respective strong computing and memory resources. • To harvest the aggregated bandwidth, we design the data-
plane FS as a key-value store with a unified storage
• The control-plane FS is responsible for handling compli-
paradigm and dependence-disentangled data access.
cated system control and management logic like concur-
rency control and crash consistency. Leveraging the central- • We demonstrate the performance benefits and cost effi-
ized view of the shared MN, we delegate intricate control- ciency of prototyped Ethane with extensive experiments.
plane FS functionalities to simplified, lightweight shared
log [13–15, 25, 38, 48, 67]. For instance, linearizable sys- 2 Background and Motivation
tem call execution is turned into a log ordering problem.
Extracting file system semantics, we propose a variety of 2.1 Symmetric PM Architecture
techniques to improve log insert scalability, reduce log play- The commercialized PM device is a paramount storage tech-
back latency, and achieve strong operation durability. nology to hunt the “Killer Microseconds” [16, 23] for its hun-
dreds of nanoseconds of latency and large bandwidth. Hence,
• The data-plane FS is responsible for storage management commodity distributed file systems [12, 33, 36, 41, 49, 72]
and processing data requests. It aims to harvest the large extensively and intensively use PMs to fulfill the strict per-
capacity and aggregated bandwidth benefits of parallel- formance requirements for data center applications. From
connected MNs. Towards this end, we design a unified the hardware perspective, data centers manage resources in
storage paradigm for translating a variety of dependence- the unit of monolithic servers. Every server machine is full-
coupled file system data structures into unified, access- featured which hosts both CPU and PM resources. From the
disentangled key-value tuples and propose mechanisms to file system perspective, this PM usage induces a series of
achieve parallel metadata and data paths. correlated issues, as described below.
We build Ethane for an RDMA-capable disaggregated PM Expensive cross-node interaction. A distributed file sys-
system and evaluate it on an emulation platform with a rack tem usually stores a large volume of application data [30,
of four Intel Optane DC persistent memory machines con- 56, 69]. In a symmetric PM architecture, data are scattered
nected with a 100 GbE Mellanox switch. We compare Ethane and managed by independent server nodes. These nodes are
with three modern distributed PM file systems: Octopus [49], self-managed individuals that run customized, deep storage
Assise [12], and CephFS [69]. Evaluations show promising and network stacks and communicate using general-purpose
results. Ethane delivers much better NIC and PM bandwidth RPCs [36, 63, 69, 72]. When a server node receives a client
utilizations. It achieves up to 68× higher throughputs and up request, it has to interact with other nodes to serve the re-
to 1.71× lower monetary costs with synthetic benchmarks. quest. An interaction includes cross-node communications
When running a replicated key-value store Redis Cluster [9] and a (meta-)data fetch from the target node. Considering
and a MapReduce application Metis [24], Ethane improves current DFSs, general communication mechanisms and long
their performance by up to 16×. data paths lead to a high end-to-end interaction latency.
CacheFS CacheFS CacheFS 67]. We identify the shared log is a good fit for the control-
...
RNIC
RNIC
RNIC
local state local state local state plane FS for two reasons. First, the memory node provides a
CN1 CN2 CNk centralized view for all compute nodes, which natively sup-
ports efficient data sharing. Second, implementing control-
Fast Data Connectivity (RMDA, CXL) plane FS functionalities is complicated and requires consid-
erably sophisticated techniques [12, 57, 69, 71]. The shared
Data Plane
Shared Log
Namespace
MN2
(a) The Arena Layout and Syscall Execution Flow (b) Three Cases of Operation Log Insertion
Figure 3: Log Arena Design
operations, and coherence among cacheFS instances, are del- Our insight is that producing a sequence of mlogs for lineariz-
egated to the shared log. Illustrated with the example in Fig- able syscalls does not require linearizable mlog append. We
ure 3a, the following section elaborates on our delegation propose log arena mechanism which dramatically reduces
mechanisms and techniques for optimizing shared log persis- mlog insert contention while still enforcing a valid log order
tence, insertion, and playback. with respect to linearizable syscall executions.
Figure 3a shows the mlog region is partitioned into a series
4.1.1 Delegating Durability to Log Persistence of arenas. An arena consists of a number of slots for storing
We decouple the log persistence from the log ordering [25]. A mlogs. An arena insertion consists of two steps: (1) a mlog
syscall has an operation log (oplog) which includes a data log insertion and (2) filling preceding empty slots. In step (1),
(dlog) and a meta log (mlog). Every cacheFS has a private PM every cacheFS randomly picks up an empty slot in the current
region in the MN for storing dlogs. A dlog contains an opcode, active arena and uses an RDMA_CAS to insert the mlog and
a file path, credentials, the meta object address, and a reuse persist it with an RDMA_READ (case I in Figure 3b). In an
field used in the collaborative log playback. Moreover, there ideal case, no preceding empty slots exist when C1 -C4 finish
is a global log order array for storing mlogs. An 8-byte mlog arena insertions. Thus, step (2) is omitted and these arena
packs a 12-bit cacheFS ID (CID), a 2-byte path fingerprint, a insertions have no contention. Also, they generate a valid
26-bit dlog region offset, a 9-bit dlog size, and a 1-bit flag for log sequence. The trick is that C1 -C4 are concurrent cacheFS
indicating this oplog is associated with a rename/symlink instances. Hence, there are no order restrictions for associated
syscall or other syscalls. mlog insertions.
As shown in Figure 3a, the cacheFS creates a dlog and When a cacheFS finishes the arena insertion, it should
a mlog for a mkdir syscall. It first writes the dlog in the ensure that there are no empty slots preceding the inserted
private region via a RDMA_WRITE ( 1 ). Then it uses another mlogx . Otherwise, if a subsequent cacheFS inserts a mlogy
RDMA_READ to the queue pair issued the RDMA_WRITE in order into one of those empty slots, mlogy precedes mlogx in the
to flush the MN’s PCIe buffer [68]. Persisting the dlog ensures log history. The linearizability is violated. To prevent this, in
the data durability of this syscall. Leveraging the in-order case II, C4 scans preceding slots and fills empty slots with
delivery property provided by commodity RNICs [66], we pseudo mlogs. The pseudo mlog represents a null operation.
issue these two RDMA requests simultaneously. The empty slot belongs to the in-flight C1 . C1 and C4 are
concurrent instances. If C4 precedes C1 , C4 ’s pseudo mlog
4.1.2 Delegating Linearizability to Log Ordering insertion may cause C1 ’s insertion to fail. C1 would re-insert
The control-plane FS provides a compatible linearizability the mlog.
model instead of a relaxed consistency model for clients [48]. Scanning and filling empty slots increases the arena in-
The shared log approach turns concurrent syscall execution sertion latency and causes contention for concurrent insert
into a sequential history of oplogs. The cacheFS writes the operations. We introduce two optimizations. First, the num-
mlog in the global log order array ( 2 ). The mlog order in ber of slots in an arena is set to be smaller than the number
the order array reflects the order of corresponding syscall of concurrent threads. Thus, all empty slots are likely to be
executions. Every cacheFS instance replays the same log filled by threads and the pseudo log insertion rarely happens.
sequence as if these syscalls take place locally. Thus, produc- Second, we perform step (2) after log playback ( 3 ), which
ing a sequence of mlogs with respect to linearizable syscall creates a time window for those in-flight arena insertions.
executions is the key to linearizable cacheFS design. Both optimizations try to minimize the likelihood of empty
To achieve a valid log sequence, a naive solution is us- slots for concurrent arena insertions.
ing RDMA_CAS to append mlogs to a list one by one [14, 15]. In case III, C3 is non-concurrent with the other three
Imposing a strict order with RDMA_CAS is expensive. It is cacheFS instances. There are no empty slots after the other
because modern RNICs use an internal lock to serialize con- three cacheFSs insert mlogs. Thus, C3 ’s mlog locates behind
current RDMA_CASes [40, 66]. High RNIC contention renders their mlogs in the arena. When C3 finishes mlog insertion,
the mlog list tail to become a severe scalability bottleneck. there exists two empty slots ahead of its mlog. It fills these
hash
/a/b/c /a/b =?
3 H(/x/y)
the mkdir locally and returns the execution result ( 4 ). /a/b/c read & compare
skip
The non-nilext interface property forces the log playback
: pseudo mlog
to appear during the syscall execution path. To resolve the
4.2
bottleneck, we propose two techniques: file-lineage-based log : valid mlog Y
reuse
Playback Range
partial replay N 4.1 full replay
dependence check and collaborative log playback.
Log dependence check. Replaying a long sequence of
Figure 5: Fast, Collaborative Log Playback
logs significantly increases the total latency. To reduce the
log playback sequence length, the cacheFS aims to play de- During log playback, the cacheFS reads mlogs one by one
pendent logs. To this end, we need to answer two questions: and checks their dependence ( 3 ). In particular, we calculate
(a) which oplogs are dependent? and (b) how to identify de- the path fingerprints in the lineage and compare each of them
pendent oplogs quickly. with the path fingerprint stored in the mlog. If one equals,
mkdir(/a/c/g) rename(/a/c, /b/e) symlink(/a, /b) this oplog is dependent. Thus, the cacheFS reads the dlog and
/ / performs the associated operation to update cacheFS states. In
/
addition, if an mlog is associated with a rename or a symlink
a b a b a b syscall, the associated step or direct and remote lineage is
c d e c d e c d used for checking log dependence.
e
g f g f g g f Collaborative log playback. The log playback performs
(a) Direct Lineage (b) Step Lineage (c) Remote Lineage history operations to derive a coherent cacheFS state. A com-
plete file syscall has a long execution path. The collaborative
Figure 4: Direct, Step, and Remote Lineage log playback accelerates syscall execution by reusing partial
To respond (a), we validate the dependence of two file execution results of other log playback routines. Specifically,
system operations based on file lineage. The direct lineage of almost all metadata syscalls are composed of two parts: a file
a file f is defined as a set of files whose paths is a prefix of path walk and the final file modification (e.g., changing file
f ’s path. Figure 4(a) illustrates an example. /a/c/g’s direct credentials). The file path walk is lengthy as it consists of a
lineage includes /, /a, and /a/c. The f ’s operation depends series of path component resolutions which occupies a large
on operations whose files and directories belong to f ’s lineage. portion of the total execution time [19, 50].
For instance, removing /a or disabling /a’s read permission We aim to reuse the path walk result of other log playbacks.
causes /a/c/g to become inaccessible. This reusing mechanism is feasible. Suppose the current log
A directory rename operation changes f ’s existing lineage. playback contains a log. This log has a deterministic order in
Figure 4(b) shows that rename(/a/c, /b/e) moves files from history as well as a set of dependent logs. If another cacheFS
/a/c to /b/e. Files in /a/c change their lineage. The new has played this log before, these dependent logs already have
lineage is called step lineage. Dependence checking of oplogs been replayed in its local state. The file path walk of this
behind the rename’s oplog uses the step lineage. oplog in these two playbacks produces the same result.
The symbolic link (symlink) adds a new file lineage. If a We add a reuse field in the dlog. A set field indicates that
source directory is a symlink and points to a directory which this log has been played before. Hence, the current playback
does not belong to its lineage, the lineage of the target direc- routine performs a partial log replay ( 4.2 ) by fetching the path
tory is the new lineage for children in the source directory. We resolution result and modifying the file directly. Otherwise, it
call the new lineage remote lineage. Figure 4(c) shows that performs a complete log replay ( 4.1 ).
/a/c is a symlink and points to /b. /a/c/g has a remote lin-
eage /b. Dependence checking of oplogs behind the symlink’s
oplog uses both the direct and remote lineage. 4.2 Data-plane FS
To answer (b), we design a mlog skip table to reduce the The data-plane FS provides a shared, DPM-friendly storage
log playback range. Each cacheFS has a volatile, private mlog layer. It unifies data management for diverse file system struc-
skip table. Every file has a corresponding table entry which tures with a key-value-based storage paradigm and a vector
batch
section3_start); // get a vector of keys
lookup Filter 8 vec_kv_get(k_vec, v_vec); // batch section lookup
[
9 extents[i] = filter_sections(v_vec);
vec_key_get Data-plane FS 10 union_max += extents[i]->range_size; // extend the union range
batch read: [ ... ] 11 addr += extents[i]->range_size; i++; // update lookup endpoint
12 blks = get_blocks(extents); // get blocks in all extents
Figure 7: Disentangled File Path Walk
13 buf = read_blocks(blks, offset, count); // parallel block reads
We decompose the path walk into a batch of dentry
lookups and remaining operations. A dentry lookup equals a The vec_kv_get performs DPM-friendly, batched lookups
meta object search in our data-plane FS. Suppose a syscall and returns a vector of data sections. We filter out the correct
unlink(/a/b/c) and the associated path /a/b/c contains section by validating candidate sections’ ranges and get the as-
three components. Its file path has three prefix paths. In Fig- sociated extent in line 9. After that, we extend the union range
ure 7, to find associated meta objects for three prefix paths, of found extents in line 10 and update the lookup endpoint
the cacheFS issues three lookups via one vec_kv_get invo- in line 11. If the union range cannot include the requested
cation. The vec_kv_get computes six hash values for three range, we perform another extent lookup. Otherwise, we get
keys, performs six parallel, dependence-free hash lookups, all blocks for found extents in line 12. The data-plane FS
and returns six lookup results. The return values may con- stripes a file in the unit of extents across PM devices in the
tain false meta objects, i.e., their file paths do not belong to memory pool, which facilitates parallel data R/W onto file
/a/b/c’s lineage. To filter out correct objects quickly, we blocks belonging to disjoint extents. Hence we parallelize
perform a swift sanity check on returned meta objects before data block reads in line 13.
exact filename comparisons. 4.2.3 Log Ingestion
Assume we check returned meta objects for the prefix path The sharedFS ingests shared logs to update its states. The
/a. We compare the parent directories of returned meta ob- log ingestion is split into two phases. First, every CN runs
jects with the root directory and abandon these objects with a log ingestion worker and suppose there are N workers in
an incorrect parent. We use the correct meta object as a new total. Logs belonging to the same file are dependent. Every
parent directory and repeat this sanity check with the next worker i scans all shared logs and gathers those logs whose
prefix path /a/b. Note that the root directory has an empty f ingerprint%N == i in its working set. It ensures that de-
parent directory. After finding out all correct meta objects, it pendent logs are processed by the same worker. The worker
performs remaining operations at once. fetches and replays operation logs to forward cacheFS states.
File data read. For a read(int* fd,void* buf,off_t When finishes replaying logs, it translates data in the names-
offset,size_t count), the file system performs a file map- pace cache and block cache into corresponding key-value
ping by traversing the extent tree, visiting tree node entries tuples and feeds these key-value data to sharedFS via a vector
(data access), performing a binary search (data processing), put invocation. Second, the sharedFS ingests these data by
and comparing the extent range with the requested range (data creating, inserting, or updating associated key-value tuples.
processing). Then it reads block data (data access) and repeats
this process until all requested data are fetched. This data read 4.3 Implementation
consists of a number of sequential extent tree lookups and We develop Ethane prototype from scratch. Its source code is
block reads, resulting in a relatively long read path. available at https://github.com/miaogecm/Ethane.git.
Our data-plane FS decomposes the data IO path into a It consists of 10910 lines of C code. The CN runs the Linux op-
series of disentangled file mappings and parallel data reads, erating system to provide POSIX-compatible interfaces, effi-
as shown in Algorithm 1. First, our file mapping performs cient resource management, and data protection. The cacheFS
batched data section lookups to find extents. The endpoint is implemented as a user-level library which includes 4922
addr is initialized as the left point of the lookup range lines of C code.
Throughput (M ops/s)
2.0 6
Throughput (ops/s)
80 240
IOPS: Para-Walk 0.9
1.6
60 IOPS: Seq-Walk 180
1.2 10
5 Baseline 0.6
Log+Arena +DepCheck 40 120
0.8
Log+CAS 4 +DepCheck+Reuse 0.3
0.4 10 20 60
0 0 0 0
32 64 128 256 384 512 64 96 128 192 256 1 2 3 4 5 6 7 8 9 4KB 256KB 16MB 1GB 64GB
Number of Clients K Number of Path Component File Size
(a) Log Arena Scalability (b) Log Replay Performance (c) Path Walk Latency (d) Data Read Throughput
Figure 8: Control-Plane FS and Data-Plane FS Evaluation Results
Every cacheFS instance runs as a state machine which Table 1: Hardware Configuration of Disaggregated and Sym-
replays logs to transit local states. Multiple cacheFS instances metric PM Systems
are isolated from each other. CacheFS instances are classified CPU DRAM PMEM SSD NIC Price
into two categories: external cacheFSs and internal cacheFSs. 32 8 GB 512 GB 2×ConnectX-6
CN - $3919
cores DDR4 NVMe SSD NIC
An external cacheFS is linked with a client and serves user 8 GB 4×128 GB 512 GB 2×ConnectX-6
MN 1 core $3463
requests. Internal cacheFSs are used for log ingestion. There DDR4 DCPMM NVMe SSD NIC
16 2×32 GB 2×128 GB 512 GB ConnectX-6
are no clients for internal cacheFS instances. SN
cores DDR4 DCPMM NVMe SSD NIC
$3789
# RDMA_CAS: Log+CAS
Para-Walk delivers a much more stable latency than that of
40 4
Transfer Rate: Log+Arena Seq-Walk. The Seq-Walk is non-scalable. Its total path walk
30 Transfer Rate: Log+CAS 3 latency is proportional to the number of path components. Its
20 2 path walk latency at 9-component takes 263 µs which is 3.09
10 1
× higher than that of 1-component.
The Para-Walk decouples the component access from com-
0 0
32 64 96 128 192 256 384 ponent processing. It utilizes the vector lookup interface to
Number of Clients
perform parallel component lookups, which hides the RDMA
Figure 9: # RDMA_CAS and PCIe Data Transfer Rate latency and delivers high network bandwidth utilization. To
We collect the number of RDMA_CAS operations and re- confirm it, we profile the NIC IOPS at the receiver side. Fig-
port them in Figure 9. The Log+CAS introduces much more ure 8c shows that Para-Walk achieves up to 2.43× higher
atomic RDMA operations than Log+Arena. In contrast, the IOPS than that of Seq-Walk.
Log+Arena incurs a few RDMA_CAS operations. Moreover, Data read throughput. This experiment creates a client to
Log+CAS has a low PM bandwidth utilization. RDMA_CAS perform random data read to a file. The IO size is set to 4 KB.
causes heavy NIC lock contention which prevents throughput We change the file size from 4 KB to 64 GB. We compare
increment. We use Intel PMWatch tool [5] to measure the two data path designs: disentangled read path (Disent-Read)
number of write operations received from the memory con- and entangled read path (Ent-Read). The entangled read path
trol, i.e., ddrt_write_ops. The RNIC accesses the DCPMM uses extent tree-based block management. Figure 8d shows
via PCIe bus and the on-chip PCIe controller forwards re- that the Disent-Read performs much better than the Ent-Read,
quests to the memory control. Thus, the increase rate of especially for a large file size. When the file size is 64 GB, its
ddrt_write_ops approximates the PCIe data transfer rate. throughput is 5.76× higher than that of the Ent-Read. This is
The Log+Arena achieves a much higher PCIe data transfer because the file mapping dominates the large file read time.
rate. Its peak rate is 3× higher than Log+CAS, leading to a
better PM bandwidth utilization. Table 2: # Pointer Chasing and % File Mapping Time
Log playback efficiency. Our log playback consists of # Pointer Chasing % File Mapping Time
File Size
Disent-Read Ent-Read Disent-Read Ent-Read
two techniques: lineage-based log dependence check and
4 KB 2.0 2.0 28.2% 49.7%
collaborative log playback. We first evaluate the effectiveness 256 KB 2.0 2.76 31.4% 62.2%
of the log dependence check. In the experiment, every client 16 MB 2.0 3.89 29.9% 77.8%
creates files in the directory /ethane-<X> where 1 ≤ X ≤ K. 1 GB 2.0 4.38 29.7% 80.2%
Experiments vary K from 64 to 256. Before a client creates 64 GB 2.0 5.41 30.5% 86.7%
a file, it must replay all dependent logs. Hence, a small K
When the file size is large, the extent tree is high. As a
causes more dependent logs.
result, the Ent-Read needs to traverse many tree levels to
The baseline replays logs one by one without any depen-
find a data block. To analyze performance overheads deeply,
dence check or reusing optimizations. Figure 8b shows that
Table 2 reports the number of pointer chasings during two
the baseline throughput is only 5 Kops/s. The log dependence
data paths. It shows that the Ent-Read requires 5.41 pointer
check optimization is effective. When K increases, the num-
chasings in extent tree traversal for a 64 GB file read. The total
ber of dependent logs reduces. It filters out a large number of
file mapping time occupies 86.7% of the total time. Thanks to
unrelated logs during log replay. The +DepCheck improves
the data section-based block management, our Disent-Read
the throughput by up to 1.7 Mops/s. The log reusing mecha-
only incurs two pointer chasings per file read regardless of
nism brings more performance benefits. Because all creat
file size changing. Moreover, it only spends 28.2%-30.5% of
syscalls share a common prefix path, the collaborative log
the total time in file mapping.
playback utilizes this to accelerate the execution path. It per-
forms 42.21% better than +DepCheck on average.
5.3 Macrobenchmark Performance
5.2 Data-plane FS Evaluation End-to-end latency. We measure the end-to-end latency of
The data-plane FS incorporates DPM-friendly, disentangled system calls. Eight clients run on four nodes and send mkdir
data paths. This section analyzes two representative data and stat requests to backend file system servers. Figure 10 re-
paths: file path walk and file data read. ports measured latencies. In the first experiment, every client
Path walk latency. This experiment preloads a Linux-4.15 creates directories in /ethane-<X> where X=rand(1,K). We
source code repository. Then it creates 256 clients and each vary K from 1 to 10. As shown in Figure 10a, deep software
client accesses a file in the directory tree via stat. We change stack and distributed namespace tree in CephFS causes over
3
Ethane Octopus 10
CephFS Assise with client-local PMs, its chain replication protocol incurs
2 2
10 10 high remote write overheads. Specifically, every time a client
1
10 creates a file, it needs to propagate this data update to all other
1 0
10 1 2 4 6 8 10 10 0 0.99 1.2 1.5 2.6
remote nodes in sequence.
K α We evaluate two settings of file stat. In the first setting,
(a) mkdir (b) stat file systems randomly access files and the experiment varies
Figure 10: System call Latency the file path length. Octopus achieves approximately 1 Mops/s
In the second experiment, every client accesses files via regardless of the number of path component changes. The path
stat. The file access pattern follows a Zipfian distribution resolution in Octopus is not POSIX-compliant. It hashes the
with a parameter α. A large α indicates a skewed access whole path without any individual path component resolution.
pattern. Both CephFS and Octopus suffer from the load im- In contrast, Ethane faithfully resolves every component and
balance issue. Their stat latencies increase linearly when it is still 1.8× faster than Octopus thanks to the parallel path
α becomes bigger. Assise replicates hot data in client-local walk design. Assise’s throughput is close to that of Ethane.
NVMs. Thus, skewed file access improves its performance. It is because there is no data replication during file stat and
Similarly, the namespace cache design in cacheFS also helps all data access happens on local devices. However, random
avoid the load imbalance issue for Ethane. file access causes numerous PM misses for Assise. Loading
Metadata scalability. We use MDTest [7] to evaluate data from SSD-based storage delivers a similar or even longer
metadata performance. We run an MPICH framework [8] latency than RDMA_READ latency.
to generate MDTest processes across server nodes. Before The second setting uses a non-uniform access popular-
the experiment, every process creates a private directory hier- ity. Both Octopus and CephFS configure one worker per SN.
archy for each client. Then, every client creates two million Therefore, their total throughput decreases dramatically when
files, accesses them, and removes these files. Figure 11 shows the file access becomes more skewed. Fortunately, Ethane has
the metadata performance of four file systems. no such load imbalance concern. All stat requests are evenly
distributed among cacheFS instances.
Throughput (ops/s)
Throughput (ops/s)
6 50 18
Throughput (GB/s)
Throughput (GB/s)
Ethane 6 Octopus
10 10 Ethane
40 15 CephFS
Octopus 12
5 5 30 Assise
10 10 9
20 6
4 4 10 3
10 10
0 0
16 32 48 64 96 128 16 32 48 64 96 128
# Clients # Clients 4 16 32 64 128 4 16 32 64 128
(a) creat (b) unlink IO Size (KB) IO Size (KB)
(a) read (b) write
Throughput (ops/s)
Throughput (ops/s)
8 Assise
10
10
6
7
Figure 12: Fio Throughput
10
6
CephFS 10
10
5
5
Data IO throughput. We use fio [4] to evaluate the data
10
4
R/W performance. We spawn 32 clients per CN/SN. All client
10
4 10
2 4 6 8 10 0 0.99 1.2 1.5 2.6 performs data reads to a shared file. We measure the data
# Path Components Access Skewness (α) throughput with different IO sizes in Figure 12. CephFS per-
(c) stat (random) (d) stat (skew)
forms worst. Data de-/serialization, message encapsulation,
Figure 11: MDTest Performance
and extra data copies in its message-based RPC lead to low
For file creation and removal, Ethane outperforms all other throughput. Its peak throughput only approaches 6 GB/s. Oc-
three distributed file systems. Octopus only assigns one topus has load imbalance issues. When the IO size increases,
worker per data server. It is unable to process massive client its total throughputs are bounded by a single node. Besides,
requests efficiently, which causes severe weak node capability Octopus achieves better performance than CephFS for small-
issues. The total throughput of Octopus stops increasing at 48 sized IOs. Its client-active IO uses one-sided RDMA_READ to
1 GB video files. During experiments, a vidwriter writes new CephFS Ethane Computing Phase
Latency (sec)
videos and fifteen clients read these video files with different 5
IO Phase
10 10 2.0
IO sizes. This workload is read-intensive which stresses the
file system IO performance. 4 1.0
10 5
5 Octopus Assise
4 3 0
[MB/s]/$
10 0
3
CephFS Ethane AOF Thpt. RDB Lat. Ethane CephFS Assise Octopus
2 (a) Redis Cluster (b) Metis
1 Figure 14: Application Performance
0
4 32 128 256 512
IO Size (KB) Metis. We run a multicore-optimized MapReduce applica-
Figure 13: Performance-cost Evaluation tion Metis [18]. We use Metis to run WordCount with a 16GB
input file. We configure two SNs for symmetric PM file sys-
A single PM device only offers a peak of 6 GB/s bandwidth. tems. Besides, We configure 0.5 CN and 1.5 MNs for Ethane.
For large IO sizes, file systems need to add more PM devices The half CN and MN is enabled by using one NUMA node of
to serve the IO requests. Symmetric PM file systems add more that machine. It ensures that the total costs of 2 SNs and 0.5
monolithic SN while disaggregated PM file system plugs CN + 1.5 MNs are approximately equal. Two SNs have twice
more PM modules into the MN. For different IO sizes, we more cores than the half CN but they deliver similar com-
choose the most cost-effective hardware configuration for puting phase latency. It suggests that the computing resource
each file system. This hardware configuration fitly supports is over-provisioned for symmetric PM file systems. On the
running this workload. For example, for an IO size of 32 KB, other side, Ethane has a shorter IO phase latency. Four PM
CephFS requires two SNs. These two SNs contain four PM devices is under-provisioned for symmetric PM file systems.
devices whose total bandwidth is sufficient for running the Thanks to its elastic resource scaling, Ethane yields superior
workload. One SN or three SNs is under- or over-provisioned. performance than others with the same hardware cost.
Figure 13 shows the performance-cost efficiency. Ethane
yields the highest throughput (i.e., MB/s) per dollar. The rea- 6 Related Works
sons are twofold. First, the disentangled data path design in Distributed file systems. For the past decades, DFSs play
Ethane brings better PM utilization than other file systems. a critical role in large-scale data storage. Conventional file
Second, when PM resources become scarce due to increased systems decouple metadata from data management, e.g.,
[64] Yan Sun, Yifan Yuan, Zeduo Yu, Reese Kuper, Chi- [72] Jian Yang, Joseph Izraelevitz, and Steven Swanson.
hun Song, Jinghan Huang, Houxiang Ji, Siddharth Agar- Orion: A Distributed File System for Non-Volatile
wal, Jiaqi Lou, Ipoom Jeong, Ren Wang, Jung Ho Ahn, Main Memory and RDMA-Capable Networks. In 17th
Tianyin Xu, and Nam Sung Kim. Demystifying CXL USENIX Conference on File and Storage Technologies,
Memory with Genuine CXL-Ready Systems and De- Boston, MA, February 25-28, 2019, pages 221–234.
vices. In 56th Annual IEEE/ACM International Sym-
[73] Jian Yang, Juno Kim, Morteza Hoseinzadeh, Joseph
posium on Microarchitecture, Toronto, ON, Canada, 28
Izraelevitz, and Steven Swanson. An Empirical Guide
October - 1 November, 2023, pages 105–121.
to the Behavior and Use of Scalable Persistent Memory.
In 18th USENIX Conference on File and Storage Tech-
[65] Shin-Yeh Tsai, Yizhou Shan, and Yiying Zhang. Dis-
nologies, Santa Clara, CA, USA, February 24-27, 2020,
aggregating Persistent Memory and Controlling Them
pages 169–182.
Remotely: An Exploration of Passive Disaggregated
Key-Value Stores. In 2020 USENIX Annual Technical [74] Juncheng Yang, Yao Yue, and K. V. Rashmi. A large
Conference, July 15-17, 2020, pages 33–48. scale analysis of hundreds of in-memory cache clusters
at Twitter. In 14th USENIX Symposium on Operat-
[66] Qing Wang, Youyou Lu, and Jiwu Shu. Sherman: A ing Systems Design and Implementation, Virtual Event,
Write-Optimized Distributed B+ Tree Index on Disag- November 4-6, 2020, pages 191–208.
gregated Memory. In International Conference on Man-
agement of Data, Philadelphia, PA, USA, June 12 - 17, [75] Ming Zhang, Yu Hua, Pengfei Zuo, and Lurong Liu.
2022, pages 1033–1048. FORD: Fast One-sided RDMA-based Distributed Trans-
actions for Disaggregated Persistent Memory. In 20th
[67] Michael Wei, Amy Tai, Christopher J. Rossbach, Ittai USENIX Conference on File and Storage Technologies,
Abraham, Maithem Munshed, Medhavi Dhawan, Jim Santa Clara, CA, USA, February 22-24, 2022, pages
Stabile, Udi Wieder, Scott Fritchie, Steven Swanson, 51–68.
Michael J. Freedman, and Dahlia Malkhi. vCorfu: A
Cloud-Scale Object Store on a Shared Log. In 14th [76] Diyu Zhou, Vojtech Aschenbrenner, Tao Lyu, Jian
USENIX Symposium on Networked Systems Design and Zhang, Sudarsun Kannan, and Sanidhya Kashyap. En-
Implementation, Boston, MA, USA, March 27-29, 2017, abling High-Performance and Secure Userspace NVM
pages 35–49. File Systems with the Trio Architecture. In 29th Sym-
posium on Operating Systems Principles, Koblenz, Ger-
[68] Xingda Wei, Xiating Xie, Rong Chen, Haibo Chen, and many, October 23-26, 2023, pages 150–165.
Binyu Zang. Characterizing and Optimizing Remote
[77] Hang Zhu, Kostis Kaffes, Zixu Chen, Zhenming Liu,
Persistent Memory with RDMA and NVM. In 2021
Christos Kozyrakis, Ion Stoica, and Xin Jin. RackSched:
USENIX Annual Technical Conference, July 14-16, 2021,
A Microsecond-Scale Scheduler for Rack-Scale Com-
pages 523–536.
puters. In 14th USENIX Symposium on Operating Sys-
tems Design and Implementation, Virtual Event, Novem-
[69] Sage A. Weil, Scott A. Brandt, Ethan L. Miller, Darrell
ber 4-6, 2020, pages 1225–1240.
D. E. Long, and Carlos Maltzahn. Ceph: A Scalable,
High-Performance Distributed File System. In 7th Sym- [78] Pengfei Zuo, Jiazhao Sun, Liu Yang, Shuangwu Zhang,
posium on Operating Systems Design and Implemen- and Yu Hua. One-sided RDMA-Conscious Extendible
tation, November 6-8, 2006, Seattle, WA, USA, pages Hashing for Disaggregated Memory. In 2021 USENIX
307–320. Annual Technical Conference, July 14-16, 2021, pages
15–29.
[70] Sage A. Weil, Kristal T. Pollack, Scott A. Brandt, and
Ethan L. Miller. Dynamic Metadata Management for
Petabyte-Scale File Systems. In Proceedings of the
ACM/IEEE Conference on High Performance Network-
ing and Computing, 6-12 November 2004, Pittsburgh,
PA, USA, pages 1–12.