0% found this document useful (0 votes)
83 views17 pages

NOVA: A Log-Structured File System For Hybrid Volatile/Non-volatile Main Memories

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
83 views17 pages

NOVA: A Log-Structured File System For Hybrid Volatile/Non-volatile Main Memories

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

NOVA: A Log-structured File System for Hybrid

Volatile/Non-volatile Main Memories


Jian Xu and Steven Swanson, University of California, San Diego
https://www.usenix.org/conference/fast16/technical-sessions/presentation/xu

This paper is included in the Proceedings of the


14th USENIX Conference on
File and Storage Technologies (FAST ’16).
February 22–25, 2016 • Santa Clara, CA, USA
ISBN 978-1-931971-28-7

Open access to the Proceedings of the


14th USENIX Conference on
File and Storage Technologies
is sponsored by USENIX
NOVA: A Log-structured File System for Hybrid Volatile/Non-volatile Main
Memories
Jian Xu Steven Swanson
University of California, San Diego

Abstract Hybrid DRAM/NVMM storage systems present a host of


opportunities and challenges for system designers. These sys-
Fast non-volatile memories (NVMs) will soon appear on tems need to minimize software overhead if they are to fully
the processor memory bus alongside DRAM. The result- exploit NVMM’s high performance and efficiently support
ing hybrid memory systems will provide software with sub- more flexible access patterns, and at the same time they must
microsecond, high-bandwidth access to persistent data, but provide the strong consistency guarantees that applications
managing, accessing, and maintaining consistency for data require and respect the limitations of emerging memories
stored in NVM raises a host of challenges. Existing file sys- (e.g., limited program cycles).
tems built for spinning or solid-state disks introduce software Conventional file systems are not suitable for hybrid mem-
overheads that would obscure the performance that NVMs ory systems because they are built for the performance char-
should provide, but proposed file systems for NVMs either in- acteristics of disks (spinning or solid state) and rely on disks’
cur similar overheads or fail to provide the strong consistency consistency guarantees (e.g., that sector updates are atomic)
guarantees that applications require. for correctness [47]. Hybrid memory systems differ from
We present NOVA, a file system designed to maximize conventional storage systems on both counts: NVMMs pro-
performance on hybrid memory systems while providing vide vastly improved performance over disks while DRAM
strong consistency guarantees. NOVA adapts conventional provides even better performance, albeit without persistence.
log-structured file system techniques to exploit the fast ran- And memory provides different consistency guarantees (e.g.,
dom access that NVMs provide. In particular, it maintains 64-bit atomic stores) from disks.
separate logs for each inode to improve concurrency, and Providing strong consistency guarantees is particularly
stores file data outside the log to minimize log size and re- challenging for memory-based file systems because main-
duce garbage collection costs. NOVA’s logs provide meta- taining data consistency in NVMM can be costly. Modern
data, data, and mmap atomicity and focus on simplicity and CPU and memory systems may reorder stores to memory to
reliability, keeping complex metadata structures in DRAM improve performance, breaking consistency in case of system
to accelerate lookup operations. Experimental results show failure. To compensate, the file system needs to explicitly
that in write-intensive workloads, NOVA provides 22% to flush data from the CPU’s caches to enforce orderings, adding
216× throughput improvement compared to state-of-the-art significant overhead and squandering the improved perfor-
file systems, and 3.1× to 13.5× improvement compared to mance that NVMM can provide [6, 76].
file systems that provide equally strong data consistency guar-
Overcoming these problems is critical since many applica-
antees.
tions rely on atomic file system operations to ensure their own
1. Introduction correctness. Existing mainstream file systems use journaling,
shadow paging, or log-structuring techniques to provide atom-
Emerging non-volatile memory (NVM) technologies such icity. However, journaling wastes bandwidth by doubling the
as spin-torque transfer, phase change, resistive memories [2, number of writes to the storage device, and shadow paging
28, 52] and Intel and Micron’s 3D XPoint [1] technology file systems require a cascade of updates from the affected
promise to revolutionize I/O performance. Researchers have leaf nodes to the root. Implementing either technique imposes
proposed several approaches to integrating NVMs into com- strict ordering requirements that reduce performance.
puter systems [11, 13, 19, 31, 36, 41, 58, 67], and the most Log-structured file systems (LFSs) [55] group small ran-
exciting proposals place NVMs on the processor’s mem- dom write requests into a larger sequential write that hard
ory bus alongside conventional DRAM, leading to hybrid disks and NAND flash-based solid state drives (SSDs) can
volatile/non-volatile main memory systems [4, 51, 72, 78]. process efficiently. However, conventional LFSs rely on the
Combining faster, volatile DRAM with slightly slower, denser availability of contiguous free regions, and maintaining those
non-volatile main memories (NVMMs) offers the possibility regions requires expensive garbage collection operations. As
of storage systems that combine the best characteristics of a result, recent research [59] shows that LFSs perform worse
both technologies. than journaling file systems on NVMM.

USENIX Association 14th USENIX Conference on File and Storage Technologies (FAST ’16)  323
To overcome all these limitations, we present the NOn- 2. Background
Volatile memory Accelerated (NOVA) log-structured file sys-
tem. NOVA adapts conventional log-structured file system NOVA targets memory systems that include emerging non-
techniques to exploit the fast random access provided by hy- volatile memory technologies along with DRAM. This sec-
brid memory systems. This allows NOVA to support massive tion first provides a brief survey of NVM technologies and the
concurrency, reduce log size, and minimize garbage collec- opportunities and challenges they present to system design-
tion costs while providing strong consistency guarantees for ers. Then, we discuss how other file systems have provided
conventional file operations and mmap-based load/store ac- atomic operations and consistency guarantees. Finally, we
cesses. discuss previous work on NVMM file systems.

Several aspects of NOVA set it apart from previous log- 2.1. Non-volatile memory technologies
structured file systems. NOVA assigns each inode a separate
log to maximize concurrency during normal operation and Emerging non-volatile memory technologies, such as
recovery. NOVA stores the logs as linked lists, so they do not spin-torque transfer RAM (STT-RAM) [28, 42], phase
need to be contiguous in memory, and it uses atomic updates change memory (PCM) [10, 18, 29, 52], resistive RAM
to a log’s tail pointer to provide atomic log append. For (ReRAM) [22, 62], and 3D XPoint memory technology [1],
operations that span multiple inodes, NOVA uses lightweight promise to provide fast, non-volatile, byte-addressable memo-
journaling. ries. Suzuki et al. [63] provides a survey of these technologies
and their evolution over time.
NOVA does not log data, so the recovery process only These memories have different strengths and weaknesses
needs to scan a small fraction of the NVMM. This also al- that make them useful in different parts of the memory hierar-
lows NOVA to immediately reclaim pages when they become chy. STT-RAM can meet or surpass DRAM’s latency and it
stale, significantly reducing garbage collection overhead and may eventually appear in on-chip, last-level caches [77], but
allowing NOVA to sustain good performance even when the its large cell size limits capacity and its feasibility as a DRAM
file system is nearly full. replacement. PCM and ReRAM are denser than DRAM, and
In describing NOVA, this paper makes the following con- may enable very large, non-volatile main memories. How-
tributions: ever, their relatively long latencies make it unlikely that they
• It extends existing log-structured file system techniques to will fully replace DRAM as main memory. The 3D XPoint
exploit the characteristics of hybrid memory systems. memory technology recently announced by Intel and Micron
• It describes atomic mmap, a simplified interface for expos- is rumored to be one of these and to offer performance up
ing NVMM directly to applications with a strong consis- to 1,000 times faster than NAND flash [1]. It will appear in
tency guarantee. both SSDs and on the processor memory bus. As a result, we
• It demonstrates that NOVA outperforms existing journal- expect to see hybrid volatile/non-volatile memory hierarchies
ing, shadow paging, and log-structured file systems run- become common in large systems.
ning on hybrid memory systems.
2.2. Challenges for NVMM software
• It shows that NOVA provides these benefits across a range
of proposed NVMM technologies. NVMM technologies present several challenges to file sys-
We evaluate NOVA using a collection of micro- and macro- tem designers. The most critical of these focus on balancing
benchmarks on a hardware-based NVMM emulator. We find the memories’ performance against software overheads, en-
that NOVA is significantly faster than existing file systems forcing ordering among updates to ensure consistency, and
in a wide range of applications and outperforms file systems providing atomic updates.
that provide the same data consistency guarantees by between Performance The low latencies of NVMMs alters the
3.1× and 13.5× in write-intensive workloads. We also mea- trade-offs between hardware and software latency. In con-
sure garbage collection and recovery overheads, and we find ventional storage systems, the latency of slow storage de-
that NOVA provides stable performance under high NVMM vices (e.g., disks) dominates access latency, so software ef-
utilization levels and fast recovery in case of system failure. ficiency is not critical. Previous work has shown that with
The remainder of the paper is organized as follows. Sec- fast NVMM, software costs can quickly dominate memory
tion 2 describes NVMMs, the challenges they present, and latency, squandering the performance that NVMMs could
related work on NVMM file system design. Section 3 gives provide [7, 12, 68, 74].
a overview of NOVA architecture and Section 4 describes Since NVMM memories offer low latency and will be on
the implementation in detail. Section 5 evaluates NOVA, and the processor’s memory bus, software should be able to access
Section 6 concludes. them directly via loads and stores. Recent NVMM-based file

324  14th USENIX Conference on File and Storage Technologies (FAST ’16) USENIX Association
systems [21, 71, 73] bypass the DRAM page cache and access dates that file systems require, programmers must use more
NVMM directly using a technique called Direct Access (DAX) complex techniques.
or eXecute In Place (XIP), avoiding extra copies between
NVMM and DRAM in the storage stack. NOVA is a DAX 2.3. Building complex atomic operations
file system and we expect that all NVMM file systems will Existing file systems use a variety of techniques like journal-
provide these (or similar) features. We describe currently ing, shadow paging, or log-structuring to provide atomicity
available DAX file systems in Section 2.4. guarantees. These work in different ways and incur different
Write reordering Modern processors and their caching types of overheads.
hierarchies may reorder store operations to improve perfor- Journaling Journaling (or write-ahead logging) is widely
mance. The CPU’s memory consistency protocol makes guar- used in journaling file systems [24, 27, 32, 71] and
antees about the ordering of memory updates, but existing databases [39, 43] to ensure atomicity. A journaling system
models (with the exception of research proposals [20, 46]) do records all updates to a journal before applying them and, in
not provide guarantees on when updates will reach NVMMs. case of power failure, replays the journal to restore the system
As a result, a power failure may leave the data in an inconsis- to a consistent state. Journaling requires writing data twice:
tent state. once to the log and once to the target location, and to im-
NVMM-aware software can avoid this by explicitly flush- prove performance journaling file systems usually only jour-
ing caches and issuing memory barriers to enforce write nal metadata. Recent work has proposed back pointers [17]
ordering. The x86 architecture provides the clflush in- and decoupling ordering from durability [16] to reduce the
struction to flush a CPU cacheline, but clflush is strictly overhead of journaling.
ordered and needlessly invalidates the cacheline, incurring a Shadow paging Several file systems use a copy-on-write
significant performance penalty [6, 76]. Also, clflush only mechanism called shadow paging [20, 8, 25, 54]. Shadow
sends data to the memory controller; it does not guarantee paging file systems rely heavily on their tree structure to
the data will reach memory. Memory barriers such as Intel’s provide atomicity. Rather than modifying data in-place during
mfence instruction enforce order on memory operations be- a write, shadow paging writes a new copy of the affected
fore and after the barrier, but mfence only guarantees all page(s) to an empty portion of the storage device. Then, it
CPUs have the same view of the memory. It does not impose splices the new pages into the file system tree by updating
any constraints on the order of data writebacks to NVMM. the nodes between the pages and root. The resulting cascade
Intel has proposed new instructions that fix these prob- of updates is potentially expensive.
lems, including clflushopt (a more efficient version of
clflush), clwb (to explicitly write back a cache line with- Log-structuring Log-structured file systems (LFSs) [55,
out invalidating it) and PCOMMIT (to force stores out to 60] were originally designed to exploit hard disk drives’ high
NVMM) [26, 79]. NOVA is built with these instructions performance on sequential accesses. LFSs buffer random
in mind. In our evaluation we use a hardware NVMM emu- writes in memory and convert them into larger, sequential
lation system that approximates the performance impacts of writes to the disk, making the best of hard disks’ strengths.
these instructions. Although LFS is an elegant idea, implementing it effi-
ciently is complex, because LFSs rely on writing sequentially
Atomicity POSIX-style file system semantics require to contiguous free regions of the disk. To ensure a consistent
many operations to be atomic (i.e., to execute in an “all or supply of such regions, LFSs constantly clean and compact
nothing” fashion). For example, the POSIX rename re- the log to reclaim space occupied by stale data.
quires that if the operation fails, neither the file with the old Log cleaning adds overhead and degrades the performance
name nor the file with the new name shall be changed or of LFSs [3, 61]. To reduce cleaning overhead, some LFS
created [53]. Renaming a file is a metadata-only operation, designs separate hot and cold data and apply different clean-
but some atomic updates apply to both file system metadata ing policies to each [69, 70]. SSDs also perform best under
and data. For instance, appending to a file atomically updates sequential workloads [9, 14], so LFS techniques have been
the file data and changes the file’s length and modification applied to SSD file systems as well. SFS [38] classifies file
time. Many applications rely on atomic file system operations blocks based on their update likelihood, and writes blocks
for their own correctness. with similar “hotness” into the same log segment to reduce
Storage devices typically provide only rudimentary guaran- cleaning overhead. F2FS [30] uses multi-head logging, writes
tees about atomicity. Disks provide atomic sector writes and metadata and data to separate logs, and writes new data di-
processors guarantee only that 8-byte (or smaller), aligned rectly to free space in dirty segments at high disk utilization
stores are atomic. To build the more complex atomic up- to avoid frequent garbage collection.

USENIX Association 14th USENIX Conference on File and Storage Technologies (FAST ’16)  325
RAMCloud [44] is a DRAM-based storage system that rectly in NVMM, but they are not efficient for search oper-
keeps all its data in DRAM to service reads and maintains ations (e.g., directory lookup and random-access within a
a persistent version on hard drives. RAMCloud applies log file). Conversely, data structures that support fast search (e.g.,
structure to both DRAM and disk: It allocates DRAM in a tree structures) are more difficult to implement correctly and
log-structured way, achieving higher DRAM utilization than efficiently in NVMM [15, 40, 65, 75]. Second, the complex-
other memory allocators [56], and stores the back up data in ity of cleaning logs stems primarily from the need to supply
logs on disk. contiguous free regions of storage, but this is not necessary
in NVMM, because random access is cheap. Third, using
2.4. File systems for NVMM
a single log makes sense for disks (where there is a single
Several groups have designed NVMM-based file systems that disk head and improving spatial locality is paramount), but
address some of the issues described in Section 2.2 by apply- it limits concurrency. Since NVMMs support fast, highly
ing one or more of the techniques discussed in Section 2.3, concurrent random accesses, using multiple logs does not
but none meet all the requirements that modern applications negatively impact performance.
place on file systems. Based on these observations, we made the following design
BPFS [20] is a shadow paging file system that provides decisions in NOVA.
metadata and data consistency. BPFS proposes a hardware Keep logs in NVMM and indexes in DRAM. NOVA
mechanism to enforce store durability and ordering. BPFS keeps log and file data in NVMM and builds radix trees [35]
uses short-circuit shadow paging to reduce shadow paging in DRAM to quickly perform search operations, making the
overheads in common cases, but certain operations that span in-NVMM data structures simple and efficient. We use a
a large portion of the file system tree (e.g., a move between radix tree because there is a mature, well-tested, widely-used
directories) can still incur large overheads. implementation in the Linux kernel. The leaves of the radix
PMFS [21, 49] is a lightweight DAX file system that by- tree point to entries in the log which in turn point to file data.
passes the block layer and file system page cache to improve
performance. PMFS uses journaling for metadata updates. It Give each inode its own log. Each inode in NOVA has
performs writes in-place, so they are not atomic. its own log, allowing concurrent updates across files without
synchronization. This structure allows for high concurrency
Ext4-DAX [71] extends Ext4 with DAX capabilities to
both in file access and during recovery, since NOVA can
directly access NVMM, and uses journaling to guarantee
replay multiple logs simultaneously. NOVA also guarantees
metadata update atomicity. The normal (non-DAX) Ext4 file
that the number of valid log entries is small (on the order of
system has a data-journal mode to provide data atomicity.
the number of extents in the file), which ensures that scanning
Ext4-DAX does not support this mode, so data updates are
the log is fast.
not atomic.
SCMFS [73] utilizes the operating system’s virtual mem- Use logging and lightweight journaling for complex
ory management module and maps files to large contigu- atomic updates. NOVA is log-structured because this
ous virtual address regions, making file accesses simple and provides cheaper atomic updates than journaling and shadow
lightweight. SCMFS does not provide any consistency guar- paging. To atomically write data to a log, NOVA first ap-
antee of metadata or data. pends data to the log and then atomically updates the log
Aerie [66] implements the file system interface and func- tail to commit the updates, thus avoiding both the duplicate
tionality in user space to provide low-latency access to data writes overhead of journaling file systems and the cascading
in NVMM. It has an optimization that improves performance update costs of shadow paging systems.
by relaxing POSIX semantics. Aerie journals metadata but Some directory operations, such as a move between direc-
does not support data atomicity or mmap operation. tories, span multiple inodes and NOVA uses journaling to
atomically update multiple logs. NOVA first writes data at the
3. NOVA Design Overview end of each inode’s log, and then journals the log tail updates
NOVA is a log-structured, POSIX file system that builds on to update them atomically. NOVA journaling is lightweight
the strengths of LFS and adapts them to take advantage of since it only involves log tails (as opposed to file data or
hybrid memory systems. Because it targets a different storage metadata) and no POSIX file operation operates on more than
technology, NOVA looks very different from conventional four inodes.
log-structured file systems that are built to maximize disk Implement the log as a singly linked list. The locality
bandwidth. benefits of sequential logs are less important in NVMM-based
We designed NOVA based on three observations. First, storage, so NOVA uses a linked list of 4 KB NVMM pages
logs that support atomic updates are easy to implement cor- to hold the log and stores the next page pointer in the end of

326  14th USENIX Conference on File and Storage Technologies (FAST ’16) USENIX Association
each log page. CPU 1 CPU 2 CPU 3 CPU 4
Allowing for non-sequential log storage provides three Free list Free list Free list Free list

advantages. First, allocating log space is easy since NOVA


DRAM
does not need to allocate large, contiguous regions for the
NVMM Journal Journal Journal Journal
log. Second, NOVA can perform log cleaning at fine-grained, Super Inode table Inode table Inode table Inode table
page-size granularity. Third, reclaiming log pages that con- block
Recovery
tain only stale entries requires just a few pointer assignments. inode
Inode Head Tail
Do not log file data. The inode logs in NOVA do not
contain file data. Instead, NOVA uses copy-on-write for Inode log
modified pages and appends metadata about the write to the
Committed log entry Uncommitted log entry
log. The metadata describe the update and point to the data
pages. Section 4.4 describes file write operation in more Figure 1: NOVA data structure layout. NOVA has per-CPU free
detail. lists, journals and inode tables to ensure good scalability. Each
Using copy-on-write for file data is useful for several rea- inode has a separate log consisting of a singly linked list of 4 KB log
sons. First, it results in a shorter log, accelerating the recovery pages; the tail pointer in the inode points to the latest committed
process. Second, it makes garbage collection simpler and entry in the log.
more efficient, since NOVA never has to copy file data out of
the log to reclaim a log page. Third, reclaiming stale pages bottlenecks.
and allocating new data pages are both easy, since they just
Inode table NOVA initializes each inode table as a 2 MB
require adding and removing pages from in-DRAM free lists.
block array of inodes. Each NOVA inode is aligned on 128-
Fourth, since it can reclaim stale data pages immediately,
byte boundary, so that given the inode number NOVA can
NOVA can sustain performance even under heavy write loads
easily locate the target inode. NOVA assigns new inodes to
and high NVMM utilization levels.
each inode table in a round-robin order, so that inodes are
The next section describes the implementation of NOVA
evenly distributed among inode tables. If the inode table
in more detail.
is full, NOVA extends it by building a linked list of 2 MB
4. Implementing NOVA sub-tables. To reduce the inode table size, each NOVA inode
contains a valid bit and NOVA reuses invalid inodes for new
We have implemented NOVA in the Linux kernel version files and directories. Per-CPU inode tables avoid the inode
4.0. NOVA uses the existing NVMM hooks in the kernel allocation contention and allow for parallel scanning in failure
and has passed the Linux POSIX file system test suite [50]. recovery.
The source code is available on GitHub: https://github.com/ A NOVA inode contains pointers to the head and tail of its
NVSL/NOVA. In this section we first describe the overall log. The log is a linked list of 4 KB pages, and the tail always
file system layout and its atomicity and write ordering mecha- points to the latest committed log entry. NOVA scans the log
nisms. Then, we describe how NOVA performs atomic direc- from head to tail to rebuild the DRAM data structures when
tory, file, and mmap operations. Finally we discuss garbage the system accesses the inode for the first time.
collection, recovery, and memory protection in NOVA. Journal A NOVA journal is a 4 KB circular buffer and
4.1. NVMM data structures and space management NOVA manages each journal with a <enqueue, dequeue>
pointer pair. To coordinate updates that across multiple in-
Figure 1 shows the high-level layout of NOVA data structures odes, NOVA first appends log entries to each log, and then
in a region of NVMM it manages. NOVA divides the NVMM starts a transaction by appending all the affected log tails to
into four parts: the superblock and recovery inode, the inode the current CPU’s journal enqueue, and updates the enqueue
tables, the journals, and log/data pages. The superblock pointer. After propagating the updates to the target log tails,
contains global file system information, the recovery inode NOVA updates the dequeue equal to enqueue to commit the
stores recovery information that accelerates NOVA remount transaction. For a create operation, NOVA journals the
after a clean shutdown (see Section 4.7), the inode tables directory’s log tail pointer and new inode’s valid bit. During
contain inodes, the journals provide atomicity to directory power failure recovery, NOVA checks each journal and rolls
operations, and the remaining area contains NVMM log and back any updates between the journal’s dequeue and enqueue.
data pages. We designed NOVA with scalability in mind: NOVA only allows one open transaction at a time on each
NOVA maintains an inode table, journal, and NVMM free core and per-CPU journals allow for concurrent transactions.
page list at each CPU to avoid global locking and scalability For each directory operation, the kernel’s virtual file system

USENIX Association 14th USENIX Conference on File and Storage Technologies (FAST ’16)  327
(VFS) layer locks all the affected inodes, so concurrent trans- new_tail = append_to_log(inode->tail, entry);
actions never modify the same inode. // writes back the log entry cachelines
clwb(inode->tail, entry->length);
NVMM space management To make NVMM allocation sfence(); // orders subsequent PCOMMIT
and deallocation fast, NOVA divides NVMM into pools, one PCOMMIT(); // commits entry to NVMM
sfence(); // orders subsequent store
per CPU, and keeps lists of free NVMM pages in DRAM. inode->tail = new_tail;
If no pages are available in the current CPU’s pool, NOVA
allocates pages from the largest pool, and uses per-pool locks Figure 2: Pseudocode for enforcing write ordering. NOVA
to provide protection. This allocation scheme is similar to commits the log entry to NVMM strictly before updating the log
scalable memory allocators like Hoard [5]. To reduce the tail pointer. The persistency of the tail update is not shown in the
allocator size, NOVA uses a red-black tree to keep the free figure.
list sorted by address, allowing for efficient merging and
providing O(log n) deallocation. To improve performance, and log entries to NVMM before updating the log tail. Sec-
NOVA does not store the allocator state in NVMM during ond, it commits journal data to NVMM before propagating
operation. On a normal shutdown, it records the allocator the updates. Third, it commits new versions of data pages
state to the recovery inode’s log and restores the allocator to NVMM before recycling the stale versions. If NOVA is
state by scanning the all the inodes’ logs in case of a power running on a system that supports clflushopt, clwb and
failure. PCOMMIT instructions, it uses the code in Figure 2 to enforce
NOVA allocates log space aggressively to avoid the need the write ordering.
to frequently resize the log. Initially, an inode’s log contains First, the code appends the entry to the log. Then it flushes
one page. When the log exhausts the available space, NOVA the affected cache lines with clwb. Next, it issues a sfence
allocates sufficient new pages to double the log space and and a PCOMMIT instruction to force all previous updates to
appends them to the log. If the log length is above a given the NVMM controller. A second sfence prevents the tail
threshold, NOVA appends a fixed number of pages each time. update from occurring before the PCOMMIT. The write-back
and commit of the tail update are not shown in the figure.
4.2. Atomicity and enforcing write ordering If the platform does not support the new instructions,
NOVA uses movntq, a non-temporal move instruction that
NOVA provides fast atomicity for metadata, data, and mmap
bypasses the CPU cache hierarchy to perform direct writes to
updates using a technique that combines log structuring and
NVMM and uses a combination of clflush and sfence
journaling. This technique uses three mechanisms.
to enforce the write ordering.
64-bit atomic updates Modern processors support 64-bit
atomic writes for volatile memory and NOVA assumes that 4.3. Directory operations
64-bit writes to NVMM will be atomic as well. NOVA uses NOVA pays close attention to directory operations because
64-bit in-place writes to directly modify metadata for some they have a large impact on application performance [37, 33,
operations (e.g., the file’s atime for reads) and uses them to 64]. NOVA includes optimizations for all the major directory
commit updates to the log by updating the inode’s log tail operations, including link, symlink and rename.
pointer. NOVA directories comprise two parts: the log of the direc-
Logging NOVA uses the inode’s log to record operations tory’s inode in NVMM and a radix tree in DRAM. Figure 3
that modify a single inode. These include operations such shows the relationship between these components. The di-
as write, msync and chmod. The logs are independent of rectory’s log holds two kinds of entries: directory entries
one another. (dentry) and inode update entries. Dentries include the name
of the child file/directory, its inode number, and timestamp.
Lightweight journaling For directory operations that re- NOVA uses the timestamp to atomically update the directory
quire changes to multiple inodes (e.g., create, unlink inode’s mtime and ctime with the operation. NOVA appends
and rename), NOVA uses lightweight journaling to provide a dentry to the log when it creates, deletes, or renames a file
atomicity. At any time, the data in any NOVA journal are or subdirectory under that directory. A dentry for a delete
small—no more than 64 bytes: The most complex POSIX operation has its inode number set to zero to distinguish it
rename operation involves up to four inodes, and NOVA from a create dentry.
only needs 16 bytes to journal each inode: 8 bytes for the
NOVA adds inode update entries to the directory’s log
address of the log tail pointer and 8 bytes for the value.
to record updates to the directory’s inode (e.g., for chmod
Enforcing write ordering NOVA relies on three write or- and chown). These operations modify multiple fields of the
dering rules to ensure consistency. First, it commits data inode, and the inode update entry provides atomicity.

328  14th USENIX Conference on File and Storage Technologies (FAST ’16) USENIX Association
Directory dentry tree dir File radix tree root

bar zoo Step 3 0 1 2 3 Step 4


DRAM DRAM
NVMM
Old tail New tail Step 2 NVMM
Directory log Old tail New tail Step 3

³foo´, 10 ³bar´, 20 chmod ³foo´, 0 ³zoo´, 10


File log <0, 1> <1, 2> <2, 2>
Step 1
Step 2
Create dentry Delete dentry Inode update
Data 0 Data 1 Data 2 Data 2 Data 3
Figure 3: NOVA directory structure. Dentry is shown in <name, Step 5 Step 1
inode_number> format. To create a file, NOVA first appends the
dentry to the directory’s log (step 1), updates the log tail as part of File write entry Data page
a transaction (step 2), and updates the radix tree (step 3).
Figure 4: NOVA file structure. An 8 KB (i.e., 2-page) write to
page two (<2, 2>) of a file requires five steps. NOVA first writes a
To speed up dentry lookups, NOVA keeps a radix tree copy of the data to new pages (step 1) and appends the file write
in DRAM for each directory inode. The key is the hash entry (step 2). Then it updates the log tail (step 3) and the radix
value of the dentry name, and each leaf node points to the tree (step 4). Finally, NOVA returns the old version of the data to
corresponding dentry in the log. The radix tree makes search the allocator (step 5).
efficient even for large directories. Below, we use file creation
and deletion to illustrate these principles.
operations and point to data pages the write modified. File
Creating a file Figure 3 illustrates the creation of file zoo write entries also include timestamp and file size, so that
in a directory that already contains file bar. The directory has write operations atomically update the file’s metadata. The
recently undergone a chmod operation and used to contain DRAM radix tree maps file offsets to file write entries.
another file, foo. The log entries for those operations are
visible in the figure. NOVA first selects and initializes an If the write is large, NOVA may not be able to describe it
unused inode in the inode table for zoo, and appends a create with a single write entry. If NOVA cannot find a large enough
dentry of zoo to the directory’s log. Then, NOVA uses the set of contiguous pages, it breaks the write into multiple
current CPU’s journal to atomically update the directory’s write entries and appends them all to the log to satisfy the
log tail and set the valid bit of the new inode. Finally NOVA request. To maintain atomicity, NOVA commits all the entries
adds the file to the directory’s radix tree in DRAM. with a single update to the log tail pointer.
For a read operation, NOVA updates the file inode’s ac-
Deleting a file In Linux, deleting a file requires two up-
cess time with a 64-bit atomic write, locates the required page
dates: The first decrements the link count of the file’s inode,
using the file’s radix tree, and copies the data from NVMM
and the second removes the file from the enclosing directory.
to the user buffer.
NOVA first appends a delete dentry log entry to the directory
inode’s log and an inode update entry to the file inode’s log Figure 4 illustrates a write operation. The notation <file
and then uses the journaling mechanism to atomically up- pgoff, num pages> denotes the page offset and number of
date both log tails. Finally it propagates the changes to the pages a write affects. The first two entries in the log de-
directory’s radix tree in DRAM. scribe two writes, <0, 1> and <1, 2>, of 4 KB and 8 KB
(i.e., 1 and 2 pages), respectively. A third, 8 KB write, <2,
4.4. Atomic file operations 2>, is in flight.
The NOVA file structure uses logging to provide metadata To perform the <2, 2> write, NOVA fills data pages
and data atomicity with low overhead, and it uses copy-on- and then appends the <2, 2> entry to the file’s inode log.
write for file data to reduce the log size and make garbage Then NOVA atomically updates the log tail to commit the
collection simple and efficient. Figure 4 shows the structure write, and updates the radix tree in DRAM, so that offset “2”
of a NOVA file. The file inode’s log records metadata changes, points to the new entry. The NVMM page that holds the old
and each file has a radix tree in DRAM to locate data in the contents of page 2 returns to the free list immediately. During
file by the file offset. the operation, a per-inode lock protects the log and the radix
A file inode’s log contains two kinds of log entries: inode tree from concurrent updates. When the write system call
update entries and file write entries that describe file write returns, all the updates are persistent in NVMM.

USENIX Association 14th USENIX Conference on File and Storage Technologies (FAST ’16)  329
4.5. Atomic mmap Tail

DAX file systems allow applications to access NVMM di- Head


rectly via load and store instructions by mapping the physical 1 2 3 4

NVMM file data pages into the application’s address space.


This DAX-mmap exposes the NVMM’s raw performance to (a) Head
1 2 3 4
the applications and is likely to be a critical interface in the
future.
While DAX-mmap bypasses the file system page cache (b) Head
1 3 4
and avoids paging overheads, it presents challenges for pro-
grammers. DAX-mmap provides raw NVMM so the only
5
atomicity mechanisms available to the programmer are the
Valid log entry Invalid log entry
64-bit writes, fences, and cache flush instructions that the
processor provides. Using these primitives to build robust Figure 5: NOVA log cleaning. The linked list structure of log
non-volatile data structures is very difficult [19, 67, 34], and provides simple and efficient garbage collection. Fast GC reclaims
expecting programmers to do so will likely limit the useful- invalid log pages by deleting them from the linked list (a), while
ness of direct-mapped NVMM. thorough GC copies live log entries to a new version of the log (b).
To address this problem, NOVA proposes a direct
NVMM access model with stronger consistency called • An inode update that modifies metadata (e.g., mode or
atomic-mmap. When an application uses atomic-mmap mtime) is dead, if a later inode update modifies the same
to map a file into its address space, NOVA allocates replica piece of metadata.
pages from NVMM, copies the file data to the replica pages, • A dentry update is dead, if it is marked invalid.
and then maps the replicas into the address space. When the NOVA mark dentries invalid in certain cases. For instance,
application calls msync on the replica pages, NOVA handles file creation adds a create dentry to the log. Deleting the
it as a write request described in the previous section, uses file adds a delete dentry, and it also marks the create dentry
movntq operation to copy the data from replica pages to as invalid. (If the NOVA garbage collector reclaimed the
data pages directly, and commits the changes atomically. delete dentry but left the create dentry, the file would seem to
Since NOVA uses copy-on-write for file data and reclaims reappear.)
stale data pages immediately, it does not support DAX-mmap. These rules determine which log entries are alive and dead,
Atomic-mmap has higher overhead than DAX-mmap but and NOVA uses two different garbage collection (GC) tech-
provides stronger consistency guarantee. The normal DRAM niques to reclaim dead entries.
mmap is not atomic because the operating system might ea-
Fast GC Fast GC emphasizes speed over thoroughness
gerly write back a subset of dirty pages to the file system,
and it does not require any copying. NOVA uses it to quickly
leaving the file data inconsistent in event of a system fail-
reclaim space when it extends an inode’s log. If all the entries
ure [45]. NOVA could support atomic mmap in DRAM by
in a log page are dead, fast GC reclaims it by deleting the page
preventing the operating system from flushing dirty pages,
from the log’s linked list. Figure 5(a) shows an example of
but we leave this feature as future work.
fast log garbage collection. Originally the log has four pages
4.6. Garbage collection and page 2 contains only dead log entries. NOVA atomically
updates the next page pointer of page 1 to point to page 3 and
NOVA’s logs are linked lists and contain only metadata, mak-
frees page 2.
ing garbage collection simple and efficient. This structure
also frees NOVA from the need to constantly move data to Thorough GC During the fast GC log scan, NOVA tallies
maintain a supply of contiguous free regions. the space that live log entries occupy. If the live entries
NOVA handles garbage collection for stale data pages and account for less than 50% of the log space, NOVA applies
stale log entries separately. NOVA collects stale data pages thorough GC after fast GC finishes, copies live entries into a
immediately during write operations (see Section 4.4). new, compacted version of the log, updates the DRAM data
Cleaning inode logs is more complex. A log entry is dead structure to point to the new log, then atomically replaces the
in NOVA if it is not the last entry in the log (because the old log with the new one, and finally reclaims the old log.
last entry records the inode’s latest ctime) and any of the Figure 5(b) illustrates thorough GC after fast GC is com-
following conditions is met: plete. NOVA allocates a new log page 5, and copies valid log
• A file write entry is dead, if it does not refer to valid data entries in page 1 and 3 into it. Then, NOVA links page 5 to
pages. page 4 to create a new log and replace the old one. NOVA

330  14th USENIX Conference on File and Storage Technologies (FAST ’16) USENIX Association
does not copy the live entries in page 4 to avoid updating the 4.8. NVMM Protection
log tail, so that NOVA can atomically replace the old log by
updating the log head pointer. Since the kernel maps NVMM into its address space dur-
ing NOVA mount, the NVMM is susceptible to corruption
4.7. Shutdown and Recovery by errant stores from the kernel. To protect the file system
and prevent permanent corruption of the NVMM from stray
When NOVA mounts the file system, it reconstructs the in- writes, NOVA must make sure it is the only system software
DRAM data structures it needs. Since applications may ac- that accesses the NVMM.
cess only a portion of the inodes while the file system is NOVA uses the same protection mechanism that PMFS
running, NOVA adopts a policy called lazy rebuild to reduce does. Upon mount, the whole NVMM region is mapped as
the recovery time: It postpones rebuilding the radix tree and read-only. Whenever NOVA needs to write to the NVMM
the inode until the system accesses the inode for the first pages, it opens a write window by disabling the processor’s
time. This policy accelerates the recovery process and re- write protect control (CR0.WP). When CR0.WP is clear, ker-
duces DRAM consumption. As a result, during remount nel software running on ring 0 can write to pages marked
NOVA only needs to reconstruct the NVMM free page lists. read-only in the kernel address space. After the NVMM write
The algorithm NOVA uses to recover the free lists is different completes, NOVA resets CR0.WP to close the write window.
for “clean” shutdowns than for system failures. CR0.WP is not saved across interrupts so NOVA disables lo-
cal interrupts during the write window. Opening and closing
Recovery after a normal shutdown On a clean unmount, the write window does not require modifying the page tables
NOVA stores the NVMM page allocator state in the recovery or the TLB, so it is inexpensive.
inode’s log and restores the allocator during the subsequent
remount. Since NOVA does not scan any inode logs in this 5. Evaluation
case, the recovery process is very fast: Our measurement
shows that NOVA can remount a 50 GB file system in 1.2 In this section we evaluate the performance of NOVA and
milliseconds. answer the following questions:
• How does NOVA perform against state-of-the-art file sys-
Recovery after a failure In case of a unclean dismount tems built for disks, SSDs, and NVMM?
(e.g., system crash), NOVA must rebuild the NVMM allocator • What kind of operations benefit most from NOVA?
information by scanning the inode logs. NOVA log scanning • How do underlying NVMM characteristics affect NOVA
is fast because of two design decisions. First, per-CPU inode performance?
tables and per-inode logs allow for vast parallelism in log • How efficient is NOVA garbage collection compared to
recovery. Second, since the logs do not contain data pages, other approaches?
they tend to be short. The number of live log entries in an • How expensive is NOVA recovery?
inode log is roughly the number of extents in the file. As We first describe the experimental setup and then evaluate
a result, NOVA only needs to scan a small fraction of the NOVA with micro- and macro-benchmarks.
NVMM during recovery. The NOVA failure recovery consists
of two steps: 5.1. Experimental setup
First, NOVA checks each journal and rolls back any uncom- To emulate different types of NVMM and study their ef-
mitted transactions to restore the file system to a consistent fects on NVMM file systems, we use the Intel Persistent
state. Memory Emulation Platform (PMEP) [21]. PMEP is a dual-
Second, NOVA starts a recovery thread on each CPU and socket Intel Xeon processor-based platform with special CPU
scans the inode tables in parallel, performing log scanning microcode and firmware. The processors on PMEP run at
for every valid inode in the inode table. NOVA use different 2.6 GHz with 8 cores and 4 DDR3 channels. The BIOS marks
recovery mechanisms for directory inodes and file inodes: the DRAM memory on channels 2 and 3 as emulated NVMM.
For a directory inode, NOVA scans the log’s linked list to PMEP supports configurable latencies and bandwidth for the
enumerate the pages it occupies, but it does not inspect the emulated NVMM, allowing us to explore NOVA’s perfor-
log’s contents. For a file inode, NOVA reads the write entries mance on a variety of future memory technologies. PMEP
in the log to enumerate the data pages. emulates clflushopt, clwb, and PCOMMIT instructions
During the recovery scan NOVA builds a bitmap of oc- with processor microcode.
cupied pages, and rebuilds the allocator based on the result. In our tests we configure the PMEP with 32 GB of DRAM
After this process completes, the file system is ready to accept and 64 GB of NVMM. To emulate different NVMM tech-
new requests. nologies, we choose two configurations for PMEP’s mem-

USENIX Association 14th USENIX Conference on File and Storage Technologies (FAST ’16)  331
Read clwb PCOMMIT Average I/O size R/W # of files
NVMM Write bandwidth Workload Threads
latency latency latency file size (r/w) ratio Small/Large
STT-RAM 100 ns Full DRAM 40 ns 200 ns Fileserver 128 KB 16 KB/16 KB 50 1:2 100K/400K
Webproxy 32 KB 1 MB/16 KB 50 5:1 100K/1M
PCM 300 ns 1/8 DRAM 40 ns 500 ns
Webserver 64 KB 1 MB/8 KB 50 10:1 100K/500K
Varmail 32 KB 1 MB/16 KB 50 1:1 100K/1M
Table 1: NVMM emulation characteristics. STT-RAM emulates
fast NVMs that have access latency and bandwidth close to DRAM, Table 2: Filebench workload characteristics. The selected four
and PCM emulates NVMs that are slower than DRAM. workloads have different read/write ratios and access patterns.

ory emulation system (Table 1): For STT-RAM we use the NOVA is more sensitive to NVMM performance than the
same read latency and bandwidth as DRAM, and configure other file systems because NOVA’s software overheads are
PCOMMIT to take 200 ns; For PCM we use 300 ns for the read lower, and so overall performance more directly reflects the
latency and reduce the write bandwidth to 1/8th of DRAM, underlying memory performance. Figure 6(c) shows the
and PCOMMIT takes 500 ns. latency breakdown of NOVA file operations on STT-RAM
We evaluate NOVA on Linux kernel 4.0 against seven file and PCM. For create and append operations, NOVA only
systems: Two of these, PMFS and Ext4-DAX are the only accounts for 21%–28% of the total latency. On PCM the
available open source NVMM file systems that we know NOVA delete latency increases by 76% because NOVA
of. Both of them journal metadata and perform in-place reads the inode log to free data and log blocks and PCM
updates for file data. Two others, NILFS2 and F2FS are log- has higher read latency. For the create operation, the
structured file systems designed for HDD and flash-based VFS layer accounts for 49% of the latency on average. The
storage, respectively. We also compare to Ext4 in default memory copy from the user buffer to NVMM consumes 51%
mode (Ext4) and in data journal mode (Ext4-data) which of the append execution time on STT-RAM, suggesting that
provides data atomicity. Finally, we compare to Btrfs [54], a the POSIX interface may be the performance bottleneck on
state-of-the-art copy-on-write Linux file system. Except for high speed memory devices.
Ext4-DAX and Ext4-data, all the file systems are mounted
with default options. Btrfs and Ext4-data are the only two file 5.3. Macrobenchmarks
systems in the group that provide the same, strong consistency We select four Filebench [23] workloads—fileserver,
guarantees as NOVA. webproxy, webserver and varmail—to evaluate the
PMFS and NOVA manage NVMM directly and do not application-level performance of NOVA. Table 2 summarizes
require a block device interface. For the other file systems, the characteristics of the workloads. For each workload we
we use the Intel persistent memory driver [48] to emulate test two dataset sizes by changing the number of files. The
NVMM-based ramdisk-like device. The driver does not pro- small dataset will fit entirely in DRAM, allowing file systems
vide any protection from stray kernel stores, so we disable that use the DRAM page cache to cache the entire dataset.
the CR0.WP protection in PMFS and NOVA in the tests to The large dataset is too large to fit in DRAM, so the page
make the comparison fair. We add clwb and PCOMMIT cache is less useful. We run each test five times and report
instructions to flush data where necessary in each file system. the average. Figure 7 shows the Filebench throughput with
5.2. Microbenchmarks different NVMM technologies and data set sizes.
In the fileserver workload, NOVA outperforms other file
We use a single-thread micro-benchmark to evaluate the la- systems by between 1.8× and 16.6× on STT-RAM, and be-
tency of basic file system operations. The benchmark creates tween 22% and 9.1× on PCM for the large dataset. NOVA
10,000 files, makes sixteen 4 KB appends to each file, calls outperforms Ext4-data by 11.4× and Btrfs by 13.5× on
fsync to persist the files, and finally deletes them. STT-RAM, while providing the same consistency guaran-
Figures 6(a) and 6(b) show the results on STT-RAM and tees. NOVA on STT-RAM delivers twice the throughput
PCM, respectively. The latency of fsync is amortized across compared to PCM, because of PCM’s lower write bandwidth.
the append operations. NOVA provides the lowest latency PMFS performance drops by 80% between the small and
for each operation, outperforms other file systems by between large datasets, indicating its poor scalability.
35% and 17×, and improves the append performance by Webproxy is a read-intensive workload. For the small
7.3× and 6.7× compared to Ext4-data and Btrfs respectively. dataset, NOVA performs similarly to Ext4 and Ext4-DAX,
PMFS is closest to NOVA in terms of append and delete and 2.1× faster than Ext4-data. For the large workload,
performance. NILFS2 performs poorly on create oper- NOVA performs between 36% and 53% better than F2FS
ations, suggesting that naively using log-structured, disk- and Ext4-DAX. PMFS performs directory lookup by linearly
oriented file systems on NVMM is unwise. searching the directory entries, and NILFS2’s directory lock

332  14th USENIX Conference on File and Storage Technologies (FAST ’16) USENIX Association
Figure 6: File system operation latency on different NVMM configurations. The single-thread benchmark performs create, append
and delete operations on a large number of files.

Figure 7: Filebench throughput with different file system patterns and dataset sizes on STT-RAM and PCM. Each workload has
two dataset sizes so that the small one can fit in DRAM entirely while the large one cannot. The standard deviation is less than 5% of the
value.

design is not scalable [57], so their performance suffers since other DAX file systems. For the small dataset, non-DAX file
webproxy puts all the test files in one large directory. systems are 33% faster on average due to DRAM caching.
Webserver is a read-dominated workload and does not However, for the large dataset, NOVA’s performance remains
involve any directory operations. As a result, non-DAX file stable while non-DAX performance drops by 60%.
systems benefit significantly from the DRAM page cache and Varmail emulates an email server with a large number
the workload size has a large impact on performance. Since of small files and involves both read and write opera-
STT-RAM has the same latency as DRAM, small workload tions. NOVA outperforms Btrfs by 11.1× and Ext4-data
performance is roughly the same for all the file systems with by 3.1× on average, and outperforms the other file systems
NOVA enjoying a small advantage. On the large data set, by between 2.2× and 216×, demonstrating its capabilities
NOVA performs 10% better on average than Ext4-DAX and in write-intensive workloads and its good scalability with
PMFS, and 63% better on average than non-DAX file systems. large directories. NILFS2 and PMFS still suffer from poor
On PCM, NOVA’s performance is about the same as the directory operation performance.

USENIX Association 14th USENIX Conference on File and Storage Technologies (FAST ’16)  333
Duration 10s 30s 120s 600s 3600s Dataset File size Number of files Dataset size I/O size
NILFS2 Fail Fail Fail Fail Fail Videoserver 128 MB 400 50 GB 1 MB
F2FS 37,979 23,193 18,240 Fail Fail Fileserver 1 MB 50,000 50 GB 64 KB
NOVA 222,337 222,229 220,158 209,454 205,347 Mailserver 128 KB 400,000 50 GB 16 KB
# GC pages
Fast 0 255 17,385 159,406 1,170,611 Table 4: Recovery workload characteristics. The number of
Thorough 102 2,120 9,633 27,292 72,727 files and typical I/O size both affect NOVA’s recovery performance.

Table 3: Performance of a full file system. The test runs a 30 GB


fileserver workload under 95% NVMM utilization with different dura- Dataset Videoserver Fileserver Mailserver
tions, and reports the results in operations per second. The bottom STTRAM-normal 156 µs 313 µs 918 µs
three rows show the number of pages that NOVA garbage collector PCM-normal 311 µs 660 µs 1197 µs
reclaimed in the test. STTRAM-failure 37 ms 39 ms 72 ms
PCM-failure 43 ms 50 ms 116 ms
Overall, NOVA achieves the best performance in almost
all cases and provides data consistency guarantees that are Table 5: NOVA recovery time on different scenarios. NOVA is
able to recover 50 GB data in 116ms in case of power failure.
as strong or stronger than the other file systems. The perfor-
mance advantages of NOVA are largest on write-intensive
workloads with large number of files. keeping the logs short, and performing log scanning in paral-
5.4. Garbage collection efficiency lel.
To measure the recovery overhead, we use the three work-
NOVA resolves the issue that many LFSs suffer from, i.e. loads in Table 4. Each workload represents a different use
they have performance problems under heavy write loads, case for the file systems: Videoserver contains a few large
especially when the file system is nearly full. NOVA reduces files accessed with large-size requests, mailserver includes
the log cleaning overhead by reclaiming stale data pages a large number of small files and the request size is small,
immediately, keeping log sizes small, and making garbage fileserver is in between. For each workload, we measure the
collection of those logs efficient. cost of mounting after a normal shutdown and after a power
To evaluate the efficiency of NOVA garbage collection failure.
when NVMM is scarce, we run a 30 GB write-intensive
fileserver workload under 95% NVMM utilization for dif- Table 5 summarizes the results. With a normal shutdown,
ferent durations, and compare with the other log-structured NOVA recovers the file system in 1.2 ms, as NOVA does not
file systems, NILFS2 and F2FS. We run the test with PMEP need to scan the inode logs. After a power failure, NOVA
configured to emulate STT-RAM. recovery time increases with the number of inodes (because
Table 3 shows the result. NILFS2 could not finish the the number of logs increases) and as the I/O operations that
10-second test due to garbage collection inefficiencies. F2FS created the files become smaller (because file logs become
fails after running for 158 seconds, and the throughput drops longer as files become fragmented). Recovery runs faster on
by 52% between the 10s and 120s tests due to log cleaning STT-RAM than on PCM because NOVA reads the logs to
overhead. In contrast, NOVA outperforms F2FS by 5.8× and reconstruct the NVMM free page lists, and PCM has higher
successfully runs for the full hour. NOVA’s throughput also read latency than STT-RAM. On both PCM and STT-RAM,
remains stable, dropping by less than 8% between the 10s NOVA is able to recover 50 GB data in 116ms, achieving
and one-hour tests. failure recovery bandwidth higher than 400 GB/s.
The bottom half of Table 3 shows the number of pages that
NOVA garbage collector reclaimed. On the 30s test fast GC 6. Conclusion
reclaims 11% of the stale log pages. With running time rises,
fast GC becomes more efficient and is responsible for 94% of We have implemented and described NOVA, a log-structured
reclaimed pages in the one-hour test. The result shows that in file system designed for hybrid volatile/non-volatile main
long-term running, the simple and low-overhead fast GC is memories. NOVA extends ideas of LFS to leverage NVMM,
efficient enough to reclaim the majority of stale log pages. yielding a simpler, high-performance file system that sup-
ports fast and efficient garbage collection and quick recovery
5.5. Recovery overhead
from system failures. Our measurements show that NOVA
NOVA uses DRAM to maintain the NVMM free page lists outperforms existing NVMM file systems by a wide mar-
that it must rebuild when it mounts a file system. NOVA ac- gin on a wide range of applications while providing stronger
celerates the recovery by rebuilding inode information lazily, consistency and atomicity guarantees.

334  14th USENIX Conference on File and Storage Technologies (FAST ’16) USENIX Association
Acknowledgments Architecture for Next-generation, Non-volatile Memories. In
Proceedings of the 43nd Annual IEEE/ACM International
This work was supported by STARnet, a Semiconductor Re- Symposium on Microarchitecture, MICRO 43, pages 385–395,
search Corporation program, sponsored by MARCO and New York, NY, USA, 2010. ACM.
DARPA. We would like to thank John Ousterhout, Niraj To- [12] A. M. Caulfield, T. I. Mollov, L. A. Eisner, A. De, J. Coburn,
lia, Isabella Furth, and the anonymous reviewers for their and S. Swanson. Providing safe, user space access to fast,
insightful comments and suggestions. We are also thankful solid state disks. In Proceedings of the seventeenth interna-
to Subramanya R. Dulloor from Intel for his support and tional conference on Architectural Support for Programming
hardware access. Languages and Operating Systems, ASPLOS XVII, pages
387–400, New York, NY, USA, 2012. ACM.
References [13] A. M. Caulfield and S. Swanson. QuickSAN: A Storage Area
[1] Intel and Micron produce breakthrough mem- Network for Fast, Distributed, Solid State Disks. In ISCA ’13:
ory technology. http://newsroom.intel.com/ Proceedings of the 40th Annual International Symposium on
community/intel_newsroom/blog/2015/07/28/ Computer architecture, 2013.
intel-and-micron-produce-breakthrough-memory-technology.
[14] F. Chen, D. A. Koufaty, and X. Zhang. Understanding intrin-
[2] A. Akel, A. M. Caulfield, T. I. Mollov, R. K. Gupta, and sic characteristics and system implications of flash memory
S. Swanson. Onyx: A protoype phase change memory storage based solid state drives. ACM SIGMETRICS Performance
array. In Proceedings of the 3rd USENIX Conference on Hot Evaluation Review, 37(1):181–192, 2009.
Topics in Storage and File Systems, HotStorage’11, pages 2–2,
[15] S. Chen and Q. Jin. Persistent B+ -trees in Non-volatile Main
Berkeley, CA, USA, 2011. USENIX Association. Memory. Proc. VLDB Endow., 8(7):786–797, Feb. 2015.
[3] R. H. Arpaci-Dusseau and A. C. Arpaci-Dusseau. Operating [16] V. Chidambaram, T. S. Pillai, A. C. Arpaci-Dusseau, and R. H.
Systems: Three Easy Pieces. Arpaci-Dusseau Books, 0.80 Arpaci-Dusseau. Optimistic crash consistency. In Proceedings
edition, May 2014. of the Twenty-Fourth ACM Symposium on Operating Systems
[4] J. Arulraj, A. Pavlo, and S. R. Dulloor. Let’s Talk About Stor- Principles, SOSP ’13, pages 228–243, New York, NY, USA,
age & Recovery Methods for Non-Volatile Memory Database 2013. ACM.
Systems. In Proceedings of the 2015 ACM SIGMOD Inter- [17] V. Chidambaram, T. Sharma, A. C. Arpaci-Dusseau, and R. H.
national Conference on Management of Data, SIGMOD ’15, Arpaci-Dusseau. Consistency without ordering. In Proceed-
pages 707–722, New York, NY, USA, 2015. ACM. ings of the 10th USENIX Conference on File and Storage Tech-
[5] E. D. Berger, K. S. McKinley, R. D. Blumofe, and P. R. Wil- nologies, FAST ’12, pages 9–9, Berkeley, CA, USA, 2012.
son. Hoard: A scalable memory allocator for multithreaded USENIX Association.
applications. In ASPLOS-IX: Proceedings of the Ninth Interna- [18] Y. Choi, I. Song, M.-H. Park, H. Chung, S. Chang, B. Cho,
tional Conference on Architectural Support for Programming J. Kim, Y. Oh, D. Kwon, J. Sunwoo, J. Shin, Y. Rho, C. Lee,
Languages and Operating Systems, pages 117–128, New York, M. G. Kang, J. Lee, Y. Kwon, S. Kim, J. Kim, Y.-J. Lee,
NY, USA, 2000. ACM. Q. Wang, S. Cha, S. Ahn, H. Horii, J. Lee, K. Kim, H. Joo,
[6] K. Bhandari, D. R. Chakrabarti, and H.-J. Boehm. Impli- K. Lee, Y.-T. Lee, J. Yoo, and G. Jeong. A 20nm 1.8V 8Gb
cations of CPU Caching on Byte-addressable Non-volatile PRAM with 40MB/s program bandwidth. In Solid-State Cir-
Memory Programming. Technical report, HP Technical Re- cuits Conference Digest of Technical Papers (ISSCC), 2012
port HPL-2012-236, 2012. IEEE International, pages 46–48, Feb 2012.
[7] M. S. Bhaskaran, J. Xu, and S. Swanson. Bankshot: Caching [19] J. Coburn, A. M. Caulfield, A. Akel, L. M. Grupp, R. K.
Slow Storage in Fast Non-volatile Memory. In Proceedings Gupta, R. Jhala, and S. Swanson. NV-Heaps: Making Persis-
of the 1st Workshop on Interactions of NVM/FLASH with tent Objects Fast and Safe with Next-generation, Non-volatile
Operating Systems and Workloads, INFLOW ’13, pages 1:1– Memories. In Proceedings of the Sixteenth International Con-
1:9, New York, NY, USA, 2013. ACM. ference on Architectural Support for Programming Languages
[8] J. Bonwick and B. Moore. ZFS: The Last Word in File Sys- and Operating Systems, ASPLOS ’11, pages 105–118, New
tems, 2007. York, NY, USA, 2011. ACM.

[9] L. Bouganim, B. Jónsson, and P. Bonnet. uFLIP: Understand- [20] J. Condit, E. B. Nightingale, C. Frost, E. Ipek, B. Lee,
ing Flash IO Patterns. arXiv preprint arXiv:0909.1780, 2009. D. Burger, and D. Coetzee. Better I/O through byte-
addressable, persistent memory. In Proceedings of the ACM
[10] M. J. Breitwisch. Phase change memory. Interconnect Tech- SIGOPS 22nd Symposium on Operating Systems Principles,
nology Conference, 2008. IITC 2008. International, pages SOSP ’09, pages 133–146, New York, NY, USA, 2009. ACM.
219–221, June 2008.
[21] S. R. Dulloor, S. Kumar, A. Keshavamurthy, P. Lantz,
[11] A. M. Caulfield, A. De, J. Coburn, T. I. Mollov, R. K. Gupta, D. Reddy, R. Sankaran, and J. Jackson. System Software
and S. Swanson. Moneta: A High-performance Storage Array

USENIX Association 14th USENIX Conference on File and Storage Technologies (FAST ’16)  335
for Persistent Memory. In Proceedings of the Ninth Euro- In SOSP ’97: Proceedings of the Sixteenth ACM Symposium
pean Conference on Computer Systems, EuroSys ’14, pages on Operating Systems Principles, pages 92–101, New York,
15:1–15:15, New York, NY, USA, 2014. ACM. NY, USA, 1997. ACM.
[22] R. Fackenthal, M. Kitagawa, W. Otsuka, K. Prall, D. Mills, [37] Y. Lu, J. Shu, and W. Wang. ReconFS: A Reconstructable File
K. Tsutsui, J. Javanifard, K. Tedrow, T. Tsushima, Y. Shiba- System on Flash Storage. In Proceedings of the 12th USENIX
hara, and G. Hush. A 16Gb ReRAM with 200MB/s write Conference on File and Storage Technologies, FAST’14, pages
and 1GB/s read in 27nm technology. In Solid-State Circuits 75–88, Berkeley, CA, USA, 2014. USENIX Association.
Conference Digest of Technical Papers (ISSCC), 2014 IEEE [38] C. Min, K. Kim, H. Cho, S.-W. Lee, and Y. I. Eom. SFS:
International, pages 338–339, Feb 2014. Random Write Considered Harmful in Solid State Drives.
[23] Filebench file system benchmark. http://sourceforge.net/ In Proceedings of the 10th USENIX Conference on File and
projects/filebench. Storage Technologies, FAST ’12, page 12, 2012.
[24] R. HAGMANN. Reimplementing the cedar file system using [39] C. Mohan, D. Haderle, B. Lindsay, H. Pirahesh, and
logging and group commit. In Proc. 11th ACM Symposium P. Schwarz. ARIES: A Transaction Recovery Method Sup-
on Operating System Principles Austin, TX, pages 155–162, porting Fine-granularity Locking and Partial Rollbacks Using
1987. Write-ahead Logging. ACM Trans. Database Syst., 17(1):94–
[25] D. Hitz, J. Lau, and M. A. Malcolm. File system design for an 162, 1992.
NFS file server appliance. In USENIX Winter, pages 235–246, [40] I. Moraru, D. G. Andersen, M. Kaminsky, N. Tolia, P. Ran-
1994. ganathan, and N. Binkert. Consistent, Durable, and Safe Mem-
[26] Intel Architecture Instruction Set Extensions Program- ory Management for Byte-addressable Non Volatile Main
ming Reference. https://software.intel.com/sites/default/files/ Memory. In Proceedings of the First ACM SIGOPS Confer-
managed/0d/53/319433-022.pdf. ence on Timely Results in Operating Systems, TRIOS ’13,
pages 1:1–1:17, New York, NY, USA, 2013. ACM.
[27] S. G. International. XFS: A High-performance Journaling
Filesystem. http://oss.sgi.com/projects/xfs. [41] D. Narayanan and O. Hodson. Whole-system Persistence with
Non-volatile Memories. In Seventeenth International Confer-
[28] T. Kawahara. Scalable Spin-Transfer Torque RAM Technol- ence on Architectural Support for Programming Languages
ogy for Normally-Off Computing. Design & Test of Comput- and Operating Systems (ASPLOS 2012). ACM, March 2012.
ers, IEEE, 28(1):52–63, Jan 2011.
[42] H. Noguchi, K. Ikegami, K. Kushida, K. Abe, S. Itai,
[29] B. C. Lee, E. Ipek, O. Mutlu, and D. Burger. Architecting S. Takaya, N. Shimomura, J. Ito, A. Kawasumi, H. Hara,
phase change memory as a scalable DRAM alternative. In and S. Fujita. A 3.3ns-access-time 71.2uW/MHz 1Mb em-
ISCA ’09: Proceedings of the 36th Annual International Sym- bedded STT-MRAM using physically eliminated read-disturb
posium on Computer Architecture, pages 2–13, New York, scheme and normally-off memory architecture. In Solid-State
NY, USA, 2009. ACM. Circuits Conference (ISSCC), 2015 IEEE International, pages
[30] C. Lee, D. Sim, J. Hwang, and S. Cho. F2FS: A New File 1–3, Feb 2015.
System for Flash Storage. In 13th USENIX Conference on File [43] Berkeley DB. http://www.oracle.com/technology/products/
and Storage Technologies, FAST ’15, pages 273–286, Santa berkeley-db/index.html.
Clara, CA, Feb. 2015. USENIX Association.
[44] J. Ousterhout, A. Gopalan, A. Gupta, A. Kejriwal, C. Lee,
[31] E. Lee, H. Bahn, and S. H. Noh. Unioning of the Buffer B. Montazeri, D. Ongaro, S. J. Park, H. Qin, M. Rosenblum,
Cache and Journaling Layers with Non-volatile Memory. In S. Rumble, R. Stutsman, and S. Yang. The RAMCloud Stor-
Presented as part of the 11th USENIX Conference on File and age System. ACM Trans. Comput. Syst., 33(3):7:1–7:55, Aug.
Storage Technologies, FAST ’13, pages 73–80, San Jose, CA, 2015.
2013. USENIX.
[45] S. Park, T. Kelly, and K. Shen. Failure-atomic Msync(): A
[32] E. K. Lee and C. A. Thekkath. Petal: Distributed virtual disks. Simple and Efficient Mechanism for Preserving the Integrity
SIGOPS Oper. Syst. Rev., 30(5):84–92, Sept. 1996. of Durable Data. In Proceedings of the 8th ACM European
[33] P. H. Lensing, T. Cortes, and A. Brinkmann. Direct Lookup Conference on Computer Systems, EuroSys ’13, pages 225–
and Hash-based Metadata Placement for Local File Systems. 238, New York, NY, USA, 2013. ACM.
In Proceedings of the 6th International Systems and Storage [46] S. Pelley, P. M. Chen, and T. F. Wenisch. Memory persistency.
Conference, SYSTOR ’13, pages 5:1–5:11, New York, NY, In Proceeding of the 41st Annual International Symposium on
USA, 2013. ACM. Computer Architecture, ISCA ’14, pages 265–276, Piscataway,
[34] Persistent Memory Programming. http://pmem.io. NJ, USA, 2014. IEEE Press.
[35] Trees I: Radix trees. https://lwn.net/Articles/175432/. [47] T. S. Pillai, V. Chidambaram, R. Alagappan, S. Al-Kiswany,
A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. All File
[36] D. E. Lowell and P. M. Chen. Free Transactions with Rio Vista.
Systems Are Not Created Equal: On the Complexity of Craft-

336  14th USENIX Conference on File and Storage Technologies (FAST ’16) USENIX Association
ing Crash-Consistent Applications. In 11th USENIX Sympo- Mains, and V. Padmanabhan. File System Logging Versus
sium on Operating Systems Design and Implementation (OSDI Clustering: A Performance Comparison. In Proceedings of the
14), pages 433–448, Broomfield, CO, Oct. 2014. USENIX USENIX 1995 Technical Conference Proceedings, TCON’95,
Association. pages 21–21, Berkeley, CA, USA, 1995. USENIX Associa-
[48] PMEM: the persistent memory driver + ext4 direct access tion.
(DAX). https://github.com/01org/prd. [62] D. B. Strukov, G. S. Snider, D. R. Stewart, and R. S. Williams.
[49] Persistent Memory File System. https://github.com/ The missing memristor found. Nature, 453(7191):80–83,
linux-pmfs/pmfs. 2008.

[50] Linux POSIX file system test suite. https://lwn.net/Articles/ [63] K. Suzuki and S. Swanson. The Non-Volatile Memory Tech-
276617/. nology Database (NVMDB). Technical Report CS2015-1011,
Department of Computer Science & Engineering, University
[51] M. K. Qureshi, V. Srinivasan, and J. A. Rivers. Scalable of California, San Diego, May 2015. http://nvmdb.ucsd.edu.
high performance main memory system using phase-change
memory technology. In Proceedings of the 36th Annual In- [64] C.-C. Tsai, Y. Zhan, J. Reddy, Y. Jiao, T. Zhang, and D. E.
ternational Symposium on Computer Architecture, ISCA ’09, Porter. How to Get More Value from Your File System Di-
pages 24–33, New York, NY, USA, 2009. ACM. rectory Cache. In Proceedings of the 25th Symposium on
Operating Systems Principles, SOSP ’15, pages 441–456,
[52] S. Raoux, G. Burr, M. Breitwisch, C. Rettner, Y. Chen, New York, NY, USA, 2015. ACM.
R. Shelby, M. Salinga, D. Krebs, S.-H. Chen, H. L. Lung,
and C. Lam. Phase-change random access memory: A scal- [65] S. Venkataraman, N. Tolia, P. Ranganathan, and R. Campbell.
able technology. IBM Journal of Research and Development, Consistent and durable data structures for non-volatile byte-
52(4.5):465–479, July 2008. addressable memory. In Proceedings of the 9th USENIX
Conference on File and Storage Technologies, FAST ’11, San
[53] POSIX 1003.1 - man page for rename. http://www.unix.com/ Jose, CA, USA, February 2011.
man-page/POSIX/3posix/rename/.
[66] H. Volos, S. Nalli, S. Panneerselvam, V. Varadarajan, P. Sax-
[54] O. Rodeh, J. Bacik, and C. Mason. BTRFS: The Linux B-Tree ena, and M. M. Swift. Aerie: Flexible File-system Interfaces
Filesystem. Trans. Storage, 9(3):9:1–9:32, Aug. 2013. to Storage-class Memory. In Proceedings of the Ninth Euro-
[55] M. Rosenblum and J. K. Ousterhout. The design and imple- pean Conference on Computer Systems, EuroSys ’14, pages
mentation of a log-structured file system. ACM Transactions 14:1–14:14, New York, NY, USA, 2014. ACM.
on Computer Systems (TOCS), 10(1):26–52, 1992. [67] H. Volos, A. J. Tack, and M. M. Swift. Mnemosyne:
[56] S. M. Rumble, A. Kejriwal, and J. Ousterhout. Log-structured Lightweight Persistent Memory. In ASPLOS ’11: Proceeding
Memory for DRAM-based Storage. In Proceedings of the of the 16th International Conference on Architectural Support
12th USENIX Conference on File and Storage Technologies, for Programming Languages and Operating Systems, New
FAST ’14, pages 1–16, Santa Clara, CA, 2014. USENIX. York, NY, USA, 2011. ACM.
[57] R. Santana, R. Rangaswami, V. Tarasov, and D. Hildebrand. [68] D. Vučinić, Q. Wang, C. Guyot, R. Mateescu, F. Blagojević,
A Fast and Slippery Slope for File Systems. In Proceedings L. Franca-Neto, D. L. Moal, T. Bunker, J. Xu, S. Swanson,
of the 3rd Workshop on Interactions of NVM/FLASH with and Z. Bandić. DC Express: Shortest Latency Protocol for
Operating Systems and Workloads, INFLOW ’15, pages 5:1– Reading Phase Change Memory over PCI Express. In Pro-
5:8, New York, NY, USA, 2015. ACM. ceedings of the 12th USENIX Conference on File and Storage
Technologies, FAST ’14, pages 309–315, Santa Clara, CA,
[58] M. Satyanarayanan, H. H. Mashburn, P. Kumar, D. C. Steere,
2014. USENIX.
and J. J. Kistler. Lightweight recoverable virtual memory. In
SOSP ’93: Proceedings of the Fourteenth ACM Symposium [69] J. Wang and Y. Hu. WOLF-A Novel Reordering Write Buffer
on Operating Systems Principles, pages 146–160, New York, to Boost the Performance of Log-Structured File Systems.
NY, USA, 1993. ACM. In Proceedings of the 1st USENIX Conference on File and
Storage Technologies, FAST ’02, pages 47–60, Monterey, CA,
[59] P. Sehgal, S. Basu, K. Srinivasan, and K. Voruganti. An
2002. USENIX.
Empirical Study of File Systems on NVM. In Proceedings
of the 2015 IEEE Symposium on Mass Storage Systems and [70] W. Wang, Y. Zhao, and R. Bunt. HyLog: A High Performance
Technologies (MSST’15), 2015. Approach to Managing Disk Layout. In Proceedings of the
3rd USENIX Conference on File and Storage Technologies,
[60] M. Seltzer, K. Bostic, M. K. McKusick, and C. Staelin. An
volume 4 of FAST ’04, pages 145–158, San Francisco, CA,
implementation of a log-structured file system for UNIX. In
2004. USENIX.
Proceedings of the USENIX Winter 1993 Conference Proceed-
ings on USENIX Winter 1993 Conference Proceedings, pages [71] M. Wilcox. Add support for NV-DIMMs to ext4. https:
3–3. USENIX Association, 1993. //lwn.net/Articles/613384/.
[61] M. Seltzer, K. A. Smith, H. Balakrishnan, J. Chang, S. Mc- [72] M. Wu and W. Zwaenepoel. eNVy: A Non-volatile, Main

USENIX Association 14th USENIX Conference on File and Storage Technologies (FAST ’16)  337
Memory Storage System. In Proceedings of the Sixth Interna-
tional Conference on Architectural Support for Programming
Languages and Operating Systems, ASPLOS-VI, pages 86–
97, New York, NY, USA, 1994. ACM.
[73] X. Wu and A. L. N. Reddy. SCMFS: A File System for
Storage Class Memory. In Proceedings of 2011 International
Conference for High Performance Computing, Networking,
Storage and Analysis, SC ’11, pages 39:1–39:11, New York,
NY, USA, 2011. ACM.
[74] J. Yang, D. B. Minturn, and F. Hady. When poll is better than
interrupt. In Proceedings of the 10th USENIX Conference on
File and Storage Technologies, FAST ’12, pages 3–3, Berkeley,
CA, USA, 2012. USENIX.
[75] J. Yang, Q. Wei, C. Chen, C. Wang, K. L. Yong, and B. He.
NV-Tree: Reducing Consistency Cost for NVM-based Single
Level Systems. In 13th USENIX Conference on File and
Storage Technologies, FAST ’15, pages 167–181, Santa Clara,
CA, Feb. 2015. USENIX Association.
[76] Y. Zhang and S. Swanson. A Study of Application Perfor-
mance with Non-Volatile Main Memory. In Proceedings of
the 2015 IEEE Symposium on Mass Storage Systems and
Technologies (MSST’15), 2015.
[77] J. Zhao, S. Li, D. H. Yoon, Y. Xie, and N. P. Jouppi. Kiln:
Closing the Performance Gap Between Systems With and
Without Persistence Support. In Proceedings of the 46th
Annual IEEE/ACM International Symposium on Microarchi-
tecture, MICRO-46, pages 421–432, New York, NY, USA,
2013. ACM.
[78] P. Zhou, B. Zhao, J. Yang, and Y. Zhang. A durable and
energy efficient main memory using phase change memory
technology. In ISCA ’09: Proceedings of the 36th Annual
International Symposium on Computer Architecture, pages
14–23, New York, NY, USA, 2009. ACM.
[79] R. Zwisler. Add support for new persistent memory instruc-
tions. https://lwn.net/Articles/619851/.

338  14th USENIX Conference on File and Storage Technologies (FAST ’16) USENIX Association

You might also like