0% found this document useful (0 votes)
37 views14 pages

Vmcache

This paper introduces vmcache, a novel buffer manager design that utilizes hardware-supported virtual memory for efficient page management in database systems while retaining control over page faulting and eviction. It also presents exmap, an OS extension for scalable page table manipulation on Linux, addressing performance bottlenecks in existing systems. Together, these contributions aim to enhance buffer management efficiency on multi-core CPUs and fast storage devices, supporting arbitrary graph data and variable-sized pages.

Uploaded by

lungite
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views14 pages

Vmcache

This paper introduces vmcache, a novel buffer manager design that utilizes hardware-supported virtual memory for efficient page management in database systems while retaining control over page faulting and eviction. It also presents exmap, an OS extension for scalable page table manipulation on Linux, addressing performance bottlenecks in existing systems. Together, these contributions aim to enhance buffer management efficiency on multi-core CPUs and fast storage devices, supporting arbitrary graph data and variable-sized pages.

Uploaded by

lungite
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Virtual-Memory Assisted Buffer Management

Preprint accepted for publication at SIGMOD 2023


Viktor Leis Adnan Alhomssi Tobias Ziegler
Technische Universität München Friedrich-Alexander-Universität Technische Universität Darmstadt
[email protected] Erlangen-Nürnberg [email protected]
[email protected]

Yannick Loeck Christian Dietrich


Technische Universität Hamburg Technische Universität Hamburg
[email protected] [email protected]

ABSTRACT 1 INTRODUCTION
Most database management systems cache pages from storage in a DBMS vs. OS. Database management systems (DBMS) and oper-
main memory buffer pool. To do this, they either rely on a hash table ating systems (OS) have always had an uneasy relationship. OSs
that translates page identifiers into pointers, or on pointer swizzling provide process isolation by virtualizing hardware access, whereas
which avoids this translation. In this work, we propose vmcache, a DBMSs want full control over hardware for optimal efficiency. At
buffer manager design that instead uses hardware-supported virtual the same time, OSs offer services (e.g., caching pages from storage)
memory to translate page identifiers to virtual memory addresses. that are almost exactly what database systems require – but for
In contrast to existing mmap-based approaches, the DBMS retains performance and semantic reasons, DBMSs often re-implement
control over page faulting and eviction. Our design is portable this functionality. The mismatch between the services offered by
across modern operating systems, supports arbitrary graph data, operating systems and the requirements of database systems was
enables variable-sized pages, and is easy to implement. One down- raised four decades ago [40], and the situation has not improved
side of relying on virtual memory is that with fast storage devices much since then.
the existing operating system primitives for manipulating the page OS-controlled caching. The big advantage the OS has over a
table can become a performance bottleneck. As a second contribu- DBMS is that it runs in kernel mode and therefore has access to
tion, we therefore propose exmap, which implements scalable page privileged instructions. In particular, the OS has direct control over
table manipulation on Linux. Together, vmcache and exmap provide the virtual memory page table, and can therefore do things user
flexible, efficient, and scalable buffer management on multi-core space processes cannot. For example, using virtual memory and
CPUs and fast storage devices. the memory management unit (MMU) of the processor, the OS
implements transparent page caching and exposes this by map-
CCS CONCEPTS ping storage into virtual memory through the mmap system call.
With mmap, in-memory operations (cache hits) are fast, thanks to
• Information systems → Data management systems; Record
the Translation Lookaside Buffer (TLB). Nevertheless, as Crotty
and buffer management.
et al. [13] recently discussed, mmap is generally not a good fit for
database systems. Two major problems of mmap are that (1) the
KEYWORDS DBMS loses control over page faulting and eviction, and that (2) the
Database Management Systems; Operating Systems; Caching; Buffer virtual memory implementation in Linux is too slow for modern
Management NVMe SSDs [13]. The properties of mmap and alternative buffer
manager designs are summarized in Table 1.
ACM Reference Format: DBMS-controlled caching. In order to have full control, most
Viktor Leis, Adnan Alhomssi, Tobias Ziegler, Yannick Loeck, and Christian DBMSs therefore avoid file-backed mmap, and implement explicit
Dietrich. 2023. Virtual-Memory Assisted Buffer Management: Preprint ac-
buffer management in user space. Traditionally, this has been done
cepted for publication at SIGMOD 2023. In Proceedings of ACM Conference
(Conference’17). ACM, New York, NY, USA, 14 pages. https://doi.org/10.1145/
using a hash table that contains all pages that are currently in
nnnnnnn.nnnnnnn cache [15]. Recent, more efficient buffer manager designs rely on
pointer swizzling [16, 23, 33]. Both approaches have downsides:
the former has non-trivial hash table translation overhead; and the
Permission to make digital or hard copies of all or part of this work for personal or latter is more difficult to implement and does not support cycli-
classroom use is granted without fee provided that copies are not made or distributed cal page references (e.g., graph data). Rather than compromising
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than ACM on either the performance or the functionality benefits of transla-
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, tion, this work proposes hardware-supported virtual memory as a
to post on servers or to redistribute to lists, requires prior specific permission and/or a
fee. Request permissions from [email protected].
fundamental building block of buffer management.
Conference’17, July 2017, Washington, DC, USA Contribution 1: vmcache. The first contribution of this paper is
© 2023 Association for Computing Machinery. vmcache, a novel buffer pool design that relies on virtual memory,
ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. . . $15.00 but retains control over faulting and eviction within the DBMS,
https://doi.org/10.1145/nnnnnnn.nnnnnnn
Table 1: Conceptual comparison of buffer manager designs be very fast, in the past decade DRAM prices have almost stopped
decreasing [18]. Storage in the form of NVMe flash SSDs, on the
mmap tradi. pointer swiz. Umbra vmcache +exmap other hand, has become cheap (20 − 50× cheaper per byte than
[13] [15] [16, 23] [33] Sec. 3 Sec. 4 DRAM [18]) and fast (>1 million random 4 KB reads per second
transl. page tbl. hash tbl. invasive invasive page tbl. page tbl. per SSD [4]). This makes pure in-memory systems economically
control OS DBMS DBMS DBMS DBMS DBMS unattractive [29], and implies that modern storage engines should
var. size easy hard hard med. (*) easy easy combine DRAM and SSD. The challenge is supporting very large
graphs yes yes no no yes yes data sets on NVMe SSDs with their high I/O throughput and making
implem. med. (**) easy hard hard easy easy cache hits almost as fast as in main-memory systems.
in-mem. fast slow fast fast fast fast
Pointer swizzling (invasive translation). An efficient technique
out-mem. slow fast fast fast med. fast
for implementing buffer managers is pointer swizzling. The tech-
(*) only powers of 2 [33] (**) read-only easy, transactions hard [13] nique has originally been proposed for object-oriented DBMSs [20],
but has recently been applied to several high-performance storage
engines [16, 23, 33]. As Figure 1b illustrates, the idea is to replace
unlike solutions based on file-backed mmap. The key idea is to map the PID of a cached page with its virtual memory pointer within the
the storage device into anonymous (rather than file-backed) vir- data structure. Page hits can therefore directly dereference a pointer
tual memory and use the MADV_DONTNEED hint to explicitly control instead of having to translate it through a hash table first. One way
eviction. This enables fast in-memory page accesses through TLB- to think about this is that pointer swizzling gets rid of explicit hash
supported translations without handing control to the OS. Page- table-based translation by invasively modifying the data structure
table-based translation also allows vmcache to support arbitrary itself. Pointer swizzling offers very good in-memory performance.
graph data and variable-sized pages. However, it requires adaptations for every buffer-managed data
Contribution 2: exmap. While vmcache has excellent in-memory structure, and its internal synchronization is quite intricate. E.g.,
performance, every page fault and eviction involves manipulating to unswizzle a page, one needs to find and lock its parent, and
the page table. Unfortunately, existing OS page table manipulation Storing a parent pointer on each node presents synchronization
primitives have scalability problems that become visible with high- challenges during node splits. Another downside is that pointer
performance NVMe SSDs [13]. Therefore, as a second contribution, swizzling-based systems generally do not support having more than
we propose exmap, an OS extension for efficiently manipulating one incoming reference to any particular page. In other words, only
virtual memory mappings. exmap is implemented as a Linux kernel tree data structures are directly supported. Graph data, next point-
module and is an example of DBMS/OS co-design. By providing new ers in B+tree leaf pages, and multiple incoming tuple references
OS-level abstractions, we simplify and accelerate data-processing (e.g., from secondary indexes) require inelegant and sometimes
systems. Overall, as Table 1 shows, combining exmap with vmcache inefficient workarounds.
results in a design that is not only fast (in-memory and out-of- Hardware-supported page translation. Traditional buffer man-
memory) but also offers important functionality. agers and pointer swizzling present an unsatisfactory and seem-
ingly inescapable tradeoff: either one pays the performance cost
2 BACKGROUND: DATABASE PAGE CACHING of the hash table indirection, or one loses the ability to support
Buffer management. Most DBMSs cache fixed-size pages (usually graph-like data. Instead of getting rid of the translation (as pointer
4-64 KB) from secondary storage in a main memory pool. The basic swizzling does), another way of achieving efficiency is to make
problem of such a cache is to efficiently translate a page identifier PID-to-pointer translation efficient through hardware support. All
(PID), which uniquely determines the physical location of each page modern operating systems use virtual memory and, together with
on secondary storage, into a pointer to the cached data content. In hardware support from the CPU, transparently translate virtual to
the following, we describe known ways of doing that, including physical addresses. Page table entries are cached within the CPU,
the six designs shown in Table 1. in particular the TLB, which makes virtual memory translation fast.
Hash table-based translation. Figure 1a illustrates the traditional Figure 1c shows how hardware-supported page translation can be
way [15] of implementing a buffer pool: a hash table indexes all used for caching pages from secondary storage.
cached pages by their PID. A page is addressed using its PID, which OS-driven caching with file-backed mmap. Unix offers the mmap
always involves a hash-table lookup. On a miss, the page is read system call to access storage via virtual memory. After mapping a
from secondary storage and added to the hash table. This approach file or device into virtual memory, a memory access will trigger a
is simple and flexible. The hash table is the single source of truth page fault. The OS will then install that page in the page table, mak-
of the caching state, and pages can reference each other arbitrarily ing succeeding page accesses as fast as ordinary memory accesses.
through PIDs. The downside is suboptimal in-memory performance, Some systems therefore eschew implementing a buffer pool and in-
as even cache hits have to pay the hash table lookup cost. Also note stead rely on the OS page cache by mmaping the database file/device.
that there are two levels of translation: from PID to virtual memory While this approach makes cache hits very fast, it has major prob-
pointer (at the DBMS level), and from virtual memory pointer to lems that were recently analyzed by Crotty et al. [13]: (1) Ensuring
physical memory pointer (at the OS/MMU level). transactional safety is difficult and potentially inefficient because
Main-memory DBMS. One way to avoid the overhead of tradi- the DBMS loses control over eviction. (2) There is no interface for
tional buffer managers is to forego caching altogether and keep all asynchronous I/O, and I/O stalls are unpredictable. (3) I/O error
data in main memory. While pure in-memory database systems can handling is cumbersome. (4) OS-implemented page faulting and
(a) hash table (b) invasive (ptr. swizling) (c) page table (virtual memory)
hash P0 P1 P2 P3 P4 P5
table P1
P1 P5 P4 P3 P5 P4 P3 bar foo
P4
virtual memory
page
foo P5 P3 table
P5
foo bar
foo P5 P4 P3 bar
bar
P3 physical memory

P0 P1 P2 P3 P4 P5
dog P5 P4 P3 cat bar fish foo

Figure 1: Buffer pool page translation schemes. Example with 6 pages on storage (P0-P5), 3 of which are cached (P1, P3, P5)

eviction is too slow to fully exploit modern NVMe storage devices. page sizes, but Umbra’s page translation is still based on pointer
The lack of control over eviction for file-backed mmap approaches swizzling rather than the page table. Umbra therefore inherits the
is a fundamental problem. Notably, it prevents the implementation disadvantages of pointer swizzling (difficult implementation, no
of ARIES-style transactions. ARIES uses in-place writes and pre- graph data), while potentially encountering OS scalability issues.
vents the eviction of a dirty page before its corresponding log entry Fast virtual memory manipulation. While OS-supported ap-
is flushed – impossible with existing OS interfaces [13]. Without proaches offer very fast access to cached pages and enable variable-
explicit control over eviction, it is also impossible to implement sized pages, they unfortunately may suffer from performance prob-
DBMS-optimized page replacement algorithms. Thus, one is at the lems. One problem is that each CPU core has its own TLB, which
whim of whatever algorithm the OS currently in use implements, can get out of sync with the page table1 . When the page table
which is unlikely to be optimized for DBMS workloads. changes, the OS therefore generally has to interrupt all CPU cores
DBMS-driven, virtual-memory assisted caching. While OS- and force them to invalidate their TLB (“TLB shootdown”). Another
managed caching using mmap may not be a good solution for most issue is that intra-kernel data structures can become the scalability
DBMSs, the OS has one big advantage: instead of having to use bottleneck on systems with many cores. Crotty et al. [13] observed
an explicit hash table for page translation, it can rely on hardware that because of these issues mmap can be slow in out-of-memory
support (the TLB) for page translation. This raises the following workloads. For random reads from one SSD, they measured that
question: Is it possible to exploit the virtual memory subsystem it achieves less than half the achievable I/O throughput. With se-
without losing control over eviction and page fault handling? One quential scans from ten SSDs, the gap between mmap and explicit
contribution of this paper is to answer this question affirmatively. asynchronous I/O is roughly 20×. Any virtual memory-based ap-
In Section 3, we describe how widely-supported OS features (anony- proach (including our basic vmcache design) will run into these
mous memory and the MADV_DONTNEED hint) can be exploited to kernel issues. Section 4 therefore describes a novel, specialized vir-
implement hardware-supported page translation while retaining tual memory subsystem for Linux called exmap, which solves these
full control over faulting and eviction within the DBMS. performance problems.
Variable-sized pages. Besides making page translation fast, using Persistent memory. In this work, we focus on block storage rather
a page table also makes implementing multiple page sizes much than byte-addressable persistent memory, for which multiple spe-
easier. Having dynamic page sizes is obviously very useful, e.g., cialized caching designs have been proposed [8, 21, 28, 41, 43].
for storing objects that are larger than one page [33]. Neverthe-
less, many buffer managers only support one particular page size 3 VMCACHE: VIRTUAL-MEMORY ASSISTED
(e.g., 4 KB) because multiple sizes lead to complex allocation and BUFFER MANAGEMENT
fragmentation issues. In these systems, larger objects need to be
The POSIX system call mmap usually maps a file or storage device
implemented by splitting them across pages, which complicates
into virtual memory, as is illustrated in Figure 1c. The advantage of
and slows down the code accessing such objects. With control over
file-backed mmap is that, due to hardware support for page transla-
the page table, on the other hand, a larger (e.g., 12 KB) page can be
tion, accessing cached pages becomes as fast as ordinary memory
created by mapping multiple (e.g., 3) non-contiguous physical pages
accesses. If the page translation is cached in the TLB and the data
to a contiguous virtual memory range. This is easy to implement
happens to be in the L1 cache, an access can take as little as 1 ns.
within the OS and no fragmentation occurs in main memory. One
The big downside is that the DBMS loses control over page fault-
system that allows multiple (albeit only power-of-two) page sizes
ing and eviction. If the page is not cached but resides on storage,
is Umbra [33]. It implements this by allocating multiple buffer pool-
dereferencing a pointer may suddenly take 10 ms because the OS
sized virtual memory areas – one for each page size. To allocate a
page of a particular size, one can simply fault the memory from that 1 The page table, which is an in-memory data structure, itself is coherent across CPU
class. To free a page, the buffer manager uses the MADV_DONTNEED cores. However, a CPU core accessing memory caches virtual to physical pointer
OS hint. This approach gets rid of fragmentation from different translations in a per-core hardware cache called TLB. If the page table is changed, the
hardware does not automatically update or invalidate existing TLB entries.
will cause a page fault that is transparent to the DBMS. Thus, from 1 fix(uint64_t pid): // fix page exclusively
the point of view of the DBMS, eviction and page faulting are to- 2 uint64_t ofs = pid * pageSize
tally unpredictable and can happen at any point in time. In this 3 while (true) // retry until success
section, we describe vmcache, a buffer manager design that – like
4 PageState s = state[pid]
file-backed mmap – uses virtual memory to translate page identifiers
5 if (s.isEvicted())
into pointers (see Figure 1c). However, unlike mmap, in vmcache the
DBMS retains control over page faults and eviction. 6 if (state[pid].CAS(s, Locked))
7 pread(fd, virtMem+ofs, pageSize, ofs)
3.1 Page Table Manipulation 8 return virtMem+ofs // page miss
9 else if (s.isMarked() || s.isUnlocked())
Setting up virtual memory. Like the file-backed mmap approach,
vmcache allocates a virtual memory area with (at least) the same 10 if (state[pid].CAS(s, Locked))
size as the backing storage. However, unlike with file-backed mmap 11 return virtMem+ofs // page hit
this allocation is not directly backed by storage. Such an “unbacked” 12 unfix(uint64_t pid):
allocation is called anonymous and, confusingly, is done through 13 state[pid].setUnlocked()
mmap as well, but using the MAP_ANONYMOUS flag: Listing 1: Pseudo code for exclusive page access
int flags = MAP_ANONYMOUS|MAP_PRIVATE|MAP_NORESERVE;
int prot = PROT_READ | PROT_WRITE;
char* virtMem = mmap(0, vmSize, prot, flags, -1, 0); evict3 , whether and when to write back a page, and when to remove
a page from the page table.
Note that no file descriptor has been specified here (the fourth argu-
ment is -1). Storage is handled explicitly and could be a file (multiple 3.2 Page States and Synchronization Basics
applications share one file system) or multiple block devices (in a
RAID setup). Moreover, the allocation will initially not be backed In terms of the buffer manager implementation, the most diffi-
by physical memory, which is important because storage capacity cult aspect is synchronization, e.g., managing races to the same
is usually much larger than main memory. page. Buffer managers must not only use scalable synchronization
Adding pages to the cache. To add a page to the cache, the buffer internally, they should also provide efficient and scalable synchro-
manager explicitly reads it from storage to the corresponding posi- nization primitives to the upper DBMS layers. After all, most data-
tion in virtual memory. For example, we can use the pread system base data structures (e.g., relations, indexes) are stored on top of
call to explicitly read P3 as follows: cacheable pages.
Buffer pool state. In a traditional buffer manager (see Figure 1a),
uint64_t offset = 3 * pageSize; the translation hash table is used as a single source of truth for
pread(fd, virtMem + offset, pageSize, offset); the caching state. Because all accesses go through the hash table,
Once pread completes, a physical memory page will be installed in synchronization is fairly straightforward (but usually not efficient).
the page table and the data becomes visible to the DBMS process. In Our approach, in contrast, needs an additional data structure for
contrast to mmap, which handles page misses transparently without synchronization because not all page accesses traverse the page ta-
involving the DBMS, with the vmcache approach the buffer man- ble4 and because the page table cannot be directly manipulated from
ager controls I/O. For example, we can use either the synchronous user space. Therefore, we allocate a contiguous array with as many
pread system call or asynchronous I/O interfaces such as libaio page state entries as we have pages on storage at corresponding
or io_uring. positions, as the following figure illustrates:
Removing pages from the cache. After mapping more and more Evicted Locked Evicted Unlocked Evicted
P0 P1 P2 P3 P4
pages, the buffer pool will eventually run out of physical memory,
causing failing allocations or swapping. Before that happens, the foo bar
DBMS needs to start evicting pages, which on Linux can be done
as follows2 : Page states. After startup, all pages are in the Evicted state. Page
madvise(virtMem + offset, pageSize, MADV_DONTNEED); access operations first check their state entry and proceed according
to the following state diagram:
This call will remove the physical page from the page table and
make its physical memory available for future allocations. If the fix
Evicted Locked
page is dirty (i.e., has been changed), it first needs to be written fix
back to storage, e.g., using pwrite: evict unfix fix
candi-
pwrite(fd, virtMem + offset, pageSize, offset); Marked date Unlocked
With the primitives described above, the DBMS can control all
buffer management decisions: how to read pages, which pages to 3 Strictly speaking, the OS could decide to evict vmcache pages – but this does not affect
the correctness of our design. OS-triggered eviction can be prevented by disabling
swapping or by mlocking the virtual memory range.
2 On Windows these primitives are available as VirtualAlloc(..., MEM_RESERVE, 4 If a page translation is cached in the TLB of a particular thread, the thread does not
...) and VirtualFree(..., MEM_RELEASE). have to consult the page table.
1 optimisticRead(uint64_t pid, Function fn): atomically. As the pseudo code in Listing 2 shows, an optimistic
2 while (true) // retry until success reader retrieves the state and if it equals Unlocked (line 4 in the
3 PageState s = state[pid] // incl. version code), it reads from the page (line 5). After that we retrieve the page
state again and make sure that the page is still not locked and that
4 if (s.isUnlocked())
the version has not changed (line 6). If this check fails, the operation
5 // optimistic read:
is restarted. Note that the version counter is incremented not just
6 fn(virtMem + (pid*pageSize)) when a page changes but also when it is evicted. This is crucial for
7 if (state[pid] == s) // validate version correctness and, for example, ensures that an optimistic read of a
8 return // success marked page that is evicted before validation will fail. To prevent
9 else if (s.isMarked()) starvation due to repeated restarts, it is also possible to fall back to
10 // clear mark: pessimistic lock-based operations (not shown in the code). Finally,
11 state[pid].CAS(s, Unlocked) let us note that optimistic reads can be interleaved across multiple
12 else if (s.isEvicted()) pages, enabling lock coupling-like synchronization of complex data
13 fix(pid); unfix(pid) // handle page miss structures like B-trees [24]. This approach has been shown to be
Listing 2: Pseudo code for optimistic read highly scalable and outperform lock-free data structures [42].
64-bit state entry. Overall, we use 64 bits for the page state,
of which 8 bits encode the Unlocked (0), LockedShared (1-252),
Locked (253), Marked (254), and Evicted (255) states. This leaves
Listing 1 shows pseudo code for the fix and unfix operations, us with 56 bits for the version counter – which are enough to never
which provide exclusive page access. Suppose we have a page that overflow in practice. 64 bits are also a convenient size that allows
is currently in Evicted state (line 5 in the code). If a thread wants atomic operations such as compare-and-swap (CAS).
to access that page, it calls fix, which will transition to the Locked Memory reclamation and optimistic reads. In general, lock-
state using a compare-and-swap operation (line 6). The thread is free data structures require special care when freeing memory [25,
then responsible to read the page from storage and implicitly (via 27, 30]. Techniques such as epoch-based memory reclamation [30]
pread) install it to the page table (line 7). After that, it can access or hazard pointers [31] have been proposed to address this prob-
the page itself and finally unfix it, which causes a transition to the lem. All these techniques incur overhead and may cause additional
Unlocked state (line 13). If another thread concurrently wants to memory consumption due to unnecessarily long reclamation de-
fix the same page, it waits until it is unlocked. This serializes page lays. Interestingly, vmcache – despite supporting optimistic reads
misses and prevents the same page from being read multiple times. – can sidestep these problems completely. Indeed, vmcache does
The fourth state, Marked, helps to implement a clock replacement not prevent the eviction/reclamation of a page that is currently
strategy – though arbitrary other algorithms could be implemented read optimistically. However, this is not a problem because after
as well. Cached pages are selected for eviction by setting their state the page is removed from the page table using the MADV_DONTNEED
to Marked. If the page is accessed, it transitions back to the Locked hint, it is replaced by the zero page. In that situation the optimistic
state, which clears the mark (line 10). Otherwise, the page can be read will proceed loading 0s from the page without crashing, and
evicted and eventually transitions to the Evicted state. will detect that eviction occurred during the version check. (The
check fails because eviction first locks and then unlocks the page,
3.3 Advanced Synchronization which increments the version.) Therefore, vmcache does not need
So far, we discussed how to lock pages exclusively. To enable scal- any additional memory reclamation scheme.
able and efficient read operations, vmcache also provides shared Parking lot. To avoid exclusive and shared locks from wasting
locks (multiple concurrent readers on the same page) and optimistic CPU cycles and ensure fairness under lock contention, one can use
(lock-free) reads. the Parking Lot [9, 36] technique. The key idea is that if a thread
Shared locks. To implement shared locks for read-only operations, fails to acquire the lock (potentially after trying several times), it
we count the number of concurrent readers within the page state. If can “park” itself, which will block the thread until it is woken up
the page is not locked exclusively, read-only operations atomically by the thread holding the lock. Parking itself is implemented using
increment/decrement that counter [9] when fixing/unfixing the a fixed-size hash table storing standard OS-supported condition
page. Exclusive accesses have to wait until the counter is 0 before variables [9]. Within the page state, we only need one additional bit
acquiring the lock. that indicates whether there are threads that are currently waiting
Optimistic reads. Both exclusive and shared locks write to shared for that page lock to be released. The big advantage of parking lots
memory when acquiring or releasing the lock, which invalidates is very low space overhead per page, which is only 1 bit instead of
cache entries in other CPU cores. For tree data structures such as 64 bytes for pthread (rw)locks [9].
B-trees this results in suboptimal scalability, because the page states
of inner nodes are constantly invalidated. An elegant alternative to
locks are optimistic, lock-free page reads that validate whether the 3.4 Replacement Strategy
read was correct. To do that, locks contain an update version that Clock implementation. In principle, arbitrary replacement strate-
is incremented whenever an exclusively locked page is unlocked [9, gies can be implemented on top of vmcache. As mentioned earlier,
25, 30]. We store this version counter together with the page state our current implementation uses the clock algorithm. Before the
within the same 64-bit value, ensuring that both are always changed buffer pool runs out of memory, we change the state of Unlocked
P0 P1 P2 P3 P4
an example where page P3 spans two physical pages. For data
foo a very large page structures implemented on top of the buffer manager this fact is
virtual memory completely transparent, i.e., the memory appears to be contiguous.
page
table Accesses to large pages only use the page state of the head page (P3
not P4 in the figure). The advantage of relying on virtual memory
physical to implement multiple page sizes is that it avoids main memory
memory arge page foo a very l
fragmentation. Note that fragmentation is not simply moved from
user to kernel space, but the page table indirection allows the OS
Figure 2: vmcache enables DBMS page sizes that are multiple to always deal with 4 KB pages rather than having to maintain
of the VM page size different allocation classes. As a consequence, as Figure 2 illustrates,
a contiguous virtual memory range will in general not be physically
contiguous.
pages to Marked. All page accesses, including optimistic reads, clear
Advantages of large pages. Although most DBMS rely on fixed-
the Marked state, ensuring that hot pages will not be evicted. To
size pages, supporting different page sizes has many advantages.
implement clock, one needs to be able to iterate over all pages in
One case where variable-size pages simplify and accelerate the
the buffer pool. One approach to do that would be to iterate over
DBMS is string processing. With variable-size pages one can, for
the state array while ignoring evicted pages. However, this would
example, simply call external string processing libraries with a
be quite expensive if the state array is very sparse (i.e., storage is
pointer into the buffer pool. Without this feature, any string opera-
much larger than main memory). We implement a more robust
tion (comparison, LIKE, regexp search, etc.) needs to explicitly deal
approach that stores all page identifiers that are currently cached
with strings chunked across several pages. Because few existing
in a hash table. The size of the hash table is equal to the number
libraries support chunking, one would have to copy larger strings
of pages in DRAM (rather than storage) and our page replacement
into a contiguous memory before being able to use them. Another
algorithm iterates over this much smaller data structure. We use a
case is compressed columnar storage where each column chunk has
fixed-size open addressing hash table, which makes iteration cache
the same number of tuples but a different size. In both cases it is in-
efficient. Note that, in contrast to traditional buffer managers, this
deed possible to split the data across multiple fixed-size pages (and
hash table is not accessed during cache hits, but only during page
many systems have to do it due to a lack of variable-size support),
faults and eviction.
but it leads to complex code and/or slower performance. Finally, let
Batch eviction. For efficiency reasons, our implementation evicts
us mention that, in contrast to systems like Umbra [33], vmcache
batches of 64 pages. To minimize exclusive locking and exploit
supports arbitrary page sizes as long as they are a multiple of 4 KB.
efficient bulk-I/O, eviction is done in five steps:
This reduces memory waste for larger objects. Overall, we argue
(1) get batch of marked candidates from hash table, lock dirty that this feature can substantially simplify the implementation of
pages in shared mode the DBMS and lead to better performance.
(2) write dirty pages (using libaio)
(3) try to lock (upgrade) clean (dirty) page candidates
(4) remove locked pages from page table using madvise 3.6 Discussion
(5) remove locked pages from eviction hash table, unlock them State access. As mentioned earlier, every page access must retrieve
After step 3, all pages must be locked exclusively to avoid race the page state – often causing a cache miss – before it can read the
conditions during eviction. For dirty pages, we already obtained page data itself. One may therefore wonder whether this is just as
shared locks in step 1, which is why step 3 performs a lock upgrade. inefficient as traditional hash table-based buffer managers. However,
Clean pages have not been locked, so step 3 tries to acquire the these two approaches are very different from each other in terms
exclusive lock directly. Both operations can fail because another of their memory access patterns. In the hash table approach, the
thread accessed the page, in which case eviction skips it (i.e., the page data pointer is retrieved from the hash table itself, i.e., there is
page stays in the pool). With the basic vmcache design, step 4 is a data dependency between the two pointers and one usually pays
simply calling madvise once for every page. With exmap, we will the price of two cache miss latencies. In our approach, in contrast,
be able to exploit bulk removal of pages from the page table. both the page state pointer and data content pointer are known
upfront. As a consequence, the out-of-order execution of modern
3.5 Page Sizes CPUs will perform both accesses in parallel, hiding the additional
Default page size. Most processors use 4 KB virtual memory pages overhead of the state retrieval.
by default, and conveniently this granularity also works well with Memory consumption. vmcache comes with some DRAM over-
flash SSDs. It therefore makes sense to set the default buffer pool head in the form of page tables and the page state array: For config-
page size to 4 KB as well. x86 (ARM) also supports 2 MB (1 MB) uring the virtual-memory mapping, vmcache requires 8.016 bytes
pages, which might be a viable alternative in systems that primarily for each 4 KB of storage to set up a 5-level page table. Besides this
read larger blocks. With vmcache, OLTP systems should generally cost, which is inherent to any mmap-like buffer manager, vmcache
use 4 KB pages and for OLAP systems both 4 KB and 2 MB pages requires additional 8 bytes for the page state: 8 bits for the exclu-
are suitable. sive/shared lock and 56 bits for the optimistic-read version counter.
Supporting larger pages. vmcache also makes it easy to support So in total, vmcache requires around 16 bytes of DRAM per 4 KB on
any buffer pool page size that is a multiple of 4 KB. Figure 2 shows storage. Thus, for example, for 1 TB flash SSD, one needs 4 GB of
Flame Graph Search ic

2.0M TLB Shootdown 1.5M OP/s


madvise
alloc+free / s

f..
1.5M gen..
free pages gen..
__s.. alloc page __s..
sys.. page fault sys.. d..

callstack
1.0M asm.. asm.. n..
llist_add_batch smp_call_function_many_cond
0.5M on_each_cpu_cond_mask
flush_tlb_mm_range free_page..
tlb_finish_mmu
0.0M e.. zap_page_range
1 32 64 96 128 a.. madvise
competitor_madv
threads Relative Execution-Time Budget (of 100%)

Figure 3: Linux page (de)allocation performance Figure 4: CPU time profile for Figure 3 with 128 threads

DRAM for the internal buffer manager state, which is a reasonable anonymous mapping by triggering a page fault and evict them
1
256 th of SSD capacity. Economically speaking, as Flash is approxi- again with MADV_DONTNEED. As Figure 3 shows, vanilla Linux only
mately 50× cheaper per byte than DRAM, the additional memory achieves 1.51M OP/s with 128 threads. Incidentally, a single modern
50 ≈ 20 % of the flash price. While this is low enough in
costs 256 PCIe 4.0 SSD can achieve 1.5M random 4KB reads per second [4]. In
most use cases, there are ways to reduce this cost: (1) Compress the other words, a 128-thread CPU would be completely busy manipu-
64-bit page state at the expense of optimistic reads (-56 bits) and lating virtual memory for one SSD – not leaving any CPU cycles
shared locking (-6 bits) down to two bits per storage page (evicted, for actual work.
exclusive locked), leaving us with a total of 2.07 GB for a 1 TB flash Problem 1: TLB shootdowns. To investigate this poor scal-
SSD (+10.11 % cost). (2) Place the page state within the buffered page ability, we used the perf profiling tool and show a flame
and keep the corresponding 8 bytes on the storage page unused, graph [17] in Figure 4. Linux spends 79% of all CPU time in the
leaving us with the unavoidable 2 GB of DRAM overhead. Thus, flush_tlb_mm_range function. It implements TLB shootdowns,
the memory overhead is reasonable in terms of overall cost for the which are an explicit coherency measure that prevents outdated
system and could be reduced even further. TLB entries, which otherwise could lead to data inconsistencies or
Address space. Existing 64-bit CPUs generally support at least 48- security problems. On changing the page table, the OS sends an
bit virtual memory addresses. On Linux, half of that is reserved for interprocessor interrupt (IPI) to all other (N-1) cores running appli-
the kernel, and user-space virtual memory allocations are therefore cation threads, which then clear their TLB. This is fundamentally
limited to 247 = 128 TB. Starting with Ice Lake, Intel processors unscalable as it requires N-1 IPIs for every evicted page.
support 57-bit virtual memory addresses, enabling a user-space Problem 2: Page allocation. After shootdowns, the next major
address space size of 256 = 64 PB. Thus, the address space is large performance problem in Linux is the intra-kernel page allocator
enough for our approach, and will be so for the foreseeable future. (free pages and alloc page in the flame graph). The Linux page
allocator relies on a centralized, unscalable data structure and, for
4 EXMAP: SCALABLE AND EFFICIENT security reasons, has to zero out each page after eviction. Therefore,
VIRTUAL MEMORY MANIPULATION once the larger TLB shootdown bottleneck is solved, workloads
with high page turn-over rates will be bound by the page allocator.
vmcache exploits hardware-supported virtual memory with ex-
Why a New Page Table Manipulation API? The two perfor-
plicit control over eviction while supporting flexible locking modes,
mance problems described above cannot be solved by some low-
variable-sized pages, and arbitrary reference patterns (i.e., graphs).
level changes within Linux, but are fundamentally caused by the
This is achieved by relying on two widely-available OS primitives:
existing decades-old virtual memory API and semantics: The TLB
anonymous memory mappings and an explicit memory-release sys-
shootdowns are unavoidable with a synchronous page-at-a-time
tem call. Although vmcache is a practical and useful design, with
API, and page allocation is slowed down by the fact that physi-
some workloads it can run into OS kernel performance problems.
cal memory pages can be shared between different user processes.
In this section, we describe a Linux kernel extension called exmap
Achieving efficient and scalable page table manipulation therefore
that solves this weakness. We first motivate why the existing OS
requires a different virtual memory API and modified semantics.
implementation is not always sufficient, then provide a high-level
overview of the design, and finally describe implementation details. 4.2 Design Principles
exmap. exmap is a specialized Linux kernel extension that enables
4.1 Motivation fast and scalable page table manipulation through a new API and
Why Change the OS? With vmcache, (de)allocating 4 KB pages efficient kernel-level implementation. We co-designed exmap for
is as frequent as page misses and evict operations, i.e., the OS’ use with vmcache, but as we discuss in Section 4.5, it could also
memory subsystem becomes part of the hot path in out-of-memory be used to accelerate other applications. exmap comes as a Linux
workloads. Unfortunately, Linux’ implementation of page alloca- kernel module that the user can load into any recent Linux kernel
tion and deallocation does not scale. As a consequence, workloads without rebooting. Like the POSIX interface, exmap provides prim-
that have a high page turn-over rate can become bottlenecked by itives for setting up virtual memory, allocating, and freeing pages.
the OS’s virtual memory subsystems rather than the storage device. However, as outlined below, exmap has new semantics to eliminate
To quantify the situation on Linux, we allocate pages on a single the bottlenecks provoked by the POSIX interface.
read(P13)
write(P7)
alloc(P4)
free(P3)
1 // Open device/file as backing storage
exmap API
2 int fd = open("/dev/...", O_RWDR|O_DIRECT);
(A) VM
Surface
3 // Create a new exmap object
P0 P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 P13 P14 P15
4 struct exmap_setup_params params = {
)
X (2) s
teal (1 5 .max_interfaces = 8, // # of control interfaces
Physical (B) Control
Pages (C) iface 0 write back iface 1 6 .pool_size = 262144, // # of pages in pool (1 GB)
Interfaces

back fill
7 .backing_fd = fd}; // storage device
SSD
8 int exmap_fd = exmap_create(&params);
9 // Make the exmap visible in the VM
10 Page* pages = (Page*)mmap(vmSize, exmap_fd, ...);
Figure 5: exmap implementation overview: The VM Surface
11 // Allocate and evict memory using interface 5
(A) is manipulated with explicit free, alloc, read, or write
12 exmap_interface_t iface = 5;
system calls. Each per-thread control interface (B) owns part
13 // Scattered I/O Vector: P1, P3-P5
of the exmap-local memory pool, which exists as interface-
14 struct iovec vec[] = {
local free lists of physical pages (C). If an interface runs out
15 { .iov_base = &pages[1], .iov_len = pageSize },
of pages (1), it steals pages from another interface (2). Pages
16 { .iov_base = &pages[3], .iov_len = pageSize * 3}};
only circulate (X) between the surface and the interface.
17 exmap_action(exmap_fd, iface, EXMAP_ALLOC, &vec, 2);
18 exmap_action(exmap_fd, iface, EXMAP_FREE, &vec, 2);
19 // Read pages from fd into the exmap
Solving TLB shootdown problem. An effective way of reducing
20 // Use exmap_fd as a proxy file descriptor.
the number of TLB shootdowns is to batch multiple page evictions
21 pread(exmap_fd, &pages[13], pageSize, iface); // P13
and thereby reduce the number of shootdowns by the batch size. To
22 preadv(exmap_fd, &vec, 2, iface); // P1, P3-P5
achieve this, exmap provides a batching interface to free multiple
23 // Write-Backs are explicit and without proxy fd
pages with a single system call. While batching is easy to exploit
24 pwrite(fd, &pages[7], pageSize, 7 * pageSize); // P7
for a buffer manager when evicting pages, it can be problematic to
batch page allocations because these are often latency critical. To Listing 3: exmap usage example
avoid TLB shootdowns on allocation, exmap therefore ensures that
allocation does not require shootdowns at all. To do this, exmap
Operations. After creation, the process makes the exmap sur-
always read-protects the page table entry of a freed page (by setting
face visible within its VM via mmap (line 10). While an exmap can
a specific bit in the page table entry). Linux, in contrast, sets that
have an arbitrary VM extent, it can be mapped exactly once in the
entry to a write but not read-protected zero page – potentially
whole system. On the mapped surface, we allow the vectorized
causing invalid TLB entries that have to be explicitly invalidated on
and scattered allocation of pages on the exmap surface (line 11 and
allocation. This subtle change eliminates the need for shootdowns
Figure 5 (X)). For this, one specifies a vector of page ranges within
on allocation completely.
the mapped surface and issues an EXMAP_ALLOC command at an
Solving the page allocation problem. Another important differ-
explicitly-addressed interface. The required physical pages are first
ence between Linux and exmap is the page allocation mechanism.
drawn from the specified interface (Figure 5 (1)), before we steal
In Linux, when a page is freed, it is returned to a system-wide
memory from other interfaces (Figure 5 (2)). Once allocated, pages
pool (and thereby potentially to other processes). This has two
are never swapped out and, therefore, accesses will never lead to
drawbacks: (1) page allocation does not scale well and (2) pages are
a page fault, providing deterministic access times. With the free
repeatedly zeroed out for security reasons. exmap, in contrast, pre-
operation (line 18), we free the page ranges and release the removed
allocates physical memory at creation and keeps them in scalable
physical pages to the specified interface.
thread-local memory pools – thereby avoiding both bottlenecks.
Read I/O. In contrast to file-backed mmap, we do not page in or
write back data transparently, but the user (e.g., vmcache) explic-
4.3 Overview and Usage itly invokes read and write operations on the surface. To speed up
Implementation overview. Figure 5 illustrates the three major these operations, we integrated exmap with the regular Linux I/O
components of an exmap object: (A) its surface within the virtual subsystem, whereby an exmap file descriptor becomes a proxy for
memory (VM); (B) a number of control interfaces to interact with the specified backing device (lines 19-22). This allows combining
the object; and (C) a private memory pool of physical DRAM pages, page allocation and read operations in a single system call: On
which exists as interface-local free lists spread over all interfaces. read, exmap first populates the specified page range with memory
Creation. On creation (lines 4-8 in Listing 3), the user configures before it uses the regular Linux VFS interface to perform the actual
these components: She specifies the number of interfaces that the read. Since we derive the disk offset from the on-surface offset, we
kernel should allocate (line 5). Usually, each thread should use its can use offset parameter to specify the allocation interface. With
own interface (e.g., thread id = interface id) to maximize scalability. this integration, exmap supports synchronous (pread) and asyn-
The user also specifies the number of memory pool pages (line 6), chronous (libaio and io_uring) reads. Furthermore, as the on-
which exmap will drain from Linux’ page allocator for the lifetime surface offset determines the disk offset, vectorized reads (preadv,
of the exmap object. As the third parameter, the user can specify a IORING_OP_READV) implicitly become scattered operations (line 22),
file descriptor as backing storage for read operations (line 7). which Linux currently allows with no other system call.
Write I/O. On the write side, we actively decided against a write- exploit scattered request patterns, but it handles them as individual
proxy interface, which would, for example, bundle the write back requests which provokes unnecessary overheads (i.e., repeated page-
and page evict. While such a bundling is not necessary as the user table locking, allocator invocations). To avoid this, exmap provides
can already write surface pages to disk (line 24), freeing pages vectorized and scattered reads with the proxy file descriptor. This
for each write individually could, if not used correctly, lead to allows us to (1) pre-populate the VM with memory, which avoids the
unnecessary overheads. Therefore, we decoupled write back and page-fault handler path, and (2) cuts down the system-call overhead
(batched) freeing of pages. as we issue only a single system call per request batch.
Multiple exmaps. A process can create multiple exmap objects,
which are mapped as separate non-overlapping virtual-memory
4.4 Implementation Details areas (VMAs) into the process address space. These VMAs come
Scalable page allocator. Usually, when the kernel unmaps a page, with their own VM subsystem and are largely isolated from each
it returns the page to the system-wide buddy allocator, which possi- other and from the rest of the kernel while still ensuring consis-
bly merges it into larger chunks of physical memory. On allocation, tency and privilege isolation. As already noted, each exmap can be
these chunks are broken down again into pages, which have to be mapped exactly once, whereby we avoid the bookkeeping overhead
zeroed before mapping them to the user space. Therefore, with a of general-purpose solutions5 .
high VM turn-over rate, memory is constantly zeroed and circles
between the VM subsystem and the buddy allocator. To optimize 4.5 Discussion
VM operations for vmcache, we decided to use per-exmap mem- OS customization. exmap is a new low-level OS interface for ma-
ory pools to bypass the system allocator. This also allows us to nipulating virtual memory efficiently. Seemingly minor semantic
avoid proactive page zeroing since pages only circulate between changes such as batching and avoiding zero pages result in very
the surface and the memory pool within the same process, whereby high performance without sacrificing security. One analogy is that
information leakage to other processes is impossible. Only during exmap is for VM what O_DIRECT is for I/O: a specialized tool for
the initial exmap creation, we zero the pages in our memory pool. systems that want to manage and control hardware resources them-
Thread-local control interfaces and page stealing. Further- selves as efficiently as possible. Two design decisions of exmap
more, exmap’s control interfaces not only allow the application to require further discussion.
express allocation/eviction locality, but they also reduce contention Functionality. We largely decoupled the exmap surface and its
and false sharing that come with a centralized allocator. For this, memory pool from the rest of Linux. As a consequence of this lean
we distribute the memory pool as local lists of free 4 KB pages design, exmap is efficient but does not support copy-on-write fork-
over the interfaces, whereby the need for page stealing comes up. ing and swapping. Few buffer pool implementations rely on such
After the interface-local free list is drained, we use a three-tiered functionality. Indeed, it is actually a benefit that exmap behavior
page-stealing strategy: (1) steal from the interface from which we is simple and predictable as it allows buffer managers to precisely
have successfully stolen the last time, (2) randomly select two in- track memory consumption and ensure robust performance.
terfaces and steal from the interface with more free pages, and (3) Portability. Another important aspect is generalizability to other
iterate over all interfaces until we have gathered enough pages. To operating systems and architectures. Since our kernel module comes
minimize the number of steal operations, we steal more pages than with its own specialized VM subsystem, it only has few depen-
required for the current operation. If we remove pages from the dencies to the rest of the Linux kernel. This makes exmap easily
surface, we always push them to the specified interface. Thereby, portable between Linux versions and suggests that the concept can
for workloads in which per-interface allocation and eviction are in be implemented for other operating systems such as Windows and
balance, steal operations are rarely necessary. FreeBSD. Except for our architecture-dependent lock-free short-cut
Lock-free page-table manipulation. For page-table manipula- for small page table modifications, the exmap implementation is
tions, Linux uses a fine-grained locking scheme that locks the last also independent of the used ISA and MMU as it reuses Linux’ MMU
level of the page table tree to update page-table entries therein. abstractions. In other words, our Linux implementation is easily
However, such entries have machine-word size on most architec- portable across CPU architectures that support Linux.
tures, and we can update them directly with atomic instructions. Other Applications of exmap. Although we explicitly designed
While Linux leaves this opportunity open for portability reasons, exmap for caching, it has other use cases as well: (1) Due to its high
we integrated an atomic-exchange–based hot path: If an operation VM-modification performance (see Figure 8), an heap manager
manipulates only an individual page-table entry on a last-level could use a large exmap surface to coalesce free pages into large
page table, we install (or remove) the VM mapping with a single contiguous buffers, which is useful for DBMS query processing [14].
compare-and-exchange. (2) With a page-move extension, a language run-time system could
I/O subsystem integration. For read operations, the Linux I/O use exmap as a base for a copying garbage collector for pools of
subsystem is optimized for sequential reads into destination buffers page-aligned objects. (3) For large-scale graph processing, workers
that are already populated with physical memory. For example, request, often with a high fan out (e.g., for breadth-first search)
without exmap, Linux does not provide a scattered read operation and with high parallelism, randomly-placed data from the backing
that takes multiple offsets; such a read request had to be split into store, which can easily be serviced by exmap. (4) For user-space file
multiple (unrelated) reads. On a lower level, Linux expects VM to
be populated and calls the page-fault handler for each missing page 5 Forexample, Linux usually maintains a reverse mapping from physical to virtual
before issuing the actual device operation. Hence, Linux cannot fully addresses that is necessary to implement features such as copy-on-write fork.
random lookup TPC-C Hardware, OS. We ran all experiments on a single-socket server
3M with an AMD EPYC 7713 processor (64 cores, 128 hardware threads)
75M and 512 GB main memory, of which we use 128 GB for caching. For
vmcache storage, we use a 3.8 TB Samsung PM1733 SSD. The system is run-
transactions/s

2M ning unmodified Linux 5.16, except when we run vmcache+exmap,


vmc.+exmap
50M which uses our exmap kernel module. Workloads. We use TPC-
LeanStore
C as well as a key/value workload that consists of random point
1M WiredTiger lookups, 8 byte uniformly-distributed keys, and 120 byte values.
25M
LMDB The two benchmarks are obviously very different from each other:
TPC-C combines complex access patterns and is write-heavy, while
0M 0M the lookup benchmark is simple and read-only. Both are imple-
1 32 64 96 128 1 32 64 96 128 mented as standalone C++ programs linked against the storage
threads engines, i.e., there is no network overhead.

Figure 6: In-memory scalability (128 GB pool, random lookup:


5.2 End-To-End In-Memory Comparison
100 M entries ≈ 20 GB, TPC-C: 200 warehouses ≈ 40 GB)
vmache performance and scalability. In the first experiment we
investigate the performance and scalability in situations where the
data set fits into main memory. The results are shown in Figure 6.
systems, a device-baked exmap allows for a user-space–controlled The two vmcache approaches are faster than the other systems
buffer cache strategy. and scale very well – achieving almost 90 M lookups/s and around
3 M TPC-C transactions/s respectively. Because no page eviction
5 EVALUATION happens for in-memory workloads, we see that exmap does not
The goal of this section is to show experimentally that vmcache offer major performance benefits over the basic vmcache design.
is competitive to state-of-art swizzling-based buffer managers for Competitor performance. LeanStore comes closest to vmcache
in-memory workloads and that exmap enables the vmcache design in performance, while WiredTiger trails significantly. LMDB is
to exploit modern storage devices. However, let us emphasize here competitive to LeanStore for the lookup benchmark but does not
that we see the main benefits of vmcache as qualitative rather than scale on the write-heavy TPC-C benchmark. This is because LMDB
quantitative as we summarized in Table 1. Specifically, despite being uses a single writer model with out-of-place writes, which means
easy to implement, vmcache supports arbitrary (graph) data and that reads do not have to synchronize, but only a single writer
variable-size pages. is admitted at any point in time. Overall, the results show that
the vmcache design has excellent scalability and high absolute
5.1 Experimental Setup performance.
Implementation. Our buffer manager is implemented in C++ and
uses a B+tree with variable-size keys/payloads and optimistic lock 5.3 End-To-End Out-of-Memory Comparison
coupling. We compare two variants: (1) vmcache uses the regu- Workload. Figure 7 shows the out-of-memory performance (upper
lar and unmodified OS primitives described in Section 3. (2) vm- plot) over time. In this experiment, the data sets are larger than
cache+exmap is based on the vmcache code, except that it uses the the buffer pool by one order of magnitude, which means that page
exmap kernel module and the interface proposed in Section 4. Both misses happen frequently. We start measuring right after loading
variants use 4 KB pages and perform reads through the blocking the data for both workloads. Therefore, in all systems it takes some
pread system call. Therefore, there is at most one outstanding read time for the performance to converge to the steady state because
I/O operation per thread. Dirty pages are written in batches of the buffer pool state needs to adjust to the switch from loading to
up to 64 pages using libaio. vmcache frees those pages individ- the actual workload.
ually with madvise, while vmcache+exmap batches them into a vmcache and exmap. For the random lookup benchmark, we see
single EXMAP_FREE call. Page allocations are not batched to avoid that exmap improves performance over basic vmcache by about 60%.
increasing latencies (we use one EXMAP_ALLOC call per allocation). This is caused by Linux scalability issues during page eviction. For
Competitors. We use three state-of-art open source storage en- TPC-C, the difference between the vmcache and vmache+exmap is
gines based on B+trees as competitors: (1) LeanStore [1], (2) WiredTiger small because even vmcache manages to become I/O bound. For
3.2.1 [5], and (3) LMDB 0.9.24 [2]. For caching, LeanStore and both workloads, the exmap variant manages to become fully I/O
WiredTiger rely on pointer swizzling, whereas LMDB [2] uses mmap bound, as is illustrated by the lower part of the figure6 .
with out-of-place writes. Since the focus of this work is buffer LeanStore. When we compare LeanStore with vmcache and exmap,
management, in all systems we disable write ahead logging and we see that vmcache is substantially slower than LeanStore for ran-
run in the lowest transactional isolation level offered. LMDB and dom lookups in steady state (again due to vmcache being bound by
LeanStore use 4 KB pages, whereas WiredTiger uses 32 KB pages
for leaf nodes on storage. We configured LeanStore to use 8 page
provider threads that handle page replacement [19], which resulted 6We measured the I/O bound for this experiment using the fio benchmarking and
in the best performance. 128 threads doing synchronous random I/O operations.
random lookup TPC-C Table 2: Random page access microbenchmark
1.2M 0.2M
32 KB 128 GB
vmcache
transactions/s

vmc.+exmap inst- cache time ins- cache time


0.6M 0.1M # truc. miss [ns] truc. miss [ns]
LeanStore
WiredTiger 1 read 3.0 0 1.6 3.3 1.0 219
LMDB 2.1 read (1 TB range) 3.0 0 1.6 3.3 1.0 235
0.0M 0.0M 2.2 + page state 7.0 0 1.7 7.4 2.0 236
0 50 100 150 200 0 50 100 150 200 2.3 + version check 10.0 0 1.8 10.4 2.0 236
time [seconds] 3 hash table 26.1 0 10.7 27.9 2.6 336
random lookup TPC-C
I/O vmache+exmap [GB/s]

I/O bound (read-only)

3 3
Table 2 simply shows the random access time in an 32 KB/128 GB
I/O bound (read/write)
read I/O array and therefore represents the lower bound for any buffer man-
2 2 write I/O
ager design. The next three lines incrementally show the steps
(described in Section 3.1, Section 3.2, and Section 3.3) necessary in
total I/O
1 1 the vmcache design. In line #2.1, we randomly read from a virtual
memory range of 1 TB (instead of 128 GB), which increases latency
0 0
by 7% due to additional TLB pressure. In line #2.2, in addition to
0 50 100 150 200 0 50 100 150 200
accessing the pages themselves, we also access the page state array
time [seconds]
as is required by the vmcache design. As mentioned in Section 3.6,
this additional cache miss does not noticeably increase access la-
Figure 7: Out-of-memory performance and I/O statistics tency because both memory accesses are independent and the CPU
(128 GB buffer pool, 128 threads, random lookup: 5 B entries therefore performs them in parallel. In line #2.3, we also include the
≈ 1 TB, TPC-C: 5000 warehouses ≈ 1 TB) version validation, which results in the full vmcache page access
logic. Overall, this experiment shows that a full optimistic read in
vmcache incurs less than 8% overhead in comparison with a simple
the kernel). Only by using the exmap module, can it become com- random memory read. We measured that an exclusive, uncontended
petitive to LeanStore. Eventually, exmap+vmcache performs simi- page access (fix & unfix) on 128 GB of RAM takes 238ns (not shown
larly to LeanStore and both become I/O bound in steady state. The in the table). The last line in the table shows the performance of a
performance differences are largely due to minor implementation hash table-based implementation based on open addressing. Even
differences: vmcache+exmap has slightly higher steady state perfor- such a fast hash table results in substantially higher latencies be-
mance due to a more compact B-tree (less I/O per transaction), and cause the page pointer is only obtained after the hash table lookup.
LeanStore temporarily (40s to 90s) outperforms vmcache+exmap Note that our hash table implementation is not synchronized, and
due to more aggressive dirty page eviction using dedicated back- the shown overhead is therefore actually a lower bound for the true
ground threads. cost of any hash table based design.
WiredTiger and LMDB. WiredTiger and the mmap-based LMDB
are significantly slower than vmcache and LeanStore. The perfor-
mance of WiredTiger suffers from the 32 KB page size, whereas 5.5 exmap Allocation Performance
LMDB is bound by kernel overhead (random lookups) and the Allocation benchmark. The end-to-end results presented so far
single-writer model (TPC-C). Overall, we see that while basic vm- have shown that exmap is more efficient than the standard Linux
cache offers solid out-of-memory performance, as the number I/O page table manipulation primitives. However, because we were
operations per second increases it requires the help of exmap to I/O bound, we have yet to evaluate how fast exmap actually is.
unlock the full potential of fast storage devices. To quantify the performance of exmap, we used similar allocation
benchmark scenarios as in Figure 3, i.e., we constantly allocate and
free pages in batches.
5.4 vmcache Ablation Study Baselines. The results are shown in Figure 8. For these, we always
To better understand the performance of virtual-memory assisted use batched allocations/evictions of 512 individual 4 KB pages. As
buffer management and compare it against a hash table-based de- a baseline, we use process_madvise with TLB batching, which
sign, we evaluated page access time using a microbenchmark. We already requires kernel changes. For reference, we also show the
focus on the in-memory case, which is why all page accesses in this maximal DRAM read rate, which we achieved using the pmbw bench-
experiment are hits. For all designs, we read random 4 KB pages marking tool and 64 threads (144.56 GiB/s). If the OS provides mem-
of main memory and report the average number of instructions, ory faster than this threshold, we can be sure that memory alloca-
cache misses, and the access latency. We report numbers of 32 KB tion will not be the bottleneck.
and 128 GB of data. The former corresponds to very hot CPU-cache Page stealing scenarios. exmap uses page stealing, and its per-
resident pages and the latter to colder pages in DRAM. Line #1 in formance therefore depends on the specific inter-thread allocation
Max. random-read throughput
300M 6
exmap (1 IF) vmcache reaches full BW

read throughput [GiB/s]

asynchronous
vmcache + exmap is close to FB (upper bound)
alloc+free / s

200M 4
exmap (2 IF)

2
100M s
exmap (pool) ou
hron uring (FB) pread (FB)
c
syn
memory bandwidth uring (vmcache) pread (vmcache)
process_madvise (w/ TLB batching) 0 uring (vmcache + exmap) pread (vmcache + exmap)
0M
1 32 64 96 128
1 32 64 96 128
threads
threads
Figure 9: Read performance for synchronous (pread) and
Figure 8: Linux memory allocation performance with exmap. asynchronous (uring) I/O operations. Both vmcache and
The three exmap lines shows different page stealing scenarios exmap support asynchronous I/O using uring, which allows
(1 IF: no stealing, 2 IF: pair-wise stealing, pool: stealing across achieving full I/O bandwidth using a few threads
all threads)

the respective system-call interface. For the io_uring variant, we


pattern. We therefore investigate three workload scenarios at dif- use thread-local submission queues and allow each thread to have
ferent degrees of page stealing: For exmap (1 IF), no stealing occurs 256 outstanding in-flight operations. We submit each read as an in-
as each thread allocates 512 pages and then evicts them again at dividual operation and do not use exmap’s scattered and vectorized
the same interface. exmap (2 IF) is like 1 IF but each thread has two read capability. For vmcache, we use process_madvise with TLB
interfaces, one for allocations and one for evicting pages. Due to a batching for eviction, and for exmap, we read and evict at the same
large enough memory pool (1 GB), stealing rarely occurs and we exmap interface. Since the SSD handles up to 128 parallel requests
regularly steal more than 512 pages, but eventually each page must and has a maximum random-read throughput of 6 GiB/s, we are
be stolen once per allocation. For exmap (Pool), half of the threads interested in which strategy can saturate it and how many threads
allocate pages as fast as possible while the other half free those it requires for this.
pages again. Here, the memory pool is always close to depletion and I/O performance. In Figure 9, we see that the pread variants,
stealing happens frequently but often does not return more than where each thread has at most one read operation in flight, cannot
512 pages. Thereby, this scenario is the most challenging workload saturate the SSD. Nevertheless, both vmcache and exmap closely fol-
for exmap. low the throughput of the fixed-buffer variant, and we can conclude
Results. Figure 8 shows that exmap outperforms the current state that our vmcache concept is not the limiting factor here. When us-
of the art in Linux significantly in all scenarios, and we reach up to ing io_uring, where a single thread could already submit enough
301M OP/s, which is equivalent of providing memory at 1,150 GiB/s parallel reads to theoretically saturate the SSD, all three variants
and way beyond current DRAM speeds. We also see that page reach the maximum of 6 GiB/s at some point. With fixed buffers, 3
stealing has a moderate effect in the low-memory-pressure scenario threads already saturate the SSD with 1.58 MIOP/s random reads.
(2 IF), while a high memory pressure (Pool) reduces the rate by When using the regular Linux system-call interface to implement
73 percent. This also demonstrates the success of our interface- a vmcache, we require 11 threads to reach the same level. With a
local free lists and suggests that applications should try to roughly single thread, we reach 40 percent of the fixed-buffer performance.
balance out their allocations and frees at each interface for optimal Even better, with exmap and io_uring, we only require 4 threads to
performance. reach 6 GiB/s and with three threads it is already at 96 percent. With
a single thread, exmap achieves 66 percent of the single-threaded
FB variant. We thus argue that in the modern hardware landscape
5.6 exmap Read I/O Performance in which multiple SSDs can be used, exmap is a perfect fit for buffer
I/O libraries. Both vmcache and exmap support both synchronous management. Both vmcache and exmap work with off-the-shelf
(pread) and asynchronous I/O (io_uring and libaio). With asyn- asynchronous I/O in Linux. Furthermore, exmap minimizes virtual
chronous I/O, one can achieve high I/O using fewer threads. In memory overhead and follows the performance of the upper-bound
the final experiment, we quantify the read-throughput of differ- implementation (FB) very closely.
ent user-space I/O strategies. For this, N threads randomly read
4 KB blocks in O_DIRECT mode from our Samsung PM1733 SSD 5.7 exmap Ablation Study
with the synchronous pread system call and via Linux’ modern VM optimizations. Let us now quantify the impact of the exmap
asynchronous io_uring interface. The target memory is either vm- optimizations we presented in Section 4.4. For this, we perform an
cache, exmap, or thread-local fixed buffers (FB). As these FBs have ablation study that is representative for scenarios with a high VM
a fixed address, this variant does not include VM-manipulation turn-over rate. The left-hand side of Figure 10 shows how the indi-
overheads and marks the upper bound of Linux’ IO subsystem for vidual techniques contribute to exmap’s performance. For the most
Alloc (#thr.=128, batch=512, 1 IF, no stealing) Read (#thr.=1, batch=512)
this task. Again, a technique like exmap is orthogonal to both
designs and could be exploited by them.
2.5M
Optimizing TLB shootdowns. The operating systems community

Variant
100M

Final
2M has identified TLB shootdowns as a major performance problem and
alloc+free / s

has proposed several techniques, including batching, for mitigating


×54.9 +7.1 %

read / s
×3.4 1.5M them [6, 7, 22]. exmap uses the same batching idea to speed up VM
1.51
5.18

1.68
92.3
267
286
manipulation.
10M 1M
+67 % Incremental VM improvements. Existing work on improving
×2.9
the Linux VM subsystem can be split into two general categories:

1.83

2.47
0.5M

1.1
(1) speed up the existing infrastructure and (2) provide new VM
1M +35 %
0M management systems. For the first, Song et al. [39] modify the
allocation strategy in the page fault handler. Freed pages are saved
B vise

+ mP h
ck l
ee

+ late

fd
+ Ba )

)
Lo oo
B se

e- (4K
tc

M tc

Fr

y
TL Ba
Ba

pu
TL d

ox
+ ead
+ ma

in application-local lists instead of being directly returned to the


+ p(

pr
po
e

pr
m a

pr system, which enables the recycling of pages within an application.


ex

With exmaps, we extend this to explicitly-addressed free lists to


Figure 10: Impact of techniques on allocation (left) and read avoid contention within the allocation path. Additionally, they
performance from null_blk (right). batch write-back operations to mitigate the overhead of the write
I/O path. Choi et al. [10] use cache removed VMAs for future use
instead of deleting them immediately on munmap. They also extend
basic variant (see Figure 10), exmap has a similar performance to the memory hinting system of madvise, adding new functionality
madvise. However, for this variant, TLB shootdowns make up 98.16 like asynchronous map-ahead. Overall, the speedups of both these
percent of the total CPU time, which explains why TLB batching has incremental approaches are limited because of the complex and
a significantly higher impact on the end-to-end performance. With general nature of the Linux VM subsystem. Another bottleneck of
TLB batching, exmap reaches 92.34M OP/s. Our next optimization Linux’ VM system is the management of the VMA list, which is
is to add the private scalable page allocator. With TLB batching and a stored lock-protected red-black tree. Bonsai [11] uses an RCU-
the private memory pool, exmap reaches 267.32M OP/s. Finally, we based binary tree to provide lock-free page faults. In follow-up work,
enable the lock-free page-table manipulation which further speeds RadixVM [12] speeds up mapping operations in non-overlapping
up random page-sized surface manipulations and with this we reach address ranges. As exmap and vmcache only use a single long-living
286.38M OP/s (see Figure 10). With our final variant of exmap, we VMA and memory is not implicitly allocated through page faults,
outperform off-the-shelf madvise by a factor of ≈ 190. we do not expect significant speedups although they are orthogonal
I/O optimization. Our I/O-integration techniques contribute to to our approach.
exmap’s performance as well. To quantify their contributions, we New VM subsystems. An alternative to incremental changes is to
measure the read performance from a null_blk device (irqmode=0, develop a specialized Linux VM subsystem. In UMap [35], memory-
queue_mode=0) onto an exmap surface. Due to scalability issues of mapped I/O is handled entirely in user-space using userfaultfd.
the null_blk driver, we only show single-threaded performance. With memory hints for prefetching, caching and evicting, as well
From the baseline, where reads are issued individually and provoke as configurable page sizes, they achieve a speedup of up to 2.5
one page fault each, we first pre-populate the surface in batches of times compared to unmodified Linux. UMap, similar to our exmap
512 pages and achieve 67 % more reads. Finally, by combining allo- approach, manages separate regions that bypass the memory man-
cation with the actual I/O request through the proxy file descriptor, agement of Linux. The approach also gives the application more
we gain another 35 % with batches of 512 scattered reads. control by providing configurable thresholds to influence the evic-
tion strategy. Unlike vmcache, however, the kernel still controls
6 RELATED WORK page eviction. Furthermore, user-level page-fault handling intro-
We already described prior work on buffer management in Section 2, duces system-call overheads that run counter the goal of improving
so let us now discuss related work on virtual memory and operating VM speeds. Papagiannis et al. identify bottlenecks in Linux’s VM
systems. system and propose FastMap [34] as an mmap alternative for implicit
Exploiting VM in DBMS. Besides caching, virtual memory manip- memory-mapped file I/O. They alleviate lock contention through
ulation has also been shown to be useful in other database use cases per-core free page lists as well as separate clean and dirty page trees.
such as query processing [37], and for implementing dynamic data They also identify TLB invalidation as a limiting factor to scalabil-
structures such as Packed Memory Arrays [26]. In multi-threaded ity, which they also solve via batched TLB shootdowns. Overall,
situations, these applications may run into kernel scalability is- their implementation is up to 5 times faster than unmodified Linux,
sues and would therefore likely benefit from the optimizations we and provides up to 11.8 times more random IOPS/s. Though signifi-
propose in Section 4. cantly faster than Linux’ mmap, both UMap and FastMap offer no
DBMS/OS co-design. Let us mention two recent DBMS/OS co- explicit control over page eviction, which makes them unattractive
design projects. MxKernel [3] is a runtime system [32] for data- for database systems.
intensive systems on many-core CPUs. The focus of DBOS [38] is
on cloud orchestration (i.e., managing and coordinating multiple
instances) and on using database concepts and systems to simplify
7 SUMMARY [17] Brendan Gregg. 2016. The flame graph. Commun. ACM 59, 6 (2016), 48–57.
[18] Gabriel Haas, Michael Haubenschild, and Viktor Leis. 2020. Exploiting Directly-
vmcache. In this paper, we propose virtual-memory assisted, but Attached NVMe Arrays in DBMS. In CIDR.
DBMS-controlled buffer management. By exploiting virtual mem- [19] Michael Haubenschild, Caetano Sauer, Thomas Neumann, and Viktor Leis. 2020.
Rethinking Logging, Checkpoints, and Recovery for High-Performance Storage
ory, vmcache is not only fast and scalable, but is also easy to im- Engines. In SIGMOD. 877–892.
plement, enables variable-size pages, and supports graph data. The [20] Alfons Kemper and Donald Kossmann. 1995. Adaptable Pointer Swizzling Strate-
basic vmcache design only relies on widely-available OS features gies in Object Bases: Design, Realization, and Quantitative Analysis. VLDB
Journal 4, 3 (1995), 519–566.
and is therefore portable. This combination of features makes vm- [21] Hideaki Kimura. 2015. FOEDUS: OLTP Engine for a Thousand Cores and NVRAM.
cache applicable to a wide variety of data management systems. In SIGMOD. 691–706.
[22] Mohan Kumar, Steffen Maass, Sanidhya Kashyap, Ján Veselý, Zi Yan, Taesoo
vmcache is available at https://github.com/viktorleis/vmcache. Kim, Abhishek Bhattacharjee, and Tushar Krishna. 2018. LATR: Lazy Translation
exmap. With fast storage devices, the page table manipulation prim- Coherence. In ASPLOS. 651–664.
itives that vmcache relies on can become a performance bottleneck. [23] Viktor Leis, Michael Haubenschild, Alfons Kemper, and Thomas Neumann. 2018.
LeanStore: In-Memory Data Management beyond Main Memory. In ICDE. 185–
To solve this problem, we propose exmap, a specialized OS interface 196.
for page table manipulation. We implemented exmap as a Linux ker- [24] Viktor Leis, Michael Haubenschild, and Thomas Neumann. 2019. Optimistic Lock
nel module that is highly efficient and scalable. When one combines Coupling: A Scalable and Efficient General-Purpose Synchronization Method.
IEEE Data Eng. Bull. 42, 1 (2019), 73–84.
vmcache with exmap, one can fully exploit even very fast storage [25] Viktor Leis, Florian Scheibner, Alfons Kemper, and Thomas Neumann. 2016. The
devices. exmap is available at https://github.com/tuhhosg/exmap. ART of practical synchronization. In DaMoN.
[26] Dean De Leo and Peter A. Boncz. 2019. Packed Memory Arrays - Rewired. In
ICDE. 830–841.
ACKNOWLEDGMENTS [27] Justin J. Levandoski, David B. Lomet, and Sudipta Sengupta. 2013. The Bw-Tree:
A B-tree for new hardware platforms. In ICDE. 302–313.
The roots of this project lie in discussions at Dagstuhl Seminar 21283 [28] Gang Liu, Leying Chen, and Shimin Chen. 2021. Zen: a High-Throughput Log-
“Data Structures for Modern Memory and Storage Hierarchies”. This Free OLTP Engine for Non-Volatile Main Memory. PVLDB 14, 5 (2021), 835–848.
work was funded by the Deutsche Forschungsgemeinschaft (DFG, [29] David B. Lomet. 2019. Data Caching Systems Win the Cost/Performance Game.
IEEE Data Eng. Bull. 42, 1 (2019), 3–5.
German Research Foundation) – 447457559, 468988364, 501887536. [30] Yandong Mao, Eddie Kohler, and Robert Tappan Morris. 2012. Cache craftiness
for fast multicore key-value storage. In EuroSys. 183–196.
REFERENCES [31] Maged M. Michael. 2004. Hazard Pointers: Safe Memory Reclamation for Lock-
Free Objects. IEEE Trans. Parallel Distributed Syst. 15, 6 (2004), 491–504.
[1] 2022. LeanStore - A High-Performance Storage Engine for Modern Hardware. [32] Jan Mühlig and Jens Teubner. 2021. MxTasks: How to Make Efficient Synchro-
https://leanstore.io/. nization and Prefetching Easy. In SIGMOD. 1331–1344.
[2] 2022. Lightning Memory-Mapped Database Manager (LMDB). http://www.lmdb. [33] Thomas Neumann and Michael Freitag. 2020. Umbra: A Disk-Based System with
tech/doc/. In-Memory Performance. In CIDR.
[3] 2022. MxKernel - A Bare-Metal Runtime System for Database Operations on [34] Anastasios Papagiannis, Giorgos Xanthakis, Giorgos Saloustros, Manolis Maraza-
Heterogeneous Many-Core Hardware. https://ess.cs.uos.de/research/projects/ kis, and Angelos Bilas. 2020. Optimizing Memory-mapped I/O for Fast Storage
MxKernel/. Devices. In USENIX ATC. 813–827.
[4] 2022. Samsung PCIe Gen 4-enabled PM1733 SSD. https://semiconductor.samsung. [35] Ivy Peng, Marty McFadden, Eric Green, Keita Iwabuchi, Kai Wu, Dong Li, Roger
com/ssd/enterprise-ssd/pm1733-pm1735/mzwlj3t8hbls-00007/. Pearce, and Maya Gokhale. 2019. UMap: Enabling application-driven optimiza-
[5] 2022. WiredTiger Storage Engine. https://docs.mongodb.com/manual/core/ tions for page management. In IEEE/ACM Workshop on Memory Centric High
wiredtiger/. Performance Computing (MCHPC). 71–78.
[6] Nadav Amit. 2017. Optimizing the TLB Shootdown Algorithm with Page Access [36] Filip Pizlo. 2016. Locking in WebKit. https://webkit.org/blog/6161/locking-in-
Tracking. In USENIX ATC. 27–39. webkit/.
[7] Nadav Amit, Amy Tai, and Michael Wei. 2020. Don’t shoot down TLB shoot- [37] Felix Martin Schuhknecht, Jens Dittrich, and Ankur Sharma. 2016. RUMA has it:
downs!. In EuroSys. 1–14. Rewired User-space Memory Access is Possible! PVLDB 9, 10 (2016), 768–779.
[8] Lawrence Benson, Hendrik Makait, and Tilmann Rabl. 2021. Viper: An Efficient [38] Athinagoras Skiadopoulos, Qian Li, Peter Kraft, Kostis Kaffes, Daniel Hong, Shana
Hybrid PMem-DRAM Key-Value Store. PVLDB 14, 9 (2021), 1544–1556. Mathew, David Bestor, Michael J. Cafarella, Vijay Gadepally, Goetz Graefe, Jeremy
[9] Jan Böttcher, Viktor Leis, Jana Giceva, Thomas Neumann, and Alfons Kemper. Kepner, Christos Kozyrakis, Tim Kraska, Michael Stonebraker, Lalith Suresh, and
2020. Scalable and robust latches for database systems. In DaMoN. Matei Zaharia. 2021. DBOS: A DBMS-oriented Operating System. PVLDB 15, 1
[10] Jungsik Choi, Jiwon Kim, and Hwansoo Han. 2017. Efficient Memory Mapped (2021), 21–30.
File I/O for In-Memory File Systems. In HotStorage. [39] Nae Young Song, Yongseok Son, Hyuck Han, and Heon Young Yeom. 2016. Effi-
[11] Austin T. Clements, M. Frans Kaashoek, and Nickolai Zeldovich. 2012. Scalable cient memory-mapped I/O on fast storage device. ACM Transactions on Storage
Address Spaces Using RCU Balanced Trees. In ASPLOS. 199–210. (TOS) 12, 4 (2016), 1–27.
[12] Austin T. Clements, M. Frans Kaashoek, and Nickolai Zeldovich. 2013. RadixVM: [40] Michael Stonebraker. 1981. Operating System Support for Database Management.
Scalable Address Spaces for Multithreaded Applications. In EuroSys. 211–224. Commun. ACM 24, 7 (1981), 412–418.
[13] Andrew Crotty, Viktor Leis, and Andrew Pavlo. 2022. Are You Sure You Want to [41] Alexander van Renen, Viktor Leis, Alfons Kemper, Thomas Neumann, Takushi
Use MMAP in Your Database Management System?. In CIDR. Hashida, Kazuichi Oe, Yoshiyasu Doi, Lilian Harada, and Mitsuru Sato. 2018.
[14] Dominik Durner, Viktor Leis, and Thomas Neumann. 2019. Experimental Study Managing Non-Volatile Memory in Database Systems. In SIGMOD. 1541–1555.
of Memory Allocation for High-Performance Query Processing. In ADMS. 1–9. [42] Ziqi Wang, Andrew Pavlo, Hyeontaek Lim, Viktor Leis, Huanchen Zhang, Michael
[15] Wolfgang Effelsberg and Theo Härder. 1984. Principles of Database Buffer Man- Kaminsky, and David G. Andersen. 2018. Building a Bw-Tree Takes More Than
agement. ACM Trans. Database Syst. 9, 4 (1984). Just Buzz Words. In SIGMOD. 473–488.
[16] Goetz Graefe, Haris Volos, Hideaki Kimura, Harumi A. Kuno, Joseph Tucek, Mark [43] Xinjing Zhou, Joy Arulraj, Andrew Pavlo, and David Cohen. 2021. Spitfire: A
Lillibridge, and Alistair C. Veitch. 2014. In-Memory Performance for Big Data. Three-Tier Buffer Manager for Volatile and Non-Volatile Memory. In SIGMOD.
PVLDB 8, 1 (2014), 37–48. 2195–2207.

You might also like