Vmcache
Vmcache
ABSTRACT 1 INTRODUCTION
Most database management systems cache pages from storage in a DBMS vs. OS. Database management systems (DBMS) and oper-
main memory buffer pool. To do this, they either rely on a hash table ating systems (OS) have always had an uneasy relationship. OSs
that translates page identifiers into pointers, or on pointer swizzling provide process isolation by virtualizing hardware access, whereas
which avoids this translation. In this work, we propose vmcache, a DBMSs want full control over hardware for optimal efficiency. At
buffer manager design that instead uses hardware-supported virtual the same time, OSs offer services (e.g., caching pages from storage)
memory to translate page identifiers to virtual memory addresses. that are almost exactly what database systems require – but for
In contrast to existing mmap-based approaches, the DBMS retains performance and semantic reasons, DBMSs often re-implement
control over page faulting and eviction. Our design is portable this functionality. The mismatch between the services offered by
across modern operating systems, supports arbitrary graph data, operating systems and the requirements of database systems was
enables variable-sized pages, and is easy to implement. One down- raised four decades ago [40], and the situation has not improved
side of relying on virtual memory is that with fast storage devices much since then.
the existing operating system primitives for manipulating the page OS-controlled caching. The big advantage the OS has over a
table can become a performance bottleneck. As a second contribu- DBMS is that it runs in kernel mode and therefore has access to
tion, we therefore propose exmap, which implements scalable page privileged instructions. In particular, the OS has direct control over
table manipulation on Linux. Together, vmcache and exmap provide the virtual memory page table, and can therefore do things user
flexible, efficient, and scalable buffer management on multi-core space processes cannot. For example, using virtual memory and
CPUs and fast storage devices. the memory management unit (MMU) of the processor, the OS
implements transparent page caching and exposes this by map-
CCS CONCEPTS ping storage into virtual memory through the mmap system call.
With mmap, in-memory operations (cache hits) are fast, thanks to
• Information systems → Data management systems; Record
the Translation Lookaside Buffer (TLB). Nevertheless, as Crotty
and buffer management.
et al. [13] recently discussed, mmap is generally not a good fit for
database systems. Two major problems of mmap are that (1) the
KEYWORDS DBMS loses control over page faulting and eviction, and that (2) the
Database Management Systems; Operating Systems; Caching; Buffer virtual memory implementation in Linux is too slow for modern
Management NVMe SSDs [13]. The properties of mmap and alternative buffer
manager designs are summarized in Table 1.
ACM Reference Format: DBMS-controlled caching. In order to have full control, most
Viktor Leis, Adnan Alhomssi, Tobias Ziegler, Yannick Loeck, and Christian DBMSs therefore avoid file-backed mmap, and implement explicit
Dietrich. 2023. Virtual-Memory Assisted Buffer Management: Preprint ac-
buffer management in user space. Traditionally, this has been done
cepted for publication at SIGMOD 2023. In Proceedings of ACM Conference
(Conference’17). ACM, New York, NY, USA, 14 pages. https://doi.org/10.1145/
using a hash table that contains all pages that are currently in
nnnnnnn.nnnnnnn cache [15]. Recent, more efficient buffer manager designs rely on
pointer swizzling [16, 23, 33]. Both approaches have downsides:
the former has non-trivial hash table translation overhead; and the
Permission to make digital or hard copies of all or part of this work for personal or latter is more difficult to implement and does not support cycli-
classroom use is granted without fee provided that copies are not made or distributed cal page references (e.g., graph data). Rather than compromising
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than ACM on either the performance or the functionality benefits of transla-
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, tion, this work proposes hardware-supported virtual memory as a
to post on servers or to redistribute to lists, requires prior specific permission and/or a
fee. Request permissions from [email protected].
fundamental building block of buffer management.
Conference’17, July 2017, Washington, DC, USA Contribution 1: vmcache. The first contribution of this paper is
© 2023 Association for Computing Machinery. vmcache, a novel buffer pool design that relies on virtual memory,
ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. . . $15.00 but retains control over faulting and eviction within the DBMS,
https://doi.org/10.1145/nnnnnnn.nnnnnnn
Table 1: Conceptual comparison of buffer manager designs be very fast, in the past decade DRAM prices have almost stopped
decreasing [18]. Storage in the form of NVMe flash SSDs, on the
mmap tradi. pointer swiz. Umbra vmcache +exmap other hand, has become cheap (20 − 50× cheaper per byte than
[13] [15] [16, 23] [33] Sec. 3 Sec. 4 DRAM [18]) and fast (>1 million random 4 KB reads per second
transl. page tbl. hash tbl. invasive invasive page tbl. page tbl. per SSD [4]). This makes pure in-memory systems economically
control OS DBMS DBMS DBMS DBMS DBMS unattractive [29], and implies that modern storage engines should
var. size easy hard hard med. (*) easy easy combine DRAM and SSD. The challenge is supporting very large
graphs yes yes no no yes yes data sets on NVMe SSDs with their high I/O throughput and making
implem. med. (**) easy hard hard easy easy cache hits almost as fast as in main-memory systems.
in-mem. fast slow fast fast fast fast
Pointer swizzling (invasive translation). An efficient technique
out-mem. slow fast fast fast med. fast
for implementing buffer managers is pointer swizzling. The tech-
(*) only powers of 2 [33] (**) read-only easy, transactions hard [13] nique has originally been proposed for object-oriented DBMSs [20],
but has recently been applied to several high-performance storage
engines [16, 23, 33]. As Figure 1b illustrates, the idea is to replace
unlike solutions based on file-backed mmap. The key idea is to map the PID of a cached page with its virtual memory pointer within the
the storage device into anonymous (rather than file-backed) vir- data structure. Page hits can therefore directly dereference a pointer
tual memory and use the MADV_DONTNEED hint to explicitly control instead of having to translate it through a hash table first. One way
eviction. This enables fast in-memory page accesses through TLB- to think about this is that pointer swizzling gets rid of explicit hash
supported translations without handing control to the OS. Page- table-based translation by invasively modifying the data structure
table-based translation also allows vmcache to support arbitrary itself. Pointer swizzling offers very good in-memory performance.
graph data and variable-sized pages. However, it requires adaptations for every buffer-managed data
Contribution 2: exmap. While vmcache has excellent in-memory structure, and its internal synchronization is quite intricate. E.g.,
performance, every page fault and eviction involves manipulating to unswizzle a page, one needs to find and lock its parent, and
the page table. Unfortunately, existing OS page table manipulation Storing a parent pointer on each node presents synchronization
primitives have scalability problems that become visible with high- challenges during node splits. Another downside is that pointer
performance NVMe SSDs [13]. Therefore, as a second contribution, swizzling-based systems generally do not support having more than
we propose exmap, an OS extension for efficiently manipulating one incoming reference to any particular page. In other words, only
virtual memory mappings. exmap is implemented as a Linux kernel tree data structures are directly supported. Graph data, next point-
module and is an example of DBMS/OS co-design. By providing new ers in B+tree leaf pages, and multiple incoming tuple references
OS-level abstractions, we simplify and accelerate data-processing (e.g., from secondary indexes) require inelegant and sometimes
systems. Overall, as Table 1 shows, combining exmap with vmcache inefficient workarounds.
results in a design that is not only fast (in-memory and out-of- Hardware-supported page translation. Traditional buffer man-
memory) but also offers important functionality. agers and pointer swizzling present an unsatisfactory and seem-
ingly inescapable tradeoff: either one pays the performance cost
2 BACKGROUND: DATABASE PAGE CACHING of the hash table indirection, or one loses the ability to support
Buffer management. Most DBMSs cache fixed-size pages (usually graph-like data. Instead of getting rid of the translation (as pointer
4-64 KB) from secondary storage in a main memory pool. The basic swizzling does), another way of achieving efficiency is to make
problem of such a cache is to efficiently translate a page identifier PID-to-pointer translation efficient through hardware support. All
(PID), which uniquely determines the physical location of each page modern operating systems use virtual memory and, together with
on secondary storage, into a pointer to the cached data content. In hardware support from the CPU, transparently translate virtual to
the following, we describe known ways of doing that, including physical addresses. Page table entries are cached within the CPU,
the six designs shown in Table 1. in particular the TLB, which makes virtual memory translation fast.
Hash table-based translation. Figure 1a illustrates the traditional Figure 1c shows how hardware-supported page translation can be
way [15] of implementing a buffer pool: a hash table indexes all used for caching pages from secondary storage.
cached pages by their PID. A page is addressed using its PID, which OS-driven caching with file-backed mmap. Unix offers the mmap
always involves a hash-table lookup. On a miss, the page is read system call to access storage via virtual memory. After mapping a
from secondary storage and added to the hash table. This approach file or device into virtual memory, a memory access will trigger a
is simple and flexible. The hash table is the single source of truth page fault. The OS will then install that page in the page table, mak-
of the caching state, and pages can reference each other arbitrarily ing succeeding page accesses as fast as ordinary memory accesses.
through PIDs. The downside is suboptimal in-memory performance, Some systems therefore eschew implementing a buffer pool and in-
as even cache hits have to pay the hash table lookup cost. Also note stead rely on the OS page cache by mmaping the database file/device.
that there are two levels of translation: from PID to virtual memory While this approach makes cache hits very fast, it has major prob-
pointer (at the DBMS level), and from virtual memory pointer to lems that were recently analyzed by Crotty et al. [13]: (1) Ensuring
physical memory pointer (at the OS/MMU level). transactional safety is difficult and potentially inefficient because
Main-memory DBMS. One way to avoid the overhead of tradi- the DBMS loses control over eviction. (2) There is no interface for
tional buffer managers is to forego caching altogether and keep all asynchronous I/O, and I/O stalls are unpredictable. (3) I/O error
data in main memory. While pure in-memory database systems can handling is cumbersome. (4) OS-implemented page faulting and
(a) hash table (b) invasive (ptr. swizling) (c) page table (virtual memory)
hash P0 P1 P2 P3 P4 P5
table P1
P1 P5 P4 P3 P5 P4 P3 bar foo
P4
virtual memory
page
foo P5 P3 table
P5
foo bar
foo P5 P4 P3 bar
bar
P3 physical memory
P0 P1 P2 P3 P4 P5
dog P5 P4 P3 cat bar fish foo
Figure 1: Buffer pool page translation schemes. Example with 6 pages on storage (P0-P5), 3 of which are cached (P1, P3, P5)
eviction is too slow to fully exploit modern NVMe storage devices. page sizes, but Umbra’s page translation is still based on pointer
The lack of control over eviction for file-backed mmap approaches swizzling rather than the page table. Umbra therefore inherits the
is a fundamental problem. Notably, it prevents the implementation disadvantages of pointer swizzling (difficult implementation, no
of ARIES-style transactions. ARIES uses in-place writes and pre- graph data), while potentially encountering OS scalability issues.
vents the eviction of a dirty page before its corresponding log entry Fast virtual memory manipulation. While OS-supported ap-
is flushed – impossible with existing OS interfaces [13]. Without proaches offer very fast access to cached pages and enable variable-
explicit control over eviction, it is also impossible to implement sized pages, they unfortunately may suffer from performance prob-
DBMS-optimized page replacement algorithms. Thus, one is at the lems. One problem is that each CPU core has its own TLB, which
whim of whatever algorithm the OS currently in use implements, can get out of sync with the page table1 . When the page table
which is unlikely to be optimized for DBMS workloads. changes, the OS therefore generally has to interrupt all CPU cores
DBMS-driven, virtual-memory assisted caching. While OS- and force them to invalidate their TLB (“TLB shootdown”). Another
managed caching using mmap may not be a good solution for most issue is that intra-kernel data structures can become the scalability
DBMSs, the OS has one big advantage: instead of having to use bottleneck on systems with many cores. Crotty et al. [13] observed
an explicit hash table for page translation, it can rely on hardware that because of these issues mmap can be slow in out-of-memory
support (the TLB) for page translation. This raises the following workloads. For random reads from one SSD, they measured that
question: Is it possible to exploit the virtual memory subsystem it achieves less than half the achievable I/O throughput. With se-
without losing control over eviction and page fault handling? One quential scans from ten SSDs, the gap between mmap and explicit
contribution of this paper is to answer this question affirmatively. asynchronous I/O is roughly 20×. Any virtual memory-based ap-
In Section 3, we describe how widely-supported OS features (anony- proach (including our basic vmcache design) will run into these
mous memory and the MADV_DONTNEED hint) can be exploited to kernel issues. Section 4 therefore describes a novel, specialized vir-
implement hardware-supported page translation while retaining tual memory subsystem for Linux called exmap, which solves these
full control over faulting and eviction within the DBMS. performance problems.
Variable-sized pages. Besides making page translation fast, using Persistent memory. In this work, we focus on block storage rather
a page table also makes implementing multiple page sizes much than byte-addressable persistent memory, for which multiple spe-
easier. Having dynamic page sizes is obviously very useful, e.g., cialized caching designs have been proposed [8, 21, 28, 41, 43].
for storing objects that are larger than one page [33]. Neverthe-
less, many buffer managers only support one particular page size 3 VMCACHE: VIRTUAL-MEMORY ASSISTED
(e.g., 4 KB) because multiple sizes lead to complex allocation and BUFFER MANAGEMENT
fragmentation issues. In these systems, larger objects need to be
The POSIX system call mmap usually maps a file or storage device
implemented by splitting them across pages, which complicates
into virtual memory, as is illustrated in Figure 1c. The advantage of
and slows down the code accessing such objects. With control over
file-backed mmap is that, due to hardware support for page transla-
the page table, on the other hand, a larger (e.g., 12 KB) page can be
tion, accessing cached pages becomes as fast as ordinary memory
created by mapping multiple (e.g., 3) non-contiguous physical pages
accesses. If the page translation is cached in the TLB and the data
to a contiguous virtual memory range. This is easy to implement
happens to be in the L1 cache, an access can take as little as 1 ns.
within the OS and no fragmentation occurs in main memory. One
The big downside is that the DBMS loses control over page fault-
system that allows multiple (albeit only power-of-two) page sizes
ing and eviction. If the page is not cached but resides on storage,
is Umbra [33]. It implements this by allocating multiple buffer pool-
dereferencing a pointer may suddenly take 10 ms because the OS
sized virtual memory areas – one for each page size. To allocate a
page of a particular size, one can simply fault the memory from that 1 The page table, which is an in-memory data structure, itself is coherent across CPU
class. To free a page, the buffer manager uses the MADV_DONTNEED cores. However, a CPU core accessing memory caches virtual to physical pointer
OS hint. This approach gets rid of fragmentation from different translations in a per-core hardware cache called TLB. If the page table is changed, the
hardware does not automatically update or invalidate existing TLB entries.
will cause a page fault that is transparent to the DBMS. Thus, from 1 fix(uint64_t pid): // fix page exclusively
the point of view of the DBMS, eviction and page faulting are to- 2 uint64_t ofs = pid * pageSize
tally unpredictable and can happen at any point in time. In this 3 while (true) // retry until success
section, we describe vmcache, a buffer manager design that – like
4 PageState s = state[pid]
file-backed mmap – uses virtual memory to translate page identifiers
5 if (s.isEvicted())
into pointers (see Figure 1c). However, unlike mmap, in vmcache the
DBMS retains control over page faults and eviction. 6 if (state[pid].CAS(s, Locked))
7 pread(fd, virtMem+ofs, pageSize, ofs)
3.1 Page Table Manipulation 8 return virtMem+ofs // page miss
9 else if (s.isMarked() || s.isUnlocked())
Setting up virtual memory. Like the file-backed mmap approach,
vmcache allocates a virtual memory area with (at least) the same 10 if (state[pid].CAS(s, Locked))
size as the backing storage. However, unlike with file-backed mmap 11 return virtMem+ofs // page hit
this allocation is not directly backed by storage. Such an “unbacked” 12 unfix(uint64_t pid):
allocation is called anonymous and, confusingly, is done through 13 state[pid].setUnlocked()
mmap as well, but using the MAP_ANONYMOUS flag: Listing 1: Pseudo code for exclusive page access
int flags = MAP_ANONYMOUS|MAP_PRIVATE|MAP_NORESERVE;
int prot = PROT_READ | PROT_WRITE;
char* virtMem = mmap(0, vmSize, prot, flags, -1, 0); evict3 , whether and when to write back a page, and when to remove
a page from the page table.
Note that no file descriptor has been specified here (the fourth argu-
ment is -1). Storage is handled explicitly and could be a file (multiple 3.2 Page States and Synchronization Basics
applications share one file system) or multiple block devices (in a
RAID setup). Moreover, the allocation will initially not be backed In terms of the buffer manager implementation, the most diffi-
by physical memory, which is important because storage capacity cult aspect is synchronization, e.g., managing races to the same
is usually much larger than main memory. page. Buffer managers must not only use scalable synchronization
Adding pages to the cache. To add a page to the cache, the buffer internally, they should also provide efficient and scalable synchro-
manager explicitly reads it from storage to the corresponding posi- nization primitives to the upper DBMS layers. After all, most data-
tion in virtual memory. For example, we can use the pread system base data structures (e.g., relations, indexes) are stored on top of
call to explicitly read P3 as follows: cacheable pages.
Buffer pool state. In a traditional buffer manager (see Figure 1a),
uint64_t offset = 3 * pageSize; the translation hash table is used as a single source of truth for
pread(fd, virtMem + offset, pageSize, offset); the caching state. Because all accesses go through the hash table,
Once pread completes, a physical memory page will be installed in synchronization is fairly straightforward (but usually not efficient).
the page table and the data becomes visible to the DBMS process. In Our approach, in contrast, needs an additional data structure for
contrast to mmap, which handles page misses transparently without synchronization because not all page accesses traverse the page ta-
involving the DBMS, with the vmcache approach the buffer man- ble4 and because the page table cannot be directly manipulated from
ager controls I/O. For example, we can use either the synchronous user space. Therefore, we allocate a contiguous array with as many
pread system call or asynchronous I/O interfaces such as libaio page state entries as we have pages on storage at corresponding
or io_uring. positions, as the following figure illustrates:
Removing pages from the cache. After mapping more and more Evicted Locked Evicted Unlocked Evicted
P0 P1 P2 P3 P4
pages, the buffer pool will eventually run out of physical memory,
causing failing allocations or swapping. Before that happens, the foo bar
DBMS needs to start evicting pages, which on Linux can be done
as follows2 : Page states. After startup, all pages are in the Evicted state. Page
madvise(virtMem + offset, pageSize, MADV_DONTNEED); access operations first check their state entry and proceed according
to the following state diagram:
This call will remove the physical page from the page table and
make its physical memory available for future allocations. If the fix
Evicted Locked
page is dirty (i.e., has been changed), it first needs to be written fix
back to storage, e.g., using pwrite: evict unfix fix
candi-
pwrite(fd, virtMem + offset, pageSize, offset); Marked date Unlocked
With the primitives described above, the DBMS can control all
buffer management decisions: how to read pages, which pages to 3 Strictly speaking, the OS could decide to evict vmcache pages – but this does not affect
the correctness of our design. OS-triggered eviction can be prevented by disabling
swapping or by mlocking the virtual memory range.
2 On Windows these primitives are available as VirtualAlloc(..., MEM_RESERVE, 4 If a page translation is cached in the TLB of a particular thread, the thread does not
...) and VirtualFree(..., MEM_RELEASE). have to consult the page table.
1 optimisticRead(uint64_t pid, Function fn): atomically. As the pseudo code in Listing 2 shows, an optimistic
2 while (true) // retry until success reader retrieves the state and if it equals Unlocked (line 4 in the
3 PageState s = state[pid] // incl. version code), it reads from the page (line 5). After that we retrieve the page
state again and make sure that the page is still not locked and that
4 if (s.isUnlocked())
the version has not changed (line 6). If this check fails, the operation
5 // optimistic read:
is restarted. Note that the version counter is incremented not just
6 fn(virtMem + (pid*pageSize)) when a page changes but also when it is evicted. This is crucial for
7 if (state[pid] == s) // validate version correctness and, for example, ensures that an optimistic read of a
8 return // success marked page that is evicted before validation will fail. To prevent
9 else if (s.isMarked()) starvation due to repeated restarts, it is also possible to fall back to
10 // clear mark: pessimistic lock-based operations (not shown in the code). Finally,
11 state[pid].CAS(s, Unlocked) let us note that optimistic reads can be interleaved across multiple
12 else if (s.isEvicted()) pages, enabling lock coupling-like synchronization of complex data
13 fix(pid); unfix(pid) // handle page miss structures like B-trees [24]. This approach has been shown to be
Listing 2: Pseudo code for optimistic read highly scalable and outperform lock-free data structures [42].
64-bit state entry. Overall, we use 64 bits for the page state,
of which 8 bits encode the Unlocked (0), LockedShared (1-252),
Locked (253), Marked (254), and Evicted (255) states. This leaves
Listing 1 shows pseudo code for the fix and unfix operations, us with 56 bits for the version counter – which are enough to never
which provide exclusive page access. Suppose we have a page that overflow in practice. 64 bits are also a convenient size that allows
is currently in Evicted state (line 5 in the code). If a thread wants atomic operations such as compare-and-swap (CAS).
to access that page, it calls fix, which will transition to the Locked Memory reclamation and optimistic reads. In general, lock-
state using a compare-and-swap operation (line 6). The thread is free data structures require special care when freeing memory [25,
then responsible to read the page from storage and implicitly (via 27, 30]. Techniques such as epoch-based memory reclamation [30]
pread) install it to the page table (line 7). After that, it can access or hazard pointers [31] have been proposed to address this prob-
the page itself and finally unfix it, which causes a transition to the lem. All these techniques incur overhead and may cause additional
Unlocked state (line 13). If another thread concurrently wants to memory consumption due to unnecessarily long reclamation de-
fix the same page, it waits until it is unlocked. This serializes page lays. Interestingly, vmcache – despite supporting optimistic reads
misses and prevents the same page from being read multiple times. – can sidestep these problems completely. Indeed, vmcache does
The fourth state, Marked, helps to implement a clock replacement not prevent the eviction/reclamation of a page that is currently
strategy – though arbitrary other algorithms could be implemented read optimistically. However, this is not a problem because after
as well. Cached pages are selected for eviction by setting their state the page is removed from the page table using the MADV_DONTNEED
to Marked. If the page is accessed, it transitions back to the Locked hint, it is replaced by the zero page. In that situation the optimistic
state, which clears the mark (line 10). Otherwise, the page can be read will proceed loading 0s from the page without crashing, and
evicted and eventually transitions to the Evicted state. will detect that eviction occurred during the version check. (The
check fails because eviction first locks and then unlocks the page,
3.3 Advanced Synchronization which increments the version.) Therefore, vmcache does not need
So far, we discussed how to lock pages exclusively. To enable scal- any additional memory reclamation scheme.
able and efficient read operations, vmcache also provides shared Parking lot. To avoid exclusive and shared locks from wasting
locks (multiple concurrent readers on the same page) and optimistic CPU cycles and ensure fairness under lock contention, one can use
(lock-free) reads. the Parking Lot [9, 36] technique. The key idea is that if a thread
Shared locks. To implement shared locks for read-only operations, fails to acquire the lock (potentially after trying several times), it
we count the number of concurrent readers within the page state. If can “park” itself, which will block the thread until it is woken up
the page is not locked exclusively, read-only operations atomically by the thread holding the lock. Parking itself is implemented using
increment/decrement that counter [9] when fixing/unfixing the a fixed-size hash table storing standard OS-supported condition
page. Exclusive accesses have to wait until the counter is 0 before variables [9]. Within the page state, we only need one additional bit
acquiring the lock. that indicates whether there are threads that are currently waiting
Optimistic reads. Both exclusive and shared locks write to shared for that page lock to be released. The big advantage of parking lots
memory when acquiring or releasing the lock, which invalidates is very low space overhead per page, which is only 1 bit instead of
cache entries in other CPU cores. For tree data structures such as 64 bytes for pthread (rw)locks [9].
B-trees this results in suboptimal scalability, because the page states
of inner nodes are constantly invalidated. An elegant alternative to
locks are optimistic, lock-free page reads that validate whether the 3.4 Replacement Strategy
read was correct. To do that, locks contain an update version that Clock implementation. In principle, arbitrary replacement strate-
is incremented whenever an exclusively locked page is unlocked [9, gies can be implemented on top of vmcache. As mentioned earlier,
25, 30]. We store this version counter together with the page state our current implementation uses the clock algorithm. Before the
within the same 64-bit value, ensuring that both are always changed buffer pool runs out of memory, we change the state of Unlocked
P0 P1 P2 P3 P4
an example where page P3 spans two physical pages. For data
foo a very large page structures implemented on top of the buffer manager this fact is
virtual memory completely transparent, i.e., the memory appears to be contiguous.
page
table Accesses to large pages only use the page state of the head page (P3
not P4 in the figure). The advantage of relying on virtual memory
physical to implement multiple page sizes is that it avoids main memory
memory arge page foo a very l
fragmentation. Note that fragmentation is not simply moved from
user to kernel space, but the page table indirection allows the OS
Figure 2: vmcache enables DBMS page sizes that are multiple to always deal with 4 KB pages rather than having to maintain
of the VM page size different allocation classes. As a consequence, as Figure 2 illustrates,
a contiguous virtual memory range will in general not be physically
contiguous.
pages to Marked. All page accesses, including optimistic reads, clear
Advantages of large pages. Although most DBMS rely on fixed-
the Marked state, ensuring that hot pages will not be evicted. To
size pages, supporting different page sizes has many advantages.
implement clock, one needs to be able to iterate over all pages in
One case where variable-size pages simplify and accelerate the
the buffer pool. One approach to do that would be to iterate over
DBMS is string processing. With variable-size pages one can, for
the state array while ignoring evicted pages. However, this would
example, simply call external string processing libraries with a
be quite expensive if the state array is very sparse (i.e., storage is
pointer into the buffer pool. Without this feature, any string opera-
much larger than main memory). We implement a more robust
tion (comparison, LIKE, regexp search, etc.) needs to explicitly deal
approach that stores all page identifiers that are currently cached
with strings chunked across several pages. Because few existing
in a hash table. The size of the hash table is equal to the number
libraries support chunking, one would have to copy larger strings
of pages in DRAM (rather than storage) and our page replacement
into a contiguous memory before being able to use them. Another
algorithm iterates over this much smaller data structure. We use a
case is compressed columnar storage where each column chunk has
fixed-size open addressing hash table, which makes iteration cache
the same number of tuples but a different size. In both cases it is in-
efficient. Note that, in contrast to traditional buffer managers, this
deed possible to split the data across multiple fixed-size pages (and
hash table is not accessed during cache hits, but only during page
many systems have to do it due to a lack of variable-size support),
faults and eviction.
but it leads to complex code and/or slower performance. Finally, let
Batch eviction. For efficiency reasons, our implementation evicts
us mention that, in contrast to systems like Umbra [33], vmcache
batches of 64 pages. To minimize exclusive locking and exploit
supports arbitrary page sizes as long as they are a multiple of 4 KB.
efficient bulk-I/O, eviction is done in five steps:
This reduces memory waste for larger objects. Overall, we argue
(1) get batch of marked candidates from hash table, lock dirty that this feature can substantially simplify the implementation of
pages in shared mode the DBMS and lead to better performance.
(2) write dirty pages (using libaio)
(3) try to lock (upgrade) clean (dirty) page candidates
(4) remove locked pages from page table using madvise 3.6 Discussion
(5) remove locked pages from eviction hash table, unlock them State access. As mentioned earlier, every page access must retrieve
After step 3, all pages must be locked exclusively to avoid race the page state – often causing a cache miss – before it can read the
conditions during eviction. For dirty pages, we already obtained page data itself. One may therefore wonder whether this is just as
shared locks in step 1, which is why step 3 performs a lock upgrade. inefficient as traditional hash table-based buffer managers. However,
Clean pages have not been locked, so step 3 tries to acquire the these two approaches are very different from each other in terms
exclusive lock directly. Both operations can fail because another of their memory access patterns. In the hash table approach, the
thread accessed the page, in which case eviction skips it (i.e., the page data pointer is retrieved from the hash table itself, i.e., there is
page stays in the pool). With the basic vmcache design, step 4 is a data dependency between the two pointers and one usually pays
simply calling madvise once for every page. With exmap, we will the price of two cache miss latencies. In our approach, in contrast,
be able to exploit bulk removal of pages from the page table. both the page state pointer and data content pointer are known
upfront. As a consequence, the out-of-order execution of modern
3.5 Page Sizes CPUs will perform both accesses in parallel, hiding the additional
Default page size. Most processors use 4 KB virtual memory pages overhead of the state retrieval.
by default, and conveniently this granularity also works well with Memory consumption. vmcache comes with some DRAM over-
flash SSDs. It therefore makes sense to set the default buffer pool head in the form of page tables and the page state array: For config-
page size to 4 KB as well. x86 (ARM) also supports 2 MB (1 MB) uring the virtual-memory mapping, vmcache requires 8.016 bytes
pages, which might be a viable alternative in systems that primarily for each 4 KB of storage to set up a 5-level page table. Besides this
read larger blocks. With vmcache, OLTP systems should generally cost, which is inherent to any mmap-like buffer manager, vmcache
use 4 KB pages and for OLAP systems both 4 KB and 2 MB pages requires additional 8 bytes for the page state: 8 bits for the exclu-
are suitable. sive/shared lock and 56 bits for the optimistic-read version counter.
Supporting larger pages. vmcache also makes it easy to support So in total, vmcache requires around 16 bytes of DRAM per 4 KB on
any buffer pool page size that is a multiple of 4 KB. Figure 2 shows storage. Thus, for example, for 1 TB flash SSD, one needs 4 GB of
Flame Graph Search ic
f..
1.5M gen..
free pages gen..
__s.. alloc page __s..
sys.. page fault sys.. d..
callstack
1.0M asm.. asm.. n..
llist_add_batch smp_call_function_many_cond
0.5M on_each_cpu_cond_mask
flush_tlb_mm_range free_page..
tlb_finish_mmu
0.0M e.. zap_page_range
1 32 64 96 128 a.. madvise
competitor_madv
threads Relative Execution-Time Budget (of 100%)
Figure 3: Linux page (de)allocation performance Figure 4: CPU time profile for Figure 3 with 128 threads
DRAM for the internal buffer manager state, which is a reasonable anonymous mapping by triggering a page fault and evict them
1
256 th of SSD capacity. Economically speaking, as Flash is approxi- again with MADV_DONTNEED. As Figure 3 shows, vanilla Linux only
mately 50× cheaper per byte than DRAM, the additional memory achieves 1.51M OP/s with 128 threads. Incidentally, a single modern
50 ≈ 20 % of the flash price. While this is low enough in
costs 256 PCIe 4.0 SSD can achieve 1.5M random 4KB reads per second [4]. In
most use cases, there are ways to reduce this cost: (1) Compress the other words, a 128-thread CPU would be completely busy manipu-
64-bit page state at the expense of optimistic reads (-56 bits) and lating virtual memory for one SSD – not leaving any CPU cycles
shared locking (-6 bits) down to two bits per storage page (evicted, for actual work.
exclusive locked), leaving us with a total of 2.07 GB for a 1 TB flash Problem 1: TLB shootdowns. To investigate this poor scal-
SSD (+10.11 % cost). (2) Place the page state within the buffered page ability, we used the perf profiling tool and show a flame
and keep the corresponding 8 bytes on the storage page unused, graph [17] in Figure 4. Linux spends 79% of all CPU time in the
leaving us with the unavoidable 2 GB of DRAM overhead. Thus, flush_tlb_mm_range function. It implements TLB shootdowns,
the memory overhead is reasonable in terms of overall cost for the which are an explicit coherency measure that prevents outdated
system and could be reduced even further. TLB entries, which otherwise could lead to data inconsistencies or
Address space. Existing 64-bit CPUs generally support at least 48- security problems. On changing the page table, the OS sends an
bit virtual memory addresses. On Linux, half of that is reserved for interprocessor interrupt (IPI) to all other (N-1) cores running appli-
the kernel, and user-space virtual memory allocations are therefore cation threads, which then clear their TLB. This is fundamentally
limited to 247 = 128 TB. Starting with Ice Lake, Intel processors unscalable as it requires N-1 IPIs for every evicted page.
support 57-bit virtual memory addresses, enabling a user-space Problem 2: Page allocation. After shootdowns, the next major
address space size of 256 = 64 PB. Thus, the address space is large performance problem in Linux is the intra-kernel page allocator
enough for our approach, and will be so for the foreseeable future. (free pages and alloc page in the flame graph). The Linux page
allocator relies on a centralized, unscalable data structure and, for
4 EXMAP: SCALABLE AND EFFICIENT security reasons, has to zero out each page after eviction. Therefore,
VIRTUAL MEMORY MANIPULATION once the larger TLB shootdown bottleneck is solved, workloads
with high page turn-over rates will be bound by the page allocator.
vmcache exploits hardware-supported virtual memory with ex-
Why a New Page Table Manipulation API? The two perfor-
plicit control over eviction while supporting flexible locking modes,
mance problems described above cannot be solved by some low-
variable-sized pages, and arbitrary reference patterns (i.e., graphs).
level changes within Linux, but are fundamentally caused by the
This is achieved by relying on two widely-available OS primitives:
existing decades-old virtual memory API and semantics: The TLB
anonymous memory mappings and an explicit memory-release sys-
shootdowns are unavoidable with a synchronous page-at-a-time
tem call. Although vmcache is a practical and useful design, with
API, and page allocation is slowed down by the fact that physi-
some workloads it can run into OS kernel performance problems.
cal memory pages can be shared between different user processes.
In this section, we describe a Linux kernel extension called exmap
Achieving efficient and scalable page table manipulation therefore
that solves this weakness. We first motivate why the existing OS
requires a different virtual memory API and modified semantics.
implementation is not always sufficient, then provide a high-level
overview of the design, and finally describe implementation details. 4.2 Design Principles
exmap. exmap is a specialized Linux kernel extension that enables
4.1 Motivation fast and scalable page table manipulation through a new API and
Why Change the OS? With vmcache, (de)allocating 4 KB pages efficient kernel-level implementation. We co-designed exmap for
is as frequent as page misses and evict operations, i.e., the OS’ use with vmcache, but as we discuss in Section 4.5, it could also
memory subsystem becomes part of the hot path in out-of-memory be used to accelerate other applications. exmap comes as a Linux
workloads. Unfortunately, Linux’ implementation of page alloca- kernel module that the user can load into any recent Linux kernel
tion and deallocation does not scale. As a consequence, workloads without rebooting. Like the POSIX interface, exmap provides prim-
that have a high page turn-over rate can become bottlenecked by itives for setting up virtual memory, allocating, and freeing pages.
the OS’s virtual memory subsystems rather than the storage device. However, as outlined below, exmap has new semantics to eliminate
To quantify the situation on Linux, we allocate pages on a single the bottlenecks provoked by the POSIX interface.
read(P13)
write(P7)
alloc(P4)
free(P3)
1 // Open device/file as backing storage
exmap API
2 int fd = open("/dev/...", O_RWDR|O_DIRECT);
(A) VM
Surface
3 // Create a new exmap object
P0 P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 P13 P14 P15
4 struct exmap_setup_params params = {
)
X (2) s
teal (1 5 .max_interfaces = 8, // # of control interfaces
Physical (B) Control
Pages (C) iface 0 write back iface 1 6 .pool_size = 262144, // # of pages in pool (1 GB)
Interfaces
back fill
7 .backing_fd = fd}; // storage device
SSD
8 int exmap_fd = exmap_create(¶ms);
9 // Make the exmap visible in the VM
10 Page* pages = (Page*)mmap(vmSize, exmap_fd, ...);
Figure 5: exmap implementation overview: The VM Surface
11 // Allocate and evict memory using interface 5
(A) is manipulated with explicit free, alloc, read, or write
12 exmap_interface_t iface = 5;
system calls. Each per-thread control interface (B) owns part
13 // Scattered I/O Vector: P1, P3-P5
of the exmap-local memory pool, which exists as interface-
14 struct iovec vec[] = {
local free lists of physical pages (C). If an interface runs out
15 { .iov_base = &pages[1], .iov_len = pageSize },
of pages (1), it steals pages from another interface (2). Pages
16 { .iov_base = &pages[3], .iov_len = pageSize * 3}};
only circulate (X) between the surface and the interface.
17 exmap_action(exmap_fd, iface, EXMAP_ALLOC, &vec, 2);
18 exmap_action(exmap_fd, iface, EXMAP_FREE, &vec, 2);
19 // Read pages from fd into the exmap
Solving TLB shootdown problem. An effective way of reducing
20 // Use exmap_fd as a proxy file descriptor.
the number of TLB shootdowns is to batch multiple page evictions
21 pread(exmap_fd, &pages[13], pageSize, iface); // P13
and thereby reduce the number of shootdowns by the batch size. To
22 preadv(exmap_fd, &vec, 2, iface); // P1, P3-P5
achieve this, exmap provides a batching interface to free multiple
23 // Write-Backs are explicit and without proxy fd
pages with a single system call. While batching is easy to exploit
24 pwrite(fd, &pages[7], pageSize, 7 * pageSize); // P7
for a buffer manager when evicting pages, it can be problematic to
batch page allocations because these are often latency critical. To Listing 3: exmap usage example
avoid TLB shootdowns on allocation, exmap therefore ensures that
allocation does not require shootdowns at all. To do this, exmap
Operations. After creation, the process makes the exmap sur-
always read-protects the page table entry of a freed page (by setting
face visible within its VM via mmap (line 10). While an exmap can
a specific bit in the page table entry). Linux, in contrast, sets that
have an arbitrary VM extent, it can be mapped exactly once in the
entry to a write but not read-protected zero page – potentially
whole system. On the mapped surface, we allow the vectorized
causing invalid TLB entries that have to be explicitly invalidated on
and scattered allocation of pages on the exmap surface (line 11 and
allocation. This subtle change eliminates the need for shootdowns
Figure 5 (X)). For this, one specifies a vector of page ranges within
on allocation completely.
the mapped surface and issues an EXMAP_ALLOC command at an
Solving the page allocation problem. Another important differ-
explicitly-addressed interface. The required physical pages are first
ence between Linux and exmap is the page allocation mechanism.
drawn from the specified interface (Figure 5 (1)), before we steal
In Linux, when a page is freed, it is returned to a system-wide
memory from other interfaces (Figure 5 (2)). Once allocated, pages
pool (and thereby potentially to other processes). This has two
are never swapped out and, therefore, accesses will never lead to
drawbacks: (1) page allocation does not scale well and (2) pages are
a page fault, providing deterministic access times. With the free
repeatedly zeroed out for security reasons. exmap, in contrast, pre-
operation (line 18), we free the page ranges and release the removed
allocates physical memory at creation and keeps them in scalable
physical pages to the specified interface.
thread-local memory pools – thereby avoiding both bottlenecks.
Read I/O. In contrast to file-backed mmap, we do not page in or
write back data transparently, but the user (e.g., vmcache) explic-
4.3 Overview and Usage itly invokes read and write operations on the surface. To speed up
Implementation overview. Figure 5 illustrates the three major these operations, we integrated exmap with the regular Linux I/O
components of an exmap object: (A) its surface within the virtual subsystem, whereby an exmap file descriptor becomes a proxy for
memory (VM); (B) a number of control interfaces to interact with the specified backing device (lines 19-22). This allows combining
the object; and (C) a private memory pool of physical DRAM pages, page allocation and read operations in a single system call: On
which exists as interface-local free lists spread over all interfaces. read, exmap first populates the specified page range with memory
Creation. On creation (lines 4-8 in Listing 3), the user configures before it uses the regular Linux VFS interface to perform the actual
these components: She specifies the number of interfaces that the read. Since we derive the disk offset from the on-surface offset, we
kernel should allocate (line 5). Usually, each thread should use its can use offset parameter to specify the allocation interface. With
own interface (e.g., thread id = interface id) to maximize scalability. this integration, exmap supports synchronous (pread) and asyn-
The user also specifies the number of memory pool pages (line 6), chronous (libaio and io_uring) reads. Furthermore, as the on-
which exmap will drain from Linux’ page allocator for the lifetime surface offset determines the disk offset, vectorized reads (preadv,
of the exmap object. As the third parameter, the user can specify a IORING_OP_READV) implicitly become scattered operations (line 22),
file descriptor as backing storage for read operations (line 7). which Linux currently allows with no other system call.
Write I/O. On the write side, we actively decided against a write- exploit scattered request patterns, but it handles them as individual
proxy interface, which would, for example, bundle the write back requests which provokes unnecessary overheads (i.e., repeated page-
and page evict. While such a bundling is not necessary as the user table locking, allocator invocations). To avoid this, exmap provides
can already write surface pages to disk (line 24), freeing pages vectorized and scattered reads with the proxy file descriptor. This
for each write individually could, if not used correctly, lead to allows us to (1) pre-populate the VM with memory, which avoids the
unnecessary overheads. Therefore, we decoupled write back and page-fault handler path, and (2) cuts down the system-call overhead
(batched) freeing of pages. as we issue only a single system call per request batch.
Multiple exmaps. A process can create multiple exmap objects,
which are mapped as separate non-overlapping virtual-memory
4.4 Implementation Details areas (VMAs) into the process address space. These VMAs come
Scalable page allocator. Usually, when the kernel unmaps a page, with their own VM subsystem and are largely isolated from each
it returns the page to the system-wide buddy allocator, which possi- other and from the rest of the kernel while still ensuring consis-
bly merges it into larger chunks of physical memory. On allocation, tency and privilege isolation. As already noted, each exmap can be
these chunks are broken down again into pages, which have to be mapped exactly once, whereby we avoid the bookkeeping overhead
zeroed before mapping them to the user space. Therefore, with a of general-purpose solutions5 .
high VM turn-over rate, memory is constantly zeroed and circles
between the VM subsystem and the buddy allocator. To optimize 4.5 Discussion
VM operations for vmcache, we decided to use per-exmap mem- OS customization. exmap is a new low-level OS interface for ma-
ory pools to bypass the system allocator. This also allows us to nipulating virtual memory efficiently. Seemingly minor semantic
avoid proactive page zeroing since pages only circulate between changes such as batching and avoiding zero pages result in very
the surface and the memory pool within the same process, whereby high performance without sacrificing security. One analogy is that
information leakage to other processes is impossible. Only during exmap is for VM what O_DIRECT is for I/O: a specialized tool for
the initial exmap creation, we zero the pages in our memory pool. systems that want to manage and control hardware resources them-
Thread-local control interfaces and page stealing. Further- selves as efficiently as possible. Two design decisions of exmap
more, exmap’s control interfaces not only allow the application to require further discussion.
express allocation/eviction locality, but they also reduce contention Functionality. We largely decoupled the exmap surface and its
and false sharing that come with a centralized allocator. For this, memory pool from the rest of Linux. As a consequence of this lean
we distribute the memory pool as local lists of free 4 KB pages design, exmap is efficient but does not support copy-on-write fork-
over the interfaces, whereby the need for page stealing comes up. ing and swapping. Few buffer pool implementations rely on such
After the interface-local free list is drained, we use a three-tiered functionality. Indeed, it is actually a benefit that exmap behavior
page-stealing strategy: (1) steal from the interface from which we is simple and predictable as it allows buffer managers to precisely
have successfully stolen the last time, (2) randomly select two in- track memory consumption and ensure robust performance.
terfaces and steal from the interface with more free pages, and (3) Portability. Another important aspect is generalizability to other
iterate over all interfaces until we have gathered enough pages. To operating systems and architectures. Since our kernel module comes
minimize the number of steal operations, we steal more pages than with its own specialized VM subsystem, it only has few depen-
required for the current operation. If we remove pages from the dencies to the rest of the Linux kernel. This makes exmap easily
surface, we always push them to the specified interface. Thereby, portable between Linux versions and suggests that the concept can
for workloads in which per-interface allocation and eviction are in be implemented for other operating systems such as Windows and
balance, steal operations are rarely necessary. FreeBSD. Except for our architecture-dependent lock-free short-cut
Lock-free page-table manipulation. For page-table manipula- for small page table modifications, the exmap implementation is
tions, Linux uses a fine-grained locking scheme that locks the last also independent of the used ISA and MMU as it reuses Linux’ MMU
level of the page table tree to update page-table entries therein. abstractions. In other words, our Linux implementation is easily
However, such entries have machine-word size on most architec- portable across CPU architectures that support Linux.
tures, and we can update them directly with atomic instructions. Other Applications of exmap. Although we explicitly designed
While Linux leaves this opportunity open for portability reasons, exmap for caching, it has other use cases as well: (1) Due to its high
we integrated an atomic-exchange–based hot path: If an operation VM-modification performance (see Figure 8), an heap manager
manipulates only an individual page-table entry on a last-level could use a large exmap surface to coalesce free pages into large
page table, we install (or remove) the VM mapping with a single contiguous buffers, which is useful for DBMS query processing [14].
compare-and-exchange. (2) With a page-move extension, a language run-time system could
I/O subsystem integration. For read operations, the Linux I/O use exmap as a base for a copying garbage collector for pools of
subsystem is optimized for sequential reads into destination buffers page-aligned objects. (3) For large-scale graph processing, workers
that are already populated with physical memory. For example, request, often with a high fan out (e.g., for breadth-first search)
without exmap, Linux does not provide a scattered read operation and with high parallelism, randomly-placed data from the backing
that takes multiple offsets; such a read request had to be split into store, which can easily be serviced by exmap. (4) For user-space file
multiple (unrelated) reads. On a lower level, Linux expects VM to
be populated and calls the page-fault handler for each missing page 5 Forexample, Linux usually maintains a reverse mapping from physical to virtual
before issuing the actual device operation. Hence, Linux cannot fully addresses that is necessary to implement features such as copy-on-write fork.
random lookup TPC-C Hardware, OS. We ran all experiments on a single-socket server
3M with an AMD EPYC 7713 processor (64 cores, 128 hardware threads)
75M and 512 GB main memory, of which we use 128 GB for caching. For
vmcache storage, we use a 3.8 TB Samsung PM1733 SSD. The system is run-
transactions/s
3 3
Table 2 simply shows the random access time in an 32 KB/128 GB
I/O bound (read/write)
read I/O array and therefore represents the lower bound for any buffer man-
2 2 write I/O
ager design. The next three lines incrementally show the steps
(described in Section 3.1, Section 3.2, and Section 3.3) necessary in
total I/O
1 1 the vmcache design. In line #2.1, we randomly read from a virtual
memory range of 1 TB (instead of 128 GB), which increases latency
0 0
by 7% due to additional TLB pressure. In line #2.2, in addition to
0 50 100 150 200 0 50 100 150 200
accessing the pages themselves, we also access the page state array
time [seconds]
as is required by the vmcache design. As mentioned in Section 3.6,
this additional cache miss does not noticeably increase access la-
Figure 7: Out-of-memory performance and I/O statistics tency because both memory accesses are independent and the CPU
(128 GB buffer pool, 128 threads, random lookup: 5 B entries therefore performs them in parallel. In line #2.3, we also include the
≈ 1 TB, TPC-C: 5000 warehouses ≈ 1 TB) version validation, which results in the full vmcache page access
logic. Overall, this experiment shows that a full optimistic read in
vmcache incurs less than 8% overhead in comparison with a simple
the kernel). Only by using the exmap module, can it become com- random memory read. We measured that an exclusive, uncontended
petitive to LeanStore. Eventually, exmap+vmcache performs simi- page access (fix & unfix) on 128 GB of RAM takes 238ns (not shown
larly to LeanStore and both become I/O bound in steady state. The in the table). The last line in the table shows the performance of a
performance differences are largely due to minor implementation hash table-based implementation based on open addressing. Even
differences: vmcache+exmap has slightly higher steady state perfor- such a fast hash table results in substantially higher latencies be-
mance due to a more compact B-tree (less I/O per transaction), and cause the page pointer is only obtained after the hash table lookup.
LeanStore temporarily (40s to 90s) outperforms vmcache+exmap Note that our hash table implementation is not synchronized, and
due to more aggressive dirty page eviction using dedicated back- the shown overhead is therefore actually a lower bound for the true
ground threads. cost of any hash table based design.
WiredTiger and LMDB. WiredTiger and the mmap-based LMDB
are significantly slower than vmcache and LeanStore. The perfor-
mance of WiredTiger suffers from the 32 KB page size, whereas 5.5 exmap Allocation Performance
LMDB is bound by kernel overhead (random lookups) and the Allocation benchmark. The end-to-end results presented so far
single-writer model (TPC-C). Overall, we see that while basic vm- have shown that exmap is more efficient than the standard Linux
cache offers solid out-of-memory performance, as the number I/O page table manipulation primitives. However, because we were
operations per second increases it requires the help of exmap to I/O bound, we have yet to evaluate how fast exmap actually is.
unlock the full potential of fast storage devices. To quantify the performance of exmap, we used similar allocation
benchmark scenarios as in Figure 3, i.e., we constantly allocate and
free pages in batches.
5.4 vmcache Ablation Study Baselines. The results are shown in Figure 8. For these, we always
To better understand the performance of virtual-memory assisted use batched allocations/evictions of 512 individual 4 KB pages. As
buffer management and compare it against a hash table-based de- a baseline, we use process_madvise with TLB batching, which
sign, we evaluated page access time using a microbenchmark. We already requires kernel changes. For reference, we also show the
focus on the in-memory case, which is why all page accesses in this maximal DRAM read rate, which we achieved using the pmbw bench-
experiment are hits. For all designs, we read random 4 KB pages marking tool and 64 threads (144.56 GiB/s). If the OS provides mem-
of main memory and report the average number of instructions, ory faster than this threshold, we can be sure that memory alloca-
cache misses, and the access latency. We report numbers of 32 KB tion will not be the bottleneck.
and 128 GB of data. The former corresponds to very hot CPU-cache Page stealing scenarios. exmap uses page stealing, and its per-
resident pages and the latter to colder pages in DRAM. Line #1 in formance therefore depends on the specific inter-thread allocation
Max. random-read throughput
300M 6
exmap (1 IF) vmcache reaches full BW
asynchronous
vmcache + exmap is close to FB (upper bound)
alloc+free / s
200M 4
exmap (2 IF)
2
100M s
exmap (pool) ou
hron uring (FB) pread (FB)
c
syn
memory bandwidth uring (vmcache) pread (vmcache)
process_madvise (w/ TLB batching) 0 uring (vmcache + exmap) pread (vmcache + exmap)
0M
1 32 64 96 128
1 32 64 96 128
threads
threads
Figure 9: Read performance for synchronous (pread) and
Figure 8: Linux memory allocation performance with exmap. asynchronous (uring) I/O operations. Both vmcache and
The three exmap lines shows different page stealing scenarios exmap support asynchronous I/O using uring, which allows
(1 IF: no stealing, 2 IF: pair-wise stealing, pool: stealing across achieving full I/O bandwidth using a few threads
all threads)
Variant
100M
Final
2M has identified TLB shootdowns as a major performance problem and
alloc+free / s
read / s
×3.4 1.5M them [6, 7, 22]. exmap uses the same batching idea to speed up VM
1.51
5.18
1.68
92.3
267
286
manipulation.
10M 1M
+67 % Incremental VM improvements. Existing work on improving
×2.9
the Linux VM subsystem can be split into two general categories:
1.83
2.47
0.5M
1.1
(1) speed up the existing infrastructure and (2) provide new VM
1M +35 %
0M management systems. For the first, Song et al. [39] modify the
allocation strategy in the page fault handler. Freed pages are saved
B vise
+ mP h
ck l
ee
+ late
fd
+ Ba )
)
Lo oo
B se
e- (4K
tc
M tc
Fr
y
TL Ba
Ba
pu
TL d
ox
+ ead
+ ma
pr
po
e
pr
m a