Art of Latency in DB
Art of Latency in DB
PVLDB Artifact Availability: 1.1 Just Hiding Memory Latency Is Not Enough
The source code, data, and/or other artifacts have been made available at Despite the significant performance improvement observed by end-
https://github.com/sfu-dis/mosaicdb. to-end latency-optimized engines [21], these systems are still inad-
equate and leave many unexplored opportunities in further hiding
1 INTRODUCTION more types of latency, as shown in Figure 1.
Various kinds of latency exist in modern database engines [13, 30, 41, First, as data size grows, it becomes necessary to support larger-
46, 55, 57] that target machines with large main memory, fast SSDs than-memory databases. As a result, in Figure 1(c–d) a transaction
and multicore CPUs. When the working set fits in memory, memory may access both memory- and storage-resident data. Yet existing
is the primary home of data and various important data structures latency-optimized OLTP engines based on software-prefetching
(e.g., indexes and version chains [62]) where memory blocks at are mostly designed to hide data stalls caused by memory accesses
random locations are chained using pointers. As a result, data stalls (0.1𝜇s level) shown in Figure 1(c), without considering different
caused by pointer chasing in turn become a major bottleneck [27]. levels of data movement latency, especially storage accesses at the
Since hardware prefetchers are ineffective for pointer chasing [5, 10𝜇s to ms level. As we elaborate later, a direct addition of storage
32, 54], modern CPUs offer prefetching instructions [24] that allow accesses to an end-to-end OLTP engine optimized for hiding mem-
software to proactively move data from memory to CPU caches in ory latency would cancel out the benefit of software prefetching or
advance. Importantly, such instructions are asynchronous, allowing even yield worse throughput than without using prefetching at all.
software prefetching approaches to overlap computation and data Second, in addition to the complexity caused by a mix of different
fetching [5, 27, 32, 48]. Coupled with lightweight coroutines [25], data access latencies, the software architecture of a database engine
latency-optimized OLTP engines [21] use software prefetching to can also induce latency during forward processing. As shown in Fig-
ure 1(a), CPU cores may be oversubscribed to run more threads than
This work is licensed under the Creative Commons BY-NC-ND 4.0 International the degree of hardware multiprogramming, causing OS scheduling
License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of
this license. For any use beyond those covered by this license, obtain permission by
activities. The use of synchronization primitives can also lead to
emailing [email protected]. Copyright is held by the owner/author(s). Publication rights additional delays. For example, in Figure 1(b) the number of (re)tries
licensed to the VLDB Endowment. (consequently, latency) to acquire a contended spinlock [53] could
Proceedings of the VLDB Endowment, Vol. 17, No. 3 ISSN 2150-8097.
doi:10.14778/3632093.3632117 be arbitrarily long depending on the workload, making system
behavior highly unpredictable.
577
These issues limit the applicability of latency-optimized database Although MosaicDB is implemented and evaluated on top of
engines. Future engines should further address other sources of CoroBase, the techniques can be separately applied in other systems.
latency, and more importantly, do so while preserving the benefits For example, contention regulation can be adopted by systems that
of existing latency-hiding techniques, such as software prefetching. use event-driven connection handling where the total number of
worker threads will never exceed the number of hardware threads.
We leave it as future work to explore how MosaicDB techniques
1.2 MosaicDB: Latency Hiding at the Next Level can be applied in other systems.
This paper presents MosaicDB, a multi-versioned, latency-optimized
OLTP engine that hides latency from multiple sources, including 1.3 Contributions
memory, I/O, synchronization and scheduling as identified earlier. This paper makes four contributions. 1 We quantify the impact of
To reach this goal, MosaicDB consists of a set of techniques that various sources of latency identified in memory-optimized OLTP
could also be applied separately in existing systems. engines, beyond memory access latency which received most atten-
MosaicDB bases on the coroutine-to-transaction paradigm [21] tion in the past. 2 We propose design principles that preserve the
to hide memory access latency. On top of that, we observe that benefits of software prefetching to hide memory latency and hide
the key to efficiently hiding I/O latency without hurting the per- storage access latency at the same time. 3 In addition to memory
formance of memory-resident transactions is carefully scheduling and storage I/O, we show how software database engine architec-
transactions such that the CPU can be kept busy while I/O is in ture could be modified to avoid the latency impact of synchroniza-
progress. To this end, MosaicDB proposes a pipelined scheduling tion and OS scheduling. 4 We build and evaluate MosaicDB on
policy for coroutine-oriented database engines. The basic ideas are top of an existing latency-optimized OLTP engine to showcase
(1) keeping admitting new requests in a pipelined fashion such that the effectiveness of MosaicDB techniques in practice. MosaicDB is
each worker thread always works with a full batch of requests, and open-source at https://github.com/sfu-dis/mosaicdb.
(2) admitting more I/O operations to the system only when there
is enough I/O capacity (measured by bandwidth consumption or
2 BACKGROUND
IOPS). This way, once the storage devices are saturated, MosaicDB
only accepts memory-resident requests, which can benefit from We begin by clarifying the type of systems that our work is based
software prefetching. By carefully examining alternatives in later upon and assumptions. We then elaborate the main sources of
sections, we show how these seemingly simple ideas turned out to latencies that MosaicDB attempts to hide, followed by existing
be effective and became our eventual design decision. latency-optimized designs that motivated our work.
To avoid latency caused by synchronization primitives and OS
scheduling, MosaicDB leverages the coroutine-to-transaction para- 2.1 System Architectures and Assumptions
digm to regulate contention and eliminate the need for background We target memory-optimized OLTP engines that both (1) leverage
threads. Specifically, each worker thread can work on multiple large DRAM when data fits in memory and (2) support larger-than-
transactions concurrently, but only one transaction per thread will memory databases when the working set goes beyond memory.
be active at any given time. This avoids oversubscribing the system Larger-Than-Memory Database Engines. There are mainly
by limiting the degree of multiprogramming to the amount of hard- two approaches to realizing this. One is to craft better buffer pool
ware parallelism (e.g., number of hardware threads). Consequently, designs [37, 46] which use techniques like pointer swizzling [18, 28]
the OS scheduler will largely be kept out of the critical path of the and low-overhead page eviction algorithms [37, 58] to approach in-
OLTP engine because context switching only happens in the user memory performance when data fits in DRAM, while otherwise pro-
space as transactions are suspended and resumed by the worker viding graceful degradation and fully utilizing storage bandwidth.
threads. Using this architecture, MosaicDB also further removes The other approach employs a “hot-cold” architecture [14, 31] that
the need for dedicated background threads (e.g., log flushers) which does not use a buffer pool, and separates hot and cold data whose
were required by pipelined commit [26] that is necessary to achieve primary homes are respectively main memory and secondary stor-
high transaction throughput without sacrificing durability: cleanup age (e.g., SSDs). In essence, a hot-cold database engine consists
work such as log flushes can be done using asynchronous I/O upon of a “hot store” that is memory-resident (although persistence is
transaction commit, which will then suspend and be resumed and still guaranteed) and an add-on “cold store” in storage. A transac-
fully committed only when the I/O request is finished. tion then could access data from both stores. However, note that
We implemented MosaicDB on top of CoroBase [21], a latency- both “stores” use the same mechanisms for such functionality as
optimized OLTP engine that hides memory latency using software concurrency control, indexing and checkpointing, inside a single
prefetching. Compared to baselines, on a 48-core server, MosaicDB database engine without requiring cross-engine capabilities [64].
maintains high throughput for memory-resident transactions, while In this paper, we focus on the hot-cold architecture and leave it as
allowing additional storage-resident transactions to fully lever- future work to explore the buffer pool based approach.
age the storage device. Overall, MosaicDB achieves up to 33× Hot-Cold Storage Engines. Figure 2 shows the design of ER-
higher throughput for larger-than-memory workloads; with given MIA [30], a typical hot-cold system that employs multi-versioned
CPU cores, MosaicDB is free of oversubscription and outperforms concurrency, in-memory indexing and indirection arrays [50] to
CoroBase by 1.7× under TPC-C; MosaicDB has better scalability support in-memory data (hot store) and storage-resident data (cold
under high contention workloads, with up to 18% less contention store). Many other systems [13, 14, 39, 40] follow similar designs.
and 2.38× throughput compared to state-of-the-art. An update to the database will append a new in-memory record
578
Transactions: Indirection array
version to the indirection array and generate a log record in the log Storage-resident log:
buffer which will later be flushed to the storage device. That is, data Local cache RID Where?
V2 V1 …
+ states
records are permanently stored in secondary storage in the form of 0
logs, which are periodically compacted and consolidated, i.e., the 1 V2 V1
log is the database. Within a table, each record is uniquely identified Indexes … Memory-resident data
by a record ID (RID). Each table is represented by an indirection
Logical pointer Virtual memory pointer
array which is a resizable array where each entry carries the ad-
dress (in memory or storage) of a unique data record. An index then
maps keys to RIDs, which in turn are indexes into the indirection Figure 2: The “hot-cold” store architecture of MosaicDB. In-
array. Note that different from RIDs in traditional systems [49], dexes map keys to RIDs. Per-table indirection arrays are
RIDs here are logical without record address information, which indexed by RIDs and store record locations.
must be retrieved through the indirection array.
With such architecture, a transaction can access a record in
three steps: (1) traverses the index to obtain the RID 𝑖, (2) examines misses1 upon pointer dereference. As a result, in main-memory sys-
the table’s indirection array using 𝑖 as the index, and (3) accesses tems data stalls often dominate total CPU execution time. Beyond
the desired record data in memory or from storage. Many such indexes, version chains in multi-versioned systems as described in
designs are also multi-versioned [30, 36, 41, 47, 61, 62] to extract Section 2.1 can also cause a significant amount of data stalls when
more concurrency. As a result, in Figure 2, an indirection array the accessing thread searches for a particular version. In some sys-
entry can point to a version chain (a linked list of versions) or an tems, over 50% of the total CPU cycles can be spent on waiting for
address in storage. In the former case, in step (3) the transaction data to be loaded from memory to CPU caches [21]. Hiding such
traverses the version chain to retrieve the desired record version stalls can potentially lead to much higher overall performance.
as dictated by the concurrency control protocol, such as snapshot Synchronization. In-memory data structures require proper
isolation. In the latter case, the transaction must issue an I/O to synchronization for correctness. For example, shared index node
access the cold record, which can be cached using various strategies. states are often protected by latches (spinlocks or mutexes). Under
Some systems [14] use per-thread caching of cold data. Such design contention, a small number of latches may by accessed by a large
decisions are orthogonal to our work. number of threads. Compared to low contention or uncontended
Since the hot store assumes DRAM is large enough to hold at least cases, acquiring a latch that is under contention costs more CPU
the working set, recent memory-optimized OLTP engines [21, 30, cycles by retrying atomic instructions such as compare-and-swap
31, 57] also often employ redo-only logging without using a buffer (CAS) [24]. Such retries can be unbounded without guaranteed
pool. This way, updates (log records) by aborted transactions are progress or success in a finite number of steps, causing long delays
discarded without ever reaching storage. The logs then store actual and unfairness among transactions.
data generated by committed transactions. The system can recover Storage I/O. As data size grows, it is necessary to allow transac-
by replaying the logs after applying a previous checkpoint (if any). tions to access cold or a mix of hot and cold records. Naturally, this
For cold records, the recovery process mainly needs to fill the would require the accessing transaction to issue I/O requests to load
indirection arrays with record addresses in storage (the log) without data from storage (if not already cached). Despite of recent advances
having to materialize any data record. During forward processing, in fast storage, such as NVMe SSDs [22], storage I/O is still orders
accesses to cold records can be done on demand by reading the log of magnitude slower than memory accesses. Worse, I/O requests
at addresses carried by the corresponding indirection array entries. are often done on the critical path via synchronous I/O primitives
(e.g., pread) as the data is needed right away by the requesting
transaction. Some systems, such as MySQL, alleviate this issue by
2.2 Where Can Latencies Come From? prefetching data in the background using dedicated I/O threads and
We identify and analyze four main sources of latency, given the asynchronous I/O primitives (e.g., AIO [45]). On the one hand, it
hot-cold architecture and assumptions described in Section 2.1. We is difficult to accurately predict the workload and prefetch exactly
focus on hiding these latencies in current mainstream database the needed data from storage.2 On the other hand, using dedicated
servers and discuss latencies that may arise in other environments. threads can also cause (1) inter-thread communication overhead
because the transaction worker thread must notify and be notified
Pointer Chasing. Modern database engines use various in- by the I/O threads to initiate and complete I/O requests, and (2)
memory data structures that are directly addressed by virtual mem- oversubscription overhead which we elaborate next.
ory pointers. Memory blocks used by these data structures are Oversubscription/OS Scheduling. CPU cycles are precious
usually allocated from the heap and chained together using point- resources that must be well used. In systems that must handle
ers. For example, the nodes of an in-memory B+-tree are allocated larger-than-memory databases, it is desirable to overlap on-CPU
and deallocated as the tree grows and shrinks. To traverse a tree compute (e.g., in-memory transactional logic) with I/O operations
from its root node to the target leaf node, the accessing thread in the background. A straightforward and widely adopted approach
must dereference multiple pointers from the root to the leaf node, is to oversubscribe the hardware and leverage the OS for scheduling,
forming a random access pattern. Unfortunately, such patterns are 1 Unless otherwise specified, throughout the paper cache misses refers to last-level
very difficult (if not impossible) for hardware prefetchers to pre- misses that mandate accessing memory.
dict accurately, leading to very high likelihood of last-level cache 2 Not to be confused with prefetching from memory to CPU caches.
579
1. promise<void> coroutine(…) { 1. void scheduler(coros) {
2. __mm_prefetch(p, …) 2. while (!batch_done) {
be resumed later by a user-space scheduler as needed. This gives
3. co_await suspend_always(); 3. for (c : coros) { the opportunity for OLTP engines to overlap compute and memory
4. data = *p; // Cache hit 4. if !c.is_done() accesses of different transactions, leading to the recent coroutine-
5. __mm_prefetch(q, …) 5. c.resume(); to-transaction execution paradigm [21]. Each worker thread runs
6. co_await suspend_always(); 6. }
7. data = *q; // Cache hit 7. }
a user-space scheduling logic that takes incoming transactions in
8. } 8. } batches and each transaction is modeled as a coroutine that will
(a) An example coroutine (b) Scheduler logic suspend execution upon possible cache misses. For example, in
Pop Free Push Push/pop Push Pop Free Figure 3(a), a transaction may invoke a coroutine to read records
Other Other of other Other Other in memory and suspends its execution (line 3) after issuing an
frames frames coroutines frames frames asynchronous prefetching instruction [24] (line 5). The scheduler
(c) Call stack changes as the example coroutine suspends and resumes in Figure 3(b) serves transactions in a round-robin manner and
switches to and resumes the next transaction 𝑇 in the batch, hop-
Figure 3: Software prefetching using C++20 coroutines. A ing to overlap its compute with the just-suspended transaction’s
coroutine (a) can suspend and be resumed by the scheduler prefetching. 𝑇 may suspend later for the same reason. Importantly,
(b). Coroutine frames are popped from/pushed onto the caller C++20 coroutines are stackless. As Figure 3(c) shows, the coroutine
thread’s stack as they are suspended/resumed (c). frame directly (re)uses the underlying thread’s stack. This allows
fast switching with overhead that is cheaper than a last-level cache
miss whereas traditional solutions [52] use their own stacks. Switch-
i.e., spawning more software threads than the number of hardware ing between them involves high overhead that defeats the purpose
threads (hyperthreads) or physical cores. The OS then transparently of software prefetching.
schedules worker threads depending on whether they are handling Compared to the traditional sequential execution model where a
I/O. While a thread is asleep waiting for (synchronous) I/O to finish, thread executes one transaction at a time, coroutine-to-transaction
another thread can be scheduled to run on the CPU, improving keeps multiple transactions open per thread. This can increase the
overall CPU utilization. The downside is that the OS scheduler conflict window and even cause deadlocks using a single thread
can require a non-trivial amount of CPU cycles for itself, cancel- due to transactions on the same thread conflicting with each other.
ing out the benefits of overlapping compute and I/O of different In practice, however, many memory-optimized systems [30, 31, 41,
threads. This is especially true for modern fast NVMe SSDs that ex- 42, 57] already use optimistic concurrency [34, 38] that avoids these
hibit a narrowing gap between memory performance [22]. Systems issues, making coroutine-to-transaction a natural fit. Our work also
that use dedicated I/O threads have similar issues. Although these assumes optimistic concurrency, following prior work [21].
threads run “in the background,” they still consume actual CPU Coroutine-to-transaction has proven to be effective for hiding
cycles (especially on fast SSDs which often prefer spinning rather memory access latency, but still lacks the ability to hide storage
than interrupts [22]) and may compete with transaction worker I/O latency, which is typically done by using asynchronous I/O
threads in environments with a fixed budget (e.g., in the cloud), primitives (e.g., io_uring [9]). As we discuss later, blindly com-
again triggering OS scheduler activities. bining asynchronous I/O and coroutine-to-transaction can lead to
Other Latencies and Environments. We find the above laten- poor performance. Some systems leverage asynchronous I/O to
cies are dominant in memory-optimized hot-cold engines. Beyond overlap storage and compute, but they typically do so using heavy-
such engines, other important latencies can arise. For example, a weight mechanisms, such as stackful coroutines [11, 43, 52] and OS
system that uses pessimistic concurrency control may suffer from threads. Compared to stackless coroutines, these mechanisms bring
latency caused by logical lock contention, in addition to synchro- non-trivial overheads that will cancel out the benefits of software
nization latency caused by latches. In contrast, memory-optimized prefetching [27]. These issues call for a new approach that allows
systems typically use optimistic approaches to concurrency control, asynchronous I/O to co-exist with and not canceling out the effect
mitigating the impact of lock-induced latency. Networking delays of memory latency hiding techniques. It was also unclear how syn-
can also lead to extra latency in distributed database engines. chronization and oversubscription/OS scheduling overheads can be
We focus on single-node, memory-optimized engines built for mitigated under the coroutine-to-transaction paradigm. MosaicDB
mainstream database servers that are typically dual- or quad-socket. thus aims to enable latency hiding for both memory and storage
Latencies caused by memory accesses and synchronization can be accesses, and at the same time leverage coroutines to mitigate the
even more visible at larger scales (e.g., on servers with over 1000 impact of oversubscription and synchronization.
cores across tens of sockets [1]). We leave it as future work to hide
other latencies and explore other environments at larger scales.
580
Record access
asynchronous I/O without extra inter-thread communication over- 1:1 Worker N coroutines:
AccessStorage(RID) {
...
head mentioned in Section 2.2, (2) does so without sacrificing the Read(key) { Issue async I/O
Worker thread 1
benefits of software prefetching, and (3) naturally avoids oversub- ... 3 Suspend/peek
Scheduler
scription overhead and regulates contention. As shown in Figure 4, 5 Probe index ...
Get version }
MosaicDB ensures that each core (or hyperthread) runs only one 1 Transaction 1
4 I/O library + OS
...
software thread to avoid oversubscription and heavyweight OS (coroutine) 2 }
kernel functions
scheduler activities. 1 With the coroutine-to-transaction paradigm, Read(A)
... RID Where?
each worker thread—instead of directly running one transaction at 6 0
V2 V1 …
a time—takes multiple transactions which are modeled as corou- Transaction 2 Indexes
1
…
tines and switches between them as needed. 2 Each transaction (coroutine) V2 V1
coroutine in turn invokes the corresponding coroutines for han- ... Indirection array
dling data accesses which may traverse indexes and version chains.
Depending on whether the target data is from the hot or cold store, Figure 4: Overview of MosaicDB. Each core/hyperthread runs
the transaction may only need to suspend its execution upon possi- one thread to schedule transactions in a pipelined manner.
ble cache misses or 3 further issue asynchronous I/O requests (e.g., Transactions (coroutines) suspend and resume upon cache
using io_uring or Linux AIO).3 4 – 5 After the current transaction misses and I/O to overlap compute and data access latency.
is suspended, control is returned to the scheduler which 6 con-
tinues to handle the next request and repeats this process. When
the scheduler resumes a previously suspended transaction, it may
continue executing the transaction by dereferencing a pointer to 4.1 Selective Coroutine Nesting
the data in memory (if the requested data is in the hot store), or Database engines use nested function call chains to modularize
checking whether the asynchronous I/O operation has finished. their implementations. Given an engine implementation, one may
MosaicDB inherits the relevant designs in CoroBase [21] to guar- simply change the function and return types (e.g., using co_return
antee durability and maintain ACID properties. This is done using instead of return) to convert the necessary functions to corou-
redo-only logging as described in Section 2.1 and pipelined/group tines, in addition to adding suspend/resume calls. Nested function
commit [26]. During forward processing, transactions generate log call chains then become nested coroutine call chains. With proper
records in memory and are only committed after the log records coroutine library support, any suspend operation inside a coroutine
are persisted. Upon commit, the underlying thread still decouples will cause the calling coroutine to be popped out of the stack with
the transaction and places it on a (partitioned) commit queue. Dif- control eventually returned to the scheduler which then switches to
ferent from prior approaches, however, MosaicDB does not use the next request. Although transforming function call chains into
background threads to monitor and release transactions from the fully-nested coroutines is simple, it brings non-trivial overhead that
commit queue (i.e., truly commit the transaction with results re- cancels out the benefits of software prefetching. To reduce corou-
turned to the application). Instead, this is checked by the worker tine switching overhead, recent approaches [21] suggest a two-level
thread itself in between requests lazily: a log flush is issued (using design where function call chains in the storage engine are “flat-
asynchronous I/O) whenever the log buffer is full or times out. The tened” to become a single coroutine, which can be called by trans-
worker thread then checks for log I/O status the next time it accesses action coroutines. While such two-level coroutine-to-transaction
the log buffer. If the I/O has finished, it then examines the commit increases the size of a single coroutine and can increase instruction
queue to release transactions whose log records are persisted. This cache misses, these costs are outweighed by the benefit, allowing
way, MosaicDB also avoids oversubscribing the CPU by not using coroutine-to-transaction to effectively hide data stalls, since fewer
additional background threads like previous work [26, 64]. cycles are spent on coroutine switching.
Hiding I/O latency makes the function call chains even deeper
by involving additional functions that handle I/O requests (e.g.,
4 MITIGATING DATA ACCESS LATENCY issuing and checking completion of asynchronous I/O, followed
Prior coroutine-to-transaction engines focused on hiding stalls by deserialization). In theory, such additional call chains should
caused by in-memory transactions [21] (i.e., those that only access be inlined to maintain the aforementioned two-level coroutine-
the hot store). The in-memory setup mandated flattening nested to-transaction structure for the lowest overhead, given it is also
function/coroutine calls yet allowed simple batch scheduling poli- relatively straightforward to do so for in-memory data accesses
cies. However, the presence of cold store transactions requires a which mostly only involve traversing an index and a version chain
departure from these designs. The rest of this section describes how to access a record in the hot store. However, we observe that this
MosaicDB caters such transactions while maintaining the effect of is neither easy nor necessary. On the one hand, such flattening
software prefetching for accessing memory-resident data. is usually done manually and the additional storage-related call
chain can be complex. Inlining all of them can further bloat code
with worse instruction cache utilization and make the code hard
to maintain. On the other hand, storage access latency is still or-
3 We
ders of magnitude longer than memory latency. This means we
refer to I/O mechanisms that do not block the issuing thread (e.g., io_uring and
Linux AIO) as “asynchronous I/O.” This is not to be confused with “asynchronous need to switch to transaction coroutines that access I/O much less
commit” which allows transactions to commit without persisting their log records. frequently, amortizing the switching overhead.
581
These observations lead to a simple but useful selective coroutine and storage accesses). As a result, the worker thread simply gathers
nesting approach that only allows nested coroutines for storage- a batch of requests at a time, and then switches between them
related functions: Based on an existing two-level coroutine-to- upon pre-defined suspend calls. A new batch cannot be admitted
transaction design, we instrument the storage-related functions to until all the transactions in the current batch are concluded (com-
become nested coroutines and keep the two-level flattened corou- mitted or aborted). Such traditional batch scheduling [21, 27, 48]
tines for memory accesses. The result is that a transaction will work has been widely adopted by existing coroutine-based systems, but
exactly the same as before until it accesses storage, by invoking in- exhibits two fundamental issues for mixed memory/storage work-
lined record access coroutines in Figure 4. If the transaction accesses loads, which we address in the rest of this section. We start by
storage-resident data, as shown in Figure 4, with the corresponding proposing storage-aware batch scheduling, a simple adaptation of
indirection array entry which points to a location in storage, the traditional batch scheduling, and then discuss how we evolve it
record access coroutine will issue asynchronous I/O and suspend. towards the desirable scheduling policies.
Control is then returned to the scheduler which will later check At a first glance, it may seem trivial to extend traditional batch
whether the I/O has finished and resume the transaction coroutine scheduling without any change by simply suspending the transac-
if so. After that, the data read from storage is converted into a node tion coroutine once an asynchronous I/O is issued. The first issue
in the in-memory version chain for future accesses. with this approach is that as I/O request status is part of the trans-
Our implementation uses io_uring [9]; other asynchronous I/O action state, the scheduler has to resume the coroutine to find out
libraries can also work with MosaicDB as long as they present an whether a previous in-progress I/O associated with the transaction
asynchronous interface that allows the application to separately has finished. If the I/O is still in-progress, the scheduler would have
issue and check for I/O completion. With io_uring, each worker to switch to the next transactions, in essence wasting the work to
thread has a thread-local I/O module where there is a “ring” for I/O resume a transaction. This was not a significant problem in pure-
operations. I/O requests are processed using submission/completion memory environments because typically modern x86 processors
queues. The submission queue in each ring has a fixed number of allow only a smaller amount of outstanding asynchronous memory
slots for submission queue entries (SQEs), defined when the ring is loads, which makes it possible to make very accurate “educated
initialized. Therefore, all transactions on the same worker thread guesses” about when the data will be fetched into the cache and
share this ring to access storage. An I/O request is issued by creating batch size. Thus, in most cases, after a transaction is resumed, its
an SQE and completions are indicated by completion queue entries requested data is indeed in the CPU cache, making the (already
(CQEs). The latter can complete out of order respective to the former. low) switching overhead worthwhile. Yet for the I/O case, recall
We therefore tag each SQE with the issuing transaction’s ID to that it is impractical to inline the I/O stack and I/O latency is still
distinguish different transactions’ CQEs. orders of magnitude higher than memory access latency. The for-
Now we walk through the process of reading a record using mer increases the cost of coroutine switching. Coupled with the
the example shown in Figure 4. (a) Worker thread 1 first starts latter, the two properties mean that when I/O and memory accesses
Transaction 1 (a coroutine) and begins to read the record matching are mixed in a batch of transactions, most of the cycles spent on
key 𝐴 (step 1 in the figure). (b) It then probes the index, during checking asynchronous I/O completion state are wasted.
which process we issue prefetch instructions followed by suspend Our first scheduling policy (storage-aware batch scheduling) re-
statements ( 2 ). (c) Control then returns to the scheduler ( 5 ). (d) solves this issue by decoupling I/O status tracking and transaction
Depending on the scheduler’s logic, it resumes Transaction 2 to context. As shown in Figure 5(a), for each worker thread, we sep-
continue from where it left off ( 6 ), (e) until it encounters the next arately allocate an array of I/O status tracking structures,4 each
suspend or concludes. At some point during the execution of Trans- of which corresponds to a transaction in the current batch being
action 1, Thread 1 gets the pointer corresponding to RID 𝑁 and processed by the thread. The scheduler then still works in the same
finds that it points to a a location in storage, (f) so the thread issues way as before, but before resuming a transaction, it first checks
an asynchronous I/O ( 3 ), immediately suspends the transaction whether the transaction was suspended because of an asynchronous
after submitting an SQE, and returns to the scheduler ( 4 – 5 ). Each I/O request. If so, it then directly checks the thread-local I/O status
time the scheduler wants to resume Transaction 1, it first peeks at without resuming the transaction and only resumes the transaction
the completion queue to look for the next completed I/O request, if the I/O has completed; otherwise it directly proceeds to the next
which occurs in the user space, and resumes the transaction the transaction. If the transaction was suspended due to a potential
completed request belongs to, which is not necessarily Transaction CPU last-level cache miss, the scheduler still directly resumes it
1. (g) The worker thread gets the I/O request belonging to Transac- because currently software cannot query CPU cache status.
tion 1, retrieves the request, and marks the CQE consumed. Next,
we discuss the scheduling policy used in the above process. 4.3 Pipelined Scheduling
Storage-aware batch scheduling avoids wasting CPU cycles on
resuming/suspending storage-bound transactions. However, since
4.2 Basic Storage-Aware Batch Scheduling
it uses a fixed-size batch and only processes transactions by full
Traditional coroutine-to-transaction was designed for in-memory batches, new (memory-resident) transactions may not be scheduled
OLTP workloads that exhibit similar transaction profiles, without to run timely while there are in fact enough CPU cycles. Consider
considering the fact that modern OLTP workloads are increas-
ingly heterogeneous with varying transaction types in addition to 4 io_uring contexts in our current implementation. Each I/O request is tagged with the
short, memory-only transactions (e.g., operational reporting [4] issuing transaction ID to distinguish I/O completions for different transactions.
582
Memory-bound transaction Transaction turning memory-bound Algorithm 1 Dual-Queue Pipelined Scheduling.
Storage-bound transaction Transaction turning storage-bound
Currently active (on-CPU) transaction 1 def DualQueuePipeline():
[hot] = get_transaction_requests()
skipped skipped skipped resumed new
new 3 while not shutdown:
resumed
suspended skipped resumed new if hot[i].is_done():
suspended resumed new 5 hot_txn_done++
if interval.is_up() or hot.empty():
(a) Storage-aware batch scheduling
7 hot_txn_done = 0
suspended suspended suspended resumed for j = 0 to cold_queue_size - 1:
resumed new new new 9 if cold[j].is_done():
suspended resumed suspended suspended idle_slots_in_cold_queue.push(j)
suspended suspended resumed suspended 11 else if cold[j].is_hot():
(b) Single-queue pipelined scheduling staging.push(cold[j])
13 idle_slots_in_cold_queue.push(j)
new move
Staging area else:
or move
15 tid, read_size = peek()
if valid_tid(tid) and
suspended
17 read_size = expected_size:
suspended move suspended cold[tid].resume()
suspended
hot queue cold queue 19 fetch_transaction()
(c) Dual-queue pipelined scheduling else if hot[i].is_cold():
21 cold[idle_slots_in_cold_queue.pop()] = hot[i]
Figure 5: Scheduling policies. (a) Storage-aware batch can fetch_transaction()
selectively resume storage-bound transactions but can lead 23 else:
to low CPU utilization and throughput. (b) Vanilla pipelined hot[i].resume()
improves CPU utilization but can starve in-memory transac- 25 i = i % hot_queue_size
tions. (c) Dual-queue pipelined isolates memory- and storage-
bound transactions to fully utilize storage bandwidth while
maintaining high in-memory transaction performance. dominate all the slots even though they only constitute a small
portion of the workload mix. As a stop-gap solution, the system can
prioritize memory-bound transactions by reserving a certain num-
a batch of ten (which is typical) where four of the transactions are ber of slots. This way, if the number of storage-bound transactions
I/O bound and six are memory-bound. Since I/O exhibits much exceeds a threshold, they will not be admitted and must wait until a
higher latency, it is likely that the more lightweight memory-bound slot is freed up.5 However, modern SSDs exhibit much higher band-
transactions will conclude earlier than the storage-bound ones. width and often requires high queue depth (i.e., number of parallel
With batch scheduling, six out of ten slots in the batch will be requests) for software to extract their full potential. As a result, the
vacant until the storage-bound transactions conclude much later; queue can become very long, requiring the scheduler logic to scan
memory-bound transactions would have to wait to be admitted in through a long (sub)array of I/O states. This essentially increases
the next batch, lowering CPU utilization and overall throughput memory prefetching distance, eventually canceling out the benefits
and affecting the performance of memory-resident transactions. of software prefetching for memory-bound transactions. Thus, it is
Vanilla Transaction Pipelining. To improve CPU utilization, desirable to keep the queue short, while still allowing saturating
MosaicDB instead processes transactions in a pipelined fashion the storage device with enough many I/O slots.
that allows individual new transactions to be admitted as long as Dual-Queue Transaction Pipelining. To solve these issues, we
there is free slot in the batch. This allows more memory-bound advocate a pipelined design that decouples memory- and storage-
transactions to be processed as I/O is in-progress for the rest of the bound transactions. Each thread is associated with two queues
batch. As Figure 5(b) shows, whenever a transaction is concluded, a respectively for memory- and storage-bound transactions, in addi-
new request can be admitted to the queue. Such vanilla transaction tion to a staging area, shown in Figure 5(c). If a previously memory
pipelining was first discussed in CoroBase [21], but was deemed (storage) bound transaction on the memory (storage) queue turns
unnecessary for pure in-memory workloads. Under mixed memo- to access the cold (hot) store, it is then moved the the corresponding
ry/storage workloads, however, more in-memory transactions can queue. In our current implementation, all the new transactions start
slip through to maintain high throughput. The downside of this with a hot-store access because the indexes are in-memory. The
policy is that there is no admission control for transactions and worker thread always works on the memory queue unless (1) it
the system can be easily dominated by storage-bound transactions, finds a cold request, (2) a pre-defined interval is up, or (3) the mem-
starving memory-resident data accesses. For example, as shown ory queue is empty. Algorithm 1 shows the details. In case (1), the
in Figure 5(b), storage-bound transactions can occupy the slots for 5 Most systems already feature similar admission control features, which can be modi-
longer time, and as they keep getting admitted, they will eventually fied to work with MosaicDB, so we do not repeat here.
583
worker thread moves the storage-bound transaction to the storage next request. The system then needs to keep pre-committed trans-
queue and then admits another transaction to fill up the slot. Both actions whose log records have not reached storage, and only notify
cases (2) and (3) happen after a transaction is concluded (line 7 in the client after the log records are persisted. This process is often
Algorithm 1). The predefined interval in case (2) records the number accomplished using background flusher/committer threads: after
of transactions processed from the hot queue (hot_txn_done in the worker thread finishes running the transaction logic, it places
Algorithm 1) and is configurable. Each time a transaction is com- the transaction on a commit queue (or a partitioned queue to ease
pleted in the memory queue, hot_txn_done is incremented. After contention) and continues to process the next request. The com-
that, if the counter reaches a predefined threshold or the memory mit queue is monitored by a committer thread that periodically
queue is empty, the worker thread will iterate over the cold queue checks log status and retires transactions from the commit queue as
and process transactions by checking whether their in-progress their log records are persisted. Log status is maintained by another
I/Os have completed (same as the storage-aware batch scheduling flusher thread that invokes I/O primitives (e.g., pwrite) to flush log
policy in Section 4.2). This way, the worker thread can attend to records submitted by worker threads. As we will see in Section 6,
memory-bound transactions more promptly, while storage-bound this design can significantly lower throughput under high load as
transactions are checked and resumed less frequently to match the background threads compete with foreground worker threads.
storage I/O capabilities. MosaicDB leverages the coroutine-to-transaction architecture
Since a transaction can transition between memory-bound and to eliminate background threads all together, avoiding oversub-
storage-bound, a transaction on the memory queue may need to scription. All the threads in MosaicDB are worker threads and
find a slot on the storage queue when the latter is already full. In transaction logging/commit operations are processed in exactly the
this case, we abort the transaction so as to ensure there is enough same way as “normal” I/O operations during forward processing.
slots left to process memory requests. Conversely, a storage-bound Upon commit, the worker thread appends the transaction’s log
transaction may become memory-bound again after processing records to a thread-local log buffer, and if necessary (e.g., the buffer
I/O. It will then be placed back to the memory queue if there is is full), issues an asynchronous I/O to flush the log. The scheduler
a vacant slot. Otherwise, we place it in the staging area and pri- logic then switches to the next transaction while I/O is in-progress,
oritize transactions from it (over other requests) to be admitted and when it resumes the same transaction, it additionally checks
to the memory queue once it has space. This allows us to keep whether the transaction’s log records have been made persistent
the benefits of software prefetching while extracting the full po- (by comparing the current durable and the transaction’s commit
tential of the storage device. The sizes of the queues need to be log sequence numbers). If so, it resumes the transaction to finish
tuned according to the target workload and the underlying storage post-processing (e.g., finalizing its newly generated versions). This
device’s capability (bandwidth and IOPS), but we believe this is way, all the additional I/O requests associated with transaction
not a major limiting factor of MosaicDB as the parameter space is commit are handled by worker threads. Moreover, the scheduling
not large: the memory queue should be kept similar to batch sizes queues in Section 4 replaces the commit queue in past systems,
used in previous in-memory coroutine-to-transactions (8–10) for saving resources and avoiding a potential source of contention.
the two-level coroutine switching mechanism to work well, and Other work (e.g., checkpointing) can be performed using system
the storage queue only needs to be tuned when storage capability transactions [17] which are then processed in similar ways.
(bandwidth/IOPS) changes (e.g., by upgrading to a faster SSD). MosaicDB’s coroutine-oriented architecture also allows to limit
contention while maintaining a high degree of concurrency. Each
worker thread can keep multiple transactions open, yet only one of
5 MITIGATING OVERSUBSCRIPTION AND the transactions will be actively running at any time, while others
SYNCHRONIZATION LATENCY are suspended waiting for data to be fetched to the CPU cache or
Without userspace scheduling like MosaicDB, many systems leave memory. Compared to systems with oversubscription, MosaicDB
it to the OS to schedule transactions and oversubscribe the system limits contention naturally to the degree of multiprogramming, i.e.,
to improve CPU utilization. Some memory-optimized database en- at most there will be the same number of hyperthreads/cores that
gines already try their best to not oversubscribe the system. They contend for a shared resource, keeping OS scheduler out of the
typically use at most one software worker thread per hardware critical path. Without oversubscription and excessive contention,
hyperthread (or per physical core). Even so, oversubscription can individual data structures, e.g., indexes, can then focus on handling
still happen with various background threads, such as log flush- contention only up to the physical resources available. Various
ers, group commit threads, and checkpointing threads [26]. With solutions already exist and MosaicDB can easily adopt them [3].
a fixed CPU budget (e.g., in the cloud), these background threads
still require a non-trivial amount of CPU cycles and can cause OS 6 EVALUATION
scheduler activities that will lower throughput. Note that although
In this section, we evaluate MosaicDB under various workloads that
some background threads only run periodically (e.g., checkpoint-
exhibit latency from memory/storage accesses, CPU core oversub-
ing), some will run frequently. We take the widely used parallel log-
scription and synchronization. Through experiments, we confirm:
ging [60, 63] and pipelined commit [26] mechanisms as an example.
With parallel logging, each worker thread accumulates log records • MosaicDB can hide storage access latency without drastically
in a local buffer and then flushes them to storage upon (group) com- impacting in-memory transaction throughput. Meanwhile, it
mit. Further, pipelined commit decouples transactions and threads, yields higher throughput for storage-bound transactions than
such that log flushing does not block the thread from taking the the traditional thread-to-transaction execution model does.
584
• MosaicDB can avoid CPU core oversubscription by removing interpret experimental results and independently scale the sizes
background threads altogether, improving performance. of both stores. Implementing a cache is promising but orthogonal
• MosaicDB effectively mitigates synchronization overhead under future work. We use 8-byte keys and a 8-byte values for each data
high contention by limiting contention levels. record. All experiments start with a freshly loaded database with
300 million hot records and 3 million cold records, in total taking
6.1 Experimental Setup ∼16GB of storage space (including padding as required by direct
I/O). The amount of cold records does not matter here, because cold
Hardware and Software. We use a dual-socket server equipped
records are not cached, so that we can constantly create storage-
with two 24-core Intel Xeon Gold 6342 CPUs clocked at 2.80 GHz
bound transactions during the experiments. We vary the percentage
(up to 3.50 GHz with turbo boost). The CPU has 36MB of caches.
of storage-bound transactions following a uniform distribution to
In total the server has 256GB of main memory occupying all the
choose the records to access. We evaluate two types of workloads:
six channels per socket to maximize memory bandwidth. We use
read-only and read-write. For read-only, an in-memory transaction
three SSDs in the server in our evaluation to understand the impact
reads 10 records in the hot table, and a storage-bound transaction
of storage devices: a 500GB Samsung 980 PRO [51], a 375GB Intel
reads 8/2 records in the hot/cold tables. We use 40 threads which
Optane SSD DC P4800X [23], and a 480GB Dell SATA SSD [12].
can saturate the SSD. For read-write workloads, there are 50% read-
Using fio, we observe that for random accesses, the Samsung/In-
only transactions (10 reads) and 50% read-modify-write (RMW)
tel/Dell SSDs can deliver up to 860K/490K/125K IOPS, respectively.
transactions (10 updates); cold read-only transactions have 2/8
Unless otherwise specified, we use the Samsung SSD. All the data is
cold/hot reads; cold RMW transactions have 2/8 cold/hot updates
persisted on the storage device, with data in the hot store also kept
where a cold update is a cold read followed by an update. Based
in-memory. For all experiments, hyperthreading is disabled and
on the IOPS and bandwidth that our storage device can offer, we
direct I/O mode (O_DIRECT) is enabled to simplify the interpretation
use 10 threads to saturate the storage device with I/Os needed by
of results. The server runs Ubuntu Linux 22.04.3 LTS with kernel
reading cold records and flushing log records.
5.15. In our experiments, we set I/O page size to 2KB and scale the
Variants. We vary the scheduling policy discussed in Section 4.
number of threads up to the point where the SSD is saturated. For
asynchronous I/O, we use io_uring and compiled all the code using
• Sequential: Baseline that uses traditional thread-to-transaction
GCC 11 with all the optimizations.
to process transactions sequentially, but still uses prefetch in-
Benchmarks. To focus on evaluating the effectiveness of our
structions as a best-effort optimization for memory accesses.
designs in hiding latencies, we implement benchmarks that directly
Cold store accesses are done using synchronous io_uring calls:
interface with the engine via C++ APIs, bypassing SQL and net-
the thread will spin until the I/O completes.
working layers. We run different end-to-end benchmarks where
• Batch: The storage-aware batch scheduling policy described in
the database engine is dominated by latencies generated from (1)
Section 4.2. We tuned the batch size and set it to 8.
storage accesses, (2) CPU oversubscription and (3) synchronization,
• Pipeline: The vanilla pipelined scheduling policy in Section 4.3.
respectively. For (1), we use microbenchmarks to stress test Mo-
Queue size is set to 16, with memory- and storage-bound trans-
saicDB. Microbenchmarks allow us to closely control the hot and
actions each using half the capacity.
cold data ratios to evaluate different scheduling policies and under-
• MosaicDB: Same as Pipeline but uses the dual-queue pipelined
stand MosaicDB’s performance behavior. For (2), we run TPC-C [56]
scheduling policy. Unless otherwise noted, we set the memory/s-
to evaluate the impact of oversubscription caused by background
torage queue sizes to 8/16 and check the storage queue once after
threads and show the effectiveness of MosaicDB. For (3), we run
eight transactions are concluded.
a high-contention microbenchmark that issues single-step insert-
only transactions with monotonically increasing keys to study how
Read-Only Throughput. Figure 6 shows the throughput and
well MosaicDB can handle synchronization latency. We give the
latency of MosaicDB and baselines along with SSD IOPS, as the
detailed workload setups in the following corresponding sections.
workload mix features more storage-bound transactions. When the
workload is purely in-memory, as expected, the three coroutine-
6.2 Larger-Than-Memory Performance to-transaction variants all outperform Sequential, corroborating
Our first experiment evaluates the performance of memory- and with results obtained by prior work [21]. Such performance im-
storage-bound transactions. provement comes from the increased interleaving of transactions
Workloads. To stress test MosaicDB, we separately load two which allows us to hide memory data stalls by overlapping software
tables, one as the hot store with all the data in memory, and the prefetching and computation. As we start to increase storage-bound
other as the cold store with all the data only in secondary storage transactions in the mix, the throughput of Sequential drops dras-
(the log in our design). Accesses to both stores are done by (1) tically, which is mainly attributed to waiting for I/O completions
traversing the index to obtain an RID and (2) using the RID to access via spinning. For Batch, the storage-bound transactions stall the
the indirection array. For hot access, in step (2) the transaction admission of new transactions, which eventually runs into the same
traverses an in-memory version chain, while for cold accesses the problem as Sequential has. That is, the scheduling logic eventually
transaction in step (2) would obtain the permanent address of the degrades into a Sequential-equivalent that waits synchronously
record in storage and issue an I/O request to access it. Note that for the storage-bound transactions. For Pipeline and MosaicDB,
we do not cache cold records in memory, i.e., accessing a storage- the hot throughput well sustains, thanks to the reserved slots for
resident record will always lead to an I/O. This allows us to reliably in-memory transactions, but the additional CPU cycles needed to
585
Sequential Batch Pipeline MosaicDB Sequential Batch Pipeline MosaicDB
1600
Total KTPS
Total KTPS
6000
4000 1200
800
2000 400
0 0 5 10 25 50 75 0 0 5 10 25 50 75
6000
Hot KTPS
1600
Hot KTPS
4000 1200
2000 800
0 400
0 5 10 25 50 75 0 0 5 10 25 50 75
600
Cold KTPS
250
Cold KTPS
400 200
200 150
0 100
0 5 10 25 50 75 50
0 0 5 10 25 50 75
900
KIOPS
0.4 0 5 10 25 50 75
0.3
0.2 900
0.1 600
0 0 5 10 25 50 75 300
Percentage of storage-bound transactions 0 0 5 10 25 50 75
0.3
Figure 6: Read-only microbenchmark results under varying 0.2
percentages of storage-bound transactions (40 threads). 0.1
0 0 5 10 25 50 75
Percentage of storage-bound transactions
issue and check I/O states still led to reduced interleaving of in-
memory transactions. For coroutine variants in general, a single Figure 7: Read-write (50% read-only, 50% update-only) mi-
worker thread can handle multiple I/O requests, thus hiding storage crobenchmark results under 10 threads and a varying per-
latency with interleaving and improving SSD utilization by up to a centage of storage-bound transactions.
factor of two (measured in IOPS at 10%). When the workload is still
memory-bound (i.e., 5% storage-bound transactions), compared to
Sequential, the total throughput of Batch/Pipeline/MosaicDB workload is write-heavy with 50% RMW transactions. Compared to
is 1.6×/1.9×/1.9× higher. Across benchmarks, MosaicDB commits cold reads that use 2KB I/Os, log records are first buffered in mem-
1.55×–33× in-memory transactions than Sequential; MosaicDB ory and persisted in SSD in batches (8MB per thread in our setup),
achieves up to 33× speedup over Batch. When the workload be- which makes write IOPS an order of magnitude lower than read
comes more storage-bound, the hot throughput of Pipeline cannot IOPS. For the in-memory workload, Batch/Pipeline/MosaicDB
keep up with MosaicDB’s, because Pipeline always needs to check is 1.39×/1.29×/1.29× faster than Sequential. This is because log
the storage-bound transactions, which is not necessary when the flushes are asynchronous and we use double buffering to avoid
storage device is saturated. In contrast, MosaicDB has two queues, blocking writers while write I/O is in-progress. Therefore, prefetch-
allowing threads to work on the in-memory queue most of the time. ing still plays a role in improving throughput. However, since RMW
The loss of total throughput of MosaicDB is sustained at 16%, when transactions contribute to more I/O API calls for log flushes, fewer
the SSD is just fully utilized (i.e., 10% storage-bound transactions). cycles can be dedicated to software prefetching, leading to lower
Read-Only Latency. Interleaving-based approaches like Mo- speedups for in-memory transactions. With more storage-bound
saicDB may trade latency for throughput. As shown in Figure 6, transactions, the coroutine-oriented variants overall follow a similar
Sequential consistently exhibits the lowest latency. Pipeline trend seen in the read-only benchmarks, with MosaicDB maintain-
uses twice as many slots as Batch and MosaicDB’s hot queue, lead- ing high performance for both in-memory and storage-bound trans-
ing to ∼2× the latency for in-memory transactions than Sequential. actions. MosaicDB is up to 2.1× faster than Sequential in terms of
The latency of MosaicDB always stands in between Batch and the overall performance with comparable storage-bound transac-
Pipeline because compared to Batch, MosaicDB is also affected by tion processing speed. Batch usually has the highest cold through-
the storage queue, although the scheduler does not always check the put as it treats both hot and storage-bound transactions equally,
cold queue. With more storage-bound transactions, Batch’s latency which means more storage-bound transactions get processed while
is bounded by the batch size, which is smaller than MosaicDB’s cold in-memory transactions are starved. From 0% to 50% storage-bound
queue size. So we observe the growing gap between Batch and transactions, the total throughput of MosaicDB drops by 21%, but
MosaicDB. Compared to Pipeline, MosaicDB spends more cycles is still on par with Sequential’s in-memory throughput. In terms
on the hot queue, leading to slightly lower latency than Pipeline. of latency, we observed similar results to those from the read-only
Read-Write Performance. Figure 7 shows the performance workloads, except that the cold queue size of MosaicDB is doubled
of MosaicDB under the RMW workload where records are fetched in this workload to accommodate more storage-bound transactions,
from storage, updated in memory and then persisted in the log. This which become slower due to the increased storage access latency.
586
5000 QS=16 Flat
Total KTPS
4000 Oversubscribed
3000 QS=32 Dedicated Selective
2000 QS=64 Nested
1000 MosaicDB
0 QS=128 600 6000
5 25 50
Percentage of storage-bound transactions 450
Total KTPS
4000
KTPS
300
Figure 8: Tuning storage queue size for a read-only workload. 2000
150
587
Sequential MosaicDB (BS=4) MosaicDB (BS=8) depth of the coroutine chain; the same effect was also documented
4 80 elsewhere [21]. With storage access, selectively nested coroutines
%cycles on latching
have deeper coroutine chains than flattened on the storage path, but
3 60
the performance is still comparable. The reason is storage latency
MTPS
588
REFERENCES [27] Christopher Jonathan, Umar Farooq Minhas, James Hunter, Justin Levandoski,
[1] Tiemo Bang, Norman May, Ilia Petrov, and Carsten Binnig. 2022. The Full and Gor Nishanov. 2018. Exploiting coroutines to attack the killer nanoseconds.
Story of 1000 Cores: An Examination of Concurrency Control on Real(Ly) Large Proceedings of the VLDB Endowment 11, 11 (2018), 1702–1714.
Multi-Socket Hardware. The VLDB Journal 31, 6 (apr 2022), 1185–1213. [28] Alfons Kemper and Donald Kossmann. 1995. Adaptable Pointer Swizzling Strate-
[2] Adrian M. Caulfield, Todor I. Mollov, Louis Alex Eisner, Arup De, Joel Coburn, and gies in Object Bases: Design, Realization, and Quantitative Analysis. The VLDB
Steven Swanson. 2012. Providing Safe, User Space Access to Fast, Solid State Disks. Journal 4, 3 (jul 1995), 519–567.
In Proceedings of the Seventeenth International Conference on Architectural Support [29] Alfons Kemper and Thomas Neumann. 2011. HyPer: A Hybrid OLTP&OLAP
for Programming Languages and Operating Systems (ASPLOS XVII). 387–400. Main Memory Database System Based on Virtual Memory Snapshots. In Proceed-
[3] Hokeun Cha, Xiangpeng Hao, Tianzheng Wang, Huanchen Zhang, Aditya Akella, ings of the 2011 IEEE 27th International Conference on Data Engineering (ICDE
and Xiangyao Yu. 2023. Blink-Hash: An Adaptive Hybrid Index for In-Memory ’11). 195–206.
Time-Series Databases. Proc. VLDB Endow. 16, 6 (apr 2023), 1235–1248. [30] Kangnyeon Kim, Tianzheng Wang, Ryan Johnson, and Ippokratis Pandis. 2016.
[4] Surajit Chaudhuri, Umeshwar Dayal, and Vivek Narasayya. 2011. An Overview ERMIA: Fast memory-optimized database system for heterogeneous workloads.
of Business Intelligence Technology. Commun. ACM 54, 8 (aug 2011), 88–98. In Proceedings of the 2016 International Conference on Management of Data. 1675–
[5] Shimin Chen, Anastassia Ailamaki, Phillip B. Gibbons, and Todd C. Mowry. 2004. 1687.
Improving Hash Join Performance through Prefetching. In Proceedings of the [31] Hideaki Kimura. 2015. FOEDUS: OLTP Engine for a Thousand Cores and NVRAM.
20th International Conference on Data Engineering (ICDE ’04). 116. In Proceedings of the 2015 ACM SIGMOD International Conference on Management
[6] Shimin Chen, Phillip B. Gibbons, and Todd C. Mowry. 2001. Improving Index of Data (SIGMOD ’15). 691–706.
Performance through Prefetching. In Proceedings of the 2001 ACM SIGMOD [32] Onur Kocberber, Babak Falsafi, and Boris Grot. 2015. Asynchronous memory
International Conference on Management of Data (SIGMOD ’01). 235–246. access chaining. Proceedings of the VLDB Endowment 9, 4 (2015), 252–263.
[7] Shimin Chen, Phillip B. Gibbons, Todd C. Mowry, and Gary Valentin. 2002. [33] Sridhar K.T. and M.A. Sakkeer. 2014. Optimizing Database Load and Extract for
Fractal Prefetching B+-Trees: Optimizing Both Cache and Disk Performance. In Big Data Era. 8422 (04 2014), 503–512.
Proceedings of the 2002 ACM SIGMOD International Conference on Management [34] H. T. Kung and John T. Robinson. 1981. On Optimistic Methods for Concurrency
of Data (SIGMOD ’02). 157–168. Control. ACM Trans. Database Syst. 6, 2 (June 1981), 213–226.
[8] Jeremy Condit, Edmund B Nightingale, Christopher Frost, Engin Ipek, Ben- [35] Harald Lang, Tobias Mühlbauer, Florian Funke, Peter A Boncz, Thomas Neumann,
jamin Lee, Doug Burger, and Derrick Coetzee. 2009. Better I/O through byte- and Alfons Kemper. 2016. Data blocks: Hybrid OLTP and OLAP on compressed
addressable, persistent memory. In Proceedings of the ACM SIGOPS 22nd sympo- storage using both vectorization and compilation. In Proceedings of the 2016
sium on Operating systems principles. 133–146. International Conference on Management of Data. 311–326.
[9] Jonathan Corbet. 2019. Ringing in a new asynchronous I/O API. https://lwn. [36] Per-Åke Larson, Spyros Blanas, Cristian Diaconu, Craig Freedman, Jignesh M.
net/Articles/776703/ Linux Weekly Newsletter. Patel, and Mike Zwilling. 2011. High-Performance Concurrency Control Mecha-
[10] Justin DeBrabant, Andrew Pavlo, Stephen Tu, Michael Stonebraker, and Stan nisms for Main-Memory Databases. PVLDB 5, 4 (Dec. 2011), 298–309.
Zdonik. 2013. Anti-caching: A new approach to database management system [37] Viktor Leis, Michael Haubenschild, Alfons Kemper, and Thomas Neumann. 2018.
architecture. Proceedings of the VLDB Endowment 6, 14 (2013), 1942–1953. LeanStore: In-Memory Data Management beyond Main Memory. In 2018 IEEE
[11] K. Delaney and C. Freeman. 2013. Microsoft SQL Server 2012 Internals. Pearson 34th International Conference on Data Engineering (ICDE) (IEEE ICDE). 185–196.
Education. [38] Viktor Leis, Florian Scheibner, Alfons Kemper, and Thomas Neumann. 2016.
[12] Dell. 2023. Dell 480GB SSD SATA Mixed Use 6Gbps 512e 2.5in Hot-Plug. The ART of Practical Synchronization. In Proceedings of the 12th International
(2023). https://www.dell.com/en-ca/shop/dell-480gb-ssd-sata-mixed-use- Workshop on Data Management on New Hardware (DaMoN ’16). Article 3, 8 pages.
6gbps-512e-25in-hot-plug/apd/345-befn/ [39] Justin Levandoski, David Lomet, and Sudipta Sengupta. 2013. LLAMA: A Cache/S-
[13] Cristian Diaconu, Craig Freedman, Erik Ismert, Per-Ake Larson, Pravin Mittal, torage Subsystem for Modern Hardware. PVLDB 6, 10 (aug 2013), 877–888.
Ryan Stonecipher, Nitin Verma, and Mike Zwilling. 2013. Hekaton: SQL Server’s https://doi.org/10.14778/2536206.2536215
Memory-Optimized OLTP Engine. In Proceedings of the 2013 ACM SIGMOD [40] Justin J. Levandoski, David B. Lomet, and Sudipta Sengupta. 2013. The Bw-Tree: A
International Conference on Management of Data (SIGMOD ’13). 1243–1254. B-Tree for New Hardware Platforms. In Proceedings of the 2013 IEEE International
[14] Ahmed Eldawy, Justin Levandoski, and Per-Åke Larson. 2014. Trekking through Conference on Data Engineering (ICDE 2013) (ICDE ’13). 302–313.
siberia: Managing cold data in a memory-optimized database. Proceedings of the [41] Hyeontaek Lim, Michael Kaminsky, and David G. Andersen. 2017. Cicada:
VLDB Endowment 7, 11 (2014), 931–942. Dependably Fast Multi-Core In-Memory Transactions. In Proceedings of the 2017
[15] Zhuhe Fang, Beilei Zheng, and Chuliang Weng. 2019. Interleaved Multi- ACM International Conference on Management of Data (SIGMOD ’17). 21–35.
Vectorizing. Proc. VLDB Endow. 13, 3 (Nov. 2019), 226–238. [42] Yandong Mao, Eddie Kohler, and Robert Tappan Morris. 2012. Cache craftiness
[16] Florian Funke, Alfons Kemper, and Thomas Neumann. 2012. Compacting Trans- for fast multicore key-value storage. In Proceedings of the 7th ACM european
actional Data in Hybrid OLTP&OLAP Databases. Proc. VLDB Endow. 5, 11 (jul conference on Computer Systems. 183–196.
2012), 1424–1435. [43] Microsoft. 2018. Windows Technical Documentation. https://docs.microsoft.com/
[17] Goetz Graefe. 2012. A Survey of B-Tree Logging and Recovery Techniques. ACM en-us/windows/win32/procthread/fibers?redirectedfrom=MSDN
Trans. Database Syst. 37, 1, Article 1 (mar 2012), 35 pages. [44] Todd C. Mowry, Monica S. Lam, and Anoop Gupta. 1992. Design and Evaluation
[18] Goetz Graefe, Haris Volos, Hideaki Kimura, Harumi Kuno, Joseph Tucek, Mark of a Compiler Algorithm for Prefetching. In Proceedings of the Fifth International
Lillibridge, and Alistair Veitch. 2014. In-Memory Performance for Big Data. Conference on Architectural Support for Programming Languages and Operating
PVLDB 8, 1 (sep 2014), 37–48. Systems (ASPLOS V). 62–73.
[19] Gabriel Haas, Michael Haubenschild, and Viktor Leis. 2020. Exploiting Directly- [45] MySQL 8.0 Reference Manual. 2023. Using Asynchronous I/O on Linux. https:
Attached NVMe Arrays in DBMS.. In CIDR (Conference on Innovative Data Systems //dev.mysql.com/doc/refman/8.0/en/innodb-linux-native-aio.html
Research). [46] Thomas Neumann and Michael J Freitag. 2020. Umbra: A Disk-Based System
[20] Gabriel Haas and Viktor Leis. 2023. What Modern NVMe Storage Can Do, and with In-Memory Performance. In 10th Conference on Innovative Data Systems
How to Exploit It: High-Performance I/O for High-Performance Storage Engines. Research, CIDR 2020, Amsterdam, The Netherlands, January 12-15, 2020, Online
Proc. VLDB Endow. 16, 9 (may 2023), 2090–2102. Proceedings.
[21] Yongjun He, Jiacheng Lu, and Tianzheng Wang. 2020. CoroBase: coroutine- [47] Thomas Neumann, Tobias Mühlbauer, and Alfons Kemper. 2015. Fast Serializable
oriented main-memory database engine. Proceedings of the VLDB Endowment 14, Multi-Version Concurrency Control for Main-Memory Database Systems. In
3 (2020), 431–444. Proceedings of the 2015 ACM SIGMOD International Conference on Management
[22] Kaisong Huang, Darien Imai, Tianzheng Wang, and Dong Xie. 2022. SSDs of Data (SIGMOD ’15). 677–689.
Striking Back: The Storage Jungle and Its Implications to Persistent Indexes. In [48] Georgios Psaropoulos, Thomas Legler, Norman May, and Anastasia Ailamaki.
12th Conference on Innovative Data Systems Research, CIDR 2022, Chaminade, CA, 2019. Interleaving with coroutines: a systematic and practical approach to hide
USA, January 9-12, 2022. memory latency in index joins. The VLDB Journal 28, 4 (2019), 451–471.
[23] Intel. 2023. Intel® Optane™ SSD DC P4800X Series. (2023). [49] Raghu Ramakrishnan and Johannes Gehrke. 2003. Database Management Systems
https://www.intel.com/content/www/us/en/products/sku/97161/intel-optane- (3 ed.).
ssd-dc-p4800x-series-375gb-2-5in-pcie-x4-3d-xpoint/specifications.html [50] Mohammad Sadoghi, Kenneth A. Ross, Mustafa Canim, and Bishwaranjan Bhat-
[24] Intel Corporation. 2016. Intel 64 and IA-32 Architectures Software Developer tacharjee. 2013. Making Updates Disk-I/O Friendly Using SSDs. PVLDB 6, 11
Manuals. (Oct. 2016). (2013), 997–1008.
[25] ISO/IEC. 2017. Technical Specification — C++ Extensions for Coroutines. https: [51] Samsung. 2023. 980 PRO PCIe 4.0 NVMe SSD 500GB. (2023).
//www.iso.org/standard/73008.html. https://www.samsung.com/us/computing/memory-storage/solid-state-
[26] Ryan Johnson, Ippokratis Pandis, Radu Stoica, Manos Athanassoulis, and Anas- drives/980-pro-pcie-4-0-nvme-ssd-500gb-mz-v8p500b-am/
tasia Ailamaki. 2010. Aether: A Scalable Approach to Logging. PVLDB 3, 1 (Sept. [52] Boris Schling. 2011. The Boost C++ Libraries.
2010), 681–692. [53] Michael Lee Scott. 2013. Shared-memory synchronization.
589
[54] Utku Sirin, Pinar Tözün, Danica Porobic, and Anastasia Ailamaki. 2016. Micro- Latencies and Maximize the Read Bandwidth!. In International Workshop on
architectural Analysis of In-memory OLTP. In Proceedings of the 2016 Interna- Accelerating Analytics and Data Management Systems Using Modern Processor
tional Conference on Management of Data. 387–402. and Storage Architectures, ADMS@VLDB 2022.
[55] Michael Stonebraker, Samuel Madden, Daniel J. Abadi, Stavros Harizopoulos, [60] Tianzheng Wang and Ryan Johnson. 2014. Scalable Logging through Emerging
Nabil Hachem, and Pat Helland. 2007. The End of an Architectural Era: (It’s Non-Volatile Memory. PVLDB 7, 10 (June 2014), 865–876.
Time for a Complete Rewrite). (2007), 1150–1160. [61] Tianzheng Wang, Ryan Johnson, Alan Fekete, and Ippokratis Pandis. 2017. Effi-
[56] TPC. 2010. TPC Benchmark C (OLTP) Standard Specification, revision 5.11. ciently Making (Almost) Any Concurrency Control Mechanism Serializable. The
http://www.tpc.org/tpcc VLDB Journal 26, 4 (Aug. 2017), 537–562.
[57] Stephen Tu, Wenting Zheng, Eddie Kohler, Barbara Liskov, and Samuel Madden. [62] Yingjun Wu, Joy Arulraj, Jiexi Lin, Ran Xian, and Andrew Pavlo. 2017. An
2013. Speedy Transactions in Multicore In-Memory Databases. In Proceedings of Empirical Evaluation of In-Memory Multi-Version Concurrency Control. PVLDB
the Twenty-Fourth ACM Symposium on Operating Systems Principles (SOSP ’13). 10, 7 (March 2017), 781–792.
18–32. [63] Yu Xia, Xiangyao Yu, Andrew Pavlo, and Srinivas Devadas. 2020. Taurus: Light-
[58] Demian Vöhringer and Viktor Leis. 2023. Write-Aware Timestamp Tracking: weight Parallel Logging for in-Memory Database Management Systems. PVLDB
Effective and Efficient Page Replacement for Modern Hardware. Proc. VLDB 14, 2 (Oct. 2020), 189–201.
Endow. 16, 11 (jul 2023), 3323–3334. https://doi.org/10.14778/3611479.3611529 [64] Jianqiu Zhang, Kaisong Huang, Tianzheng Wang, and King Lv. 2022. Skeena:
[59] Leonard von Merzljak, Philipp Fent, Thomas Neumann, and Jana Giceva. 2022. Efficient and Consistent Cross-Engine Transactions. In Proceedings of the 2022
What Are You Waiting For? Use Coroutines for Asynchronous I/O to Hide I/O ACM SIGMOD International Conference on Management of Data (SIGMOD ’22).
590