Cache Is King
Cache Is King
Tal Zussman*,1 , Ioannis Zarkadas*,1 , Jeremy Carin1 , Andrew Cheng1 , Hubertus Franke2 , Jonas Pfefferle2 ,
and Asaf Cidon1
1 Columbia University, 2 IBM, * denotes equal contribution
Abstract to be inadequate for many workloads and scenarios (e.g., large
The page cache is a central part of an OS. It reduces re- scans [40, 57, 59], multi-core applications [73]). For exam-
peated accesses to storage by deciding which pages to re- ple, an application that searches through files in a codebase
arXiv:2502.02750v1 [cs.OS] 4 Feb 2025
tain in memory. As a result, the page cache has a significant (a scan-based workload) would benefit from using a most-
impact on the performance of many applications. However, recently used (MRU) policy, while a key-value store running
its one-size-fits-all eviction policy performs poorly in many a fixed, skewed-distribution workload would improve under
workloads. While the systems community has experimented least-frequently used (LFU). However, both of these work-
with a plethora of new and adaptive eviction policies in non- loads currently run with the default Linux eviction policy.
OS settings (e.g., key-value stores, CDNs), it is very difficult The reasons applications are “stuck” with the same old evic-
to implement such policies in the page cache, due to the tion policy are twofold. First, modifying the Linux page cache
complexity of modifying kernel code. To address these short- is a hard task, requiring extensive kernel knowledge and atten-
comings, we design a novel eBPF-based framework for the tion to detail. Second, upstreaming changes to the page cache
Linux page cache, called cachebpf, that allows developers is difficult, because the changes have to work well for the
to customize the page cache without modifying the kernel. wide range of applications that run on Linux, forcing a lowest
cachebpf enables applications to customize the page cache common denominator. For instance, it took Google years to
policy for their specific needs, while also ensuring that differ- upstream its proposed Multi-Generational LRU (MGLRU)
ent applications’ policies do not interfere with each other and algorithm into the Linux kernel, and even after several years,
preserving the page cache’s ability to share memory across dif- it is still disabled by default in upstream [17, 19].
ferent processes. We demonstrate the flexibility of cachebpf’s In this paper, we attempt to finally answer Stonebraker’s
interface by using it to implement several eviction policies. plea for better OS support for buffer management, within
Our evaluation shows that it is indeed beneficial for applica- Linux. To this end, we design a novel framework, cachebpf,
tions to customize the page cache to match their workloads’ which provides visibility and control of the OS page cache,
unique properties, and that they can achieve up to 70% higher without requiring the application to make kernel changes.
throughput and 58% lower tail latency. cachebpf takes advantage of eBPF [26], a Linux (and Win-
dows) supported runtime that allows safely running applica-
1 Introduction tion code inside the kernel. We take a cue from sched_ext, an
eBPF-based framework that allows applications to customize
In his seminal 1981 paper on OS support for database man- the OS scheduler [38,42] and has been adopted by Linux [15].
agement, Michael Stonebraker described how existing OS cachebpf’s design is motivated by four main insights. First,
buffer cache mechanisms were ill-suited for the needs of modern storage devices are very fast and support millions of
databases at the time [66]. He observed that the buffer cache’s IOPS, so custom page cache policies must run with low over-
one-size-fits-all eviction policy, approximate least-recently head. Therefore, we design cachebpf so that its eBPF-based
used (LRU), cannot possibly address the heterogeneity of policies run in the kernel, avoiding expensive and frequent
database workloads. Nevertheless, in the intervening decades, synchronization between the kernel and userspace. Second,
despite wide-ranging efforts to rethink the UNIX/Linux caching algorithms are very diverse and may use complex
OS page cache [17, 31, 73], design customizable file sys- data structures. To address this challenge, cachebpf exposes
tems [10, 11, 41, 52], and build clean-slate extensible ker- a simple yet flexible interface that allows applications to de-
nels [8, 60, 62], applications by and large still contend with fine one or more variable-sized lists of pages, and a set of
Linux’s opaque and inflexible OS page cache policy. policy functions (e.g., admission, eviction) that operate on
At the same time, the diversity of applications and work- these lists, which can be used to express a wide range of
loads running on Linux has only increased, from enterprise eviction policies. Third, in order for cachebpf to be useful
file systems and large-scale distributed datacenter ML train- in multi-tenant scenarios, it should allow each application to
ing, to multimedia rich applications running on an Android use its own policy without interfering with others. We iden-
phone. All of these applications must use Linux’s decades-old tify cgroups as a natural isolation boundary. Thus, cachebpf
approximate LRU policy, despite the fact it is widely known allows each cgroup to implement its own eviction policy with-
1
out interfering with other cgroups. Finally, custom policies through which the page cache behavior can be tweaked on a
determine which pages to evict and return page references to global or per-application basis, these interfaces are opaque
the kernel. However, these references may be invalid, which and do not perform as intended, as we describe in §2.1 and
could lead to kernel crashes or security breaches. To solve evaluate in §5.1.5.
this, cachebpf maintains a registry of valid page references, Therefore, to avoid compromising performance, some ap-
which is used to validate the page references returned by the plications implement their own userspace-based caches [4, 29,
user-defined policies. 54, 56]. However, userspace-based caches are not a panacea.
We demonstrate cachebpf’s utility and flexibility by im- First, they require significant developer effort to implement.
plementing four custom eviction policies, which include both Second, they typically require the application to specify in ad-
“classic” and recently-designed policies: most-recently used vance how much memory will be allocated for the cache. How-
(MRU), least-frequently used (LFU), S3-FIFO [70], and least ever, the amount of memory available to an application may
hit density (LHD) [5]. We also show how cachebpf enables change over time (e.g., when multiple applications run on the
application-informed policies with only minor policy changes, same physical server). Third, application-specific caches are
allowing applications to design policies that take into account very hard to share across processes, due to security and com-
application-level insights. For example, a database can imple- patibility issues. Ultimately, even applications that implement
ment a custom policy that prioritizes point queries over scans, their own userspace-based cache often still rely on the page
yielding higher throughput for point queries. We compare cache by default as a “second-tier” cache [4, 29, 54], allowing
these cachebpf policies with the kernel’s default eviction pol- operators to fully utilize the server’s memory and share mem-
icy and its different options (e.g., fadvise()), and with the ory across processes. As such, despite the page cache’s limi-
recently-upstreamed MGLRU algorithm. We show that with tations, it is still used extensively by storage-optimized work-
cachebpf, developers can significantly improve their appli- loads, such as key-value stores [33, 54, 65], databases [34, 56],
cations’ performance far beyond the existing algorithms pro- and ML inference and training systems [13, 53].
vided by the Linux page cache. In general, we find that there Unfortunately, these factors yield a status quo where po-
is no one-size-fits-all policy that improves all workloads – cus- tential performance gains are left on the table. Properly cus-
tomization is necessary in order to maximize performance. In tomizing the page cache is not an easy task – it is deeply
particular, applications can use cachebpf to improve through- intertwined with other performance- and correctness-critical
put by up to 38% using “generic” policies, and achieve up memory management and filesystem code paths. While work
to 70% higher throughput and 58% lower P99 latency with to modernize the page cache is ongoing, it does not yet seem
application-informed policies. to have achieved this goal. In particular, MGLRU, an alter-
We will open source cachebpf and all our implemented native LRU implementation for the page cache, has still not
policies upon publication. A key benefit of cachebpf is that been enabled by default in upstream Linux several years after
any publicly available eviction policy can be used by other it was introduced, and it does not provide customization inter-
developers, lowering the barrier to using the system and ex- faces [17,19]. Indeed, in §5 we show that MGLRU sometimes
perimenting with eviction policies on different workloads. underperforms and sometimes outperforms the default LRU
Our primary contributions are: algorithm, and that in general there is no single eviction policy
• cachebpf, a flexible, scalable, and safe eBPF-based frame- that performs best across a wide range of workloads.
work for running custom eviction policies in the Linux We now provide a primer on the Linux page cache. We also
kernel page cache. describe the eBPF framework, which cachebpf uses to allow
applications to write custom page cache policies.
• A suite of custom eviction policies and userspace libraries
allowing developers to easily create new policies.
• An evaluation of cachebpf across various applications, 2.1 Linux Page Cache
demonstrating how they benefit from customized policies. The page cache is a core component of the Linux kernel,
responsible for accelerating access to storage. While anony-
mous memory pages are stored similarly to file-backed mem-
2 Background and Motivation
ory, in this paper we focus specifically on file-backed memory.
The kernel’s default eviction policy is an LRU approximation
By default, the page cache buffers write and read operations
algorithm which uses two FIFO lists.1 As shown in Figure 1,
to and from storage devices. In Linux, the page cache tracks
when a page is first fetched from storage, it is added to the
pages and stores them in lists (see §2.1), on which it approxi-
tail of the inactive list. If that page is accessed again, it will
mates the LRU algorithm. While this scheme works reason-
eventually be promoted to the active list. The goal of this pol-
ably well for some workloads, it is inadequate for many others.
icy is to use the inactive list as a preliminary filter and keep
The classic example is scan-heavy workloads, which perform
frequently accessed pages in the active list. When eviction is
poorly with LRU or its approximations [5, 50, 66]. While
Linux provides some interfaces (e.g., fadvise() or sysctl) 1 The Linux page cache algorithm description is based on Linux v6.6.8.
2
triggered, pages are removed from the head of the inactive to. Most importantly, these hints are still subject to the basic
list. If necessary, the page cache will balance the lists by de- inflexible structure of the kernel’s approximate LRU policy.
moting pages from the head of the active list to the tail of the
inactive list. Notably, during balancing or shrinking, pages in
Head Tail
the active list that have been referenced are typically demoted
to the inactive list, rather than being given another chance in Active List
Demotion
the active list, as is typical for LRU or CLOCK algorithms.
Importantly, active and inactive lists are segmented by Promotion
cgroup. cgroups are a Linux feature which isolate resource
usage for groups of processes [35]. Each cgroup has its own
Head Tail
set of page cache lists which count toward its memory alloca-
Inactive List
tion, allowing for cgroup-specific eviction when its memory
Eviction Admission
threshold is reached. Processes in cgroup A can access a page
“owned” by cgroup B – such an access will update the page’s
Figure 1: Overview of the current Linux page cache eviction policy.
metadata (affecting its placement in cgroup B’s lists), but will
not count against cgroup A’s memory limit. The combination
of these per-cgroup lists make up the page cache as a whole.2
The page cache also keeps track of “shadow entries” in
order to mitigate thrashing. These entries keep track of meta-
2.2 eBPF
data enabling calculation of a page’s refault distance (i.e. the
time elapsed between eviction and the new request). If a page
eBPF [26] is a sandboxing technology that enables userspace
has been evicted and then fetched again recently enough, the
functions to run in the Linux kernel in a safe and controlled
kernel may decide to insert it directly into the active list in-
manner. eBPF has found many use cases, including observ-
stead of the inactive list. There are several additional edge
ability [37], security [39, 51], scheduling [15, 38, 42], and I/O
cases and heuristics in the kernel’s implementation, but these
acceleration [36, 71, 74–76]. eBPF programs are verified by
are the broad strokes of the existing policy.
the kernel before they can be run, ensuring, for example, that
Folios. Linux developers are in the process of converting the programs don’t contain illegal memory accesses, and that
various usages of struct page to folios, which represent ei- they will terminate within a fixed number of instructions.
ther zero-order pages (a single page) or the head page of a
compound page (a group of contiguous physical pages that
can be treated as a single larger page) [16]. While the page
cache now largely uses folios, we use the terms “folio” and 3 Challenges
“page” interchangeably, as in our workloads all folios repre-
sent a single page. There are several challenges in allowing applications to cus-
tomize the page cache using eBPF. We describe them below.
Userspace interfaces. While LRU is a commonly-used
eviction policy that works well across many workloads, there 1. Scalability. Modern SSDs support millions of IOPS [23,
are many applications that would benefit from a different pol- 63], requiring the page cache to efficiently handle millions
icy for part or the entirety of their I/O requests. For example, of events a second. Any changes to the page cache in order
LRU is notoriously bad for scan-like access patterns. This to enable custom policies must incur a low overhead, and
gap between applications and the kernel can be partially miti- the policies themselves must also be efficient.
gated by the madvise() and fadvise() system calls. These 2. Flexibility. Researchers have proposed many different
interfaces allow userspace applications to give hints to the caching algorithms for different use cases. These algo-
kernel about how to handle certain ranges of memory or files. rithms often require custom data structures. Any interface
While these hints may help in simple cases, we show in for custom policies must be flexible enough to accommo-
our evaluation that they don’t function as expected for some date the diversity of existing caching algorithms.
workloads. Additionally, while the hints may have a semantic 3. Isolation and sharing. The page cache is shared by many
meaning, their actual behavior is highly dependent on the applications. Therefore, we must avoid a situation where
kernel implementation, which is opaque, may change across one application’s policy interferes with those of other
versions, and can yield unexpected results [9, 49]. Advice applications, while still allowing applications to benefit
values may also be ignored by the kernel for a range of reasons, from the shared nature of the page cache.
or may have restrictions on what memory they can be applied 4. Memory safety. Custom eviction policies return page
2 Technically, each NUMA node has its own set of per-cgroup lists, but references to the kernel to indicate which pages to evict.
this does not affect our design. This must not lead to unsafe memory references.
3
Workload Baseline Benchmark % Degradation
Application
YCSB A 82,808 op/s 69,089 op/s -16.6%
Userspace
YCSB C 76,166 op/s 62,578 op/s -17.8%
Kernel Uniform 44,618 op/s 35,443 op/s -20.6%
folio folio folio folio Search 42.3s 44.4s -4.7%
Linux Page Cache
Table 1: Performance of workloads without and with userspace-dispatch.
Events Actions
4
Eviction list API
// Policy function hooks
u64 list_create(struct mem_cgroup *memcg)
struct cachebpf_ops {
int list_add(u64 list, struct folio *f, bool tail)
s32 (*policy_init)(struct mem_cgroup *memcg); int list_move(u64 list, struct folio *f, bool tail)
// Propose folios to evict int list_del(struct folio *f)
void (*evict_folios)(struct eviction_ctx *ctx, int list_iterate(struct mem_cgroup *memcg
struct mem_cgroup *memcg); u64 list,
void (*folio_added)(struct folio *folio); s64(*iter_fn)(int id, struct folio *f),
void (*folio_accessed)(struct folio *folio); struct iter_opts *opts,
// Folio was removed: clean up metadata struct eviction_ctx *ctx)
void (*folio_removed)(struct folio *folio);
Table 2: cachebpf eviction list API.
char name[CACHEBPF_OPS_NAME_LEN];
};
using linked lists, where the policy iterates over one or more
struct eviction_ctx {
lists and evicts items based on a calculated per-item score. For
u64 nr_candidates_requested; /* Input */
example, the “classic” eviction policies, (e.g., LRU, MRU) are
u64 nr_candidates_proposed; /* Output */
all based on lists, with items inserted or evicted from the head
struct folio *candidates[32];
or tail of a list. Similarly, families of policies like ARC [50]
};
or segmented LRU [43] can be implemented using multiple
variable-sized lists, where items are inserted into any list or
Figure 3: struct_ops for cachebpf and eviction context. moved between lists. Even recent “state-of-the-art” policies,
such as LHD, S3-FIFO, or LRB either store data directly in a
list [70, 72], or simply select a sample of objects and evict the
events: policy initialization, requests for eviction, folio ad-
ones with the lowest score [5, 64].
mission, folio access, and folio removal. The policy function
In order to facilitate an interface flexible enough for all
interface is implemented using eBPF’s recent struct_ops
these policies, cachebpf is built around an eviction list API,
kernel interface [46], as shown in Figure 3.
a simple interface for policies to construct and manipulate
These five events are central to caching decisions in the
a set of variable-sized linked lists. Each node in the list cor-
page cache. Notably, requests for eviction and folio removal
responds to a single folio, and stores a pointer to that folio,
are different: the former involves the kernel asking the policy
rather than the folio itself. Importantly, the actual folios are
to propose folios to evict, and the latter is the kernel informing
still stored and maintained by the default kernel page cache
the policy that a folio was actually evicted. This distinction
implementation, in order to minimize changes to the kernel.
exists for the following two reasons. A folio can be evicted in
This API is implemented as a set of eBPF kfuncs
circumvention of the “normal” eviction path if, for example,
(in-kernel functions that are exposed to eBPF programs)
the file containing it is deleted. Conversely, in rare cases,
and is shown in Table 2.3 For example, init() will typi-
proposing a folio for eviction does not guarantee that it will
cally call list_create() to create a new eviction list, and
be evicted (e.g., the folio is in active use by the kernel).
folio_added() will call list_add() to add the folio to a list.
We use eBPF’s struct_ops feature in order to mini-
Newly-created lists are added to a “registry”, an internal per-
mize the verifier changes needed to add new eBPF hooks.
policy hash table which maps between the list IDs (exposed
struct_ops was designed to allow kernel subsystems to ex-
to eBPF) and the lists themselves. Notably, these lists are
pose modular interfaces to eBPF components, and has already
indexed – that is, given a folio pointer, the APIs can directly
been used for TCP congestion control algorithms, FUSE eBPF
obtain that folio’s list node. This property is necessary for
filesystems, handling HID driver quirks, and sched_ext [15,
operations such as deletion from the list, and is facilitated
18, 27, 45]. struct_ops programs are loaded into the kernel
using a per-policy hash-table which maps from folios to list
like any other eBPF program. Using struct_ops also makes
nodes. We discuss the use of this hash table further in §4.4.
it much easier to extend cachebpf and add new hooks. For ex-
ample, we implemented an extension to cachebpf that added
a page cache admission filter with only 15 additional lines of 4.2.3 Eviction Candidate Interface
verifier-related code.
To facilitate eviction, policy functions iterate over their evic-
tion lists in order to determine which folios to evict. Note
4.2.2 Eviction Lists that policies do not directly evict folios – rather, they propose
Eviction algorithms are implemented on a wide range of eviction candidates to the kernel, which checks if the folios
data structures. Nevertheless, we observe that many of these 3 The actual functions have a “cachebpf” prefix to prevent name collisions,
policies can be implemented either exactly or approximately but we omit it for brevity.
5
are indeed valid eviction targets (i.e. not pinned or in other u64 lfu_list;
use by the kernel) and evicts them if so. int lfu_policy_init(struct mem_cgroup *cg) {
The eBPF framework currently does not provide a clean lfu_list = list_create(cg);
way to iterate over the eviction lists, so cachebpf provides a return 0;
new iteration kfunc which allows policy functions to specify }
how to iterate over a list and make decisions for each node. void lfu_folio_added(struct folio *folio) {
Specifically, list_iterate() takes a list to iterate over, an u64 freq = 1;
list_add(lfu_list, folio, true); // Add to tail
options struct, an eviction context, and a callback function.
bpf_map_update_elem(&freq_map, &folio, &freq);
The callback function, which is also an eBPF program, is
}
called on each node, and the policy decides whether to keep
void lfu_folio_accessed(struct folio *folio) {
or evict that folio. Folios chosen for eviction are added to the u64 *freq = bpf_map_lookup_elem(&freq_map, &folio);
candidates array in the eviction_ctx struct. The options __sync_fetch_and_add(freq, 1); // Increment freq
struct specifies how the interface should treat evaluated folios. }
For example, they can be left in place, moved to the tail of the long score_lfu(int id, struct folio *folio) {
list, or moved to a different list. This enables implementing return bpf_map_lookup_elem(&freq_map, &folio);
policies that make use of multiple lists and require balancing }
the lists, such as S3-FIFO or ARC. We also provide a “batch void lfu_evict_folios(struct eviction_ctx *ctx, struct
scoring” mode for this interface, where the callback function mem_cgroup *cg) {
struct iter_opts opts = { /* Set scoring mode */ };
is used to compute scores for N folios, with the C folios with
list_iterate(cg, lfu_list, score_lfu, &opts, ctx);
the lowest score chosen for eviction. This mode can be used
}
for policies such as LFU.
void lfu_folio_removed(struct folio *folio) {
In order to ensure correct verification of our callback func- bpf_map_delete_elem(&freq_map, &folio);
tions, we added ∼80 lines to the eBPF verifier to “register” }
our iteration interfaces, on top of eBPF’s existing support
for callback functions. Additionally, to protect against eBPF Figure 4: Simplified LFU implementation with cachebpf.
program misbehavior, this interface performs the requisite
bounds-checking and enforces loop termination.
eBPF matures, new features could further reduce overhead
4.2.4 eBPF limitations and provide even more flexibility for eBPF policies.
We ran into a number of challenges when implementing 4.2.5 Example: LFU Policy
cachebpf’s eviction list API. eBPF maps, the standard way
to maintain state in eBPF programs, do not provide in- To get a better sense of how cachebpf’s policy functions
terfaces that both store items in a specified order while can be used to implement custom policies, we walk through
also providing random access, both of which are neces- implementing a simple eviction policy, LFU, using cachebpf.
sary in order to implement eviction properly. Specifically, LFU evicts the least-frequently accessed item in the list, which
eBPF provides maps such as BPF_MAP_TYPE_QUEUE and requires storing additional metadata. Our LFU implementa-
BPF_MAP_TYPE_STACK, which provide pop() and push() op- tion uses a single list and an eBPF map to store folio access
erations, but do not allow deleting or accessing elements from frequencies. It approximates LFU using cachebpf’s batch
the “middle” of the map. Conversely, BPF_MAP_TYPE_HASH scoring mode to select the C (e.g., 32) least-frequently ac-
provides random access, but no method to easily maintain cessed folios out of N (e.g., 512) folios.
an ordering of elements (e.g., MRU order). A notable ex- A simplified version of the policy is shown in Figure 4.
ception is BPF_MAP_TYPE_LRU_HASH, which provides both an When the policy is loaded and lfu_policy_init() is called,
LRU structure and random-access, but is too deeply tied to we create a new eviction list. When a folio is added,
its specific algorithm for our purposes [22]. This necessitated lfu_folio_added() adds the folio to the tail of the list
the development of a custom data structure for cachebpf. using list_add() and initializes its frequency to 1 in the
While eBPF has started to introduce experimental support freq_map eBPF map (not shown). When a folio is accessed,
for custom data structures and more complex locking in eBPF, we increment its frequency. When eviction is triggered,
this support is not yet mature enough for our use case [2, 20, lfu_evict_folios() calls list_iterate(), which calls the
25]. As such, we designed our list API to be managed by the score_lfu() callback function on N nodes in the list. The
kernel and exposed to eBPF via kfuncs. Additionally, in order score function returns the frequency of each folio as its score.
to avoid concurrency issues and verifier limitations around cachebpf then selects the C folios with the lowest scores as
locking, the provided API is concurrency-safe and makes use eviction candidates. When a folio is evicted by the kernel,
of locks under the hood, in the kernel implementations. As lfu_folio_removed() is called, and the folio’s metadata is
6
removed from the map. Additionally, it is not necessary to We also protect against adversarial behavior by providing a
remove the folio from the list on eviction, as this is taken care fallback for eviction. For example, if the kernel asks a faulty
of by cachebpf. We discuss this point further in §4.4. policy to evict 10 folios, but it only proposes 5 candidates, the
kernel will fall back to its default policy and evict additional
folios. Similarly, when a folio is evicted, the kernel ensures
4.3 Isolation that it is removed from any eviction list it is present in, in order
We now tackle the third challenge from §3: how to allow to release memory resources and minimize stale references
applications to deploy their own policy functions without lying around. Similar fallbacks are present in other frame-
interfering with other applications’ policies, while preserving works, such as sched_ext, which implements a watchdog
the sharing property of the page cache, whereby applications that forcibly removes misbehaving policies.
can avoid having to load duplicate pages into memory.
We make the observation that implementing policies within
4.5 Kernel Implementation Complexity
a cgroup can address this challenge. This is due to the fact
that within a cgroup, processes have the same custom eviction Implementing cachebpf required adding ∼2000 lines to the
policy, and different cgroups running on the same server can kernel. Only a fraction of these lines touched core kernel code:
each use their own eviction policy. In addition, deploying pol- 210 lines in the page cache (most of which are the new eBPF
icy functions per-cgroup fits the common pattern of deploying hooks), 80 lines in the verifier (supporting our callback func-
modern applications via containers, which isolate each appli- tions), and 80 lines in cgroup code. Additionally, implement-
cation in its own memory cgroup. Note that processes from ing per-cgroup struct_ops required 220 lines in the kernel
cgroup A can still access page cache memory managed by and 75 lines in libbpf. The remaining lines implemented pure
cgroup B, and benefit from accessing shared data. cachebpf functionality: 750 lines for cachebpf’s eviction list
To support per-cgroup policies, we extend eBPF’s kfuncs, and 580 lines to implement registry operations and
struct_ops functionality to support cgroup-specific register cachebpf with the verifier.
struct_ops (currently, it only supports system-wide poli-
cies). This involved adding a cgroup identifier (in the form of
a file descriptor) to the kernel’s struct_ops loading interface, 5 Evaluation
along with corresponding libbpf interfaces in userspace.
We aim to answer the following questions:
Q1: Is cachebpf flexible enough to implement a variety of
4.4 Memory Safety
eviction policies? Can cachebpf policies improve appli-
We must ensure that cachebpf does not allow unsafe memory cation performance with low developer effort? (§5.1)
accesses when interacting with folio pointers (challenge 4 Q2: Can different applications use different policies without
from §3). Specifically, cachebpf must ensure that eBPF pro- interfering with each other? (§5.2)
grams return valid pointers to the kernel (i.e. in the eviction Q3: What is the overhead of cachebpf? (§5.3)
candidate interface). Otherwise, a malicious eBPF program
could return invalid values, leading to memory corruption or a
kernel crash. Frameworks like sched_ext solve this in part by System configuration. We conduct our experiments on
using PIDs as identifiers for processes in userspace dispatch. Cloudlab [24] c6525-25g machines, with a 16-core AMD
However, folios do not have analogous easily-obtainable Rome CPU, 128GB of memory and a 480GB SSD drive. We
unique identifiers, so we resort to using folio pointers. use CPU-pinning and disable SMT, swap, and address space
In order to validate these pointers, we implement a reg- randomization to make our results more reproducible. We
istry of “valid folios” in the system. When a folio enters the also drop the page cache before each test. We run Ubuntu
page cache, cachebpf adds it to the registry. When a folio 22.04 with Linux v6.6.8 as the kernel.
is evicted, it is removed from the registry. When a cachebpf
eviction proposes a set of folio eviction candidates, the ker- 5.1 Case Studies: Custom Policies (Q1)
nel uses the registry to verify that each candidate is indeed a
valid folio before proceeding with eviction. This registry is In this section, we describe several case studies on how appli-
implemented as a hash table with a per-bucket lock, which cations can utilize cachebpf to achieve better performance.
also stores a folio’s list node (as described in §4.2.2), which First, we show how cachebpf can be used to implement a
maps from folio pointer to list node. We find that this design wide range of eviction policies, tailored to different appli-
incurs minimal overhead, which we evaluate in §5.3. Future cations: from simple “classic” policies (MRU and LFU) to
developments in eBPF may make it easier to keep track of state-of-the-art policies such as LHD [5], which uses con-
“trusted” pointers, potentially allowing us to remove this check ditional probabilities to model different page features (e.g.,
and further reduce overhead. age, frequency), and S3-FIFO [70]. We then explore how an
7
50 MGLRU policy. The results in Figure 5 show that cachebpf
Default (Linux)
MGLRU (Linux) is almost 2× faster than both baseline and MGLRU, since
Runtime (s) 40 MRU (cachebpf) both policies suffer from the scan “pathology” of LRU.
30
5.1.2 Least-Frequently Used (LFU)
20
An additional disadvantage of LRU-like algorithms is that
10 they only take into account a single feature (recency) when
making eviction decisions. However, other features, such as
0 File Search access frequency, may be better suited for certain workloads,
especially skewed workloads where the access distribution
Figure 5: File search workload results (MRU policy). is static or slow-changing. One such workload is the popular
YCSB benchmark, made to model cloud OLTP applications,
in which the probability of each key being requested is drawn
application can make its eviction policy aware of application- independently from a static distribution (by default, Zipfian).
specific information, such as assigning different priorities to We use cachebpf to implement an LFU policy, which takes
specific types of requests. access frequency into account when evicting folios by evict-
ing those with the lowest access frequency among the eviction
5.1.1 Most-Recently Used (MRU) candidates. Our LFU policy is an approximate LFU policy,
as it does not evict the global least-frequently used folio, but
Scan-like workloads do not perform well under LRU-like only the least-frequently ones among the current batch of
policies when the scan length is larger than the size of the LRU folios considered for eviction. An exact policy would either
list. For example, one such scan-heavy workload is searching yield higher overhead or require more complex data struc-
through files in a codebase. Consider a developer working on tures, which eBPF does not yet support. We described the
a large codebase, such as the Linux kernel, and continuously implementation of our LFU policy in §4.2.5.
searching for certain terms. In such a scenario, an LRU-like
policy would evict the files that were least-recently searched, Evaluation. We evaluate our LFU policy by running Lev-
but those are precisely the files that are required at the start elDB [33], a popular key-value store, on the YCSB (Zipfian)
of the next search. While readahead can help mitigate this workloads, as well as against uniform and uniform-read-write
issue for single-file scans by prefetching sequential blocks, it workloads. We compare this custom policy against both the
cannot predict which blocks to fetch across files. default and MGLRU Linux policies, using a 100GiB database
We use cachebpf to develop an MRU eviction policy for with a 10GiB cgroup. Our results in Figure 6 show that
this use case. In contrast to LRU, MRU will evict the folios cachebpf’s LFU policy outperforms both the default and
that were most recently searched, and will be used again fur- MGLRU, for all the YCSB Zipfian workloads and the uniform
thest in the future. In order to facilitate this, our policy adds workloads, except for YCSB D, which only uses the latest
folios to the head of the list on insertion, and moves them key-value pairs and as such is cached entirely in-memory.
back to the head on access. No metadata is required, as nodes cachebpf achieves up to 37% better throughput than the de-
are stored in the correct order in the eviction list. In a simpli- fault Linux policy, and interestingly, it outperforms MGLRU
fied version of the policy, when eviction is triggered, the first by an even greater margin. We also measure the P99 read
32 nodes in the list are selected as eviction candidates using latency, for which cachebpf beats the default policy by up
cachebpf’s iterate interface. However, if the policy decides to to 55%. Note that YCSB D’s tail latency barely registers in
evict folios right after they were added to the page cache, they the figure due to its lack of disk accesses. We also evalu-
may still be in use by the kernel to service the I/O request. ated the YCSB workload with our other policies, but LFU
This would lead to the kernel refusing to evict the folios and outperformed those policies as well, so we omit those results.
resorting to the fallback path to evict folios. Therefore, we Takeaway 1: cachebpf can significantly improve applica-
skip a small fixed number of folios when iterating the eviction tion performance even with simple policies (e.g., MRU,
list before proposing eviction candidates. LFU) that match the application’s access patterns.
Evaluation. To evaluate the policy, we construct a file
search workload that searches the Linux kernel codebase 5.1.3 S3-FIFO
(v6.6), using the multi-threaded ripgrep CLI tool [32]. More
specifically, we perform 10 searches within a 1GiB cgroup, S3-FIFO [70] is a recent caching policy designed for key-
which is roughly 70% the size of the codebase (excluding Git value caches, which uses three FIFO queues to quickly remove
history). We compare the cachebpf MRU policy with the de- “one-hit wonders” (keys that are accessed only once). It has
fault Linux kernel policy as well as the kernel’s experimental been shown to yield significant throughput gains on a number
8
30000
Default (Linux) Default (Linux)
50 MGLRU (Linux)
MGLRU (Linux)
Throughput (ops/sec) 25000 LFU (cachebpf) LFU (cachebpf)
40
16K
16K
15K
210K
233K
208K
13K 30
13K
13K
15000
12K
11K
11K
11K
20
9K
9K
10000
8K
7K
6K
5K
10
4K
5000
3K
3K
2K
1K
0 0
YCSB YCSB YCSB YCSB YCSB YCSB Unif. Unif. YCSB YCSB YCSB YCSB YCSB YCSB Unif. Unif.
A B C D E F (100/0) (50/50) A B C D E F (100/0) (50/50)
of workloads. We implement S3-FIFO using cachebpf and that are considered for eviction have their access frequency
evaluate it on Twitter production cache traces below. decremented and are moved to the tail of the main list.
S3-FIFO uses a main FIFO and a small FIFO to hold ∼90%
and 10% of the objects, respectively. Upon insertion, objects
5.1.4 Least Hit Density (LHD)
are typically added to the small FIFO. It uses a ghost FIFO
to track recently-evicted objects, in order to promote them LHD is a relatively sophisticated eviction policy that uses con-
quickly to the main FIFO on readmission. The small FIFO ditional probabilities to predict which objects are most likely
is used to filter out short-lived objects, while objects that are to be accessed in the future [5]. LHD uses a metric called hit
accessed more often are promoted from the small FIFO to the density in order to determine which objects should be evicted,
main FIFO. The access frequency of the objects is tracked, along with a dynamic ranking approach which allows it to
but is capped at a maximum of 3, in order to ensure that a automatically tune its eviction policy over time. We imple-
burst of accesses does not prevent objects from being evicted. ment LHD using cachebpf, based on the implementation in
We implement the main and small FIFOs as two eviction libcachesim [69, 70, 72].
lists, and the ghost FIFO as a BPF_MAP_TYPE_LRU_HASH map. Our implementation only uses one eviction list. However,
The map will then automatically remove entries from the folios are divided into classes partially based on when they
ghost FIFO in LRU order when it hits capacity. When a folio were last accessed and their age at that time. Each class stores
is evicted, we create a ghost entry using a pointer to its struct its own statistics (e.g., hits, evictions, hit densities) for differ-
address_space (which represents a file’s contents), along ent folio ages. Folios are not explicitly “owned” by classes –
with the folio’s offset in the file, as the key. Note that we instead, they use metadata from classes based on which class
cannot use folio pointers as the key, as they are not persistent they most closely correspond to at a given time. When a folio
across evictions. While we cannot implement the ghost FIFO is added, we store its metadata, such as its last access time and
as an eviction list (as they operate on valid folios), it is more age at that time, in an eBPF map. That metadata is updated
performant and simpler to use an existing eBPF map. We find on folio access, and removed when the folio is evicted.
that the combination of existing eBPF features and cachebpf Folios are selected as eviction candidates based on their hit
is sufficiently flexible to implement complex eviction policies. density (or more accurately, the hit density of the class and
On folio addition, we set its access frequency in an eBPF age they most closely correspond to). LHD iterates over the
map, and update it on access. We use eviction candidate re- list and selects the folios with the lowest hit density as evic-
quests to evict folios, but also to maintain the 90%-10% ratio tion candidates. While this process is fairly straightforward, it
between the main and small lists. If the small list is over- is enabled by accurate computation of hit densities over time.
represented, we evict from it. We use cachebpf’s eviction LHD requires periodically “reconfiguring” its hit densities
iteration interface: if a folio’s access frequency is greater than and other statistics in order to ensure that its probability dis-
1, we move it to the tail of the main list, balancing the lists. tributions are accurate and aged appropriately over time using
Otherwise, we propose the folio for eviction, and move it to an exponentially weighted moving average (EWMA).
the tail of the small list so that it isn’t considered again be- This reconfiguration process needs to run every N folio
fore it is evicted. When evicting from the main list, we use admissions or insertions (where N is a relatively large number
the iteration interface to find folios with access frequency of – e.g., 220 ). However, reconfiguration is a relatively expensive
0. If we cannot find enough of those, we search for folios process, requiring iterating over all of the policy’s metadata
with access frequency of 1, and then 2, and so on. All folios and adjusting it. In order to avoid performing this operations
9
in the page cache’s insertion or access hot paths, we use Default (Linux)
120000 MGLRU (Linux)
108K
an eBPF ring buffer to notify userspace that reconfiguration
105K
LHD (cachebpf)
Throughput (ops/sec)
needs to take place. Userspace then calls an eBPF program 100000 S3-FIFO (cachebpf)
of type BPF_PROG_TYPE_SYSCALL, which allows running an LFU (cachebpf)
84K
82K
81K
81K
79K
78K
eBPF program without attaching it to a specific hook. This 80000
72K
68K
program then performs the required reconfiguration, including
61K
60K
58K
55K
computing updated hit densities, and scaling or compressing 60000
43K
41K
distributions as necessary. We use atomic operations to ensure
38K
38K
40000
32K
that the page cache can continue using these values, albeit
17K
16K
with some potential inaccuracy, which we permit for the sake
15K
15K
20000
13K
of performance. While we could implement this reconfigura-
0K
tion step in userspace, doing so would have required numer- 0
Cluster 17 Cluster 18 Cluster 24 Cluster 34 Cluster 52
ous syscalls to interact with eBPF maps, and atomic updates
would not have been possible. Additionally, we note that in a Figure 7: Twitter workload results (LHD, S3-FIFO, and LFU policies) using
standard LHD policy, hit densities and other parameters are LevelDB. No one policy performs best across the different clusters.
stored as floating-point values. However, eBPF does not sup-
port floating-point operations, so we resort to scaling values
by a large constant in order to approximate such calculations.
Evaluation. For many real-world workloads, it may not
be obvious in advance which policy works best for a given GET Pool SCAN Pool
workload. cachebpf makes experimentation easy, allowing LevelDB
developers to implement a set of policies and empirically Userspace Write
choose the best one for each workload.
Kernel
We evaluate our LHD, LFU, and S3-FIFO policies on pro-
duction traces taken from the Twitter cache workloads [69]. scan_pids
map
The workloads divide the traces by cluster ID. We compare
these policies to Linux’s default and MGLRU policies. Each
cluster was evaluated with a cgroup size set to 10% of the GET List SCAN List Read
10
8000 140 Default (Linux)
Default (Linux) Policy eBPF LoC Userspace LoC
7000 FADV_NOREUSE (Linux) FADV_NOREUSE (Linux)
Throughput (ops/sec)
FADV_DONTNEED (Linux)
120 FADV_DONTNEED (Linux)
6000 FADV_SEQUENTIAL (Linux) FIFO 56 118
Latency (ms)
FADV_SEQUENTIAL (Linux) 100
5000 MGLRU (Linux) MGLRU (Linux) MRU 101 87
4000 cachebpf 80 cachebpf
LFU 221 107
3000 60 S3-FIFO 287 139
2000 40 GET-SCAN 324 107
1000 20 LHD 366 152
0 Mixed GET/SCAN 0 Mixed GET/SCAN
Table 3: Lines of eBPF and userspace loader code in cachebpf policies.
(a) GET throughput. (b) GET P99 latency.
8000
Throughput (ops/sec)
handle this scenario well, leading to cache pollution due to a
large number of SCAN folios. 6000
Using cachebpf, we design a policy that is aware of the
different request types. We observe that a folio accessed by 4000
a SCAN should not be worth the same as a folio accessed Default (Linux)
LFU (cachebpf)
by a GET. To implement prioritization, the policy uses two 2000 MRU (cachebpf)
eviction lists: one for folios inserted by GET requests, and Tailored (cachebpf)
the other for those inserted by SCAN requests. When loading 00 20 40 60 80 100 120 140
the policy, the application initializes an eBPF map with the File search (iterations)
PIDs of the SCAN threads. When a folio is inserted, the
policy checks whether the PID of the current task is present Figure 10: Using cachebpf, multiple applications can run different eviction
in the map to determine which eviction list to add the folio to. policies, yielding better performance for all.
Each eviction list independently maintains an approximate
LFU policy, as described in §5.1.2. When the kernel requests
eviction candidates, the policy prioritizes evicting folios from amount than would be necessary to implement them within
the SCAN list. Figure 8 illustrates this policy. the kernel (or even in userspace). We find that cachebpf re-
duces the complexity of developing new policies, by using
Evaluation. To evaluate the policy, we compare against the pre-defined list and policy function abstractions, as well
Linux’s default and MGLRU policies, and various as by relying on the kernel’s existing page cache to actually
fadvise() options: FADV_DONTNEED, FADV_NOREUSE and store the folios. In addition, developer experience and veloc-
FADV_SEQUENTIAL (on top of the default policy). We apply ity is greatly improved, since eBPF prevents kernel crashes
these fadvise() options to files used by SCAN requests, in and many types of bugs, enabling developers to focus on the
order to inform the kernel in advance that we plan to read the policy logic. Thus, cachebpf allows developers to accelerate
files sequentially or only once (SEQUENTIAL and NOREUSE) or their applications with a relatively modest amount of effort.
that we no longer need the folios after their use (DONTNEED). Additionally, we plan to open source all of our policies,
As shown in Figure 9, cachebpf’s application-informed allowing developers to easily try them with their applications,
policy achieves 1.70× the throughput and 57% lower P99 lowering the barrier to entry for using cachebpf.
latency for GET requests, while SCAN requests experience
an 18% throughput decrease. In addition, the fadvise()
options do not help much, demonstrating the inadequacy of 5.2 Isolation (Q2)
existing kernel page cache interfaces compared to cachebpf.
MGLRU performs even worse than the default LRU. The Linux page cache already provides a measure of isolation
by giving each cgroup its own set of LRU lists. cachebpf
Takeaway 3: Even very simple application-aware eviction takes advantage of this design by enabling each cgroup to
policies can significantly improve performance. have its own custom policy. We demonstrate that this is a
Takeaway 4: Existing Linux page cache customization in- useful capability by simulating and comparing against “global”
terfaces are ineffective. policies, as opposed to cachebpf’s per-cgroup policies. We
create two cgroups, one running a YCSB C workload with
5.1.6 Implementation Complexity LevelDB, and the other running a file search workload with
Table 3 shows the lines of eBPF and userspace loader code ripgrep. The YCSB cgroup is allocated 10GiB and the file
necessary to implement each of the aforementioned policies, search cgroup is allocated 1GiB. We run these workloads
along with a simple FIFO policy. The policies are all imple- under four configurations: both cgroups using the default
mented in at most a few hundred lines of code, a much smaller policy, both using LFU, both using MRU, and a “tailored”
setup: YCSB with LFU and file search with MRU.
11
Evaluation. Figure 10 shows that the tailored setup beats Cgroup Size Baseline cachebpf Overhead (%)
the other three configurations, yielding 49.8% and 79.4% 5 GiB 234.80 236.51 0.72%
improvements for YCSB and file search, respectively, over 10 GiB 217.48 221.14 1.66%
the baseline configuration. While the other two cachebpf 30 GiB 197.67 198.01 0.17%
configurations provide performance improvements for the
workloads corresponding to their policy, they can significantly Table 4: cachebpf µCPU usage per I/O operation using fio.
degrade the performance of the other workload, demonstrating
that global policies are indeed not a viable solution. Note that
cachebpf data structures. Then, we calculate and compare the
YCSB improves further in the tailored setup compared to the
CPU usage per I/O operation (measured in µCPUs, i.e. one-
LFU configuration (and vice versa for file search compared to
millionth of a CPU). Table 4 shows that the CPU overhead of
the MRU configuration). This is due to improved caching of
cachebpf is at most 1.7%.
the workloads yielding reduced disk contention. Additionally,
the file search workload improves in the LFU configuration Takeaway 6: cachebpf incurs relatively low CPU and mem-
for the same reason. ory overhead, while improving performance.
Takeaway 5: Using cachebpf with per-cgroup policies al-
lows for fine-grained control and improved performance.
6 Related Work
5.3 Memory and CPU Overhead (Q3) There have been three predominant approaches to allow ap-
The advent of faster and larger storage devices means that the plications to customize the page cache. The first approach,
page cache (and cachebpf) must be able to handle millions which was explored in the 80’s and 90’s, was to design clean-
of events per second. We run a number of micro-benchmarks slate extensible kernels [1, 8, 28, 60], which allow applica-
to investigate cachebpf’s memory and CPU overhead. tions to customize kernel interfaces and policies. For example,
VINO [60, 62] and SPIN [8] allow applications to customize
5.3.1 Memory Overhead
buffer cache eviction, admission, and prefetching policies.
cachebpf’s primary memory usage is the valid folios registry These OS designs never achieved widespread use, even though
hash table (§4.4). In the worst case, we set up the hash table some of their underlying ideas have become relevant again
with as many buckets as there are 4KiB pages in the cgroup with the adoption of eBPF, which enables extensibility within
(based on its configured size). Each bucket requires 16 bytes to monolithic kernels like Linux or Windows.
store the hash table’s internal list pointers. Thus, the memory The second approach, introduced in the 90’s, is to design
overhead for an empty registry is: customizable file systems, which allow applications to cus-
tomize the page cache. ACFS [10, 11] is an application-
cgroup_size controlled file system which enables customizing caching
page_size × 16
= 0.4% and prefetching. The XN [41] libOS file system enables run-
cgroup_size
ning a userspace-level file system within the exokernel OS,
Each filled entry in the hash table uses 32 bytes for the which can be fully customized. More recent work in this vein
cachebpf list node. The full registry memory overhead is: is Bento [52], which allows custom file systems written in
Rust to be installed in the kernel, without disrupting appli-
cgroup_size cations. None of these approaches would work with existing
page_size × (16 + 32)
= 1.2% Linux or legacy file systems.
cgroup_size The third approach is for applications to simply implement
Therefore, the memory overhead for cachebpf’s registry is be- their own userspace cache, with the option of bypassing the
tween 0.4%-1.2% of a policy’s cgroup’s memory. We believe OS page cache with direct I/O. There are many examples of
that this overhead could be significantly reduced with recent data systems that implement a userspace cache [14, 48, 65].
improvements to eBPF’s handling of kernel objects, allowing TriCache [30] is a recent framework that helps applications
eBPF to directly ensure that some pointers are trusted. customize their own userspace caches. Nonetheless, many
popular data systems still rely on the page cache, sometimes
in conjunction with userspace caches [21, 34, 54, 58, 65].
5.3.2 CPU Overhead
There has been more recent work on customizing memory
To measure the CPU overhead, we run the fio micro- management policies using eBPF, such as huge page place-
benchmark [3] with 8 threads on a randread workload and ment, page fault handling, and page table designs [55, 61, 77].
record the IOPS and CPU usage. We do this for both the Most relevant to cachebpf, FetchBPF allows customizing
default Linux policy and a cachebpf no-op policy, meaning Linux’s memory prefetching policy, and could easily be in-
that it uses the default eviction policy while still maintaining tegrated into cachebpf as an additional hook [12]. P2Cache
12
is conceptually similar to cachebpf, but only allows LRU or [6] Daniel S. Berger, Benjamin Berg, Timothy Zhu, Sid-
MRU ordering and changes the global page cache policy [47]. dhartha Sen, and Mor Harchol-Balter. RobinHood: Tail
Additionally, the P2Cache paper is a work-in-progress work- latency aware caching – dynamic reallocation from
shop paper, is closed-source, and does not contain many de- Cache-Rich to Cache-Poor. In 13th USENIX Sympo-
tails about its design, implementation, or evaluation. sium on Operating Systems Design and Implementation
(OSDI 18), pages 195–212, Carlsbad, CA, October 2018.
USENIX Association.
7 Conclusion
[7] Daniel S. Berger, Ramesh K. Sitaraman, and Mor
This work explores the design of a new eBPF framework to Harchol-Balter. AdaptSize: Orchestrating the hot ob-
implement custom eviction policies in the kernel, enabling ject memory cache in a content delivery network. In
applications to choose a policy according to their needs and 14th USENIX Symposium on Networked Systems Design
making the latest caching research accessible to the kernel. and Implementation (NSDI 17), pages 483–498, Boston,
We believe there are significant future research challenges in MA, March 2017. USENIX Association.
this area, such as exploring ML-based eviction algorithms and
integrating more parts of the page cache into the cachebpf [8] Brian N Bershad, Stefan Savage, Przemyslaw Pardyak,
framework (e.g., writeback and prefetching). Furthermore, we Emin Gün Sirer, Marc E Fiuczynski, David Becker,
are aware of efforts in the eBPF verifier to support more com- Craig Chambers, and Susan Eggers. Extensibility safety
plex data structures and believe that cachebpf could benefit and performance in the SPIN operating system. In Pro-
from these efforts. ceedings of the fifteenth ACM symposium on Operating
systems principles, pages 267–283, 1995.
[4] Nidhi Bansal. An Overview of Caching for PostgreSQL, [14] Alexander Conway, Abhishek Gupta, Vijay Chi-
2020. https://severalnines.com/blog/overview- dambaram, Martin Farach-Colton, Richard Spillane,
caching-postgresql/. Amy Tai, and Rob Johnson. SplinterDB: Closing the
bandwidth gap for NVMe Key-Value stores. In 2020
[5] Nathan Beckmann, Haoxian Chen, and Asaf Cidon. USENIX Annual Technical Conference (USENIX ATC
LHD: Improving cache hit rate by maximizing hit den- 20), pages 49–63. USENIX Association, July 2020.
sity. In 15th USENIX Symposium on Networked Systems
Design and Implementation (NSDI 18), pages 389–403, [15] Jonathan Corbet. The extensible scheduler class. https:
Renton, WA, April 2018. USENIX Association. //lwn.net/Articles/922405/.
13
[16] Jonathan Corbet. Clarifying memory management SIGOPS Operating Systems Review, 29(5):251–266,
with page folios, 2021. https://lwn.net/Articles/ 1995.
849538/.
[29] Facebook. Memory usage in RocksDB, 2024.
[17] Jonathan Corbet. The multi-generational LRU, 2021. https://github.com/facebook/rocksdb/wiki/
https://lwn.net/Articles/851184/. memory-usage-in-rocksdb.
[18] Jonathan Corbet. BPF for HID drivers, 2022. https: [30] Guanyu Feng, Huanqi Cao, Xiaowei Zhu, Bowen Yu,
//lwn.net/Articles/909109/. Yuanwei Wang, Zixuan Ma, Shengqi Chen, and Wen-
guang Chen. TriCache: A User-Transparent block
[19] Jonathan Corbet. Merging the multi-generational LRU, cache enabling High-Performance Out-of-Core process-
2022. https://lwn.net/Articles/894859/. ing with In-Memory programs. In 16th USENIX Sympo-
[20] Jonathan Corbet. Red-black trees for BPF programs, sium on Operating Systems Design and Implementation
2023. https://lwn.net/Articles/924128/. (OSDI 22), pages 395–411, Carlsbad, CA, July 2022.
USENIX Association.
[21] Yifan Dai, Jing Liu, Andrea Arpaci-Dusseau, and Remzi
[31] Wi Fengguang, Xi Hongsheng, and Xu Chenfeng. On
Arpaci-Dusseau. Symbiosis: The art of application and
the design of a new Linux readahead framework.
kernel cache cooperation. In 22nd USENIX Confer-
SIGOPS Oper. Syst. Rev., 42(5):75–84, jul 2008.
ence on File and Storage Technologies (FAST 24), pages
51–69, Santa Clara, CA, February 2024. USENIX As- [32] Andrew Gallant. ripgrep is faster than {grep, ag, git grep,
sociation. ucg, pt, sift}, 2016. https://blog.burntsushi.net/
ripgrep/.
[22] Kernel development community. BPF_MAP_TYPE_HASH,
with PERCPU and LRU variants. https:// [33] Google. LevelDB. https://github.com/google/
docs.kernel.org/bpf/map_hash.html. leveldb/.
[23] Western Digital. Western digital PC [34] The PostgreSQL Global Development Group. Post-
SN8000S NVMe SSD, 2024. https: greSQL: The world’s most advanced open source rela-
//documents.westerndigital.com/content/dam/ tional database. https://www.postgresql.org/.
doc-library/en_us/assets/public/western-
[35] Tejun Heo. cgroup-v2. https://docs.kernel.org/
digital/product/internal-drives/pc-sn8000s-
admin-guide/cgroup-v2.html.
nvme-ssd/data-sheet-pc-sn8000s-nvme-ssd.pdf.
[36] Toke Høiland-Jørgensen, Jesper Dangaard Brouer,
[24] Dmitry Duplyakin, Robert Ricci, Aleksander Mar- Daniel Borkmann, John Fastabend, Tom Herbert, David
icq, Gary Wong, Jonathon Duerig, Eric Eide, Leigh Ahern, and David Miller. The express data path: fast
Stoller, Mike Hibler, David Johnson, Kirk Webb, Aditya programmable packet processing in the operating sys-
Akella, Kuangching Wang, Glenn Ricart, Larry Landwe- tem kernel. In Proceedings of the 14th International
ber, Chip Elliott, Michael Zink, Emmanuel Cecchet, Conference on Emerging Networking EXperiments and
Snigdhaswin Kar, and Prabodh Mishra. The design Technologies, CoNEXT ’18, page 54–66, New York, NY,
and operation of CloudLab. In 2019 USENIX Annual USA, 2018. Association for Computing Machinery.
Technical Conference (USENIX ATC 19), pages 1–14,
Renton, WA, July 2019. USENIX Association. [37] Claire Huang, Stephen Blackburn, and Zixian Cai. Im-
proving garbage collection observability with perfor-
[25] Kumar Kartikeya Dwivedi, Rishabh Iyer, and Sanidhya mance tracing. In Proceedings of the 20th ACM
Kashyap. Fast, flexible, and practical kernel extensions. SIGPLAN International Conference on Managed Pro-
In Proceedings of the ACM SIGOPS 30th Symposium on gramming Languages and Runtimes, MPLR 2023, page
Operating Systems Principles, SOSP ’24, page 249–264, 85–99, New York, NY, USA, 2023. Association for Com-
New York, NY, USA, 2024. Association for Computing puting Machinery.
Machinery.
[38] Jack Tigar Humphries, Neel Natu, Ashwin Chaugule,
[26] eBPF.io authors. eBPF. https://ebpf.io/. Ofir Weisse, Barret Rhoden, Josh Don, Luigi Rizzo, Oleg
[27] Jake Edge. The FUSE BPF filesystem, 2023. https: Rombakh, Paul Turner, and Christos Kozyrakis. GhOSt:
//lwn.net/Articles/937433/.
Fast & flexible user-space delegation of Linux schedul-
ing. In Proceedings of the ACM SIGOPS 28th Sympo-
[28] Dawson R Engler, M Frans Kaashoek, and James sium on Operating Systems Principles, SOSP ’21, page
O’Toole Jr. Exokernel: An operating system architec- 588–604, New York, NY, USA, 2021. Association for
ture for application-level resource management. ACM Computing Machinery.
14
[39] Jinghao Jia, YiFei Zhu, Dan Williams, Andrea Ar- from values in SSD-conscious storage. ACM Trans.
cangeli, Claudio Canella, Hubertus Franke, Tobin Storage, 13(1), mar 2017.
Feldman-Fitzthum, Dimitrios Skarlatos, Daniel Gruss,
and Tianyin Xu. Programmable system call security [49] Linux man pages project. madvise – linux man-
with eBPF, 2023. ual page, 2024. https://man7.org/linux/man-pages/
man2/madvise.2.html.
[40] Song Jiang and Xiaodong Zhang. LIRS: an efficient
low inter-reference recency set replacement policy to [50] N. Megiddo and D.S. Modha. Outperforming LRU with
improve buffer cache performance. In Proceedings of an adaptive replacement cache algorithm. Computer,
the 2002 ACM SIGMETRICS International Conference 37(4):58–65, 2004.
on Measurement and Modeling of Computer Systems,
[51] Sebastiano Miano, Matteo Bertrone, Fulvio Risso,
SIGMETRICS ’02, page 31–42, New York, NY, USA,
Mauricio Vásquez Bernal, Yunsong Lu, and Jianwen Pi.
2002. Association for Computing Machinery.
Securing linux with a faster and scalable iptables. SIG-
[41] M. Frans Kaashoek, Dawson R. Engler, Gregory R. COMM Comput. Commun. Rev., 49(3):2–17, November
Ganger, Hector M. Briceño, Russell Hunt, David Maz- 2019.
ières, Thomas Pinckney, Robert Grimm, John Jannotti,
and Kenneth Mackenzie. Application Performance and [52] Samantha Miller, Kaiyuan Zhang, Mengqi Chen, Ryan
Flexibility on Exokernel Systems. In Proceedings of Jennings, Ang Chen, Danyang Zhuo, and Thomas An-
the Sixteenth ACM Symposium on Operating Systems derson. High velocity kernel file systems with Bento.
Principles, SOSP ’97, page 52–65, New York, NY, USA, In 19th USENIX Conference on File and Storage Tech-
1997. Association for Computing Machinery. nologies (FAST 21), pages 65–79. USENIX Association,
February 2021.
[42] Kostis Kaffes, Jack Tigar Humphries, David Mazières,
and Christos Kozyrakis. Syrup: User-defined schedul- [53] Milvus. Milvus chunk cache. https://milvus.io/
ing across the stack. In Proceedings of the ACM docs/chunk_cache.md.
SIGOPS 28th Symposium on Operating Systems Prin-
[54] MongoDB. WiredTiger storage engine. https://
ciples, SOSP ’21, page 605–620, New York, NY, USA,
www.mongodb.com/docs/manual/core/wiredtiger/.
2021. Association for Computing Machinery.
[55] Konstantinos Mores, Stratos Psomadakis, and Georgios
[43] Ramakrishna Karedla, J Spencer Love, and Bradley G
Goumas. ebpf-mm: Userspace-guided memory manage-
Wherry. Caching strategies to improve disk system
ment in linux with ebpf, 2024.
performance. Computer, 27(3):38–46, 1994.
[56] MySQL. InnoDB Buffer Pool Optimization, 2024.
[44] The kernel development community. bpf-ringbuf.
https://dev.mysql.com/doc/refman/8.4/en/innodb-
https://www.kernel.org/doc/html/next/bpf/
buffer-pool-optimization.html.
ringbuf.html.
[45] Martin Lau. BPF extensible network, 2020. [57] Elizabeth J. O’Neil, Patrick E. O’Neil, and Gerhard
https://lpc.events/event/7/contributions/
Weikum. The LRU-K page replacement algorithm for
687/attachments/537/1262/BPF_network_tcp-cc-
database disk buffering. In Proceedings of the 1993
hdr-sk-stg_LPC_2020.pdf.
ACM SIGMOD International Conference on Manage-
ment of Data, SIGMOD ’93, page 297–306, New York,
[46] Martin KaFai Lau. struct-ops, 2020. https://lwn.net/ NY, USA, 1993. Association for Computing Machinery.
Articles/809092/.
[58] Yingjin Qian, Marc-André Vef, Patrick Farrell, Andreas
[47] Dusol Lee, Inhyuk Choi, Chanyoung Lee, Sungjin Lee, Dilger, Xi Li, Shuichi Ihara, Yinjin Fu, Wei Xue, and
and Jihong Kim. P2Cache: An application-directed Andre Brinkmann. Combining buffered I/O and direct
page cache for improving performance of data-intensive I/O in distributed file systems. In 22nd USENIX Con-
applications. In Proceedings of the 15th ACM Workshop ference on File and Storage Technologies (FAST 24),
on Hot Topics in Storage and File Systems, HotStorage pages 17–33, Santa Clara, CA, February 2024. USENIX
’23, page 31–36, New York, NY, USA, 2023. Association Association.
for Computing Machinery.
[59] John T. Robinson and Murthy V. Devarakonda. Data
[48] Lanyue Lu, Thanumalayan Sankaranarayana Pillai, Har- cache management using frequency-based replacement.
iharan Gopalakrishnan, Andrea C. Arpaci-Dusseau, and SIGMETRICS Perform. Eval. Rev., 18(1):134–142, apr
Remzi H. Arpaci-Dusseau. WiscKey: Separating keys 1990.
15
[60] Margo I. Seltzer, Yasuhiro Endo, Christopher Small, and [70] Juncheng Yang, Yazhuo Zhang, Ziyue Qiu, Yao Yue,
Keith A. Smith. Dealing with disaster: Surviving misbe- and Rashmi Vinayak. FIFO queues are all you need
haved kernel extensions. In Proceedings of the Second for cache eviction. In Proceedings of the 29th Sympo-
USENIX Symposium on Operating Systems Design and sium on Operating Systems Principles, SOSP ’23, page
Implementation, OSDI ’96, page 213–227, New York, 130–149, New York, NY, USA, 2023. Association for
NY, USA, 1996. Association for Computing Machinery. Computing Machinery.
[61] Dimitrios Skarlatos and Kaiyang Zhao. To- [71] Ioannis Zarkadas, Tal Zussman, Jeremy Carin, Sheng
wards programmable memory management with Jiang, Yuhong Zhong, Jonas Pfefferle, Hubertus Franke,
eBPF, 2024. https://lpc.events/event/18/ Junfeng Yang, Kostis Kaffes, Ryan Stutsman, and Asaf
contributions/1932/attachments/1646/3414/ Cidon. BPF-oF: Storage function pushdown over the
Towards%20Programmable%20Memory%20Management% network, 2023.
20with%20eBPF%20(LPC%202024).pdf.
[72] Yazhuo Zhang, Juncheng Yang, Yao Yue, Ymir Vig-
fusson, and K.V. Rashmi. SIEVE is simpler than LRU:
[62] Christopher A Small and Margo I Seltzer. Vino: An
an efficient turn-key eviction algorithm for web caches.
integrated platform for operating system and database
In 21st USENIX Symposium on Networked Systems De-
research. Harvard Computer Science Group Technical
sign and Implementation (NSDI 24), pages 1229–1246,
Report, 1994.
Santa Clara, CA, April 2024. USENIX Association.
[63] Solidigm. Solidigm d7-ps1030. https: [73] Da Zheng, Randal Burns, and Alexander S. Szalay. A
//www.solidigm.com/products/data-center/d7/ parallel page cache: IOPS and caching for multicore
ps1030.html#configurator. systems. In 4th USENIX Workshop on Hot Topics in
Storage and File Systems (HotStorage 12), Boston, MA,
[64] Zhenyu Song, Daniel S. Berger, Kai Li, and Wyatt June 2012. USENIX Association.
Lloyd. Learning relaxed Belady for content distribu-
tion network caching. In 17th USENIX Symposium on [74] Yuhong Zhong, Haoyu Li, Yu Jian Wu, Ioannis Zarkadas,
Networked Systems Design and Implementation (NSDI Jeffrey Tao, Evan Mesterhazy, Michael Makris, Junfeng
20), pages 529–544, Santa Clara, CA, February 2020. Yang, Amy Tai, Ryan Stutsman, and Asaf Cidon. XRP:
USENIX Association. In-Kernel storage functions with eBPF. In 16th USENIX
Symposium on Operating Systems Design and Imple-
[65] Meta Open Source. RocksDB. https://rocksdb.org/. mentation (OSDI 22), pages 375–393, Carlsbad, CA,
July 2022. USENIX Association.
[66] Michael Stonebraker. Operating system support for
database management. Commun. ACM, 24(7):412–418, [75] Yang Zhou, Zezhou Wang, Sowmya Dharanipragada,
jul 1981. and Minlan Yu. Electrode: Accelerating distributed pro-
tocols with eBPF. In 20th USENIX Symposium on Net-
[67] Linpeng Tang, Qi Huang, Wyatt Lloyd, Sanjeev Kumar, worked Systems Design and Implementation (NSDI 23),
and Kai Li. RIPQ: Advanced photo caching on flash pages 1391–1407, Boston, MA, April 2023. USENIX
for facebook. In 13th USENIX Conference on File and Association.
Storage Technologies (FAST 15), pages 373–386, Santa
[76] Yang Zhou, Xingyu Xiang, Matthew Kiley, Sowmya
Clara, CA, February 2015. USENIX Association.
Dharanipragada, and Minlan Yu. DINT: Fast In-Kernel
distributed transactions with eBPF. In 21st USENIX
[68] Daniel Lin-Kit Wong, Hao Wu, Carson Molder, Sathya
Symposium on Networked Systems Design and Imple-
Gunasekar, Jimmy Lu, Snehal Khandkar, Abhinav
mentation (NSDI 24), pages 401–417, Santa Clara, CA,
Sharma, Daniel S. Berger, Nathan Beckmann, and Gre-
April 2024. USENIX Association.
gory R. Ganger. Baleen: ML admission & prefetching
for flash caches. In 22nd USENIX Conference on File [77] Tal Zussman, Teng Jiang, and Asaf Cidon. Custom
and Storage Technologies (FAST 24), pages 347–371, page fault handling with eBPF. In Proceedings of the
Santa Clara, CA, February 2024. USENIX Association. ACM SIGCOMM 2024 Workshop on eBPF and Kernel
Extensions, eBPF ’24, page 71–73, New York, NY, USA,
[69] Juncheng Yang, Yao Yue, and K. V. Rashmi. A large 2024. Association for Computing Machinery.
scale analysis of hundreds of in-memory cache clusters
at Twitter. In 14th USENIX Symposium on Operating
Systems Design and Implementation (OSDI 20), pages
191–208. USENIX Association, November 2020.
16