The SIEVE Algorithm
The SIEVE Algorithm
Simpler
LIRS 0.1774
reducing the cache miss ratio. Many eviction algorithms have obj = prev
ARC 0.1798
been designed in the past decades. However, they all trade off # SIEVE
LRU 0.1259 obj = hand
throughput, simplicity, or both for higher efficiency. Such a while obj.visited:
CLOCK 0.1369 obj.visited = false
compromise often hinders adoption in production systems. SIEVE 0.1883 # skip obj, do nothing
obj = obj.prev
This work presents S IEVE, an algorithm that is simpler 1559 traces, 7 datasets hand = obj.prev
than LRU and provides better than state-of-the-art efficiency
and scalability for web cache workloads. We implemented Figure 1: S IEVE is simple and efficient. The code snippet shows how FIFO-
S IEVE in five production cache libraries, requiring fewer than Reinsertion and S IEVE find eviction candidates. Minor code changes convert
20 lines of code changes on average. Our evaluation on 1559 FIFO-Reinsertion to S IEVE, unleashing lower miss ratios than state-of-the-art
algorithms.
cache traces from 7 sources shows that S IEVE achieves up
to 63.2% lower miss ratio than ARC. Moreover, S IEVE has algorithms are efficient when they can retain more valu-
a lower miss ratio than 9 state-of-the-art algorithms on more able objects in the cache to achieve a lower miss ratio—the
than 45% of the 1559 traces, while the next best algorithm fraction of requested objects that must be fetched from the
only has a lower miss ratio on 15%. S IEVE’s simplicity comes backend. The quest for high efficiency has spurred a long
with superior scalability as cache hits require no locking. Our repertoire of clever algorithms, but most, if not all, trade
prototype achieves twice the throughput of an optimized 16- off simplicity in exchange for efficiency gains. For exam-
thread LRU implementation. S IEVE is more than an eviction ple, ARC [67], SLRU [55], 2Q [60], and MQ [106] manage
algorithm; it can be used as a cache primitive to build ad- multiple least-recently-used (LRU) queues to achieve better
vanced eviction algorithms just like FIFO and LRU. efficiency. LHD [16], CACHEUS [75], LRB [79], and GL-
Cache [93] use machine learning techniques that further in-
1 Introduction crease system and lookup complexity. Furthermore, many of
Web caches, such as Content Delivery Networks (CDNs) these algorithms require explicit or implicit parameter tuning
and key-values caches, are widely deployed in today’s digital to achieve good efficiency on a target workload.
landscape to reduce user request latency [14, 21, 22, 33, 69, The conventional wisdom among systems operators is that
73, 76, 100], network bandwidth [54, 55, 79, 95], and repeated simple is beautiful: simplicity is a key appealing feature for
computation [28, 89, 97, 98]. As a critical component of mod- an algorithm to be deployed in production since it commonly
ern infrastructure, these caches often have a large footprint. correlates with effectiveness, maintainability, scalability, and
For example, Netflix used 18,000 servers for caching over 14 low overhead. To illustrate, note that most caching systems or
PB of application data in 2021 [68]; while Twitter reportedly libraries in use today, such as ATS [2], Varnish [11], Nginx [7],
had 100s of clusters using 100s of TB of DRAM and 100,000s Redis [9], and groupcache [25], use only FIFO and LRU
of CPU cores for in-memory caching in 2020 [96]. policies.
At the heart of a cache is the eviction algorithm, which We have stumbled upon an easy improvement (Fig. 1) to
plays a crucial role in managing limited cache space. Such a decades-old algorithm (FIFO-Reinsertion) that materially
∗ Equal contribution. improves its efficiency across a wide range of web cache
† Corresponding author: Juncheng Yang, [email protected]. workloads. Instead of moving the to-be-evicted object that
has been accessed to the head of queue, S IEVE keeps it in 2 Background and Related Work
its original position. It should be noted that both S IEVE and
2.1 Web caches
FIFO-Reinsertion insert new objects at the head of the queue.
The new algorithm is called S IEVE 1 : a simple and efficient Web caches are essential components of modern Internet
turn-key cache eviction policy. We implemented S IEVE in five infrastructure, playing a crucial role in reducing data access
production cache libraries, which required fewer than 20 lines latency and network bandwidth. Key-value caches, e.g., Mem-
of change on average, underscoring the ease of real-world cached [5], Pelikan [8] and Cachelib [37], are widely used
deployment. in modern web services such as Twitter [97] and Meta [20]
Despite a simple design, S IEVE can quickly remove un- to reduce service latency. CDN caches are deployed close to
popular objects from the cache, achieving comparatively high users to reduce data access latency and high WAN bandwidth
efficiency compared to the state-of-the-art algorithms. By ex- cost [14, 91, 95, 101].
perimentally evaluating S IEVE on 1559 traces from five public Cache metrics. Caches are measured along two primary
and two proprietary datasets, we show that S IEVE achieves axes: efficiency and throughput performance. Cache effi-
similar or higher efficiency than 9 state-of-the-art algorithms ciency measures how well the cache can store and serve the
across traces. Compared to ARC [67], S IEVE reduces miss required data. A cache miss occurs when the requested data
ratio by up to 63.2% with a mean of 1.5% 2 . As a compari- is not found in the cache, requiring access to the backend
son, ARC reduces LRU’s miss ratio by up to 33.7% with a storage to retrieve the data. Common cache efficiency metrics
mean of 6.7%. Moreover, compared to the best of all algo- include (1) object miss ratio: the fraction of requests that are
rithms, S IEVE has lower miss ratio on over 45% of the 1559 cache misses; (2) byte miss ratio: the fraction of bytes that
traces. In comparison, the runner-up algorithm, TwoQ, only are cache misses. A lower miss ratio indicates higher cache
outperforms other algorithms on 15% of the traces. efficiency, as more requests are served directly from the cache,
S IEVE’s design eliminates the need for locking during reducing backend load, access latency, and bandwidth costs.
cache hits, resulting in a boost in multi-threaded throughput. Throughput performance, on the other hand, is as impor-
Our prototype implementation in Cachelib [37] demonstrates tant as efficiency because the goal of a cache is to serve data
that S IEVE achieves twice the throughput of an optimized quickly and help scale the application. Beyond throughput,
LRU implementation when operating with 16 threads. scalability is also increasingly important [72, 98] as modern
Through empirical evidence and analysis, we illustrate that CPUs often surpass 100 cores. Scalability measures through-
S IEVE’s efficiency stems from sifting out unpopular objects put growth with the number of threads accessing the cache. A
over time. S IEVE transcends a single standalone algorithm more scalable cache can better harness the many cores in a
— it can also be embedded within other cache policies to modern CPU.
design more advanced algorithms. We demonstrate the idea Access patterns. Web cache workloads typically follow
by replacing the LRU components in ARC, TwoQ, and LeCaR Power-law (generalized Zipfian) distributions [20, 26, 27, 34,
with S IEVE. The S IEVE-supported algorithms significantly 49, 52, 55, 81, 82, 97], where a small subset of objects account
outperform the original LRU-based algorithms. For example, for a large proportion of requests. In detail, the ith popular
ARC-S IEVE reduces ARC’s miss ratio by up to 62.5% with a object has a relative frequency of 1/iα , where α is a parameter
mean reduction of 3.7% across the 1559 traces. that decides the skewness of the workload. Previous works
Our work makes the following contributions. find different α values from 0.6 to 0.8 [26], 0.56 [49], 0.71–
• We present the design for S IEVE: an easy, fast, and surpris- 0.76 [51], 0.55–0.9 [20], and 0.6–1.5 [97]. The reasons for
ingly efficient cache eviction algorithm for web caches. the large range of α include (1) the different types of work-
• We demonstrate S IEVE’s simplicity by implementing it in loads, such as web proxy and in-memory key-value cache
five production cache libraries by changing less than 20 workloads; (2) the layer of the cache, noting that many prox-
lines of code on average. y/CDN caches are secondary or tertiary cache layers [55];
• Using 1559 traces from 7 datasets, we show that S IEVE and (3) the popularity of the service, such as the most popular
outperforms all state-of-the-art eviction algorithms on more objects receiving greater volume of requests in more popular
than 45% of the traces. (widely-used) web applications. Moreover, web caches often
• We illustrate S IEVE’s scalability using our Cachelib-based serve constantly growing datasets — new content and objects
implementation, which achieves 17% and 125% higher are created every second.
throughput than optimized LRU at 1 and 16 threads. In contrast, the backend of enterprise storage caches or
• We show how S IEVE, as a turn-key cache primitive, opens single-node caches, such as the page cache, often has a fixed
new opportunities for designing advanced eviction algo- size, not regularly observing new objects. Further, many stor-
rithms, e.g., replacing the LRU in ARC, TwoQ, and LeCaR age cache workloads often have scan and loop patterns [75],
with S IEVE. in which a range of block addresses are sequentially requested
1 S IEVE sifts out unpopular objects from cache over time (§5). in a short time. Such patterns are rare in web cache workloads
2 Due to a large number of traces, the mean miss ratio looks small. according to our observation on 1559 traces from 7 datasets.
2.2 Cache eviction policies eviction to improve Memcached’s throughput and scalabil-
The cache eviction algorithm, which decides which objects ity; MICA [64] uses log-structured storage, data partitioning,
to store in the limited cache space, governs the performance and a lossy hash table to improve key-value cache through-
and efficiency of a cache. The field of cache eviction algo- put and scalability. Segcache [98] uses segment-structured
rithms has a rich literature [12, 17–19, 23, 29, 32, 35, 36, 39, 41, storage with a FIFO-based eviction algorithm and leverages
44–46, 53, 58, 62, 63, 71, 74, 78, 83, 86, 88, 90, 102]. macro management to improve scalability. Frozenhot [72] im-
Increasing complexity. Most works on cache eviction al- proves cache scalability by freezing hot objects in the cache
gorithms focused on improving efficiency, such as LRU- to avoid locking. However, it’s crucial to note that while these
k [70], TwoQ [60], SLRU [61], GDSF [29], EELRU [77], approaches excel in throughput and scalability, they often
LRFU [39], LIRS [59], ARC [67], MQ [105], CAR [15], compromise on cache efficiency due to the use of simpler,
CLOCK-pro [58], TinyLFU [42, 43], LHD [16], LeCaR [84], weaker eviction algorithms such as CLOCK3 and FIFO.
LRB [79], CACHEUS [75], GLCache [93], and HALP [80]. 2.3 Lazy promotion and quick demotion
Over the years, new cache eviction algorithms have gradually
Promotion and demotion are two cache internal operations
convoluted. Algorithms from the 1990s use two or more static
used to maintain the logical ordering between objects4 . Re-
LRU queues or use different recency metrics; algorithms from
cent work [94] shows that “lazy promotion” and “quick demo-
the 2000s employ size-adaptive LRU queues or use more com-
tion” are two important properties of efficient cache eviction
plicated recency/frequency metrics, and algorithms from the
algorithms.
2010s and 2020s start to use machine learning to select evic-
tion candidates. Each decade brought greater complexity to Lazy promotion refers to the strategy of promoting cached
cache eviction algorithms. Nevertheless, as we show in §4, objects only at eviction time. It aims to retain popular objects
while the new algorithms excel on a few specific traces, they with minimal effort. An example of lazy promotion is adding
do not show a significant improvement (and some are even reinsertion to FIFO. In contrast, FIFO has no promotion, and
worse) compared to the traditional ones on a large number LRU performs eager promotion – moving objects to the head
of workloads. The combination of limited improvement and of the queue on every cache hit. Lazy promotion can improve
high complexity explains why these algorithms have not been (1) throughput due to less computation and (2) efficiency due
used in production systems. to more information about an object at eviction.
The trouble with complexity. Multiple problems come Quick demotion removes most objects quickly after they
with increasing complexity. First, complex cache eviction are inserted. Many previous works have discussed this idea in
algorithms are difficult to debug due to their intricate logic. the context of evicting pages from a scan [16,60,67,70,75,77].
For example, we find two open-source cache simulators used Recent work also shows that not only storage workloads but
in previous works have two different bugs in the LIRS [59] web cache workloads also benefit from quick demotion [94]
implementation. Second, complexity may affect efficiency because object popularity follows a power-law distribution,
in surprising ways. For example, previous work reports that and many objects are unpopular.
both LIRS and ARC exhibit Belady’s anomaly [50, 85]: miss To the best of our knowledge, our proposed cache evic-
ratio increases with the cache size for some workloads. It’s tion algorithm, which we call S IEVE, is the simplest one that
worth noting that FIFO, although simple, also suffers from this effectively achieves both lazy promotion and quick demotion.
anomaly. Third, complexity often negatively correlates with 3 Design and Implementation
throughput performance. A more intricate algorithm performs
more computation with potentially longer critical sections, 3.1 S IEVE Design
reducing both throughput and scalability. Furthermore, many In this section, we introduce S IEVE, a cache eviction algo-
of these algorithms need to store more per-object metadata, rithm that achieves both simplicity and efficiency.
which reduces the effective cache size that can be used for Data structure. S IEVE requires only one FIFO queue and
caching data. For example, the per-object metadata required one pointer called “hand”. The queue maintains the insertion
by CACHEUS is 3.3× larger than that of LRU. Fourth, com- order between objects. Each object in the queue uses one bit
plex algorithms often have parameters that can be difficult to to track the visited/non-visited status. The hand points to the
tune. For example, all the machine-learning-based algorithms next eviction candidate in the cache and moves from the tail
include many parameters about learning. Although some al- to the head. Note that, unlike existing algorithms, e.g., LRU,
gorithms do not have explicit parameters, e.g., LIRS, previous FIFO, and CLOCK, in which the eviction candidate is always
work shows that the implicit ghost queue size can impact the the tail object, the eviction candidate in S IEVE is an object
efficiency [85].
3 CLOCK was recently shown to be more efficient than LRU [94].
Trade-offs in using simple eviction algorithms. Besides 4 Note that the terms “promotion” and “demotion” are also commonly
works focusing on improving cache efficiency, several other used in the context of cache hierarchy. In this case, promotion refers to the
works have improved cache throughput and scalability. For process of moving data to a faster device, while demotion involves moving
example, MemC3 [47] uses Cuckoo hashing and CLOCK the data to a slower device [65, 87].
Insert FIFO-Reinsertion Snapshot Algorithm 1 S IEVE
FIFO-Reinsertion Input: The request x, doubly-linked queue T , cache size C, hand p
Reinsert Evict 1: if x is in T then ▷ Cache Hit
Insert 2: x.visited ← 1
SIEVE Snapshot
3: else ▷ Cache Miss
SIEVE 4: if |T | = C then ▷ Cache Full
Evict 5: o←p
Hand: identify victim "survived" obj newly inserted obj 6: if o is NULL then
7: o ← tail of T
Figure 2: An illustration of S IEVE. Note that FIFO-Reinsertion and 8: while o.visited = 1 do
CLOCK are different implementations of the same algorithm. We use FIFO- 9: o.visited ← 0
Reinsertion in the illustration but will use CLOCK in the rest of the text 10: o ← o.prev
because it is more commonly used and is shorter. 11: if o is NULL then
12: o ← tail of T
13: p ← o.prev
somewhere in the queue. 14: Discard o in T ▷ Eviction
S IEVE operations. A cache hit in S IEVE changes the visited 15: Insert x in the head of T .
bit of the accessed object to 1. For a popular object whose 16: x.visited ← 0 ▷ Insertion
visited bit is already 1, S IEVE does not need to perform any
operation. During a cache miss, S IEVE examines the object
pointed by the hand. If it has been visited, the visited bit is objects, the hand in S IEVE can quickly move from the tail to
reset, and the hand moves to the next position (the retained the area near the head, where most objects are newly inserted.
object stays in the original position of the queue). It continues These newly inserted objects are quickly examined by the
this process until it encounters an object with the visited bit hand of S IEVE after they are admitted into the cache, thus
being 0, and it evicts the object. After the eviction, the hand achieving quick demotion. This eviction mechanism makes
points to the next position (the previous object in the queue). S IEVE achieve both lazy promotion and quick demotion with-
While an evicted object is in the middle of the queue most out adding too much overhead.
of the time, a new object is always inserted into the head of The key ingredient of S IEVE is the moving hand, which
the queue. In other words, the new objects and the retained functions like an adaptive filter that removes unpopular ob-
objects are not mixed together. jects from the cache. This mechanism enables S IEVE to strike
At first glance, S IEVE is similar to CLOCK/Second a balance between finding new popular objects and keeping
Chance/FIFO-Reinsertion 5 . Each algorithm maintains a sin- old popular objects. We discuss more in §5.
gle queue in which each object is associated with a visited 3.2 Implementation
bit to track its access status. Visited objects are retained (also Simulation. We implemented S IEVE in libCacheSim [92].
called "survived") during an eviction. Notably, new objects LibCacheSim is a high-performance cache simulator de-
are inserted at the head of the queue in both S IEVE and FIFO- signed for running cache simulations and analyzing cache
Reinsertion. However, the hand in S IEVE moves from the tail traces. It supports many state-of-the-art eviction algo-
to the head over time, whereas the hand in FIFO-Reinsertion rithms, including ARC [67], LIRS [59], CACHEUS [75],
stays at the tail. The key difference is where a retained ob- LeCaR [84], TwoQ [60], LHD [16], Hyperbolic [24], FIFO-
ject is kept. S IEVE keeps it in the old position, while FIFO- Reinsertion/CLOCK [35], B-LRU (Bloom Filter LRU), LRU,
Reinsertion inserts it at the head, together with newly inserted LFU, and FIFO. For all state-of-the-art algorithms, we used
objects, as depicted in Fig. 2. the configurations from the original papers.
We detail the algorithm in Alg. 1. Line 1 checks whether Prototype. Because of S IEVE’s simplicity, it can be imple-
there is a hit, and if so, then line 2 sets the visited bit to one. mented on top of a FIFO, LRU, or CLOCK cache in just a few
In the case of a cache miss (Line 3), Lines 5-12 identify the lines by adding, initializing, and tracking the “hand” pointer.
object to be evicted. The object pointed to by the hand is either evicted or retained,
Lazy promotion and quick demotion. Despite a simple depending on whether it has been accessed.
design, S IEVE effectively incorporates both lazy promotion We implemented S IEVE caching in five different open-
and quick demotion. As described in §2.3, an object is only source cache libraries: Cachelib [20], groupcache [25],
promoted at the eviction time in lazy promotion. S IEVE op- mnemonist [6], lru-dict [3], and lru-rs [4]. These represent the
erates in a similar manner. However, rather than promoting most popular cache libraries of five different programming
the object to the head of the queue, S IEVE keeps the object languages: C++, Golang, JavaScript, Python, and Rust. All
at its original location. The "survived" objects are generally five of these production cache libraries implement LRU as the
more popular than the evicted ones, thus, they are likely to eviction algorithm of choice. Aside from mnemonist, which
be accessed again in the future. By gathering the "survived" uses arrays, they all use doubly-linked-list-based implementa-
5 Note
that Second Chance, CLOCK, and FIFO-Reinsertion are different tions of LRU. Adapting these LRU implementations to use
implementations of the same eviction algorithm. S IEVE was a low effort, as mentioned earlier.
Table 1: Datasets used in this work. CDN 1 and 2 are proprietary, and all turbo boost and pinned threads to CPU cores in one NUMA
others are publicly available.
node in our evaluations. We validated the efficiency results
trace approx
# traces
cache # request # object from the simulator and prototype using 60 randomly selected
collections time type (million) (million) traces and found the same conclusion.
CDN 1 2021 1273 object 37,460 2,652
CDN 2 2018 219 object 3,728 298 4.2 Efficiency results
Tencent Photo [103] 2018 2 object 5,650 1,038
Wiki CDN [1] 2019 3 object 2,863 56 In this section, we compare the efficiency of different evic-
Twitter KV [97] 2020 54 KV 195,441 10,650 tion algorithms. Because many caches today use slab-based
Meta KV [10] 2022 5 KV 1,644 82 space management, in which evictions happen on objects of
Meta CDN [10] 2023 3 object 231 76
similar sizes, we do not consider object size in this section.
The cache sizes are determined as a percentage of the num-
4 Evaluation ber of objects in a trace. We evaluate eight cache sizes using
In this section, we evaluate S IEVE to answer the following 1559 traces from the 7 datasets and present two representa-
questions. tive cache sizes at 0.1% and 10% of the trace footprint (the
• Does S IEVE have higher efficiency than state-of-the-art number of unique objects in the trace).
cache eviction algorithms? Three large datasets CDN1, CDN2 and Twitter. Fig. 3
• Can S IEVE improve a cache’s throughput and scalability? shows the miss ratio reduction (from FIFO) of different algo-
• Is S IEVE simpler than other algorithms? rithms across traces. The whiskers on the boxplots are defined
using p10 and p90, allowing us to disregard extreme data
4.1 Experimental setup and concentrate on the typical cases. At the large cache size,
Workloads. Our experiments use open-source traces from S IEVE demonstrates the most significant reductions across
Twitter [97], Meta [10], Wikimedia [1], TencentPhoto [103, nearly all percentiles. For example, S IEVE reduces FIFO’s
104], and two proprietary CDN datasets. We list the dataset miss ratio by more than 42% on 10% of the traces (top
information in Table 1. It consists of 1559 traces that together whisker) with a mean of 21% on the CDN1 dataset using
contain 247,017 million requests to 14,852 million objects. the large cache size (Fig. 3a). As a comparison, all other algo-
Notably, our research is centered around web traces. We re- rithms have smaller reductions on this dataset. For example,
played the traces in the simulator and the prototypes as a CLOCK/FIFO-Reinsertion, which is conceptually similar to
closed system with instant on-demand fill. S IEVE, can only reduce FIFO’s miss ratio by 15% on average.
Metrics. Miss ratio serves as a key performance indica- Compared to advanced algorithms, e.g., ARC, S IEVE reduces
tor when evaluating the efficiency of a cache system. How- ARC miss ratio by up to 63.2% with a mean of 1.5%. We
ever, when analyzing different traces (even within the same remark that a 1.5% mean miss ratio reduction on the huge
dataset), the miss ratios can vary significantly, making direct number of traces is significant. For example, ARC only re-
comparisons and visualizations infeasible, as shown in Fig. 3. duces LRU’s miss ratio by 6.3% on average (not shown).
Therefore, we calculate the miss ratio reduction relative to a A similar observation can be made on the CDN2 dataset.
mr −mralgo
baseline method (FIFO in this work): FIFO mrFIFO where mr Although LHD is the best algorithm on the Twitter dataset,
stands for miss ratio. If an algorithm’s miss ratio is higher than S IEVE scores second and outperforms most other state-of-the-
mr −mralgo
FIFO, we use FIFO mralgo . This metric has a range between art algorithms.
-1 and 1. When the cache is very small, TwoQ and LHD sometimes
We measure throughput in millions of operations per sec- outperform S IEVE. This is because TwoQ and LHD can
ond (Mops) to quantify a cache’s performance. To evaluate quickly remove newly-inserted low-value objects similar to
scalability, we vary the number of trace replay threads from 1 S IEVE. The primary reason for S IEVE’s relatively poor perfor-
to 16 and measure the throughput. mance is that new objects cannot demonstrate their popularity
Testbed. Our evaluations were conducted on Cloudlab [40] before being evicted when the cache size is very small. A
and focused on two key aspects: simulation-based efficiency similar problem also happens with ARC and LIRS. ARC’s
and prototype-based throughput and simplicity. adaptive algorithm sometimes shrinks the recency queue to
We used libCacheSim [92], a high-performance cache sim- very small and yields a high miss ratio. LIRS, which uses a
ulator, to evaluate the efficiency of different cache algorithms. 1% queue for new objects, suffers the most when the cache
These simulations ran on various node types at either the size is small, as we see its miss ratio on some traces higher
Clemson or Utah sites, subject to availability. than FIFO. In contrast, TwoQ does not suffer from the small
We evaluate the throughput and simplicity using prototypes, cache sizes because it reserves a fixed 25% of the cache space
as described in §3.2. The prototype evaluations were con- for new objects, preventing overly aggressive demotion. How-
ducted on the c6420 node from the Clemson site. This node ever, we remark that the production miss ratios reported in
type has a dual-socket Intel Gold 6142 running at 2.6 GHz previous works [13, 55, 97, 98] are close to the miss ratios we
and is equipped with 384 GB DDR4 DRAM. We turned off observe at the large cache size.
0.3 0.4
0.4
Miss Ratio Reduction
from FIFO
from FIFO
0.2
0.2
0.1
0.1 0.1
Tw U
RS
E
C
Tw U
RS
CK
oQ
er S
CK
O c
oQ
er S
O c
E
C
Tw U
RS
oQ
er S
CK
O c
U
CL oli
CL oli
CL oli
EV
yp U
EV
yp U
EV
yp U
AR
AR
AR
LF
LR
LR
LF
LR
LR
LF
LR
LR
LH
LH
LH
LI
LI
LI
H HE
H HE
H HE
b
b
ny
ny
ny
SI
SI
SI
B-
B-
B-
C
C
Ti
Ti
Ti
CA
CA
CA
(a) CDN1 workloads, large cache, 1273 traces (b) CDN2 workloads, large cache, 219 traces (c) Twitter workloads, large cache, 54 traces
0.4
0.20 0.30
Miss Ratio Reduction
0.20
from FIFO
from FIFO
0.2 0.10
0.10
0.1
0.00
0.00
0.0
−0.05 −0.05
E
C
Tw U
RS
oQ
er S
CK
O c
E
C
FU
er S
CK
FU
U
oQ
RS
O c
oQ
RS
er S
CK
O c
CL oli
EV
yp U
AR
LF
LR
LR
LH
CL oli
CL oli
EV
yp EU
EV
yp EU
AR
AR
LR
LR
LR
LR
LH
LH
LI
H HE
yL
yL
LI
LI
Tw
Tw
b
ny
b
SI
B-
SI
SI
B-
B-
CH
CH
n
n
C
Ti
Ti
Ti
CA
CA
CA
H
H
(d) CDN1 workloads, small cache, 1273 traces (e) CDN2 workloads, small cache, 219 traces (f) Twitter workloads, small cache, 54 traces
Figure 3: The box shows the miss ratio reduction from FIFO over all traces in the dataset. The box shows P25 and P75, the whiskers show P10 and P90, and the
triangle shows the mean. The large cache uses 10% of the trace footprint, and the small cache uses 0.1% of the trace footprint. S IEVE achieves similar or better
miss ratio reduction compared to state-of-the-art algorithms.
Miss Ratio Reduction from FIFO
Meta (KV) Meta (CDN) Tencent Wiki Meta (KV) Meta (CDN) Tencent Wiki SIEVE ARC TwoQ LHD Others SIEVE ARC TwoQ LHD Others
Tw FU
LR K
oQ
RS
er S
O lic
ARE
C
Tw FU
oQ
RS
CA HD
er S
O lic
LR K
U
U
EV
yp EU
EV
yp U
C
C
LR
LR
LH
CLbo
L
L
LI
LI
H HE
L
ny
ny
SI
SI
B-
B-
CH
C
Ti
Ti
CA
H
Throughput (Mops)
LRU (optimized) LRU (optimized) metadata size required to implement each algorithm in our simulator. We
TwoQ (optimized) 30 TwoQ (optimized)
20 assume that frequency counter and timestamps use 4 bytes and pointers use
LRU LRU
20 8 bytes.
10 Algorithm cache hit eviction insertion metadata size
10
FIFO 1 4 3 16B
0 0 LRU 5 4 3 16B
12 4 8 16 12 4 8 16
Number of Threads Number of Threads ARC 64 108 20 17B
LIRS 96 120 64 17B
(a) Meta KV trace (b) Twitter trace LHD 192 81 64 13B
Figure 6: Throughput scaling with CPU cores on two KV-cache workloads. LeCaR 72 76 20 40B
CACHEUS 168 140 150 54B
TwoQ 28 16 8 17B
Table 2: Lines of code requires modification to add S IEVE to a production
Hyberbolic 4 20 4 16B
cache library.
CLOCK 4 9 3 17B
Cache library Language Lines S IEVE 4 9 3 17B
groupcache [25] Golang 21
mnemonist [6] Javascript 12
lru-rs [4] Rust 16 S IEVE.
lru-dict [3] Python + C 21
Although different libraries/systems have different imple-
mentations of LRU, e.g., most use doubly-linked-list, and
trace replay threads using two production traces from Meta some use arrays, we find that implementing S IEVE is very
and Twitter. To better emulate real-world deployments in easy. Table 2 shows the number of lines (not including the
which the working set size (dataset size) grows with the hard- tests) needed to replace LRU — all implementations require
ware specs (#cores and DRAM sizes), we scale the cache size no more than 21 lines of code changes 6 .
and working set size together with the number of threads. To Advanced algorithms in simulator. Most of the complex
scale the working set size, each thread plays the same trace algorithms we evaluated in §4.2 are not implemented in pro-
with the object id transformed into a new space. For example, duction systems. Therefore, we compare the lines of code
the benchmark sends 4× more requests to 4× larger cache needed to implement cache hit, insert, and evict in our simu-
size at 4 threads compared to the single-thread experiment. lator. Although we implemented our own linked list and hash
We set the cache size to be 4 × nthread GB for both traces, table data structures in C for our simulator, we do not include
which gives miss ratios of 7% (Meta) and 2% (Twitter). We the code lines related to list and hash table operations, i.e., ap-
remark that the miss ratio is close to previous reports [13, 98]. pending to the list head or inserting to the hash table requires
The LRU and TwoQ in Cachelib use extensive optimiza- one line.
tions to improve the scalability. For example, objects that Table 3 shows that FIFO requires the fewest number of
were promoted to the head of the queue in the last 60 sec- lines to implement. On top of FIFO, implementing LRU adds
onds are not promoted again, which reduces lock contention a few lines to promote an object upon cache hits. CLOCK
without compromising the miss ratio. Cachelib further adds a and S IEVE require close to 10 lines to implement the eviction
lock combining technique to elide expensive coherence and function because both need to find the first object that has not
synchronization operations to boost throughput [38]. As a been visited. However, we remark that S IEVE is simpler than
result of the optimizations, both LRU and TwoQ show im- LRU and CLOCK because S IEVE does not require moving
pressive scalability results compared to the unoptimized LRU: objects to the head of the queue in either hit or miss (evict).
the throughput is 6× higher at 16 threads than using a single Besides these, all other algorithms require one to two orders
thread on the Twitter trace. As a comparison, unoptimized more lines of code to implement the three functions.
LRU’s throughput plateaus at 4 threads. Per-object metadata. In addition to the implementation com-
Compared to these LRU-based algorithms, S IEVE does not plexity, we also quantified the per-object metadata needed to
require “promotion” at each cache hit. Therefore, it is faster implement each algorithm. FIFO does not require any meta-
and more scalable. At a single thread, S IEVE is 16% (17%) data when implemented using a ring buffer. However, such
faster than the optimized LRU (TwoQ) and on both traces. an implementation does not support overwrite or delete. So
At 16 threads, S IEVE shows more than 2× higher throughput common FIFO implementation also uses a doubly-linked list
than the optimized LRU and TwoQ on the Meta trace. with 16 bytes of per-object metadata similar to LRU. CLOCK
and S IEVE are similar, both requiring 1-bit to track object
4.4 Simplicity access status. When implemented using a doubly linked list,
Prototype implementations. S IEVE not only achieves bet-
6While most LRU implementations are straightforward to adapt for S IEVE ,
ter efficiency, higher throughput, and better scalability, but
CacheLib is an exception. Cachelib is highly optimized for LRU-based algo-
it is also very simple. We chose the most popular cache rithms. Many optimizations are not needed for S IEVE, making it impractical
libraries/systems from five different languages: C++, Go, to quantify code modifications for integration with S IEVE. Therefore, it is
JavaScript, Python, and Rust, and replaced the LRU with not included in Table 2.
Head G L E 0.7 0.9
H I J K L ARC LRU SIEVE ARC LRU SIEVE
F J M
Miss Ratio
Miss Ratio
E Insert H n round J
Hand 0.6 0.8
movement D 1st round sifting
G G
C D sifting D
B Evict B B 0.5 0.7
A C E F I K A A
Tail Logical Time (#requests) Logical Time (#requests)
(a) Density of colors indicates inherent object popularity (blue: newly inserted (b) Trace 1, full trace (one week) (c) Trace 2, first two days of a week-long trace
objects; red: old objects in each round), and the letters represent object IDs.
The first queue captures the state at the start of the first round, and the second
queue captures the state at the end of the first round.
Figure 7: Left: illustration of the sifting process. Right: Miss ratio over time for two traces. The gaps between S IEVE’s miss ratio and others enlarge over time.
they use 17 bytes per-object metadata. Compared to S IEVE, eviction pressure on unpopular items (such as “one-hit won-
advanced algorithms often require more per-object metadata. ders”) than LRU and its variations [67]. As previous work has
Many key-value cache workloads have objects as small as shown [16, 55, 94], quick demotion is crucial for achieving
10s of bytes [66, 97], and large metadata wastes the precious high cache efficiency.
cache space. Fig. 7b and Fig. 7c show the cumulative miss ratio over
ZERO parameter. Besides being easy to implement and time of different algorithms on two representative production
having less metadata, S IEVE also has no parameters. Except traces. After the cache is warmed up, the miss ratio gaps be-
for FIFO, LRU, CLOCK, and Hyperbolic, all other algorithms tween S IEVE and other algorithms widen over time, support-
have explicit or implicit parameters, e.g., the sizes of queues ing the interpretation that S IEVE indeed sifts out unpopular
in LIRS, the learning rate in LeCaR and CACHEUS, and the objects and retains popular ones. A similar observation can
decay rate and age granularity in LHD. Note that although be seen in Fig. 10a.
ARC has no explicit parameters, its adaptive algorithm uses
implicit parameters in deciding when and how much space to 5.2 Analyzing the sifting process
move between the queues. As a comparison, S IEVE has no We now analyze the popularity retention mechanism in
parameter and requires no tuning. S IEVE. To clarify the exposition, suppose the S IEVE cache
can fit C equally sized objects. Since S IEVE always inserts
5 Distilling S IEVE’s Effectiveness new objects at the head, and objects that are retained remain
Our empirical evaluation shows that S IEVE is simultane- in their original positions within the queue, the algorithm
ously simple, fast, scalable, and efficient. In a well-trodden implicitly partitions the cache between new and old objects.
field like cache eviction, S IEVE’s competitive performance This partition is dynamic, allowing S IEVE to strike a bal-
was a genuine surprise to us as well. We next report our anal- ance between exploration (finding new popular objects) and
ysis that seeks to understand the secrets behind its efficiency. exploitation (enjoying hits on old popular objects).
5.1 Visualizing the sifting process S IEVE performs sifting by moving the hand from the tail
The workhorse of S IEVE is the “hand” that functions as to the head, evicting unpopular objects along the way, which
a sieve: it sifts through the cache to filter out unpopular ob- we call one round of sifting. We use r to denote the number
jects and retain the popular ones. We illustrate this process in of rounds. We first enumerate the queue positions p from
Fig. 7a, where each column (queue) represents a snapshot of the tail (p = 0) to the head (p = C − 1). We then further
the cached objects over time from left to right. As the hand denote that an object at position p in round r is examined
moves from the tail (the oldest object) to the head (the newest (during eviction) or inserted at time Tpr . Note that T effectively
object), objects that have not been visited are evicted – the defines a logical timer for the examined objects: whenever an
same sweeping mechanism that underlies CLOCK [30, 35]. object is examined, T increases by 1, regardless of whether the
For example, after the first round of sifting, objects at least as examined object is evicted or retained. In addition, T changes
popular as A remain in the cache while others are evicted. The once each round for an old object (retained from previous
newly admitted objects are placed at the head of the queue — rounds).
much like the CLOCK policy, but a departure from CLOCK, For an old object x at position p, we define the “inter-
which does in-place replacements to emulate LRU. During examination time” Ie (pr ) = Tpr − Tpr−1
′ where p′ was the po-
the subsequent rounds of sifting, if objects that survived pre- sition of x in round r − 1. Clearly, p′ ≥ p. For a new object
vious rounds remain popular, they will stay in the cache. In inserted in the current round, the inter-examination time is
such a case, since most old objects are not evicted, the evic- defined as the time between its examination and insertion. We
tion hand quickly moves past the old popular objects to the further define an old object x’s “inter-arrival time” Ia (xr ) as
queue positions close to the head. This allows newly inserted the time, measured again in the number of objects examined,
objects to be quickly assessed and evicted, putting greater between the first request to the x in round r and the last re-
1.0
5.3 Deeper study with synthetic workloads
0.2
LFU characterization work, however, is that object popularity in
0.2 ARC
LRU web cache workloads invariably follows a heavy-tailed power-
0.0 0.0
0 20 40 60 80 0 20 40 60 80
Cache Size (X% of Working Set) Cache Size (X% of Working Set)
law (generalized Zipfian) distribution [27, 97]. Therefore, we
(a) Miss ratio over size (b) Popular object ratio over size
opted for synthetic power-law workloads for our study. It
allows us to easily modify workload features to better un-
Figure 8: Miss ratio and popular object ratio on a Zipfian dataset (α = 1.0). derstand their impact on performance. Using these synthetic
workloads, we further scrutinize SIEVE’s effectiveness.
quest to x in round r − 1. For a new object, the inter-arrival Miss ratio over size. Fig. 8a displays the miss ratio of
time is the time between its insertion and the second request. LRU, LFU, ARC, and S IEVE at different cache sizes. Notably,
If an old object is not requested in the last round or a new LFU, ARC, and S IEVE all exhibit lower miss ratios than LRU,
object does not have a second request, its inter-arrival time is demonstrating their efficiency. Despite being considered opti-
infinite. mal for synthetic power-law workloads, LFU performs simi-
In round r, consider two consecutive retained objects x1 and larly to ARC and is visibly worse than S IEVE. This is because
x2 at position p1 and p2 = p1 +1. The inter-examination times objects with medium popularity, such as objects with ranks
are Ie (pr1 ) = Tpr1 −Tpr−1
′ and Ie (pr2 ) = Tpr2 −Tpr−1
′ , respectively. around the cache size C, are only requested once before their
1 2
The transition yields two invariants: eviction. LFU cannot distinguish the true popularity of these
objects and misses out on an opportunity for better perfor-
Tpr2 − Tpr1 = 1 mance. As a comparison, both ARC and S IEVE can quickly
remove new and potentially unpopular objects, which allows
Tpr−1
′ − Tpr−1
′ ≥ 1
2 1 cached objects to enjoy more time in the cache to demonstrate
their popularity. Between the two algorithms, S IEVE further
The first equation follows from x1 and x2 being consecutively extends the tenure of these objects in the cache because when
retained objects; the second inequality expresses that other the hand sweeps through the newly inserted objects, the ob-
evictions may have taken place between x1 and x2 in the jects closer to the head must have strictly shorter inter-arrival
previous round. Together, these imply that Ie (pr1 ) ≥ Ie (pr2 ). times (expected to be more popular) to survive.
The result generalizes further: for any two retained old objects Popular object ratio over size. To capture how different
in the queue, the object closer to the head has a smaller inter- algorithms manage popular objects, we define a metric called
examination time. “popular object ratio”. Under the assumption of a static and
Moreover, if an object is retained, its inter-arrival time must known popularity distribution, the optimal caching policy
be no greater than its inter-examination time. Therefore, for retains the most popular content within the cache at all times.
any retained object x at position px , its inter-arrival time Ia (xr ) Given a cache size C and a workload following a power-
must be smaller than the tail object’s inter-examination time: law distribution, the popular objects are the C most frequent
objects in the workload, denoted by H. The popular ratio
Ia (xr ) ≤ Ie (prx ) ≤ Ie (pr0 ) (1) of objects in the cache at time t is calculated by It = |H∩A t|
C
where At denotes the cache contents at time t.
Using the commonly assumed independent reference Fig. 8b shows the popular object ratio at different cache
model [31, 48, 56, 57] with a Poisson arrival, we can expect sizes. LRU evicts objects based on recency, which only
any retained object to be more popular than some dynamic weakly correlates with popularity. In this scenario, LRU stores
threshold set by the tail object’s inter-examination time Ie (pr0 ). the least number of popular objects. LFU stores slightly more
Since evicting an object keeps the hand pointer at its original “popular objects” than ARC. S IEVE, however, successfully
position (relative to the tail), the more objects are evicted filters out unpopular objects from the cache.
during a round, the longer the inter-examination time. As a Varying the popularity skew. Fig. 8 shows a distribution
result, S IEVE effectively adapts the popularity threshold so with Zipfian skewness α = 1. We further studied how different
that more objects are retained in the next round. concentration of popularity affects S IEVE’s effectiveness. Due
Following our sifting process metaphor, the mesh size in to space restrictions, we focus on results with large cache
S IEVE is determined by the tail object’s inter-examination sizes for the remainder of this subsection. Results using the
time Ie (pr0 ), which is dynamically adjusted based on object small cache size are either similar or do not reveal interesting
popularity change. If too few objects are retained in one round patterns.
(mesh size too small), then we will have an increased tail Fig. 9a and Fig. 9b demonstrate the impact of varying skew
inter-examination time Ie (pr0 ) (a larger mesh size) in the next on miss and popular object ratios. As skew increases, making
round. popular objects more prominent, it becomes easier to identify
Zipf 0.8 Zipf 1.0 Zipf 1.2
0.6 LFU
0.50 0
SIEVE SIEVE 1 0 50000 100000 150000 200000 250000 300000
0.4
LFU
0.25
0.2 ARC 0
1 0 100000 200000 300000 400000 500000
LRU
0.00
0.0
0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 0
Zipf Parameter (skewness α) Zipf Parameter (skewness α) Logical Time (#requests)
(a) Miss ratio (b) Popular object ratio (c) Hand movement in S IEVE
Figure 9: Left two: miss ratio and popular object ratio on Zipfian workloads with different α. Right: hand position in the cache over time in Zipfian workloads.
0.8
and cache the popular objects, increasing the popular object
Miss Ratio
0.25
Among ARC, LFU, and S IEVE, we observe that S IEVE always 0.4 SIEVE
shows a higher popular ratio with a lower miss ratio across 0.20 LFU
0.2 ARC
skewness, indicating the efficiency of S IEVE is not limited to 0.15 LRU
very skewed workloads. 0.0
0.0 0.5 1.0 0 0.5 1
Logical Time Logical Time
Fig. 9c illustrates the hand position in the S IEVE cache over
time, advancing towards the head with each retained object (a) Interval miss ratio (b) Popular object ratio over time
and pausing during evictions. Therefore, the more objects are Figure 10: Interval miss ratio and popular object ratio over time on a work-
retained, the faster the movement. We observe that the hand load constructed by connecting two different Zipfian workloads (α = 1).
Instructions per request
0.60
300
because that is when many unpopular objects are evicted. In 200
0.63
0.22
0.49
0.25
subsequent rounds, the hand lingers at positions close to the 200
0.25
0.51
0.90
0.05
0.18
0.91
0.80
0.55
0.28
0.43
0.19
100
0.04
0.06
head for most of the time because S IEVE keeps a new object 100
at position p only if it is more popular (shorter inter-arrival
0 0
time) than the object at position p − 1. In other words, S IEVE 0.8 1.0 1.2
Zipf Parameter (skewness α)
0.8 1.0 1.2
Zipf Parameter (skewness α)
performs quick demotion [87].
(a) Large cache (b) Small cache
In more skewed workloads, the hand moves quickly due
Figure 11: Average number of instructions per request when running LRU,
to early arrival and higher request volumes for popular ob- FIFO, and S IEVE caches. The top number denotes the miss ratio.
jects, allowing S IEVE to cache most popular objects by the
end of the first round. Consequently, the hand rapidly transi- 0 when the workload changes at the midway point. Whereas
tions from tail to head with fewer evictions and spends less LFU never recovers from the drop, the popular object ratios
time near the head, as new objects are more likely to be re- in all other algorithms quickly recover to large proportions.
tained, hastening its progress. Nevertheless, the time of each Finally, the figures corroborate our interpretation of the sift-
round varies depending on the frequency of encountering po- ing process: S IEVE’s miss ratio drops over time, while the
tentially popular objects, highlighting S IEVE’s adaptability fraction of popular objects increases over time.
to workload shifts. When new popular objects appear, the
hand accelerates, replacing existing cached objects with the 6 S IEVE as a Turn-key Cache Primitive
newcomers by giving less time to set their visited bit. 6.1 Cache primitives
S IEVE is adaptive. To visualize S IEVE’s adaptivity via the Beyond being a cache eviction algorithm, S IEVE can serve
sifting process, we created a new workload by joining two Zip- as a cache primitive for designing more advanced eviction
fian (α = 1.0) workloads that request different populations of policies. To study the range of such policies, we categorize
objects. Fig. 10 shows the interval miss ratio (per 100,000 re- existing cache eviction algorithm designs into four main ap-
quests) over time on this conjoined workload. The changeover proaches. (1) We can design simple and easy-to-understand
happens at the 50% midway time mark. We observe that the eviction algorithms, such as FIFO queues, LRU queues, LFU
interval miss ratio of LFU skyrockets to nearly 100% (beyond queues, and Random eviction. We call these simple algorithms
figure bounds) since new objects cannot replace the old ob- cache primitives. S IEVE falls under this category. (2) We can
jects. Relative to LRU and ARC, S IEVE’s miss ratio spike is improve the cache primitives. For example, FIFO-Reinsertion
larger because it takes time for the hand to move back to the is designed by adding reinsertion to FIFO; LRU-K [70] is
tail before it can evict old objects. However, S IEVE’s spike designed by changing the recency metric in LRU. (3) We can
is invisible when the cache size is small (not shown). With compose multiple cache primitives with objects moved be-
respect to the interval miss ratio spike, we observe the popular tween them. For example, ARC, SLRU, and MQ use multiple
object ratio of all algorithms (the curves overlap) dropping to LRU queues. (4) We can run multiple cache primitives and
Original replace LRU with SIEVE Original replace LRU with SIEVE
Miss Ratio Reduction
from FIFO
Large
0.2 cache
0.2 Small
0.1 cache
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.0
SIEVE LeCaR TwoQ ARC S3-FIFO SIEVE LeCaR TwoQ ARC S3-FIFO
(a) Large cache (b) Small cache (c) Best-performing algorithms across traces.
Figure 12: Impact of replacing LRU with S IEVE in advanced algorithms (a,b). The potential of FIFO, LRU, and S IEVE when endowed with foresight (c).
craft a decision-maker to select eviction candidates suggested 6.2 Turn-key cache eviction with S IEVE
by the primitives. For example, LeCaR [84] uses reinforce-
ment learning to choose between the eviction candidates from As a cache primitive, S IEVE can facilitate the design of
LRU and LFU; HALP [80] uses machine learning (MLP) to more advanced eviction algorithms. To understand the bene-
choose one object from the eight objects at the LRU tail. fits of using a better cache primitive, we replaced the LRU in
LeCaR, TwoQ, and ARC with S IEVE. Note that for ARC, we
Having an efficient cache primitive not only provides an only replace the LRU for frequent objects.
effective and simple eviction algorithm but also enables other
We evaluate these algorithms on all traces and show the
approaches to design more efficient algorithms. The ideal
miss ratio reduction(from FIFO) in Fig. 12a and Fig. 12b.
cache primitive is simultaneously (1) simple, (2) efficient, and
Compared to S IEVE, LeCaR has much lower efficiency; how-
(3) fast — in terms of high throughput. For example, FIFO
ever, when replacing the LRU in LeCaR with S IEVE, it sig-
and LRU meet these requirements and are frequently used to
nificantly reduces LeCaR’s miss ratio by 4.5% on average.
construct more advanced algorithms. However, they are less
TwoQ and ARC achieve efficiency close to S IEVE; however,
efficient than complex algorithms.
when replacing the LRU with S IEVE, the efficiency of both
While we have shown that S IEVE is simple, efficient, and algorithms gets boosted. For example, ARC-S IEVE achieves
fast in §4, to further understand S IEVE as a cache primi- the best efficiency among all compared algorithms at both
tive, we compare the number of instructions needed to run small and large cache sizes. It reduces ARC’s miss ratio by
FIFO, LRU, and S IEVE caches. We remark that the number 3.7% on average and up to 62.5% on the large cache size (re-
of instructions may not necessarily correlate with latency or call that ARC reduces LRU’s miss ratio by 6.3% on average).
throughput but rather a rough metric of CPU resource usage. ARC-S IEVE also reduces S IEVE’s miss ratio by an average
We used perf stat to measure the number of instructions of 2.4% and up to 40.6%.
for serving power-law workloads (100 million requests, 1 To understand the potential in suggesting eviction candi-
million objects) in our simulator. We then deduct the simu- dates, we evaluated the efficiency of FIFO, LRU, and S IEVE,
lator overhead by measuring a no-op cache, which performs granting them access to future request data. Each eviction can-
nothing on cache hits and misses. didate is either evicted or reinserted, depending on whether
the object will be requested soon. We assume that an object
Fig. 11 shows that S IEVE generally executes fewer instruc-
will be requested soon if the logical time (number of requests)
tions per request than FIFO and LRU, a difference accentu- C
till the object’s next access is no more than mr , where C is
ated in skewed workloads and larger cache sizes. Compared
the cache size and mr is the miss ratio. This mimics the case
to LRU, S IEVE requires fewer instructions since S IEVE needs
that we have a perfect decision-maker choosing between the
only to check and possibly update a Boolean field on cache
eviction candidates suggested by multiple simple eviction
hits, which is much simpler than moving an object to the
algorithms. Fig. 12c shows that when supplied with this addi-
head of the queue. Besides LRU, S IEVE also requires fewer
tional information, S IEVE achieves the lowest miss ratio on
instructions than FIFO because of the difference in miss ra-
97% and 94% of the 1559 traces at the large and small cache
tios. Because S IEVE has a lower miss ratio than FIFO, fewer
size, respectively.
objects need to be inserted due to cache misses, leading to
fewer instructions per request on average. The only exception These results highlight the potential of S IEVE as a pow-
is when S IEVE and FIFO have similar miss ratios, in which erful cache primitive for designing advanced cache eviction
case, FIFO executes fewer instructions than S IEVE. Overall, algorithms. Leveraging lazy promotion and quick demotion,
S IEVE requires up to 40% and 24% fewer instructions than S IEVE not only performs well on its own but also bolsters the
LRU and FIFO, respectively. performance of more complex algorithms.
0.4 0.3
Consequently, both types of objects are rapidly evicted af-
Byte Miss Ratio Reduction
from FIFO
0.2
shadow cache that keeps track of recently evicted items to
0.1
0.1 make smarter future eviction decisions — it cannot recognize
0.0 0.0 the popular objects when they are requested again. This prob-
lem is less severe on the large cache size, but when the cache
E
C
Tw U
oQ
RS
er S
O c
CK
U
E
C
Tw U
oQ
RS
er S
O c
CK
U
U
CL oli
CL oli
EV
yp EU
EV
yp EU
AR
AR
F
LR
LR
LR
LR
LH
LH
yL
yL
LI
LI
b
b
SI
SI
B-
B-
CH
CH
n
n
Ti
Ti
CA
CA
size is small, we observe that having a ghost is critical to be
H
H
(a) Large cache (b) Small cache scan-resistant. We conjecture that not being scan-resistant
Figure 13: Byte miss ratio across all CDN traces. is probably the reason why S IEVE remained undiscovered
over the decades of caching research, which has been mostly
0.3
0.20
LRU LRU focused on page and block accesses.
Byte Miss Ratio
ARC
Byte Miss Ratio
ARC
0.2 LRB
0.15 LRB
SIEVE SIEVE
7.3 TTL-friendliness
0.10
0.1 Time-to-live (TTL) is a common feature in web
0.05
caching [97, 98]. It specifies the duration during which an
0.00 0.0
1 2 5 10 20 40 1 2 5 10 20 40 object can be used. After the TTL has elapsed, the object
Cache Size (X% Working Set, 5688 GB) Cache Size (X% Working Set, 8627 GB)
expires and can no longer be served to the user, even if it may
(a) Wiki2018 trace (b) Wiki2019 trace
still be cached. Most existing eviction algorithms today do not
Figure 14: Byte miss ratios at different cache sizes on two Wiki CDN traces consider object expiration and require a separate procedure,
used in LRB evaluation.
e.g., scanning the cache, to remove expired objects. Similar
to FIFO, S IEVE maintains objects in insertion order, which
7 Discussion allows objects in TTL-partitioned caches, e.g., Segcache [98],
7.1 Byte miss ratio to be sorted by expiration time. This provides a convenient
method for discovering and removing expired objects.
To gauge SIEVE’s efficiency in reducing network band-
width usage in CDNs, we analyzed its byte miss ratio by 8 Conclusion
considering object sizes. We chose the cache size at 10% and We design S IEVE, a simple, efficient, fast, and scalable
0.1% of the trace footprint in bytes. Fig. 13a and Fig. 13b cache eviction algorithm for web caches that leverages “lazy
show that S IEVE presents larger byte miss ratio reductions at promotion” and “quick demotion”. The high efficiency in
ALL percentiles than state-of-the-art algorithms at both cache S IEVE comes from gradually sifting out the unpopular ob-
sizes, showcasing its high efficiency in CDN caches. jects. S IEVE is the first and the simplest cache primitive that
We further compared S IEVE with LRB [79], the state-of- supports both lazy promotion and quick demotion. This serves
the-art machine-learning-based cache eviction algorithm op- as the foundation for S IEVE’s high efficiency and high perfor-
timized for byte miss ratio. Due to LRB’s long run time, mance. Evaluated on 1559 traces from 7 datasets, we show
we only evaluated LRB on the two open-source Wiki traces that S IEVE outperforms complex state-of-the-art algorithms
provided by the authors. Fig. 14a and Fig. 14b show that on over 45% of the traces. We implemented S IEVE in five
LRB performs better at small cache sizes (1% and 2%), while open-source production libraries using less than 20 lines on
S IEVE excels at larger cache sizes. We conjecture that at a average.
small cache size, the ideal objects to cache are popular ob-
jects with many requests, which LRB can more easily identify Availability
because they have more features (most of LRB’s features The code and data used in this work are open-sourced
are about the time between accesses to an object). When the at https://github.com/cacheMon/NSDI24-SIEVE. This
cache size is large, most objects in the cache have few re- repository includes both the simulator and prototypes.
quests. Without enough features, a learned model can provide Additionally, we have engineered cache libraries based on
little benefits [94, 99]. In summary, compared to complex S IEVE for various programming languages. More information
machine-learning-based algorithms, S IEVE still has competi- is available at https://sievecache.com.
tive efficiency.
Acknowledgments
7.2 S IEVE is not scan-resistant We thank the anonymous reviewers and our shepherd Kay
Besides web cache workloads, we evaluated S IEVE on Ousterhout for constructive suggestions. We are grateful to
some block cache workloads. However, we find that S IEVE the individuals and organizations that have generously open-
sometimes shows a miss ratio higher than LRU. The primary sourced and shared production traces. We thank Cloudlab [40]
reason for this discrepancy is that SIEVE is not scan-resistant. for the infrastructure support for running experiments. We
In block cache workloads, which frequently feature scans, also appreciate the members of SimBioSys and PDL Consor-
popular objects often intermingle with objects from scans. tium for their interest, insights, feedback, and support.
References Computer Communication, SIGCOMM ’20, pages 495–
[1] Analytics/data lake/traffic/caching. https: 513, New York, NY, USA, 2020. Association for Com-
//wikitech.wikimedia.org/wiki/Analytics/ puting Machinery.
Data_Lake/Traffic/Caching. Accessed: 2023-04-
[15] Sorav Bansal and Dharmendra S. Modha. CAR: Clock
27.
with Adaptive Replacement. In 3rd USENIX Con-
[2] Apache traffic server. https://trafficserver. ference on File and Storage Technologies, FAST’04,
apache.org/. Accessed: 2023-04-27. 2004.
[3] lru-dict. https://github.com/amitdev/lru-dict. [16] Nathan Beckmann, Haoxian Chen, and Asaf Cidon.
Accessed: 2023-04-27. LHD: Improving cache hit rate by maximizing hit den-
sity. In 15th USENIX symposium on networked systems
[4] lru-rs. https://github.com/jeromefroe/lru-rs. design and implementation, NSDI’18, pages 389–403,
Accessed: 2023-04-27. 2018.
[5] Memcached - a distributed memory object caching [17] Nathan Beckmann and Daniel Sanchez. Talus: A sim-
system. http://memcached.org/. Accessed: 2023- ple way to remove cliffs in cache performance. In 2015
04-27. IEEE 21st International Symposium on High Perfor-
mance Computer Architecture, HPCA’15, pages 64–75,
[6] mnemonist. https://github.com/ Burlingame, CA, USA, February 2015. IEEE.
Yomguithereal/mnemonist. Accessed: 2023-
04-27. [18] Nathan Beckmann and Daniel Sanchez. Maximizing
Cache Performance Under Uncertainty. In 2017 IEEE
[7] Nginx. https://nginx.org/. Accessed: 2023-04- International Symposium on High Performance Com-
27. puter Architecture, HPCA’17, pages 109–120, Austin,
[8] pelikan. https://github.com/pelikan-io/ TX, February 2017. IEEE.
pelikan. Accessed: 2023-04-27. [19] L. A. Belady. A study of replacement algorithms
[9] Redis. http://redis.io/. Accessed: 2023-04-27. for a virtual-storage computer. IBM Systems Journal,
5(2):78–101, 1966.
[10] Running cachebench with the trace workload.
https://cachelib.org/docs/Cache_Library_ [20] Benjamin Berg, Daniel S. Berger, Sara McAllister,
User_Guides/Cachebench_FB_HW_eval. Accessed: Isaac Grosof, Sathya Gunasekar, Jimmy Lu, Michael
2023-04-27. Uhlar, Jim Carrig, Nathan Beckmann, Mor Harchol-
Balter, and Gregory R. Ganger. The CacheLib caching
[11] Varnish cache. https://varnish-cache.org/. Ac- engine: Design and experiences at scale. In 14th
cessed: 2023-04-27. USENIX symposium on operating systems design and
implementation, OSDI’20, pages 753–768. USENIX
[12] Ismail Ari, Ahmed Amer, Robert B Gramacy, Ethan L Association, November 2020.
Miller, Scott A Brandt, and Darrell DE Long. ACME:
Adaptive Caching Using Multiple Experts. In WDAS, [21] Daniel S. Berger, Benjamin Berg, Timothy Zhu, Sid-
volume 2, pages 143–158, 2002. dhartha Sen, and Mor Harchol-Balter. RobinHood: Tail
latency aware caching – dynamic reallocation from
[13] Berk Atikoglu, Yuehai Xu, Eitan Frachtenberg, Song Cache-Rich to Cache-Poor. In 13th USENIX sympo-
Jiang, and Mike Paleczny. Workload Analysis of a sium on operating systems design and implementation,
Large-Scale Key-Value Store. In Proceedings of the OSDI’18, pages 195–212, Carlsbad, CA, October 2018.
12th ACM SIGMETRICS/PERFORMANCE Joint Inter- USENIX Association.
national Conference on Measurement and Modeling of
Computer Systems, SIGMETRICS ’12, pages 53–64, [22] Daniel S Berger, Ramesh K Sitaraman, and Mor
New York, NY, USA, 2012. Association for Computing Harchol-Balter. AdaptSize: Orchestrating the hot ob-
Machinery. ject memory cache in a content delivery network. In
14th USENIX symposium on networked systems design
[14] Nirav Atre, Justine Sherry, Weina Wang, and Daniel S. and implementation, NSDI’17, pages 483–498, 2017.
Berger. Caching with Delayed Hits. In Proceedings
of the Annual Conference of the ACM Special Inter- [23] Adit Bhardwaj and Vaishnav Janardhan. Pecc:
est Group on Data Communication on the Applica- Prediction-error correcting cache. In Workshop on
tions, Technologies, Architectures, and Protocols for ML for Systems at NeurIPS, volume 32, 2018.
[24] Aaron Blankstein, Siddhartha Sen, and Michael J. key-value cache. In 2017 USENIX annual technical
Freedman. Hyperbolic caching: Flexible caching for conference, ATC’17, pages 321–334, Santa Clara, CA,
web applications. In 2017 USENIX annual technical July 2017. USENIX Association.
conference, ATC’17, pages 499–511, Santa Clara, CA,
[34] Brian F Cooper, Adam Silberstein, Erwin Tam, Raghu
July 2017. USENIX Association.
Ramakrishnan, and Russell Sears. Benchmarking
[25] bradfitz. group cache. https://github.com/ cloud serving systems with ycsb. In Proceedings of
golang/groupcache. Accessed: 2023-04-27. the 1st ACM symposium on Cloud computing, pages
143–154, 2010.
[26] L. Breslau, Pei Cao, Li Fan, G. Phillips, and S. Shenker.
Web caching and Zipf-like distributions: evidence and [35] Fernando J Corbato. A paging experiment with the
implications. In Proceedings. Eighteenth Annual Joint multics system. Technical report, MASSACHUSETTS
Conference of the IEEE Computer and Communica- INST OF TECH CAMBRIDGE PROJECT MAC,
tions Societies, pages 126–134 vol.1, New York, NY, 1968.
USA, 1999. IEEE. [36] Peter J Denning. The working set model for program
behavior. Communications of the ACM, 11(5):323–333,
[27] Lee Breslau, Pei Cao, Li Fan, Graham Phillips, and
1968.
Scott Shenker. On the implications of Zipf’s law
for web caching. Technical report, University of [37] Meta developers. Cachelib. https://cachelib.org.
Wisconsin-Madison Department of Computer Sci- Accessed: 2023-04-27.
ences, 1998.
[38] Meta developers. Distributed mutex.
[28] Daniel Byrne, Nilufer Onder, and Zhenlin Wang. https://github.com/facebook/folly/blob/
mPart: miss-ratio curve guided partitioning in key- 2c00d14adb9b632936f3abfbf741373871cd64a6/
value stores. In Proceedings of the 2018 ACM SIG- folly/synchronization/DistributedMutex.h.
PLAN International Symposium on Memory Manage- Accessed: 2023-04-27.
ment, ISMM’18, pages 84–95, Philadelphia PA USA, [39] Donghee Lee, Jongmoo Choi, Jong-Hun Kim, S.H.
June 2018. ACM. Noh, Sang Lyul Min, Yookun Cho, and Chong Sang
[29] Pei Cao and Sandy Irani. Cost-Aware WWW Proxy Kim. LRFU: a spectrum of policies that subsumes the
Caching Algorithms. In USENIX Symposium on Inter- least recently used and least frequently used policies.
net Technologies and Systems, USITS’97, Monterey, IEEE Transactions on Computers, 50(12):1352–1361,
CA, December 1997. USENIX Association. December 2001.
[40] Dmitry Duplyakin, Robert Ricci, Aleksander Mar-
[30] Richard W. Carr and John L. Hennessy. WSCLOCK:
icq, Gary Wong, Jonathon Duerig, Eric Eide, Leigh
a simple and effective algorithm for virtual memory
Stoller, Mike Hibler, David Johnson, Kirk Webb,
management. In Proceedings of the eighth ACM sym-
Aditya Akella, Kuangching Wang, Glenn Ricart, Larry
posium on Operating systems principles, SOSP ’81,
Landweber, Chip Elliott, Michael Zink, Emmanuel
pages 87–95, New York, NY, USA, December 1981.
Cecchet, Snigdhaswin Kar, and Prabodh Mishra. The
Association for Computing Machinery.
design and operation of CloudLab. In Proceedings
[31] H. Che, Z. Wang, and Y. Tung. Analysis and design of the USENIX Annual Technical Conference (ATC),
of hierarchical Web caching systems. In Proceedings pages 1–14, July 2019.
IEEE INFOCOM 2001. Conference on Computer Com- [41] Gil Einziger, Ohad Eytan, Roy Friedman, and Ben
munications. Twentieth Annual Joint Conference of the Manes. Adaptive Software Cache Management. In Pro-
IEEE Computer and Communications Society (Cat. ceedings of the 19th International Middleware Confer-
No.01CH37213), volume 3, pages 1416–1424, Anchor- ence, pages 94–106, Rennes France, November 2018.
age, AK, USA, 2001. IEEE. ACM.
[32] Asaf Cidon, Assaf Eisenman, Mohammad Alizadeh, [42] Gil Einziger, Ohad Eytan, Roy Friedman, and Ben-
and Sachin Katti. Cliffhanger: Scaling performance jamin Manes. Lightweight robust size aware cache
cliffs in web memory caches. In 13th USENIX sympo- management. ACM Transactions on Storage, 18(3),
sium on networked systems design and implementation, August 2022.
NSDI’16, pages 379–392, 2016.
[43] Gil Einziger, Roy Friedman, and Ben Manes. TinyLFU:
[33] Asaf Cidon, Daniel Rushton, Stephen M. Rumble, and A Highly Efficient Cache Admission Policy. ACM
Ryan Stutsman. Memshare: a dynamic multi-tenant Transactions on Storage, 13(4):1–31, December 2017.
[44] Assaf Eisenman, Asaf Cidon, Evgenya Pergament, data eviction in cache. In 2016 USENIX annual techni-
Or Haimovich, Ryan Stutsman, Mohammad Alizadeh, cal conference, ATC’16, pages 351–364, Denver, CO,
and Sachin Katti. Flashield: a hybrid key-value cache June 2016. USENIX Association.
that controls flash write amplification. In 16th USENIX
symposium on networked systems design and imple- [54] Xinyue Hu, Eman Ramadan, Wei Ye, Feng Tian, and
mentation, NSDI’19, pages 65–78, Boston, MA, Febru- Zhi-Li Zhang. Raven: belady-guided, predictive (deep)
ary 2019. USENIX Association. learning for in-memory and content caching. In
Proceedings of the 18th International Conference on
[45] Assaf Eisenman, Maxim Naumov, Darryl Gardner, emerging Networking EXperiments and Technologies,
Misha Smelyanskiy, Sergey Pupyrev, Kim Hazelwood, CoNEXT ’22, pages 72–90, New York, NY, USA,
Asaf Cidon, and Sachin Katti. Bandana: Using non- November 2022. Association for Computing Machin-
volatile memory for storing deep learning models. In ery.
A. Talwalkar, V. Smith, and M. Zaharia, editors, Pro-
ceedings of machine learning and systems, volume 1 [55] Qi Huang, Ken Birman, Robbert van Renesse, Wyatt
of mlsys’20, pages 40–52, 2019. Lloyd, Sanjeev Kumar, and Harry C. Li. An analy-
sis of Facebook photo caching. In Proceedings of the
[46] Ohad Eytan, Danny Harnik, Effi Ofer, Roy Friedman, Twenty-Fourth ACM Symposium on Operating Systems
and Ronen Kat. It’s time to revisit LRU vs. FIFO. In Principles, SOSP ’13, pages 167–181, New York, NY,
12th USENIX workshop on hot topics in storage and USA, November 2013. Association for Computing Ma-
file systems, hotStorage’20. USENIX Association, July chinery.
2020.
[56] Stratis Ioannidis, Laurent Massoulie, and Augustin
[47] Bin Fan, David G Andersen, and Michael Kaminsky. Chaintreau. Distributed caching over heterogeneous
MemC3: Compact and concurrent MemCache with mobile networks. In Proceedings of the ACM SIG-
dumber caching and smarter hashing. In 10th USENIX METRICS international conference on Measurement
symposium on networked systems design and imple- and modeling of computer systems, SIGMETRICS’10,
mentation, NSDI’13, pages 371–384, 2013. pages 311–322, 2010.
[48] Philippe Flajolet, Daniele Gardy, and Loÿs Thimonier. [57] Stratis Ioannidis and Edmund Yeh. Adaptive Caching
Birthday paradox, coupon collectors, caching algo- Networks with Optimality Guarantees. In Proceedings
rithms and self-organizing search. Discrete Applied of the 2016 ACM SIGMETRICS International Con-
Mathematics, 39(3):207–229, 1992. ference on Measurement and Modeling of Computer
Science, SIGMETRICS’16, pages 113–124, Antibes
[49] Phillipa Gill, Martin Arlitt, Zongpeng Li, and Anir- Juan-les-Pins France, June 2016. ACM.
ban Mahanti. Youtube traffic characterization: a view
from the edge. In Proceedings of the 7th ACM SIG- [58] Song Jiang, Feng Chen, and Xiaodong Zhang. CLOCK-
COMM conference on Internet measurement, pages Pro: an effective improvement of the CLOCK re-
15–28, 2007. placement. In Proceedings of the annual conference
on USENIX Annual Technical Conference, ATC’05,
[50] Xiaoming Gu and Chen Ding. On the theory and page 35, USA, April 2005. USENIX Association.
potential of lru-mru collaborative cache management.
SIGPLAN Not., 46(11):43–54, jun 2011. [59] Song Jiang and Xiaodong Zhang. LIRS: an efficient
low inter-reference recency set replacement policy to
[51] Lei Guo, Enhua Tan, Songqing Chen, Zhen Xiao, and improve buffer cache performance. In ACM SIGMET-
Xiaodong Zhang. The stretched exponential distribu- RICS Performance Evaluation Review, volume 30 of
tion of internet media access patterns. In Proceedings SIGMETRICS’02, pages 31–42, June 2002.
of the twenty-seventh ACM symposium on Principles
of distributed computing, pages 283–294, 2008. [60] Theodore Johnson and Dennis Shasha. 2Q: A Low
Overhead High Performance Buffer Management Re-
[52] Syed Hasan, Sergey Gorinsky, Constantine Dovrolis, placement Algorithm. In Proceedings of the 20th
and Ramesh K Sitaraman. Trade-offs in optimizing International Conference on Very Large Data Bases,
the cache deployments of cdns. In IEEE INFOCOM VLDB’94, pages 439–450, San Francisco, CA, USA,
2014-IEEE conference on computer communications, September 1994. Morgan Kaufmann Publishers Inc.
pages 460–468. IEEE, 2014.
[61] R. Karedla, J.S. Love, and B.G. Wherry. Caching strate-
[53] Xiameng Hu, Xiaolin Wang, Lan Zhou, Yingwei Luo, gies to improve disk system performance. Computer,
Chen Ding, and Zhenlin Wang. Kinetic modeling of 27(3):38–46, March 1994.
[62] Cong Li. DLIRS: Improving Low Inter-Reference [71] Sejin Park and Chanik Park. FRD: A filtering based
Recency Set Cache Replacement Policy with Dynam- buffer cache algorithm that considers both frequency
ics. In Proceedings of the 11th ACM International and reuse distance. In Proc. of the 33rd IEEE Inter-
Systems and Storage Conference, SYSTOR ’18, pages national Conference on Massive Storage Systems and
59–64, New York, NY, USA, June 2018. Association Technology (MSST), 2017.
for Computing Machinery.
[72] Ziyue Qiu, Juncheng Yang, Juncheng Zhang, Cheng Li,
[63] Conglong Li and Alan L. Cox. GD-Wheel: a cost- Xiaosong Ma, Qi Chen, Mao Yang, and Yinlong Xu.
aware replacement policy for key-value stores. In Frozenhot cache: Rethinking cache management for
Proceedings of the Tenth European Conference on modern software. In Twenty-third EuroSys Conference,
Computer Systems, EuroSys’15, pages 1–15, Bordeaux EuroSys’23, New York, NY, USA, 2023. Association
France, April 2015. ACM. for Computing Machinery.
[64] Hyeontaek Lim, Dongsu Han, David G. Andersen, and [73] KV Rashmi, Mosharaf Chowdhury, Jack Kosaian, Ion
Michael Kaminsky. MICA: A holistic approach to fast Stoica, and Kannan Ramchandran. EC-Cache:load-
In-Memory Key-Value storage. In 11th USENIX sym- balanced,low-latency cluster caching with online era-
posium on networked systems design and implemen- sure coding. In 12th USENIX symposium on operating
tation, NSDI’14, pages 429–444, Seattle, WA, April systems design and implementation, OSDI’16, pages
2014. USENIX Association. 401–417, 2016.
[65] Adnan Maruf, Ashikee Ghosh, Janki Bhimani, Daniel [74] John T. Robinson and Murthy V. Devarakonda. Data
Campello, Andy Rudoff, and Raju Rangaswami. cache management using frequency-based replacement.
MULTI-CLOCK: Dynamic Tiering for Hybrid Mem- In Proceedings of the 1990 ACM SIGMETRICS con-
ory Systems. In 2022 IEEE International Symposium ference on measurement and modeling of computer
on High-Performance Computer Architecture (HPCA), systems, SIGMETRICS’90, pages 134–142, New York,
pages 925–937, April 2022. ISSN: 2378-203X. NY, USA, 1990. Association for Computing Machin-
[66] Sara McAllister, Benjamin Berg, Julian Tutuncu- ery.
Macias, Juncheng Yang, Sathya Gunasekar, Jimmy Lu,
[75] Liana V. Rodriguez, Farzana Yusuf, Steven Lyons,
Daniel S. Berger, Nathan Beckmann, and Gregory R.
Eysler Paz, Raju Rangaswami, Jason Liu, Ming Zhao,
Ganger. Kangaroo: Theory and practice of caching
and Giri Narasimhan. Learning Cache Replacement
billions of tiny objects on flash. In ACM Transactions
with CACHEUS. In 19th USENIX Conference on File
on Storage, volume 18 of TOS’22, August 2022.
and Storage Technologies, FAST’21, pages 341–354.
[67] Nimrod Megiddo and Dharmendra S Modha. ARC: A USENIX Association, February 2021.
self-tuning, low overhead replacement cache. In 2nd
USENIX conference on file and storage technologies, [76] Arjun Singhvi, Aditya Akella, Maggie Anderson, Rob
FAST’03, 2003. Cauble, Harshad Deshmukh, Dan Gibson, Milo M. K.
Martin, Amanda Strominger, Thomas F. Wenisch, and
[68] Sailesh Mukil. Cache warming: Lever- Amin Vahdat. CliqueMap: productionizing an RMA-
aging ebs for moving petabytes of data. based distributed caching system. In Proceedings
https://netflixtechblog.medium.com/ of the 2021 ACM SIGCOMM 2021 Conference, SIG-
cache-warming-leveraging-ebs-for-moving- COMM’21, pages 93–105, Virtual Event USA, August
petabytes-of-data-adcf7a4a78c3. Accessed: 2021. ACM.
2023-04-27.
[77] Yannis Smaragdakis, Scott Kaplan, and Paul Wilson.
[69] Rajesh Nishtala, Hans Fugal, Steven Grimm, Marc EELRU: simple and effective adaptive page replace-
Kwiatkowski, Herman Lee, Harry C Li, Ryan McEl- ment. ACM SIGMETRICS Performance Evaluation
roy, Mike Paleczny, Daniel Peek, Paul Saab, and oth- Review, 27(1):122–133, May 1999.
ers. Scaling memcache at facebook. In 10th USENIX
Symposium on Networked Systems Design and Imple- [78] Alan Jay Smith. Sequentiality and prefetching in
mentation, NSDI’13, pages 385–398, 2013. database systems. ACM Transactions on Database
Systems, 3(3):223–247, September 1978.
[70] Elizabeth J. O’Neil, Patrick E. O’Neil, and Gerhard
Weikum. The LRU-K page replacement algorithm [79] Zhenyu Song, Daniel S Berger, Kai Li, Anees
for database disk buffering. ACM SIGMOD Record, Shaikh, Wyatt Lloyd, Soudeh Ghorbani, Changhoon
22(2):297–306, June 1993. Kim, Aditya Akella, Arvind Krishnamurthy, Emmett
Witchel, and others. Learning relaxed belady for con- [88] Nan Wu and Pengcheng Li. Phoebe: Reuse-Aware On-
tent distribution network caching. In 17th USENIX line Caching with Reinforcement Learning for Emerg-
symposium on networked systems design and imple- ing Storage Models, November 2020.
mentation, NSDI’20, pages 529–544, 2020.
[89] Xingbo Wu, Li Zhang, Yandong Wang, Yufei Ren,
[80] Zhenyu Song, Kevin Chen, Nikhil Sarda, Deniz Alt- Michel Hack, and Song Jiang. zExpander: a key-value
inbuken, Eugene Brevdo, Jimmy Coleman, Xiao Ju, cache with both high performance and fewer misses. In
Pawel Jurczyk, Richard Schooler, and Ramki Gum- Proceedings of the Eleventh European Conference on
madi. Halp: Heuristic aided learned preference evic- Computer Systems, Eurosys’16, pages 1–15, London
tion policy for youtube content delivery network. In United Kingdom, April 2016. ACM.
20th USENIX Symposium on Networked Systems De-
sign and Implementation, pages 1149–1163, Boston, [90] Gang Yan and Jian Li. RL-Bélády: A Unified Learning
MA, April 2023. USENIX Association. Framework for Content Caching. In Proceedings of
the 28th ACM International Conference on Multime-
[81] Kunwadee Sripanidkulchai, Bruce Maggs, and Hui dia, pages 1009–1017, Seattle WA USA, October 2020.
Zhang. An analysis of live streaming workloads on ACM.
the internet. In Proceedings of the 4th ACM SIG-
COMM conference on Internet measurement, pages [91] Gang Yan and Jian Li. Towards Latency Awareness for
41–54, 2004. Content Delivery Network Caching. ATC’22, pages
789–804, 2022.
[82] Aditya Sundarrajan, Mingdong Feng, Mangesh Kas-
bekar, and Ramesh K Sitaraman. Footprint descrip- [92] Juncheng Yang. libcachesim: a high-performance
tors: Theory and practice of cache provisioning in a library for building cache simulators. https://
global cdn. In Proceedings of the 13th International libcachesim.com/. Accessed: 2023-04-27.
Conference on emerging Networking EXperiments and
Technologies, pages 55–67, 2017. [93] Juncheng Yang, Ziming Mao, Yao Yue, and K. V.
Rashmi. GL-Cache: Group-level learning for efficient
[83] Linpeng Tang, Qi Huang, Wyatt Lloyd, Sanjeev Kumar, and high-performance caching. FAST’23, pages 115–
and Kai Li. RIPQ: Advanced photo caching on flash 134, 2023.
for facebook. In 13th USENIX Conference on File and
Storage Technologies, FAST’15, pages 373–386, 2015. [94] Juncheng Yang, Ziyue Qiu, Yazhuo Zhang, Yao Yue,
and K. V. Rashmi. FIFO can be better than LRU: the
[84] Giuseppe Vietri, Liana V. Rodriguez, Wendy A. Mar-
power of lazy promotion and quick demotion. In The
tinez, Steven Lyons, Jason Liu, Raju Rangaswami,
19th Workshop on Hot Topics in Operating Systems
Ming Zhao, and Giri Narasimhan. Driving cache re-
(HotOS 23), 2023.
placement with ML-based LeCaR. In 10th USENIX
workshop on hot topics in storage and file systems, [95] Juncheng Yang, Anirudh Sabnis, Daniel S. Berger, K. V.
hotStorage’18, Boston, MA, July 2018. USENIX As- Rashmi, and Ramesh K. Sitaraman. C2DN: How to
sociation. harness erasure codes at the edge for efficient content
[85] Carl Waldspurger, Trausti Saemundsson, Irfan Ahmad, delivery. In 19th USENIX symposium on networked
and Nohhyun Park. Cache modeling and optimization systems design and implementation, NSDI’22, pages
using miniature simulations. In 2017 USENIX annual 1159–1177, Renton, WA, April 2022. USENIX Asso-
technical conference, ATC’17, pages 487–498, Santa ciation.
Clara, CA, July 2017. USENIX Association.
[96] Juncheng Yang, Yao Yue, and K. V. Rashmi. Slides
[86] Hua Wang, Xinbo Yi, Ping Huang, Bin Cheng, and of a large scale analysis of hundreds of in-memory
Ke Zhou. Efficient SSD Caching by Avoiding Unnec- cache clusters at twitter. https://www.usenix.org/
essary Writes using Machine Learning. In Proceed- sites/default/files/conference/protected-
ings of the 47th International Conference on Parallel files/osdi20_slides_yang.pdf. Accessed:
Processing, ICPP’18, pages 1–10, Eugene OR USA, 2023-04-27.
August 2018. ACM.
[97] Juncheng Yang, Yao Yue, and K. V. Rashmi. A large
[87] Theodore M Wong and John Wilkes. My cache or scale analysis of hundreds of in-memory cache clusters
yours?: Making storage more exclusive. In USENIX at Twitter. In 14th USENIX symposium on operating
Annual Technical Conference, ATC’02, pages 161–175, systems design and implementation, OSDI’20, pages
2002. 191–208. USENIX Association, November 2020.
[98] Juncheng Yang, Yao Yue, and Rashmi Vinayak. Seg-
cache: a memory-efficient and scalable in-memory key-
value cache for small objects. In 18th USENIX Sympo-
sium on Networked Systems Design and Implementa-
tion, NSDI’21, pages 503–518. USENIX Association,
April 2021.
[99] Juncheng Yang, Yazhuo Zhang, Ziyue Qiu, Yao Yue,
and K.V. Rashmi. Fifo queues are all you need for
cache eviction. In Symposium on Operating Systems
Principles (SOSP’23), 2023.
[100] Tzu-Wei Yang, Seth Pollen, Mustafa Uysal, Arif Mer-
chant, and Homer Wolfmeister. CacheSack: Admission
Optimization for Google Datacenter Flash Caches. In
2022 USENIX Annual Technical Conference, ATC’22,
pages 1021–1036, Carlsbad, CA, July 2022. USENIX
Association.
[101] Yiying Zhang, Gokul Soundararajan, Mark W. Storer,
Lakshmi N. Bairavasundaram, Sethuraman Subbiah,
Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-
Dusseau. Warming up storage-level caches with bon-
fire. In Proceedings of the 11th USENIX conference on
File and Storage Technologies, FAST’13, pages 59–72,
USA, February 2013. USENIX Association.
[102] Chen Zhong, Xingsheng Zhao, and Song Jiang. LIRS2:
an improved LIRS replacement algorithm. In Proceed-
ings of the 14th ACM International Conference on
Systems and Storage, SYSTOR’21, pages 1–12, Haifa
Israel, June 2021. ACM.
[103] Ke Zhou, Si Sun, Hua Wang, Ping Huang, Xubin He,
Rui Lan, Wenyan Li, Wenji Liu, and Tianming Yang.
Tencent photo cache traces (SNIA IOTTA trace set
27476). In Geoff Kuenning, editor, SNIA IOTTA Trace
Repository. Storage Networking Industry Association,
February 2016.
[104] Ke Zhou, Si Sun, Hua Wang, Ping Huang, Xubin He,
Rui Lan, Wenyan Li, Wenjie Liu, and Tianming Yang.
Demystifying cache policies for photo stores at scale:
A tencent case study. In Proceedings of the 2018 In-
ternational Conference on Supercomputing, ICS ’18,
page 284–294, New York, NY, USA, 2018. Association
for Computing Machinery.
[105] Y. Zhou, Z. Chen, and K. Li. Second-level buffer cache
management. IEEE Transactions on Parallel and Dis-
tributed Systems, 15(6):505–519, June 2004.
[106] Yuanyuan Zhou, James Philbin, and Kai Li. The multi-
queue replacement algorithm for second level buffer
caches. In Proceedings of the annual conference on
USENIX Annual Technical Conference, ATC’01, pages
91–104, USA, 2001. USENIX Association.